Skip to main content

OCR For Business: Beyond One Desktop
by Doug Henschen



TABLE: A Three-Way Test of Leading OCR Products (PDF, 24K)


Optical character recognition (OCR) has long been used to turn paper documents into searchable images, editable documents and usable data. It's a mature technology in which steady improvements in image processing and character analysis have brought accuracy rates for the leading products above 99 percent on routine office documents. The differences from engine to engine might then seem trivial, but your perspective quickly changes when you examine errors per page, pages per document and documents per day.

For the readers of Transform, who often buy OCR seats by the dozen, rely on OCR embedded into production capture and forms processing systems, or integrate OCR engines into custom software, the differences in performance are often compounded across scores or even hundreds of users and thousands if not millions of documents. Depending on the nature of your documents and applications, the ultimate impact on productivity can be tremendous.

Clean, laser-printed documents with common type sizes and single-column formatting are easily recognized, and if you're embedding the results behind the image without a correction step, as in a PDF Image+Text file, you may never notice OCR errors. This approach is typically used to create searchable image archives, and with fuzzy search technology, users will likely find all relevant documents without fail even if recognition results are less than perfect.

The task for OCR becomes much more difficult when encountering low-resolution scans, small or poor-quality type or background shading. And if you want to recreate challenging documents in an editable form, such as a Word or editable PDF files, then errors in recognition and page decomposition and formatting will take time and effort to repair.

With an eye on all these variables, Transform recently put Abbyy FineReader 7.0 Corporate Edition, ScanSoft OmniPage Pro 14 Office and I.R.I.S. Readiris Pro 9 through a head-to-head test on the same 53 images. We used a mix of laserjet- and offset-printed pages ranging from simple, single-column documents to complex, multi-column articles with graphics, small fonts and background shading. We also used two shipping documents with dot matrix printing and two inkjet-printed faxes. Our test set included an even split of bitonal and color images and 200 dpi and 300 dpi images, and we also threw in a few grayscale images. We concentrated the toughest documents with smaller and degraded type in the 300 dpi batch because we knew they would require better resolution. More of the color scanning was performed at 200 dpi, keeping file sizes in check.

We judged accuracy by letting each product process the test images automatically. We skipped the OCR proofing step and then counted character errors (an incredibly tedious and time consuming task). We judged page formatting by exporting to Word and examining page breaks and column, table and graphic format retention. Speed was tested by recording the time required to recognize and convert the same 20-page image to PDF Image+Text-style formats.

Abbyy FineReader 7.0 CE Lowers the Cost Per Seat
Last September, Fremont, CA-based Abbyy introduced the 7.0 releases of FineReader Professional ($299) and Corporate Edition ($499). Both products are identical in most respects, but CE lets organizations install and use the software on as many desktops as needed with a concurrent licensing model that lowers the cost of deployment. If 100 people need access to FineReader, but no more than 10 typically make use of the software at any given time, an organization can save the cost of 90 licenses.

Concurrent licenses are administered on the network with a centralized license manager. Monitoring capabilities let administrators see exactly who is using the software and how often, and power users can be assigned permanent access to a single license.

FineReader's network approach eases workgroup style processing. As soon as images are captured or page recognition is completed, results can be shared and load balanced between multiple users in production fashion. This lets you distribute the more arduous recognition and page formatting steps among multiple users. FineReader also offers block (page) templates and batch templates that let you store customized settings for specific page zoning schemes and document types.

Among other strengths, FineReader CE can read barcodes, and version 7.0 now reads PDF-417 two-dimensional barcodes. While all three products read all the major languages, FineReader handles 177 in total - more than any other product.

FineReader supports more than 20 image and electronic document output options including four types of PDF output: PDF text and pictures only, PDF page image only, PDF text over page image and PDF text under page image. Another advance in 7.0 is linearized PDF output, which lets users begin working with individual pages even if the entire document hasn't completed downloading.

New output options in the 7.0 release include the new Microsoft Word 2003 XML format and PowerPoint (in true .ppt format). Word 2003 compatibility also allows FineReader's Zoom Window image viewer to be called up directly within the word processor so you can handle OCR proofreading without toggling between windows.

In our tests, FineReader made slightly fewer recognition mistakes overall and it was the most accurate engine in reading small text and dot matrix print. FineReader's edge was on 300 dpi images (particularly those with small and dot matrix print). OmniPage performed better on 200 dpi images overall and made most of its mistakes at 300 dpi on just two images with dot matrix print.

FineReader's page formatting options are plentiful, though not as extensive as those offered in OmniPage. If you want to recreate the look and feel of an original document, FineReader's best option is its unique "text over page image" format, which places computer readable text over the document image. This choice does a great job of preserving graphics and text positioning, and you have the option of retaining the image of a word if the recognition results are uncertain. FineReader didn't do quite as good a job at converting to Word, particularly in the case of multi-column documents, which were sometimes split into two pages.

When it comes to speed, FineReader was the second slowest product (after OmniPage in its normal "accuracy-preference" setting), recognizing a 20-page bitonal document in four minutes and 30 seconds and converting it to PDF image over text format in 23 seconds.

OmniPage Pro 14 Office ($599) incorporates an all-new engine said to be its most accurate to date. This new version (which succeeds OmniPage Pro 12 Office) has much to offer high-volume users. As with the competitive products, you can set up page and batch templates (called Zone Templates and Workflows), but OmniPage lets you create more granular workflows separating the capture, recognition, proofing and formatting stages of processing. In addition, a Print Cover Page utility lets you print out (proprietary) barcoded separator sheets that will automatically initiate specific workflows. (Note: OmniPage does not support standard 1D or 2D barcode reading).

Most impressive for a product of this class, OmniPage has a Batch Manager that lets you schedule workflows to run in unattended fashion so you can capture images all day and then run recognition and save to multiple formats overnight. This can be a real time saver, particularly if you have workflows that don't require human intervention - such as committing PDF Image+Text-style files to an archive without an OCR proofing step.

Unfortunately, ScanSoft does not offer a concurrent licensing approach for OmniPage, but the company does offer volume discounts for those buying OCR seats by the score (as do Abbyy and I.R.I.S.).

In our tests, OmniPage Pro 14 Office was the best performer on 200 dpi images overall (see chart), and it fared slightly better overall on routine documents (clean office documents and magazine articles) with quality printing and conventional (9-point and larger) type.

If you want editable electronic documents that look as close as possible to the originals, OmniPage's True Page format offers the best layout retention available. The downside of this choice, however, is that it creates isolated blocks of text that must be individually selected. If you need to edit or revise documents, you'll need to use the Flowing Page format, which links paragraphs and columns in the natural order in which you would read them. Unfortunately, Flowing Page sometimes introduces blank or nearly blank pages not found in the original (due to bad page breaks), but it still did the best job on multicolumn documents.

OmniPage has five PDF output options including PDF Normal, PDF Edited, PDF With Image on Text, PDF With Image Substitutes and PDF Image Only. Unlike the competitors, OmniPage lets you add encryption and digital signatures to PDFs, and it also lets you read, edit and convert electronic (non image) PDFs to other formats. OmniPage supports the new Word XML format implemented in Office 2003, but you can't invoke an image viewer from within Word as you can with FineReader.

In our speed tests, OmniPage Pro 14 was the slowest of the three products at its normal accuracy setting, taking five minutes and 45 seconds to recognize and one minute and 20 seconds to convert the 20-page test image to a PDF Image on Text file. The software does offer a "speed" preference mode, which reduced recognition time to two minutes and 17 seconds, but accuracy and formatting results suffered dramatically. In our view, the better compromise between speed and accuracy was offered by FineReader, which was two minutes and 12 seconds faster than OmniPage on the same document with comparable accuracy.

One feature OmniPage offers that competitors can't match is its built-in speech technology. Most useful is the text-to-speech option, which lets you turn recognized text into a .WAV file. ScanSoft says time-pressed doctors and lawyers, for instance, can turn long professional journal articles into audio files they can listen to in their car or while traveling. We did this with a Transform article and were pleased with the results, though we wouldn't want to listen to that computer-generated voice reading War and Peace.

ReadIris Pro 9: The Hare in a Three-Engine Race
Speed is the strongest suit for ReadIris Pro 9.0, but it just wasn't in the same league with FineReader and OmniPage when it came to recognition accuracy. You'll likely spend far more time cleaning up inaccurate recognition results than you will save during processing, so just like the hare in the proverbial race with the tortoise, you'll cross the finish line last in terms of ultimate productivity.

As we pointed out earlier, there are instances when speed trumps accuracy, as when you just need searchable archives (with recognition results behind the image). But if speed is important, you could choose OmniForm and use the "speed" recognition setting while still having higher accuracy and formatting capabilities available whenever required.

One feature we appreciated in Readiris Pro 9 was its approach of performing page decomposition (also known as parsing) immediately when scanning or importing images (which may explain this product's speed advantages over the other products, which perform decomposition during the recognition step). Immediate page segmentation lets you see how the documents will be interpreted before the recognition step. If you have complex documents that often require manual zoning, you can then review the automatic results and redraw text, table and graphic zones as required. The other products might make mistakes on autopilot that could be avoided with better segmentation preceding recognition.

Conclusions
The choice between the top two performers boils down to a matter of deployment scale, document types, applications and operational preferences. OmniPage was more accurate than FineReader at 200 dpi (making 318 versus 393 character errors on 26 images with 72,928 characters). FineReader 7.0 posted higher accuracy on the tougher 300 dpi images (making 704 versus OmniPage's 919 character errors on 27 images with 48,024 characters). While FineReader made fewer errors overall (1,097 versus 1,237), it's notable that OmniPage made nearly 500 of its errors at 300 dpi (about 40 percent) on just two test images with dot-matrix and small print. If you remove these two images from consideration (which would make sense for those who no longer encounter dot matrix), then OmniPage had slightly higher accuracy than FineReader. No single document type had the same Achilles-heel effect on FineReader.

With 53 images in our test batch, this review gives you a good idea of what you can expect from these products, but the best indicator of performance is a test of your own documents. If you have high volumes (as in scores, hundreds or thousands per day) of fairly consistent documents and you have lots of users relying on the software, then it's well worth it to test these documents specifically.

Considering features and functions, FineReader's concurrent user licensing model could represent a real cost advantage in large deployments. If you encounter small type or dot matrix print, you'll likely get better accuracy than you would with OmniPage (particularly if you're scanning at 300 dpi).

The granular workflows and unattended operation capabilities offered in OmniPage Pro 14 Office could be huge time (and therefore cost) savers. If you're facing typical office documents with anything but dot matrix and ordinary type sizes, OmniPage will give you an edge, and you'll be able to get away with 200 dpi scanning. OmniPage also offers more choices in terms of formatting, and it adds PDF security and text-to-speech features not found in the competitive products.
Original Post

Add Reply

Post
×
×
×
×
Link copied to your clipboard.
×
×