pdf图书的OCR

lxh623 · 发表于 2010-1-24 09:10:01

以前很少OCR，OCR之后可以加下划线、高亮，复制，搜索，被索引(硬盘搜索)。第一次OCR，遇到麻烦。幸遇老马指点，得以顺利完成。我又在网上搜索一番，拿来分享。再次感谢老马！

http://www.readfree.net/bbs/read.php?tid=4870429
[quote]引用第4楼strnghrs于2010-01-21 09:58发表的 :

对于文字页面的大图，放大后务必减色成单色。
另外在用Acrobat做OCR的时候，注意“PDF输出样式”选择“可搜索的图像（精确）”。如果使用缺省的“可搜索的图像”，单色图像文件将会全部重新转换成JPG，不大大膨胀就没有天理了。[/quote]

一，格式
Image-to-PDF Flavors
Flavors of PDF for Paper-based Documents

Covering all PDF Flavors produced by JRAPublish™ 3.0 and by Adobe Acrobat from source page p_w_picpaths. The term \"flavors\" was coined by Adobe in a white paper published in 2000 and titled \"The Four Flavors of Adobe PDF for Paper-based Documents\". The title is somewhat misleading as it does not describe four different formats of PDF, but rather four different options for PDF creation in one program (Acrobat Capture 3.0). Seven years ago, there WERE just 4 flavors of PDF for Paper-based documents. Today there are more, and JRAPublish™ introduces several brand new flavors to the mix.

\"Image-to-PDF\" is an encoding process where single page raster p_w_picpaths, captured with a scanner or digital camera, are converted to multipage, intelligent PDF documents. A raster p_w_picpath (TIFF, JPEG. etc.) only encapsulates the p_w_picpath of a single page, while PDF can be characterized as a compound object container that is capable of storing much more complex and complete information about a document. Our objective is to generate as much intelligence as possible into the PDF file that we derive from raster page p_w_picpaths. Intelligence like OCRed and searchable text, p_w_picpath layers, and vector graphics.

Here we update the list of Image-to-PDF Flavors with the new flavors introduced by JRAPublish™ and also by Adobe since the publication of \"The Four Flavors of Adobe PDF for Paper-based Documents\".

JRAPublish™ 3.0:

Image-Only
Image-Only - 3-Layer
Image-Only -Vector
Searchable-Image - Exact
Searchable-Image - Layered

Searchable-Image - 3 Layer
Searchable-Image - 4 Layer

Searchable-Image - Vector
Formatted Text & Graphics
Formatted Text & Layered Graphics

Formatted Text - 3 Layer
Formatted Text - 4 Layer

PDF/A-1b

Adobe Acrobat (all versions):

Image-Only (Acrobat Capture, Acrobat 3 - 8)
Image-Only - Adaptive (Acrobat 8)
Searchable-Image - Exact (Acrobat Capture, Acrobat 3 - 8)
Searchable-Image - Compact & Adaptive

Searchable-Image - Compact (Acrobat Capture, Acrobat 3 - 5)
Searchable-Image - Adaptive-6 (Acrobat 6)
Searchable-Image - Adaptive-7 (Acrobat 7)

Formatted Text & Graphics (Acrobat Capture, Acrobat 3 - 8)
PDF/A-1b (Acrobat 8)

PDF Flavor Definitions:

Image-Only

Image-Only takes a bitmapped p_w_picpath of a document and applies a PDF wrapper to that raster p_w_picpath. The p_w_picpath can be compressed using any of the available compression methods for PDF (ZIP, JPEG, JPEG2000, CCITT4). Because Image-Only files do not contain OCR text, their content is not searchable. Image Only PDF files are produced by Acrobat Capture and Adobe Acrobat Versions 3 - 8, and by JRAPublish™.

Usage Notes

In JRAPublish™3.0 and in Adobe Acrobat, this flavor of PDF is called \"Image Only PDF\".
In ABBYY FineReader 8, this flavor of PDF is called \"age Image Only\".
In Scansoft OmniPage Professional 15, this flavor of PDF is called \"DF, Image Only\".
In IRIS ReadIris Pro 11, this flavor is called \"DF Image\".

Image-Only - Adaptive

In Adobe Acrobat 8, Image-Only - Adaptive Divides each page into black-and-white, grayscale, and color regions and chooses a representation that preserves appearance while highly compressing each kind of content. For both color and bitonal input page p_w_picpaths, there is also a option for \"background removal\". This whitens nearly white areas of grayscale and color input (not black-and-white input). It actually deletes the nearly white areas from the page p_w_picpath, showing only the default empty white background of the PDF page itself.

Unlike Image-Only - 3-Layer PDF, this flavor does not use transparency or layers to compress the size of the page p_w_picpath. Rather, it uses page layout analysis to divide the page p_w_picpath into multiple sub-p_w_picpaths with differing bit-depths and compression methods.

Page layout analysis is usually performed when OCR is performed (in order to identify text blocks and text flow). In Adobe Acrobat 8, the page layout analysis (from the OCR engine) can be performed for adaptive compression purposes only, even when OCR is not performed. In Acrobat 5, 6 and 7, adaptive compression could only be used when OCR was also performed. So \"Image-Only - Adaptive\" was not an available flavor - only \"Searchable-Image - Adaptive\" was available.

Usage Notes:

Adobe is the only vendor offering this flavor of PDF for paper-based documents.

Image-Only - 3-Layer

This flavor segments color page p_w_picpaths into 3 layers &endash; a background layer, a foreground layer and a mask layer, and applies a PDF wrapper using transparency so that these three layers appear as one when viewed and printed. Each of the layers can be separately compressed using any of the available compression methods for PDF (ZIP, JPEG, JPEG2000, G4). The file size for pages with color will be 6 - 10 times smaller than the Image-Only flavor. Image-Only -3-Layer PDF files are produced by JRAPublish™. Note that this is the same compression method used in the DjVu file format for color page p_w_picpaths, where it is sometimes called \"Segmented\".

Usage Notes:

In JRAPublish™ 3.0, this flavor of PDF is called \"Image-Only - 3-Layer PDF\".
Adobe Acrobat does not offer this flavor.
ABBYY FineReader 8 does not offer this flavor.
In Scansoft OmniPage Professional 15, this flavor of PDF is called \"DF-MRC\".(MRC standing for Mixed Raster Content).
IRIS ReadIris Pro 11 does not offer this flavor.
In cVision PDFCompressor 3.1, this is called \"auto-segmentation\".

Image-Only - Vector

This flavor applies only to bitonal page p_w_picpaths. The page p_w_picpath is completely replaced by vector graphics using a shape tracing method, so that the vector graphics replicate the raster p_w_picpath is a slightly stylized form. Because the page is vector, it will re-scale at any viewing resolution and you will never see pixelation in the p_w_picpath (because there are no longer any pixels). This flavor is beneficial when you want to zoom into the document to see fine detail such as with blueprint p_w_picpaths and schematic diagrams, and is also advantageous for the visually-impaired. Image-Only - Vector PDF files are produced by JRAPublish™.

Usage Notes:

Image-Only Vector is an exclusive PDF flavor offered only by JRAPublish™.

Searchable-Image - Exact

The Exact version of Searchable-Image preserves color as 8-bit to 28-bit files. Like Image-Only, it takes a bitmapped p_w_picpath of a document and applies a PDF wrapper to that raster p_w_picpath. This flavor stores p_w_picpath information on one layer and maintains a text version of the document on another hidden, invisible layer, so you can easily search your documents. The trade-off for having a single page p_w_picpath is a larger file size. Searchable Image - Exact PDF files are produced by Acrobat Capture and Adobe Acrobat Versions 3 through 8, and by JRAPublish™. This flavor of PDF was previously referred to by Adobe as \"DF Image + Text\".

From health records, tax forms, and insurance claims, to old memos, magazines, and books; businesses are digitizing paper every day. With the advent of better search technology, having searchable text for all these documents is an obvious win. The common way to do this is to use OCR (Optical Character Recognition) to translate the p_w_picpaths to a document format that indexers already know, but the drawback is that we often lose the layout, p_w_picpaths and color of the original - plus, since no OCR is perfect, we need the original p_w_picpath to be able to fix mistakes. What we want is a document format that looks like the original p_w_picpaths when humans look at it, but that looks like plain text when the indexer looks at it. And, when we copy from the p_w_picpath, we want text put on the clipboard. This is the promise of the searchable PDF.

In a searchable PDF, the original scanned p_w_picpath is retained so any human can read the document. The textual content that is extracted via OCR is put behind the p_w_picpath so search indexers can see it and Acrobat Reader will let us select it as text. The ubiquity of desktop and enterprise search, ever-increasing OCR accuracy, and mass adoption of PDF are a powerful combination that make searchable PDF's the ideal format to store digitized paper.

Searchable PDF's from scanned documents can be indexed by Google, Sharepoint, Microsoft desktop search, and other applications that will index PDF documents.

Usage Notes:

In JRAPublish™ 3.0 and in Adobe Acrobat, this flavor of PDF is called \"Searchable-Image - Exact PDF\".Prior to the year 2000, Adobe referred to this flavor as simply \"Searchable Image PDF\".
In ABBYY FineReader 8, this flavor of PDF is called \"Text under the page p_w_picpath\". But note that the text is not invisible as it should be, it is black instead.
In Scansoft OmniPage Professional 15, this flavor of PDF is called \"DF with Image on Text\".
In IRIS ReadIris Pro 11, this flavor is called \"DF Image - Text\".

Back to Top

Searchable-Image - Compact and Adaptive

Searchable Image -Compact uses a color-segmentation process to create small file sizes from certain types of color documents. The Compact format is advantageous when the document you need to scan has some regions that are color p_w_picpaths and some regions that are monochrome (for example, text in any two colors). The page is segmented into two types of regions. Image (color) regions are stored within the PDF file as JPEG data. Text (monochrome) regions are stored within the file as G4 or Zip compressed data. Regions containing text in any two colors, that would otherwise by saved as 8-bit to 24-bit color, are instead saved as 1-bit color. Searchable-Image- Compact PDF files are produced by Acrobat Capture and Adobe Acrobat Versions 3 -5.

The Compact option works best for documents that have either a few colors or colors that are distinct from each other. For example, corporate letterhead is a good candidate for Searchable Image &endash; Compact because logos with limited color that would otherwise have to be saved as large, 8-bit p_w_picpaths can be saved as 1-bit p_w_picpaths. Searchable Image - Compact PDF files are produced by Acrobat Capture and Adobe Acrobat Versions 3 through 5.

In Adobe Acrobat Versions 6, 7 and 8, Adobe changed the name of this compression method from \"Compact\" to \"Adaptive\", and the methods employed by each version differ slightly. What they all have in common is the strategy of dividing the page into regions for compression.

Usage Notes:

Adobe is the only vendor offering this flavor of PDF for paper-based documents.

Searchable-Image - Layered

Searchable-Image - 3 Layer

This flavor segments color page p_w_picpaths into 3 layers &endash; a background layer, a foreground layer and a mask layer. A text version of the document is maintained on another hidden, invisible layer, so you can easily search your documents. Searchable-Image - 3-Layer PDF files are produced by JRAPublish™.

This flavor of PDF is also produced by Nuance Omnipage and Paperport products, where it is called PDF-MRC (MRC standing for Mixed Raster Content).

Usage Notes:

In JRAPublish™, this flavor of PDF is called \"Searchable-Image - 3 Layer\". Adobe Acrobat does not offer this flavor.
Scansoft OmniPage Pro 15 does not offer this flavor.
IRIS ReadIris Pro 11 does not offer this flavor.
ABBYY FineReader 8 does not offer this flavor.
In cVision PDFCompressor 3.1, this is called \"auto-segmentation with OCR\".

Searchable-Image - 4 Layer

This flavor segments color page p_w_picpaths into 4 layers - a background layer, a foreground layer, a mask layer and a photo object layer. The photos are not layered with transparency as is the rest of the page, but are retained as separate single-p_w_picpath regions. A text version of the document is maintained on another hidden, invisible layer, so you can easily search your documents. When the document has photo regions, 4-Layer is superior to 3-Layer because the quality of photo p_w_picpaths is retained while just the other areas of the page are layered to reduce file size. Searchable-Image 4-Layer PDF files are produced by JRAPublish™.

Usage Notes:

Searchable-Image - 4 Layer is an exclusive PDF flavor offered only by JRAPublish™.

Searchable-Image - Vector

This flavor applies only to bitonal page p_w_picpaths. The page p_w_picpath is completely replaced by vector graphics using a shape tracing method, so that the vector graphics replicate the raster p_w_picpath is a slightly stylized form. Because the page is vector, it will re-scale at any viewing resolution and you will never see pixelation in the p_w_picpath (because there are no longer any pixels). A text version of the document is maintained on another hidden, invisible layer, so you can easily search your documents. Searchable-Image - Vector PDF files are produced by JRAPublish™. Image-Only Vector is an exclusive PDF flavor offered only by JRAPublish™.

Usage Notes:

Image-Only Vector is an exclusive PDF flavor offered only by JRAPublish™.

Formatted Text & Graphics

PDF Formatted Text and Graphics, also known as PDF Normal - replaces bitmapped p_w_picpath with true, computer-generated text and graphics based on OCR, using only one layer.

During OCR, bitmaps of text are analyzed and then the text-p_w_picpaths are substituted for words and characters in those bitmapped areas. If the ideal substitution is uncertain, then the word is marked as suspect. Recognition suspects appear in the PDF as the original bitmap of the word, but the text is included on an invisible layer behind the bitmap of the word. This makes the word searchable even though it is displayed as a bitmap.

OCR recognition suspects have a confidence level - measured by how many characters of the word were recognized, and other factors. Prior to OCR, a user may define a confidence threshold - below which recognition suspects will be ignored and treated as fully-recognized text.

Usage Notes:

In JRAPublish™ 3.0 and in Adobe Acrobat, this flavor of PDF is called \"Formatted Text & Graphics\". Prior to the year 2000, Adobe referred to this flavor as \"DF Normal\".
In ABBYY FineReader 8, this flavor of PDF is called \"Text and Pictures Only\".
In Scansoft OmniPage Professional 15, this flavor of PDF is called \"DF (Normal)\".
In IRIS ReadIris Pro 11, this flavor is called \"DF Text\".

Formatted Text & Layered Graphics

Formatted Text & Graphics - 3 Layer

This flavor is created by first segmenting a color p_w_picpath into 3 layers, and then converting the text regions of the bitonal mask layer to formatted text.

Formatted Text & Graphics - 4 Layer

This flavor is created by first segmenting a color p_w_picpath into 3 layers, and then converting the text regions of the bitonal mask layer to formatted text.

With layered graphics, the text, along with other high-contrast content such as line drawings, is segmented to the bitonal mask layer. On the background page p_w_picpath, the text and line drawings are \"erased\" from the p_w_picpath. Only a slight shadow of the text remains in the background p_w_picpath.

When formatted text is created, it is created within the boundaries of text blocks (regions of the page where text occurs). The p_w_picpath of the text in the bitonal mask layer that falls within the text blocks is removed, and formatted, visible text is displayed instead. The p_w_picpath of line drawings within the bitonal mask layer is kept.

When 4-layer encoding is used, then picture blocks are extracted as separate graphic objects within the PDF file (kept apart from the 3-Layer segmentation process), possibly with a different resolution and compression setting than the rest of the page.

JRAPublish™ 3.0 is the only application offering the Formatted Text & Layered Graphics flavor of PDF. However, applications from IRIS and ABBYY offer similar but less advanced flavors that are worth noting.

IRIS ReadIris Pro 11 has a flavor called \"text-p_w_picpath PDF\". We might refer to it for explanatory purposes as \"Formatted Text & Background Graphics\". The p_w_picpath of the text in the color page p_w_picpath is removed, leaving only a slight shadow of the text, and then the visible text is displayed on top of the p_w_picpath. It is less advanced because it does not have a bitonal mask layer, a foreground layer or photo zones.

ABBYY FineReader 8 has a flavor called \"Text over the page p_w_picpath\". We might also refer to it as \"Formatted Text & Background Graphics\", and like ReadIris Pro 11 the p_w_picpath of the text is removed from the background p_w_picpath, but the background is not smoothed to create a shadow of the text. Instead, the text is \"cookie-cuttered\" from the background and filled with an adjacent color.

Usage Notes:

Formatted Text & Layered Graphics is an exclusive PDF flavor offered only by JRAPublish™.

PDF/A-1b

PDF/A-1b is the new archival flavor of PDF. It is a subset of PDF Version 1.4.

See the topic PDF/A for Digital Archiving.

Usage Notes:

This is a feature of JRAPublish™ and Adobe Acrobat 8.
Adobe Acrobat Capture does not offer this flavor.
Scansoft OmniPage Pro 15 does not offer this flavor.
IRIS ReadIris Pro 11 does not offer this flavor.
ABBYY FineReader 8 does not offer this flavor.
cVision PDFCompressor 3.1 does not offer this flavor.

来源：http://www.jrapublish.com/help/JRAPublish_Overview/Image-to-PDF_Flavors.htm

二、dpi

OCR works best with a 300 dpi monochrome (1-bit, black and white) scan.I find it works acceptably with 150 dpi scans, as well. Adobe claims that 72 dpi is adequate, but you’ll find some mistakes in the character recognition with such a coarse bitmap.
来源：book“Adobe Acrobat 8 For Windows And Macintosh”

sunyasong · 发表于 2010-1-24 23:04:44

好帖，好细心！

		自动登录	找回密码
密码			注册

[【推荐】] pdf图书的OCR