|
An overview of Document Scanning
How to convert paper into electronic documents
Provides a guide of document scanning, covering everything from the principles of scanning paper, how to scan, types of document scanners, quality options, compression options and the best way of scanning
1 Introduction
This document provides guidelines processes for document imaging which will be compliant with BIP0008, and ensures a successful deployment of a document scanning solution. It covers the following areas:
· Document preparation for scanning
· Batching of documents
· Scanning process
· Sample set (documents used for calibration)
· Quality control
· Re-scanning
· Image processing
· Scanner recommendations
· Design of documents for optimal scanning
1.1 Back File and Forward Scanning
Within this document, reference is made to the terms “Back File Scanning” and “Forward”/”Day Forward” scanning. Back file scanning is the term to describe the scanning of historical paper records (such as files), converting them into an electronic medium to allow the removal of the paper thus freeing up storage and paper transfer costs. Forward or Day Forward scanning refers to the on-going process of scanning new paper as it is received.
A reference is also made to “On Demand” scanning. This is a method of converting historical files to electronic media, as required (as demanded). Typically, this occurs where historical large files on a subject (company, person, patient, project) exist, but back file scanning is not desired. On Demand scanning means that where documents are required, when paper files still exist, they are converted to electronic format as they are initially requested, thus gradually removing the historical or paper files.
As back file scanning quickly frees up large amounts of paper storage space, back-file scanning is normally the preferred method, then moving onto forward scanning for new paper. Back file scanning can be performed either in the companies offices, or can be taken off-site and quickly scanned using industrial scanners as a service.
1.2 Scope and Objectives
When a Company takes on a scanning solution for day forward and back file scanning, it is useful to first understand the best practices for scanning paper. This document highlights the recommended process for document preparation and scanning, and provides a recommendation of the types of scanners that will be required to implement the solution.
1.3 Background
This document has been created to answer the questions a company will have around the area of the scanning process. The questions answered here will include information based on the Legal Admissibility of documents as defined in BIP0008 (to provide copies of documents as evidence in court), and what techniques are recommended for successful scanning and document management.
2 Document scanning processes
2.1 General
This section includes recommendations relating to the procedures relevant to document image capture. These recommendations cover procedures for:
· preparation of documents
· document batching
· photocopying (to improve scanning success)
· scanning
· image processing
2.2 Preparation of paper documents
All paper documents need to be examined prior to the scanning process, to ensure that a successful image is obtained. Attributes such as paper size, weight, physical state (thin paper, creased, stapled, etc.), binding, and print colour, black-and-white, colour, tonal range, etc. can all affect the physical scanning process.
Where documents are found which are unlikely to be accepted by the scanner, there are a number of techniques that can be used. For example the original could be photocopied or transparent wallets could be used.
· When removing staples, clips, or other document bindings, ensure that no damage is caused to the original that may affect the capture of the information from the document.
· Where a source document has physical attachments, for example, stick-on notes, they must be distinguishing from the document to which they are attached and linked to the original document after scanning so that both can be viewed.
This should be achieved, by capturing a separate image of the attachment on the original page. The index data should record the fact that there is an attachment and a link to the original page. It is suggested that the document preparer photocopies the original page to scan as the first page of the attachment with the copy and the attached note added as the second and successive pages where there are multiple notes.
· Where a source document has physical amendments, for example, white correction fluid, the workflow introduced should ensure that the presence of such amendments is noted. This should be through the use of a black ink pen to circle the amendment.
· All pages of multi-page documents should be kept together and in the appropriate order before, during, and after scanning.
· All pages which require specialised scanning, e.g. forms, oversize pages, low contrast pages etc should be extracted for scanning in specialised scanners or with different scanning settings (colour, contrast, resolution, etc).
2.3 Document batching
Generally, documents can be scanned in two methods. Batch scanning, where documents are sorted into batches by type and subject, or intelligent scanning where OCR is performed on the scanned pages and text and/or markers are used to identify the storage context.
Batch scanning of documents generally requires the production of cover sheets to be inserted between each batch of pages. These cover sheets carry identification marks (normally bar codes) which indicate the type of document and context (company, patient, person, etc). The processing of batched pages within the scanning/reading process is very fast, but requires the pre-production of cover pages.
The intelligent recognition of pages removes the need to identify the context prior to scanning, but requires a table to be defined within Metamation with the storage context. The validation and identification is carried out in the background by the scanning reader engine (which performs the OCR on the documents), which means a slower background throughput, but less document preparation time.
Both batch and intelligent recognition can be used together for optimum performance (such as using intelligent recognition for document storage, but pre-printed cover pages for paper ‘forwarding’ to individuals within the organisation).
The choice of the preferred method of batch/scanning control will depend on the types of documents to be scanned.
2.4 Photocopying
It may be helpful for some documents to be photocopied prior to being scanned. Such documents include:
· documents that may be adversely affected by the scanning process, such as damaged or delicate documents
· documents where there are substantial contrast or density variations over the area of the original, and where photocopying demonstrably improves the image quality
· documents containing paper or ink colours that do not produce legible scanned images, and where photocopying demonstrably improves the image quality
· photocopiers and scanners may respond differently to different colours, and it is only in exceptional cases that the technique of photocopying prior to scanning does not produce satisfactory results
· Photocopies should be examined to ensure that there is no significant loss of information during this process.
Where an image was made from a photocopy, it should be stamped as a ‘photocopy’ or ‘original photocopy’, and indexed as having been captured from a photocopy, distinguishing between photocopies made during document preparation and source documents which are known photocopies.
2.5 Scanning processes
As a general rule of thumb, it is recommended that all scanning should be duplex, in colour, and at 400dpi resolution. However, colour high resolution images take up more storage space, so as detailed within this document, different types of pages may require scanning at lower resolutions, in grey scale etc.
As part of the set-up and configuration of a scanning process, the types of pages to be scanned can be checked and scanning ‘jobs’ can be created to scan with different settings. The scanning process would then need to factor in the separation of different document types based on the types, sizes, contrast and usability of the paper concerned.
To ensure that all documents in a batch are fully scanned a count of captured documents should be compared with the number of documents in a batch.
2.6 Quality control
Procedures are required which reduce the risk of scanned images being of unsatisfactory quality. The evidential weight of scanned images will be increased if it can be demonstrated that the images are of good quality, and that the scanner was working to agreed standards at the time of scanning.
· A sample set of source documents should be assembled for the purposes of evaluating scanner results against agreed quality control criteria and should consist of a representative type of documents to be scanned, and should consist of a duplex (front and back) content page.
· Documents in the sample set should be representative of the complete set of documents that is to be scanned.
· Documents in the sample set should include examples of source documents whose quality is poor relative to those of the majority of the documents.
Quality control criteria should cover;
· overall legibility
· smallest detail legibly captured (e.g. smallest type size for text; clarity of punctuation marks, including decimal points)
· completeness of detail (e.g. acceptability of broken characters, missing segments of lines)
· dimensional accuracy compared with the original, scanner-generated speckle (i.e. speckle not present on the original)
· completeness of overall image area (i.e. missing information at the edges of the image area)
· density of solid ‘black’ areas, and colour fidelity
Quality control criteria for image quality should be realistic given the nature of the source material and the characteristics of the scanning equipment and based upon the sample set of documents.
2.7 Evaluating image quality
The scanners should be setup using the sample set of documents and should be retested on a weekly schedule to ensure the best quality of the scanned images.
During operation evaluating image quality should be undertaken using a 20” monitor with a resolution of 90-100 dpi monitor which should allow the validation operator to view the document as a complete page to ensure that the comparison with the original is complete. The validation system should allow the operator to print suspect documents to verify that the image can be reproduced and validated against the original where the reproduced image is as good as the original.
The scanned image should be printed on a colour printer with a greater resolution than the 400dpi scanner. This is to ensure that all information is printed.
The results of all quality control checks should be stored in the Quality Control Log with the reason for rejection.
The sample rate should be every 5th page in the first month reducing to every 10th page in the second month and in the third month it should reduce to one page every 1 hour.
When the sample set is used the whole set should be validated for accuracy.
2.8 Checking scanner performance
Optical and paper transfer rollers should be cleaned daily or on demand, when for example, a clean original shows banding on the scanned image which is produced by dirt on the optical system.
The sample document should become the scanner test target and should be used to monitor scanner performance.
Scanner performance checks should be used weekly to ensure that the scanner performance is within agreed tolerances.
Hard copy prints should be made of the scanned images of the test targets and compared with the test targets themselves to determine whether the quality criteria are met.
2.9 Rescanning
All pages marked for rescanning should be identified and rescanned using a flatbed scanner where possible, to improve operator controls over the scanning.
The operator should have the ability to change the contrast adjustments or increase the resolution of the scanner to improve the scanned image. However, de-speckling of the image should not be allowed as this can change content of the scanned page making the original un-reproducible.
All pages rescanned should replace the original scanned page which was marked for rescanning. The operator should ensure that the information on the page is accurately represented before replacing the image.
2.10 Image processing
The following sections (5 and 6) describes some different types of documents, and associated image processing facilities that may be used. Some of these operations are carried out during and/or after scanning.
The scanners should be setup to automatically de-skew pages. On occasions the operator may need to carry out the de-skew operation using the pull down menus in the application. This should be at the operator's discretion and the alternative is to reject the page and rescan the page.
Where documents are OCR'd or OMR'd, then the operator should be required to verify the accuracy of the text or marks against the original page as well as the scanned page. This is to ensure that the accuracy of the content is represented when carrying out free text searches.
De-speckling and border removal is NOT acceptable and if the page requires extra processing to remove noise from the page then the page should be rejected and rescanned with a different scanner setting and the page carefully validated for quality.
3 Image processing
3.1 General
For legal re-production of the original scanned documents, the majority of image processing tools cannot be used, e.g. De-speckling. The following are the acceptable tools that can be used.
3.2 Document skew
Document skew is a term used to describe the phenomenon of poor document alignment (rotation) during the scanning processes. In its most pronounced form, images can appear on a viewing screen as crooked or slanted. Even a small angle of skew is likely to affect data capture processes and thus reduce data recognition rates.
Passing images through de-skewing processes may correct this problem.
3.3 Speckle, noise and background marks
There features should not to be used. It is included for information only.
Random black marks (speckles) which appear on an image may have been generated during the scanning process, or may be present on the original document. These speckles may be removed by systems involving special algorithms. These algorithms assume that small isolated clusters of pixels contain no information, and may be deleted.
3.4 Black border removal
This feature should not to be used. It is included for information only.
When scanning documents of mixed sizes using certain scanner types (such as rotary scanners), black borders may be left around the edges of smaller documents. Black border removal entails the deletion of such large areas of black pixels.
3.5 Forms removal
The scanning of textual information on a pre-printed form is common when automated data capture processes such as OCR and OMR replace a large keyboarding operation. To increase the accuracy of the recognition rate, images can be passed through a post-scanning process that will remove boxes, lines, and pre-printed text.
Where new forms have been designed and are intended for OMR and OCR then forms removal should be used. The forms will be defined during implementation and a list should be given to the operators and the scanning systems setup to recognise the forms which are enabled for this method.
4 Scanning specific types of document
4.1 General
This section gives details of different types of documents, and the scanner characteristics needed to give acceptable results within the Metamation information management system. The characteristics detailed in this section are not applicable where Optical Character Recognition is to be performed on the scanned image.
4.2 Text, typed and printed
It is recommended that a resolution of 400 dpi be used as the minimum for the following reasons:
· At lower resolutions, some detail may be missing from some characters, particularly if they contain thin elements, including serifs; fonts under about 6 point on the original as they may not be captured very clearly.
· With material containing particularly small type sizes (e.g. superscripts and subscripts), a resolution of 600 dpi or more may be necessary.
· For material that may be processed using Optical (or ‘Intelligent’) Character Recognition, it may be beneficial to scan at a higher resolution than would be satisfactory for visual legibility. For example, while for much material 200 dpi would be satisfactory for visual representation, it may be preferable to use 300 dpi resolution if OCR/ICR is to be used; similarly, where 300 dpi may be visually satisfactory, 400 dpi may be better for OCR.
· Material which contains handwriting is known to be difficult to read a resolution of greater than 300 dpi may be required.
No decisions should be made regarding choice or resolution without conducting tests against the sample set. Careful tests should be carried out to ensure that the resulting image remains an effectively ‘true’ facsimile of the original. These tests should use the sample set of documents, and hard copies should be made of scanned images.
There should be no anomalies introduced into the enhanced image that are visible under normal office lighting conditions.
It is important to bear in mind that the validation monitor should have an effective resolution of about 90 to 100 dpi. This is normally adequate for typed material but ‘zooming’ may be required with small sized print, and this requires that the scanning resolution should be substantially greater than the basic display resolution.
The results of these tests should be stored with other records of the scanning processes.
4.3 Line drawings/art
For line drawings/art which form part of otherwise text-oriented documents, the scanning resolutions applicable to text are typically satisfactory for the drawings also. With printed material, where fine lines are used in the artwork, 300 dpi may be too low, but this can only be determined via tests on sample documents.
4.4 Handwritten material
With material where a modem pen, ball-point, or pencil was used, 400 dpi will normally be adequate. For older material where a steel-nibbed fountain pen was used, the thinness of the upstrokes will often require 400 dpi as the minimum resolution which will satisfactorily capture the text without significant components of these upstrokes being lost.
Handwriting (or hand drawing) using pencils can be faint, and difficult to reproduce. Care should be taken to ensure that image brightness and contrast are appropriate for these images.
4.5 Charts, plans, and drawings
For hand-drawn charts, architectural, and engineering drawings, there may be finer lines present than would be the case with a typical ‘full-sized’ CAD drawing, and although 300 dpi will usually be a satisfactory resolution, tests should be done to ensure that the finest detail is captured. It may prove necessary to use 400 dpi.
If the scanning is to be done from copies of the originals, and if these copies have been reduced from the originals (which is quite common), then a higher resolution may be required than would otherwise have been satisfactory.
With drawings and Critical Care Unit (CCU) charts, dimensional accuracy may be important. Because of the large size of drawings, the paper or film may undergo dimensional change (due mainly to variations in moisture content). For working drawings it is often a requirement when scanning that dimensional inaccuracies are corrected, i.e. the scanned image may be post-processed to correct scale inaccuracies, skew or lack of orthogonality. Such corrections mean that the subsequent image is not a true facsimile of the original. Where legal admissibility may become an issue, it will be required to preserve an uncorrected version of the scanned image as well as the corrected version with the appropriate links to both to both documents.
4.6 Maps
With maps, a minimum resolution of 400 dpi will be required, but much higher resolutions (e.g. up to 1000 dpi) may be required with some material which contains fine detail.
As with drawings, scanned images of maps are frequently corrected for scale inaccuracies and lack of orthogonally in the original after scanning.
Where coloured maps are being scanned, and the colour is to be preserved, the scanner should be capable of capturing individual colours with the required discrimination. While the number of colours subjectively present may be quite small, 8-bit colour (256 colours) may be inadequate and it may be necessary to scan with 24-bit colour in order to provide the required colour discrimination. Tests should be done to determine how many ‘bits’ of colour are required.
4.7 Half-tone material
Where half-tone material (black-and-white or colour separated) is present on a page along with text and/or line art, the outcome objectives of the scanning should be considered.
If the objective is to produce a scanned image that is comparable in quality to a ‘normal’ black-and-white photocopy, then a scanner which produces a digital image (i.e. ‘black-and-white’) will suffice. The resolution may have to be higher than that which would be acceptable for text only: 400 dpi will be required to capture halftone material.
If the half-tone content has value in the application context, following the recommendations that apply to scanning text or line art may result in the capture of images of unacceptable quality from the half-tones.
Most scanners have different settings for scanning text or line art and scanning half-tones. It is a general problem when scanning mixed text or line art and half-tones with a ‘black-and-white’ scanner that the scanner settings that are optimal for text are far from optimal for the half-tones, and vice versa. When set for ‘text’, the quality of the half-tone images will generally be significantly worse than a photocopy; when set for ‘half-tone’ or ‘photographs’, the text may appear rather blurred in the scanned image, to the extent that the image would not form a good facsimile of the original text.
If the half-tone content has ‘cosmetic’ value only and does not contribute to the essential information content of the original, then the scanning should be done according to the recommendations which apply to text or line art material.
If the half-tone is to be captured to a quality level comparable to that of a typical (good quality) photocopy, then there are two options. One option is to scan the document with the scanner settings ‘normal’, at a higher resolution than would be necessary for the text alone; 400 dpi minimum is recommended. The other is to scan the document twice, to create two images, one where the text/line art is captured to satisfactory quality and the other where the half-tone material is satisfactorily captured. In the latter case a record should be kept that the production of the two images involved different scanner settings (affecting the processing performed on the images).
If the half-tone material is to be produced to a quality comparable to that of the original, then it should be processed according to the recommendations for photographs.
4.8 Continuous-tone images
Continuous-tone images include photographs, medical and industrial radiographs (X-rays), and images generated by computer as photographic style images, including, for example, ultrasound images, CT and MR images.
With material containing continuous- tone areas (grey scale or colour), where the tonal information should be preserved, scanning should be performed with a scanner capable of capturing the required number of grey levels and/or colour. The number of levels that is appropriate should be determined by benchmark tests on the sample set of documents.
For images from photographic material, the number of grey levels will typically be 16, 64, or 256 (i.e. 4, 6, or 8 bits per pixel). For very high quality images, 256 levels are normally used, and for X-rays, up to 1024 levels of grey (i.e. 10 bits per pixel) may be necessary.
For colour photographs, 24 bit per pixel of colour information is used in most applications, but for very high quality images, up to 36 bits per pixel may be necessary. Typically, 15 or 16 bits of colour are used; for source material containing only a small palette of colours, 256 grey levels may suffice. Tests should be performed to determine how many colour levels are required.
· With continuous-tone colour, most scanners capture 8 bits of colour information in three different regions of the colour spectrum: Red, Green, Blue (‘RGB’), resulting in 24 bits per pixel, or the ability to reproduce over 16 million colour variations.
· With only 8 bits of colour information (256 levels), there may be a noticeable ‘blockiness’ in the image if the original contains a broad range of colours.
Scanning resolution requirements for documents containing colour are normally similar to that for black-and-white material, particularly if there is text present on the original. Thus scanning may be performed at 200-400 dpi, referred to the original photograph. If there is no text present on the original satisfactory images may be achieved at lower resolutions, down to television quality levels (about 350 lines per image frame); this would typically be satisfactory for identity photographs and similar applications.
To assess image quality, in general it is satisfactory to compare the screen images with the original. If there is likely to be use of high quality hard copy images then the comparison should be made between hard copies of the images, produced on a high quality colour printer, and the originals.
Care should be taken when comparing screen colours with an original that the colours were correctly balanced at the time of image capture, and that the display system has also been calibrated correctly. Otherwise the displayed colours may be significantly different from the colours on the original. The same requirement applies when comparing the original with hard copies of the captured image.
Where colour accuracy is important, a standard Colour Gamut test chart should be scanned at the same time as the original (or batch of originals scanned at the same time), and the image of this chart stored along with the original.
4.9 Mixed mode documents
Mixed mode documents comprise more than one document type inside a single document (e.g. photograph, text). From a scanning perspective the documents described above containing half-tone material are essentially of this type, even though the original has been created in a single print operation. As described in 6.7, the use of scanner settings optimized for one type of material can result in the loss of information in material of other types. As suggested in 6.7, one solution is to capture multiple images, with scanner settings (or even scanner type) selected to optimize the image quality for each material type.
One option is to use a scanning system that can scan mixed mode documents automatically, with automatic detection of each type of material and automatic optimization of the settings for each type. These systems can also be set to select the most appropriate compression algorithm for each type of material. Benchmark testing should be done to ensure that the results are acceptable.
4.10 Documents with note sheets attached
Some documents may have note sheets or notelets attached. Care should be taken when scanning such documents. It may necessary to remove the attachment where, for example, it obscures information on the document. If removal is required, the note should be marked or stamped as being a part or page of the document to which it was attached, and scanned and indexed separately. The original page should also be indexed to indicate that it has an attachment.
Where a system has a facility to indicate that a document has a related image, then this facility should be used.
4.11 Microform documents
Microforms should be examined carefully prior to deciding upon the scanning approach. Within multi-frame microfilm media (roll film, microfiche, microfiche jackets, multi-frame aperture cards); unless the inter-frame gap can be detected unambiguously automated frame detection should not be used.
If the gap is not detected multiple frames may be merged into one image. Depending on the physical characteristics of the scanning system it is possible that some part(s) of the digitized image may be lost.
With jacketed film, film strips may overlap. The processing procedures should ensure that such overlaps may be detected and corrected before scanning, otherwise some page images will be missing or illegible, in whole or in part.
Where a rotary camera has been used, images on the film may not have a one-to-one correspondence with the original documents. For example, two pages may have been fed at once, so that on the film part or all of an original page may be missing.
http://knol.google.com/k/an-overview-of-document-scanning#
扫描的全面解说。 |
|