From paper to database–Part I
Selecting a PDF and OCR-program for automating scanned paper based information into a data warehouse.
Sometimes when you build a BI application part of your data is not from your own computer systems. If you are a customer you could often claim that you want the data in a digital form and get it into your database for analysing by file transfers. It could be data about competitors or as in my case it is a customer and you are not in the position of claiming your “rights”. You just have the data printed on paper.
So one of my customers gets about 100 pages of paper each week with detailed information that they have a need to analyse. For many years we have used manual input having it “keypunched” from paper into an Excel worksheet in the traditional way. But since costs have risen and the volume is too small for getting it done overseas, last summer time has come to investigate other methods.
Choosing a tool.
Since we started the application many years ago, today all paper invoices get scanned and stored before entered into the accounting system, so all information is already in a digital form in a PDF-file. The problem is that there are many types of objects that could be hidden in a PDF-file, in our case it is actually an image, with no interpretation of the content.
The information we want to use in the database are from different parts of the paper, so even if we get the image there is still some work to do. A good combination of a program capable of reading PDF and OCR-scanning the picture would do most of the work making our project realistic. Some five years ago we tested this approach and then found it to be impractical, but this summer we gave it a new try. So we found after some investigations a Canadian product called AbleToExtract from InvestTech in Toronto. We downloaded the free trial and tested it for our purpose before actually paying the $130 cost of buying it.
The tool has several options when scanning and we tested a few, finding that the defaults chosen were good enough for us as a starting point. We have never used OCR-software before so we were newbies in the field. Starting up the program it is a very easy three step procedure: Show the PDF-file, select output format (Excel for us), and then you have the possibility to select a lot of options that we didn’t spend too much time with, since we looked for an automated process. We decided to use “Convert” under Automatic. The program is heavy on the CPU and is one of the few programs capable of using some of the power in my quad-processor:
With this 61 page sample it uses my machine for about 3-4 minutes before it is actually done. When ready I end up in Excel with the converted PDF-document interpreted and put into cells of a worksheet. Not exactly as good as one would hope for, but certainly good enough to be useful for further processing.
In this sample there is actually no error in the OCR-interpretation on page 61, but depending on the quality of the scanned images this was an easy one. After testing a hand full of files we found some properties that you have to be prepared for. Firstly on some pages AbleToExtract shifts the columns where it put in your data. Here we see that the quantity has moved from H to I and then on page 13 it is in column G. On some pages we have also found that there are shifts within a page, especially if there are “dust”-problems.
We have no control of the scanning process of the original documents, since this service is outsourced to an external company, but we have found that some files are much more cumbersome than others. The dust problem which occurs on some pages means that the OCR interpreter finds a lot of dark spots that sometimes get interpreted as: . ’ – , and so on. If these are lonely in a cell they are easy to neglect but sometimes 0 will be found as 8,6 or 9. So don’t expect everything to be correctly interpreted. You will find other types of problems 1 could sometimes be a lowercase l and sometimes be an uppercase i. Zero goes into O.
So after the first interpretation you will need a “washing-program”, which is capable of correcting most of the errors. If it finds some tricky situation it gives up and tells you about the problem so you could easily find it for manual correction. So we found AbleToExtract to be a good enough candidate to give the next step a try…