Paper to database – third phase

When using OCR scanned data as input, you run into dust problems, erroros in OCR-interpretation and as with AbleToExtract also some alignment problems in which column to find the data you are looking for. Dust should be removed, common errors with figures could also easily be handled, and floating columns makes need for an anchor.

So now we have picked up all rows in correct order and have them in random columns, which is a good start. But what about the dust? It does occur so why not remove it if it is by it’s own in a cells by:

image 

Where the sub “TaBortAlla”, Swedish for “RemoveThemAll”:image 
Removing the dust in empty cells, makes it easier to recognize empty cells. 
One way of navigating on a row is counting columns from the left. But it happened sometimes that the where new columns inserted when there was a shift of page so we went for better approach. Ahta we found was that there were seldom shifts in columns in the columns in the middle where we had the article numbers, i.e. keeping neighbours together. What we needed was an anchor point in each row, a place to feel comfortable around to identify it as a valid input row. In our case we were lucky all article numbers started with ‘”AXX” in one of the columns, so we believed it to be a good anchor. Some columns where to the left of it and some where to the right. So we did some scanning for “AXX” which in most cases where successful, but it also appeared as: “AXXC”,”AX0C”,”AX0(“,”A)(X”,”A)CX”….. OCR is not easy! So far we have found this code useful to solve the anchor point problem, it is a bit of fuzzy logic but, it works…:

If x = "AXXC" Or x = "AX0C" Or x = "AX0(" Or x = "A)(X" Or x = "A)CX" Or x = "44XX" Or x = "4.XX" Or x = "A)0(" Or x = "AX0(" Or x = "A)O(" Then
    iColAXX = j
    iWidthAXX = 4
Else
    x = Left(x, 3)
    If x = "AXX" Or x = "AX0" Or x = "A)(" Or x = "A)C" Or x = "4XX" Or x = "44X" Or x = "4.X" Or x = "A)0" Or x = "AYX" Or x = "AX0" Then
        iColAXX = j
        iWidthAXX = 3

We also added some lines of code to find new version of “AXX”, logging rows with many columns where AXX was not found with the above lines, picking up new alternatives as they arrive. So with these lines of code we pick them all up and have got our anchor on a row:

image
image

Sometimes our anchor is in F sometimes i G and sometimes in … But in the column to the right we assume the quantity will be and then the amount per unit etc, and to the left we assume that it contains the article number of the wholesaler,etc. We seldom found that a column had changed it’s relative position to the anchor column, which would have made the logic is a little bit more tricky. All data we where looking for on each row was supposed to be numeric.
But what is numeric when working with OCR-scanned data? There are some common problems with numbers in OCR, so if we are almost certain that in this columns should only contain figures we use the following code:

image
Three types of replacement, first is dust among the figures, then comes common errors like Il for 1 etc, and then the decimal point which could be “.;,”, since we use “,” in Swedish that is what we want. If value still is not numeric the content is written to the log for the human eye to investigate.
For each valid (ie one with an identified anchor column) row we find, we produce an output row that will be used in as input in the next step. If we suspect there are some errors the code could not handle we produce a message on a sheet we use for logging:
image
For each error found the row where the problem was found is recorded. There are three problems reported on this sample, 2 in the 1st phase and one in the second which we describe in the next part of this story. There is also a severity on each error if it is just informative or if it is critical.

Be the first to comment - What do you think?
Posted by Rickard - February 4, 2012 at 17:32

Categories: Uncategorized   Tags:

Excel Conference in London 25/1 2012

imageSo I went there without too much thought about why? Ryanair having a marketing campaign on the route made my travel expenses close to nil. As a knowledge partner for my customers I often need input from smart guys, so I thought I will give these UK guys a try. Simon and the others I met at the pub the evening before impressed on me. Many of them where experienced VBA-guys gone through them same process I have been trough, with similar but in some aspect different experiences.

imageLate but not too late I arrived at the conference. The starter was not my favourite subject. I spent a lot of time trying to figure out the importance of being accurate in the definition of how and why you solve problems with Excel and why that is important. Suddenly I realised that this is probably a result of the financial crises we have been trough… Sorry but it did not strike me that hard, so that might be why I lost the importance….

Next on stage was Bob Phillips who told the story about Power Pivot the new invention from MS in Excel 2010. To be really honest I was a bit disappointed about the speed, Bob told us about using VPC, but anyhow, why VPC?? I think that due to all interrupts he may have missed giving me my point. I don’t believe ordinary end user are capable of doing my jobDjävul, and so did Bob. I think the important point about Power Pivot is setting up the business aimed views and combining them with tables you have in your own spread sheet is the real selling point, at least that is a feature I think my customers will like.

Next on stage was Mike Staunton and ExcelDNA. I have never heard of it, so to me it was very interesting, though I am still disappointed about the “.Net-thing” MS gave us. I have for a long time being disappoint about MS, and suddenly these guys give me hope Ler. So after this presentation I might give my 20k’ codebase a new try, not tomorrow but in 2012, maybe?? He also talked a lot about FFT and how to make them work faster, maybe not what I am up to most of my time.. Nice guy with a lot of enthusiasm, but maybe a little bit academic for me..Cool

We went into “The Slaughtered Lamb” and hade a nice lunch sponsored by F1F9. Thanks Morten.

After arriving late after lunch Simon, being a good host, gave up his presentation on project management. Instead he introduced Mathias Brandewinder, a guy who presented himself having a strange accent in English, not being my mother’s tongue, I had some problems. He told his story about going from VBA to VSTO… In this issue I have been through a lot of pain, making me very uncomfortable for him, sorry for making to many interrupts Utomjording. He tried as many before to convince my brain of doing the shift, but sorry to say he did not manage. I still don’t know how to convince my customers of doing the effort of a transfer to the “promised land”…. Pls pls give me the solution I have been searching for a long time.

Next on was Charles Williams telling us the story about converting your brain from VBA to C++!!!
Rullar runt på golvet och skrattarSjukProbably there are more feelings about such a shift in your brain. He made a very good presentation on what this is all about. Dramatic was, when he got  hidden compiler errors, but everything was under control, just giving us some excitement. Very subtle. Probably I wont go into it, but it was a very good presentation.image

Morten from Denmark living in Baath running F1F9 (Help+Calculate) was next on stage. Another story about making spread sheets understandable and possible to audit. I liked his presentation though it was a little beside my main interest.

Simon was then given the possibility, and he told an amazing story about bridges I will never forget. After that we ended up in a pub were all problems for the world was solved.

Great experience, and thanks to all I have learned to know.Tummen upp

Some other views from the conference:

Be the first to comment - What do you think?
Posted by Rickard - January 26, 2012 at 01:07

Categories: Uncategorized   Tags:

ERP-data into Excel from Movex

Summary.

The demand for current information is increasing and the technical development allows more applications to be profitable by direct extraction of data from your ERP-system

Background.

imageExtracting data directly from the ERP-system is today in many cases a realistic approach. Some years ago it was a dream. Often you needed to extract the data into a data warehouse and then transfer it into a cube. In many cases these intermediate steps made the needed fast implementation unrealistic and expensive, killing the idea behind the application. Most BI-vendors have understood this, like Microsoft’s announcement of PowerPivot for Excel 2010. MS claims  it will make it possible for advanced Excel-users to extract data direct from their ERP-system, making people like me unemployed. Having been in the arena for some decades I am not too worried, but the elimination of unnecessary steps is good, very good. The real problem behind this is still unsolved and will probably be for a long time. When I first heard about the collaboration between SAP and MS I was naïve to believe they together would fix it. Still after many years we have seen nothing. What the user needs when interacting with his ERP is a view of the data of his interest not a collection of some thousands of tables with many mysterious rules hidden in relations and triggers. To build these views you need to understand a lot behind the scene, so at the end it is all bark no bite.

Trivial example.

imageLet us assume you are a product-specialist in a company, responsible for the introduction of new product into the market. When introducing new products you need to follow sales, often daily in the beginning to make the correct decision about if is is a go or a no go. First problem is to define “what is a new product”. In this case it is easily solved by collecting a list of product numbers, probably this is done in Excel. The next step is to answer: “Where are our sales stored?”. Asking his IT department he is directed to the data warehouse (if he is lucky), which is updated once a week. This will kill his application, because when he tells them his needs they will tell him – summarizing data daily will cost a fortune.

imageSolving this problem has technically been possible for decades, but in the past cost to much in resource usage of computers to be profitable. So let us go ahead and solve it if your ERP-system is Movex (Lawson/M3). The database with my customers is in DB2 in AS400 and have been accessible through ODBC for a long time, so that is really no news. Daily I meet people struggling with MS Query digging into the tables of Movex using a language they call “Moviska”. The real challenge is finding the right tables and how to join them, then you have to find the columns to use which could be solved by recognizing figures from your daily work when browsing though the details. We try Google for this and enters “Movex sales stats” and in 2nd document we find the keyword “OSBSTD”. This is actually the table where sales stats are stored!!

How to connect to DB2 i AS400 with ODBC i standard techniques, you could use MS Query, MS Access pass thru or linked tables. I prefer linked tables in Access which makes it very convenient to construct the SQL you need, and it is also very simple to join it to information of your own needs, so I put up a table ProdList which contains my list of interest. Then I start the query designer:

image image

As you see, you need to know the names of the columns in one way or another (Moviska), but then it easy simple and straight forward. So we move the data to Excel, either by copy/paste or with a query in Excel, where we put it into a feasible graph and after some clicks we have:

image

So we are all done with the prototype and wants to put it into production. Problem is the execution of the query takes 260 sec and a lot of resources making the IT-department worried. With a minor change in the SQL you will have an execution time of 60 seconds. Making it a passthru query and making it use a feasible index reduces the execution time down a factor of 20 giving 3 sec – for the same query. Why?

The SQL used if you would like to copy:
SELECT UCIVDT AS Invdate, Sum(UCSAAM) AS Sales, Sum(UCSAAM-UCUCOS) AS Revenue 
FROM OSBSTD
WHERE UCITNO In (’121881′,’121891′,’121901′,’121911′,’121921′,’121931′)
AND UCCONO = 1
GROUP BY UCIVDT
ORDER BY UCIVDT

Knowing the methods and where and when to use them, that is what craftsmanship is all about in IT.

Be the first to comment - What do you think?
Posted by Rickard - January 23, 2012 at 11:20

Categories: Uncategorized   Tags:

From paper to database–Part II

Summary.

When trying to use AbleToExtract for extracting information from a pictures in a PDF-file into an Excel-worksheet we went into some problems. We investigated them ending up finding solutions in a practical way for all of them.

Background.

This summer we were asked to automate the process of getting paper based information into a data warehouse, by using scanned pages stored in a PDF-file. In an earlier post is the story of selecting a tool for doing the first part of the job. We found a tool called AbleToExtract that takes a PDF-file as input ending up with an Excel workbook containing OCR interpreted data put into feasible cells.

Our next step is to take this file and translate it into a worksheet for loading the data into the tables in the warehouse.

Looking for useful connections in data.

In the earlier manual process there had always been up to 0,5-1% in difference of the total sum of the invoice and the sum of the transaction read into the database. With that method we found getting it down lower was not worth the price, so this error has been accepted, but has not helped building confidence for the data. Now we put the goal much lower, max 0,1% if any.

The data in this application is discounts offered to a specific customer from a wholesaler but paid by the manufacturer. On a weekly or monthly basis these discounts are invoiced to the manufacturer from the wholesaler. This type of transactions are very common when the public sector is buying in the Swedish food market, for example for school meals.

Looking firstly at the detail rows:

image

To the left is the name of the product sold, which is not needed since the products are stored in a product table in the DW. The OCR program is English so it does no take into account the three special Swedish characters ÅÄÖ. The next part is “ST” which is the unit used for the amounts. The next box contains the article number, for the wholesaler and for the producer (prefix AXX), and the last four numbers are quantity, and then the discount for each unit (repeated twice) and the product of quantity and discount producing a total for each row. So there is a lot of redundant information in each row that could be used to find and in many cases eliminate the errors in the OCR interpretation. 
Starting with the product codes, which should be numeric when neglecting the prefix ‘AXX’.  Both these numbers are stored in a DW table, so they are checked there. If the number does not hit in this table it produces an error message that has to be handled manually. If it get a hit on one of the numbers it is only reported as a possible error, if both numbers hit no error is assumed.
In the amount fields we have: If the two numbers representing discount/unit both are equal and numeric (dec. point i Sweden is ‘,’ ) they are believed to be correct. The next test is if the product is correct (quantity * disc/unit = rowtotal) if not, the quantity is checked if it is an integer, in that case the quantity is believed to be correct and the product is calculated. An error message is produced for all rows where this multiplication is incorrect, but with the extra assumptions the rate for manual corrections are minimized.

Let us have a look at the subtotal headers which occurs before a section with new detail lines:

image

The first number is a customer number in the ERP of the wholesaler followed by the name of the customer. Since a customer could have more than one contract there is a number of the contract to be picked up as well. Since both of these numbers have occurred in imports earlier the chance of getting a hit on them is pretty high. These numbers are collected in a table in the DW, so they could be checked for validity against that table. If no hit then they are reported as a new contract making it a possible OCR error, but also possibly representing a contract not yet entered in the DW table, so manual inspection is called for in this case, to ensure 100% validity. Of course fuzzy logic could be applied trying to match the customer name, but this has not been tested yet.

To make thing still a little bit more complicated the pages of the invoice are not in the correct order in the PDF-file. It always starts with the last page holding the total, probably because that is the best use for the accounting system, when used to check incoming invoices. We need the pages in the correct order to pick up subheading that was printed on the previous page if a page break occurred before a section was completely written, to guaranty that the correct customer is found.

The page problem.

So we have to deal with the page problem first to ensure all rows appears in the worksheet in the correct order. “Page” in Swedish is “Sida”. We put up a filter in Excel to see where it occurs.

  image image

So here we see that this word i found in column E and F and sometimes even in D. As we looked further we also found it in column L on page 17!  So it is reasonable to scan all occupied columns in a row, being unable to trust the column. We also discovered that sometimes the word to be found was “Side” due to dust problems. So the program has to be adaptable to variations in just finding this keyword to get rows correctly sorted. The idea is to put in two new column to the left putting in a page number and in the second the relative row within that page. When sorting on A and B we will get the correct order of the rows. For backtracking we put in the original row number so we easily could find the correct row when making manual corrections in the original file. We write the program and get:

image

We sort on column A and B and delete these columns to achieve:

image

Having the rows of the sheet in the correct order with the original row number in column A! Now we are ready for the next step.

Be the first to comment - What do you think?
Posted by Rickard - January 23, 2012 at 11:15

Categories: Uncategorized   Tags:

From paper to database–Part I

Summary.

Selecting a PDF and OCR-program for automating scanned paper based information into a data warehouse.

Background.

Sometimes when you build a BI application part of your data is not from your own computer systems. If you are a customer you could often claim that you want the data in a digital form and get it into your database for analysing by file transfers. It could be data about competitors or as in my case it is a customer and you are not in the position of claiming your “rights”. You just have the data printed on paper.

So one of my customers gets about 100 pages of paper each week with detailed information that they have a need to analyse. For many years we have used manual input having it “keypunched” from paper into an Excel worksheet in the traditional way. But since costs have risen and the volume is too small for getting it done overseas, last summer time has come to investigate other methods.

Choosing a tool.

Since we started the application many years ago, today all paper invoices get scanned and stored before entered into the accounting system, so all information is already in a digital form in a PDF-file. The problem is that there are many types of objects that could be hidden in a PDF-file, in our case it is actually an image, with no interpretation of the content.

image

The information we want to use in the database are from different parts of the paper, so even if we get the image there is still some work to do. A good combination of a program capable of reading PDF and OCR-scanning the picture would do most of the work making our project realistic. Some five years ago we tested this approach and then found it to be impractical, but this summer we gave it a new try. So we found after some investigations a Canadian product called AbleToExtract from InvestTech in Toronto. We downloaded the free trial and tested it for our purpose before actually paying the $130 cost of buying it.

clip_image012The tool has several options when scanning and we tested a few, finding that the defaults chosen were good enough for us as a starting point. We have never used OCR-software before so we were newbies in the field. Starting up the program it is a very easy three step procedure: Show the PDF-file, select output format (Excel for us), and then you have the possibility to select a lot of options that we didn’t spend too much time with, since we looked for an automated process. We decided to use “Convert” under Automatic. The program is heavy on the CPU and is one of the few programs capable of using some of the power in my quad-processor:

clip_image014

With this 61 page sample it uses my machine for about 3-4 minutes before it is actually done. When ready I end up in Excel with the converted PDF-document interpreted and put into cells of a worksheet. Not exactly as good as one would hope for, but certainly good enough to be useful for further processing.

image

In this sample there is actually no error in the OCR-interpretation on page 61, but depending on the quality of the scanned images this was an easy one. After testing a hand full of files we found some properties that you have to be prepared for. Firstly on some pages AbleToExtract shifts the columns where it put in your data. Here we see that the quantity has moved from H to I and then on page 13 it is in column G. On some pages we have also found that there are shifts within a page, especially if there are “dust”-problems.

image

clip_image020We have no control of the scanning process of the original documents, since this service is outsourced to an external company, but we have found that some files are much more cumbersome than others. The dust problem which occurs on some pages means that the OCR interpreter finds a lot of dark spots that sometimes get interpreted as: . ’ – , and so on. If these are lonely in a cell they are easy to neglect but sometimes 0 will be found as 8,6 or 9. So don’t expect everything to be correctly interpreted. You will find other types of problems 1 could sometimes be a lowercase l and sometimes be an uppercase i. Zero goes into O.

So after the first interpretation you will need a “washing-program”, which is capable of correcting most of the errors. If it finds some tricky situation it gives up and tells you about the problem so you could easily find it for manual correction. So we found AbleToExtract to be a good enough candidate to give the next step a try…

1 comment - What do you think?
Posted by Rickard - January 23, 2012 at 11:07

Categories: Uncategorized   Tags: