Uncategorized

Hierarchical versus relational database

image

Was my obvious first “google” preparing for this post. Very first came an interesting article from Narendra Naiudu from Pune in India posted 2005, explaining the shortcomings of the hierarchical model. I hope to pass him next time I do the search with Google.

I have seen it happen many times, when things getting tricky, you end up compromising trying to model the world as a hierarchy..

imageIn the mainframe era of the 70:is IMS was the model to structure the world – in hierarchies. Two decades later when OLAP went into sizing problems when calculating sums for all possible levels, hierarchy was again needed as a compromise. And now 20 years later the same occurs again, managing “Big data” the need for this narrow model is called for again. When you go into performance problems with computers, this is the rescue. “The world is hierarchical”, which sometimes is easy believed if you are naive, and sometimes might be an acceptable compromise.

imageWorking with DB design for decades I would say – it is a sacrifice to do. Far too many models fail.. The most classical example is, ok you have one boss, so this could be modelled with “H”.  But what about Mr X who works in two projects as well.. He has a boss for his ordinary job, but with 25% he has another boss as well. But sales districts are such, except for Mr Y, whose nephew owns a shop outside his district, so he is of course responsible for that customer anyhow. But as Mr Naindu explains the setup of a disk and its directories could be modelled perfectly as a hierarchy, and with that I agree totally. You love how your files are organized, don’t you?

imageFAT was a compromise and we still have to live with it!! You know where you have stored your email in its folders? Most of what you observe in the world, the hierarchical approach means many compromises. Ok, “Bill of material” is good, but what more? Military might also be an exception where “map applies if the reality does not match”. A son could only have one father, you say, or? If you take other examples the compromises you got to accept sooner or later makes your database incapable of keeping up.

Don’t get trapped! Be aware of shortcomings when the model is “H”!!

The crowd yielding “NoSQL” is loud and verbal.

The reason for using simplified methods for Big data is basically technical (i.e. performance and development costs). It is much easier to program for parallelism and loosely couple processors in a hierarchy and ignore basics like integrity, usefulness and confidence. Who really needs that?


Your flying companion

imageThe development of hard disks have been just amazing, the price per MB has fallen every year. If you crash it and take a look inside you will see one or more very smooth brownish platters and a mechanical device with one or more arms. Pricing for mechanics and platters have been constant, so you roughly paid the same for the box, but you have got more storage and capacity. In my collection of 2,5”:s during the last 10 years I have: 0.7, 4, 40, 200 and now 450GB on my current laptop I bought a year ago. The last five years I have never managed to fill them up before I bought a new one. I never clean up anymore, which is a pity…  Though I still remember my 10MB (10MB, 85ms, 0.5MB/s) from my first PC-XT in 1985. I also remember good old 3340 (70MB, 25ms and 0.9MB/s) from the seventies, which you could look at picked out of the box and 3380 (2,5GB, 16ms, 3MB/s) – where the platter went into a sealed dark room that could not be opened. The comparison between mainframe and PC disks were  $100k/2500MB = $40/MB – $2k/10MB = $200/MB, so it was still cheaper to store in a mainframes and with much better performance in the eighties. image

But matters have changed in 30 years …100$/750GB = 0,001$/MB, 7200rpm and 100MB/s. But I was unable to find the access time… The rotation have not changed to much during this period and the speed of the arms have not changed dramatically. BUT the package of bits, areal density, is the explanation behind the speed of getting data into the computer and the price per MB, which you can see in the graph to the left, the scale on the y-axis is logarithmic!! So things have happened. Blinkar

But what happens inside that dark box. Each morning it is the same procedure as every day, a boring travel seen from the perspective of a disk head, staring up any version of Windows – how much fun is that?. You wake him up and the day starts, exactly the same travel every morning!! A new boring(?) day he thinks, I am not so sure about that. Flying up from the landing zone, let us see things from this perspective. OK, flying(?) well there are about 10-20 molecules of air between him and the surface, flying? Around 3nm and maybe a little bit higher is where he flies in 120km/h in my computer (7200rpm) or double speed if in a high performance drive. imageThat is risk taking in my view! Most dramatic flying I have ever seen was in Budapest 1st of May 2004  when Hungary was celebrating entrance into EU, then I saw a guy flying under the Chain Bridge. I hope you see him!!! I don’t know the exact height of that flight, 20 m maybe in 300 km/h??? But that is a gap, isn’t it compared to say lower than10nm in 120km/h!! Well you say, it is all about surface smoothness isn’t it? Yeah, that was what I thought at least. It must be OK, if the surface is smooth enough. So I asked “How smooth is it?” and I “Googled”.. It is no easy question to answer, it is all about “nanodefects” was what I learned. Really nothing you could see or feel. imageOur visible light can’t see these things, it is a matter about wavelength and resolution, you need “Rontgen-eyes” to see, so either my disk head has that or is he just a flying in blindness like a bat? Looking into details a dust gain would look like Ayers Rock in that perspective, flying like in Budapest, but in blindness. Don’t worry just take a look at the mathematical theory behind this, it is no miracle.image The rules of physics (hydrodynamic lubrication) tells you this is what happens there, nothing to worry about, he just flies over the rock, at least most of the time… as is easily seen in the formulas. So don’t worry you will probably do well if you take your backup tomorrow.

image This is a narrow picture of your “flying companion”

Here is a listing of some of the links inspiring me:

http://www.tribology-abc.com/abc/headdisk.htm

http://en.wikipedia.org/wiki/Perpendicular_recording

http://www.pctechguide.com/hard-disks/hard-disk-hard-drive-construction

http://www.hdd-tool.com/hdd-basic/hard-disk-and-hard-drive-physical-components.htm

http://www.azonano.com/article.aspx?ArticleID=2696

http://wiki.answers.com/Q/How_does_the_size_of_a_nitrogen_molecule_compare_to_that_of_oxygen

http://domino.research.ibm.com/comm/wwwr_thinkresearch.nsf/pages/storage297.html

http://www.kepcil.com/kepcilin/harddisk/hdplattr.htm

https://minty2.stanford.edu/Papers/Lev03a.pdf

http://www.ipn.mx/MX/Revistas/Polytec/PDF/ISSUES/Polytec%20InFocus%202008-2.pdf

http://en.wikipedia.org/wiki/Head_crash

http://datacent.com/hard_drive_sounds.php

http://www.datarecovery.com.sg/data_recovery/hard_disk_head_crash.htm


The melting cube

The cube has never been a stable configuration in nature, looking at ice cubes or Sodium Chloride crystals. Sooner or later they melt down to nothing. If you intermix them they will even melt faster..

image image

Is the same thing happening with the OLAP-cube in BI?

The OLAP cube was an invention of the seventies, but not really used until mid-nineties. The demand from management of summary information with short response time made really at that time no other solution possible. You had to foresee  the needs and have most sums calculated in the night shift or on weekends, when the load on the server was moderate. With it came a new programming language – MDX, a rather tough piece to grasp. Still understood by few humans compared to SQL. Do we need MDX any more? Most problem for BI could be solved by both.

With the dramatic price reduction of memory, the increasing speed of hard disks not to mention SSD and the price fall of computing power (though thru parallelism today) has in the last years made a dramatic change. We now talk about “Big Data” and the possibilities to scan thru massive data and calculate what in most BI-applications is needed on demand with sub second response times.

imageSo is there a need to calculate in advance? In many common scenarios today where the cube is the standard, it is no longer needed. This is also the way the newer BI-tools work, QlikView, PowerPivot and ordinary Pivot-tables in Excel 2007/2010.

As Donald Farmer stated on a seminar a month ago, when the remaining 72% want their data for decision support, they will not pay the price of the expensive old-fashioned cube based solutions. They will buy a multicore server with some SSD:s and some dozens of GB of storage (around $10k). They load some millions of raw data into the data warehouse and build a hand full of dimension tables and they are done with the the implementation. While the user is clicking around in the tool, weather it is QlikView or Excel, the calculations are done on demand in sub seconds.

In some posts to follow I will try to give more insight in what I call “Not so BIG Data”.

Inspiration: Areal density, SSD for DW, Memory 1957-2012, OLAP Codd, Fast Track DW.

image

image


Donald Farmer in Malmö May 11

image

 

With great expectations I entered my car and went to Stratiteq to the seminar they arranged ( and also invited competitors like me!! ) with the guru Donald Farmer (product advocate) from Qliktech former MS/SQL Server. It was a lunch meeting with delicious soup served by Pernilla.
The title of the presentation was “Social Business Social Discovery”. Donald was soon introduced by Johan Lindberg Stratiteq and entered the scene.
A true guru with a Scottish accent living in Seattle, age unknown, but since he said he had 25 years of experience of BI, when the term was “Decision Support”, he have to be 50+. In other words a young man in my eyes today. He referenced one of those Gartner figures – 28% of knowledge workers use BI-tools! In other words there is a market of 72% that are potential customers. This figure was 25% ten years ago, so what are you guys doing? With Donald’s professional view he has to work with the tools to make them behave in a way that these 72% accept. How to make this happen?

imageFirst word was agile, agile like a circus – making everything happen in 60 hours. And then it was about “cut up the elephant” i.e. make BI-projects small and reach the goal step by step with fast business value. We share some common values I must say.

Second came discovery, which also was being prepared to discover. Donald meant by making the tools “browsable” and more like “Google” when looking for information, the 72% could be reached by this method. Same as used by birds looking for berries. It also meant using  the human brain for finding patterns, still computers are not especially good at that. The social part ie sharing discoveries is not just a library! People share, the tools have to adapt. Which Qlikview does….

The final part about “next release” was vague, meaning for me it will not happened tomorrow.
Next release will be:

  1. Genius and compelling: “like Heidi Lehmann the famous Czech film star from the 40:ies”.. Sophisticated joke? She is not to be found by Google – if not known by Google, you do not exist. Was he kidding us about Cook too?
  2. Compulsive collaboration (?) Okay.
  3. Mobility and agile, made primarily for mobile device, adapted version for PC.
  4. Platform for developers, ie full API for partners.
  5. Simpler to handle ie installation and maintence making it the IT-departments friend.

 

After the summer it will be more specific Ler ! Sophisticated show by a great guy!   


Deep dive–SQL Server 2012

imageimage image image image
 

Tobias

Simon

Mikael

A lot of buzz – new release in SQL Server, four years of intense development…. I was very curious about what has happened during this long period. Will there be anything that could change my life? So I went to this seminar with a lot of expectations early in the morning of April 4th in Malmö.

It was a promising agenda with a Swedish guy – direct from Redmond – working as a program manager in Redmond, Tobias Ternström, warmed up by Simon Lidberg and Mikael Colliander  from MS in Sweden.

First was an overview by Simon presenting the three main pillars: “Mission Critical Confidence”, “Breakthrough Insight” and “Cloud On Your Terms”. ? What is he talking about?? The intro was ended by a demo that was supposed to show a new feature called “AlwaysOn”. Unfortunately this seemed to be only a vision since Mikael failed to get the fall over Server to start… So the start was a bit from 100% Bara skämtar.

imageTobias arriving a little bit late started to talk about news in T-SQL, now available in SQL Server 2012. Functions that has been in PostGre for some time,  making it possible to do something useful (windowing) with the OVER-clause from 2005 and which has been dorming for seven years Sömnig. Of course not yet fully implemented, but a good start! Still limitations with time interval not being a native SQL Server data type, yet… And NEW!! SEQEUNCE:s also in PostGre for some time. Some nice new functions for easing data cleansing with T-SQL, TRY_CONVERT as an example. The most trivial data cleansing functions of them all TRIM, still not implemented!! I don’t know of any database engine not having this function, they didn’t have the time was the excuse??? I can’t take more than a few minutes to include that functions, but we have to wait at least four more years Fest. Strange, do they have a party in Redmond or what are these guys doing? Tummen ned

Mikael followed up with the new possibilities to interact with the file system from within SQL with SQL-syntax (functions of the fso-object in scrrun.dll since the end of the 90:s) The strange limitation imagewas though that the folder structure has to be generated from within SQL, I don’t know of any messy G-disk that is created that way?? It is there the search logic is needed not in some parts of it imported into SQL-Server, or should it all be moved over to SQL-Server. If the user would like to make a new subfolder he has to contact the DB-admin??? Can’t really see how they thought that would ever happen? There must be something I missed on this Utomjording, it was just before the very nice lunch served, so I slept???
Very nice and tasty!Ler 

Simon had the chance “after lunch” to give us the insight into the need of order in the database, a term called Data Dictionary in the 80:s and now renamed to “EIM in SQL Server 2012”. Nice feature but only available in Enterprise edition, so of no interest in my business. Still I am not sure that the central data storage with order is more than a dream, is it possible in realty, I doubt it.

imageNext on stage was Tobias again talking now about SQL Azure. $5/month for 100Mb SQL-server database, without having to ask for help from the IT-department. For me this is a dream come true!! So my brain boiled up at least ten new business ideas based on this concept. This pitch was the value of the day for me, I will cover my discoveries in posts to follow. This is good, very good news!! AND in Azure you only wait months for new features, so hopefully TRIM appears their before I die.Tummen uppTummen upp

Mikael was next on stage talking about performance in the big arena. Not some stupid performance issue with a user complaining about response time. It was bulk loading times (DW:s I know of load in 15-20 minutes without partitioning..) and it was “Columnstore indexes” both features of Enterprise, ie not my businessUtomjording.

The show ended with questions to the panel. I asked about getting full use of my 24 cores in my Server, making the Optimizer understand to use parallel plans instead of running in one core. I was told it was my fault that the stupid engine handled things wrong, parallelism is “black magic” covered in some “white papers” nobody understands. And have you updated statistics?? Is not that what SQL Server is meant to understand and do if it is needed?? In other words they didn’t have a clue. I was advised to use some words to google on: Paul White, Conner (Angry Irish) and Joe Chang. When I have read and understood I will probably tell you here…  (Half promise Sarkastisk – black magic)
If I fail I will probably convert to QlikView as Donald Farmer, where the data is always stored in the server for me, without having to tell how, so it could not be my fault.

So to sum up. Some useful stuff, and hopefully a day spent worth the cost, at least I had a nice day, with nice company and some really enthusiastic choir leaders.  Thanks!  /Rickard


Paper to database – third phase

When using OCR scanned data as input, you run into dust problems, erroros in OCR-interpretation and as with AbleToExtract also some alignment problems in which column to find the data you are looking for. Dust should be removed, common errors with figures could also easily be handled, and floating columns makes need for an anchor.

So now we have picked up all rows in correct order and have them in random columns, which is a good start. But what about the dust? It does occur so why not remove it if it is by it’s own in a cells by:

image 

Where the sub “TaBortAlla”, Swedish for “RemoveThemAll”:image 
Removing the dust in empty cells, makes it easier to recognize empty cells. 
One way of navigating on a row is counting columns from the left. But it happened sometimes that the where new columns inserted when there was a shift of page so we went for better approach. What we found was that there were seldom shifts in columns in the middle where we had the article numbers, i.e. keeping neighbours together. What we needed was an anchor point in each row, a place to feel comfortable around to identify it as a valid input row. In our case we were lucky all article numbers started with ‘”AXX” in one of the columns, so we believed it to be a good anchor. Some columns where to the left of it and some where to the right. So we did some scanning for “AXX” which in most cases where successful, but it also appeared as: “AXXC”,”AX0C”,”AX0(“,”A)(X”,”A)CX”….. OCR is not easy! So far we have found this code useful to solve the anchor point problem, it is a bit of fuzzy logic but, it works…:

If x = "AXXC" Or x = "AX0C" Or x = "AX0(" Or x = "A)(X" Or x = "A)CX" Or x = "44XX" Or x = "4.XX" Or x = "A)0(" Or x = "AX0(" Or x = "A)O(" Then
    iColAXX = j
    iWidthAXX = 4
Else
    x = Left(x, 3)
    If x = "AXX" Or x = "AX0" Or x = "A)(" Or x = "A)C" Or x = "4XX" Or x = "44X" Or x = "4.X" Or x = "A)0" Or x = "AYX" Or x = "AX0" Then
        iColAXX = j
        iWidthAXX = 3

We also added some lines of code to find new version of “AXX”, logging rows with many columns where AXX was not found with the above lines, picking up new alternatives as they arrive. So with these lines of code we pick them all up and have got our anchor on a row:

image
image

Sometimes our anchor is in F sometimes i G and sometimes in … But in the column to the right we assume the quantity will be and then the amount per unit etc, and to the left we assume that it contains the article number of the wholesaler,etc. We seldom found that a column had changed it’s relative position to the anchor column, which would have made the logic is a little bit more tricky. All data we where looking for on each row was supposed to be numeric.
But what is numeric when working with OCR-scanned data? There are some common problems with numbers in OCR, so if we are almost certain that in this columns should only contain figures we use the following code:

image
Three types of replacement, first is dust among the figures, then comes common errors like Il for 1 etc, and then the decimal point which could be “.;,”, since we use “,” in Swedish that is what we want. If value still is not numeric the content is written to the log for the human eye to investigate.
For each valid (ie one with an identified anchor column) row we find, we produce an output row that will be used in as input in the next step. If we suspect there are some errors the code could not handle we produce a message on a sheet we use for logging:
image
For each error found the row where the problem was found is recorded. There are three problems reported on this sample, 2 in the 1st phase and one in the second which we describe in the next part of this story. There is also a severity on each error if it is just informative or if it is critical.


Excel Conference in London 25/1 2012

imageSo I went there without too much thought about why? Ryanair having a marketing campaign on the route made my travel expenses close to nil. As a knowledge partner for my customers I often need input from smart guys, so I thought I will give these UK guys a try. Simon and the others I met at the pub the evening before impressed on me. Many of them where experienced VBA-guys gone through them same process I have been trough, with similar but in some aspect different experiences.

imageLate but not too late I arrived at the conference. The starter was not my favourite subject. I spent a lot of time trying to figure out the importance of being accurate in the definition of how and why you solve problems with Excel and why that is important. Suddenly I realised that this is probably a result of the financial crises we have been trough… Sorry but it did not strike me that hard, so that might be why I lost the importance….

Next on stage was Bob Phillips who told the story about Power Pivot the new invention from MS in Excel 2010. To be really honest I was a bit disappointed about the speed, Bob told us about using VPC, but anyhow, why VPC?? I think that due to all interrupts he may have missed giving me my point. I don’t believe ordinary end user are capable of doing my jobDjävul, and so did Bob. I think the important point about Power Pivot is setting up the business aimed views and combining them with tables you have in your own spread sheet is the real selling point, at least that is a feature I think my customers will like.

Next on stage was Mike Staunton and ExcelDNA. I have never heard of it, so to me it was very interesting, though I am still disappointed about the “.Net-thing” MS gave us. I have for a long time being disappoint about MS, and suddenly these guys give me hope Ler. So after this presentation I might give my 20k’ codebase a new try, not tomorrow but in 2012, maybe?? He also talked a lot about FFT and how to make them work faster, maybe not what I am up to most of my time.. Nice guy with a lot of enthusiasm, but maybe a little bit academic for me..Cool

We went into “The Slaughtered Lamb” and hade a nice lunch sponsored by F1F9. Thanks Morten.

After arriving late after lunch Simon, being a good host, gave up his presentation on project management. Instead he introduced Mathias Brandewinder, a guy who presented himself having a strange accent in English, not being my mother’s tongue, I had some problems. He told his story about going from VBA to VSTO… In this issue I have been through a lot of pain, making me very uncomfortable for him, sorry for making to many interrupts Utomjording. He tried as many before to convince my brain of doing the shift, but sorry to say he did not manage. I still don’t know how to convince my customers of doing the effort of a transfer to the “promised land”…. Pls pls give me the solution I have been searching for a long time.

Next on was Charles Williams telling us the story about converting your brain from VBA to C++!!!
Rullar runt på golvet och skrattarSjukProbably there are more feelings about such a shift in your brain. He made a very good presentation on what this is all about. Dramatic was, when he got  hidden compiler errors, but everything was under control, just giving us some excitement. Very subtle. Probably I wont go into it, but it was a very good presentation.image

Morten from Denmark living in Baath running F1F9 (Help+Calculate) was next on stage. Another story about making spread sheets understandable and possible to audit. I liked his presentation though it was a little beside my main interest.

Simon was then given the possibility, and he told an amazing story about bridges I will never forget. After that we ended up in a pub were all problems for the world was solved.

Great experience, and thanks to all I have learned to know.Tummen upp

Some other views from the conference:


ERP-data into Excel from Movex

Summary.

The demand for current information is increasing and the technical development allows more applications to be profitable by direct extraction of data from your ERP-system

Background.

imageExtracting data directly from the ERP-system is today in many cases a realistic approach. Some years ago it was a dream. Often you needed to extract the data into a data warehouse and then transfer it into a cube. In many cases these intermediate steps made the needed fast implementation unrealistic and expensive, killing the idea behind the application. Most BI-vendors have understood this, like Microsoft’s announcement of PowerPivot for Excel 2010. MS claims  it will make it possible for advanced Excel-users to extract data direct from their ERP-system, making people like me unemployed. Having been in the arena for some decades I am not too worried, but the elimination of unnecessary steps is good, very good. The real problem behind this is still unsolved and will probably be for a long time. When I first heard about the collaboration between SAP and MS I was naïve to believe they together would fix it. Still after many years we have seen nothing. What the user needs when interacting with his ERP is a view of the data of his interest not a collection of some thousands of tables with many mysterious rules hidden in relations and triggers. To build these views you need to understand a lot behind the scene, so at the end it is all bark no bite.

Trivial example.

imageLet us assume you are a product-specialist in a company, responsible for the introduction of new product into the market. When introducing new products you need to follow sales, often daily in the beginning to make the correct decision about if is is a go or a no go. First problem is to define “what is a new product”. In this case it is easily solved by collecting a list of product numbers, probably this is done in Excel. The next step is to answer: “Where are our sales stored?”. Asking his IT department he is directed to the data warehouse (if he is lucky), which is updated once a week. This will kill his application, because when he tells them his needs they will tell him – summarizing data daily will cost a fortune.

imageSolving this problem has technically been possible for decades, but in the past cost to much in resource usage of computers to be profitable. So let us go ahead and solve it if your ERP-system is Movex (Lawson/M3). The database with my customers is in DB2 in AS400 and have been accessible through ODBC for a long time, so that is really no news. Daily I meet people struggling with MS Query digging into the tables of Movex using a language they call “Moviska”. The real challenge is finding the right tables and how to join them, then you have to find the columns to use which could be solved by recognizing figures from your daily work when browsing though the details. We try Google for this and enters “Movex sales stats” and in 2nd document we find the keyword “OSBSTD”. This is actually the table where sales stats are stored!!

How to connect to DB2 i AS400 with ODBC i standard techniques, you could use MS Query, MS Access pass thru or linked tables. I prefer linked tables in Access which makes it very convenient to construct the SQL you need, and it is also very simple to join it to information of your own needs, so I put up a table ProdList which contains my list of interest. Then I start the query designer:

image image

As you see, you need to know the names of the columns in one way or another (Moviska), but then it easy simple and straight forward. So we move the data to Excel, either by copy/paste or with a query in Excel, where we put it into a feasible graph and after some clicks we have:

image

So we are all done with the prototype and wants to put it into production. Problem is the execution of the query takes 260 sec and a lot of resources making the IT-department worried. With a minor change in the SQL you will have an execution time of 60 seconds. Making it a passthru query and making it use a feasible index reduces the execution time down a factor of 20 giving 3 sec – for the same query. Why?

The SQL used if you would like to copy:
SELECT UCIVDT AS Invdate, Sum(UCSAAM) AS Sales, Sum(UCSAAM-UCUCOS) AS Revenue 
FROM OSBSTD
WHERE UCITNO In (‘121881′,’121891′,’121901′,’121911′,’121921′,’121931’)
AND UCCONO = 1
GROUP BY UCIVDT
ORDER BY UCIVDT

Knowing the methods and where and when to use them, that is what craftsmanship is all about in IT.


From paper to database–Part II

Summary.

When trying to use AbleToExtract for extracting information from a pictures in a PDF-file into an Excel-worksheet we went into some problems. We investigated them ending up finding solutions in a practical way for all of them.

Background.

This summer we were asked to automate the process of getting paper based information into a data warehouse, by using scanned pages stored in a PDF-file. In an earlier post is the story of selecting a tool for doing the first part of the job. We found a tool called AbleToExtract that takes a PDF-file as input ending up with an Excel workbook containing OCR interpreted data put into feasible cells.

Our next step is to take this file and translate it into a worksheet for loading the data into the tables in the warehouse.

Looking for useful connections in data.

In the earlier manual process there had always been up to 0,5-1% in difference of the total sum of the invoice and the sum of the transaction read into the database. With that method we found getting it down lower was not worth the price, so this error has been accepted, but has not helped building confidence for the data. Now we put the goal much lower, max 0,1% if any.

The data in this application is discounts offered to a specific customer from a wholesaler but paid by the manufacturer. On a weekly or monthly basis these discounts are invoiced to the manufacturer from the wholesaler. This type of transactions are very common when the public sector is buying in the Swedish food market, for example for school meals.

Looking firstly at the detail rows:

image

To the left is the name of the product sold, which is not needed since the products are stored in a product table in the DW. The OCR program is English so it does no take into account the three special Swedish characters ÅÄÖ. The next part is “ST” which is the unit used for the amounts. The next box contains the article number, for the wholesaler and for the producer (prefix AXX), and the last four numbers are quantity, and then the discount for each unit (repeated twice) and the product of quantity and discount producing a total for each row. So there is a lot of redundant information in each row that could be used to find and in many cases eliminate the errors in the OCR interpretation. 
Starting with the product codes, which should be numeric when neglecting the prefix ‘AXX’.  Both these numbers are stored in a DW table, so they are checked there. If the number does not hit in this table it produces an error message that has to be handled manually. If it get a hit on one of the numbers it is only reported as a possible error, if both numbers hit no error is assumed.
In the amount fields we have: If the two numbers representing discount/unit both are equal and numeric (dec. point i Sweden is ‘,’ ) they are believed to be correct. The next test is if the product is correct (quantity * disc/unit = rowtotal) if not, the quantity is checked if it is an integer, in that case the quantity is believed to be correct and the product is calculated. An error message is produced for all rows where this multiplication is incorrect, but with the extra assumptions the rate for manual corrections are minimized.

Let us have a look at the subtotal headers which occurs before a section with new detail lines:

image

The first number is a customer number in the ERP of the wholesaler followed by the name of the customer. Since a customer could have more than one contract there is a number of the contract to be picked up as well. Since both of these numbers have occurred in imports earlier the chance of getting a hit on them is pretty high. These numbers are collected in a table in the DW, so they could be checked for validity against that table. If no hit then they are reported as a new contract making it a possible OCR error, but also possibly representing a contract not yet entered in the DW table, so manual inspection is called for in this case, to ensure 100% validity. Of course fuzzy logic could be applied trying to match the customer name, but this has not been tested yet.

To make thing still a little bit more complicated the pages of the invoice are not in the correct order in the PDF-file. It always starts with the last page holding the total, probably because that is the best use for the accounting system, when used to check incoming invoices. We need the pages in the correct order to pick up subheading that was printed on the previous page if a page break occurred before a section was completely written, to guaranty that the correct customer is found.

The page problem.

So we have to deal with the page problem first to ensure all rows appears in the worksheet in the correct order. “Page” in Swedish is “Sida”. We put up a filter in Excel to see where it occurs.

  image image

So here we see that this word i found in column E and F and sometimes even in D. As we looked further we also found it in column L on page 17!  So it is reasonable to scan all occupied columns in a row, being unable to trust the column. We also discovered that sometimes the word to be found was “Side” due to dust problems. So the program has to be adaptable to variations in just finding this keyword to get rows correctly sorted. The idea is to put in two new column to the left putting in a page number and in the second the relative row within that page. When sorting on A and B we will get the correct order of the rows. For backtracking we put in the original row number so we easily could find the correct row when making manual corrections in the original file. We write the program and get:

image

We sort on column A and B and delete these columns to achieve:

image

Having the rows of the sheet in the correct order with the original row number in column A! Now we are ready for the next step.


From paper to database–Part I

Summary.

Selecting a PDF and OCR-program for automating scanned paper based information into a data warehouse.

Background.

Sometimes when you build a BI application part of your data is not from your own computer systems. If you are a customer you could often claim that you want the data in a digital form and get it into your database for analysing by file transfers. It could be data about competitors or as in my case it is a customer and you are not in the position of claiming your “rights”. You just have the data printed on paper.

So one of my customers gets about 100 pages of paper each week with detailed information that they have a need to analyse. For many years we have used manual input having it “keypunched” from paper into an Excel worksheet in the traditional way. But since costs have risen and the volume is too small for getting it done overseas, last summer time has come to investigate other methods.

Choosing a tool.

Since we started the application many years ago, today all paper invoices get scanned and stored before entered into the accounting system, so all information is already in a digital form in a PDF-file. The problem is that there are many types of objects that could be hidden in a PDF-file, in our case it is actually an image, with no interpretation of the content.

image

The information we want to use in the database are from different parts of the paper, so even if we get the image there is still some work to do. A good combination of a program capable of reading PDF and OCR-scanning the picture would do most of the work making our project realistic. Some five years ago we tested this approach and then found it to be impractical, but this summer we gave it a new try. So we found after some investigations a Canadian product called AbleToExtract from InvestTech in Toronto. We downloaded the free trial and tested it for our purpose before actually paying the $130 cost of buying it.

clip_image012The tool has several options when scanning and we tested a few, finding that the defaults chosen were good enough for us as a starting point. We have never used OCR-software before so we were newbies in the field. Starting up the program it is a very easy three step procedure: Show the PDF-file, select output format (Excel for us), and then you have the possibility to select a lot of options that we didn’t spend too much time with, since we looked for an automated process. We decided to use “Convert” under Automatic. The program is heavy on the CPU and is one of the few programs capable of using some of the power in my quad-processor:

clip_image014

With this 61 page sample it uses my machine for about 3-4 minutes before it is actually done. When ready I end up in Excel with the converted PDF-document interpreted and put into cells of a worksheet. Not exactly as good as one would hope for, but certainly good enough to be useful for further processing.

image

In this sample there is actually no error in the OCR-interpretation on page 61, but depending on the quality of the scanned images this was an easy one. After testing a hand full of files we found some properties that you have to be prepared for. Firstly on some pages AbleToExtract shifts the columns where it put in your data. Here we see that the quantity has moved from H to I and then on page 13 it is in column G. On some pages we have also found that there are shifts within a page, especially if there are “dust”-problems.

image

clip_image020We have no control of the scanning process of the original documents, since this service is outsourced to an external company, but we have found that some files are much more cumbersome than others. The dust problem which occurs on some pages means that the OCR interpreter finds a lot of dark spots that sometimes get interpreted as: . ’ – , and so on. If these are lonely in a cell they are easy to neglect but sometimes 0 will be found as 8,6 or 9. So don’t expect everything to be correctly interpreted. You will find other types of problems 1 could sometimes be a lowercase l and sometimes be an uppercase i. Zero goes into O.

So after the first interpretation you will need a “washing-program”, which is capable of correcting most of the errors. If it finds some tricky situation it gives up and tells you about the problem so you could easily find it for manual correction. So we found AbleToExtract to be a good enough candidate to give the next step a try…


« Previous Page