Peter Bengtsson

Standard deviation to eliminate guesses

By: Peter Bengtsson, 9th of October 2007

9th of October 2007

Yesterday I added a neat feature to our (currently private beta) in-house application SnapExpense that I'm quite proud of. Not because it's technically amazing or anything but it's useful and uses a mathematical model rather than guesswork.

What it does is that it extracts a bunch of numerical numbers from a chunk of text (the OCRed text of an image of an expense receipt). Some of these numbers are actual real money-amounts and some are just numbers that also happen to appear in the text. For example, in a text I find "176.0 2.0 12.3" using a set of "complex" regular expressions and filters. Which one of these is the total price that goes with the text? If the text is the receipt from EasyJet (the airline) then any human being will be able to elimate "2.0" and "12.3" as mistakes. For example "2.0" was maybe part of a date.

The challenge was to filter out numbers that probably aren't likely candidates. If, in a separate procedure, I manage to find one vendor from the text, I can use the vendor to list which other amounts have been saved in the database with that vendor. For example, for the vendor Amazon.co.uk, most amounts are in the range £15 to £40, never £1,500. See?

Standard deviation to eliminate guesses The solution was to calculate the Standard Deviation of previously entered amounts and compare the found numbers here with the Standard Deviation.

Here's an extract of amounts previously spent on the vendor Amazon.co.uk by people at Fry-IT:

     18.7     55.75     32.65    159.73        38     31.52     15.79
    40.21     39.51     49.21     25.83     16.51     19.18     12.01
    70.39     35.08     46.17      9.25     16.61     54.99     17.48
     7.75     64.29     20.45     21.08     18.09     34.64     24.88
    61.53     57.25     18.18     11.41     16.86     45.88     93.99
     73.9     87.2      32.5      64.56     39.73 

That gives an average of 39.96 with Standard Deviation 29.40. That means that anything that is +/- 29.40 falls within 68.2% of the area of the curve (rough assumption that the data is normally distributed). So, numbers that are smaller than 10.57 or greater than 69.36 can be considered "junk", ie. out in the flattening flanks of the distribution curve.

Remember, it's still guesswork but with some quality to it. It's nice to see some of that degree in mathematics come to some use numerically too.

A closing thought

Standard Deviation only really works if the data is normally distributed. I.e. that nice camel curve shape with a hump right in the middle. For things like these another mathematical model that might be useful is to calculate which numbers are outliers instead. We'll see. At least I have the foundation now to do more testing and playing.




Comment

Jan - 13th October 2007  [«« Reply to this]
Cool, good that you did it! You could still do second std dev later, or some estimation on whether the distribution of prices is indeed normal. But I'm just going to shut up about statistics... this year ;)
 



hide my email address.

Your email address will be encoded to prevent email-extraction spiders from reading it so you won't get spammed if you decide to show your email address.