Community mailing list archives

Re: [Odoo] accounting roadmap v9: a survey and a doodle

David Arnold
- 07/29/2014 15:35:48
Playing arround a little bit with SimpleIndex OCR and reading a bit about tesseract, a proposed (as on the feedback form) Invoice OCR/Scanning could have the following components:

  1. Create an Invoice
  2. Set a partner first (to get trained invoice templates loaded, can be more than one per partner when we think of receipts of the same partner in his multiple stores)
  3. Upload a PDF (in the right quality)
  4. Load this PDF into a pop-up window and set a template (=tesseract training file), 
  5. On the right there are freetext fields of the eligible fields (reference, date, total, subtotal, linees[qty,name,total])
  6. On the right of those fields is a read only field, where string operations and dictionary mathich is already applied
  • dictionary matching can be quite useful for date strings  and some date-time or number formatting dictionaries maybe could be preconfigured
  • The FreeText fields are filled by an OCR-Zone scan of an invoice area delimited by a mouse click-drag movement
  • The user corrects the filled field to the exact value found on the invoice (incl. points, # ect)
  • The user configures the transformation rules for that field (Upper-Lower-CamelCase, StripSpaces, StripCaracters($), ReplaceCaracters(.->,), maybe regex-operations)
    • StringReplace could be used in a structurate manner to match invoice strings aggainst internal product codes. (I mean: Matching the manually corrected string onto a drop down of internal product lists)
  • At saving the values, a method loads the tesseract training and traines the zones with the corrected text of the boxes (thats why identical representation would be important) on the specific template (remember there could be probably more than one for a partner, if partner has different invoice styles)
  • After the 5th invoice or so, accuracy will tend towards 99,5%, 

  • There can be improved variatons:
    • Checkbox-Option to make a field a fixed zone of the template, so for this field there would be no need to set the zone manually any more. This is usefull if you use at time of scanning fixed physical templates and rules about orientation of the invoices
    • Make A python preprocessor in combination with printable scan templates, so that you can print meta information (generated by odoo) on a blank sheet, put the invoice onto this sheet (wich has positioning marks, ect.) and scan it into a general inbox. from there it could be automaticall routed (by automatically extracting the meta information) to a specific application document inbox (eg. invoices, contracts, curriculum vitae, bank statments, etc.). In those document inboxes one should be able to create e.g. invoices out of a scanned document in an integrated workflow ("create invoice from this document")
    • Make a field to assign tesseract standard language dictionary XOR user dicitionres to the specific template. Like this the accouracy can be greatly improved already from within the tesseract algorithm (training files). An example could be, if you have some products that you always buy from the same supplier, you can register the product string and it will match if a specifically defiend confidentiality is reached. (this could also be the first step towards a tag-based recognition of date-time, reference, and probably also invoice line values...
    • There is little use for line scanning as this is very complex and IMO quite likely not to work correctly.
    • However, Invoice generation from scan extraction would probably be a great help for accountants.

    It would be intersting to hear your opinion on this suggestion. Especially about it's general usefulness?