Community mailing list archives
Re: [Odoo] accounting roadmap v9: a survey and a doodleby
Playing arround a little bit with SimpleIndex OCR and reading a bit about tesseract, a proposed (as on the feedback form) Invoice OCR/Scanning could have the following components:
- Create an Invoice
- Set a partner first (to get trained invoice templates loaded, can be more than one per partner when we think of receipts of the same partner in his multiple stores)
- Upload a PDF (in the right quality)
- Load this PDF into a pop-up window and set a template (=tesseract training file),
- On the right there are freetext fields of the eligible fields (reference, date, total, subtotal, linees[qty,name,total])
- On the right of those fields is a read only field, where string operations and dictionary mathich is already applied
- dictionary matching can be quite useful for date strings and some date-time or number formatting dictionaries maybe could be preconfigured
- StringReplace could be used in a structurate manner to match invoice strings aggainst internal product codes. (I mean: Matching the manually corrected string onto a drop down of internal product lists)
There can be improved variatons:
- Checkbox-Option to make a field a fixed zone of the template, so for this field there would be no need to set the zone manually any more. This is usefull if you use at time of scanning fixed physical templates and rules about orientation of the invoices
- Make A python preprocessor in combination with printable scan templates, so that you can print meta information (generated by odoo) on a blank sheet, put the invoice onto this sheet (wich has positioning marks, ect.) and scan it into a general inbox. from there it could be automaticall routed (by automatically extracting the meta information) to a specific application document inbox (eg. invoices, contracts, curriculum vitae, bank statments, etc.). In those document inboxes one should be able to create e.g. invoices out of a scanned document in an integrated workflow ("create invoice from this document")
- Make a field to assign tesseract standard language dictionary XOR user dicitionres to the specific template. Like this the accouracy can be greatly improved already from within the tesseract algorithm (training files). An example could be, if you have some products that you always buy from the same supplier, you can register the product string and it will match if a specifically defiend confidentiality is reached. (this could also be the first step towards a tag-based recognition of date-time, reference, and probably also invoice line values...
- There is little use for line scanning as this is very complex and IMO quite likely not to work correctly.
- However, Invoice generation from scan extraction would probably be a great help for accountants.
It would be intersting to hear your opinion on this suggestion. Especially about it's general usefulness?
Quentin De Paoli (qdp)