Fresco: A Programming Language for Extracting Information from Structured Documents
In daily business processes a large number of structured documents must be processed. Typically, the task is to extract specific information from these documents and to incorporate this information into the workflow. Example document types include invoices, tax forms, general business letters, business reply mail, etc. These structured documents may convey the same information but in differing layout format. Current commercial form readers are not able to process this variety of documents.
FRESCO is a programming language that enables the task of information extraction from arbitrarily structured documents using OCR output. It is a concept based language (or graph-grammar) describing the items to be extracted by specifiying their local features, their structure and their relationships. Tools are available to efficiently apply FRESCO: a compiler, linker and a run-time system. The heart of the run-time system is a best-first graph search exploiting the knowledge modeled.
Three successful applications are presented: tax return forms and two different kinds of invoice processing. FRESCO is already in commercial use in one invoice application which is discussed in more detail. The automation rate on a test set of 500 different invoices has been 64% with respect to items to be extracted.