|
FAQ: Data extraction |
|
|
I have a number of documents that are very similar in layout. Is there a way of simplifying the data extraction? When you mark up data fields and extract them, EscapeE creates a file (extension *.EE) with the field name definitions. So rather than redefining each whole document you can use the same .EE file and make the few modification manually to create a new .EE file. There doesn't seem to be a place to define where the .EE file is. Do I need to put DEFAULT.EE in the same folder as the print files? You can do this - it will apply to all files in that folder unless they have a EE file with the same stem. E.g. if you have the following files: I need text extraction for the whole document, not for fields. Is there a way to get the text contents extracted and TIFFs produced in one pass? Proceed as follows: When I try to extract text I get an empty file - why is this? There is no text in your file, merely graphics. EscapeE (from version 8.50) can do OCR, but you must have Microsoft Office 2003 or 2007 loaded onto your PC with the Microsoft Office Document Imaging tool (MODI) and you must purchase the OCR plugin from RedTitan. Check the EscapeE help index and look for OCR for further details or contact help@redtitan.com. Alternatively, if it was produced by a Windows driver there may well be a way of persuading it to produce text. In the Print Setup dialogue click on Properties, then Advanced Graphics options. Make sure you choose either Download TrueType fonts as outline soft fonts or as bitmap soft fonts. Tip: if your file is mainly graphics, when you right-click on some text you will find that it doesn't enable the "Text details" or "Font properties" options, only "Graphic details". When I tried to extract text I obtained rubbish in my file. Why is this? You are suffering from the non-standard character codes used by some drivers. Most of the problems come from Windows drivers, since customised software or Unix systems tend to drive printers in a fairly straightforward way, so I assume your output was created via a Windows driver. Options|Configuration and set 'Type' to 'Windows HP Driver' the correct code conversion will be engaged. A possible workaround is to use only the printer's resident fonts in your document, since the driver is compelled to use standard codes in this case. Some drivers allow you to tell them which fonts are resident so you could try that. If the file is something such a Word document that can be moved to another PC running Windows 98 then it can be printed using a different driver. The other possibility is to use the free Microsoft Word viewer to do the printing, which often results in simpler output. There is also the possibility of configuring the driver differently e.g. to download the fonts rather than using graphics. In some of the fields why is there more text included than I marked? The problem is that there are two overlapping pieces of text in your field, so EscapeE concatenates the two. The solution is to be more specific in the searching criteria or perhaps to be more accurate in delimiting the field. For example if the two pieces of text are in different fonts or sizes then you can specify the attributes of the one you want in the Fields/Searching dialogue. You can check for overlapping fields by right-clicking on the text and choosing Text Details. You will see a line for each piece of text found at the point where you clicked. When I try to extract a number of fields from the detail line of an invoice, additional data is picked up on this line. A line is considered part of the field if any part of it falls within the field and the characters on such lines are included if at least half the character's width is in the field. If the fields are not well aligned with the data, extra lines may become included. It is therefore crucial that the fonts do not change between defining the fields and extracting the data (e.g. if Courier is substituted for a missing font). You can sometimes avoid this by making the fields relative to an explicit tag: e.g. make the description fields use the 'Description' text as a reference, so that their offsets are measured from wherever that text is printed. How can do I define a field relative to a tag? To change a field right click on it and choose 'Field properties'. You can set up fields which only appear for those pages containing a specified textual tag by making them refer to a tag. You define the tag by right-clicking on the tag wherever it appears on the page and choosing 'Define tag'. The text that you clicked on will be shown in the Tag box and may be edited if required. Then click on OK. Then define your field by marking it out (or select an existing field) and in the Field Properties dialogue you click on the Reference Field box, then choose the appropriate tag as a reference. Can I define the page field tags, names and positions directly in the .EE file without having to use the page viewer? You can write the .EE file yourself, since it is XML and therefore just a text file. When extracting fields, do you have a way of handling different page formats in a single PCL file? The field extraction can be tailored to each different kind of page by choosing a tag string which is unique to that page and basing a different series of fields on each such tag. You can also define multi-page sets which repeat every n pages (see Field Definitions|Advanced). The starting page can be specified separately, so a field could be defined to start at page 3 and then every 2 pages, or to skip the first page you define a field that starts at page 2 and is then on every page.
|