PDF documents exist in the trillions and support all types of personal and business activities. A large percentage of these documents were “born digital”; meaning that they were created from electronic files such as Microsoft Word or Excel and were converted to PDF. The other portion are simply scanned images that are stored inside a PDF “container”.
While the original premise of PDF format was to provide a common standard for storing and sharing final form documents, many knowledge workers and organizations have come to rely on the ability to easily and automatically take data stored within these documents and use it for different purposes such as locating specific data within a PDF; or exporting data in a table to a spreadsheet.
With all of the benefits of PDF, such as, providing compact, sharable, secure, and perfectly displayed documents, there are a lot of difficulties when attempting to use the data within a PDF document. Data stored within a PDF does not have the necessary metadata to identify individual words let alone complex data structures such as fields and tables.
Report Miner enables even novice users to quickly take data from PDF files and use it within other applications. Through a simple-to-use interface, knowledge workers can define data elements that are then used to automatically locate and extract this data to text files, XML, or spreadsheet formats.
Although primarily defined for parsing and extracting data from PDF documents, the Report Miner is equally capable of processing scanned images, such as TIFF files, or text documents including print streams and EBCDIC files with vertical carriage controls. These extensions to the Report Miner makes it ideal for processing high volumes of images as well as the traditional Computer Output to Laser Disc (COLD) data files.
PDF Report Miner uses a custom PDF parsing and viewing engine with sophisticated text handling and automatic reading order detection, a feature that is lacking in most traditional PDF viewers or libraries. Reliable and quality text extraction is vital to successful parsing of more complicated data structures such as fields and tables.
PDF Report Miner was designed for flexibility and scalability. The available SDK for the .NET environment allows customization and precise control of the processing engine through plugins and a callback interface. Unlike many traditional PDF engines, the report miner can process very large PDF documents with many thousands of pages.