PDF Table Extraction Extension Released!
By: Edwin Yaqub, Phd
In my last post, I introduced the ‘Web Table Extraction’ extension, which provides a convenient way to retrieve data tables fromWiki-like HTML pages. In this post, I will introduce you to the ‘PDF Table Extraction’ - another extension developed at RapidMiner Research, as part of the Data Search for Data Mining (DS4DM project,http://ds4dm.de) and released today. So let us see how this extension adds value to RapidMiner processes.
Problem: You may have already faced a situation where you wanted to use data tables from PDF documents. PDF has become a de-facto standard for read-only documents. It is certainly possible and sometimes unavoidable to extract data tables out of PDF using fine grained scraping techniques, but content parsing in this way is a meticulous activity. In the worst case, your efforts might not be reusable if tables in other documents use a different header structure. The problem is to raise the level of abstraction so data tables (having arbitrary header structure) can be extracted out of the PDF document in an easy way.
Solution: The ‘Read PDF Table’ operator solves this problem. It provides a generic solution to automatically detect and extract data tables from a PDF document as RapidMiner example sets. Simply provide it the path of your PDF file, or its URL address if the file resides on the web and execute the process. The output is a collection, as the operator tries to calibrate the detection of tables in the document. One of these example sets is highly likely to be the most accurate representation of your table. Let’s try some examples, with which I will share a few hints you might find useful when dealing with tables whose headers are complex.
Examples: The first example is rather simple. We use a document where tables have a clear single layer header, available here [1]. The operator accurately detects and extracts tables as seen below.
Read PDF Table operator
Read PDF Table Results
在第二个例子中,文档包含一个[2]table with 3-layer header. The operator uses the first layer to construct example set attributes. We can imagine that the second row serves as a more descriptive table header. The ‘Rename by Example Values’ operator easily resolves this task.
Renaming!
The Rename Process
改名后的结果
Now that we have the ability to extract data tables from a PDF document, let’s make use of some interesting statistics data from the European Commission (Eurostat). Eurostat offers many datasets [3] downloadable as PDF files. One such dataset, stored at [4] shows the percentage of individuals that obtain information from public authorities’ websites (per year between 2008-16). Governments use websites for educating the public on a variety of issues such as health awareness creation, political canvassing, travel warnings, development plans, etc. The question is, if in certain countries more attention (and how much) is being paid to this information? If this is found, spending could be optimized and different means can be used to expand audience in specific groups of countries. As we have no means to classify data, we turn to RapidMiner Clustering to discover groupings. Here we go:
Read PDF Table and Cluster Data
After reading the PDF document from thisurl[4], we realize that the example set has an arbitrary attribute at the second place, which shifts the rest of the attributes one step to the right. We can easily fix this by using the Data Editor view from Text processing extension to rename the attributes and delete the last redundant attribute. Owing to my programmer instincts, I wrote a short Groovy script that automates this and renames the first column. RapidMiner does not require you to do coding, but if you have small scripts that do big things, you can of course use the Execute operators.
Next, some pre-processing is performed. We remove the redundant attribute, trailing whitespaces, useless examples from top and bottom, clean alpha-numeric values to keep the numeric only, filter out examples with missing values, type the data, convert nominal to numeric and perform k-means clustering. Now we face the moment of truth - what value to set fork? As we are clueless, here is the good deal about RapidMiner: situations like these are ideal to leverage itsWisdom of the Crowds[5] – a guidance feature that suggests parameter values based on how community members used the same operator. Empowered with this knowledge, we quickly trykwith 4 and 5, and it becomes clear that 5 provides the better inflection point in reducing the error rate, also considering the output of the Cluster Performance operator (for average in-cluster distance as well as Davies Bouldin index).
Although our dataset was relatively small, it was not easy to draw conclusions manually. Clustering allowed us to identify five groups of countries. The Centroid table view of the cluster model provides more details on attributes (Country, usage data for years 2008-16) in each cluster. A simpler way to interpret the clusters in this case can be to use the overall mean value of attributes (for 2008-16).
Results - Davies Bouldin Index
We find that individuals of cluster 2 (Croatia and Poland) obtained the least information from public authorities’ websites, while those of cluster 4 (Netherlands, Sweden and Norway) obtained the most.
Conclusion: In this post, the RapidMiner extension for PDF data table extraction was introduced. This can boost your productivity by expanding your reach to data tables inside PDF - the universal data format. Feel free to reuse the example process (attached), extend the dataset by joining more PDF data tables (from Eurostat or another source) that interest you, and hand over the complexity to RapidMiner clustering. Have fun discovering more insights!
References:
[3]http://ec.europa.eu/eurostat/web/digital-economy-and-society/data/main-tables