The ‘Data Search for Data Mining’ – Extension Release!
By: Edwin Yaqub, PhD
At RapidMiner Research, we are addressing problems that are becoming increasingly pertinent to businesses. As part of the German research project DS4DM (http://ds4dm.de), we now released the ‘Data Search for Data Mining’ extension, which provides data enrichment capabilities in RapidMiner.
Motivation:
Data analysts are increasingly confronted with the situation that data which they need for a data mining project exists somewhere on the web or in an organization’s intranet but they are not able to find it. On the web, data is generally searched from search engines using keywords or text. This is an example of unstructured search. In cases where structured data exists, e.g. in the form of a table, structured and contextualized search is possible. The objective is to enrich an existing table with additional data by harnessing diverse sources of data in an efficient manner. In the literature, this topic is often referred to asEntity AugmentationorSearch-Join[1,2].Search-Joinsare useful within a wide range of application scenarios. For example, given a dataset containing attributes like the name, GDP and the region of a country, we would like to enrich the dataset by:
- Searching forrelevantdatasets that contain an attribute of our interest e.g., thelanguagethat is spoken in a country or thecurrencyused there.
- Integrate the new attribute to our original dataset, either automatically filtering it out from potentially large candidate datasets or allowing a human to manually refine the integration.
The ‘Data Search’ extension implements both of these capabilities and thus brings theSearch-Joindata enrichment method to RapidMiner.
Besides the subject matter, this post also shows that Java developers can reuse RapidMiner libraries to customize visualizations, add GUI panels and controls in their extensions to suit their needs.
Data enrichment through the Search-Join method
The Backend: For the search function, the extension uses a Search-Join data server at the backed. This is developed by our project partner, the University of Mannheim (Data and Web Science group). The backend comprises a corpus of heterogeneous data tables, which are indexed and stored after extracting from data sources. The current implementation uses subset of Wikipedia as a source but more sources will be added in future. The extension (frontend) interfaces with the backend through a web-service, which uses algorithms to discover candidate tables. The discovery is based on schema (column level) and instance (row level) matches between the provided query and the tabular corpus.
The Frontend: The extension is composed of three operators. The Data Search, Translate and Fuse operator which work together in an operator chain as seen in Fig. 1.
Fig. 1 RapidMiner process for data search and integration
Data Search operator: This operator queries the web-service for relevant tables by submitting an entity query. The entity query comprises of an existing dataset; one attribute of this dataset is recognized as the subject identifier (primary identifier of a row) and a keyword for the additional attribute to be discovered. The server returns a collection of relevant tables. The schema level and instance level matches are also made available at the output ports.
If you select the checkbox ‘apply manual refinements’ in the operator parameter panel, the process execution is halted in real time and you are taken to aControl Panelgraphical view. Here, you see the discovered data tables matching your query as shown in Fig. 2. The customized tree view lists candidate tables, which can contribute values for your new attribute. The red legend indicates that the table (shown as a named node in the tree panel) has an attribute (columnar) match to your original table. Similarly, the blue legend indicates a match at the instance (row) level and both legends together indicate both matches, which is the ideal case.
The panel shows distribution of two statistics over the collection to give high level view at a glance:
- Coverage: the number of examples that matched between the query (your original) table and the fetched (candidate) table, divided by the number of examples in the query table.
- Ratio: the number of examples that matched between the query table and the fetched table divided by the number of examples in the fetched table.
Fig. 2 Results of the Data Search operator
Noise Removal:它是一个事实,数据搜索是容易的noise. If the analyst deems certain discovered table to be noisy, it is necessary to delete it before the process execution is resumed.
一个嘈杂的在th表可以通过选择删除它e tree, right click mouse and then selecting the Delete menu item. This changes the data model of the operator and therefore, these changes need to be committed in-memory by clicking the ‘Commit Updates’ button before resuming the process execution. If you accidentally delete a node, the original collection can be restored through the ‘Restore Original’ button at any time. These controls are shown in Fig. 3. Notice that the examples sets at the output ports of the operator i.e. schema and instance match tables are updated accordingly. The idea is that only refined output reaches the next (Translate) operator in chain.
Fig. 3 Delete from list and commit changes in memoryVisual aids
Care must be taken when deleting tables to prevent the loss of potentially valuable tables. To assist the data analyst in this exploratory task, two visualizations are provided.
Interactive Document Map: RapidMiner provides a Self-Organizing Map (SOM) visualization which can be used to expose patterns in data. We reuse and customize it to tag the dot (points shown on the map) with text showing key properties of the table i.e. its full name, the count of schema and instance matches. The map also provides a drill-down mechanism in that each dot is implemented as a hyperlink. If clicked, it opens the associated table in the tree-tabular view. This eases localization and filtering.
The document map helps to understand how the candidate space of discovered tables shows up in a landscape like layout. For example, tables with higher schema or instance matches might be (but not necessarily) stronger candidates. You may not want to delete these table, while others may not be so interesting. The map can also reveal neighbourhoods based on (dis)similarities among the tables based on table properties, which are fed internally to the underlying neural network. Fig. 4 shows a document map for the results of a sample query.
Fig. 4 Interactive Document Map showing discovered tables
3D Labelled Scatter Plot: While the interactive map provides a landscape view of the search space, the 3D scatter plot shows the tables as points along x-y-z axes. The points are labelled with the table name. This visualization is intended to see how/if the tables clutter along individual axis and if a Pareto frontier exists. If so, the Pareto-efficient tables are stronger trade-off candidates which you may want to keep. Fig. 5 shows such a plot for the results of a sample query.
Fig. 5 Labelled scatter 3D plot showing discovered tables
Translate operator:
The outputs of the Data Search operator are passed on to the Translate operator. This is where data integration or theJoinstep inSearch-Joinstarts. Translate processes the candidate tables using the schema and instance matches. As a result, a new collection of tables in the image of your original dataset is created. This collection of 'translated' tables is composed from only those candidate tables, each of which have at least one cell value to contribute to your new attribute. Here again, the 'apply manual refinements' checkbox can be selected to filter out unwanted tables from reaching to the Fuse operator. The interested readers are referred to [3] for conceptual details.
Fuse operator:
The last operator in the Search-Join process is the Fuse operator. Fuse takes the outputs of the Translate operator as input. It then selects a particular cell value for the new attribute from the collection of translated tables. The decision which value to choose from which table is made by a fusion policy, which uses criteria provided by the user in operator parameter panel. At this stage, we provide a default fusion policy. Finally, chosen cell values are fused to the corresponding instance (row) of your original dataset and an enriched dataset with the new attribute is produced. This concludes the data integration (Join) step. Fig. 6 shows the enriched dataset(s) where a new attribute ‘language’ and ‘currency’ has been added to the original dataset.
Fig. 6 Dataset enriched with 'language' and 'currency' attributes
Conclusion
In this blog post, you learned about the ‘Data Search for Data Mining’ extension, which can be used to enrich an existing dataset with relevant new attributes. The GUI features shown here reused RapidMiner source to achieve necessary customizations. If you perform similar customizations, just ensure that RapidMiner security guidelines [4] are respected. The project DS4DM [5] is under active development and new features being developed at the backend and the front end will be rolled out in subsequent releases. I will stop here and urge you to go ahead, install the extensionfrom the marketplaceand simply execute the sample process (attached and below) for a first hand experience.
Acknowledgments
The Data Search extension is developed as part of Data Search for Data Mining (DS4DM project,http://ds4dm.de) sponsored by the German ministry of education and research (BMBF).
References
[1] Bizer, Christian et al. entitled, 'The Mannheim Search Join Engine', published in Web Semantics: Science, Services and Agents on the World Wide Web. Vol.35, Part 3, Dec. 2015.
[2] Bizer, Christian, Tom Heath, and Tim Berners-Lee. Linked data-the story so far, published in Semantic services, interoperability and web applications: emerging concepts (2009): 205-227.
[3]商务,基督徒,Tran模式映射和数据slation, lecture notes, weblink:http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/Lehre/WebDataIntegration/HWS2015/WDI03-Mapping-HWS2015.pdf
[4] RapidMiner documentation on Security and Restrictions, weblink:http://docs.www.turtlecreekpls.com/developers/security
[5] Data Search for Data Mining (DS4DM) project, weblink:http://ds4dm.de