Similarity between mutiple tables

Robin1992Robin1992 MemberPosts:5Contributor II
Hi,

Currently, I am working on a thesis research for my university to solve an entity resolution problem. Today I have tried to integrate two tables with each other through measuring the Similarity between these tables. If the threshold is above 0,9 it is considered as useful and will it be used in the second evaluation. In the second evaluation the variables will be evaluated on weight. For example, a phone number is a better unique key, than a firstname. At the end, the customer representation need to be evaluated as followed (0.9*2)+(0,8*7) = .... if the threshold is above the 0.8 (for example) it will consider as usefull and integrate the rows. I Tried to perform the similarity (with a couple of similarity measures) measure In rapid miner, but I received extreme values ( <0 or >1).

(currently, I cannot post any screenshots, since I am new)

What do I wrong?

Cheers, Robin

Tagged:

Answers

  • eyey 主持人、雇员、RapidMiner注册分析师, RapidMiner Certified Expert, RMResearcher, MemberPosts:21RM Research
    edited May 2019
    Hello Robin,

    I would refer you to the Data Search for Data Mining extension, which we developed as part of the research project DS4DM (http://ds4dm.com). In this extension, we dealt with table matchings to identify common entities in two tables (or a corpus of tables). The idea is to extend the initial table using fuzzy matching and automatic conflict resolution. If common entities (duplicates) are detected, attributes of the matched table are augmented to the initial table - which bring new knowledge about the matched entities OR help fill missing values for existing attribute(s).

    Please try the 'Enrich Table by Data Fusion (ETDF)' operator, which will try to extend or augment your table with attributes from a matched table. It does this only for Entities (Examples), whose so-called Subject-Identifier attributes (near unique but not necessarily primary identifier) would match by applying instance matching (attribute-value matching) heuristic. Optionally, you can apply schema matching (attribute-name matching) heuristic which will filter out tables that are not structurally similar enough. Please read the Help documentation for details. I am assuming you do not have a unique identifier, else simply use inner Join and use nested loop over two tables (m x n) to compute similarity among attribute values for each Example.

    Usually, fullname (first and last name together), email, a title, or a short textual description would be a good candidate to select as Subject-Identfier attribute, else you can concat attributes to create a Subject-Identfier attribute for matching. Typos and duplicates are allowed and expected. If the ETDF operator detects multiple entity matches in the other table, it automatically resolves this conflict by choosing the highest matched. Various fuzzy matching methods are provided, which are used for matching string-pairs (this is called 'label based matching') between your initial table and the matched table. Please note that the execution time and precision heavily depends on the method.

    Your requirements seem very specific to your case. You mention that you would like to apply own weights to match 'comparable attributes' after an Entity match is found. This shows you are dealing with structurally similar tables, but the ETDF operator is designed to process tables in a structurally in-different manner. However, we took this aspect into consideration. Our approach is to transform the augmentation process into an optimization problem. To implement your own evaluation scheme, I would refer you to the last tutorial process of the ETDF operator, which shows how you can bring your own definition of 'precise-match' in order to filter out the incorrect matches. In this case, first you apply the ETDF operator to extend the table, and in the second step, you can choose to accept or reject a matched Entity (Example) by applying your own weighted-matching scheme on comparable attributes. This tutorial also relieves you from finding good parameter values for the ETDF operator by using an optimizer.

    If you are able to share your tables (or a subset), then we could have a closer look.

    Best Regards,
    Edwin Yaqub
    sgenzer MartinLiebig Robin1992
  • Robin1992Robin1992 MemberPosts:5Contributor II
    edited May 2019
    Hi Edwin,

    Thanks for your quick reply!

    Today, I tried to use the extension to solve my problem,.however many problems occur during the execution of the model. Maybe I can clarify the problem more accurately:

    Currently, my thesis supervisor is looking for a solution to connect multiple databases and extract the entities (customers). Our customers are recorded in the main database that contains high-quality data. I can use this data set (in this case the truth) to connect with other sources, such as a marketing database. In the main database (with the truth) we have the following columns: Initials, Full name (text), home_number (int), Postcode (text) , City_name (text) , Gender (text), Phone_Number (int) and e-mail . When we try to connect to another source it could be that address details are missing, and the phone number is incorrectly inserted. Since the data is missing or is incorrect, we would like to find a way how to extract the entities and add additional characteristics to the customer profile. For example in the main database were the attributes: Initials, Full name, home_number, Postcode, City_name, Gender, Phone_Number and e-mail, but after integration the attribute eye colour is added.

    Yes, we would like to add some weights to the similarity because a match between full names is less effective than phone number for example.

    My first thoughts are:

    Inserting data --> Cleaning data--> check on similarity, if above the threshold, then proceed --> combine the similarity with weights. e.g. eye-colour will never change so, a heavy weight = 1, while city_name is less stable and gets the weight 0,8. Combine the weights with similarity (0,95*1)*(0,97*0,9) = 0.82, which is above the threshold of 0.8 --> Integrate the data and add additional columns if available.

    Do you also have an example with regular operations in rapid miner. For my thesis I need to understand why things happen and if I am using extensions it will be a black box for me.

    Robin

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager
    edited May 2019
    @Robin1992I just 'boosted' you so you should now be able to post screenshots.

    Scott

  • Robin1992Robin1992 MemberPosts:5Contributor II
    Thank Scot,

    Currently, I made the following process in rapid miner:



    我使用相同的数据集,使用正确的数据and the other with manipulated data (same columns). To start with the first cross distance test I selected the "initials" attribute. Within the cross distance operate I selected "nominal measures" and "JaccardSimilarity". I received the following results::

    Results:




    I was expecting results such as: 0,43, 0,33 etc, see below an real example:


Sign InorRegisterto comment.