"Compare two customer databases"

xiaobo_sxbxiaobo_sxb MemberPosts:17Maven
edited June 2019 inHelp
Hi

I have two customer tables which contains their information like name, address, phone etc. Most of them are actually the same customer set. I'd like to map them if they are the same, by comparing above fields. Both table has more than 10K records. Does anybody know how to do that in Rapidminer?

Best Regards
Steven
Tagged:

Answers

  • awchisholmawchisholm RapidMiner Certified Expert, MemberPosts:458Unicorn
    Hello Steven

    The Join operator lets you join tables together. You could also use a distance to similarity approach to see what records are close to one another.

    问候,

    Andrew
  • xiaobo_sxbxiaobo_sxb MemberPosts:17Maven
    Hi Andrew

    Thank you for your reply. I still have questions for your proposal.

    First, the join operator require the two dataset have the same ID (the key). For my case, I don't have the same ID.

    For the "data to similarity" operator, still not good enough. First, it will create a cross join across all the rows, in my case I have more than 10K rows for both of the tables, and I doubt the performance. Second, even I have the similarity score, I don't know the threshold for determining the possibility of two rows as the same customer. Is it possible to generate the possibility to say, how much percent of confidence we can say the two customers are actually the same one?

    问候
    Steven
  • awchisholmawchisholm RapidMiner Certified Expert, MemberPosts:458Unicorn
    Hello Steven

    Well if there is no common ID then there is obviously no way to use Join.

    Actually a better operator would be Cross Distances which allows the selection of the top k nearest. The threshold completely depends on the data you have and I can't answer that.

    Performance may not be that bad; you have to try it.

    regards

    Andrew
  • xiaobo_sxbxiaobo_sxb MemberPosts:17Maven
    Does anyone know how the cross distance operator work? Seems it works well for numbers but not for text. I use the operator for double bytes text and the result is totally incorrect. I created several records manually, some of them are quite close with only few words difference, some of them are quite different, but the distance it generated does not provide the result as it should be.

    问候
    Steven
  • awchisholmawchisholm RapidMiner Certified Expert, MemberPosts:458Unicorn
    One way is to extract the words from each description to create a word vector where the value of each attribute is based on a measure like tf.idf. The text processing extension is your friend for this.

    Andrew
  • MariusHelfMariusHelf RapidMiner Certified Expert, MemberPosts:1,869Unicorn
    Also the Join operator is still an option, but it will only match customers that have an exact match in name, phone number etc.: in RapidMiner you can join by comparing the desired fields, even if there is no common id.
    An outer join is the join type you'll want to use in your case to spot customers that are part of only one of the tables.

    Best regards,
    Marius
Sign InorRegisterto comment.