"Compare two customer databases"

xiaobo_sxb · April 2014

Hi

I have two customer tables which contains their information like name, address, phone etc. Most of them are actually the same customer set. I'd like to map them if they are the same, by comparing above fields. Both table has more than 10K records. Does anybody know how to do that in Rapidminer?

Best Regards
Steven

awchisholm · April 2014

Hello Steven

The Join operator lets you join tables together. You could also use a distance to similarity approach to see what records are close to one another.

问候,

Andrew

xiaobo_sxb · April 2014

Hi Andrew

Thank you for your reply. I still have questions for your proposal.

First, the join operator require the two dataset have the same ID (the key). For my case, I don't have the same ID.

For the "data to similarity" operator, still not good enough. First, it will create a cross join across all the rows, in my case I have more than 10K rows for both of the tables, and I doubt the performance. Second, even I have the similarity score, I don't know the threshold for determining the possibility of two rows as the same customer. Is it possible to generate the possibility to say, how much percent of confidence we can say the two customers are actually the same one?

问候
Steven

awchisholm · April 2014

Hello Steven

Well if there is no common ID then there is obviously no way to use Join.

Actually a better operator would be Cross Distances which allows the selection of the top k nearest. The threshold completely depends on the data you have and I can't answer that.

Performance may not be that bad; you have to try it.

regards

Andrew

xiaobo_sxb · April 2014

Does anyone know how the cross distance operator work? Seems it works well for numbers but not for text. I use the operator for double bytes text and the result is totally incorrect. I created several records manually, some of them are quite close with only few words difference, some of them are quite different, but the distance it generated does not provide the result as it should be.

问候
Steven

awchisholm · April 2014

One way is to extract the words from each description to create a word vector where the value of each attribute is based on a measure like tf.idf. The text processing extension is your friend for this.

Andrew

MariusHelf · April 2014

Also the Join operator is still an option, but it will only match customers that have an exact match in name, phone number etc.: in RapidMiner you can join by comparing the desired fields, even if there is no common id.
An outer join is the join type you'll want to use in your case to spot customers that are part of only one of the tables.

Best regards,
Marius

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Compare two customer databases"

Answers