Document similarity of 2 excel spreadsheets containing text

ekotasekotas MemberPosts:8Contributor I
edited December 2018 inHelp

Hello,

I've posted before about text mining but I think my question was too vague and didn't explain what I wanted to do very well. So I've gone away, watched some (a lot!) of tutorials and tried again.

So what I've done is read in 2 excel spreadsheets 1 containing relevant text, keywords etc and 1 containing 504 references exported from medical databases. Both spreadsheets contain title and abstract and for the 504 each reference is on a new row with the aim of comparing the 2 spreadsheets to find the most relevant references compared to the text in the 1st excel spreadsheet.

Ok so I've played around with this alot and got a few things to work (eucalidean distance, cosine similarity etc) but it's not quite doing what I wanted it to... I want it to re-order the 504 references with regards to how similar they are to the relevant text in the first excel spreadsheet. Ideally so that the most relevant references are first in the list and then the least relevant are down the bottom of the list... if that makes sense.

Also just to clarify I am no data scientist so I don't actually know what the results mean when I run cosine similarity and eucalidean distance and that. All i know is I got it to work without any errors, which at the moment is a pretty good achievement for me.

Anyway, I've gone off topic. Can anyone help with what I'm aiming to do with the ranking of the documents?? Also, I don't know if you need to see what I've done so far?

Thank you so much

Answers

  • Pavithra_RaoPavithra_Rao Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:123RM Data Scientist

    Hi@ekotas,

    Here is a good community post about basics of text mining and also refers to sample process withing RapidMiner Studio >> Community samples on how to see similarity to each row. You could easily translate this into your use case.

    https://community.www.turtlecreekpls.com/t5/RapidMiner-Text-Analytics-Web/Term-Frequencies-and-TF-IDF-How-are-these-calculated/ta-p/46333

    Hope this helps.

    Cheers

    sgenzer
  • ekotasekotas MemberPosts:8Contributor I

    Lovely stuff, looks very relevant. I'll give it a go, thank you!

    sgenzer
  • Knut-RMKnut-RM Administrator, Employee, Member, University ProfessorPosts:110Administrator

    can you tell us which tutorials you watch and which were helpful? Where did you find them and what are you missing? Background: we are always working on new stuff so we are interested what people are having issues with...

    Cheers, K.

    sgenzer
  • ekotasekotas MemberPosts:8Contributor I

    Hi,

    I watched a series of 5 tutorials on youtube, starts with this one:https://www.youtube.com/watch?v=hpvda_Rfg3s. They were really helpful. I also got some information off this forum and just other tutorials on youtube but I didn't find them as helpful as the series of 5. I was missing what I wanted to do really which was the rank them and also because I don't know much about cosine similiarity and that I could have used some help with what the numbers meant in the output.

    sgenzer
  • Knut-RMKnut-RM Administrator, Employee, Member, University ProfessorPosts:110Administrator

    Great - thanks for the feedback!

  • ekotasekotas MemberPosts:8Contributor I

    hello,

    I'm back with some more problems... I'm trying to classify my excel spreadsheet and see if this shows anything interesting in the data... it might not like.

    无论如何,这都是设置但我得到一个错误。. "Attributes do not match. The input ExampleSet does not match the training ExampleSet. Missing attribute: 'Abstract = ?' "

    I have tried to work it out via the forums etc but I can't work it out. Can anyone help please?:)

    Thanks!

  • Pavithra_RaoPavithra_Rao Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:123RM Data Scientist

    Hi@ekotas

    Please share rapidMiner process XML, to help us see what the error is in detail.

    Cheers,

    sgenzer
  • ekotasekotas MemberPosts:8Contributor I

    Hello,

    i've attched the xml... hope I've done it right!

    thanks

  • SGolbertSGolbert RapidMiner Certified Analyst, MemberPosts:344Unicorn

    Hi@ekotas,

    The topic seems interesting. The relevance surely depends also on factors other than word vectors (Author, Impact, Times Referred, etc.), but the text analysis is a good start. I would like to know more about the Excel files, at least which columns it has with what kind of data.

    Best,

    问候

Sign InorRegisterto comment.