Synonym Detection with Word2Vec

MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,352RM Data Scientist
edited May 2020 inKnowledge Base

Introducing theWord2VecExtension to the RapidMiner Marketplace!

We recently published a new extension on our marketplace: an advanced algorithm for text mining calledWord2Vec. The core operator is calledWord2Vecand can be thought of as a learner. In the following I will shortly explain the basics of whatWord2Vecdoes and afterwards how you can use this in your RapidMiner text mining processes.

What isWord2Vec

One of the key problems of text mining is that distances between words are hard to define. One could also say: "It’s hard to do math with words by itself in anyway." For example, there are words like beautiful and gorgeous, which have similar meanings but are spelled very differently. How should an algorithm know that "beautiful" and "gorgeous" have the same meaning? Or do they have similar connotations but have different meanings?

Word2Vecis a word vector algorithm which attempts to tackle this problem. As the title implies, this operator takes a word and turns it into a vector. So how is so special about Word2Vec? The cool part is thatthis newWord2Vecvector can be associated with the “meaning” of a word. For example:

1. Let's take a sentence from raw text:RapidMiner has a new extension called Word2Vec

2. Now let's 'window' our sentence and always leave out the word in the middle:

RapidMiner has ___ new extension

has a ___extension called

new extension ___ Word2Vec

3.Word2Vecdefines a probabilityPfor the for the missing word,depending on the surrounding words. In fact, Word2Vec assigns a vector for every word. The whole trick ofWord2Vecis that it optimises all vector entries to maximize the probability for the correct gap words and minimizes it for others. This way it assigns a vector to every word.

Sample Process withWord2Vec

There are various ways to use Word2Vec as a useful addition to your data science processes. In this sample process we will create a custom stemming dictionary from TripAdvisor review data (availablehere). All depicted processes are attached to this post.

Our analysis is split in three parts. The first part reads in the data and transforms it into a collection of documents. Each document is already tokenized. The second process will then create aWord2Vecmodel on it, and the final third model is generating a stemming dictionary.

Step 1: Read and Tokenize

数据是公关ovided in one flat file for each hotel with the following structure:

4
$302
http://www.tripadvisor.com/ShowUserReviews-g60878-d100504-r22932337-Hotel_Monaco_Seattle_a_Kimpton_Hotel-Seattle_Washington.html

selizabethm
Wonderful time- even with the snow! What a great experience! From the goldfish in the room (which my daughter loved) to the fact that the valet parking staff who put on my chains on for me it was fabulous. The staff was attentive and went above and beyond to make our stay enjoyable. Oh, and about the parking: the charge is about what you would pay at any garage or lot- and I bet they wouldn't help you out in the snow!
Dec 23, 2008
-1
-1
5
4
5
5
5
5
5
-1

We read all files in with a Loop Files + Read Document combination, and then loop over all documents to extract only the content with a Cut Document operator. In the Cut Document we quickly transform all tokens to lower case and tokenize our document. After flattening the collection to one straight collection of documents, we store it in our repository for later use.

grafik.pngRead In Process

Step 2: Train the Model

Training aWord2Vecmodel is straightforward: get the data, applyWord2Vec, and store the result. The layer size, which defines the length of one vector, is set to a moderate 100 and the window size is set to 7. The iterations parameter is set to a high 50, which should ensure convergence.grafik.pngTraining Process

Step 3: Building the Stemming Dictionary

Building the final dictionary needs a tiny bit of postprocesseing. The new operator Extract Vocabulary is able to extract vectors for all or parts of the used corpus. Using Cross distance it is possible to get the distance between to word vectors measured in cosine similiary.

In the postpocessing we first need to remove duplicates of words which were created in the cross distance.

Afterwards there is a different type of duplicates. These are the ones were the first word in the first example equals the second word in the second example and vice versa.

Word1 Word2

Gorgeous Beautiful

Beautiful Gorgeous

grafik.pngThe final processing process with a postprocessing which creates a stemming dictionary

Finally we apply a threshold on the similarity to produce a well-pruned list. This is controlled with a macro and can thus also be used from the outside. The only thing we need to make sure is that a word is not a synonym more than once. We can do this by removing some additional duplicates.

Let's have a look at the results!grafik.png发现synonymsIf你检查结果的示例lts you can see some obvious similarities likewallandwalls, and some more clever synonyms likepeopleandguests,anywhereandsomewhere.

Where it gets interesting are that sometimes words with opposite meanings are considered synonyms (best-worst,warm-cooletc). This is due to the wayWord2Vecworks in that these words can be put into the same gaps – hence considered similar to each other. Depending on the task you do this can be useful (e.g. topic recognition) or detrimental (e.g. sentiment analysis). For the latter you need to manually walk through the result list and prune more.

As a last step we can use an Aggregate operator in combination with a Generate Attributes operator to generate regular expressions. For example:

amazing:awesome

american:european

amsterdam:berlin

and:very|with

another:later

anywhere:somewhere

appointed:maintained

area:areas

arrived:checked|arrival

asked:requested|ask

The format can be used on any document you have. The operator for this is called “Stem Tokens using Example Set” and is part of Operator Toolbox extension.

Where can I learn more?

- Head of Data Science Services at RapidMiner -
Dortmund, Germany
kayman Thomas_Ott dhampton Pavithra_Rao alinebora jacobcybulski

Comments

  • websiteguywebsiteguy MemberPosts:24Maven

    Thanks very much@mschmitz

    for this fantastic process, Just experimenting now.

    If I want to analyse a group of documents and find not only the single words that have a vector relationship, but also bigram, trigram phrases is that possible? Or does it melt your computer...

    Can this be combined with any other text processing or modified to produce topical buckets of terms?

    I was wondering if it is possible to split the input documents by punctuation.

    I am inputting webpages that have headings etc., at present I am stripping out stop words, short strings 4 letters.

    Therefore, I end up with just long strings.

    However, was thinking if I split each document by sentence or paragraph/ list content? I could then create many separate documents (from one html page) that could be classified or grouped by similarity.

    Using document to similarity to process those buckets of sentences.

    I would get words output in the dictionary of Word2vec that are not just related to each other, but related to the concept (as defined by the documents to similarity groupings of sentences or lists extracted from the html document.)

    I am probably not thinking correctly about it.

    My goal to end up with buckets of words that could be then used in construction of paragraphs within a new written document that are known to be related by vector space. Not only to each other, but also to other words within the topical buckets.:smileyhappy:

    (The buckets being defined by the pre-processing using documents to similarity) rather than just individual words related to each other.

    I used the ITF/TO before and that works ok to find bigrams and trigram strings to get them on the page.

    However, the problem is the same you end up with the phrases on the page, but not necessarily near to each other.

    It works, concerning creating statistically similar page (google), but its very time consuming with lots of manual pruning.

    Then you have to post process your document for synonyms to ensure you have not overegged it.

    I would like to create some sort of process that stitches several processes togetherITF/TO Word2Vec, Document clustering, LSIto produce some sort of master grouping of words.

    That way it would just be a matter of taking that grouping of n words and forming a meaningful paragraph out of it.

    Knowing in advance that it has ticked all the boxes.

    I purchased the book, not picked it up yet:)

    was alos looking at this.lda2vec

    https://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/#topic=5&lambda=1&term=

    is this possible in rapidminer??

    regards lee

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,352RM Data Scientist

    Hi@websiteguy,

    first of all: thanks for the kind words and using the operator. It is always cool to see, when people use the tools you write.

    Let's go through your questions a bit

    If I want to analyse a group of documents and find not only the single words that have a vector relationship, but also bigram, trigram phrases is that possible? Or does it melt your computer...

    Word2Vec by itself does not support bi_grams. But maybe you can find frequent bigrams using process_documents and use Replace Tokens to then replace e.g. not good with not_good which is then considered as one word in Word2Vec.

    I was wondering if it is possible to split the input documents by punctuation.

    Sure, Cut Document should do the trick.

    I am inputting webpages that have headings etc., at present I am stripping out stop words, short strings 4 letters. Therefore, I end up with just long strings. However, was thinking if I split each document by sentence or paragraph/ list content? I could then create many separate documents (from one html page) that could be classified or grouped by similarity.
    Using document to similarity to process those buckets of sentences.

    You can treat whole sentences as words in the operator. This also includes things like tags or parts of code. The only thing i would be worried about is, that you need enough sample size.

    I would get words output in the dictionary of Word2vec that are not just related to each other, but related to the concept (as defined by the documents to similarity groupings of sentences or lists extracted from the html document.)I am probably not thinking correctly about it.

    Not sure what you mean here.

    I would like to create some sort of process that stitches several processes together ITF/TO Word2Vec, Document clustering, LSI to produce some sort of master grouping of words.

    我将考虑与一些集群的向量cosine similarity measure.

    lda2vec

    Never saw this before, but thanks for the link! This is not yet supported but we may investigate this. The LDA vis package for python seems to be a good ressource for the recent LDA operator i published in toolbox.


    欢呼,
    Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • websiteguywebsiteguy MemberPosts:24Maven

    Hi@mschmitz

    Thanks for the quick reply,

    By stripping out stop words and turning in to bi-grams, or tr-grams, create a document, then collect and save?

    Then process these strings of two or three words with the connecting _ and they would each would be a string used in the vector, is that right?

    -----------

    I am trying to create a new document

    That has a statistical similarity to the original set of documents, by including these word2vec results in its creation.

    (I have found the ITF/TO works but it does not allow for distance, so you have to slavishly ensure the inclusion of bigrams/tri-grams that occur in the original documents to ensure similarity. Even then, you have to return to your document at later date and shift the usage of the strings about to get the nearness to other bigram strings.

    ...

    word-2-vec.png

    If clustering were done on our newly created document, the original set of documents and a random set of other docs,the new docwould fall in to the same cluster as the original set as it "is like" the original set.

    ---------------------------

    At present, the vector interpretation of documents produces words from a set of documents that co-occur (by a distance of K words/synonyms from each other) therefore these words have a relationship. Is that correct?

    So when processing documents, we get a list of words and co-occurring examples of words, that acts as a representation to commonalities of word usage as defined by 'K“距离(stemrule)

    It word2vec helps us to know we should include, "acne|naturally|grab” in to sentence in our new document.

    "For suffers ofacne, I would always treat itnaturally,这就是为什么我建议你graba copy of my new book"

    However, not how near this sentence should be to another sentence that includes another stemrule?

    So if I used another stemrule in a sentence:

    "It’sabsolutelyvital that whenkeepingan injury ortraumaprotected weactquickly to ensure the bone does notshift"

    These two new sentences could be in the same paragraph or distant from each other in the new document.

    Is there any way to know this "nearness of stemrules"? So that the vector stemrules are used in a way that insures their nearness to other stemrules is takes in to account stemrules distance from other stemrules?

    Therefore, we get "grouped stem rules" and therefore our new document we produce is more "like the originals"

    Or, is this essentially what lda2vec is doing?

    ----------------------

    "I would consider clustering the vectors with some cosine similarity measure"

    Any chance you could show me how to do this, or explain a little further?

    thanks fr your help,

    regards lee

    -----------------

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    There seems to be a lot of questions floating on the Community on how to use Word2Vec with Twitter data. I made a fast and dirty process on how to do ithere.

  • CampelloCampello MemberPosts:3Learner I
    Okay, so I'm completely new to this. I keep getting an error when looping over my own dataset, which is an Excel file. It says the number of iterations can't be smaller than 1. I've tried to run "read excel" instead of "read document" but with no results. Also, when I cut the document i should fill "string matching queries" and I can't figure out what that means. Can you maybe give me hand?
  • kaymankayman MemberPosts:662Unicorn
    Could you share your process? To read excel you definitly need to use the read excel operator, this loads the data as an example set (like a spreadsheet, using columns and rows).

    Now, if you want to do some proper textmining, it means this data needs to be converted to documents (so text format). There are quite some operators with different options for the job, so it al' depends on what you actually want/need to do, and how your excel is constructed.
    Campello
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,352RM Data Scientist
    the error you describe indicates, that the Loop Files operator does not find any files meeting you conditions. Can you make sure the directory is set up correctly and maybe do not any filter on the files?
    最好的
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    Campello
  • CampelloCampello MemberPosts:3Learner I
    Hey guys, thanks for the quick reply. So, it seems like after I've put "data to docs" inside the loop thing it finally works. Although I'm not sure if it was the perfect operator, since as@kaymansuggests that may depend on my data and what I need to do (what I need to do is, well, find the meanings of certain words, such as "people" and "nation" in these speeches, looking for an 'lsa' kind of thing here). I keep ketting errors when cutting document tho. I'm attaching a few images I think may help you to understand my issues. One shows my dataset (a series of parlamentary speeches, over 900 rows). The others, my processes.Btw, when cutting the document I've set the queries to "," and ",", coz I didn't know what to do and I figured "," was as good a guess as any other lol, just to see if it worked (it didnt, but not for that reason haha). Thank you so much for helping a newbie, I sure appreciate it:)


  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,352RM Data Scientist
    Hi,
    i'll post something on the tech issue, but it looks like you want to build something like this:https://www.zeit.de/politik/deutschland/2019-09/bundestag-jubilaeum-70-jahre-parlament-reden-woerter-sprache-wandel#s=pay gap? Its basically a data journalism piece on all speeches hold in German Bundestag. I know its German and your text is Portugese, but maybe this is still a nice reference for you.

    最好的
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    Campello
  • CampelloCampello MemberPosts:3Learner I
    Hi@mschmitz! This looks very beautiful! Not exactly what I'm going for, since I'm analyzing president's Bolsonaro's speeches (as a former deputy) only, not the whole of the speeches at the Assembly, but you got the point and I'll save that website, it gave me important insights. The goal is to compare the results I get from Bolsonaro's conceptions of nation and ppl with Marine Le Pen's, processing her speeches in the same manner. I sould be able to detect if they have similar of divergent ideas about those themes. I'm a political science researcher and research on the topic of contemporary right wing populism:)

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,352RM Data Scientist
    edited January 2021
    Hi,
    that's why i thought this is interesting to you. One of the examples they are exploring is the change from the word Ausländer (Foreigner) to the word Migrant (migrant) and how often this was used over time. You can see how the frequency is high in the early 90s when there were racist riots in Germany but also in 2015, when a stream of refugees came to Germany. So if you speak German this is a good source of inspiration for you.
    The source (Zeit) is one of the most known and trusted news papers in Germany comparable to NYT or washington post.

    On your process: Are you able to share the data with me? That would allow me to quickly set it up for you. You can send it my email to mschmitz at www.turtlecreekpls.com

    最好的
    Martin


    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    Campello
Sign InorRegisterto comment.