How to use word2vec and lstm to classify sequences of tokens
As a first step towards building a chatbot using rapidminer (for educational purposes), I try to loop over a collection of documents with tokenized texts. Now, each document containing a sequence of tokens should have 1) each token translated to a word2vec embedding (learned over the complete collection of all documents) and then 2) passed the resulting embedded token to a rnn (using deep learning lstm-layer).
I cannot figure out how to design such a process. One problem I keep stumbling upon is the inability to pass something into an complex process. E.g., when I learned the word2vec embedding, and now want to loop over the collection of documents, I need to pass the word2vec model into the loop operator. Another challenge I am facing is that I then need to loop over the individual tokens of a single document, apply to word2vec model to translate the token into the word2vec embedding, and then pass on the embedded token to a deep learning process which then contains the lstm-layer. Somehow I keep getting errors because inputs don't match what is expected, e.g. a collection of documents passed to a loop operator then passes on individual documents inside the loop to a document window operator. However, the document windowing operator says this is not the right input.
However. Has anyone done anything like that before and could share their process? My first attempt is to have the lstm layer connected to a fully connected layer and then classify the input document according to a person that spoke its content. I am using romeo and juliet where I extracted all passages spoken by each one of them. The goal is to use rnn to classify texts into whether spoken by romeo or juliet. I want to compare the performance to a more tradditional approach using tfidf vectorisation of documents.
Looking forward to any bit of help
I cannot figure out how to design such a process. One problem I keep stumbling upon is the inability to pass something into an complex process. E.g., when I learned the word2vec embedding, and now want to loop over the collection of documents, I need to pass the word2vec model into the loop operator. Another challenge I am facing is that I then need to loop over the individual tokens of a single document, apply to word2vec model to translate the token into the word2vec embedding, and then pass on the embedded token to a deep learning process which then contains the lstm-layer. Somehow I keep getting errors because inputs don't match what is expected, e.g. a collection of documents passed to a loop operator then passes on individual documents inside the loop to a document window operator. However, the document windowing operator says this is not the right input.
However. Has anyone done anything like that before and could share their process? My first attempt is to have the lstm layer connected to a fully connected layer and then classify the input document according to a person that spoke its content. I am using romeo and juliet where I extracted all passages spoken by each one of them. The goal is to use rnn to classify texts into whether spoken by romeo or juliet. I want to compare the performance to a more tradditional approach using tfidf vectorisation of documents.
Looking forward to any bit of help
3
Answers