Word2Vec
Synopsis
This operator learns a Word2Vec model on a given corpus.
Description
The operator expects a tokenized collection of documents as an input. Each token represents one word for which you want to learn a vector. If you have a collection of documents (e.g. from Read Files) you can use Loop Collection and Tokenize to get the desired input.
Word2Vec p是一个opular algorithm based on: Efficient Estimation of Word Representations in Vector Space, Mikolov et. al (2013). Training on a single corpus the algorithm will generate one multidimensional vector for each word. These vectors are known to have symantic meanings. A commonly used distance measure is cosine similarity. The returned RMWord2VecModel can be thought of a dictionary or hash map containing the vectors. To access them you can use Apply Word2Vec or Extract Vocabulary.
For more details on the algorithm we recommend for example Lecture 2 of Stanford's NLP and DL lecture available on YouTube:https://www.youtube.com/watch?v=ERibwqs9p38. The blog post by Chris McCormick is a nice resource to read:http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
The used implementation is a Java port of the original implementation. The source code can be found at:https://github.com/allenai/Word2VecJava
Input
doc
A tokenized collection of documents
Output
国防部
The resulting Word2Vec model.
Parameters
Minimal vocab frequency
Minimal number of occurences each word needs to have to be considered for model generation.
Layer size
Size of the vector which is generated. Typical values are between 50-500.
窗口大小
During model generation each text is split into windows. This specifies how large the window is. Typical values are 3-7.
Use negative samples
When running a Word2Vec minimalization you need to take all negative - not used - words into account for minimization. Using negative samples you only take x random words into account. Usual numbers are 5-20 on small text and 2-5 on large texts.
Iterations
Number of iterations during training
Down sampling rate
This parameter can be used to downsample frequent words. Smaller values mean words are less likely to be kept. A common range is 1e-3 - 1e-5.