Skip to main content

Word2Vec

Synopsis

This operator learns a Word2Vec model on a given corpus.

Description

The operator expects a tokenized collection of documents as an input. Each token represents one word for which you want to learn a vector. If you have a collection of documents (e.g. from Read Files) you can use Loop Collection and Tokenize to get the desired input.

Word2Vec p是一个opular algorithm based on: Efficient Estimation of Word Representations in Vector Space, Mikolov et. al (2013). Training on a single corpus the algorithm will generate one multidimensional vector for each word. These vectors are known to have symantic meanings. A commonly used distance measure is cosine similarity. The returned RMWord2VecModel can be thought of a dictionary or hash map containing the vectors. To access them you can use Apply Word2Vec or Extract Vocabulary.

For more details on the algorithm we recommend for example Lecture 2 of Stanford's NLP and DL lecture available on YouTube:https://www.youtube.com/watch?v=ERibwqs9p38. The blog post by Chris McCormick is a nice resource to read:http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

The used implementation is a Java port of the original implementation. The source code can be found at:https://github.com/allenai/Word2VecJava

Input

doc

A tokenized collection of documents

Output

国防部

The resulting Word2Vec model.

Parameters

Minimal vocab frequency

Minimal number of occurences each word needs to have to be considered for model generation.

Layer size

Size of the vector which is generated. Typical values are between 50-500.

窗口大小

During model generation each text is split into windows. This specifies how large the window is. Typical values are 3-7.

Use negative samples

When running a Word2Vec minimalization you need to take all negative - not used - words into account for minimization. Using negative samples you only take x random words into account. Usual numbers are 5-20 on small text and 2-5 on large texts.

Iterations

Number of iterations during training

Down sampling rate

This parameter can be used to downsample frequent words. Smaller values mean words are less likely to be kept. A common range is 1e-3 - 1e-5.