Generate TFIDF

Synopsis

This operator performs a TF-IDF filtering of the given ExampleSet. TF-IDF is a numerical statistic which reflects how important a word is to a document.

Description

生成TFIDF operator generates TF-IDF values from the given ExampleSet The ExampleSet must contain either the binary occurrences (which will be normalized during calculation of the term frequency TF) or it should already contain the calculated term frequency values (in this case no normalization will be done). This behavior can be selected using thecalculate term frequenciesparameter.

The TF-IDF (term frequency–inverse document frequency) is a numerical statistic which reflects how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others.

Input

example set input

This input port expects an ExampleSet. It is output of the Read CSV operator in the attached Example Process.

Output

example set output

The TF-IDF is calculated and the resultant ExampleSet is returned through this port.

original

The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

Parameters

Calculate term frequencies

This parameter indicates if term frequency values should be generated. This parameter must be set to true if the input data is given as simple occurrence counts.