"Creating SVDs in X-Validation operator very slow"

text_miner · February 2010

I am trying to setup a process in RapidMiner for text mining that uses SVDs. I have compared the time it takes to create SVDs using the entire dataset and for only a training set (within the training subprocess of an X-Validation operator). (Both processes I used are detailed below.) Using the entire dataset, the entire process finishes within a minute or so. When running the process with an X-Validation operator, the time increases dramatically; after 45 minutes the SVDs had not been created. Any ideas on why creating SVDs is taking so much longer inside the X-Validation operator?

For both processes I am using the comp.graphics and comp.windows.x newsgroups mini-datasets available fromhttp://archive.ics.uci.edu/ml/databases/20newsgroups/20newsgroups.html(mini_newsgroups.tar.gz).

Entire Dataset:

















<过程扩展="true" height="521" width="614">








<参数键= = " 200 " /“prune_above_absolute”价值>
<过程扩展="true" height="650" width="1092">

X-Validation:

Note: I tried putting a Materialize Data operator in before creating the SVDs, but it doesn't seem to speed up the creation of the SVDs.
















<过程扩展="true" height="521" width="614">








<参数键= = " 200 " /“prune_above_absolute”价值>
<过程扩展="true" height="650" width="1092">
















<过程扩展="true" height="650" width="614">




















<过程扩展="true" height="650" width="547">

Any help would be greatly appreciated. Thanks!

land · February 2010

Hi,
I would guess the problem arises, because there are less examples. This might produce a matrix conditioned worse, so that either the SVD algorithm hangs or needs a longer time to compute the results. Did you try to change the random seed? A new distribution of the examples on the folds might solve the problem.

Greetings,
Sebastian

text_miner · February 2010

Sebastian,

Thanks for the reply. After trying different seed values I was still getting the same problem. So I investigated a little further and found the solution.

The issue was due to missing values being introduced into the dataset after calculating TFIDF values for the term-by-document matrix. Since only a subset of the data was used in training each fold, there were certain attributes (i.e., terms) that had zero occurrences for all examples. For those attributes, the TFIDF operator put missing values ("?") for all examples of that term.

The solution was to use the Replace Missing Values operator after the TFIDF operator to replace all missing values with zero. After replacing the missing values, the SVD operator worked without a problem.

Thanks again for the reply!

land · February 2010

Hi,
ok, then it seems to be a good idea to throw a warning, that it cannot cope with missing values. I will note that down.

Greetings,
Sebastian

text_miner · February 2010

Sebastian,

I agree, a warning would be nice.

In addition, another thing to consider is changing the TFIDFFilter class to set zeros for columns without any counts. Although the missing values can currently be changed to zeros with the Replace Missing Values operator, this (1) requires the use of another operator and (2) changes the order of attributes in the matrix. While the first point is not a big deal, I imagine the second point may cause problems. For example, consider creating SVDs with a training set and then wanting to map (i.e., fold-in) examples from the testing set into the pre-existing latent semantic space. (This example assumes the training and testing set applied TFIDF separately (although in reality, the IDF values from the training set would probably be applied to the testing set...) and the sets have different attributes with zero counts.) To fold in these new "pseudo documents", the order of the attributes should be the same between the two sets.

Listed below is the TFIDFFilter class with two simple changes to set zeros for columns without any counts. The first change is on line 106 and just makes sure at least one document has a count for the current term before trying to calculate IDF. The second change adds an OR to line 118-119; the value is set to zero if IDF is zero for the current term.


/*
* RapidMiner
*
* Copyright (C) 2001-2009 by Rapid-I and the contributors
*
* Complete list of developers available at our web site:
*
*http://rapid-i.com
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU Affero General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with this program. If not, seehttp://www.gnu.org/licenses/.
*/
package com.rapidminer.operator.preprocessing.filter;

import java.util.LinkedList;
import java.util.List;

import com.rapidminer.example.Attribute;
import com.rapidminer.example.Example;
import com.rapidminer.example.ExampleSet;
import com.rapidminer.operator.OperatorDescription;
import com.rapidminer.operator.OperatorException;
import com.rapidminer.operator.UserError;
import com.rapidminer.operator.ports.metadata.AttributeMetaData;
import com.rapidminer.operator.ports.metadata.ExampleSetMetaData;
import com.rapidminer.operator.ports.metadata.MetaData;
import com.rapidminer.operator.ports.metadata.SetRelation;
import com.rapidminer.operator.preprocessing.AbstractDataProcessing;
import com.rapidminer.parameter.ParameterType;
import com.rapidminer.parameter.ParameterTypeBoolean;
进口com.rapidminer.parameter.UndefinedParameterError;


/**
* This operator generates TF-IDF values from the input data. The input example
* set must contain either simple counts, which will be normalized during
* calculation of the term frequency TF, or it already contains the calculated
* term frequency values (in this case no normalization will be done).
*
*@authorIngo Mierswa
*/
public class TFIDFFilter extends AbstractDataProcessing {

/** The parameter name for "Indicates if term frequency values should be generated (must be done if input data is given as simple occurence counts)." */
public static final String PARAMETER_CALCULATE_TERM_FREQUENCIES = "calculate_term_frequencies";

public TFIDFFilter(OperatorDescription description) {
super(description);
}

@Override
protected MetaData modifyMetaData(ExampleSetMetaData metaData) throws UndefinedParameterError {
for (AttributeMetaData amd: metaData.getAllAttributes()) {
if (!amd.isSpecial() && amd.isNumerical()) {
amd.getMean().setUnkown();
amd.setValueSetRelation(SetRelation.UNKNOWN);
}
}
return metaData;
}

@Override
public ExampleSet apply(ExampleSet exampleSet) throws OperatorException {
if (exampleSet.size() < 1)
throw new UserError(this, 110, new Object[] { "1" });
if (exampleSet.getAttributes().size() == 0)
throw new UserError(this, 106, new Object[0]);

// init
double[] termFrequencySum = new double[exampleSet.size()];
List attributes = new LinkedList();
for (Attribute attribute: exampleSet.getAttributes()) {
if (attribute.isNumerical())
attributes.add(attribute);
}
int[] documentFrequencies = new int[attributes.size()];

// calculate frequencies
int exampleCounter = 0;
for (Example example: exampleSet) {
int i = 0;
for (Attribute attribute : attributes) {
double value = example.getValue(attribute);
termFrequencySum[exampleCounter] += value;
if (value > 0)
documentFrequencies++;
i++;
}
exampleCounter++;
checkForStop();
}

// calculate IDF values
double[] inverseDocumentFrequencies = new double[documentFrequencies.length];
for (int i = 0; i < attributes.size(); i++) {
if (documentFrequencies> 0) {
inverseDocumentFrequencies= Math.log((double) exampleSet.size() / (double) documentFrequencies);
}
}

// set values
boolean calculateTermFrequencies = getParameterAsBoolean(PARAMETER_CALCULATE_TERM_FREQUENCIES);
exampleCounter = 0;
for (Example example: exampleSet) {
int i = 0;
for (Attribute attribute : attributes) {
double value = example.getValue(attribute);
if (termFrequencySum[exampleCounter] == 0.0d ||
inverseDocumentFrequencies== 0.0d) {
example.setValue(attribute, 0.0d);
} else {
double tf = value;
if (calculateTermFrequencies)
tf /= termFrequencySum[exampleCounter];
double idf = inverseDocumentFrequencies;
example.setValue(attribute, (tf * idf));
}
i++;
}
exampleCounter++;
checkForStop();
}
return exampleSet;
}

@Override
public List getParameterTypes() {
列表< ParameterType > = super.getParameterType类型s();
ParameterType type = new ParameterTypeBoolean(PARAMETER_CALCULATE_TERM_FREQUENCIES, "Indicates if term frequency values should be generated (must be done if input data is given as simple occurence counts).", true);
type.setExpert(false);
types.add(type);
return types;
}

}

Thanks!

land · February 2010

Hi,
I will add this and it will be included in the upcoming final version.

Anyway, usually we use the TFIDF filter of the Process Documents operator, where this error does not arise as far as I know.

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Creating SVDs in X-Validation operator very slow"

Answers