"Creating SVDs in X-Validation operator very slow"
text_miner
MemberPosts:11Contributor II
I am trying to setup a process in RapidMiner for text mining that uses SVDs. I have compared the time it takes to create SVDs using the entire dataset and for only a training set (within the training subprocess of an X-Validation operator). (Both processes I used are detailed below.) Using the entire dataset, the entire process finishes within a minute or so. When running the process with an X-Validation operator, the time increases dramatically; after 45 minutes the SVDs had not been created. Any ideas on why creating SVDs is taking so much longer inside the X-Validation operator?
For both processes I am using the comp.graphics and comp.windows.x newsgroups mini-datasets available fromhttp://archive.ics.uci.edu/ml/databases/20newsgroups/20newsgroups.html(mini_newsgroups.tar.gz).
Entire Dataset:
Note: I tried putting a Materialize Data operator in before creating the SVDs, but it doesn't seem to speed up the creation of the SVDs.
For both processes I am using the comp.graphics and comp.windows.x newsgroups mini-datasets available fromhttp://archive.ics.uci.edu/ml/databases/20newsgroups/20newsgroups.html(mini_newsgroups.tar.gz).
Entire Dataset:
X-Validation:
<过程扩展="true" height="521" width="614">
<参数键= = " 200 " /“prune_above_absolute”价值>
<过程扩展="true" height="650" width="1092">
Note: I tried putting a Materialize Data operator in before creating the SVDs, but it doesn't seem to speed up the creation of the SVDs.
Any help would be greatly appreciated. Thanks!
<过程扩展="true" height="521" width="614">
<参数键= = " 200 " /“prune_above_absolute”价值>
<过程扩展="true" height="650" width="1092">
<过程扩展="true" height="650" width="614">
<过程扩展="true" height="650" width="547">
Tagged:
0
Answers
I would guess the problem arises, because there are less examples. This might produce a matrix conditioned worse, so that either the SVD algorithm hangs or needs a longer time to compute the results. Did you try to change the random seed? A new distribution of the examples on the folds might solve the problem.
Greetings,
Sebastian
Thanks for the reply. After trying different seed values I was still getting the same problem. So I investigated a little further and found the solution.
The issue was due to missing values being introduced into the dataset after calculating TFIDF values for the term-by-document matrix. Since only a subset of the data was used in training each fold, there were certain attributes (i.e., terms) that had zero occurrences for all examples. For those attributes, the TFIDF operator put missing values ("?") for all examples of that term.
The solution was to use the Replace Missing Values operator after the TFIDF operator to replace all missing values with zero. After replacing the missing values, the SVD operator worked without a problem.
Thanks again for the reply!
ok, then it seems to be a good idea to throw a warning, that it cannot cope with missing values. I will note that down.
Greetings,
Sebastian
I agree, a warning would be nice.
In addition, another thing to consider is changing the TFIDFFilter class to set zeros for columns without any counts. Although the missing values can currently be changed to zeros with the Replace Missing Values operator, this (1) requires the use of another operator and (2) changes the order of attributes in the matrix. While the first point is not a big deal, I imagine the second point may cause problems. For example, consider creating SVDs with a training set and then wanting to map (i.e., fold-in) examples from the testing set into the pre-existing latent semantic space. (This example assumes the training and testing set applied TFIDF separately (although in reality, the IDF values from the training set would probably be applied to the testing set...) and the sets have different attributes with zero counts.) To fold in these new "pseudo documents", the order of the attributes should be the same between the two sets.
Listed below is the TFIDFFilter class with two simple changes to set zeros for columns without any counts. The first change is on line 106 and just makes sure at least one document has a count for the current term before trying to calculate IDF. The second change adds an OR to line 118-119; the value is set to zero if IDF is zero for the current term. Thanks!
I will add this and it will be included in the upcoming final version.
Anyway, usually we use the TFIDF filter of the Process Documents operator, where this error does not arise as far as I know.
Greetings,
Sebastian