"Applying Feature Selection on text input"
Hello. I am new to using RapidMiner so please excuse my ignorance.
I am trying to perform K-Means Clustering on a set of text files. I have downloaded and installed the plug-in needed to input text files. Now, I want to apply Feature Selection to it. However, when I try to, it seems that it needs an ExampleSet to be able to perform the Feature Selection function. Is there a way for me to apply Feature Selection on text input?
Here is how my xml looks like right now:
<操作符r name="Root" class="Process" expanded="yes">
<操作符r name="TextInput" class="TextInput" expanded="yes">
I am trying to perform K-Means Clustering on a set of text files. I have downloaded and installed the plug-in needed to input text files. Now, I want to apply Feature Selection to it. However, when I try to, it seems that it needs an ExampleSet to be able to perform the Feature Selection function. Is there a way for me to apply Feature Selection on text input?
Here is how my xml looks like right now:
<操作符r name="Root" class="Process" expanded="yes">
<操作符r name="TextInput" class="TextInput" expanded="yes">
< / >列表
<操作符r name="StringTokenizer" class="StringTokenizer">
<操作符r name="StopwordFilterFile" class="StopwordFilterFile">
<操作符r name="StopwordFilterFile (2)" class="StopwordFilterFile">
<操作符r name="KMeans" class="KMeans">
当我试图添加ff:
<操作符r name="BackwardElimination" class="FeatureSelection" expanded="yes">
The ff. error occurs:
Error in: TextInput (TextInput) Error in experiment setup: com.rapidminer.operator.MissingIOObjectException: The operator needs some input of type com.rapidminer.example.ExampleSet which is not provided
Can anyone please suggest something to help me do this. Thank you very much. :-*
Tagged:
0
Answers
<操作符r name="Root" class="Process" expanded="yes">
<操作符r name="TextInput" class="TextInput" expanded="yes">
< / >列表
<操作符r name="StringTokenizer" class="StringTokenizer">
<操作符r name="StopwordFilterFile" class="StopwordFilterFile">
<操作符r name="ExampleSetGenerator" class="ExampleSetGenerator">
<操作符r name="BackwardElimination" class="FeatureSelection" breakpoints="after" expanded="yes">
<操作符r name="KMeans" class="KMeans">
but it returns this error:
Root[1] (Process)
+- TextInput[1] (TextInput)
| +- StringTokenizer[1] (StringTokenizer)
| +- StopwordFilterFile[1] (StopwordFilterFile)
| +- ExampleSetGenerator[1] (ExampleSetGenerator)
here ==> | +- BackwardElimination[1] (FeatureSelection)
+- KMeans[0] (KMeans)
I would really appreciate if anyone has any ideas why this error appears. Thanks a lot.
好吧,the approach you are taking is a bit, umh, ... broken. Feature selection does not work this way. An example of a ForwardSelection is in the samples folder under 05_features/10_ForwardSelection.xml. The important point is: You need to have your learner inside the forward-selection. otherwise, it does not know how to optimize. In general, the FS takes an ExampleSet and must contain operators that are able to evaluate such an example set by producing a PerformanceVector.
As an aside, it might turn out that it is a bad idea to try backward elimination on text data.
Best,
Simon
I am currently trying out this xml: However, it is running very slowly. And it cannot accommodate about 300 text files, it returns Java Heap Space error. I have tried changing the rapidminerGUI script but nothing is changing. Do you have any idea how I can change the maximum size for the heap space?
Thank you very much. You are very helpful.
the topic of adjusting the maximum heap size has been discussed in this forum a look of time. Please use the search button in order to find one of the discussions and the solutions.
Greetings,
Sebastian
There is an interesting problem over this model. I ran in my optimized workstation, which has 7GB exclusive memory to JVM and personalized JVM arguments.
I used hardware and graphic examples and appear a RuntimeException caught. java.lang.OutOfMemoryError: GC overhead limit exceeded. Very strange for such small bases.
With this workstation I already run a BOW with 9700 words and 8500 lines.
Using the top command on linux, I was watching the process and realized several PID java when running model.
Marcello Sandi
we don't start any other java process, so probably this is an artifact from somewhere else...
We are aware of the problem that the feature selection has sometimes problems on example sets with a really great number of attributes. Since those great numbers mostly occur on text mining and feature selection on text mining is of limited use, the problem was not of top priority.
But with the next major release we will add a more memory efficient variant.
Greetings,
Sebastian