"Applying Feature Selection on text input"

jebadiahjebadiah MemberPosts:4Contributor I
edited May 2019 inHelp
Hello. I am new to using RapidMiner so please excuse my ignorance.

I am trying to perform K-Means Clustering on a set of text files. I have downloaded and installed the plug-in needed to input text files. Now, I want to apply Feature Selection to it. However, when I try to, it seems that it needs an ExampleSet to be able to perform the Feature Selection function. Is there a way for me to apply Feature Selection on text input?

Here is how my xml looks like right now:

<操作符r name="Root" class="Process" expanded="yes">
<操作符r name="TextInput" class="TextInput" expanded="yes">


< / >列表

<操作符r name="StringTokenizer" class="StringTokenizer">

<操作符r name="StopwordFilterFile" class="StopwordFilterFile">


<操作符r name="StopwordFilterFile (2)" class="StopwordFilterFile">



<操作符r name="KMeans" class="KMeans">





当我试图添加ff:

<操作符r name="BackwardElimination" class="FeatureSelection" expanded="yes">



The ff. error occurs:

Error in: TextInput (TextInput) Error in experiment setup: com.rapidminer.operator.MissingIOObjectException: The operator needs some input of type com.rapidminer.example.ExampleSet which is not provided


Can anyone please suggest something to help me do this. Thank you very much. :-*

Answers

  • jebadiahjebadiah MemberPosts:4Contributor I
    Hi again. I was able to produce to this xml file

    <操作符r name="Root" class="Process" expanded="yes">
    <操作符r name="TextInput" class="TextInput" expanded="yes">


    < / >列表

    <操作符r name="StringTokenizer" class="StringTokenizer">

    <操作符r name="StopwordFilterFile" class="StopwordFilterFile">


    <操作符r name="ExampleSetGenerator" class="ExampleSetGenerator">



    <操作符r name="BackwardElimination" class="FeatureSelection" breakpoints="after" expanded="yes">



    <操作符r name="KMeans" class="KMeans">





    but it returns this error:

    Root[1] (Process)
    +- TextInput[1] (TextInput)
    | +- StringTokenizer[1] (StringTokenizer)
    | +- StopwordFilterFile[1] (StopwordFilterFile)
    | +- ExampleSetGenerator[1] (ExampleSetGenerator)
    here ==> | +- BackwardElimination[1] (FeatureSelection)
    +- KMeans[0] (KMeans)


    I would really appreciate if anyone has any ideas why this error appears. Thanks a lot.
  • jebadiahjebadiah MemberPosts:4Contributor I
    No one? Please? I really need to do this. Thanks in advance.
  • fischerfischer MemberPosts:439Maven
    Hi,

    好吧,the approach you are taking is a bit, umh, ... broken. Feature selection does not work this way. An example of a ForwardSelection is in the samples folder under 05_features/10_ForwardSelection.xml. The important point is: You need to have your learner inside the forward-selection. otherwise, it does not know how to optimize. In general, the FS takes an ExampleSet and must contain operators that are able to evaluate such an example set by producing a PerformanceVector.

    As an aside, it might turn out that it is a bad idea to try backward elimination on text data.

    Best,
    Simon
  • jebadiahjebadiah MemberPosts:4Contributor I
    Hello, thank you for your reply.

    I am currently trying out this xml:
    <操作符r name="Root" class="Process" expanded="yes">
    <操作符r name="TextInput" class="TextInput" expanded="yes">


    < / >列表

    <操作符r name="StringTokenizer" class="StringTokenizer">

    <操作符r name="StopwordFilterFile" class="StopwordFilterFile">



    <操作符r name="FS" class="FeatureSelection" expanded="yes">
    <操作符r name="XValidation" class="XValidation" expanded="yes">



    <操作符r name="NearestNeighbors" class="NearestNeighbors">


    <操作符r name="ApplierChain" class="OperatorChain" expanded="yes">
    <操作符r name="Applier" class="ModelApplier">

    < / >列表

    <操作符r name="Performance" class="Performance">




    <操作符r name="KMeans" class="KMeans">



    However, it is running very slowly. And it cannot accommodate about 300 text files, it returns Java Heap Space error. I have tried changing the rapidminerGUI script but nothing is changing. Do you have any idea how I can change the maximum size for the heap space?

    Thank you very much. You are very helpful.
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:2,531Unicorn
    Hi,
    the topic of adjusting the maximum heap size has been discussed in this forum a look of time. Please use the search button in order to find one of the discussions and the solutions.

    Greetings,
    Sebastian
  • keithkeith MemberPosts:157Guru
    Or check on the RM Wiki page on the topic:http://rapid-i.com/wiki/index.php?title=Memory_Issues
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:2,531Unicorn
    Good hint. It seems, I'm not used to the Wiki, yet:)
  • Marcello_SandiMarcello_Sandi MemberPosts:15Maven
    Hi,

    There is an interesting problem over this model. I ran in my optimized workstation, which has 7GB exclusive memory to JVM and personalized JVM arguments.

    I used hardware and graphic examples and appear a RuntimeException caught. java.lang.OutOfMemoryError: GC overhead limit exceeded. Very strange for such small bases.

    With this workstation I already run a BOW with 9700 words and 8500 lines.


    Using the top command on linux, I was watching the process and realized several PID java when running model.

    Marcello Sandi
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:2,531Unicorn
    Hi Marcello,
    we don't start any other java process, so probably this is an artifact from somewhere else...

    We are aware of the problem that the feature selection has sometimes problems on example sets with a really great number of attributes. Since those great numbers mostly occur on text mining and feature selection on text mining is of limited use, the problem was not of top priority.
    But with the next major release we will add a more memory efficient variant.

    Greetings,
    Sebastian
Sign InorRegisterto comment.