分裂dataset

ikayunida123ikayunida123 MemberPosts:17Contributor II
edited June 2019 inHelp

Hello everyone!

So I'm doing a text classification right now. And I want to ask how to split the dataset into data training and data testing on Rapidminer. I know there are some operators like Split Data or Split Validation, but looks like it's splitting the data automatically(?) So I don't know which one is data training or which one is data testing.

My teacher wants me to compare the result of text classification that I'm doing manually and the result of my RapidMiner process. So I must make sure the data training or data testing in those two processes are same.

Please help me. Thank you :catvery-happy:

Tagged:

Answers

  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University ProfessorPosts:568Unicorn

    Hi@ikayunida123

    我t depends on how you chose your data for training or testing.

    Let's say you have 10 samples from 1 to 10, and you train your model manually with samples 1, 2 and 3, and test your model with 4 to 10, you should choose "linear sampling" on both "Split Data" and when configuring the "Split Validation". If, however, you chose specifically (e.g.) examples 1, 4, and 6 to train your model and the rest to test it, you might prefer not to work with split data but by creating two datasets that are equal to your choices, and build the model without split validation.

    There are three ways to create data samples in RapidMiner (well, there are four but the fourth one is "automatically choose between stratified or shuffled depending on the data types you have") : "Linear" is 1, 2, 3... "Shuffled" is random, and "Stratified" is shuffled but trying to maintain the proportions between your training data and your testing data.

    Regarding Split Data, Split Validation and DIY validation, I can post you some pictures on what is the case for each one:

    Split Validation:

    Split Validation 01 - Setup.pngSplit Validation - General ViewSplit Validation 02 - Internals.pngSplit Validation - Testing/Training

    Split Data:

    Split Data.pngNotice that this is equivalent to performing the Split Validation, but harder to read when on a larger model

    With DIY Validation, data splitting isyourresponsibility. Basically, your model looks much like the "Split Data" model, except that you have two Retrieve operators, one with your chosen training data and other with your chosen testing data. TBH I was too lazy to build an example.

    For the sake of completion, I prefer to do Cross Validation whenever I have enough memory and processor to use it (or use it as a pretext to ask my boss to buy more memory and a better processor). It is exactly the same as the Split Validation, but let's say you have 100 examples and you want to part them in 5 folds, you have an iterator: use examples 1 to 20 for testing and the rest for training, then examples 21 to 40 for testing, then 41 to 60 and so on... More folds means smaller examples but more iterations. There is also the "leave one out" option but with this option enabled the amount of computing power required is... quite high if your dataset is large.

    Screen Shot 2018-06-08 at 01.02.22.pngThis is the same as the split validation, but more powerful and more CPU and memory consuming.

    Hope this helps,

    Rodrigo.

    sgenzer ikayunida123
  • ikayunida123ikayunida123 MemberPosts:17Contributor II

    Hello Mr.@rfuentealba:catvery-happy: Your answer definitely makes me understand the concept of validation in RapidMiner. Thank you for your help! But the one that exactly close to my needed (from your explanation) is DIY validation. Can you give me an example of DIY validation? Any pictures or XML process is okay. Thank you and have a nice day :catvery-happy:

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn

    You can use the Log operator to capture whatever intermediate results you want from the training and the testing data inside a split validation. Just make sure you use the local random seed option so your results are reproducible.

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University ProfessorPosts:568Unicorn

    Hi@ikayunida123,

    我f I understood it correctly, you picked up some data by hand, and you are performing calculations manually, and your teacher wants you to see how performance differs between what you do on paper and pencil (May Odin bless your patience for such a task, I failed COBOL twice at the Uni because I didn't have enough of it). So you wantsomethingthat can help you define that you wantexactlythat data for training and that other data for testing, am I right? In that case, theLogoperatorthat@Telcontar120suggested might not work. (However, that's a good catch! You might want to try the other way round: run your process on RapidMiner and use the Log operator to see how your training data looks like, and then take paper and pencil with that data, that's much less hassle!)

    Well, DIY means "Do It Yourself"(I had a joke with "Just Do It" but I just didn't... For reference, that is a slogan for a known sportswear company), and it means getting rid of all the help given to you by RapidMiner process blocks.

    The first thing you have to get rid of is our good friend theSplit Dataoperator, and the examples must be split by yourself. Then, apply everything as you would, but... before applying thePerformanceblock, you have to define which columns in your result havelabelandpredictionroles. ThePerformanceoperator takes these two to analyze how far was your algorithm from the truth, and that's transparent to you when using theSplit Validationor theSplit Datamethods.

    Finally, apply Performance as normal, et voilá: DIY Validation! Here is a screenshot.

    DIY Validation.pngDIY Validation!Hope this helps!

    Cheers and have a nice weekend,

    sgenzer
  • ikayunida123ikayunida123 MemberPosts:17Contributor II

    Hello@Telcontar120thank you for your suggestion :cathappy:

    But I'm still didn't understand. Can you please give me an example of how to use the log operator in my case? I'm sorry for asking too much questions.

    This's my process :




















    <参数键= " Text " value = "常规" / >
















    @#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]"value="link"/>
    @ value=at "/>
























    <连接from_op = "朴素贝叶斯“from_port = t“模型”o_port="model"/>

































    <参数键= " Text " value = "常规" / >
















    @#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]"value="link"/>
    @ value=at "/>












    <连接from_op = "标记(2)“from_port = " document" to_op="Filter Stopwords (2)" to_port="document"/>

    <连接from_op = "生成n个字尾”(2)from_port ="document" to_port="document 1"/>
























    < portSpacing端口= " source_input 1”间隔= " 0 " / >







    Thank you.

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn

    Here is an example process which uses split validation to create a model and logs the train performance and the text performance separately.



























































    < portSpacing端口= " source_input 1”间隔= " 0 " / >







    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
    sgenzer
Sign InorRegisterto comment.