How can I obtain the accuracy list of my process?

fiddinyusfidafiddinyusfida MemberPosts:12Contributor II
edited August 2019 inHelp
Hi everyone,

我很新的Rapidminer和发现规律lty here. I am conducting a loop process for a model, says 10 iterations and calculate the accuracy performance. However, the result shows only the averaged accuracy or final accuracy. I need the list of accuracy (which is contains 10 accuracies) in order to further check using statistical software like SPSS.

Is it possible to obtain accuracy list of my process using rapidminer?

Below is the averaged accuracy sample. Thanks for your kind response


Best Answer

Answers

  • fiddinyusfidafiddinyusfida MemberPosts:12Contributor II
    @varunm1Thank you for your response,

    In your previous solution, I cannot define how many iterations. Here I attached the loop with average function.

    After I calculated manually, Why does this process produce a different averaged result?

    <运营商激活= " true " class = " apply_model“compatibility="9.3.001" expanded="true" height="82" name="Apply Model" width="90" x="514" y="85">
    <操作符= " true " class = " performance_clas激活sification" compatibility="9.3.001" expanded="true" height="82" name="Performance" width="90" x="715" y="85">
    <运营商激活= " true " class = " apply_model“compatibility="9.3.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="514" y="187">
    <操作符= " true " class = " performance_clas激活sification" compatibility="9.3.001" expanded="true" height="82" name="Performance (2)" width="90" x="648" y="136">

  • varunm1varunm1 Moderator, MemberPosts:1,207Unicorn
    Hello@fiddinyusfida

    Thanks for the process, I did check the process. My understanding is the change in accuracy is based on splitting of data. As you are splitting it some times the test set changes and train set changes t changes accuracy. I fixed it by using a "local random seed" option in Split data operator, can you check now the below-modified process and see it is ok for you.











































    <运营商激活= " true " class = " apply_model“compatibility="9.3.001" expanded="true" height="82" name="Apply Model" width="90" x="514" y="85">



    <操作符= " true " class = " performance_clas激活sification" compatibility="9.3.001" expanded="true" height="82" name="Performance" width="90" x="715" y="85">
































































    <运营商激活= " true " class = " apply_model“compatibility="9.3.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="514" y="187">



    <操作符= " true " class = " performance_clas激活sification" compatibility="9.3.001" expanded="true" height="82" name="Performance (2)" width="90" x="648" y="136">



















































    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

    fiddinyusfida Tghadially
  • fiddinyusfidafiddinyusfida MemberPosts:12Contributor II
    @varunm1Many thanks, this helps me a lot

    I just curious,
    Are there any ways to make this local random seed increases as the iteration process?

    Such as this pseudocode
    For i in 5:
       Random seed (i)


  • varunm1varunm1 Moderator, MemberPosts:1,207Unicorn
    You can use macros for that like %{execution_count}
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

    fiddinyusfida sgenzer Tghadially
  • hughesfleming68hughesfleming68 MemberPosts:323Unicorn
    edited September 2019
    Just a quick comment here.... I wouldn't try and increment your random seed this way. If you chose your best accuracy based on changing your random seed then any improvement won't translate to out of sample data. There are many ways to trick yourself into thinking your model is better than it really is and this is one of them.
    varunm1 fiddinyusfida Tghadially
  • fiddinyusfidafiddinyusfida MemberPosts:12Contributor II
    @varunm1Thanks for the suggestion, I am still learning how to implement macros.

    @hughesfleming68Thanks for the response. What actually I want to do is repeating the process 30 times (based on the Central Limit Theorem) by using a random seed.

    After I obtain the 30 accuracies (comes from random seed 1 to 30), I want to do statistical hypothesis testing to know whether my proposed method is significant or not (compare to another).

    Or is there any suggestion about this?

    I quoted central limit theorem from this link
    (https://www.investopedia.com/terms/c/central_limit_theorem.asp)
    Sample sizes equal to or greater than 30 are considered sufficient for the CLT to hold.
  • hughesfleming68hughesfleming68 MemberPosts:323Unicorn
    edited September 2019
    Hi,@fiddinyusfida. Sometimes that approach is unavoidable. It is something I have to deal with when I use Tensorflow for time series forecasting as opposed to other frameworks like DL4J or PyTorch which make it much easier to get repeatable results.

    In your case, the data splitting is the weak link and whether you change the spit ratio,sampling type or random seed, you still could get wildly different different results. It is something I would do as a last resort. It is better to use as much data as you can and then use cross validation or sliding window validation in the case of a time series to get a result you can start to trust. In the end only testing on out of sample data will tell you if your testing was valid. If your data is very random....sometimes we can't control this part then even averaging 30 times might not be helpful. It all depends how stable your data is.
    fiddinyusfida Tghadially
  • fiddinyusfidafiddinyusfida MemberPosts:12Contributor II
    Hi@hughesfleming68,

    I have only 100 records and seems hard to add the data since I obtained it from public dataset repository.

    So, based on your tips, It will be better if I use Cross-validation (for instance K=10) and just averaging the accuracy instead of doing data-split with ratio?
  • hughesfleming68hughesfleming68 MemberPosts:323Unicorn
    I would start with five fold cross validation and switch between linear and shuffled sampling to see what effect that has on your result. Either way, you are going to be data limited but that depends on how regular your data is. I would still chose that over split. Luck plays a large role when it comes to splitting small data sets. I always seem to get over optimistic in sample results and poorer out of sample results so I am very cautious.
    fiddinyusfida
  • fiddinyusfidafiddinyusfida MemberPosts:12Contributor II
    @hughesfleming68

    I really appreciate your advice, thank you....
    hughesfleming68 Tghadially
Sign InorRegisterto comment.