Not normally distributed data
jeroenheijlen
MemberPosts:4Learner I
inHelp
Hi,
I'm trying to find a model to make a prediction for the execution time of a process step. I've data from over 200 different recurring process steps from the past 2 years (160.000 rows in excel sheet). When I plot the execution-time data per event, the data is not normally distributed but more like a Poisson distribution. Just loading the data in Rapidminer Studio and applying the models do not return a good fit. What can i do? (for data pre-processing in Python or R I would need a step-by-step guide because I'm pretty new in all of this)
Some help would really be appreciated!
Best regards
Jeroen
I'm trying to find a model to make a prediction for the execution time of a process step. I've data from over 200 different recurring process steps from the past 2 years (160.000 rows in excel sheet). When I plot the execution-time data per event, the data is not normally distributed but more like a Poisson distribution. Just loading the data in Rapidminer Studio and applying the models do not return a good fit. What can i do? (for data pre-processing in Python or R I would need a step-by-step guide because I'm pretty new in all of this)
Some help would really be appreciated!
Best regards
Jeroen
Tagged:
0
Answers
Have you tried to submit your data to Auto-Model (the AutoML tool of RapidMiner) ?
Regards,
Lionel
Yes sure, I tried auto model but even when I already seriously reduced the variation in the inputdata, no model but do a good job for my data:
Maybe there are not relationships between your independent features and your label (your target).
In this case, it is impossible to find a good model and machine learning is of no use...
In the meantime, you can try to :
- enable feature selection / feature generation in the options of AutoModel
- for your best models, you can tune hyper-parameters to try to increase the accuracy/decrease the error rate.
Regards,
Lionel
I'm indeed afraid the variation within each of the process step is too large and therefor no model can find a correlation or prediction fit.
Thanks for your advise.
I will try a few more things (auto feature selection fails) such as starting with a smaller dataset (info of only a few of the process steps, remove more of the outliers, but still the data will never be normally distributed) and also once create the set like a binomial outcome (more than 2 hours, less than 2 hours, or so).
If I ever will succeed, I will post the outcome ;-).
Best regards
Jeroen
Good luck !
regards,
Lionel