Not normally distributed data

jeroenheijlenjeroenheijlen MemberPosts:4Learner I
Hi,
I'm trying to find a model to make a prediction for the execution time of a process step. I've data from over 200 different recurring process steps from the past 2 years (160.000 rows in excel sheet). When I plot the execution-time data per event, the data is not normally distributed but more like a Poisson distribution. Just loading the data in Rapidminer Studio and applying the models do not return a good fit. What can i do? (for data pre-processing in Python or R I would need a step-by-step guide because I'm pretty new in all of this)
Some help would really be appreciated!
Best regards
Jeroen

Answers

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, MemberPosts:1,195Unicorn
    Hi@jeroenheijlen,

    Have you tried to submit your data to Auto-Model (the AutoML tool of RapidMiner) ?

    Regards,

    Lionel
  • jeroenheijlenjeroenheijlen MemberPosts:4Learner I
    edited May 2020
    Hi@lionelderkrikor, thanks for your reply.
    Yes sure, I tried auto model but even when I already seriously reduced the variation in the inputdata, no model but do a good job for my data:

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, MemberPosts:1,195Unicorn
    Hi@jeroenheijlen,

    Maybe there are not relationships between your independent features and your label (your target).
    In this case, it is impossible to find a good model and machine learning is of no use...
    In the meantime, you can try to :
    - enable feature selection / feature generation in the options of AutoModel
    - for your best models, you can tune hyper-parameters to try to increase the accuracy/decrease the error rate.

    Regards,

    Lionel
    jeroenheijlen
  • jeroenheijlenjeroenheijlen MemberPosts:4Learner I
    Hi@lionelderkrikor,
    I'm indeed afraid the variation within each of the process step is too large and therefor no model can find a correlation or prediction fit.
    Thanks for your advise.
    I will try a few more things (auto feature selection fails) such as starting with a smaller dataset (info of only a few of the process steps, remove more of the outliers, but still the data will never be normally distributed) and also once create the set like a binomial outcome (more than 2 hours, less than 2 hours, or so).

    If I ever will succeed, I will post the outcome ;-).
    Best regards
    Jeroen
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, MemberPosts:1,195Unicorn
    You're welcome,@jeroenheijlen.

    Good luck !

    regards,

    Lionel
    jeroenheijlen
Sign InorRegisterto comment.