DBSCAN taking very long time

moritz_moellermoritz_moeller MemberPosts:5Learner I
2019年1月编辑 inHelp
Hello there,

I am currently trying to do a cluster analysis with DBSCAN. Since it is my first time to either do a clusteranalysis or using DBSCAN I only have knowledge from papers and online documents. But maybe someone of you is able to help me out:

I am analyzing a kind of huge amount of data (I know it's relative). It's 10 columns and around 6 million rows. I am selecting attributes, filter them, normalize and then put them into the dbscan clustering. My parameters are epsilon=0.5 and minpts=4. I want to look at 2 attributes at a time since I'll compare it to k-means.

But the problem is that it already takes over an hour to preprocess the data (there is the loading circle on the clustering part) before it even starts to go from 1 to 100. Is there anything I can change in my process that would maybe make it faster? Perhaps there are some beginner mistakes involved which is quite likely..

Thanks for your answers and have a nice day.

EDIT: I have 64GB of RAM and the process uses around 32GB at the moment. I put the maximum to 50GB. In addition I can say that I only have numeric attributes
Tagged:

Best Answer

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,404RM Data Scientist
    Solution Accepted
    Hi Moritz,
    i guess 6M rows are just a lot for this.. If i remember correctly the runtime is in O(n²).

    BR,
    martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
    moritz_moeller SGolbert sgenzer

Answers

  • moritz_moellermoritz_moeller MemberPosts:5Learner I
    Well it seems like you're correct. I am working with only a range of my rows now and the runtime is fairly lower.

    Thanks for the answer, I assume that this is the correct one.
Sign InorRegisterto comment.