Load intensive processes and operators - RM Server autoscaling testing

NikouyNikouy MemberPosts:22Contributor II
edited March 2020 inHelp
Hello,

I have set up a Kubernetes cluster for RM-Server using EKS and I need to run a series of tests for Horizontal, Vertical and Cluster scaling. I need to generate a lot of load, and I would like to use some real world processes to generate load.

- What kind of processes/operators would exhaust the memory?
- What kind of processes/operators are heavier on the CPU?
- Is there any process publicly available that I can use, either for prediction, classification or something else?

I do not really care about what I am processing, as long as I can exhaust memory and/or CPU while using a real data set.

Thanks,
Nicolas





Tagged:
Pavithra_Rao Andy3

Answers

  • hbajpaihbajpai MemberPosts:102Unicorn
    Hey@Nikouy,

    I feel loops are one of the easiest way to check out the exhaustion of the memory in RM. Especially, if we deactivate the parallel execution.

    Try the below process. Also, please share the results, I am interested in understanding the auto scaling aspect too.

    < ?xml version = " 1.0 " encoding = " utf - 8 " ?> <过程版本sion="9.6.000">                                                                    



    Best,
    Harshit
    sgenzer RandyLeBlanc
  • NikouyNikouy MemberPosts:22Contributor II
    Hey@hbajpai.

    Thanks for your input. I tried this process in my laptop and almost fries it! I'll be giving it a try in my Cluster and share my findings here! Still trying to figure out how to fix the Kubernetes DNS, so the loadbalancer redirects the requests to multiple server (for high availability).

    Community, is there any way I can do something similar using superviserd algorithms? Thinking of using some large data set from UCI.

    Thanks,
    Nicolas


    hbajpai
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,404RM Data Scientist
    Hi@Nikouy,
    can you maybe explain why you are doing this? We are running some tests like this internally of course. But what do you try to get out of it?

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • NikouyNikouy MemberPosts:22Contributor II
    edited April 2020

    I am currently undertaking a research project as part of my MSc dissertation. Rapidminer is the focus of my project, which answers to a call from the scientific and big data community to “develop scalable higher-level models” (Elshawi et al., 2018) and thus help those with needs to automate the flexible scaling of infrastructure (Zhao et al., 2015).
    Therefore, I am exploring how to deploy an auto-scalable Rapidminer fleet in the cloud, using Kubernetes and provide a reference architecture.
    Obviously, I will need to test the system after its implementation to demonstrate high-availability and scalability, and I would like to do so using real data sets and various algorithms in order to understand how it behaves under different circumstances or test cases.

    Thanks,
    Nicolas

    Pavithra_Rao
  • NikouyNikouy MemberPosts:22Contributor II
    Rodrigo,

    Thank you for taking the time to write such a detailed reply and highliting the differences between HPC and Blackboard Systems. Using parallel processing is something I consider key, therefore the reason why I was asking which algorithms (either supervided or non supervised) would make good use of paralell processing so I could simply focus in one or two processes at max.

    I didn't quite get the point number 3 you made, so I'd appreciate if you could expand on this. What would I be ahieving with this?

    Thanks again,
    Nicolas
  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University ProfessorPosts:568Unicorn

    Sure!

    I understand that you are launching more agents with Kubernetes on demand depending on the process, am I right?

    When you use a local process that requires parallel work, RapidMiner launches these parallel processes in the same machine. What processes can do that?

    · Looping with “use parallel execution”.

    · Cross validation.

    · Feature selection.

    When you do such a thing on RapidMiner Server, it does the same (parallel processes in the same machine), the same processes are applied.

    But if you are talking about horizontal scaling (adding more machines), your processes need to be ready to send data to other RapidMiner agents, and that is done by creating a process that can be scheduled through the server. For horizontal scaling, you should invoke “Schedule Process” in a loop, and Cross Validation and Feature Selection can no longer be parallelized on many servers.

    Basically that’s the reason on why (my humble opinion) I think you might want to focus on scoring with a previously trained model: it will be easier for you to research on horizontal and vertical scaling. If you want to discuss this in private, drop me a line.

    All the best,

    Rod
    Nikouy
  • NikouyNikouy MemberPosts:22Contributor II
    Thanks Rodrigo, totally makes sense:). I'll probably be reaching out.

    @hbajpai,I tried executing in the server the process that you suggested but it looks (for some reason) that Studio ends picking it up? Please see the screenshot below from my laptop. I did not see any load increase at all in the server.



    Thanks,
    Nicolas


Sign InorRegisterto comment.