Joining collections in Radoop

JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:578Unicorn
edited December 2018 inHelp

Hi guys,

I'm wanting to loop attributes in Radoop to get some aggregates and append it back to my original table.

The output of the data is a collection and I'd really love to be able to loop through that collection with Remember Recall and then join all of the datasets together.

Unfortunately this is Radoop so those operators aren't available to me so I need a smarter solution.

For efficiency I want to make the most of my cluster resources so using Reuse Results isn't a good option. Each loop needs to wait until the previous one it hits a bottleneck which in a test run takes 10 seconds per loop and this means (as it isn't able to run in parallel) the entire process takes 7 hours, with or without the full dataset.

Anyone got any suggestions on how to rejig it to combine the output collections so I can use more cluster nodes in the execution?














<枚举关键= " tables_to_reload " / >






























:'(



<连接from_op = "加入(2)”from_port = to_p“加入”ort="output 1"/>




For efficiency in execution I don't want to reuse results. I want to join the resulting sets together.<br>Unfortunately I can't find the operator that would help with this.










Tagged:

Answers

  • mborbelymborbely MemberPosts:14Contributor II

    Hi,

    It's not entirely clear to me, what you want to achieve, but I have some assumptions.
    It seems like you want to aggregate multiple attributes, and end up with a result similar to this:

    image.png

    Is this correct?
    Now, from the process you posted it seems like you want to apply the same aggregation method to all of your attributes. If that's the case, you can achieve this by using default aggregation. You can pass your Example Set directly to the Aggregate operator, and set it up the following way:

    • Check "use default aggregation"
    • For attribute filter type select "all"
    • Set the aggregation function in "default aggregation function" selector. (For example "average")
    • For "group by attributes" select your group by attribute. From your process, I assume you want to use "label".

    The output of this operation is going to be similar to the image above. You don't even have to worry about attributes to which your aggregation function doesn't apply, these will simply be disregarded.

    It's possible that I misunderstood you, and you want to do a different aggregation for all of your attributes. In that case you cannot use default aggregation, but rather provide your aggregations in "aggregation attributes" one-by-one. Whichever option you want, this is the right way to execute this, since this way a single Hive statement is going to take care of all of your aggregations.

    The last thing I want to mention is that you are trying to do a join by id at the end of your process. This does not make a whole lot of sense on an aggregated dataset. I don't really know what you want to achieve through this, but generally you shouldn't need any join with the solution proposed by me.

    Hope this is what you were looking for.

    Cheers,

    Máté

    sgenzer
Sign InorRegisterto comment.