Joining time series with time as index
Hallo,
I am trying to do something similar as in this post:
I have two source files with each having basically two attributes: "timestamp" and "att". Data is only recorded if a value in "att"changes and the values of source file 1 and source file 2 change independently. Therefore, I am having two data collections that have the same timeframe from start to end, but the individual datapoints in between have to a large extent different timestamp values and a different amount of datapoints (i.e. examples).
I now want to join those two source files by basically treating the timestamp of one of the two files as index (the one with more entries). Hence, I would like to be able to answer the following question:
At timestamp t, att of source 1 had a value of x and att of source 2 a value of y.
As RapidMiner can only join exact index matches (i.e. no closest value), I would like to generate an index attribute which basically takes the current value of att from source 2 and finds the closest value to it in the array of all att values in source 1. As a result, source file 2 would now have three columns (timestamp, att and index).
I have looked around a lot and tried to find a solution with Macros and Loops, but I just don't manage to get the desired result. Rounding the timestamp values as proposed in the post above does not work for me, as precision is key to my process.
I can think of several use cases for such an operator, so I am convinced that there has to be a solution in which I can generate an attribute by looping through the array of existing attribute values and finding the closest value.
I would be really grateful, if someone could point me in the right direction (preferably without using the ExecuteScript operator).
Thank you very much.
Best Answer
-
bhupendra_patil Administrator, Employee, MemberPosts:168RM Data Scientist
see if the attached example gets the correct result
basically I am doing a join on time stamps from both a&b,
then the time series extension allows replacement with previous values
so we then get values for B for all values of A
Then we do left join again to remove unwanted rows.
You will need the series extension for this
2
Answers
Do you have some sample data that you can post to work with? It will make it much easier to test a few different approaches.
There is an operator that was created by one of the RMStaff called "generate date series" (attached here but you should be able to find the original by searching the forum if you prefer), which allows you to generate a series of date/time stamps by whatever intervals you need, which you might then be able to use to fill in the missing gaps so you can join everything and have a consistent series, and then use some looping to group them together.
Another idea is that you may want to look at the lag and window operators from the series extension--that would work as long as you have consistent spacing between records of each type in the dataset.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Thank you for your response.
I am attaching a small made up dataset that illustrates the problem. Tab 1 represents source 1, tab 2 source 2 and the third tab shows the desired result, by adding an index to source 2 that represents the closest value of Attribute that can be found in source 1. The timestamp is in ms since epoch and unfortunately the datapoints are collected outside of a set time interval. Whenever Attribute changes in either of the two sources, a new datapoint is collected in that source file. I am looking at the operator you attached, but believe that it starts out with the assumption of fixed time intervals - hence it might not work.
在Excel中我通过应用可以解决这种类型的问题ing a combination of INDEX and MATCH (the real dataset is far too big to solve my problem there and re-import it into RM). However, I have so far not found a way to iterate over all examples of one attribute and generate a new attribute that is based on the relationship of a fixed example of another attribute column and the attribute column which I iterated over.
Or maybe more general: Is there a way to use the "Generate Attribute" operator with more than one input?
if i have t1 as timestamps in sheet1 and t2 and timestamps in sheet 2
Do you care about closest timestamp in forward backward both directions
i.e after the join, it should have the closest value if the future timestamp is closer?
or always use the closest value from the past?
Best case would be the closest value from the past. I.e. Timstamp from 1 > Timestamp form 2
The excel solution in my example ignored that fact, as I believe I could do that with the filter operator in some way.
Thank you so much. This does exactly what I was looking for.