What is the best way to build an incremental load to prepare data

MarlaBotMarlaBot Administrator, Moderator, Employee, MemberPosts:57Community Manager
edited December 2018 inHelp
dllanoswants to know the answer to this question, "I'm trying to see how is the process of an incremental load, for a data that over time will continue to grow"
rfuentealba

Answers

  • MarcoBarradasMarcoBarradas Administrator, Employee, RapidMiner Certified Analyst, MemberPosts:271Unicorn
    As always it depends what the source of the data will be.
    If its form a DB the best way will be by managing a GUID and a timestamp.
    If its from a web source like twitter you will need the max id of the last time you pulled data from that source.
    If the data is extracted from files you will need to keep track of what you had already processed and what is new.
    Give us more information of what the sorce of the data and then we can begin to process the information.
    rfuentealba
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn
    一个方便的技巧you can always use the macro %{process_start} for the time when a process is run. You could then store that value in an attribute in an exampleset and then use that in a later process to set the minimum for a date/time stamp to select only data after that point.
    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
    rfuentealba MarcoBarradas
  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University ProfessorPosts:568Unicorn
    edited November 2018
    视情况而定。

    There are 4 kinds of fields that will help us performing incremental load, here are these from best to worst:
    • Creation Date, Date.
    • Modification Date, Date.
    • Deletion Date, Date.
    • ID, Numerical.
    Here is how to use these approaches:

    • If there is a creation date, then we can useAggregate和read the last date on the database to performINSERT.
    • If there is a modification date, then we can use the last date on theCreation Dateto performUPDATE.
    • If there is adeletion date, then we can use the last date on theCreation Dateto performDELETE. Haven't seen anyone who uses the deletion date without the other two.
    • If there are no dates because the system doesn't have auditing, then we can rely upon a numerical ID. Notice that ID's are created with sequences, and sequences are prone to lack of synchronization (at least in some databases). In that case, theAggregateis still required, just on the ID field.
    This is basically summing up what my great senseis@Telcontar120@MarcoBarradasalready told you.

    If this is not what you are looking for, please, break this glassONLYin case of emergency:
    There is another idea. It is overkill and totally not recommended, but it is the last resort if we want to keep the database updated. If there is a unique field that isn't a sequential one and you are required to keep the data warehouse up to date with the required information, and the database supports it, you can perform an UPSERT. Let's say the one who wrote the database had the marvelous idea of putting car plates, identification numbers or addresses as ID's (yes, I saw them, they exist, and I actually became a bit famous for using data science to overcome technical debt... I hope you don't have the same problems but if you do, let me tell you you are not alone), your only alternative is to perform an INSERT that can fail on a unique value but update the entire rest.

    Again, this isoverkill, as it performs up to 3 operations per row. I have to do something similar: upserts on a table that has 4.7 Gb of floating point numbers per hour on alittle pet project(Dear@sgenzer, please insert recorded laughs here) that I'm doing. I ended up creating the table dynamically and updating the view to point to the last one, because that was the best way I could go back to recover my hours of sleep.

    Still, this isnota common approach,notrecommended. Only a seasoned DBA or a totally insane and fearless person could give this one to you. I am in the second class of people.

    Hope this helps,

    Rodrigo.

    hughesfleming68 sgenzer
Sign InorRegisterto comment.