How to work with a very large .csv file?

ncjanesncjanes MemberPosts:3Contributor I
edited December 2018 inHelp

Hi folks, it looks like there has been a similar post or two in the past, but years old at this point so I thought it would be helpful to refresh...

I need to load a huge .csv file (4.72GB, ~23MM lines), and I need to break it up into smaller .csv files according to one polynomial attribute. This is public State of Texas data, so the attribute by which I want to split into smaller data sets is "County", and I want those new .csv files to be kicked out onto my local disk.

What's the best, most computationally efficient way to do this?

Thanks!!

cc@sgenzer

Tagged:

Answers

  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:363RM Data Scientist

    Hi@ncjanes,

    This is an interesting topic. You can have several options. Some ideas off the top of my head.

    1. shell scripts, split the file by the value of a2
      while IFS=, read -r a1 a2 a3 a4; do
      echo "$a1,$a2,$a3,$a4" >> "$a2".csv
      done < file.csv
      Please refer to this post forexamples
    2. RDB(relational database), load csv data into mySQL or postgreSQL DB table and run in-database SQL query to select subsets
    3. scale up, if you have access to an environment loaded with big RAM then you can run RapidMiner process to load big csv data and split the tabular data with (a). "loop value" operator (enable parallel execution to make it run faster). Inside the loop, you can use the iteration macro with "filter example" to find subset and then "write csv" with macro defined file name to store individual small csv files. (b). "Group Into Collection" operator with "loop collection" and inside loop you can write each example into seperate csv
    4. scale out, HDFS or RDD are popular solution nowadays for really huge data (4.72GB is not that big)

    HTH!

    YY

    kayman
  • kypexinkypexin Moderator, RapidMiner Certified Analyst, MemberPosts:290Unicorn

    Hi,

    Just my 5 cents... I would say the most computatiionaklly efficient way is the second suggested by YY.

    You may break it down into two parts actually:

    1) Use simple Python (or whatever else) script to read the arbitrarily large file in chunks and then save them into SQL table, see Python example here:http://odo.pydata.org/en/latest/perf.html

    2) Then make SQL queries on subsets, using any SQL related tool, or RapidMiner.

    This approach can easily let you have this data available for any future processing, including using it within different RM priocesses.

    yyhuang
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,404RM Data Scientist

    Hi,

    this reminds me of an operator i wanted to implement: Stream Lines

    Basically, it reads a file line by line and gives it to you as a document. You could then filter it and move forward.

    Now i just need to find the time to implement this.

    @kypexin: Do you think this is useful?

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
    Telcontar120 sgenzer kayman yyhuang
  • kypexinkypexin Moderator, RapidMiner Certified Analyst, MemberPosts:290Unicorn

    你好,马丁,

    Well, some time ago I was asking more or less the same question about processing long csv:https://community.www.turtlecreekpls.com/t5/RapidMiner-Studio-Forum/Partial-retrieve-of-example-set/m-p/42320

    So that kind of operator you describe would be really helpful in my opinion. At least it can give an alternative option for handling such data.

    sgenzer yyhuang
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,404RM Data Scientist

    Hi@kypexin,

    well doing something like this on a repository item is harder, since these are either DB tables or serialized java objects. Doing it on flat files is easier..

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
    sgenzer
  • kypexinkypexin Moderator, RapidMiner Certified Analyst, MemberPosts:290Unicorn

    But this is still a good alternative solution for the problem:)所以我认为我会使用这样一个操作符!

    sgenzer
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn

    I think it would be a very useful operator to add! I still miss the similar "Stream DB" which used to exist but was deprecated for some unknown reason back in v 7 somewhere...

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
    sgenzer yyhuang
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified ExpertPosts:949Unicorn

    The most elegant and RapidMiner-like solution would be a two-part operator like Handle Exception.

    In the left part you could put an operator returning an example set (like Read Database, Read CSV, Read Hive). The output would be processed by the right part in batches of selectable size.

    The input operators could be updated to know about the batch size (internal parameter?) and would adhere to it, saving memory in the process. Operators that aren't yet updated would just return everything, so this would be like Loop Batches with the data source outside.

    Balázs

    yyhuang
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,404RM Data Scientist

    @BalazsBarany,

    that would requiere that the users specifies his batching manually on the left hand side?

    If yes - you can do this with a normal Loop operator?

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified ExpertPosts:949Unicorn

    只要输入运营商也't have the optional batch size, this is just an elegant frontend for our existing loops or loop batches.

    My idea is more futuristic.

Sign InorRegisterto comment.