How to work with a very large .csv file?
Hi folks, it looks like there has been a similar post or two in the past, but years old at this point so I thought it would be helpful to refresh...
I need to load a huge .csv file (4.72GB, ~23MM lines), and I need to break it up into smaller .csv files according to one polynomial attribute. This is public State of Texas data, so the attribute by which I want to split into smaller data sets is "County", and I want those new .csv files to be kicked out onto my local disk.
What's the best, most computationally efficient way to do this?
Thanks!!
cc@sgenzer
Tagged:
0
Answers
Hi@ncjanes,
This is an interesting topic. You can have several options. Some ideas off the top of my head.
HTH!
YY
Hi,
Just my 5 cents... I would say the most computatiionaklly efficient way is the second suggested by YY.
You may break it down into two parts actually:
1) Use simple Python (or whatever else) script to read the arbitrarily large file in chunks and then save them into SQL table, see Python example here:http://odo.pydata.org/en/latest/perf.html
2) Then make SQL queries on subsets, using any SQL related tool, or RapidMiner.
This approach can easily let you have this data available for any future processing, including using it within different RM priocesses.
Vladimir
http://whatthefraud.wtf
Hi,
this reminds me of an operator i wanted to implement: Stream Lines
Basically, it reads a file line by line and gives it to you as a document. You could then filter it and move forward.
Now i just need to find the time to implement this.
@kypexin: Do you think this is useful?
~Martin
Dortmund, Germany
你好,马丁,
Well, some time ago I was asking more or less the same question about processing long csv:https://community.www.turtlecreekpls.com/t5/RapidMiner-Studio-Forum/Partial-retrieve-of-example-set/m-p/42320
So that kind of operator you describe would be really helpful in my opinion. At least it can give an alternative option for handling such data.
Vladimir
http://whatthefraud.wtf
Hi@kypexin,
well doing something like this on a repository item is harder, since these are either DB tables or serialized java objects. Doing it on flat files is easier..
~Martin
Dortmund, Germany
But this is still a good alternative solution for the problem所以我认为我会使用这样一个操作符!
Vladimir
http://whatthefraud.wtf
I think it would be a very useful operator to add! I still miss the similar "Stream DB" which used to exist but was deprecated for some unknown reason back in v 7 somewhere...
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
The most elegant and RapidMiner-like solution would be a two-part operator like Handle Exception.
In the left part you could put an operator returning an example set (like Read Database, Read CSV, Read Hive). The output would be processed by the right part in batches of selectable size.
The input operators could be updated to know about the batch size (internal parameter?) and would adhere to it, saving memory in the process. Operators that aren't yet updated would just return everything, so this would be like Loop Batches with the data source outside.
Balázs
@BalazsBarany,
that would requiere that the users specifies his batching manually on the left hand side?
If yes - you can do this with a normal Loop operator?
Best,
Martin
Dortmund, Germany
只要输入运营商也't have the optional batch size, this is just an elegant frontend for our existing loops or loop batches.
My idea is more futuristic.