"read in csv file skip first n lines"

josef_frankjosef_frank MemberPosts:3Contributor I
edited May 2019 inHelp

Hi,

when importing csv files, is there any method to skip lines in the beginning of a file?
Something that behaves like R's "read.table"-option "skip".

Quite some time I'd like to read in csv files that have a structure like windows' ini-files:

[Header]
some garbage here
some non-relevant info here
[Data]
Var1,Var2,Var3,...
0,1,0,...
....

I'd like to start reading the file in the line "Var1,Var2,.."
Just can't find any option to get this done :-(

Best
Josef
Tagged:

Best Answer

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn
    Solution Accepted

    @sgenzerSo Martin posted a comment in the related product idea that I think showed me the solution. You can use the "Edit Annotations" parameter to manually specify that certain lines are to be treated as comments (even if they don't have the comment character, which is where I was getting hung up before). See the following:

    annotations.PNG

    That effectively skips those lines on import and it starts reading on the specified line, and interprets the correct number of columns! And at that point, the extra work with the macros to rename the attributes isn't needed either, since you can specify the relevant line to be read as names--but I'm sure going to save that process anyways because it could come in handy in other more complex data files in the future. Thanks again!

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts

Answers

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:578Unicorn
    Hi Josef,

    I think step 3 of 5 of the import wizard offers this as you can set lines as name, comment or unit but I haven't used it before.
    Give that a try and see if it does what you expect.

    Best,
    JEdward.

  • el_chiefel_chief MemberPosts:63Maven
    ya, this has been annoying me as well. I usually delete the header line in notepad first.

    Simon, can you pleeeeeeeeeeeease fix this? And in Excel importer too

    josef.frank wrote:

    Hi,

    when importing csv files, is there any method to skip lines in the beginning of a file?
    Something that behaves like R's "read.table"-option "skip".

    Quite some time I'd like to read in csv files that have a structure like windows' ini-files:

    [Header]
    some garbage here
    some non-relevant info here
    [Data]
    Var1,Var2,Var3,...
    0,1,0,...
    ....

    I'd like to start reading the file in the line "Var1,Var2,.."
    Just can't find any option to get this done :-(

    Best
    Josef

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:578Unicorn
    Hi guys,

    I gave a quick test with a csv with the following format:

    Line1- RandomText
    Line2 - [Var1], [Var2] (column headings)
    Line3 - RandomText
    Line4 - RandomText
    Line5 - RandomText
    Line6+ - [data1], [data2] (data)

    The below process worked fine for me. What have I missed?






























  • josef_frankjosef_frank MemberPosts:3Contributor I
    JEdward wrote:

    Hi JEdward

    The below process worked fine for me. What have I missed?
    The suggested method promising at a first glance
    (had already tried that before the original post).
    This procedure however has two drawbacks:
    1st: it gets quite unwieldy when 80 or 100 lines have
    to be annotated manually.
    2nd: It just doesn't work as intended:

    Maybe youl'd like to try another file:
    none1,none2,none3
    nothing
    var1,var2,var3,var4,var5
    1,2,3,4,5
    2,3,4,5,6
    3,4,5,6,7
    When annotating the first two lines as comments and the 3rd line
    with "Name", in the example above, the result set starts indeed with var1,
    but the number of fields that are read in is still determined by the first line,
    so this time the number of fields is determined by a comment:o
    Perhaps I'm just using it incorrectly. Could you give any hint?

    Best Josef
  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:578Unicorn
    Hi Josef,

    点1。——我认为这是笨拙的。一种around this would be using Excel & concatenate formulae to build the number of lines that you need and then pasting it into the XML of the process.
    For example::
    A B C D
    1 =CONCATENATE(A1,B1,C1)
    2 =CONCATENATE(A2,B2,C2)
    3 =CONCATENATE(A3,B3,C3)
    4 =CONCATENATE(A4,B4,C4)
    Not the most elegant solution & having an option to type a range of rows in RapidMiner would be very nice.

    Point 2. -Ah! I see the problem, that's a pain.
    The below worked for me after I named the required columns in "data_set_meta_data_information". Try this.


































  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn
    So, is it really true that in 6 years, no one has found a better way to solve this problem??? I just added a product idea for the "skip n lines" for Read CSV.
    这个程序ears to be the only relevant thread from the forum using the obvious search terms.
    In the meantime, if anyone else has found a more effective way to do this that doesn't involve going into csv files manually to add comment characters (or simply delete the header lines), I would love to hear about it.@sgenzer, any tricks up your sleeve here?
    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager

    darn you,@Telcontar120- now that you've thrown down the gauntlet I am going to tackle this. You know my weakness for bizarre ETL puzzles.:)Stay tuned.

    Scott

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager

    how's this? test2.csv is attached












    set n rows to %{skip}











    get number of examples %{NEx}





    %{startRow} = skip + 1




    filter startRow to NEx












    test2.csv 143B
    Telcontar120
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn

    Brilliant, I really appreciate you working on this@sgenzer!很有创意的使用宏来做到这一点。这个程序roach is promising, but I think it only works if there are enough delimiters in the first line so it reads the correct number of columns. If you modify the csv so the first line has only text (unfortunately more typical with title rows), then it only reads one column in, and the rest of the process works but only on that first column. And after a bit of testing it seems like it will generally read however many columns as it detects from the first row. So if you can figure out how to tell RapidMiner how many columns to read (not based on the first row of data), then something like this should do the trick.

    Happy Thanksgiving!

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
    sgenzer
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager








    <运营商激活= =“false”类"set_macro" compatibility="7.6.001" expanded="true" height="68" name="Set Macro" width="90" x="45" y="187">


    set n rows to %{skip}













    <运营商激活= =“false”类"extract_macro" compatibility="7.6.001" expanded="true" height="68" name="Extract Macro" width="90" x="581" y="187">


    get number of examples %{NEx}

    <运营商激活= =“false”类"generate_macro" compatibility="7.6.001" expanded="true" height="82" name="Generate Macro" width="90" x="715" y="187">



    %{startRow} = skip + 1

    <运营商激活= =“false”类"filter_example_range" compatibility="7.6.001" expanded="true" height="82" name="Filter Example Range" width="90" x="849" y="187">


    filter startRow to NEx













  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager

    oh darn - of course I work on the next one and then see your note. Leave it to@mschmitzto one-up me.:)Perfect solution.

    Happy Thanksgiving.


    Scott

  • matt_yauchmatt_yauch MemberPosts:1Contributor I

    I don't see how the macros can manage a blank row in the csv file. Looking to skip the first 2 lines (first is comment, second is blank) without editing the csv file. I didn't have luck with blank row via annotations, but perhaps I overlooked a setting. Any help with skipping blank lines between comment and column names?

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,404RM Data Scientist

    Hi@matt_yauch,

    you can manually say in the operator parameters to skip first x lines and just read it. RM will then automatically detect the att names etc.

    BR,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
    sgenzer
Sign InorRegisterto comment.