Split a single xml file into several docs or example set

mohammadrezamohammadreza MemberPosts:23Contributor I
edited February 2020 inHelp
Hi. I am new to RapidMiner text plugin.

I have an XML file consisting of elements. Each document tag contains one document as follows:


1
...............


1
...............

...
I think I have to split them first and extract documents to be able to construct the word vector. Is there any way to do that?

Tagged:

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,404RM Data Scientist
    Is there any reason not to use read xml and convert the example set to a document afterwards?
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • mohammadrezamohammadreza MemberPosts:23Contributor I
    Thanks Martin,

    我想读取XML操作符是明智的选择,但是I need to do some text classification after that. That's why I wanted to work with documents through text plugin. Assuming that according to your explanation I use Read XML, is this any way to work with text plugin? I mean how should I connect the output of read XML to some operator like "Process Document" or any other operator to allow me do the tokenization, stemming and make word vector?

    Thanks
  • frasfras MemberPosts:93Contributor II
    Hi, try this as a starting point:


















    <参数键=“空”的值= " & lt;Family.</Family>"/>



    <参数键=“空”的值= " & lt;document.</document>"/>



















































  • mohammadrezamohammadreza MemberPosts:23Contributor I
    Thank you indeed Fras. I will try your solution and let you know about the results ASAP. I think your solution is more efficient if I can adapt it because, I designed the RM process with read XML operator (as Martin suggested), and I ran out of the memory with even a 32 GB of RAM. My XML file is about just 160 MB but the de-serialization process take a lot of RAM in Read XML. So I wanna try your approach and inform you if it could handle my 160MB XML file with the size of 16 0MB. Thanks again.
  • mohammadrezamohammadreza MemberPosts:23Contributor I
    Hi Fras. I am trying your solution for reading my 160 MB XML fille. I got stuck in dealing with the following XML schema which has more than one node ineachdocument.


    1

    ..........
    ...............


    ...............
    ...............



    2

    ..........
    ...............


    ...
    In previous solution (Martin's Solution) I used ReadXML operator and set the property "XPath for attribute" to extract all of the nodes for each document. But in the new solution, as you explained, the "Cut Document" operator nicely separates each document and then it is passed through "loop collection" operator. This is where I need to extract all of the nodes in the document (e.g. via XPath). and convert them to one attribute for my example set. But I cannot get all of the nodes for each document. Do you think if there is any solution to do this?

    Thanks in advance.
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,404RM Data Scientist
    Hi,

    looks to me like a xpath can solve this.
    Have you tried the import wizard?

    Sadly i got no time to try it myself. But i guess it works

    best
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • mohammadrezamohammadreza MemberPosts:23Contributor I
    Thanks for the answer Martin; XPATH do solve this problem in "ReadXML" operator. But Read XML cannot handle a 160 MB file. So I am playing around with Fras' solution. And I need to use XPATH in that one. Any idea please?
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,404RM Data Scientist
    the file size should be no problem for read xml.
    The wizard might get slow, because it caches the file at some point. But it still works
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • mohammadrezamohammadreza MemberPosts:23Contributor I
    Hi Martin. That's interesting about ReadXML. But I used it on my 160 MB of XML data and I waited for 2 days and 4 hours (totally 52 hours) on a system with 32GB of memory. After 52 hours, the process was still busy with ReadXML so I stopped it thinking that something is wrong. So do you think that I should have waited more or maybe something is wrong with big files? As an experiment, I splitted the file into several peaces and I got results after 9 hours. In neither of cases I used the import wizard, so I am sure that my XPATH expressions are correct. This experiment might be helpful for others. Please let me know what you think about this experiment.
  • xmlguyxmlguy MemberPosts:1Contributor I
    Why not use a tool designed for splitting xml? Over on stackexchange an answer to the following question lists some tools:
    http://stackoverflow.com/questions/700213/xml-split-of-a-large-file/7823719#7823719
Sign InorRegisterto comment.