Split a single xml file into several docs or example set

mohammadreza · February 2015

Hi. I am new to RapidMiner text plugin.

I have an XML file consisting of elements. Each document tag contains one document as follows:



 1 
...............


 1 
...............

...

I think I have to split them first and extract documents to be able to construct the word vector. Is there any way to do that?

MartinLiebig · February 2015

Is there any reason not to use read xml and convert the example set to a document afterwards?

mohammadreza · February 2015

Thanks Martin,

我想读取XML操作符是明智的选择,但是I need to do some text classification after that. That's why I wanted to work with documents through text plugin. Assuming that according to your explanation I use Read XML, is this any way to work with text plugin? I mean how should I connect the output of read XML to some operator like "Process Document" or any other operator to allow me do the tokenization, stemming and make word vector?

Thanks

fras · February 2015

Hi, try this as a starting point:



















<参数键=“空”的值= " & lt;Family.</Family>"/>



<参数键=“空”的值= " & lt;document.</document>"/>

mohammadreza · February 2015

Thank you indeed Fras. I will try your solution and let you know about the results ASAP. I think your solution is more efficient if I can adapt it because, I designed the RM process with read XML operator (as Martin suggested), and I ran out of the memory with even a 32 GB of RAM. My XML file is about just 160 MB but the de-serialization process take a lot of RAM in Read XML. So I wanna try your approach and inform you if it could handle my 160MB XML file with the size of 16 0MB. Thanks again.

mohammadreza · February 2015

Hi Fras. I am trying your solution for reading my 160 MB XML fille. I got stuck in dealing with the following XML schema which has more than one node ineachdocument.



 1 

..........
...............


...............
...............



 2 

..........
...............


...

In previous solution (Martin's Solution) I used ReadXML operator and set the property "XPath for attribute" to extract all of the nodes for each document. But in the new solution, as you explained, the "Cut Document" operator nicely separates each document and then it is passed through "loop collection" operator. This is where I need to extract all of the nodes in the document (e.g. via XPath). and convert them to one attribute for my example set. But I cannot get all of the nodes for each document. Do you think if there is any solution to do this?

Thanks in advance.

MartinLiebig · February 2015

Hi,

looks to me like a xpath can solve this.
Have you tried the import wizard?

Sadly i got no time to try it myself. But i guess it works

best
Martin

mohammadreza · February 2015

Thanks for the answer Martin; XPATH do solve this problem in "ReadXML" operator. But Read XML cannot handle a 160 MB file. So I am playing around with Fras' solution. And I need to use XPATH in that one. Any idea please?

MartinLiebig · February 2015

the file size should be no problem for read xml.
The wizard might get slow, because it caches the file at some point. But it still works

mohammadreza · February 2015

Hi Martin. That's interesting about ReadXML. But I used it on my 160 MB of XML data and I waited for 2 days and 4 hours (totally 52 hours) on a system with 32GB of memory. After 52 hours, the process was still busy with ReadXML so I stopped it thinking that something is wrong. So do you think that I should have waited more or maybe something is wrong with big files? As an experiment, I splitted the file into several peaces and I got results after 9 hours. In neither of cases I used the import wizard, so I am sure that my XPATH expressions are correct. This experiment might be helpful for others. Please let me know what you think about this experiment.

xmlguy · February 2015

Why not use a tool designed for splitting xml? Over on stackexchange an answer to the following question lists some tools:
http://stackoverflow.com/questions/700213/xml-split-of-a-large-file/7823719#7823719

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Split a single xml file into several docs or example set

Answers