Cut Document
CaptainChaos
MemberPosts:17Maven
Hi Guys,
I think i have some kind of trivial problem but couldnt figure out how to solve it.
我处理路透数据集时,我有一个eemed version consisting of one big docuement which contains all the other documents. So it is a big .txt file in which the beginning and ending of each document is marked by the word "reuter". I tried to use the "Cut Document" operator to split them. As query expression I used "reuters" the problem is that all documents know have the same name(label) which makes it hard to work with them.
Does anybody know how to give different names to all documents like 1,2,3,4,5 for example and than write/export them to excell or a data base.
Thanky in advance
cheer
I think i have some kind of trivial problem but couldnt figure out how to solve it.
我处理路透数据集时,我有一个eemed version consisting of one big docuement which contains all the other documents. So it is a big .txt file in which the beginning and ending of each document is marked by the word "reuter". I tried to use the "Cut Document" operator to split them. As query expression I used "reuters" the problem is that all documents know have the same name(label) which makes it hard to work with them.
Does anybody know how to give different names to all documents like 1,2,3,4,5 for example and than write/export them to excell or a data base.
Thanky in advance
cheer
Tagged:
0
Answers
here is a little example of how you could write the single documents as files: If you prefer a list-based output like Excel or database, this is the way to go: Hope these examples help you a little. Feel free to ask if you have further questions.
Regards
Matthias
first off all thank you very much for your help my model no works a lot better than before.
But I would like to ask you one more question. In the next pic I copied your code and marked one line which is different to my once could you explain the line to me.
<运营商激活= " true " class = "process" compatibility="5.1.011" expanded="true" name="Process">
<运营商激活= " true " class = "text:create_document" compatibility="5.1.001" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">
<运营商激活= " true " class = "text:cut_document" compatibility="5.1.001" expanded="true" height="60" name="Cut Document" width="90" x="179" y="30">
<运营商激活= " true " class = "text:documents_to_data" compatibility="5.1.001" expanded="true" height="76" name="Documents to Data" width="90" x="313" y="30">
<运营商激活= " true " class = "generate_id" compatibility="5.1.011" expanded="true" height="76" name="Generate ID" width="90" x="447" y="30"/>
<运营商激活= " true " class = "generate_attributes" compatibility="5.1.011" expanded="true" height="76" name="Generate Attributes" width="90" x="581" y="30">
<运营商激活= " true " class = "write_excel" compatibility="5.1.011" expanded="true" height="60" name="Write Excel" width="90" x="581" y="165">
Kind regards Roberto
you can use the code style by adding CODE-tags around it. It's the third symbol from the right just above the smileys.
突出显示的行是我破裂的表达式。你的年代aid there is a word marking the beginning and the end of each document. Since I manually typed some example document contents, I simply used "marker" for this. I think it should be "reuters" in your case. The regular expression used to cut the text collects anything between two marker words (the first capturing group) and also uses \s* to cut of whitespace surrounding the content (a newline between marker word and beginning of the actual document content for example).
Hope this clarifies things.
Regards
Matthias