"Intelligent Text Extraction"
Hi everyone,
this is a very basic question now. I am trying to extract text from various locally stored HTML files. The main structure of the part of the text that I want to extract from each document is similar but not 100% identical. Is there any possibility to define a start text and endtext(i.e. 2-3 words that are always at the beginning or end)ANDdefine some "keywords”,必须在开始和结束之间的文本to tell RapidMiner that it is extracting the correct text? The problem that I am encountering at the moment with "Cut Document" and therein the Regular Region Parameter is that the start of my text CAN occur a few times before the actual text part that I really want to have.
Example:
.....
...
...
So what I need would be the second "This is an example Text" as starting point and all the HTML text down to "Unique End Tag". If I use "Cut Document" I have the problem that I cannot write a regex that distinguishes between the first and second occurence of my starting text as the beginning of each HTML string can be completely different. I would have some unique words that could specify the region that I want to extract (in my example "Keyword". I was playing with theInformation Extraction Pluginas I could do some annotation there but I couldn't figure out how this would work on my purpose?
Is there something like a "Intelligent Text Extraction" Operator in RapidMiner? Any other suggestions welcome!:smileyhappy:
Answers
Hi,
this seems tricky. My approach would either be a (tricky) regex or something like HTML to XML and then Process XSLT?
~Martin
Dortmund, Germany
Hi Martin,
I know, normally a RegEx would be the best solution if I would have some structure where I could distinguish between my different start texts, however I don't know whether a very complex regex that contains multiline forward and backlooking features will run into performance issues as I have a lot of documents....
For XSLT I doubt that it would work as my text has no unique tags but it randomly formatted with inline classes which do not have to contain similar attributes...
To get back to my originally question: Are you aware of any operator within the IE plugin that could adress this problem? Or is this really something that I will have to do with "Cut Documents" and the Regular Region Parameter?
Or is there any possibility that I could extract one text as a reference and "train" RapidMiner to detect this part in all other files due to high similarity?