"Extract text from crawled example set"
Hello,
I am new to rapidminer. I have created a process to crawl the web [which works perfectly fine] and want to extract information from the example set. I'm using the cut document operator however I get an error stating it expects a document instead of example set.
I need to extract a paragraph with a consistent start and end expression from all the web pages that I have crawled.
Can someone advise how should I go about doing this.
Thanks.
Tagged:
0
Answers
You need to take 2 steps here, first ensure that the attribute you want to cut is of text type. Rapidminer will define these as nominal by default so use the nominal to text operator to let Rapidminer know it needs to treat the attribute as a text field.
Next you need to convert your data set (examples) to a document logic, using the data to documents operator. Now your examples are no longer considered as examples, but as a list of documents, and you can apply text logic to fields defined as text.
Depending on your workflow you can choose to loop through all of the example, apply document logic and convert back to example, or loop through your documents and do the opposite.
It sounds more scary than it is...
Thank you for the guidance, Kayman
I tried all of that, however I get a new error - "Expected Document but received IOObjectCollection"
Sharing the screen shot. Do advise.
I see. That is because you are actually providing a collection of documents to an operator that expects a single document.
There are 2 ways to deal with this, the first one is to use the loop collection operator. This one will itterate over all your documents, apply some logic, and return the converted data. In other words, this is where you insert your cut document operator. So you connect the loop operator after the 'data to documents' one, and move the 'cut document' operator inside the loop operator.
Another (and maybe more easy way) is to use the 'process documents from data' operator. You can deselect the create word vector selctor and keep the rest with default settings. It will automatically loop through all your examples, do some magic on the field you defined as text, and returns the final result in one go as a new exampleset.