text mining pdf articles omitting references

mlubicz · June 2019

In a previous posthttps://community.www.turtlecreekpls.com/discussion/53107/text-mining-of-multiple-pdf-files-with-separate-key-word-countsan approach for mining multiple pdf files was described.

If the pdfs are articles, is there a way toexcludeReferences section from being mined. The section often starts with the same term (i.e. 'References'), so I tried to define some Split or a specific Tokenize option but I failed.

I would be grateful for any suggestion.

kayman · June 2019

Yeah, an option to filter documents based on content would be nice, but as far as I know it's not available.

A workaround could be as follows : use the documents to data operator, and filter on the reference keyword. Next convert back to documents (or deal with it as data)

Attached a very simplified example, might get you started.

< ?xml version = " 1.0 " encoding = " utf - 8 " ?> <过程版本sion="9.2.001">

mlubicz · June 2019

Thank you for the inspiration. In fact the task was to split each pdf document into main text and references, and make Text Mining on the main text only, while the references should be saved as an example set (e.g. xlsx) - a desirable by-product.

I tried to experiment with Split File by Content and Split File by Point which makes the same, however it is more convenient to have one file and not multiple segments.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

text mining pdf articles omitting references

Best Answers