Text extraction of key themes/words from series of pdf files

pimlico35pimlico35 MemberPosts:4Newbie
嗨,伙计们,

Im new to this & struggling a little bit:)

I just wanted some easy (explicit) steps to help me achieve what I want to do, which is:

I have a series of mostly pdf reports;
- I want to extract key themes or words that recur throughout the reports, for example 'serious accident' or 'safety'

What I have done so far is to put all these files into a new repository. I have tried to use operators to read through the files, tokenise etc - but Im getting lost in translation so to speak;)

- Im not sure whether I have to convert the pdfs into word files - if that makes it easier before getting it into rapidminer; but that seems to defeat the whole purpose ....

- I want to then have a document or table of these extracted common occuring words so I can see how often they are used. Later then I can also check in the output document the least used words...

I would really appreciate any help or pointing me in the direction of videos that explicitly look at this.

非常感谢!
Tagged:
    Sign InorRegisterto comment.