"pdf tokenization (?)"

margkwmargkw MemberPosts:14Contributor II
edited June 2019 inHelp
Hello guys,
I am totally new here and to the rapidminer!!
I have an assignment to get done so there is not much time for me to explore rapid miner. I will set my question here and I hope I will find the answer. It might be trivial.I apologise for that..

I have several pdf files. I want to tokenize them, i.e to see the multiple appearances of each word and how many times each word appears..
For example let's assume that in a pdf there is the word "process"..I want to see how many times this word appears. And that is what I want to do for all the words in the pdf file. Is tokenization what I need to do? If yes, how do I do it? If not what do you propose?
Thank you in advance!
Tagged:

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, MemberPosts:1,869Unicorn
    Yes, it is. Just load the data with Read Documents from Files, connect it to Process Documents, inside Process Documents add the Tokenize operator, and finally connect the output ports of the Process Documents operator to the process output.

    To get the aforementioned operators, you have to install the Text Processing extension.

    Best, Marius
  • margkwmargkw MemberPosts:14Contributor II
    Thank you very much.I will try that out and I will get back to you if I have any problem...Many many thanks!!!!:) :) :) :) :)
  • margkwmargkw MemberPosts:14Contributor II
    It's me again!How can I insert the tokenize operator inside Process Documents?

    And the process output should be what?

    Sorry for the stupid questions..I am completely new to this..
  • MariusHelfMariusHelf RapidMiner Certified Expert, MemberPosts:1,869Unicorn
    Hi,

    these are very important concepts which are rather easy to understand, but hard to explain here in text form. I would like to forward you to our video tutorials on our website; there is one complete section about text processing.

    你会发现tutori的链接als in the post linked in my signature.

    Happy Mining!
    -Marius
  • margkwmargkw MemberPosts:14Contributor II
    THANKS!I will be back with more questions!:D
Sign InorRegisterto comment.