Extract e-mail adresses out of a pdf

marcel_hanselmamarcel_hanselma MemberPosts:3Learner I
edited May 2020 inHelp
Hello dear Rapidminer community,
I have a pdf full of adresses (name, street, phonenumber, email). What I want is to extract only all the e-mail adresses and store them line per line in an excel or csv. How is the approach to this? (I am really a Rapidminer newbie)
Greetings, Marcel

Best Answer


  • jacobcybulskijacobcybulski Member, University ProfessorPosts:391Unicorn
    When your PDF has a nicely formatted table, the PDF Table Extraction extension will do this in no time or effort. Otherwise you can use "Read Document" from Text Processing extension and do a bit of gymnastics parsing the text.

  • marcel_hanselmamarcel_hanselma MemberPosts:3Learner I
    That do a bit of gymnastics is what i am missing. I can read the document, but then i fail to extract all the e-mail addresses. The PDF is not nicely formatted.
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, MemberPosts:1,195Unicorn

    Can you provide your .pdf file in order we can see how to extract the e-mail adresses ?

    You can send it via private message if it is not confidential...


  • marcel_hanselmamarcel_hanselma MemberPosts:3Learner I
    Wow, thank you Lionel.
    It worked flawless. :-)
    sgenzer lionelderkrikor
Sign InorRegisterto comment.