PDF encoding issue

limegreenman900limegreenman900 MemberPosts:26Contributor I
edited November 2018 inHelp

Hi everyone,

I was trying to do the most simple one can do, by reading a PDF file into RM.... I have done this several times before, but now I am stuck with (I suspect) an encoding issue.

After using the "Read Document" Operator (extract text only and use file extension as type are tick-marked) I inserted a breakpoint, before I do some preprocessing of the text. However I don't get any text out of my PDF, what I get instead is something like:


¨EøC&13 #新元o / Y¢¬¬——³UUiai = UOsbsurnaºçsOæ1óŠòvç=Ë�ËïÏŸ\ä»hÙ¢óÖê‚#…¤Â¼Â�…³‹ãoZ<]TÔUt}‰`IÃ’sK—V-ý¤˜Y,+>TB(É/ÙSòƒ,]6*›-•–¾W:#—È7Ë*¢ŠÊe¿ò^YDYÙ}U„j£êAyTù`ù#µD=¬þ¶"©b{ųÊôÊ+¬Ê¯: !kJ4Gµm¥ötµ}uCõ%�—®K7YV³©fFŸ¢ßY Õ.©=bàá?SŒîÆ•Æ©ºÈº‘ºçõyõ‡Ø
Ú† �ž�kï5%4ý¦m–7Ÿlqlio™Z³lG+ÔZÚz²Í¹³mzyâò]íÔöÊö?uøuôw|¿"űN»Îå�wW&®ÜÛe֥ﺱ*|ÕöÕèjõê‰5k¶¬yÝèþ¢Ç¯g°ç‡^yïkEk‡Öþ¸®lÝD_p߶õÄõÚõ×7DmØÕÏîoê¿»1mãál {àûMÅ›Î
nßLÝlÜ<9”úO

Anyone an idea where the problem is? I would suggest that it is an encoding issue?!

If I go into the PDF file and Copy+Paste the text into a Word File there is no problem and the text is displayed in a correct manner....

Tagged:

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    You can change the encoding on the Read Documents operator. Just enable the advanced settings and a new parameter box will show up in the parameter window. From there you can change the encoding.

  • limegreenman900limegreenman900 MemberPosts:26Contributor I

    I am working with RM5.3, so by displaying the "Read Document" operator encoding is set by default to "System". This should automatically match the correct encoding right?

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,381RM Data Scientist

    Hi,

    usually it is. If you have a UTF file on a windows machine it might not work. So I would give it a try with UTF-8.

    ~Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    Thomas_Ott
  • limegreenman900limegreenman900 MemberPosts:26Contributor I

    @mschmitz: I gave it a try with UTF, but it didn't work. I'll figure out another way, somehow it has to work.

    Nevertheless, thanks for your help.

Sign InorRegisterto comment.