Text mining in utf-8

i_anickai_anicka MemberPosts:2Contributor I
edited November 2018 inHelp

Hello all,

I need to use RapidMiner for text mining in Cyrilic.
I tried setting the encoding to utf-8. It gives me some results which are displayed in characters instead of cyrilic words.

Thanks,

Best Answer

  • i_anickai_anicka MemberPosts:2Contributor I
    Solution Accepted

    Hi guys,

    I have solved my problem.

    I had set the utf-8 encoding everywhere except on the process level.

    I changed this and it works!

    Thank you all for your replies.

    Ana,

    sgenzer

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,421RM Data Scientist

    Hi,

    could you maybe post an example?
    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:578Unicorn

    It could be that your original document isn't in UTF-8, but in another encoding.

    One way to be absolutely sure is to create a loop which changes the encoding parameter in your process documents using macros and to look at all the resulting outputs. The one that looks 'right'.

    stevefarr
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager

    agreed. Just did a quick check and there's no problem with Cyrillic in UTF-8.





    <宏/ >
















    Scott

  • arunasethupathyarunasethupathy MemberPosts:4Contributor I

    I want to use Tamil language for text mining

    Where you have change the UTF-8 option for this

    I have tried in process level but unable to get

    Plz anybody give the answer

  • arunasethupathyarunasethupathy MemberPosts:4Contributor I

    for changing the unicode option to UTF-8 ( for processing tamil language)

    I have changed in the Rapidminer studio preference - encoding to UTF-8

    I have simply read the document using ReadDocument operator in Text mining extension

    But it is not working, the screen shot is attached ( doc7.docx)

    Kindly help me to sort out this problem

    Tahnk you

    Doc7.docx 131.3K
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager

    Hello@arunasethupathy- so Tamil is not a language I have worked with before. Could you please post your XML process AND your text document (in Tamil) so I can take a look?

    Thank you.

    Scott

  • arunasethupathyarunasethupathy MemberPosts:4Contributor I

    Sir,

    Kindly find the attached for the sample tamil text document

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager

    thank you@arunasethupathy. Can you please also post your XML process?

    Scott

Sign InorRegisterto comment.