Problem Mandarin Text mining - HanMiner

YoGVAYoGVA MemberPosts:2Newbie
edited December 2019 inHelp
Hi everyone,

I am a newbie here but here is my situation.
I need to conduct a qualitative content analysis of a large number of Chinese reports. However, Rapid Miner needs an extension to capture Chinese characters - I found one called Hanminer posted by another member.

I followed the instructions and installed the extension via Github; but the extension does not show up on RapidMiner ...

Any ideas to solve that issue? Or another was to text mine Chinese documents?

Any help would be much appreciated!
Yoyo
JEdward

Best Answer

  • jwpfaujwpfau Employee, MemberPosts:245RM Engineering
    edited May 25 Solution Accepted
    Hi,

    the third party HenMiner Extension has no option to define the encoding of the imported file, as a workaround you could use Macros:

    < ?xml version = " 1.0 " encoding = " utf - 8 " ?> <过程版本sion="10.1.002">                 https://us.v-cdn.net/6030995/uploads/editor/sf/nq6mm23abhpa.txt"/>                                             

    Greetings,
    Jonas

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager
    hi@YoGVA我很抱歉没有人附和道e. Is this still an issue?

    Scott
  • YoGVAYoGVA MemberPosts:2Newbie
    Hi Scott,

    Yes it is.

    I'm trying to install the following but no success so far.
    https://github.com/joeyhaohao/rapidminer-Hanminer
    Nothing happens at step 4 when I try to install the extension.

    I am also trying to look at other options but it is harder than I expected...

    Any help would be great, cheers!
    Yoyo
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager
    hi@YoGVAhmm never seen that repo before!

    I'm going to cc my good friend and colleague@yyhuangwho will know a LOT more about this than I do.

    Scott

    yyhuang
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,368RM Data Scientist
    Hi@YoGVA,
    here is a compiled version of the github version, which you can just unzip and copy to .RapidMiner/extension. This works, but i have not tested the operators of course.

    Best,
    Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    yyhuang BingleWu JEdward
  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:363RM Data Scientist
    edited January 2020
    Thanks for sharing the compiled extension. Dr@mschmitz!
    After installing manually by inserting unzipped .jar file into my local extension folder C:\Users\Yy\.RapidMiner\extensions and a restart, everything is working fine. Hi@YoGVAyou can follow the instructions herehttps://community.www.turtlecreekpls.com/discussion/31996/install-extensions-manually-for-rapidminer-studio


    Six new operators added into the new extension folder "Text Miner"

    A quick test on the news data looks reasonable.





    < ?xml version = " 1.0 " encoding = " utf - 8 " ?> <过程版本sion="9.5.001">          [email protected]" / > <参数键=“process_duration_for_mail val”ue="1"/>                    

    sgenzer JEdward
  • ruhailaruhaila MemberPosts:48Guru
    Hi.

    My apologies if I should open a new question. My question is related to the latest version of Hanminer v.1.0.3. I noticed that the READ TEXT operator is now named READ DOCUMENT.

    My problem is when I import from file using this operator, the chinese characters became unidentified symbols.






    I have tried several ways:
    1. I tried using the different encodings listed and have installed chinese character in my windows pc but no difference.





    2. I imported the dataset as an example set and used DATA TO DOCUMENTS operator as below. However, I received an error.




    3. I tried connecting DATA TO DOCUMENTS operator to the READ DOCUMENT operator but this resulted in wrong input/output connection.



    Perhaps,@yyhuangcan help shed some light here. Really appreciate it.

    Thank you kindly.

  • jwpfaujwpfau Employee, MemberPosts:245RM Engineering
    Hi,

    have you tried to changeencodingto UTF-8?

    Greetings,
    Jonas
  • ruhailaruhaila MemberPosts:48Guru
    Hi Jonas,

    Yes, I have, but still nothing.
  • ruhailaruhaila MemberPosts:48Guru
    Thank you Jonas. That worked fine. Didn't think of macros here.:)
Sign InorRegisterto comment.