text mining breaks text into symbols

imkeimke MemberPosts:12Contributor I
edited December 2018 inHelp

Hello,

I'm trying to do text mining with a large excel table with many text entrys (many words in a cell). Unfortunately my "Process Documents from Files" breaks my text into a mixture of symbols and letters.unbenannt_2.png

I aktually do not know why it is doing that, but also my word list looks like that.

unbenannt.png

Can you tell why this happens?

Thanks a lot

Imke

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,404RM Data Scientist

    Hi,

    can you make sure that you tried the right encoding? it looks like this was stored with UTF-8 (Mac/Linux Standard) but read with a Windows Encoding.

    Br,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
    sgenzer
  • imkeimke MemberPosts:12Contributor I

    Hello,

    underneath you can see my process. Maybe you can tell, what is wrong, and why it crashes rapid miner, too.

    Thank you

    Imke







    <运营商激活d="true" class="process" compatibility="7.5.003" expanded="true" name="Process">

    <运营商激活d="true" class="text:process_document_from_file" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Files" width="90" x="380" y="34">
    <列出关键= " text_directories " >




    <运营商激活d="true" breakpoints="after" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34"/>
    <运营商激活d="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="246" y="34"/>
    <运营商激活d="true" class="text:filter_stopwords_german" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (German)" width="90" x="380" y="34"/>
    <运营商激活d="true" class="text:generate_n_grams_terms" compatibility="7.5.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="514" y="34"/>
    <运营商激活d="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="648" y="34"/>












    <连接from_op = "公关ocess Documents from Files" from_port="example set" to_port="result 1"/>
    <连接from_op = "公关ocess Documents from Files" from_port="word list" to_port="result 2"/>








  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,404RM Data Scientist

    For Reference,

    the issue was that the files in the folder were Excel-Files. Read Document from Files is only able to handle pure text files. The attached process soled the issue.

    Best,

    Martin







    <运营商激活d="true" class="process" compatibility="9.0.002" expanded="true" name="Process">

    <运营商激活d="true" class="read_excel" compatibility="9.0.002" expanded="true" height="68" name="Read Excel" width="90" x="45" y="187">


    Use Import Wizard to read your file

    <运营商激活d="true" class="nominal_to_text" compatibility="9.0.002" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="187">
    Make sure the text is tagged as text

    <运营商激活d="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="187">


    <运营商激活d="true" breakpoints="after" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34"/>
    <运营商激活d="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="179" y="34"/>
    <运营商激活d="true" class="text:filter_stopwords_german" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (German)" width="90" x="313" y="34"/>
    <运营商激活d="true" class="text:generate_n_grams_terms" compatibility="8.1.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="447" y="34"/>
    <运营商激活d="true" class="text:filter_by_length" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="581" y="34"/>













    <连接from_op = "公关ocess Documents from Data" from_port="example set" to_port="result 1"/>








    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign InorRegisterto comment.