Stem (dictionary) for greek language

slimikslimik MemberPosts:7Contributor I
edited July 2019 inHelp

Hello to the community of rapidminer,

i'm trying to create a stemmer for greek language but i can't implement a more general rule for removing punctuations. For example i want words like "fishes","fished","fishing","fishery" to be reduced to "fish". Due to the wide range of punctuations in greek language is too dificult to map every possible punctuation with the origin of the word. So i tried a rule like this:

fish:fish.*

but it didn't work out. Is there any way to do that ?

thank you in advance

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    That should work, can you post your process?

  • slimikslimik MemberPosts:7Contributor I


























































    <运营商激活= " true " class = "文本:filter_by_length" compatibility="7.3.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="112" y="238">





    <参数键= "文件" value="C:\Users\klimi\Desktop\Thesis\stemmer.txt"/>

    <运营商激活= " true " class = "文本:filter_stopwords_dictionary" compatibility="7.3.000" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="514" y="238">
    <参数键= "文件" value="C:\Users\klimi\Desktop\Thesis\gr_stopwords.txt"/>



    <运营商激活= " true " class = "文本:filter_stopwords_english" compatibility="7.3.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="514" y="34"/>































    <操作符= " true " class = " support_vector_m激活achine" compatibility="7.3.001" expanded="true" height="124" name="SVM (2)" width="90" x="179" y="34">































































































    <运营商激活= " true " class = "文本:filter_by_length" compatibility="7.3.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="112" y="238">





    <参数键= "文件" value="C:\Users\klimi\Desktop\Thesis\stemmer.txt"/>

    <运营商激活= " true " class = "文本:filter_stopwords_dictionary" compatibility="7.3.000" expanded="true" height="82" name="Filter Stopwords (2)" width="90" x="514" y="187">
    <参数键= "文件" value="C:\Users\klimi\Desktop\Thesis\gr_stopwords.txt"/>



    <运营商激活= " true " class = "文本:filter_stopwords_english" compatibility="7.3.000" expanded="true" height="68" name="Filter Stopwords (3)" width="90" x="514" y="34"/>































    < portSpacing端口= " source_input 1”间隔= " 0 " / >







    after Process Documents from Data the WordList (Process Documents from Data) result window is empty so it can't continue to the Validation procedure because it hasn't any attribute. I've tried the same process with and without the stem (dictionary) and the problem is with the stemmer of greek words.

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    The way you have it should work. I wonder if there's a bug in those stemmer/stopword dictionary operators because they ask you enter the file path and name to the txt file.

    Try it with an Open File operator attached to them and let's see if that works.
































    <运营商激活= " true " class = "文本:filter_by_length" compatibility="7.3.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="313" y="34">




    <参数键= "文件name" value="C:\Users\klimi\Desktop\Thesis\stemmer.txt"/>


    <参数键= "文件" value="C:\Users\klimi\Desktop\Thesis\stemmer.txt"/>


    <参数键= "文件name" value="C:\Users\klimi\Desktop\Thesis\gr_stopwords.txt"/>

    <运营商激活= " true " class = "文本:filter_stopwords_dictionary" compatibility="7.3.000" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="648" y="289">
    <参数键= "文件" value="C:\Users\klimi\Desktop\Thesis\gr_stopwords.txt"/>

    <运营商激活= " true " class = "文本:filter_stopwords_english" compatibility="7.3.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="782" y="187"/>
























    <操作符= " true " class = " support_vector_m激活achine" compatibility="7.3.001" expanded="true" height="124" name="SVM (2)" width="90" x="179" y="34">

















































    <运营商激活= " true " class = "文本:filter_by_length" compatibility="7.3.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="112" y="238">




    <参数键= "文件name" value="C:\Users\klimi\Desktop\Thesis\stemmer.txt"/>


    <参数键= "文件" value="C:\Users\klimi\Desktop\Thesis\stemmer.txt"/>


    <参数键= "文件name" value="C:\Users\klimi\Desktop\Thesis\gr_stopwords.txt"/>

    <运营商激活= " true " class = "文本:filter_stopwords_dictionary" compatibility="7.3.000" expanded="true" height="82" name="Filter Stopwords (2)" width="90" x="648" y="391">
    <参数键= "文件" value="C:\Users\klimi\Desktop\Thesis\gr_stopwords.txt"/>

    <运营商激活= " true " class = "文本:filter_stopwords_english" compatibility="7.3.000" expanded="true" height="68" name="Filter Stopwords (3)" width="90" x="514" y="34"/>






























    < portSpacing端口= " source_input 1”间隔= " 0 " / >







  • slimikslimik MemberPosts:7Contributor I

    with your solution the process bypasses the previous error that i mention. But still the stemmer doesn't work no matter what rule i give. Is there any way to implemet a python based stemmer as an rapidminer operator?

  • 我ngoRM我ngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University ProfessorPosts:1,751RM Founder

    Should be possible but might require some work on your side of course. You need to install the Python extension from the Marketplace (https://marketplace.www.turtlecreekpls.com/UpdateServer/faces/product_details.xhtml?productId=rmx_python_scripting) and then implement the function yourself.

    Cheers,

    我ngo

  • slimikslimik MemberPosts:7Contributor I

    ok. Thank you i'll give it a try!

  • todimarytodimary MemberPosts:1Contributor I

    Hello, any news with your work?

Sign InorRegisterto comment.