"Stem (Dictionary) Indonesia Language with regex"

baybaybaybay MemberPosts:3Contributor I
edited June 2019 inHelp

Hello,

I have a problem when trying to use regex for Stem (Dictionary) Indonesia language
This is for example indonesian language:

saya sangat senang dengan kalian-kalian, tampilannya dan suaranya sangat bagus

and I want to make it as below:

saya sangat senang dengan kalian, tampil dan suara sangat bagus

That is working when I used stem like this:

kalian:kalian.*
tampil:tampil.*
suara:suara.*

But failed, when I'am trying to used another regex function:

:-(.*)$
:(ku|mu|nya|lah|kah|tah|pun)$

How can I used stem, besides with function "text: text. *"

Please help me for this case:)


Thanks

Best Regards,

Bay

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager

    hello@baybay- hmm I don't speak Indonesian and am very puzzled on what you're trying to do with your first RegEx expression

    -(.*)$

    the second one seems ok. If you could post your XML and your sample data set, it would be a lot easier to help. Also tagging my go-to RegEx guru@Telcontar120:)

    Scott

  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:363RM Data Scientist

    Hi@baybay,

    You definitely can use rule based stemmer. A preferred way is "stem tokens using example set" operator from toolbox extension.







    <运营商激活= " true " class = "process" compatibility="9.0.000-BETA" expanded="true" name="Process">

    <运营商激活= " true " class = "text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="112" y="34">


    <运营商激活= " true " class = "text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="447" y="34">



    <运营商激活= " true " class = "text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">



    <运营商激活= " true " class = "operator_toolbox:create_exampleset" compatibility="1.2.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="179" y="136">







    <运营商激活= " true " class = "operator_toolbox:stem_tokens_using_exampleset" compatibility="1.2.000" expanded="true" height="82" name="Stem Tokens Using ExampleSet" width="90" x="514" y="34">





    <连接from_op = "干细胞标记使用ExampleSet“from_port="document" to_port="document 1"/>















    A comprehensive study of stemming on Indonesia

    https://pdfs.semanticscholar.org/8ed9/c7d54fd3f0b1ce3815b2eca82147b771ca8f.pdf

    HTH,

    YY

    sgenzer
  • baybaybaybay MemberPosts:3Contributor I

    @sgenzerwrote:

    hello@baybay- hmm I don't speak Indonesian and am very puzzled on what you're trying to do with your first RegEx expression

    -(.*)$

    the second one seems ok. If you could post your XML and your sample data set, it would be a lot easier to help. Also tagging my go-to RegEx guru@Telcontar120:)

    Scott


    Hi@sgenzer,

    I sent by attachment for dataset, XML and stemming

    Thanks

    Bay

  • baybaybaybay MemberPosts:3Contributor I
    @yyhuangwrote:

    Hi@baybay,

    You definitely can use rule based stemmer. A preferred way is "stem tokens using example set" operator from toolbox extension.







    <运营商激活= " true " class = "process" compatibility="9.0.000-BETA" expanded="true" name="Process">

    <运营商激活= " true " class = "text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="112" y="34">


    <运营商激活= " true " class = "text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="447" y="34">



    <运营商激活= " true " class = "text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">



    <运营商激活= " true " class = "operator_toolbox:create_exampleset" compatibility="1.2.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="179" y="136">







    <运营商激活= " true " class = "operator_toolbox:stem_tokens_using_exampleset" compatibility="1.2.000" expanded="true" height="82" name="Stem Tokens Using ExampleSet" width="90" x="514" y="34">





    <连接from_op = "干细胞标记使用ExampleSet“from_port="document" to_port="document 1"/>















    A comprehensive study of stemming on Indonesia

    https://pdfs.semanticscholar.org/8ed9/c7d54fd3f0b1ce3815b2eca82147b771ca8f.pdf

    HTH,

    YY


    Hi@yyhuang,

    So we must input stem text one by one like "suara:suara.*"?

    I just want to make automaticaly remove stem text like on thislink

    Thanks

    Bay

Sign InorRegisterto comment.