"Stem (Dictionary) Indonesia Language with regex"

baybay · July 2018

Hello,

I have a problem when trying to use regex for Stem (Dictionary) Indonesia language
This is for example indonesian language:

saya sangat senang dengan kalian-kalian, tampilannya dan suaranya sangat bagus

and I want to make it as below:

saya sangat senang dengan kalian, tampil dan suara sangat bagus

That is working when I used stem like this:

kalian:kalian.*
tampil:tampil.*
suara:suara.*

But failed, when I'am trying to used another regex function:

:-(.*)$
:(ku|mu|nya|lah|kah|tah|pun)$

How can I used stem, besides with function "text: text. *"

Please help me for this case

Thanks

Best Regards,

Bay

sgenzer · July 2018

hello@baybay- hmm I don't speak Indonesian and am very puzzled on what you're trying to do with your first RegEx expression

-(.*)$

the second one seems ok. If you could post your XML and your sample data set, it would be a lot easier to help. Also tagging my go-to RegEx guru@Telcontar120

Scott

yyhuang · July 2018

Hi@baybay,

You definitely can use rule based stemmer. A preferred way is "stem tokens using example set" operator from toolbox extension.







<运营商激活= " true " class = "process" compatibility="9.0.000-BETA" expanded="true" name="Process">

<运营商激活= " true " class = "text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="112" y="34">


<运营商激活= " true " class = "text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="447" y="34">



<运营商激活= " true " class = "text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">



<运营商激活= " true " class = "operator_toolbox:create_exampleset" compatibility="1.2.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="179" y="136">







<运营商激活= " true " class = "operator_toolbox:stem_tokens_using_exampleset" compatibility="1.2.000" expanded="true" height="82" name="Stem Tokens Using ExampleSet" width="90" x="514" y="34">





<连接from_op = "干细胞标记使用ExampleSet“from_port="document" to_port="document 1"/>

A comprehensive study of stemming on Indonesia

https://pdfs.semanticscholar.org/8ed9/c7d54fd3f0b1ce3815b2eca82147b771ca8f.pdf

HTH,

YY

baybay · July 2018

@sgenzerwrote:
hello@baybay- hmm I don't speak Indonesian and am very puzzled on what you're trying to do with your first RegEx expression
-(.*)$
the second one seems ok. If you could post your XML and your sample data set, it would be a lot easier to help. Also tagging my go-to RegEx guru@Telcontar120

Scott

Hi@sgenzer,

I sent by attachment for dataset, XML and stemming

Thanks

Bay

baybay · July 2018

@yyhuangwrote:

Hi@baybay,

You definitely can use rule based stemmer. A preferred way is "stem tokens using example set" operator from toolbox extension.







<运营商激活= " true " class = "process" compatibility="9.0.000-BETA" expanded="true" name="Process">

<运营商激活= " true " class = "text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="112" y="34">


<运营商激活= " true " class = "text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="447" y="34">



<运营商激活= " true " class = "text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">



<运营商激活= " true " class = "operator_toolbox:create_exampleset" compatibility="1.2.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="179" y="136">







<运营商激活= " true " class = "operator_toolbox:stem_tokens_using_exampleset" compatibility="1.2.000" expanded="true" height="82" name="Stem Tokens Using ExampleSet" width="90" x="514" y="34">





<连接from_op = "干细胞标记使用ExampleSet“from_port="document" to_port="document 1"/>

A comprehensive study of stemming on Indonesia

https://pdfs.semanticscholar.org/8ed9/c7d54fd3f0b1ce3815b2eca82147b771ca8f.pdf

HTH,

YY

Hi@yyhuang,

So we must input stem text one by one like "suara:suara.*"?

I just want to make automaticaly remove stem text like on thislink

Thanks

Bay

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Stem (Dictionary) Indonesia Language with regex"

Answers