compare and analysis text documents

TobiasNehrigTobiasNehrig MemberPosts:41Guru
edited December 2018 inHelp

Hi Experts,

I‘m experimenting in text mining and analysis. I’ve created a neighborhood co-occurence from one text and try to analysis and compare it with a larger corpus.

My Example Set look like:

Row No. | Document | Word1 | Word2 | n

1 aaa bbb 2

1 bbb ddd 3

1 aaa bbb 4

2 aaa ccc 3

2 aaa bbb 4

2 ccc aaa 3

This is my process:







<运营商激活= " true " class = "过程”兼容ibility="8.0.001" expanded="true" name="Process">










<参数键= " prune_above_absolute " value = " 3000 "/>






























<操作符= " true " class = " select_attribute激活s" compatibility="8.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="45" y="34">












































































http://www.spiegel.de"/>































<参数键= " prune_above_absolute " value = " 3000 "/>






























<操作符= " true " class = " select_attribute激活s" compatibility="8.0.001" expanded="true" height="82" name="Select Attributes (2)" width="90" x="45" y="34">












































































<连接from_op = " Spon过程文件数据" from_port="example set" to_op="Splitting" to_port="in 1"/>






<连接from_op = " Spon过程文件数据(2)" from_port="example set" to_op="Splitting (2)" to_port="in 1"/>











I’m out of ideas how to compare and analyse them.

Please, has someone an idea how I can do this?

Regards

Tobias

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,437RM Data Scientist

    Hi@TobiasNehrig,

    are these texts or tupels you are working on? And does the order matter? I guess the solution is something like Pivot + Cross Distance or Aggregate + Cross Distance. But the precise solution depends on your use case.

    Cheers,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • TobiasNehrigTobiasNehrig MemberPosts:41Guru

    Hi@mschmitz,

    in my understanding these should be Tupels.

    Regards

    Tobias

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,437RM Data Scientist

    Ok,

    I would concat the two words, Pivot, Replace Missings with 0 and use Cross Distance.

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign InorRegisterto comment.