Wordlist with column per link/text

sauresaure MemberPosts:4Contributor I
edited December 2018 inHelp

Hello experts,

i`am looking for a solution to subdivide the wordlist after a web mining process.

the process i used based on this video:https://www.youtube.com/watch?v=OXIKydgGbYk

Read Excel (5 links) > Get Pages > Data to Document > Process Documents > Wordlist

Everything fine!

My question is:

Is it possible to subdived the wordlist with columns from the linklist?

Like this:

Word | Attribute Name | Total Occurences | Document Occurences | Link1 | Link2 | Link3 | ...

power | power | 14 | 2 | 10 | 0 | 4 | ...

Is it possible to do this?

Here is the code:















































































































Thanks for the support!

Bernd

Best Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,438RM Data Scientist
    Solution Accepted

    Dear Bernd,

    was trickier than i expected, but i think the attached process should do the trick.


    ~Martin









    <运营商激活= =“false”类"read_excel" compatibility="7.5.001" expanded="true" height="68" name="Read Excel" width="90" x="45" y="493">













    http://www.energate-messenger.de/news/suche/index.php?cmdStartSearch=1&amp;categories[]=508&amp;pattern[]=Bundesnetzagentur&quot;"/>





    http://www.energate-messenger.de/news/suche/index.php?cmdStartSearch=1&amp;categories[]=508&amp;pattern[]=Ausschreibungen&quot;"/>



    <运营商激活= " true "类= compatib“追加”ility="7.5.001" expanded="true" height="103" name="Append" width="90" x="179" y="289"/>


























































    <连接from_op = "生成属性”from_port = "example set output" to_op="Select Attributes" to_port="example set input"/>









    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,438RM Data Scientist
    Solution Accepted

    Dear Bernd,

    you can switch the Generate Data operators to your read excel again. This was just my quick and dirty way to get your URLs in.

    For the occurcences, you can switch from Binary Occurences to Occurences in Process Documents. That should do the trick.

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • sauresaure MemberPosts:4Contributor I
    Solution Accepted

    Dear Martin,

    thanks again. That`s it!

    Best regards
    Bernd

    MartinLiebig

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,438RM Data Scientist

    Dear Bernd,

    could you explain a bit more what you mean by "subdivide"?

    BR,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    This is in the wrong Forum. Will move.

  • sauresaure MemberPosts:4Contributor I

    Hello Martin,

    subdivide is the wrong term, expand is the better one.

    The wordlist should show the total occurrence and also the occurrence per link/doc.

    I hope this is understandable.

    BR, Bernd

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,438RM Data Scientist

    Hi,

    sounds like you can use aggregate andjoin to do this. Any chance you can sent me the first 10 lines of your excel file as a private message?

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • sauresaure MemberPosts:4Contributor I

    Hello Martin,

    thanks a lot for this solution.

    Each link by "Generate Data..." is not very comfortable - but it works. (In a project i have nearly 40 links).

    But in the ExampleSet is per word only the suggestion yes/no (1/0) not the total count of the word. Unter total_x is the correct sum. But i do not know how many are in each link. Have I overlooked something?

    BR, Bernd

Sign InorRegisterto comment.