"Read from Database, Process Documents From Data, kMeans Clustering"

natenash203natenash203 MemberPosts:2Contributor I
edited June 2019 inHelp
Greetings - My question concerns what I imagine is something very simple that as a newbie, I am merely overlooking. However, after reading the manual and similar posts (like this one:http://rapid-i.com/rapidforum/index.php/topic,5518.0.html), I am still at a loss.

I am reading data from a DB with the following columns:
  • entity_id
  • raw_text
My current process grabs each row, turns it into a doc, processes the doc, the attempts to cluster them using the k-means clustering operator. My goal is to have the docs clustered, but show the entity_id value instead of the id value generated by rapidMiner. I have attempted the following with no luck:
  • Add Set Role operator for the attribute entity_id to id,afterProcess Documents From Data operator
- Doesn't work as in order for the entity_id to show up after the Process Documents From Data operator, it appears I need to check the "Add meta information" box. If I do this, the k-means clustering operator complains about the non nominal values. Specifically, values such as title, language, etc. These values do not exist in my data and appear to be added by the Process Documents From Data operator.
  • Add Set Role operator for the attribute entity_id to id,beforeProcess Documents From Data operator
- Same issue as above. Entity_id doesn't make it through without checking the "Add meta information" box. As a result, the k-means cluster complains about the title, langauge, robots, attributes that I did not create.


Many thanks in advance for helping me through what I imagine is a total noob oversight.

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, MemberPosts:1,869Unicorn
    Hm, where does RapidMiner create an id? If you add the id before Process Documents, it survives until the end of the process, even if you add a clustering algorithm in the end. Please have a look at the attached process. If you keep having problems, please post your process setup, as described in the link in my signature.

    Best regards,
    Marius




    <输出/ >























    <连接from_op = "生成ID”from_port = "的例子set output" to_op="Nominal to Text" to_port="example set input"/>










  • natenash203natenash203 MemberPosts:2Contributor I
    Hi Marius - Thanks for responding so quickly! In reference to your first, question I believe I am referring to the "id" column that is generated in the Results View. It appears to correspond to row number from the exampleSet. Also, each of the documents within the clusters are identified with this same id value.

    My data looks like this in the database I am querying. In real life, that values within the raw_text column are significantly longer. Also, I rename my database's id column to entity_id and only return ids over 1000, as well as limit it to 100 rows.
    id raw_text
    1003 This entity is about snowboarding and other fun winter sports
    2097 This entity is about orange juice and pancakes
    2318 This entity is about elephants
    数据库调用工作正常。一旦数据is in, here is my process:





    <输出/ >







    <枚举关键= "参数" / >

















































    The goal is to have the k-means cluster use raw_text to cluster, then identify each document by its entity_id, not the row number (which seems to be named "id" and is always between 1 and 100). For example, if I am looking at the "Folder View", cluster_0 might expand to 2220, 3862, and 1034 (entity_id). Not 12, 44, 86 (row number, called id)
Sign InorRegisterto comment.