"Text Mining: How do I assign/create a macro to reference a group of attributes collectively?"

batstache611batstache611 MemberPosts:45Guru
edited June 2019 inHelp

Hi,

Before I begin, I sincerely apologize that I cannot share my process here due to confidentiality issues with my school. This is a service learning project that I am doing for the co-op placements department of my business school. But I will try my best to describe it.

I have a process that tries to classify a job based on lexicons. I have one dictionary per category of jobs ~ 30 dictionaries for ~30 categories. Also I have one massive customstopwordsdictionary. Each job posting is run against these dictionaries and we try to see how many words from each of the dictionaries are contained in each individual job posting. The idea is that whichever category of dictionary gets the highest word count for a given job posting, that is the predicted category of that job. The concept by itself is simple, except in order to automate the whole thing and run it on scale, I'm using file and repository loops, macros, branches, subprocesses, etc.

The process works fine except the results are very clutterred. For every job posting, word counts for all 30 dictionaries are being returned. I'd like to limit it to just the highest one or the top 3. I know that can use theMaxfunction inGenerate Attributesto select the one with the highest count but that would mean the dictionary names will be hard-coded into the process. I'd like it to be able to handle new dictionaries on it's own in the future without me having to go in to the parameter settings and modifying things. Also if I used attribute names inMax(), the function will be very long ex:Max(dictionary_1, dictionary_2, dictonary_3, ...., dictionary_30)。Is there a way to use a macro instead to refer to these dictionary attributes such that I can write a simple function -Max(%{dictionary})and have it select the highest count?

I've attached a sample csv with breakpoint results for one row/document/job posting. As you can see, it has wordcounts for several dictionaries however I'm only interested in the largest one. And I need to do this for over 5k job postings. I want to have an attribute(s) that picks the top or the top three categories for each document/row using macros and generate attributes.

非常感谢,你的帮助是极大的appreciated.

demo.csv 886B

Best Answer

  • batstache611batstache611 MemberPosts:45Guru
    Solution Accepted

    @mschmitz, nevermind I got it. Using the aggregate operator at the end does the trick. I needed to aggregate the Count using the maximum function and PredictedCat with mode. And then group it by the company IDs. Thank you very much for pointing me in the proper direction though!

    Best regards.

    MartinLiebig Thomas_Ott

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,404RM Data Scientist

    Hi,

    what you can do is use Generate Aggregation with a regex. This gives you the option to take max of n attributes.

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • batstache611batstache611 MemberPosts:45Guru

    Thank you very much@mschmitz。但我走到半路,。我想要的attribute name to be the cateogry that was picked. So in the parameter settings ofGenerate Aggregation, the place where it asks me for theAttribute Name, I want to insert some kind of macro in there that will return the name of the dictionary with the highest count.

    To summarise, I want to pick the dictionary with the highest tag count and return the name and count number for that dictionary. Hope I was able to explain myself clearly. Thank you for your solution.

    Update: I am using a macro that grabs the file name of the dictionary used. However if I use this macro forGenerate Aggregation'sattribute name, it returns the name of the last dictionary used which is always the same.

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,404RM Data Scientist

    Hey,

    ok, so yo do not just need the max, but also the name of the max. Is transposing the table and sorting + Filter Example Range an option?

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • batstache611batstache611 MemberPosts:45Guru

    Hi@mschmitzI'm sorry it took me a while to reply as I went on a tangent from this porcess for a short while. Yes, transposing and sorting works. However I'm not exactly sure how Filter Example Range would help me. I've attached a csv of the current process output.Cat IDis all the job categories that we have,HighestCountis the word count for category with the maximum amount of hits for a given job posting.PredictedCatis the name of that category with max hits.JobCompanyIDis the id of the company that posted this job.

    As you can see, for each job posting company, theHighestCountnumber can literally vary from anywhere to anywhere. But I'd only like to keep the row with the greatest number. So in the example of the demo file there are 3 companies that have job postings, I'd like RapidMiner to return only 3 rows with companyID, count, and predicted cat. Hope I was able to explain myself. Thank you very much.

    demoII.csv 9.1K
Sign InorRegisterto comment.