Can I create multiple models using an attribute as a loop variable?
I have a data set that is comprised of messages from dozens of different authors. My goal is to develop a model based on multiple attributes (including TF-IDF) of each author's messages. Since each author's messages are likely to be unique in terms of their content, topics, word usage, etc., I'd like to develop one model for each author. In other words, if I have 10 authors, I want to create 10 unique models (one for each author's messages). Thus, I have several questions:
1) One of the attributes of my data is the author's name. Can I use this attribute somehow as a loop variable so that for each iteration of the loop I can analyze all of an author's messages and train and create a model unique to that author?
2) How can I name and store these models in such a way that in another RM process I can retrieve a model based on an author's name? In other words, if I train a model based on messages whose author is Jenny, then how can I retrieve and apply "Jenny's model" if I get new messages from Jenny in the future (or "Steve's model" if I get new messages from Steve, and so on)?
3) Also, is there an unsupervised model that can be used to model all of an individual author's messages as a single class, and then apply the model to future messages to detect deviations or anomalies?
Best Answer
-
MartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,362RM Data Scientist
you can use Extract Macro and use the first author as a macro. Then you can use this in Write Clustering
- I would recommend to use the usual loop over loop examples. Requieres you to extract numberOfExamples first but you get the loop in parallel.
- I would use a Store operator over Write Clustering.
@adamf: Maybe you want to try LDA of Toolbox on it.
BR,
Martin
- Head of Data Science Services at RapidMiner -
Dortmund, Germany1
Answers
Hi@adamf,
I will try to provide some elements of anwers :
0. Hypothesis :
I assume that your dataset is under this form :
1. Process 1 :
Basically, it is a process which create N cluster models from your dataset (one model for each author, k = 1 / model used = DBScan) and write the cluster model in a path on oyur computer (path to set in the parameters).
Process 1 :
2. Process 2 :
I reflected on a process which is not exactly what you asking, but I think it can be relevant for your final use :
This process create an unique cluster model from your training dataset with k = number of authors. In this case, each author/message belongs to a cluster.
So when you have a future message from a "known author" to "score" :
- ifeffectivly the author uses the same wording in this second message, the model will classify it in the author's cluster.
- if the "wording" of this second message is different than in the first message (from training dataset), the model will classify it in a other author's cluster. From there you can study deviations, anomalies, like you said. Maybe you can calculate the distance between the 2 differents clusters (I don't know if it is feasable in RapidMiner).
Process 2 :
I hope It helps,
Regards,
Lionel
NB :I don't know how "rename" the model with the author's name. Basically in my first process, the model are named model_1, model_2, ..., model_N, in the order of authors.
Hi Adam,
I will go a bit more to the point:
Best,
Sebastian
Thanks for the suggestions. I am a bit unclear whether I should use a distinct model for each author or one model for all authors. For the former approach, would a one-class SVM be a good option? Is it supported in RM? Other/better clustering models?
My goal is: given a new message X from author Y, predict whether the message X is really from author Y or is from an imposter pretending to be author Y.
I thought if I can train one model per author, then when I receive a new message/author tuple I would retreive the appropriate model based on author to most accurately predict whether the message is consistent with the other messages by the same author or if it is an outlier.
Hi,
I have found some examples of using Autoencoders for anomaly detection in text (or other unstructured data). There are some kernels in Kaggle:
https://www.kaggle.com/imrandude/h2o-autoencoders-and-anomaly-detection-python
I think you can do it in RM with the Keras extension, but I'm not sure. Or maybe with tweaking the Deep Learning operator.
I would definitely like to see your finished process.
Regards,
Sebastian
Hi@adamf,
An other ressource :
I think that the Chapter 12 of"Data Mining for the Masses"dedicated to "text mining" can be helpful for you.
I hope it helps,
Regards,
Lionel
@adamfyou could check out my tutorial on one class svm's and autolabeling a training set here:http://www.neuralmarkettrends.com/use-rapidminer-to-auto-label-twitter-training-set/
Hi@Thomas_Ott,
Your link ends to a "404 not found".
I also tried to access directly via inside your blog ==> same result.
Regards,
Lionel
Hi again@Thomas_Ott,
OK after a new test, the link works...
Regards,
Lionel
@lionelderkrikoryes, I borked something as I was making website updates. Should be all fixed now.