LOF on Text Data
你好,
I am fairly new to RM and currently conducting some research on online text.
In particular I am trying to detect outliers from an set of documents by using the LOF operator.
现在我有一些troubles, since the LOF for each document is very close to 1, no matter how I set the MinPtsUB and MinPtsLB.
Basically I have represented the each document as vector of term frequency and TF-IDF, before applying the LOF operator.
So I have two ExampleSets representing the corpus as, a matrix of TF values and a matrix of TF-IDF values, to check the differences.
然而,对于matrices I get LOF values that are equal or very close to one, which does not make any sence to me.
Could you tell me, if and what I am doing wrong?
Best
Please find my XML enclosed:
I am fairly new to RM and currently conducting some research on online text.
In particular I am trying to detect outliers from an set of documents by using the LOF operator.
现在我有一些troubles, since the LOF for each document is very close to 1, no matter how I set the MinPtsUB and MinPtsLB.
Basically I have represented the each document as vector of term frequency and TF-IDF, before applying the LOF operator.
So I have two ExampleSets representing the corpus as, a matrix of TF values and a matrix of TF-IDF values, to check the differences.
然而,对于matrices I get LOF values that are equal or very close to one, which does not make any sence to me.
Could you tell me, if and what I am doing wrong?
Best
Please find my XML enclosed:
Tagged:
0
Best Answer
-
tamberge MemberPosts:6Contributor IISo I have been trying different methods in all possible combinations for a test set of 26 examples:
changing MinPts UB and LB, (1-2, 2-3, 5-10)
choosing different vectors (TF,TF-IDC, Term Occurence, Binary Term Occurence),
pruning (filtering frequent words, and filtering unfrequent words)
However, I was not able to get values that are LOF >> 1.So does anyone have a theory, where this is coming from?I can also share the data, if you want.
0
Answers
Or you could switch to a different outlier detection algorithm that is more inherently distance based like k-nn anomaly score rather than density based, although you may still run into similar problems.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
I will let you know, if it has any positive impact on the outcome!
Thanks again!