LOF on Text Data

tambergetamberge MemberPosts:6Contributor II
edited April 2019 inHelp
你好,

I am fairly new to RM and currently conducting some research on online text.
In particular I am trying to detect outliers from an set of documents by using the LOF operator.
现在我有一些troubles, since the LOF for each document is very close to 1, no matter how I set the MinPtsUB and MinPtsLB.
Basically I have represented the each document as vector of term frequency and TF-IDF, before applying the LOF operator.
So I have two ExampleSets representing the corpus as, a matrix of TF values and a matrix of TF-IDF values, to check the differences.
然而,对于matrices I get LOF values that are equal or very close to one, which does not make any sence to me.

Could you tell me, if and what I am doing wrong?

Best

Please find my XML enclosed:

-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-

Best Answer

  • tambergetamberge MemberPosts:6Contributor II
    Solution Accepted
    So I have been trying different methods in all possible combinations for a test set of 26 examples:
    changing MinPts UB and LB, (1-2, 2-3, 5-10)
    choosing different vectors (TF,TF-IDC, Term Occurence, Binary Term Occurence),
    pruning (filtering frequent words, and filtering unfrequent words)

    However, I was not able to get values that are LOF >> 1.
    So does anyone have a theory, where this is coming from?
    I can also share the data, if you want.


Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn
    I can't see your text data, but this is likely an artifact of the "curse of dimensionality" meaning that with a large TF-IDF vector, the multivariate differences between set members is simply not large enough to register under the LOF algorithm. This can easily happen if there are lots of terms in common and only a few differentiating terms. You might resolve this better by using a reduced wordlist to generate your TF-IDF matrix with only words that are likely to be differentiating ones.
    Or you could switch to a different outlier detection algorithm that is more inherently distance based like k-nn anomaly score rather than density based, although you may still run into similar problems.
    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
    sgenzer
  • tambergetamberge MemberPosts:6Contributor II
    Hi Brian, Thank you for your quick reply. I guess I will just try to reduce the vector size by pruning more.
    I will let you know, if it has any positive impact on the outcome!
    Thanks again!
    sgenzer
  • tambergetamberge MemberPosts:6Contributor II
    I have found a solution to the challenge. Not using any pruning and normalizing the data, before using the LOF operator.
    varunm1 MartinLiebig
Sign InorRegisterto comment.