Help with correct understanding results of classification

Serek91Serek91 MemberPosts:22Contributor II
edited October 2019 inHelp
Hi, I have such table with results of classifications:



I have 4 algorithms. Classification was made for 16 different training sets:
- all => all 15 predictors were used
- 1-15 => each set contains 14 predictors and in each set one different type of predictor was removed

Example of set is in attachment.

Type of excluded predictor | column name in csv
1 - characters_number
2 - sentences_number
3 - words_number
4 - average_sentence_length
5 - average_sentence_words_number
6 - ratio_unique_words
7 - average_word_length
8 - ratio_word_length_[1-16]
9 - ratio_special_characters
10 - ratio_numbers
11 - ratio_punctuation_characters
12 - most_used_word_[1-4]
13 - ratio_letter_[a-z]
14 - ratio_questions;
15 - ratio_exclamations;

I have to samehow conclude why results for 1-15 for each algorithm and each set are better/worse than results in column "ALL".

But I don't have any idea why. I know that in most cases, when difference between column ALL and column [1-15] is very small (like < 1%) it is just a luck and randomness. But in cases when difference is higher, probably it is caused by something.

The most important thing - I don't know why for k-NN algorithm results are the same for columns 9-15...

好将知道,为什么朴素贝叶斯best (54%) and k-NN is a bad algorithm for this task (20%).

Can someone help me with that?










Tagged:

Best Answer

Answers

  • Serek91Serek91 MemberPosts:22Contributor II
    Ok, thanks. I added normalization to the k-NN and now I have better results (~46%).

    Normalization is not needed in rest of algorithms (Naive Bayes, Decision Tree)? I don't see any difference with and without it.
  • varunm1varunm1 Moderator, MemberPosts:1,207Unicorn
    I don't say it's not needed, but for KNN you will definitely find a difference with normalization. The reason is the distance calculation methods used in KNN. KNN mainly relies on surrounding data samples for prediction. There is a beautiful visual example in the StackOverflow post below.

    https://stats.stackexchange.com/questions/287425/why-do-you-need-to-scale-data-in-knn

    From my experience, there won't be much difference (normalization) in the decision tree as they calculate the impurity index for each attribute and branch down.
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

    Serek91
  • Serek91Serek91 MemberPosts:22Contributor II
    Ok, thanks.

    Results for k-NN are now a way better. Results for Decision Tree are a bit better, but difference is not significant. I will try a bit more to improve it.





  • Serek91Serek91 MemberPosts:22Contributor II
    Hi, I have next question:
    Decision Tree - result in columns ALL and 12 are the same. Column 12 has only string values (words), not numerical. Can Decision Tree use predictors with text values? It seems that it can't.
  • varunm1varunm1 Moderator, MemberPosts:1,207Unicorn
    From my understanding, the text data is treated as categorical (nominal) in this case.

    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • Serek91Serek91 MemberPosts:22Contributor II
    edited October 2019
    According to docs:
    This Operator can process ExampleSets containing bothnominaland numerical Attributes.

    So it should have some impact on final result. But result is still the same. No matter if predictor is included or not.


  • varunm1varunm1 Moderator, MemberPosts:1,207Unicorn
    edited October 2019
    You should see the two models and see if it has that feature/attribute in the tree. May be that attribute got pruned
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • Serek91Serek91 MemberPosts:22Contributor II
    edited October 2019
    I made prediction only using this one parameter, and I got:





  • varunm1varunm1 Moderator, MemberPosts:1,207Unicorn
    Makes sense, its zero accuracy cause it cannot predict with that one, it just randomly labeled predictions. If you want to predict from text, you should use some techniques like tokenization
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

    Serek91
  • varunm1varunm1 Moderator, MemberPosts:1,207Unicorn
    What is 792246? Is it a column name? I think some issue in the process structure. Not sure unless I see data and process. Based on posted picture I am bit confused. Only reasons I can think is everything go pruned due to no added value in tree or some issue in process input
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • Serek91Serek91 MemberPosts:22Contributor II
    edited October 2019
    Ehhh... so it will be hard to do it now... I don't have time for it...

    Thanks anyway.

    EDIT:
    What is 792246? Is it a column name? I think some issue in the process structure. Not sure unless I see data and process. Based on posted picture I am bit confused. Only reasons I can think is everything go pruned due to no added value in tree or some issue in process input

    I added wrong image^^

    It should be this one:



    Tghadially
Sign InorRegisterto comment.