Classification Advise
Greetings to all community members!
I am new at using the very interesting RM and I am working on a classification problem. For an experienced user these might be easy questions and that is why I would like your advice.
I have a data set of 13000 rows. I am trying to predict if a bank is going to sell their product through cellphone or telephone (label = Yes/No) having some data about their clients (age, job, personal loan, house loan, balance etc). I have tried to use Decision Trees, ANN and SVM. But in my question I am only going to ask about Decision Trees.
My data’s types are binominal, polynominal and integers. For some of those attributes there are a lot if missing values which were already filled with “unknown” values in the data that I have. And one of them for example “pdays” -> number of days that passed by, after a client was last contacted from a previous campaign (-1 means client was not previously contacted) it is filled with -1 values when someone was not previously contacted.
What I did first is to run my algorithms without any preprocess to have my first results to use them as baseline (using Cross Validation).
之后,看到我的数据是unbalanced (12% Yes and 88% No), I did both upsampling and downsampling methods to see which one gives the best results. And then I did optimization of the parameters to see some first results. Using upsampling in my case would give the best results.
But I was thinking that I should do some preprocessing as well. What I was thinking of doing are the followings:
1)Missing Values:
There are some attributes that have many unknown values (80%) and others that have just few unknown values (4%). I believe I can’t just take them out, at least not the one with 80%, because I may loose information. But I don’t know if replacing them with the most frequent value or average would be a good solution especially for polynominal attributes. I don’t also know what to do with the value that is -1 (80%) because I think it is not good for my algorithms to keep it. What approach should I think of?
2)Outliers:
I think that I should also check about outliers, because I have for example attribute “balance” [-8000 to 100000] while only one person has balance 100000 and the average is 1350. As I mentioned before I have polynominal and integer types. Should I check for outliers only for integers or also for all the attributes? And what about -1? Doesn’t it influence my results?
3)Normalization:
Because I have data with different ranges I was thinking of normalizing the integers. I have watched tutorials of RM and it was suggested to do normalization in the cross validation in the training set. I did it in the training set in the cross validation where I had balanced my data but I got worse results.
4)Feature Selection:
I used Optimize Selection and I chose backward selection. The attributes that have weight 0 should be taken out. So I took them out and rerun the performance of the decision trees. The accuracy got slightly better, precision improved by 8% but the f measure and recall fell about 20%. Is this normal?
5)Correlation:
I also tried to see if there is any correlation between the attributes. So attribute “duration” has weight 1 but in the correlation matrix I don’t understand with which other attribute is duration correlated because its column has only small values (<0.08). I have the weights only for integer types (total 7). I then tried to do it including all the other attributes by converting Nominal to Binominal but the correlation Matrix was very complicated (48x48)
But the way the results are I don’t understand which attributes to take out. Because for example in the matrix there is in x column marital = married and in y column marital = single with value (-0.772). The attributes that are highly correlated should be taken out?
I have watched many tutorials and read many books (ex. Introduction to Data Mining, Data Mining for the Masses) but theory is a little bit different from real life problems.
I know my message is big but I really need some advice how to approach this problem and I would like to thank you in advance for your time.
Best regards,
Nikos
Answers
It is good to see that you are thinking about preprocessing and feature engineering.
Many of the items that you list here are not really relevant for decision trees: for example, normalizing data, removing outliers, and replacing missing values will generally have no impact at all on the decision tree solution. The biggest concern you have for decision trees is overfitting, which is something that you can try to minimize through the pruning and pre-pruning parameters (although even with these, in my experience trees are prone to overfit solutions). Perhaps a better alternative would be to try the Random Forest operator, which is based on trees but is much less likely to overfit.
For several of the other approaches that you might consider (ANN, SVM), these types of attribute pre-processing and feature selection would be more important.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Hi@nikpap201
Generally, you are on the right direction!
I could add a few thoughts to some of the points you mentioned.
Missing values。Use common sense in deciding what a replacement value should be. It depends what type of measure each variable represents. For example, you can replace missinghouse loan amountwith zero, but you can't replace missingagewith zero. Try to think also if a zero value has a special meaning to your business problem. For example, zero income in credit risk is a serious negative factor, etc etc.
Outliers。通常这些都是数值,所以检查数字variables distribution to see if you have some obvious outstanding values. Sometimes it makes sense to bin variables, for example put allbalancevalues into low, medium and high bins. With polynomial attributes, it's different: you may have values that are occuring too rarely, so you may handle these with 'REPLACE RARE VALUES' operator form Operator Toolbox extension (needs to be installed from Marketplace).
More generaly, there are two interesting measurements used in Auto Model process, which are called ID-ness (the attribute tends to have different values for all example) and stability (the attribute tends to have same values for all examples). Both situations (high IDness or stability) indicate that this particular attribute is not usable in modelling.
特征选择。This depends on total number of attributes you have, sometimes if you don't have many attributes, it's easier and faster not to use forward selection or backward elimination, but rather weight all the attributes (for example, 'WEIGHT BY CORRELATION') and choose top k attributes. Also, have a look at 'REMOVE CORRELATED ATTRIBUTES' operator.
Vladimir
http://whatthefraud.wtf
Thank you@Telcontar120and also@kypexinfor your very extensive answer. And I will take your advice into acount.
Just two more questions. About the values -1 in the attribute “pdays” which is reffered to the number of days that passed by, after a client was last contacted from a previous campaign (-1 means client was not previously contacted), should I change this value or should I keep it as it is? I am asking because it is not actually missing value but people that were not previously contacted.
And the data that I have missing values were already filled with the value "unknown". So when I ran my data in RM then the unknown vallue is like a nominal attribute and not considered as a missing value. Should I for example change that "unknown" value with blank (space button from keyboard) or keep it as it is?
Kind regards,
Nikos
Hi@nikpap201
1) As for pdays attribute: does it really matter for your business case to count on exact number of days passed from the last contact? Because othertwise you could generate a binominal attribyte 'was_contacted', for example, which is false for pdays = -1 and true otherwise. But anyway, this '-1' value will be treated by a model as an existing, not a missing value.
2) This case with 'unknown' nominal value is pretty common, if you deal with pre-processed datasets. To fix it, you can use 'DECLARE MISSING VALUES' operator, which would convert 'unknown' nominal values into actually missing values, and then apply 'REPLACE MISSING VALUES' for handling those.
Vladimir
http://whatthefraud.wtf