Confused how to approach my data, to start by Clustering? or Prediction directly? or a better idea?
Dear all,
I am working with a dataset, that contains more than 8456rows, 26 columns. this data is about projects that are taken place in Europe, each row is a project.
these are the columns:
Office | Office Country | Competence | Executive competence | Classification | Enquiry date | Creation date | Confirmation date | Proposal Date | Final invoice sent date | Intermediary | 客户ID | 客户 | 事件 | Group name | Reference code | Start date | End date | Project manager | Main contact | Via sales contact | Project location | Project country | Heard About Us | Source Market | Client Kind | Client Sector | Region | Market | Lead Sent to | 事件Frequency | Pipeline Future Projects | Initial Pax | Estimated turnover | Estimated costs | Estimated profit % | Status | Pax | Net turnover | Net costs | Gross profit | Gross profit % | Net profit | Net profit % | Agency commissions | Supplier commissions | Cancellation/Rejection reason | Cancellation date | Remarks | Controlled | Financial Regime | Currency | Exchange Rate | Payment status % | Required(Net) | Required | Invoiced | To invoice | Receipt | To pay | Custom invoices | Balance carried forward | Comments to low margin | Debits | Assets | Balance | TO Inv. | TO Acc. | TO Total | Cost Eff. | Cost Man. | Cost Acc. | Cost Total |
for privacy policy I cannot expose the data itself, so I created an imaginary data just for illustration:
Office | Office Country | Competence | Executive competence | Classification | Enquiry date | Creation date | Confirmation date | Proposal Date | Final invoice sent date | Intermediary | 客户ID | 客户 | 事件 | Reference code | Start date | End date | Project manager | Project location | Project country | Heard About Us | Source Market | Client Kind | Client Sector | Region | Initial Pax | Estimated turnover | Estimated costs | Estimated profit % | Status | Pax | Net turnover | Net costs | Gross profit | Gross profit % | Net profit | Net profit % | Agency commissions | Supplier commissions | Cancellation/Rejection reason | Cancellation date | Remarks | Controlled | Financial Regime | Currency | Exchange Rate | Payment status % | Required(Net) | Required | Invoiced | To invoice | Receipt | To pay | Custom invoices | Balance carried forward | Debits | Assets | Balance | TO Inv. | TO Acc. | TO Total | Cost Eff. | Cost Man. | Cost Acc. | Cost Total |
Saint Louis | Senegal | BL | Saint Louis | Unknown | 22.02.2016 | 08.04.2016 | 08.04.2016 | 23.02.2016 | 08.04.2016 | 11896 | 客户2 | zina 2016 | code e1 2 | 15.04.2016 | 16.04.2016 | Maya | Saint Louis 1 hall | Senegal | BL | Agency | 其他 | 35 | 0 | 0 | 0 | Completed | 35 | 1.950 | 1.486 | 463 | 24 | 122 | 6 | 0 | 0 | Input/Output | EUR | 1 | 100 | 1.950 | 2.321 | 2.321 | 0 | 2.321 | 0 | 0 | 0 | 0 | 0 | 0 | 1.950 | 0 | 1.950 | 0 | 0 | 1.487 | 1.487 | |||||||
Saint Louis | Senegal | BL | Saint Louis | 其他 | 08.06.2016 | 08.07.2016 | 08.07.2016 | 14.06.2016 | 25.07.2016 | 43 | 客户3 | code e1 3 | 07.07.2016 | 07.07.2016 | Maya | Saint Louis | Senegal | BL | Agency | 其他 | 0 | 200 | 0 | 100 | Completed | 0 | 297 | 9 | 288 | 97 | 236 | 79 | 0 | 0 | Input/Output | EUR | 1 | 100 | 297 | 354 | 354 | 0 | 354 | 0 | 0 | 0 | 0 | 0 | 0 | 297 | 0 | 297 | 0 | 0 | 9 | 9 | ||||||||
Saint Louis | Senegal | BL | Saint Louis | Embassy | 19.05.2016 | 20.05.2016 | 04.08.2016 | 04.08.2016 | 04.08.2016 | 1978 | 客户4 | leab 2016 | code e1 4 | 11.09.2016 | 16.09.2016 | Laura | Saint Louis | Senegal | BL | Agency | 32 | 12.000 | 0 | 100 | Completed | 32 | 9.614 | 7.416 | 2.197 | 23 | 515 | 5 | 0 | 0 | Input/Output | EUR | 1 | 100 | 9.614 | 11.441 | 11.441 | 0 | 11.441 | 0 | 0 | 0 | 0 | 0 | 0 | 9.614 | 0 | 9.614 | 0 | 0 | 7.417 | 7.417 | ||||||||
Saint Louis | Senegal | BL | Saint Louis | Embassy | 20.05.2016 | 21.05.2016 | 28.06.2016 | 28.06.2016 | 04.08.2016 | 1978 | 客户5 | leab 2016 | code e1 5 | 12.09.2016 | 16.09.2016 | Laura | Saint Louis | Senegal | BL | Agency | 12 | 4.500 | 0 | 100 | Completed | 12 | 4.550 | 3.526 | 1.024 | 22 | 227 | 5 | 0 | 0 | Input/Output | EUR | 1 | 100 | 4.550 | 5.415 | 5.415 | 0 | 5.415 | 0 | 0 | 0 | 0 | 0 | 0 | 4.550 | 0 | 4.550 | 0 | 0 | 3.526 | 3.526 | ||||||||
Saint Louis | Senegal | BL | Saint Louis | Unknown | 21.03.2016 | 01.04.2016 | 15.06.2016 | 01.04.2016 | 28.11.2016 | 807 | 客户6 | festival 2016 | code e1 6 | 23.09.2016 | 25.09.2016 | Martin | Saint Louis | Senegal | BL | Agency | 20 | 18.000 | 0 | 100 | Completed | 20 | 11.276 | 9.676 | 2.104 | 19 | 130 | 1 | 0 | 503 | Input/Output | EUR | 1 | 100 | 11.277 | 12.815 | 12.815 | 0 | 12.815 | 0 | 0 | 0 | 0 | 0 | 0 | 11.277 | 0 | 11.277 | 0 | 0 | 9.676 | 9.676 | ||||||||
Saint Louis | Senegal | BL | Saint Louis | Unknown | 28.06.2016 | 29.06.2016 | 10.08.2016 | 10.08.2016 | 14.09.2016 | 43 | 客户7 | code e1 7 | 04.10.2016 | 05.10.2016 | Laura | Saint Louis | Senegal | BL | Agency | 其他 | 30 | 6.000 | 0 | 100 | Completed | 30 | 4.789 | 3.778 | 1.011 | 21 | 173 | 4 | 0 | 0 | Input/Output | EUR | 1 | 100 | 4.790 | 5.700 | 5.700 | 0 | 5.700 | 0 | 0 | 0 | 0 | 0 | 0 | 4.790 | 0 | 4.790 | 0 | 0 | 3.779 | 3.779 | ||||||||
Saint Louis | Senegal | BL | Saint Louis | Unknown | 05.08.2016 | 06.08.2016 | 10.08.2016 | 10.08.2016 | 10.08.2016 | 2374 | 客户8 | code e1 8 | 04.10.2016 | 06.10.2016 | Laura | Saint Louis | Senegal | BL | Agency | 其他 | 2 | 1.500 | 0 | 100 | Completed | 2 | 2.007 | 1.753 | 254 | 13 | -97 | -5 | 0 | 0 | Input/Output | EUR | 1 | 100 | 2.008 | 2.228 | 2.228 | 0 | 2.228 | 0 | 0 | 0 | 0 | 0 | 0 | 2.008 | 0 | 2.008 | 0 | 0 | 1.753 | 1.753 | ||||||||
Saint Louis | Senegal | BL | Saint Louis | Incentive | 01.09.2016 | 02.09.2016 | 29.11.2016 | 06.09.2016 | 02.11.2016 | 535 | 客户9 | code e1 9 | 19.10.2016 | 20.10.2016 | Larissa | Saint Louis | Senegal | BL | Agency | 其他 | 15 | 2.700 | 0 | 100 | Completed | 15 | 2.240 | 1.736 | 503 | 22 | 111 | 5 | 0 | 0 | Input/Output | EUR | 1 | 100 | 2.240 | 2.666 | 2.666 | 0 | 2.666 | 0 | 0 | 0 | 0 | 0 | 0 | 2.240 | 0 | 2.240 | 0 | 0 | 1.737 | 1.737 | ||||||||
Saint Louis | Senegal | BL | Saint Louis | Incentive | 22.09.2016 | 12.10.2016 | 23.11.2016 | 14.10.2016 | 07.11.2016 | 43 | 客户10 | code e1 10 | 19.10.2016 | 20.10.2016 | Maya | Saint Louis | Senegal | BL | Agency | 其他 | 25 | 1.000 | 0 | 100 | Completed | 25 | 2.360 | 1.433 | 926 | 39 | 513 | 22 | 0 | 0 | Input/Output | EUR | 1 | 100 | 2.360 | 2.808 | 2.808 | 0 | 2.808 | 0 | 0 | 0 | 0 | 0 | 0 | 2.360 | 0 | 2.360 | 0 | 0 | 1.434 | 1.434 | ||||||||
Saint Louis | Senegal | BL | Saint Louis | Incentive | 05.07.2016 | 06.07.2016 | 11.01.2017 | 12.07.2016 | 04.11.2016 | 535 | 客户11 | code e1 11 | 21.10.2016 | 22.10.2016 | Larissa | Saint Louis | Senegal | BL | Agency | 其他 | 24 | 4.500 | 3.500 | 22 | Completed | 24 | 7.513 | 6.404 | 1.109 | 15 | -206年 | -3 | 0 | 0 | Input/Output | EUR | 1 | 100 | 7.514 | 8.791 | 8.791 | 0 | 8.791 | 0 | 0 | 0 | 0 | 0 | 0 | 7.514 | 0 | 7.514 | 0 | 0 | 6.405 | 6.405 |
for these data, I want to make analysis and predictions/classifications to get new insight of the data and to contribute something. I am using this data from the company in order to help me write my master thesis upon.
I need to make a data mining process, predicting for example the Net turnover of next year, or to make cluster classification and to get new insights,
我是新的这rapidMiner不知何故nd I am struggling in choosing my appropriate path for starting.
I thought about to generate two new columns at the beginning (inside the Turbo Preparation) one column called
"Year"=that takes the year of each project
and another column
"Poject's length"= that counts how many days each project lasts
i need to know please with these attributes that I have, can I reach to a satisfying result? do you have any ideas ? I am stucked in the middle with too much data and dilemmas inside my head which prevents me to concentrate and take the right approach
that's why I need some wet ideas, some motivations and recommendations please
I thought about Clustering, and getting insights from the clusters i'll get, and then upon it to continue with a decision tree model that predicts the next years net turnover for example, (it can be another idea rather than predicting the turnover if you have any, im open to everything)
I tried to make the auto model and to cluster, but actually im not getting any useful results. I guess there might be 2 reasons for this:
1. that I do not know how exactly to approach this procedure, and I am missing something.
or
2. the data that I have is not enough good for this type of approach
any help please guys ?
@sgenzer@jczogalla@David_A@mschmitz@stevefarr@Pavithra_Rao
Tons of Thanks and Gratitudes.
Kind regards,
Jana
Answers
You could also use clustering to see what kind of patterns are in the data. You should also look for outliers.
Another option would be to reformulate your target label, sometimes predicting a continuous numerical (like net turnover) is more difficult. Could you redefine it into a classification problem, by setting a threshold level of net turnover and then assigning a class (either above that level or below it)?
Without seeing your actual data, it is almost impossible to say whether there is enough predictive power in your attributes to do a good job predicting your outcome. But these are a few other things you should try.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
If they just gave you the data and said "Find something interesting", you would certainly want to try and discover some interesting relationships between the various data fields which you could then talk about with the people who gave you the data, which might lead to you learning more about the meanings of all of the data fields or what your colleagues would like you to concentrate on.
You may also want to check for missing and NULL data values in the various data fields, and look for any inconsistencies in the data values in the various data fields because if the data is not entered in a consistent manner, this could make it more difficult for RapidMiner to find interesting relationships between the data fields. It's usually helpful to get a sense of minimum, average, median, and maximum values for the numeric data fields and how evenly (or unevenly evenly) the data for each data field is distributed.
Hope this helps, good luck, and best wishes, Michael Martin