autotagging and autocategorizing text pieces

mayageudens · November 2017

Hello Rapid Minder community!

冷杉t of all thank you for taking the time to read my question. Seocndly i apologize for my ignorance. I am totally new to data mining and i have looked around the community but did not find any other post answering my question. Perhaps this is because of my lack of knowledge. Okay so this is my problem:

I have around 5000 text pieces. I have categorized and tagged them. I want to build a rulebook that can autotag and autocategorize new text pieces. I have about 600 tags and about 20 categories. Every snippet can have different tags but only one category. Specifically i want:

-to analyze the text so i can automatically give this snippet the correct tags (up to 4) from a list i have made myself.
- to analyze the text (or analyse the tags whatever is easier) and find rules for putting them in a category automatically

I have no idea how to even begin this process and i would be forever grateful if someone would be willing to guide me through this process!

kypexin · November 2017

Hi@mayageudens

I could advise you on the second part, text categorizing (I have done this before as a big project for categorizing web sites based on their content and detecting restricted categories like adult, druge, weapons etc), though I am not ready at this moment to advise on tagging the texts, as it seems to be pretty different task I haven't ever aproached.

1. Start with installing "Text processing" RapidMiner extension from the marketplace as this is gonna be the main tool for you.

2. Study operators "PROCESS DOCUMENTS FROM FILES" or "PROCESS DOCUMENTS FROM DATA", depending on the way your text data is stored. I have actually used the first one as I had all the data stored in text files which were then read by this operator.

3. Important thing is that you have to vectorize text data for further classification. I used TF-IDF for creating word vectors from text files.

4. For classifying text documents I found the simplest k-NN classification algorithm could produce really good results.

Here are also some screenshots from my process I used for the task. This doesn't mean that simply copying the structure will do the trick on your data, but at least it can give you many hints about how to approach the problem.

Whole process:

截图2017-11-17 13.04.09.png

过程文档的文件:

截图2017-11-17 13.04.18.png

Vectorizing settings:

截图2017-11-17 13.04.29.png

Labelling and files structure (I used a separate directory for storing documents for each category):

截图2017-11-17 13.04.44.png

Cross validation:

截图2017-11-17 13.05.11.png

I am also attaching slides about the whole project which I have presented on RapidMiner Wisdom 2015 conference in Ljubljana. Maybe this also might be a source of some knowledge

Telcontar120 · November 2017

Agreed that the full scope of everything you have requested would be quite a complicated project, and quite likely beyond the scope of a forum answer. Thanks to@kypexinfor a great starting point of resources!

A few additional comments/questions for your consideration:

What is the purpose of the tagging as opposed to the classification? It is possible (and in many cases preferable) for machine learning to do the classification component without the tagging (such as the k-nn example already given). Is tagging really needed, or it is simply an intermediate steps to help a human? If the algorithm can do classification without tagging, is it necessary?
Do you really need 20 separate categories? The more categories, the harder it will be for any classification model. Could you simplify your categories to reduce the number?
Do you really need a "rulebook" type of classification? That will restrict the machine learning algorithms to tree or rule-based learners. But many other algorithms provide good results for text classification, such as SVM, k-nn, neural nets, and even Naive Bayes, but they will not produce "rulebooks" that are human-interpretable.

mayageudens · November 2017

Thank you for answering!

I realize now this project is maybe too big for me to handle or to set up. I will give you guys a little more information. I have a website that takes information about a bulk of events and categorizes and tags them. You can see the website here:http://findout.be/.
-As you can see, the tagging is really necessary. the category in itself is not enough to give people enough information about the event.
-Sadly it is also impossible for me to simplify the categories. Every event takes place in a venue, since every venues has as about 3 possible categories ( a club would almost never organize a workshop). Perhaps this will help me along?
- I really don't need a 'rulebook' If it is possible to set up this system and link it to my website database.

What do you guys think will be the best way to achieve this? I think i realized i need help, i would be okay with spending some money on this but my budget is very very limited..
I truly appreciate the help you've already gave me!

Telcontar120 · November 2017

Yep, this seems like it is more complicated than what you would get in terms of community support, unless you are planning to do a lot of the underlying work yourself.
One other option you have would be to post this as a project in the RapidMiner Experfy data science channel:https://www.experfy.com/channels/rapid-miner/marketplaces
There you can post a brief project description and your requirements, provide some sample data, state your budget and timeframe, and invite qualified data scientists to bid on the project. You'll probably be pleasantly surprised as to what you can get there.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

autotagging and autocategorizing text pieces

Answers