"Content analysis of annual corporate reports with text processing"
Hi RM-Crew,
I have a question regarding a text mining project I want to do for my master thesis.
I want to do a content analysis of corporate disclosure. So I want to train a model with an example set (excel list with representative sentences classified in one of 6 topic categories). After that I want to apply that model to several unknown annual reports (pdf format) of companies to measure how much they are disclosing regarding that 6 categories.
Now I am a little bit lost with choosing the right transforming processes for the annual report. I could tokenize the documents so I get a full list of sentences. But actually I don´t want every sentence to be categorized. I only want the model to measure how much of the content of each annual report refers to one of the 6 topics..
Do you have an idea or did somebody have a similar project?
Thanks and best regards,
Nadine
Answers
What you are describing is very similar to LDA, which is a topic modeling approach for text data. Check out the operator and the tutorial sample included in RapidMiner (you'll need the Toolbox extension, which is free). However, this doesn't allow you to "train" the classifier with particular examples; instead, it looks for patterns in the data and comes up with its own topic groupings. To train it you will feed it the entire document and then tokenize typically at the word rather than the sentence level, because it is much more granular and accurate that way.
If you really have to train the data based on your predefined categories, and the categories are not mutually exclusive, then you are likely to have to build 6 separate predictive models, one for each topic. And then run every document through those models and get a confidence score for each topic. In that case you probably want to tokenize every document at the word level again, so the model has the raw material in the most flexible form, to determine the classification labels (that you will provide for an initial sample).
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
@Telcontar120,
thank you very much for your detailed answer.
Yes, unfortunately I have to train the model to recognize and measure pre-defined categories, so the LDA will apparently not work out for me..
Do you know which predictive models are usually used for such text analysis projects (unfortunately the websitehttps://mod.www.turtlecreekpls.com/for finding the right model doesn´t work as well as the link Ingo posted in his article//www.turtlecreekpls.com/blog/doc-ingo-what-model-should-i-use/)?
I assume the confidence score as the result of each topic model could be interpreted as the “amount of information” the unknown text contains about the specific topic, right?
Thank you very, very much. You are really helping me a lot!
Best regards,
Nadine
The resulting score is more accurately interpreted as the confidence that the algorithm has that a specific text relates to a specific topic. So you would likely want to rank them and establish some threshhold cutoffs to say which ones were related to or "about" each topic. To say whether a document is really "about" a given topic is a somewhat individualized judgment, I think.
By the way, if you are strictly looking to see whether a given document mentions a given set of words or phrases, you can build a rule-based model pretty easily using the binary occurrences word vector and a given wordlist. That will simply let you know which documents contain those words or phrases. Sometimes that type of "score" is used in an audit context (e.g., it will return positive for every document that contains words of interest) but it doesn't necessarily tell whether the document is "about" that topic.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts