347 647 9001+1 714 797 8196Request a Call
Call Me

Overview of Text Mining

June 4, 2015
, , , ,
text mining overview

Text mining is an analytical field which derives high quality information from text. Text mining is widely used in the industry when data is unstructured. Derived information can be provided in the form of numbers (indices), categories or clusters, summary of text. In this blog, we will focus on applications of text mining, workflow and example.

Text Mining Applications

1. Analyze open ended survey comments- Analysis of open ended comments is most common in the current market. When a particular survey is conducted, there are options for the customers to provide feedback to the company using open ends rather than constraining their opinions into particular dimension of scaling. Sometimes, these open ends are more than 5000 words and hence, human mind can’t gather and extract information. The best possible solution is to use text mining algorithms.

2. Analyze customer insurance/warranty claims, feedback forms, etc.- In insurance domain, warranty claims information are usually open-ended. For example, when a motor claim is filed, insured specifies reason of accident in textual comments and you can imagine how difficult and erroneous it can be to process huge number of motor claim by a company in a month.

3. Analyze sentiment of users against a particular product/campaign/reviews using social media data- Every company are worried about their brand, customer satisfaction and customer preference. It takes just seconds for a customer to go on internet and spread bad words about a company. Social media analytics uses text mining to compute sentiment of customer. It’s easy to identify core topic discussed among customers every day on social media using text mining.

4. Automatic processing of emails/images/messages etc.- Text mining algorithms are used for automatic classification of texts. In outlook, a user categorizes the emails into various folders/spam. Similarly, on a larger scale using text mining algorithms key topics can be identified and the emails can be automatically forwarded to desired department

5. Identify competitors performance- In business intelligence sector, identifying competitors performance, capabilities, products offered, identifying their target business line can be automatically processed using combination of web crawling and text mining.

6. Automatic document search- In recent months, researchers have focused on text mining to identify reference documents for their research. For example-You are a researcher and would want to figure out summary of a chapter in a document. There are two ways to go through; one is read the entire chapter or use text mining algorithms.

Workflow of Text Mining

1. Collect Data- Unstructured information from websites, emails, blogs, social media websites, user comments, etc.

2. Text Parsing- This step involves extraction of words, parts of speech tagging, word filtering (removing preposition, numbers, and punctuations), synonyms, tokenization, and stemming.

3. Text Filtering- Removing irrelevant terms, building stop word dictionary and removing stop words

4. Transformation- Building term frequency document matrix (TDM) or document term matrix (DTM), computing frequency term counts, and calculating SVD’s

5. Text Mining Algorithms- Hierarchal Clustering, Topic Extraction, LDA and Gibbs Algorithm, Text summarization using text blob noun phrase extraction, sentiment analysis by identifying polarity using naïve Bayesian theorem, and Boolean rules

6. Analysis, Insights & Recommendations- Relationship between key categories, fish bowl analysis, risk analysis, identifying gaps and recommending it to business and key stakeholders.

Text Mining Terminologies

1. Text cleanup- Removes hyperlinks, special characters, ads from web pages, remove figures and formulas from web pages and documents

2. Tokenization- Tokenization is the process to divide unstructured data into tokens such as words, phrase, keywords, and other elements.

3. Stemming-It’s a process used to bring words to their base form. E.g. “amazing”, “amazed”, and “amaze” can be described as “amaze” using stemming.

4. Parts of Speech Tagging- POS tagging involves tagging every word in the document and assigns part of speech-noun, verb, adjective, pronoun, single noun, plural noun, etc.

5. N-grams is a part of tokenization. Creation of n-grams are important to understand the data. E.g. “good” is a positive sentiment whereas “not” is neural but when you combine “not good” it’s a negative sentiment.


If you want to analyze “the quick red fox jumps over the lazy dog”

a. Bi-gram:- Combination of 2 words. E.g. “quick fox”- this determines that fox is quick whereas “lazy dog” determines that dog is lazy. Hence, this could be used as an analysis between fox and dog where former is determined by its quickness and latter by its laziness.

b. Tri-gram:- Combination of 3 words E.g. “red fox jumps” determines fox is red and fox can jump whereas “lazy brown dog” determines dog is brown and lazy.

If you would have to analyze it without using n-gram it will lead into inaccurate information. Data like “red” “fox” “jumps” “lazy” “brown” “dog” analyzed separately doesn’t makes sense.

We shall discuss Mathematical applications of text mining algorithms in the upcoming blogs.


About the Author

Anuj Mehra has 4 years of professional experience in the field of analytics. His core competencies are machine learning, statistics, algorithms and text mining spanning retail and insurance verticals


Global Association of Risk Professionals, Inc. (GARP®) does not endorse, promote, review or warrant the accuracy of the products or services offered by EduPristine for FRM® related information, nor does it endorse any pass rates claimed by the provider. Further, GARP® is not responsible for any fees or costs paid by the user to EduPristine nor is GARP® responsible for any fees or costs of any person or entity providing any services to EduPristine Study Program. FRM®, GARP® and Global Association of Risk Professionals®, are trademarks owned by the Global Association of Risk Professionals, Inc

CFA Institute does not endorse, promote, or warrant the accuracy or quality of the products or services offered by EduPristine. CFA Institute, CFA®, Claritas® and Chartered Financial Analyst® are trademarks owned by CFA Institute.

Utmost care has been taken to ensure that there is no copyright violation or infringement in any of our content. Still, in case you feel that there is any copyright violation of any kind please send a mail to and we will rectify it.

Popular Blogs: Whatsapp Revenue Model | CFA vs CPA | CMA vs CPA | ACCA vs CPA | CFA vs FRM

Post ID = 76689