Share with your network!

text mining overview

Text mining is an analytical field which derives high quality information from text. Text mining is widely used in the industry when data is unstructured. Derived information can be provided in the form of numbers (indices), categories or clusters, summary of text. In this blog, we will focus on applications of text mining, workflow and example.

Text Mining Applications

1. Analyze open ended survey comments- Analysis of open ended comments is most common in the current market. When a particular survey is conducted, there are options for the customers to provide feedback to the company using open ends rather than constraining their opinions into particular dimension of scaling. Sometimes, these open ends are more than 5000 words and hence, human mind can’t gather and extract information. The best possible solution is to use text mining algorithms.

2. Analyze customer insurance/warranty claims, feedback forms, etc.- In insurance domain, warranty claims information are usually open-ended. For example, when a motor claim is filed, insured specifies reason of accident in textual comments and you can imagine how difficult and erroneous it can be to process huge number of motor claim by a company in a month.

3. Analyze sentiment of users against a particular product/campaign/reviews using social media data- Every company are worried about their brand, customer satisfaction and customer preference. It takes just seconds for a customer to go on internet and spread bad words about a company. Social media analytics uses text mining to compute sentiment of customer. It’s easy to identify core topic discussed among customers every day on social media using text mining.

4. Automatic processing of emails/images/messages etc.- Text mining algorithms are used for automatic classification of texts. In outlook, a user categorizes the emails into various folders/spam. Similarly, on a larger scale using text mining algorithms key topics can be identified and the emails can be automatically forwarded to desired department

5. Identify competitors performance- In business intelligence sector, identifying competitors performance, capabilities, products offered, identifying their target business line can be automatically processed using combination of web crawling and text mining.

6. Automatic document search- In recent months, researchers have focused on text mining to identify reference documents for their research. For example-You are a researcher and would want to figure out summary of a chapter in a document. There are two ways to go through; one is read the entire chapter or use text mining algorithms.

Workflow of Text Mining

1. Collect Data- Unstructured information from websites, emails, blogs, social media websites, user comments, etc.

2. Text Parsing- This step involves extraction of words, parts of speech tagging, word filtering (removing preposition, numbers, and punctuations), synonyms, tokenization, and stemming.

3. Text Filtering- Removing irrelevant terms, building stop word dictionary and removing stop words

4. Transformation- Building term frequency document matrix (TDM) or document term matrix (DTM), computing frequency term counts, and calculating SVD’s

5. Text Mining Algorithms- Hierarchal Clustering, Topic Extraction, LDA and Gibbs Algorithm, Text summarization using text blob noun phrase extraction, sentiment analysis by identifying polarity using naïve Bayesian theorem, and Boolean rules

6. Analysis, Insights & Recommendations- Relationship between key categories, fish bowl analysis, risk analysis, identifying gaps and recommending it to business and key stakeholders.

Text Mining Terminologies

1. Text cleanup- Removes hyperlinks, special characters, ads from web pages, remove figures and formulas from web pages and documents

2. Tokenization- Tokenization is the process to divide unstructured data into tokens such as words, phrase, keywords, and other elements.

3. Stemming-It’s a process used to bring words to their base form. E.g. “amazing”, “amazed”, and “amaze” can be described as “amaze” using stemming.

4. Parts of Speech Tagging- POS tagging involves tagging every word in the document and assigns part of speech-noun, verb, adjective, pronoun, single noun, plural noun, etc.

5. N-grams is a part of tokenization. Creation of n-grams are important to understand the data. E.g. “good” is a positive sentiment whereas “not” is neural but when you combine “not good” it’s a negative sentiment.


If you want to analyze “the quick red fox jumps over the lazy dog”

a. Bi-gram:- Combination of 2 words. E.g. “quick fox”- this determines that fox is quick whereas “lazy dog” determines that dog is lazy. Hence, this could be used as an analysis between fox and dog where former is determined by its quickness and latter by its laziness.

b. Tri-gram:- Combination of 3 words E.g. “red fox jumps” determines fox is red and fox can jump whereas “lazy brown dog” determines dog is brown and lazy.

If you would have to analyze it without using n-gram it will lead into inaccurate information. Data like “red” “fox” “jumps” “lazy” “brown” “dog” analyzed separately doesn’t makes sense.

We shall discuss Mathematical applications of text mining algorithms in the upcoming blogs.