You are on page 1of 2

Week 10: Text Analytics

Why do we need text analytics


-

Database notes: Call centre transcripts, other CRM (customer relationship management) data
Email
Open ended survey responses
Web pages
NewsGroups
Actual documents
Information on competition

Text Analytics Process

Text Analysis

Data Collection
Text Identification
Text Interpretation
Text
Mining
- Web Forums
- Statistical Analyses
Computer Software
- Blogs
- Regression
Content Analysis
- Identify synonyms
- Speeches
- Cluster Analysis
- Experimentally derived words, themes, and patterns
- Group related words
- Focus Groups
- Factor Analysis
- Creation of word dictionaries and coding schemes
Human Coders
- Open ended Questions
- Structural
Equation
Modelling
-Identification and measurement of emotions, attitudes, thought
processes,
and salient
concern
- Provide Context
- Call Logs
- CHAID
- Discern themes and patterns
- Newspapers

Text Mining
Data Collection
Parsing
Filtering
Mining Topics and Trends

- The use of computer software to:


Annotate and extract information from text sources such as entities;
concepts, topics,
facts, attributes, and attitudes
Analysis of the annotated/extracted data
Document processing for the purpose of retrieval, categorisation,
classification and
association

Data Collection
- Collection of textual data with text records
Parsing
- Purpose of parsing is to quantify information about the terms contained in the documents
- Includes the following tasks: Identifying sentences; Stemming (identifying root words); Removing Stop
words (frequently
occurring meaningless words) ;Tagging parts of speech; Identifying Entities (like addresses etc); Spell
Checks
> At the end of parsing the document can be represented using numeric representation
-Stop word removal: Many words are not informative and thus irrelevant for document representation
(e.g. the and an is of)
-Stemming: reducing words to their root form: A document may contain several occurrences of words
such as fish, fisher, fishing but cannot be retrieved by keyword query fishing, however they all share
the same stem word fish and should be represented by the stem instead of the actual word.
- After parsing, the document is represented in a term document matrix which is a numeric
representation

Text Filtering
-Purpose of text filtering is to reduce the total number of parsed terms or document that are being
analysed according to value and relevance (reduce amount of working data and hence processing time)
-Text filtering includes the following tasks: Spell checks and dictionary, term weights, others (minimum
number of documents in which the term should occur to be considered for analysis, key word based
filtering)
- Term weights: Term weights are assigned to enable the definition of important terms based on how
frequently they occur in individual documents (frequency weighting); determine importance based on
how distributed terms are throughout the document collection (term weighting)
>Frequency Weighting:
- the frequency tft,d of term t in document d is defined as the number of times that term occurs in
d
- Raw frequency is not ideal as term relevance does not scale directly with raw frequency (does
not increase
proportionately)
- Therefore, we use log-frequency such that log-frequency=ln(tft,d+1)
>Term Weighting:
- Rare terms are more informative that frequent terms stop words
- Consider a term that is rare in the collection of documents (e.g. technocratic)
- A document containing this term is likely to be relevant when search technocratic, special
words will be given a high
weight
- Given that frequent term are less informative than rare terms: inverse document frequency
(idf)= log(1/proportion of
documents containing the term)
Final Weighted Term Document Matrix
- After parsing, each document is represented by the terms that occur in the document
- Next as a part of the filtering step, the frequency and term weightings are determined creation of
weighted term document matrix
- This final matrix represents all the documents as a vector which can be used for similarity calculations
for clustering, information retrieval and other advanced analytics on textual data
Topic Extraction
- A topic is a collection of terms that define a theme or an idea
- Every document can be assigned to a topic depending on the terms that are contained in it
- One document can belong to one more many topics (unlike clustering)
- The objective of creating topics is to represent a group of documents as themes of interest contained
in the corpus

You might also like