You are on page 1of 14

IST755 – R.P.

- Indira Guzman – May, 2002

DATA MINING: A Strategic Decision Support Tool For Organizations


Author: Indira Guzman
May 7, 2002

Introduction

The growth of information resources along with the accelerating rate of technological
change has produced huge amounts of information that often exceed the ability of
managers and employees to assimilate and use it productively. Data must be categorized
in some manner if it is to be accessed, re-used, organized, or synthesized to build a
picture of the company’s competitive environment or solve a specific business problem
(Pearlson, 2001, p.196).

In recent years, the need to extract knowledge automatically from very large databases
has grown. In response, the closely related fields of knowledge discovery in databases
(KDD) and data mining have developed processes and algorithms that attempt to
intelligently extract interesting and useful information from vast amounts of raw data.
(Fayyad, Grinstein & Wierse, 2002). For example, Wal-Mart has one of the world’s
largest databases of customer transactions, with over 20 million transactions being
handled per day [Babcock, 1994]. Wal-Mart just wants to know to whom they should
mail their next advertising circular; they are not trying to prove a hypothesis.

Intelligent data mining, according to Edelstein (1996), discovers information within data
warehouses that queries and reports cannot reveal. That is why a data-mining project
requires the best selection of hardware, software and human resources. The market offers
software tools that require expertise and high level of knowledge. Usually, because of the
large amount of data, data mining tools run in sophisticated computers with high-speed
processors and large storage capabilities. However, the technology is not everything and
the role of the analysts who work with data mining is really important. For instance,
database administrators should understand data mining requirements and try to design
databases the most accessible for mining as possible. One of the biggest problems in data
mining projects is to prepare data for running the algorithms. Data is not ready to mine.

1
IST755 – R.P. - Indira Guzman – May, 2002

Preparing data could be easier if databases were previously designed taking in account
mining purposes.
There is a big variety of data mining software and methodologies in the market. Hence, it
is important to choose the right methodology and the right tool. Not only identifying
where and how to run data-mining models is important and also interpreting the results
that will ultimately help making strategic decisions in organizations.

Specifically, because data mining attempts to discover patterns, trends, and correlations
hidden in the data, those results can give a company a strategic business advantage
(O’Brien, 2001, p.363). Data mining can help managers to make decisions and apply
more effective strategies in the organizations.

What is Data Mining?

“Data Mining is a non-trivial process of identifying valid, novel, potentially useful, and
ultimately understandable patterns from data” (Srikant & Agrawal, 1996).

That is a very brief definition that implies the purposes of doing mining and extracting
new information from data:

 It is “valid” because it looks to be well grounded in logic patterns. In order to be


valid the processes can be automatic or semiautomatic and there are many tools
that are used to make the used algorithms and the resulting patterns as valid as
possible;
 It is a “novel” because data mining a lot of research needs to be done yet;
 It is “potentially useful” because the results can be used in the decision making
process of any organization, such as health, education, marketing, etc.
 “Understandable” patterns because the results should be capable of being
understood or interpreted by users from different backgrounds and not only for
researchers.

2
IST755 – R.P. - Indira Guzman – May, 2002

 “Patterns” from previous data because a perceptual structure has been created as a
model that can be applied to new data.
 “Data” refers to the digitalized information in databases first and data warehouses
later that can be accessed by data mining tools.

According to a Data Mining Glossary from Two Crows, Data Mining is an information
extraction activity whose goal is to discover hidden facts contained in databases. Using a
combination of machine learning, statistical analysis, modeling techniques and database
technology, data mining finds patterns and subtle relationships in data and infers rules
that allow the prediction of future results (2002). The main points of this definition are
the term “facts” because data mining works with “real” data and “hidden facts” because
data mining shows the behavior and performance that is not easily discovered.

Commercial databases are growing at unprecedented rates. In the evolution from business
data to business information, each new step has built upon the previous one. According to
Kurt Thearling (~1996), these are the steps in the evolution of data mining:

Evolutionary Business Question Enabling Product Characteris


Step Technologies Providers tics

Data Collection "What was my total Computers, tapes, IBM, CDC Retrospectiv
revenue in the last disks e, static data
(1960s)
five years?" delivery

Data Access "What were unit sales Relational Oracle, Retrospectiv


in New England last databases Sybase, e, dynamic
(1980s)
March?" (RDBMS), Informix, IBM, data delivery
Structured Query Microsoft at record
Language (SQL), level
ODBC

Data "What were unit sales On-line analytic Pilot, Retrospectiv


Warehousing & in New England last processing (OLAP), Comshare, e, dynamic
March? Drill down to multidimensional Arbor, data delivery
Decision
Boston." databases, data Cognos, at multiple
Support
warehouses Microstrategy levels
(1990s)

Data Mining "What’s likely to Advanced Pilot, Prospective,

3
IST755 – R.P. - Indira Guzman – May, 2002

happen to Boston unit algorithms, Lockheed, proactive


(Emerging
sales next month? multiprocessor IBM, SGI, information
Today-1996)
Why?" computers, numerous delivery
massive databases startups
(nascent
industry)

Table 1. Steps in the Evolution of Data Mining.


Although in table 1 Thearling indicates that Data mining was emerging today, in 1996, it
is still emerging today in 2002. The evolution of data mining is still going on, but what
happened in the past has created the basis for organizational mining.

What data mining can do?

Data mining can do basically six tasks. The first three are all examples of directed data
mining, where the goal is to use the available data to build a model that describes one
particular variable of interest in terms of the rest of the available data. For example,
analyzing bankruptcy, the target variable is a binary variable that describes if a client was
declared on bankruptcy or not. In directed data mining, we try to find patterns that will
make that variable have that value: 0 or 1. The next three tasks are examples of
undirected data mining where no variable is singled out as a target and the goal is to
establish some relationship among all the variables. In the previous bankruptcy example,
data mining tries to identify patters of the behavior of customers without indicating that
those customers are in bankruptcy or not.

These are the types of information can be obtained by data mining, summarizing from
two different sources: Turban & Aronson (2001) and Berry & Linoff (1999):

 Classification: consists of examining the features of a newly presented object and


assigning to it a predefined class or group. The task is to build a model that can be
applied to unclassified data in order to classify it using the defined characteristics
of a certain group (e.g., classifying credit applicants as low, medium of high risk).
 Estimation: Given some input data, we use estimation to come up with a value
for some unknown continuous variable such as income, height, or credit card

4
IST755 – R.P. - Indira Guzman – May, 2002

balance. (e.g. a bank trying to decide to whom they should offer a home equity
loan based on the probability that the person will respond positively to an offer).
 Prediction: records are classified according to some predicted future behavior or
estimated future values based on patterns within large sets of data (e.g. demand
forecasting or predicting which customers will leave within the next six months).
 Association: identifies relationships between events that occur at one time,
determines which things go together (e.g., the contents of a shopping basket: beer
with cigarettes)
 Clustering: identifies groups of items that share a particular characteristic
segmenting a diverse group into a number of more similar subgroups or clusters.
Clustering differs from classification in that it does not rely on predefined classes
or characteristics for each group. (e.g. as a first step in a market segmentation
effort, we can divide the customer base into clusters of people with similar buying
habits, and ten ask what kind of promotion works best for each cluster or group).
 Description and Visualization: the purpose is to describe what is going on in a
complicated database in a way that increases our understanding of the people,
products, or processes that produced the data in the firs place. A good description
suggests where to start looking for an explanation. (e.g., repeat visits to a
supermarket).

Data mining techniques

The most commonly used techniques in data mining are (Thearling, 1995):

 Artificial neural networks: Non-linear predictive models that learn through


training and resemble biological neural networks in structure.
 Decision trees: Tree-shaped structures that represent sets of decisions. These
decisions generate rules for the classification of a dataset. Specific decision tree
methods include Classification and Regression Trees (CART) and Chi Square
Automatic Interaction Detection (CHAID).

5
IST755 – R.P. - Indira Guzman – May, 2002

 Genetic algorithms: Optimization techniques that use processes such as genetic


combination, mutation, and natural selection in a design based on the concepts of
evolution.
 Nearest neighbor method: A technique that classifies each record in a dataset
based on a combination of the classes of the k record(s) most similar to it in a
historical dataset (where k ³ 1). Sometimes called the k-nearest neighbor
technique.
 Rule induction: The extraction of useful if-then rules from data based on
statistical significance.

Data mining tools

Data mining software is based in mathematical algorithms and statistics. Developers have
been working in data mining software tools to make them more user friendly and the
different products available on the market have advantages and disadvantages basically
related to their interface and available techniques. Today, the market offers a variety of
products. According to Kdnuggets poll done between June 27 ad July 17, 2000, with the
question: Which tool do you plan to try or by next? [242 votes total] the most five
preferred tools are: Clementine (32), Megaputer (32), SAS EM (Enterprise Miner) (31),
GainSmarts (28), and EasyMiner(15).

Gartner places six tool suites in the generic market: Angoss Knowledge Suite, IBM's
Intelligent Miner for Data, Oracle's Darwin, SAS' EnterpriseMiner, SGI's MineSet, and
SPSS' Clementine. Some of them are very expensive - data mining products range in
price from $1,000 on a PC to more than $100,000 for algorithms that run on mainframes
(Foley & Russell, 1998) - and not very user friendly. However, the best product for a
company would be the one that is easy to run with current corporate data and easy to
interpret for the managerial purposes of the company.

6
IST755 – R.P. - Indira Guzman – May, 2002

What are the steps for conducting a data-mining project?

The following are the necessary steps for conducting a data-mining project:

 Understand the purpose of the project together with managers of the organization.
Define the objectives and how the results will be applied in the organization.
 Create the dataset that is going to be used. This may include creating a data
warehouse and selecting the databases that probably need to be joined.
 Define the tools that are going to be used: Hardware (e.g. powerful computers
with a high speed processor), Software (e.g. Database management software and
data mining tool like Clementine, SAS or SPSS) and most important, people who
are going to work in the project.
 Analyze the dataset together with the managers. It is important to define each
variable. Sometimes the names of the variables do not have the real meaning or
they can have different interpretations.
 Check if data is ready to apply data mining methods. That means for example, to
clean the data, and input missing data. Create dummy variables if necessary. Like
in a bankruptcy example, where a dummy variable can be created to define a
bankrupt account = 1 if the client was declared in bankruptcy three months later.
 Reduce the size of the dataset. For example, summarize some variables that can
have same meaning. Reduce the format for binary variables; they can be reduced
to one bit instead of one byte. That reduces the size of the file, so it can be easier
to manipulate.
 Identify the relevance of variables. For instance, Wal-mart may require primarily
transaction variables instead frequency of transactions.
 Take sample data and identify the best rules for decomposition of data. For
example, to define what percent of available data is going to be used for learning,
validation and testing data. One can use data from 1996 to learn patters, then 1997
data to validate the results and 1998 data to test the results.
 Use data mining software tools and data mining techniques to search for the
models that have better performance.

7
IST755 – R.P. - Indira Guzman – May, 2002

 Test the results in a different set of data


 Develop an interpretation of the results, so they can be understandable not only by
the analysts and better understandable by the managers.
 Use the results in the management decision-making process. For example,
choosing a better packing strategy after mining CRM data.

Data mining issues

a. Privacy

Data mining is both a powerful and profitable tool, but it poses challenges to the
protection of individual privacy. Many critics wonder whether companies should be
allowed to collect such detailed information about individuals. Consumer groups, privacy
activists and persons associated with the Federal Trade Commission are concerned about
the privacy issues that arise with the vast amount of information that companies acquire
when data mining. In order to protect privacy of customers, some measures were taken in
the Prepared Statement of The Federal Trade Commission: "Online Profiling: Benefits
and Concerns" in Washington, D.C. on June 13, 2000 available on
http://www.ftc.gov/os/2000/06/onlineprofile.htm.

b. Database requirements
In order to apply data mining techniques they require to be fully integrated with a data
warehouse. Figure 1 shows the importance on data warehouse in the data mining process.

Company

8
IST755 – R.P. - Indira Guzman – May, 2002

Figure 1. Adapted from Meltzer, 2001.

The data warehouse grows and the organization can continually mine the best practices
and apply them to future decisions. However, if the data warehouse is not ready to mine,
preparing data could take big percentage of the available resources for the project.

c. Interpretation of the results


Sometimes the results that are obtained after mining data can be difficult to interpret.
Davenport and Prusak define neural networks (p.140, 2000) as a statistically oriented tool
that excels at using data to classify cases into one category or another - say, whether a
loan customer is likely to default on a loan, or pay it back. However, it is not easy to
explain why neural networks did what they did. A particular case will be classified in a
particular fashion according to nodes and variable weightings, and it is therefore difficult
to interpret. As a result, data mining has a variety of tools to make it easier to interpret
and explain, but an intelligent human is still required to (a) structure de data in the first
place; (b) interpret the data to understand the identified pattern; and (c) make a decision
based on the knowledge (Davenport & Prusak, 2000).

Why data mining is a strategic decision support tool for organizations?

Data mining’s main purpose is knowledge discovery leading to decision support. Data
mining tools find patterns in data and may even infer rules from them. These patters and
rules can be used to guide decision-making and forecast the effect of these decisions. It is
proved that data mining can speed analysis by focusing attention on the most important
variables. In fact, as Turban & Aronson (2001) state, the dramatic drop in the

9
IST755 – R.P. - Indira Guzman – May, 2002

cost/performance ratio of computer systems has enabled many organizations to start


applying the complex algorithms used in data mining techniques.

As an illustration, early users found great benefits with IBM’s DecisionEdge for
Relationship Marketing decision support software. “We had a full return on our
investment 14 months after installing the data mining component,” said Jo Ann
Boylan, an executive vice president in the key Technology Service division at Key Corp.,
the nation’s 13th largest retail bank with 7 million customers. She added that the data
mining and analysis system helped raise the bank’s direct-mail response rate from 1 to as
high as 10 percent. It also helped identify unprofitable product lines (O’Brien, 2001,
p.363).

According to Deering (2002), any company facing competition for customers confronts
the basic question, how do we anticipate and manage against initiatives intended to cut
our share of the market? This is true whether the competition is a new entrant to the
marketplace or a known competitor seeking greater market share. In any contested
marketing environment, the organization that is best able to anticipate competitor
strategies and tactics and offer real alternative value wins. The win could be the retention
of current valuable customers through skillful customer relationship management (CRM)
and acquisition of new customers in the head-to-head marketing environment.

Some strategic applications of data mining include (Laudon & Laudon, 2000, p.53):
 Identifying individuals or organization most likely to respond to a direct mailing.
 Determining which products or services are commonly purchased together, such
as beer and cigarettes.
 Predicting which customers are likely to switch to competitors.
 Identifying which transactions are likely to be fraudulent.
 Identifying common characteristics of customers who purchase the same product.
 Predicting what each visitor to a Web site is most interesting in seeing.

10
IST755 – R.P. - Indira Guzman – May, 2002

Because competitor plans are never fully known ahead of time, it is essential to leverage
all available information about customers’ reactions to potential offers. Sometimes the
information comes from the specific markets where more aggressive marketing initiatives
are anticipated; sometimes the information comes from sources external to that market. In
either case, effective knowledge management (KM) requires seeking diverse data about
customer and competitor activities and capitalizing on these data.

Creating new knowledge for competitive situations requires openness to an enlarged


array of data sources and the ability to capitalize on developments in data modeling and
mining. Interpretation and analysis that might otherwise lead only to directional guidance
can result in more specific decision parameters.

Data mining can also be used to locate individual customers with specific interests or
determine the interests of a specific group of customers. For example, American Express
continually mines a gigantic pool of computerized data on its 30 million credit card
holders to create highly personalized marketing campaigns (Laudon & Laudon, 2000). If,
for example, a customer purchases a dress at Saks Fifth Avenue department store,
American Express might include in her next bill an offer of discount on a pair of shoes
purchased at the same store and charged on her American Express Card. The two goals
are to increase the customer’s use of the American Express card and to expand the
presence of American Express at Saks.

Conclusion

In the last decade the investments on information technology resources represent an


important asset for all organizations. Consequently, those resources should be able to
create a competitive advantage for organizations. One way to achieve this goal is
applying data mining techniques.

Once large databases from different sources are converted on data warehouses, data
mining process can extract information and discover hidden facts and create patterns.

11
IST755 – R.P. - Indira Guzman – May, 2002

Those patters can be used to create “rules” that will finally support the managerial
decision making process and create a competitive advantage for the organization. For
instance, through a data-mining project and focused differentiation, a company can
provide a specialized product or service for this narrow target market better than
competitors. Business can create new market niches by identifying a specific target for a
product or service that it can serve in a superior manner.

Despite of the privacy issues, databases problems and interpretation problems, data
mining is an important tool that can be applied in every company to build a competitive
advantage.

The selection of the best hardware, software and human resources is a key issue that will
have direct influence on the success or failure of any data-mining project.

Currently, the market has many software tools that are difficult to categorize because all
of them still have their own advantages and disadvantages. Fortunately, new tools are
being developed constantly that more effectively mine data and provide decision support
based on that data mining. So far, many big companies like Wal-Mart, American Express
and Key Corp., have been running data mining projects with great results. Their success
supports the continuous research and evolution of data mining. In the same way, it
demonstrates that data mining is a strategic decision support tool for organizations.

12
IST755 – R.P. - Indira Guzman – May, 2002

References

Babcock C. (1994) Parallel Processing Mines Retail Data. Computer World.


Berry M.J.A. & Linoff G. (1999) Mastering Data Mining: The Art and Science of
Customer Relationship Management. John Wiley and Sons, Inc.
Davenport T.H. & Prusak L. (2000). Working Knowledge: How organizations manage
what they know. Boston, Massachusetts; Harvard Business School Press.
Deering B.J. (2002) Chapter 11: KM for competitive advantage: mining diverse sources
for marketing intelligence. Knowledge Management Strategy and Technology.
Bellaver R.F. & Lusa J.M. Editors. Artech House.
Edelstein, H. (1996, Jan.8) . Mining Data Warehouses” Information Week.
Fayyad U., Grinstein G.E. & Wierse A. (2002). Information Visualization in Data
Mining and Knowledge Discovery. Morgan Kaufmann Publishers. Academic
Press.
Foley J. & Russell J.D. (1998) Mining your Own Business. Information Week.com March
6, 1998. Retrieved May 4, 2002 from:
http://www.informationweek.com/673/73iudat.htm
Gartner Group Sized up Workbench Market: Data Mining News. Retrieved May 5, 2002
from: http://www.idagroup.com/v3n0101.htm
Kdnuggets poll (June 27 - July 17, 2000) with the question: Which tool do you plan to
try or by next?. Retrieved April 24, 2002 from
http://www.kdnuggets.com/polls/next_dm_tool-2000-07-17.htm
Laudon K.C. & Laudon J.P. (2000). Management Information Systems: Organization
and Technology in the Networked Enterprise. Sixth Edition; Prentice Hall.
Meltzer M. (2001) E-Mining Myth & Magic: Using Data Mining Successfully. Retrieved
May 1, 2002, from http://www.crm-forum.com/library/art/art-048/art-048.htm.
O’Brien J.A. (2001) Introduction to Information Systems: Essentials for the
Internetworked E-Business Enterprise. Tenth Edition; McGraw-Hill Irwin.
Pearlson, K.E. Managing and Using Information Systems: A Strategic Approach. New
York, Wiley, 2001: John Wiley & Sons, Inc.

13
IST755 – R.P. - Indira Guzman – May, 2002

Srikant, R and Agrawal, R (1996) "Mining sequential patterns : Generalizations and


performance improvements", Proc. of the 5th International Conf. on Extending
Database Technology, France (March).
Thearling K. Retrieved May 6, 2002 from:
http://www3.shore.net/~kht/text/dmwhite/dmwhite.htm
Turban E. & Aronson J.E. (2001). Decision Support Systems and Intelligent Systems.
Sixth Ed. New Jersey; Prentice Hall.
Two Crows: Data Mining Glossary. Retrieved April 28, 2002 from:
http://www.twocrows.com/glossary.htm.

14

You might also like