You are on page 1of 63

Analytics on Microsoft Excel

Overview of solutions around the platform


from a real world perspective

Alberto Guillén
03. June 2008
 Capgemini is a leading company with long
experience in technology services Alberto Guillén
 We are one of the biggest actors in Business Consultant
Intelligence in Norway

 A major demand from our clients is delivering Risk Management & Complia
solutions in Microsoft Excel, we have
continuously updated our efforts to adapt MSc. Mathematics
clients’ needs
MSc. Statistics

Phone: +47 46444721


E-Mail: alberto.guillen@capgemini.c
Excel is more than worksheet functions and tables

Basic statistical environment

Data Visualization and Data Mining

Desktop for analytical solutions


EXCEL

Front for Data Warehouse Collection (SQL Server)

GUI for in-house coded programs (VBA)

Reporting and Monitoring Tool (Performance Point)

Excel is the Industry standard for end-user calculations, and also as front
interface
Analytical solutions can be created on different
complexity layers beyond basic Excel

Statistical
Programming
Languages

Analysis
Tool pack
Data Third-
Mining Excel party
Add-in Add-ins

Solver

VBA
Analytical solutions can be created on different
complexity layers beyond basic Excel

Statistical
Programming
Languages

Analysis
Tool pack
Data Third-
Mining Excel party
Add-in Add-ins

Solver

VBA
Analytical solutions can be created on different
complexity layers beyond basic Excel

Statistical
Programming
Languages

Analysis
Tool pack
Data Third-
Mining Excel party
Add-in Add-ins

Solver

VBA
Analytical solutions can be created on different
complexity layers beyond basic Excel

Statistical
Programming
Languages

Analysis
Tool pack
Data Third-
Mining Excel party
Add-in Add-ins

Solver

VBA
Analytical solutions can be created on different
complexity layers beyond basic Excel

Statistical
Programming
Languages

Analysis
Tool pack
Data Third-
Mining Excel party
Add-in Add-ins

Solver

VBA
Analytical solutions can be created on different
complexity layers beyond basic Excel

Statistical
Programming
Languages

Analysis
Tool pack
Data Third-
Mining Excel party
Add-in Add-ins

Solver

VBA
The average user masters the standard Excel tools
BASIC EXCEL

Tables and Filters Worksheet functions


ID Marital Status Gender Income Children Education Occupation Home Owner C
12496 Married Female 40000
Charts1 Bachelors Skilled Manual Yes
24107 Married Male 30000 3 Partial College Clerical Yes
14177 Married Female 80000 5 Partial College Professional No
24381 Single Male 70000 0 Bachelors Professional Yes
25597 Single Male 30000 0 Bachelors Clerical No
13507 Married Female 10000 2 Partial College ”What
Manual If” Analysis Yes
Pivot Tables
27974 Single Male 160000 2 High School Management Yes
19364 Married Male 40000 1 Bachelors Skilled Manual Yes
Sumof Income
22155 Married Male
Column Labels
20000 2 Partial High School Clerical Yes
19280 Married
22173 Married
Male
Female
Bachelors
20000
30000
2
3
Partial College
High School
Manual
Skilled Manual
Yes
No
Row Labels
12697 Single Female Clerical
90000 0 Management Manual Pro
Bachelors Professional No
11434 Married Male 170000 5 Partial College Professional Yes
Female
25323 Married Male 40000 690000
2 3770000 10000
Partial College Clerical Yes
23542 Single Male 60000 1 Partial College Skilled Manual No
Married
20870 Single Female 10000
540000
2
2270000 10000
High School Manual Yes
Europe
23316 Single
Standard
Male
Excel allows
30000
direct 500000
3
interaction 300000 10000
Partial College
with the
Clerical
raw data
No
12610 Married Female 30000 1 Bachelors Clerical Yes
North America
27183 Single Male 40000 2 1770000
Partial College Clerical Yes
25940 Single Male 20000 2 Partial High School Clerical Yes
Pacific
25598 Married Female 40000 40000
0 200000
Graduate Degree Clerical Yes 1
The Analysis Toolpack makes rigorous analysis
possible ANALYSIS TOOLPACK

Basic statistical analyses are available through Analysis Toolpack

1
Solver leverages computational abilities
SOLVER

 Free Microsoft download


 Optimizing and root finding set of
algorithms
 Can be called on the background
from VBA
 Practical but slow in heavy
calculations

 Not exact convergence


sometimes!
 Can be tuned
 Needs good seeds

Solver implements standard algorithms for mathematical optimization


problems

1
Solver leverages computational abilities
SOLVER

 Free Microsoft download


 Optimizing and root finding set of
algorithms
 Can be called on the background
from VBA
 Practical but slow in heavy
calculations

 Not exact convergence


sometimes!
 Can be tuned
 Needs good seeds

Solver implements standard algorithms for mathematical optimization


problems

1
Third party add-ins provide easily new functionalities
ADD-INS

 Cheap
 Simple
 Easy to use
 No development efforts

Many small software developers use Excel as their GUI

1
There are several third party add-ins offering
solutions on quantitative analysis and Monte Carlo
ADD-INS
simulation

Hundreds of free or cheap add-ins offer various solutions on fields like Risk
Management

1
The Table Analysis Tools add-in brings data mining
capabilities DATA MINING

Data Mining is embedded into table functionality

1
The Data Mining add-in easies data mining to
business analysts DATA MINING

 What is Data Mining?


• Data mining is frequently described as "the process of extracting valid, authentic, and
actionable information from large databases.“

 Microsoft’s approach to Data Mining:


• Business Intelligence with a user-friendly
interface, accessible to end-users and
developers

 Software
• SQL Server 2005/2008 (Visual Studio BI)
• Excel/Visio add-ins
• DMX
• ADOMD.Net / AMO

Microsoft brings Data Mining to business users for the first time

1
Microsoft takes a different approach to Data Mining
DATA MINING

Donald Farmer - Principal Program Manager for Microsoft’s Data Mining


"We don't have all the functionality of something like a SAS or an SPSS, because that's just not our market.
[…] Our market just has to be a much larger market“

“We have a huge database marketing team who do classic customer analysis. These guys were all SAS
users, but when they joined Microsoft, they started using our tools. […], they actually use the Excel data
mining add-ins to do it. It's not that there's nothing they don't miss, it's that they are able to achieve the
same business results using our tools.“

"For a function such as 'Detect Categories,' what the add-in is doing is building a clustering model in the
background […], but we don't expect the Excel user to understand that. We just call it 'Build Categories,'“

"We're seeing a lot of interest in the Excel-side data mining,for one thing, but we're also seeing interest in the
embed-ability, too. The people who are actually pushing this are from the developer side.

Microsoft will not compete with traditional DM vendors, Microsoft targets other
users

1
Data Mining assists in various business processes
DATA MINING

Top Business Scenarios for DM

 Cross-sell and up-sell


Main DM tasks
 Campaign
management  Classification

 Customer acquisition  Estimation

 Budget and  Prediction


forecasting
 Association
 Customer retention
 Clustering
 New fields:
manufacturing, retail
and entertaiment
Data Mining is being used in several business areas

1
Data Mining is performed in SQL Server 2005 / 2008
DATA MINING

SQL Server Business Development Studio and DMX code is the natural
environment

2
Data Mining is also accessible through Excel 2007
DATA MINING

e Excel add-in acts as a client to an instance of Analysis Services

th Excel and SQL Server Analysis Services support the full DM Cycle:
Data Data
understan preparatio Modeling Validation Deployment
ding n

Excel sends DM queries and data directly to SQL Server Analysis Services

2
Data Mining is an iterative process
DATA MINING

A mining model is part of a larger process that includes


everything Problem
Working environment

? Deploym
ent

This process can be defined by using the following six basic


steps:
Defining the problem

Preparing Data

Exploring Data

Building Models

Exploring and Validating Models

Although theand
Deploying process is illustrated as circular, creating a data mining model is
Updating
a dynamic and iterative process

2
There are 9 available Data Mining algorithms on Excel
DATA MINING

 Decision/Regression Trees
 Clustering
 Naïve Bayes
 Association rules
 Sequence clustering
 Time series
 Neural Networks
 Logistic regression
 Linear regression
 Plug-in algorithms
• Third-party or self programmed implementing a set of COM interfaces

9 built-in algorithms can be tuned to obtain new ones

2
Decision and Regression trees find natural splits
DATA MINING

 Decision trees classify and find


associations

 Regression trees build segmented


regressions

 Example:
• Identify potential buyers

Decision trees give decision rules that are suitable to business understanding

2
Clustering finds homogeneous groups
DATA MINING

 Example: Find segments of similar clients age

income

Clustering can find hidden classes and identify outliers

2
Clustering finds homogeneous groups
DATA MINING

Middle
”older” age
 Example: Find segments of similar clients age
age Many cars
2 cars and Young
no children people
children No
children

income

Clustering can find hidden classes and identify outliers

2
Naïve Bayes provides probabilities of group
membership DATA MINING

 Example: marketing campaign

Naîve Bayes is an efficient method to asses probability of classification

2
Association rules unveils hidden logic
DATA MINING

 Example: Shopping Basket

Association rules visualizes logical rules that underly your business

2
Sequence clustering finds event patterns in time
DATA MINING

 Example: Web navigation

Sequence clustering identifies clusters of similarly ordered events in a


sequence

2
Time series forecasts processes in time
DATA MINING

 ARTx Microsoft proprietary algorithm

 ARIMA available in SQL Server 2008


historical predicted

 Example: forecast seasonal sales to keep suitable stock

The past patterns that it discovers can be used to predict values for future
time steps.

3
Time series forecasts processes in time
DATA MINING

 ARTx Microsoft proprietary algorithm

 ARIMA available in SQL Server 2008


historical predicted

 Example: forecast seasonal sales to keep suitable stock

The past patterns that it discovers can be used to predict values for future
time steps.

3
Neural networks discovers predictive patterns by
learning DATA MINING

 Example: fraud detection

Neural networks learns in an uncontrolled manner

3
Logistic regression predicts binary responses
DATA MINING

 Microsoft Logistic Regression is implemented as a trivial neural network

 Example: Probability of credit default based on personal information

Logistic regression gives probabilities of ”YES/NO” given some attributes

3
Linear regression is of course also available
DATA MINING

 It is however extended by Regression Trees


(Linear Regression is implemented as a particular case)

 Example: extrapolate the influence of oil price on house prices

The classical linear regression is also integrated in the add-in

3
Chosen examples vs. real life problems
DATA MINING

 So far, we have seen chosen examples:


• Shopping basket
• Web navigation
• Market segmentation
•…

 Unfortunately, it is not that easy; data Mining is a creative and unclear process. Sometimes there is no
answer with data mining.

 Books don’t show examples on when not to use the algorithms


• Time series: long forecasts
• Classification trees: credit scoring
• Sequence clustering: non-markovian processes
•…

Bottom line: understand the statistical models behind the icons

3
Data Mining: using the Data Mining add-in to forecast Credit
Default DATA MINING

Ranked
classes
Number of
payments
A
score
Income

Probability B
of default
C
Age

Civil
D
status

After training the algorithm, probabilities of default can be predicted for new
applicants

3
Data Mining: using the Data Mining add-in to forecast Debt
Recovery DATA MINING

% recovered

 Problem: need the algorithm k-nearest neighbours


• Can be implemented as a plug-in algorithm

Some problems require creative approaches

3
Data Mining: using the Data Mining add-in to forecast Debt
Recovery DATA MINING

 Hybrid between time series and regression trees

% recovered

Training…

period

t
 Problem: need the algorithm k-nearest neighbours
• Can be implemented as a plug-in algorithm

Some problems require creative approaches

3
Data Mining: using the Data Mining add-in to forecast Debt
Recovery DATA MINING

 Hybrid between time series and regression trees

% recovered

Training…

period

t
 Problem: need the algorithm k-nearest neighbours
• Can be implemented as a plug-in algorithm

Some problems require creative approaches

3
Data Mining: using the Data Mining add-in to forecast Debt
Recovery DATA MINING

 Hybrid between time series and regression trees

% recovered

Training…

period

t
 Problem: need the algorithm k-nearest neighbours
• Can be implemented as a plug-in algorithm

Some problems require creative approaches

4
Data Mining: using the Data Mining add-in to forecast Debt
Recovery DATA MINING

 Hybrid between time series and regression trees

% recovered

Training…

period

t
 Problem: need the algorithm k-nearest neighbours
• Can be implemented as a plug-in algorithm

Some problems require creative approaches

4
Data Mining: using the Data Mining add-in to forecast Debt
Recovery DATA MINING

 Hybrid between time series and regression trees

% recovered

Training…

period

t
 Problem: need the algorithm k-nearest neighbours
• Can be implemented as a plug-in algorithm

Some problems require creative approaches

4
Data Mining: using the Data Mining add-in to forecast Debt
Recovery DATA MINING

 Hybrid between time series and regression trees

% recovered

Predicting…

One period older:

period Period = Period +1

Age = Age + …

t
 Problem: need the algorithm k-nearest neighbours
• Can be implemented as a plug-in algorithm

Some problems require creative approaches

4
Data Mining: using the Data Mining add-in to forecast Debt
Recovery DATA MINING

 Hybrid between time series and regression trees

% recovered

x Training…

period

t
 Problem: need the algorithm k-nearest neighbours
• Can be implemented as a plug-in algorithm

Some problems require creative approaches

4
Data Mining: using the Data Mining add-in to forecast Debt
Recovery DATA MINING

 Hybrid between time series and regression trees

% recovered

x Predicting…

period

t
 Problem: need the algorithm k-nearest neighbours
• Can be implemented as a plug-in algorithm

Some problems require creative approaches

4
VBA for Excel is the main tool for automated solutions
VBA

 Communication with other software (COM Server)

 Build algorithms not available in Excel

 Automation of processes (“Macro programming”)

 Easy and quick interaction with the solution through ActiveX Buttons and
Userforms

 Possibility to embed analytical solutions in a simple user-front to end users


without the right competence

VBA allows in-house built solutions

4
VBA: Building a statistical tool for analyzing and forecasting
Debt Collection VBA

 With VBA it is possible to deliver customized solutions to end users

 Problems: a lot of work to implement statistical algorithms, Solver can get slow

VBA is the tool to use to provide end-users with an interactive work station

4
There are no limits with statistical programming
languages R

Statistical
Programming
Languages
(COM
Server)

Analysis
Tool pack
Data Third-
Mining Excel party
Add-in Add-ins

Solver

VBA

4
There are no limits with statistical programming
languages R

Statistical
Programming
Languages
(COM
Server)

Analysis
DDE Tool pack
Data Third-
1991 OLE
Mining Excel party
Add-in Add-ins

1996 ActiveX Solver


DCOM
1999 COM+

.Net
WCF VBA

4
R is becoming the standard in the scientific
community R

 R is a statistical programming language with syntax similar to S-plus


• R is free (under GNU license)
• R uses statistical libraries created by statisticians all over the world

 R communicates with Excel through a COM server


• ”COM” is a set of interfaces that covers OLE, ActiveX, DCOM, ...

 R Excel add-in
• Background mode
• Small Ribbon toolbar
• Fast
• Code embedded in:
− Worksheet functions
− VBA
− Cells

R allows analysts to implement the most advanced mathematical models

5
Histograms are a dangerous tool to approximate
empirical pdf’s R

With standard Excel:

5
Histograms are a dangerous tool to approximate
empirical pdf’s R

With standard Excel:

5
Histograms are a dangerous tool to approximate
empirical pdf’s R

With standard Excel:

5
With the R add-in, advanced semiparametric methods
are available R

With the R add-in for Excel:

Empirical probability distribution functions are easily approximated with the R


add-in

5
With the R add-in, advanced semiparametric methods
are available R

With the R add-in for Excel:

Empirical probability distribution functions are easily approximated with the R


add-in

5
With the R add-in, advanced semiparametric methods
are available R

With the R add-in for Excel:

Empirical probability distribution functions are easily approximated with the R


add-in

5
With the R add-in, advanced semiparametric methods
are available R

With the R add-in for Excel:

Empirical probability distribution functions are easily approximated with the R


add-in

5
With the R add-in, advanced semiparametric methods
are available R

With the R add-in for Excel:

Empirical probability distribution functions are easily approximated with the R


add-in

5
Statistical programming languages: problem case
R

 Forecasting Oslo Børs Hovedindeks

Some problems demand advanced statistical approaches

5
Statistical programming languages: problem case
R

 Using Monte Carlo simulation to predict default in Specialized Lending

NIBOR Oil price


25

3000
20

2500
15

2000
r

olje
10

1500
5

1000
0

0 50 100 150
500

Index
0 20 40 60 80 1
Complex multivariate Monte Carlo models are developed fast
Index
in R

6
Industrial solutions are another alternative
Industry vendor

Statistical
Programming Industrialized
Languages Vendors
(COM
Server)
Analysis
Tool pack
Data Third-
Mining party
Excel
Add-in Add-ins

Solver

VBA

6
Industrial solutions should be chosen only if the area
requires it Industry vendor

When to consider Industrial Solutions:

 Big companies (important deployment)

 Special industrial subject area

 Data warehouse integration

 Highly competent staff within Analytics

 Expensive investment: study worthiness

 Whatever vendor, check Excel compatibilities (reporting, platform migrations, …)

The previously presented alternatives for Excel can do the their job at end-user
level

6
Further references

 Capgemini:
• www.no.capgemini.com (alberto.guillen@capgemini.com)

 Microsoft Data Mining:


• http://www.sqlserverdatamining.com
• http://www.microsoft.com/sqlserver/2008/en/us/data-mining-addins.aspx

 R:
• http://www.r-project.org
• http://sunsite.univie.ac.at/rcom/

You might also like