Data Mining in Excel Using XL Miner

Data Mining in Excel Using
XLMiner
Nitin R. Patel
Cytel Software and M.I.T.Sloan
Contact Info
XLMiner is distributed by Resampling
Stats, Inc.
www.xlminer.net
Contact Peter Bruce: pbruce@resample.com
703-522-2713
What is XLMiner?
XLMiner is an affordable, easy-to-use tool for
business analysts, consultants and business
students to:
learn strengths and weaknesses of data mining methods,
prototype large scale data mining applications,
implement medium scale data mining applications.
More generally, XLMiner is a tool for data

analysis in Excel that uses classical and modern,
computationally-intensive techniques.
3
Available Data Mining Software

Application-specific: aimed at providing
solutions to end-users for common tasks
(e.g. Unica for Customer Relationship
Management, Urban Science for location
and distribution)
Technique-specific: focused on a few data
mining methods (e.g. CART from Salford
Associates, Neural Nets from HNC
Software)
4
Kohonen Net
Association Rules
K-Means
Sequential. Rules
TimeSeries
Logistic
Regression
Rule Induction
x
See5
x
NeuroShell
x
x
WizWhy
Nave Bayes
Radial Basis Fns.
Cognos
K-Nearest
Neighbors
Multilayer Neural
Net
Linear Regression
Class. & Regr.
Trees
x
CART (Salford)
Algorithms>
Source: Elder Research

TECHNIQUE-SPECIFIC PRODUCTS
Available Data Mining Software

Horizontal products: designed for data
mining analysts: (e.g. SAS Enterprise
Miner, SPSS Clementine, IBM Intelligent
Miner, NCR Teraminer, Splus Insightful
Miner, Darwin/Oracle)
Powerful, comprehensive, easy-to-use; but
Need substantial learning effort
Expensive
6
HORIZONTAL PRODUCTS
Source: Elder Research
x x x
x x x x
x
x
x x
Kohonen Net
PRW (Unica)
x x x x
Association Rules
Darwin
(Oracle)
x
x
K-Means
Sequential. Rules
x
x
TimeSeries
MineSet (SGI)
Logistic Regression
x x x
Rule Induction
Intelligent
Miner (IBM)
Nave Bayes
x x x
Radial Basis Fns.
Clementine
(SPSS)
K-Nearest Neighbors
x x x
Multilayer Neural Net
Class. & Regr. Trees
Enterprise
Miner (SAS)
Linear Regression
Algorithms>
x x
x
x
Desiderata for Data Mining and

Modern Data Analysis Software
Easy-to-use
Data import (e.g. cross-platform, various data bases)
Data handling (e.g. data partitioning, scoring)
Invoking and experimenting with procedures
Comprehensive Range of Procedures:

Statistics (e.g. Regression, Multivariate procedures)
Machine learning (e.g. Neural Nets, Classification
Trees)
Database (e.g. Association Rules)
8
XLMiner is Unique
Low cost,
Comprehensive set of data mining models and
algorithms that includes statistical, machine
learning and database methods,
Based on prototype used in three years of MBA
courses on data mining at Sloan School, M.I.T.
Focus on business applications: Book of lecture
notes and cases in preparation (first draft available
for examination).
9
Why Data Mining in Excel?

Leverage familiarity of MBA students,
managers and business analysts with
interface and functionality of Excel to
provide them with hands-on experience in
data mining.
10
Advantages
Low learning hurdle
Promotes understanding of strengths and
weaknesses of different data mining techniques
and processes
Enables interactive analysis of data (important in
early stages of model building)
Facilitates incorporation of domain knowledge
(often key to successful applications) by
empowering end-users to participate actively in
data mining projects
Enables pre-processing of data and postprocessing of results using Excel functions,
reporting in Word, presentations in PowerPoint
11
Advantages (cont.)
Supports communication between data miners and
end-users
Supports smooth transition from prototyping to
custom solution development (VB and VBA)
Emphasizes openness
enables integration with other analytic software for
optimization (Solver), simulation (Crystal Ball) ,
numerical methods;
interface modifications (e.g.custom forms and outputs)
solution specific routines (VBA)
Examples:
Boston Celtics analysis of player statistics
Clustering for improving forecasts, optimizing price
markdowns.
12
Size Limitations
An Excel spreadsheet cannot exceed 64,000 rows.
If data records are stored as rows in a single
spreadsheet this is the largest data set that can be
accommodated. The number of variables cannot
exceed 256 (number of columns).
These limits do not apply to deployment of model
to score large databases.
If Excel is used as a view-port into a database such
as Access, MS SQL Server, Oracle or SAS, these
limits do not apply.
13
Sampling
Practical Data Mining Methodologies such
as SEMMA (SAS) and CRISP-DM (SPSS
and European Industry Standard)
recommend working with a sample
(typically 10,000 random cases) in the
model and algorithm selection phase. This
facilitates interactive development of data
mining models.
14
XLMiner
Free 30 day trial version: limit is 200 records per
partition.
Education version: limit is 2,000 records per
partition, so maximum size for a data set is 6,000
records.
Standard version (currently in beta test: will be
available by end August):
Up to 60,000 records obtained by drawing samples
from large data bases in accordance with SASs
SEMMA (Sample, Explore, Model, Measure, Apply)
methodology. Training data restricted to 10,000 records
Sampling from and scoring to Access databases (later
SQLServer, Oracle, SAS)
15
Data Mining Procedures in

XLMiner
Partitioning data sets (into Training, Validation,
and Test data sets)
Scoring of training, validation, test and other data
Prediction (of a continuous variable)
Classification
Data reduction and exploration
Affinity
Utilities: Sampling, graphics, missing data,
binning, creation of dummy variables
16
Prediction
Multiple Linear Regression with subset
selection, residual analysis, and collinearity
diagnostics.
K-Nearest Neighbors
Regression Tree
Neural Net
17
Classification
Logistic Regression with subset selection,
residual analysis, and collinearity diagnostics
Discriminant Analysis
K-Nearest Neighbors
Classification Tree
Nave Bayes
Neural Networks
18
Data Reduction and Exploration

Principal Components
K-Means Clustering
Hierarchical Clustering
19
Affinity
Association Rules (Market Basket Analysis)
20
Partitioning
Aim: To construct training,
validation, and test data sets from
Boston Housing data
21
22
Boston Housing Data

CRIM
0.00632
0.02731
0.02729
0.03237
0.06905
0.02985
0.08829
0.14455
0.21124
0.17004
0.22489
0.11747
0.09378
0.62976
ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO

18 2.31
0 0.538 6.575 65.2 4.09
1 296
15.3
0 7.07
0 0.469 6.421 78.9 4.97
2 242
17.8
0 7.07
0 0.469 7.185 61.1 4.97
2 242
17.8
0 2.18
0 0.458 6.998 45.8 6.06
3 222
18.7
0 2.18
0 0.458 7.147 54.2 6.06
3 222
18.7
0 2.18
0 0.458 6.43 58.7 6.06
3 222
18.7
13 7.87
0 0.524 6.012 66.6 5.56
5 311
15.2
13 7.87
0 0.524 6.172 96.1 5.95
5 311
15.2
13 7.87
0 0.524 5.631 100 6.08
5 311
15.2
13 7.87
0 0.524 6.004 85.9 6.59
5 311
15.2
13 7.87
0 0.524 6.377 94.3 6.35
5 311
15.2
13 7.87
0 0.524 6.009 82.9 6.23
5 311
15.2
13 7.87
0 0.524 5.889
39 5.45
5 311
15.2
0 8.14
0 0.538 5.949 61.8 4.71
4 307
21
B
397
397
393
395
397
394
396
397
387
387
393
397
391
397
LSTAT MEDV
4.98
24
9.14 21.6
4.03 34.7
2.94 33.4
5.33 36.2
5.21 28.7
12.43 22.9
19.15 27.1
29.93 16.5
17.1 18.9
20.45
15
13.27 18.9
15.71 21.7
8.26 20.4
23
XLMiner : Data Partition Sheet
Date: 29-Jul-2003 13:50:09 (Ver: 1.2.0.1)
Output Navigator
Training Data
Validation Data
Test Data
Data
Data source
housing!$A$2:$O$507
Selected variables
CRIM
Partitioning Method
Random Seed
# training row s
# validation row s
# test row s
Randomly chosen
81801
253
152
101
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
Selected variables
Row Id.
1
2
5
6
7
8
10
12
14
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
LSTAT
0.00632
0.02731
0.06905
0.02985
0.08829
0.14455
0.17004
0.11747
0.62976
18
0
0
0
12.5
12.5
12.5
12.5
0
2.31
7.07
2.18
2.18
7.87
7.87
7.87
7.87
8.14
0
0
0
0
0
0
0
0
0
0.538
0.469
0.458
0.458
0.524
0.524
0.524
0.524
0.538
6.575
6.421
7.147
6.43
6.012
6.172
6.004
6.009
5.949
65.2
78.9
54.2
58.7
66.6
96.1
85.9
82.9
61.8
4.09
4.9671
6.0622
6.0622
5.5605
5.9505
6.5921
6.2267
4.7075
1
2
3
3
5
5
5
5
4
296
242
222
222
311
311
311
311
307
15.3
17.8
18.7
18.7
15.2
15.2
15.2
15.2
21
396.9
396.9
396.9
394.12
395.6
396.9
386.71
396.9
396.9
4.98
9.14
5.33
5.21
12.43
19.15
17.1
13.27
8.26
3
9
13
0.02729
0.21124
0.09378
0
12.5
12.5
7.07
7.87
7.87
0
0
0
0.469
0.524
0.524
7.185
5.631
5.889
61.1
100
39
4.9671
6.0821
5.4509
2
5
5
242
311
311
17.8
15.2
15.2
392.83
386.63
390.5
4.03
29.93
15.71
4
11
17
0.03237
0.22489
1.05393
0
12.5
0
2.18
7.87
8.14
0
0
0
0.458
0.524
0.538
6.998
6.377
5.935
45.8
94.3
29.3
6.0622
6.3467
4.4986
3
5
4
222
311
307
18.7
15.2
21
394.63
392.52
386.85
2.94
20.45
6.58
24
Prediction
Multiple Linear Regression using
subset selection
Aim: To estimate median residential

property value for a census tract
25
The Regression Model

Coefficient
32.677
-0.094
0.055
0.030
2.836
-15.889
3.872
0.007
-1.405
0.358
-0.013
-0.934
0.014
-0.582
Input variables
Constant term
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
B
LSTAT
Std. Error
7.444
0.049
0.020
0.091
1.199
5.463
0.597
0.019
0.292
0.097
0.005
0.208
0.004
0.073
p-value
0.000
0.054
0.007
0.742
0.019
0.004
0.000
0.728
0.000
0.000
0.019
0.000
0.000
0.000
SS
128852
3566
2550
1529
645
143
4697
0
938
1
174
620
502
1623
239
0.738
5.025
6036
Residual df
Multiple R-squared
Std. Dev. estimate
Residual SS
Training Data scoring - Summary Report

Total sum of
squared
errors
6036
RMS Error
Average
Error
4.884
0.000
# Records training
# Records validation
# Records test
253
152
101
Validation Data scoring - Summary Report

Total sum of
squared
errors
2848
RMS Error
Average
Error
4.329
0.066
Test Data scoring - Summary Report

Total sum of
squared
errors
2392
RMS Error
Average
Error
4.866
-1.019
26
Subset selection (exhaustive enumeration)
Subset size
RSS
Adjusted RCp R-Squared

Squared
Prob
Models (Constant present in all models)

1
2 19472.3789
362.7529
0.5441
0.5432
0.0000
Constant
LSTAT
3 15439.3086
185.6474
0.6386
0.6371
0.0000
Constant
RM
LSTAT
4 13727.9863
111.6489
0.6786
0.6767
0.0000
Constant
RM
PTRATIO
LSTAT
5 13228.9072
91.4852
0.6903
0.6878
0.0000
Constant
RM
DIS
PTRATIO
LSTAT
6 12469.3447
59.7537
0.7081
0.7052
0.0000
Constant
NOX
RM
DIS
PTRATIO
LSTAT
7 12141.0723
47.1754
0.7158
0.7123
0.0000
Constant
CHAS
NOX
RM
DIS
PTRATIO
LSTAT
27
The Regression Model

Predictor (Indep. Var.)
Coefficient
Std. Error
42.8367
7.1766
0.0000 126430.6016
Residual df
-21.7852
4.6042
0.0000
3404.4565
Multiple R-squared
0.6601
RM
3.7503
0.6177
0.0000
6583.3579
Std. Dev. Estimate
5.3467
DIS
-1.4072
0.2535
0.0000
211.6853
PTRATIO
-1.0086
0.1747
0.0000
1453.9551
LSTAT
-0.5907
0.0696
0.0000
2060.2676
Constant
NOX
p-value
SS
247.0000
Residual SS
7061.1646
XLMiner : Multiple Linear Regression - Prediction of Validation Data

MaxAbsErr=
RMSErr= Data range Data_Partition1!$C$273:$P$424
4.9355 AvMEDV=
22.9645 %RMSErr=
SqErr
0.8439637
0.2196854
0.2137043
6.6637521
4.0947798
18.224484
0.3253246
51.86411
20.33
Back to Navigator
21.5%
AvAbsErr=
3.57
Predicted
Value
Actual
Value
NOX
RM
DIS
PTRATIO
LSTAT
AbsErr
22.0187
21.1
0.4640
5.8560
4.4290
18.6000
13.0000
32.8687
32.4
0.4470
6.7580
4.0776
17.6000
3.5300
25.4623
25
0.4890
6.1820
3.9454
18.6000
9.4700
31.0814
28.5
0.4110
6.8610
5.1167
19.2000
3.3300
22.4236
20.4
0.5470
5.8720
2.4775
17.8000
15.3700
24.5690
20.3
0.5440
5.9720
3.1025
18.4000
9.9700
23.4704
22.9
0.5240
6.0120
5.5605
15.2000
12.4300
14.6983
21.9
0.7180
4.9630
1.7523
20.2000
14.0000
0.92
0.47
0.46
2.58
2.02
4.27
0.57
7.20
28
%AvAbsErr=15.6%
Frequency in Validation Dataset
AbsErr Freq
0
0
2
61
4
40
6
25
8
10
10
9
12
2
14
3
16
0
18
0
20
1
22
1
70
60
50
40
30
20
10
0
0
10
12
14
16
18
20
22
AbsErr
29
Prediction
K_Nearest Neighbors
Aim: To estimate median residential

property value for a census tract
30
XLMiner : K-Nearest Neighbors Prediction

Data
Source data w orksheet
Data_Partition1
Training data used for building the model
Data_Partition1!$C$19:$Q$322
Validation data
Data_Partition1!$C$323:$Q$524
# cases in the training data set
304
# cases in the validation data set
202
Normalization
TRUE
# nearest neighbors (k)
Variables
Input variables
NOX
Output variable
MEDV
RM
DIS
PTRATIO
LSTAT
31
Param eters/Options
# Nearest neighbors

Total sum of
squared
errors
0
RMS Error
Average
Error

Total sum of
squared
errors
3314
RMS Error
Average
Error
4.669
0.805
# Records training
# Records validation
# Records test
253
152
101
Test Data scoring - Summary Report

Total sum of
squared
errors
3895
RMS Error
Average
Error
6.210
-0.450
Timings
Overall (secs)
3.00
32
Validation Data prediction details

Row Id.
3
9
13
15
16
20
25
29
Predicted
Value
28.70
14.40
22.90
19.60
20.40
20.40
16.60
19.60
Actual
Actual #Nearest
Residual
Value
Neighbors
34.70
6.00
1
16.50
2.10
1
21.70
-1.20
1
18.20
-1.40
1
19.90
-0.50
1
18.20
-2.20
1
15.60
-1.00
1
18.40
-1.20
1
CRIM
ZN
INDUS
CHAS
NOX
0.02729
0.21124
0.09378
0.63796
0.62739
0.7258
0.75026
0.77299
0
12.5
12.5
0
0
0
0
0
7.07
7.87
7.87
8.14
8.14
8.14
8.14
8.14
0
0
0
0
0
0
0
0
0.469
0.524
0.524
0.538
0.538
0.538
0.538
0.538
33
Classification
Classification Tree
Aim: To classify census tracts into

high and low residential property
value classes
34
Boston Housing Data

CRIM
ZN INDUS CHAS NOX RM AGE DIS
RAD TAX PTRATIO B
LSTAT MEDV HIGHCLASS
0.00632 18 2.31
0 0.54 6.575 65.2
4.09
1 296
15.3 396.9 4.98
24
0
0.02731 0 7.07
0 0.47 6.421 78.9
4.97
2 242
17.8 396.9 9.14
21.6
0
0.02729 0 7.07
0 0.47 7.185 61.1
4.97
2 242
17.8 392.83 4.03
34.7
1
0.03237 0 2.18
0 0.46 6.998 45.8
6.06
3 222
18.7 394.63 2.94
33.4
1
0.06905 0 2.18
0 0.46 7.147 54.2
6.06
3 222
18.7 396.9 5.33
36.2
1
0.02985 0 2.18
0 0.46 6.43 58.7
6.06
3 222
18.7 394.12 5.21
28.7
0
0.08829 13 7.87
0 0.52 6.012 66.6
5.56
5 311
15.2 395.6 12.43
22.9
0
0.14455 13 7.87
0 0.52 6.172 96.1
5.95
5 311
15.2 396.9 19.15
27.1
0
0.21124 13 7.87
0 0.52 5.631 100
6.08
5 311
15.2 386.63 29.93
16.5
0
0.17004 13 7.87
0 0.52 6.004 85.9
6.59
5 311
15.2 386.71 17.1
18.9
0
0.22489 13 7.87
0 0.52 6.377 94.3
6.35
5 311
15.2 392.52 20.45
15
0
0.11747 13 7.87
0 0.52 6.009 82.9
6.23
5 311
15.2 396.9 13.27
18.9
0
0.09378 13 7.87
0 0.52 5.889
39
5.45
5 311
15.2 390.5 15.71
21.7
0
35
Training Log
Grow ing the Tree
#Nodes
Error
13.82
3.45
2.97
0.67
0.65
0.56
0.2
0.14
0.06
0.05
10
0.05
11
0.04
12
0.02
13
0.01
14
0.01
15
Validation Misclassification Summary

Classification Confusion Matrix
Predicted Class
Actual
Class
0
1
152
36
Error Report
Class
0
1
Overall
# Cases
# Errors
% Error
158
3.80
44
18.18
202
14
6.93
36
XLMiner : Classification Tree - Prune Log

Back to Navigator
# Decision
Nodes
15
0.0792
14
0.0644
13
0.0644
12
0.0644
11
0.0644
10
0.0644 <-- Minimum Error Prune
Error
0.0743
0.0743
0.0743
0.0693
0.0693
0.0693
0.0693 <-- Best Prune
0.099
0.2079
Std. Err.
0.0172708
37
Classification Tree : Full Tree

Back to Navigator
6.55
05
RM
228
76
1.35
929
6.79
1
DIS
222
73.0 %
10.1
702
CRIM
19.4
5
LSTAT
14
17
4
1.31 %
37
5.59 %
3.43
515
DIS
12
PTRATIO
3
2
0.98 %
35.0
000
286.
000
ZN
TAX
2.30 %
4.13
499
25
DIS
8.22 %
0.32 %
2.30 %
378
LSTAT
1
1
RM
1.25
934
2.30 %
4.62
499
LSTAT
1
1
PTRATIO
7.06
449
18.1
45
7.63
5
0.65 %
RM
31
TAX
0.32 %
0.32 %
0.32 %
0.32 %
0.98 %
0.65 %
38
Classification Tree : Best Pruned Tree

Back to Navigator
6.55
05
RM
136
66
67.3 %
6.79
1
0
16
RM
50
7.92 %
19.4
5
PTRATIO
44
6
21.7 %
2.97 %
0
39
Classification Tree : Minimum Error Tree

Back to Navigator
6.55
05
RM
136
66
1.35
929
DIS
133
0
0
1.48 %
0%
RM
16
65.8 %
10.1
702
CRIM
6.79
1
7.63
5
19.4
5
LSTAT
11
5
PTRATIO
44
6
2.47 %
3.43
515
DIS
50
0
11
14
0%
5.44 %
2.97 %
7.06
449
RM
30
14.8 %
286.
000
TAX
1
7
3.46 %
378
1
6
TAX
2.97 %
0.49 %
40
Classification
Neural Network
Aim: To classify census tracts into

high and low residential property
value classes
41
XLMiner : Neural Network Classification

Epochs Inform ation
Number of Epochs
30
Accumulated Trials
9120
Class
Trials
7860
1260
Architecture
Number of hidden layers
Hidden Layer
# Nodes
25
Step size for gradient descent
0.1000
Weight change momentum
0.6000
Weight decay
0.0000
Cost Function
Squared Error
Hidden layer sigmoid
Standard
Output layer sigmoid
Standard
42

Cut of f Prob.Val. f or Success (Updatable)
Clas sification Confus ion Matrix
Pre dicte d Class
Actual
1
Class
1
40
0
4
Class
1
0
Overall
0.5
0
11
249
Error Re port
# Cas es
# Errors
51
11
253
4
304
15
% Error
21.57
1.58
4.93

Cut of f Prob.Val. f or Success (Updatable)
Clas sification Confus ion Matrix
Pre dicte d Class
Actual
1
Class
1
26
0
1
Class
1
0
Overall
0.5
0
7
168
Error Re port
# Cas es
# Errors
33
7
169
1
202
8
% Error
21.21
0.59
3.96
43
Cumulative
Lift chart (validation dataset)

35
30
25
20
15
10
5
0
Cumulative
HIGHV when
sorted using
predicted values
100
200
Cumulative
HIGHV using
average
300
# cases
Decile mean / Global

mean
Decile-wise lift chart (validation dataset)

7
6
5
4
3
2
1
0
1
10
De cile s
44
Data Reduction and Exploration

Hierarchical Clustering
Aim: To cluster electric utilities into

similar groups
45
Utilities Data
seq#
Arizona
Boston
Central
Common
Consolid
Florida
Hawaiian
Idaho
Kentucky
Madison
Nevada
NewEngla
Northern
Oklahoma
Pacific
Puget
SanDiego
Southern
Texas
Wisconsi
United
Virginia
x1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
x2
1.06
0.89
1.43
1.02
1.49
1.32
1.22
1.1
1.34
1.12
0.75
1.13
1.15
1.09
0.96
1.16
0.76
1.05
1.16
1.2
1.04
1.07
x3
9.2
10.3
15.4
11.2
8.8
13.5
12.2
9.2
13
12.4
7.5
10.9
12.7
12
7.6
9.9
6.4
12.6
11.7
11.8
8.6
9.3
x4
151
202
113
168
192
111
175
245
168
197
173
178
199
96
164
252
136
150
104
148
204
174
x5
54.4
57.9
53
56
51.2
60
67.6
57
60.4
53
51.5
62
53.7
49.8
62.2
56
61.9
56.7
54
59.9
61
54.3
x6
1.6
2.2
3.4
0.3
1
-2.2
2.2
3.3
7.2
2.7
6.5
3.7
6.4
1.4
-0.1
9.2
9
2.7
-2.1
3.5
3.5
5.9
x7
9077
5088
9212
6423
3300
11127
7642
13082
8406
6455
17441
6154
7179
9673
6468
15991
5714
10140
13507
7287
6650
10093
x8
0
25.3
0
34.3
15.6
22.5
0
0
0
39.2
0
0
50.2
0
0.9
0
8.3
0
0
41.1
0
26.6
0.628
1.555
1.058
0.7
2.044
1.241
1.652
0.309
0.862
0.623
0.768
1.897
0.527
0.588
1.4
0.62
1.92
1.108
0.636
0.702
2.116
1.306
46
Dendrogram (Data range:T12-5!$M$3:$U$24, Method:Single linkage)

4
3.5
Distance
2.5
1.5
0.5
18 14 19
10 13 20
12 21 15 22
16 17 11
47
Predicted Clusters
Back to Navigator
Cluster id.
x1
x2
x3
x4
x5
x6
x7
x8
1.06
9.2
151
54.4
1.6
9077
0.628
0.89
10.3
202
57.9
2.2
5088
25.3
1.555
1.43
15.4
113
53
3.4
9212
1.058
1.02
11.2
168
56
0.3
6423
34.3
0.7
1.49
8.8
192
51.2
3300
15.6
2.044
1.32
13.5
111
60
-2.2
11127
22.5
1.241
1.22
12.2
175
67.6
2.2
7642
1.652
1.1
9.2
245
57
3.3
13082
0.309
1.34
13
168
60.4
7.2
8406
0.862
1.12
12.4
197
53
2.7
6455
39.2
0.623
0.75
7.5
173
51.5
6.5
17441
0.768
1.13
10.9
178
62
3.7
6154
1.897
1.15
12.7
199
53.7
6.4
7179
50.2
0.527
1.09
12
96
49.8
1.4
9673
0.588
0.96
7.6
164
62.2
-0.1
6468
0.9
1.4
1.16
9.9
252
56
9.2
15991
0.62
0.76
6.4
136
61.9
5714
8.3
1.92
1.05
12.6
150
56.7
2.7
10140
1.108
1.16
11.7
104
54
-2.1
13507
0.636
1.2
11.8
148
59.9
3.5
7287
41.1
0.702
1.04
8.6
204
61
3.5
6650
2.116
1.07
9.3
174
54.3
5.9
10093
26.6
1.306
48
Dendrogram (Data range:T12-5!$M$3:$U$24, Method:Complete

linkage)
7
Distance
18 14
19
22
20 10
13
12 21
15 17
16
11
49
Predicted Clusters
Back to Navigator
Cluster id.
x1
x2
x3
x4
x5
x6
x7
x8
1.06
9.2
151
54.4
1.6
9077
0.628
0.89
10.3
202
57.9
2.2
5088
25.3
1.555
1.43
15.4
113
53
3.4
9212
1.058
1.02
11.2
168
56
0.3
6423
34.3
0.7
1.49
8.8
192
51.2
3300
15.6
2.044
1.32
13.5
111
60
-2.2
11127
22.5
1.241
1.22
12.2
175
67.6
2.2
7642
1.652
1.1
9.2
245
57
3.3
13082
0.309
1.34
13
168
60.4
7.2
8406
0.862
1.12
12.4
197
53
2.7
6455
39.2
0.623
0.75
7.5
173
51.5
6.5
17441
0.768
1.13
10.9
178
62
3.7
6154
1.897
1.15
12.7
199
53.7
6.4
7179
50.2
0.527
1.09
12
96
49.8
1.4
9673
0.588
0.96
7.6
164
62.2
-0.1
6468
0.9
1.4
1.16
9.9
252
56
9.2
15991
0.62
0.76
6.4
136
61.9
5714
8.3
1.92
1.05
12.6
150
56.7
2.7
10140
1.108
1.16
11.7
104
54
-2.1
13507
0.636
1.2
11.8
148
59.9
3.5
7287
41.1
0.702
1.04
8.6
204
61
3.5
6650
2.116
1.07
9.3
174
54.3
5.9
10093
26.6
1.306
50
Predicted Clusters (sorted)

Cluster id.
x1
x2
x3
x4
x5
x6
x7
x8
1.06
9.2
151
54.4
1.6
9077
0.628
1.43
15.4
113
53
3.4
9212
1.058
1.32
13.5
111
60
-2.2
11127
22.5
1.241
1.34
13
168
60.4
7.2
8406
0.862
1.09
12
96
49.8
1.4
9673
0.588
1.05
12.6
150
56.7
2.7
10140
1.108
1.16
11.7
104
54
-2.1
13507
0.636
0.89
10.3
202
57.9
2.2
5088
25.3
1.555
1.02
11.2
168
56
0.3
6423
34.3
0.7
1.49
8.8
192
51.2
3300
15.6
2.044
1.12
12.4
197
53
2.7
6455
39.2
0.623
1.15
12.7
199
53.7
6.4
7179
50.2
0.527
1.2
11.8
148
59.9
3.5
7287
41.1
0.702
1.07
9.3
174
54.3
5.9
10093
26.6
1.306
1.22
12.2
175
67.6
2.2
7642
1.652
1.13
10.9
178
62
3.7
6154
1.897
0.96
7.6
164
62.2
-0.1
6468
0.9
1.4
0.76
6.4
136
61.9
5714
8.3
1.92
1.04
8.6
204
61
3.5
6650
2.116
1.1
9.2
245
57
3.3
13082
0.309
0.75
7.5
173
51.5
6.5
17441
0.768
1.16
9.9
252
56
9.2
15991
0.62
1.21
1.13
1.02
1.00
12.5
10.9
9.1
8.9
128
183
171
223
55.5
55.1
62.9
54.8
1.7
3.1
3.7
6.3
10163
6546
6526
15505
3.2
33.2
1.8
0.0
0.874
1.065
1.797
0.566
M eans
Cluster
Cluster
Cluster
Cluster
1
2
3
4
51
Affinity
Association Rules
(Market Basket Analysis)
Aim: to identify types of books that

are likely to be bought by customers
based on past purchases of books
52
ChildBks
YouthBks
CookBks
DoItYBks
RefBks
ArtBks
GeogBks
ItalCook
ItalAtlas
ItalArt
Florence
2000
customers
0
1
0
1
0
1
0
0
1
1
0
0
1
1
1
1
0
0
1
1
1
0
0
1
0
0
1
1
0
1
0
0
0
1
1
1
0
0
1
1
0
0
0
1
1
0
0
0
0
1
0
1
0
0
1
1
1
1
1
1
1
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
1
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
1
0
0
0
0
1
1
1
0
0
1
1
0
0
0
0
1
0
1
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
53
XLMiner : Association Rules

Data
Rule #
1
Input Data
Sheet1!$A$1:$K$2001
Data Format
Binary Matrix
Min. Support
200
Min. Conf. %
70
# Rules
19
Conf. % Antecedent (a)

100 ItalCook=>
Consequent (c)
Support(a)
Support(c) Support(a U c)
Lift Ratio
CookBks
227
862
227
2.32
82.19 DoItYBks, ArtBks=>
CookBks
247
862
203
1.91
81.89 DoItYBks, GeogBks=>
CookBks
265
862
217
1.90
80.33 CookBks, RefBks=>
ChildBks
305
846
245
1.90
80 ArtBks, GeogBks=>
ChildBks
255
846
204
1.89
81.18 ArtBks, GeogBks=>
CookBks
255
862
207
1.88
79.63 YouthBks, CookBks=>
ChildBks
324
846
258
1.88
80.86 ChildBks, RefBks=>
CookBks
303
862
245
1.88
78.87 DoItYBks, GeogBks=>
ChildBks
265
846
209
1.86
10
79.35 ChildBks, DoItYBks=>
CookBks
368
862
292
1.84
11
77.87 CookBks, DoItYBks=>
ChildBks
375
846
292
1.84
12
77.66 CookBks, GeogBks=>
ChildBks
385
846
299
1.84
13
78.18 ChildBks, YouthBks=>
CookBks
330
862
258
1.81
14
77.85 ChildBks, ArtBks=>
CookBks
325
862
253
1.81
15
75.75 CookBks, ArtBks=>
ChildBks
334
846
253
1.79
16
76.67 ChildBks, GeogBks=>
CookBks
390
862
299
1.78
17
70.65 GeogBks=>
ChildBks
552
846
390
1.67
18
70.63 RefBks=>
ChildBks
429
846
303
1.67
19
71.1 RefBks=>
CookBks
429
862
305
54
1.65
Some Utilities
Sampling from worksheets and databases

Database scoring
Graphics
Binning
55
Simple
Random
Sampling
56
Stratified
Random
Sampling
57
Scoring to
databases and
worksheets
58
Binning
continuous
variables
59
Missing Data
60
Graphics: Boston Housing data

Histogram
Box Plot
120
180
160
100
Frequency
140
Y Values
80
60
40
120
100
80
60
40
20
20
0
0
AGE
10
20
30
40
50
60
70
80
90
AGE
61
100
Histogram
Box Plot
250
10
9
200
8
Frequency
Y Values
7
6
5
4
3
150
100
50
RM
3.6 4.2 4.8 5.4
6.6 7.2 7.8 8.4
RM
62
9.6
Matrix Plot
0.4
0.6
0.8
0.2
4.2 5.4 6.6 7.8
High tax
towns have
fewer rooms
on average?
0.2 0.4 0.6 0.8
RM
0
10
4.2 5.4 6.6
AGE
2
10
4.2
5.4
6.6
7.8
1.8
4.2
5.4
6.6
1.8
TAX
2
10
63
RM
Box Plot
10
Y Values
8
6
4
2
0
1
Binned_TAX
64
Future Extensions
Cross Validation
Bootstrap, Bagging and Boosting
Error-based clustering
Time Series and Sequences
Support Vector Machines
Collaborative Filtering
65
In Conclusion
XLMiner is a modern tool-belt for data mining. It
is an affordable, easy-to-use tool for consultants,
MBAs and business analysts to learn, create and
deploy data mining methods,
More generally, XLMiner is a tool for data
analysis in Excel that uses classical and modern,
computationally intensive techniques.
66

Data Mining in Excel Using XL Miner

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining in Excel Using XL Miner

Uploaded by

Copyright:

Available Formats

Data Mining in Excel Using

More generally, XLMiner is a tool for data

Available Data Mining Software

Radial Basis Fns.

Source: Elder Research

Available Data Mining Software

Source: Elder Research

Radial Basis Fns.

Multilayer Neural Net

Class. & Regr. Trees

Desiderata for Data Mining and

Comprehensive Range of Procedures:

Why Data Mining in Excel?

Data Mining Procedures in

Data Reduction and Exploration

Boston Housing Data

ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO

XLMiner : Data Partition Sheet

Date: 29-Jul-2003 13:50:09 (Ver: 1.2.0.1)

Aim: To estimate median residential

The Regression Model

Training Data scoring - Summary Report

Validation Data scoring - Summary Report

Test Data scoring - Summary Report

Subset selection (exhaustive enumeration)

Adjusted RCp R-Squared

Models (Constant present in all models)

The Regression Model

Std. Dev. Estimate

XLMiner : Multiple Linear Regression - Prediction of Validation Data

Frequency in Validation Dataset

Aim: To estimate median residential

XLMiner : K-Nearest Neighbors Prediction

Training data used for building the model

# cases in the training data set

# cases in the validation data set

# nearest neighbors (k)

Training Data scoring - Summary Report

Validation Data scoring - Summary Report

Test Data scoring - Summary Report

Validation Data prediction details

Aim: To classify census tracts into

Boston Housing Data

Validation Misclassification Summary

XLMiner : Classification Tree - Prune Log

0.0644 <-- Minimum Error Prune

0.0693 <-- Best Prune

Classification Tree : Full Tree

Classification Tree : Best Pruned Tree

Classification Tree : Minimum Error Tree

Aim: To classify census tracts into

XLMiner : Neural Network Classification

Step size for gradient descent

Weight change momentum

Hidden layer sigmoid

Output layer sigmoid

Training Data scoring - Summary Report

Validation Data scoring - Summary Report

Lift chart (validation dataset)

Decile mean / Global

Decile-wise lift chart (validation dataset)

Data Reduction and Exploration

Aim: To cluster electric utilities into