You are on page 1of 50

Data Mining 101

Okiriza Wibisono - @okiriza


Ali Akbar Septiandri - @aliakbars

Outline

Introduction

Terminology
Potential application
Venn diagram

Process
overview

Business understanding
Data understanding (exploration)
Data preparation (preprocessing)
Modeling
Evaluation
Deployment (presentation)

Tools &
Resource

Introduction Terminology

Data
science
Big data
analytics

Statistics
Knowledge
Discovery
in
Databases

Data
mining

The process of collecting,


searching through, and analyzing
a large amount of data in a
database, as to discover patterns
or relationships.
Data Mining - dictionary.reference.com

Introduction Potential Application

Customer
segmentation

Recommendation
engine

Social media
mining


What should we do?
Where to start? Do I have to get a master degree in statistics?

http://tomfishburne.com.s3.amazonaws.com/site/wp-content/uploads/2014/01/140113.bigdata.jpg

Data Science Venn Diagram


http://drewconway.com/zia/2013/3/26/thedata-science-venn-diagram

And now the business process

CRISP DM Methodology
http://lyle.smu.edu/~mhd/8331f03/crisp.pdf

Business Understanding
CRISP DM Methodology

Objective Statement

Bottom-up
Top-down

Objective Statement

vs

Data

Problem

Situation Assessment
Inventory of Resources
Requirements, Assumptions, and Constraints
Risks and Contingencies
Terminology
Costs and Benefits
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf

Situation Assessment
Inventory of Resources
Hardware
Data,
Knowledge,
Tools

Personnel

Resource

http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf

Situation Assessment
Requirements, Assumptions, and Constraints
Requirements

Assumptions

Constraints

Scheduling

Data quality

Legal issues

Accuracy

External
factors

Budget

Security

Reporting type

Resources

http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf

Situation Assessment
Risks and Contingencies
Business
Organizational

Financial

Contingency Plan
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf

Situation Assessment Terminology


Write down related terminology

http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
http://www.partnersmn.com/wp-content/uploads/2010/08/5b8567b2b4e2d1cfd1a31b2b8a0ecebc1.jpg

Situation Assessment Costs and Benefits


Money, money, money!

http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
http://www.centuryproductsllc.com/wp-content/uploads/holding-money.jpg


How to evaluate the results?
Define your success criteria!

Data Understanding
CRISP DM Methodology

Data Collection

vs

External

Internal

Watch out!

visible accessible
storable presentable
Victor Lavrenko Text Technologies

http://www.inf.ed.ac.uk/teaching/courses/tts/pdf/crawl-2x2.pdf

Data Exploration
Visualization Heuristics

Visualize fast. Visualize reactively.

Go for high information 2D visualizations.

Select data subsets to visualize.

http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf

Data Exploration
Visualization Heuristics

Never let anomalies pass you by. Dig deeper.

Use your visualizations to inform potential


models. Use your potential model to direct your
visualizations.

Expect problems in your data.

http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf

This is the cheapest and most


informative stage of data
mining.
Nigel Goddard DME Visualization

http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf

Data Exploration
Visualization Tools

Column/bar: Large change

Line, curve: Small change, long periods

Histogram: Frequency distribution

https://nces.ed.gov/nceskids/help/user_guide/graph/whentouse.asp

Data Preparation
CRISP DM Methodology

Which one should I include


(or exclude)?

Data Selection

Missing
value

Data Cleaning
Remember: Expect problems in your data.

Outlier

Duplication

Dirty
Data

Incomplete

Outdated

Data Construction
Feature

engineering derived attributes,

e.g.:
year

from timestamp

quarter
BMI

from timestamp

from weight and height

Log(x)

for skewed data (e.g. house price)

Data Splitting
Two

kinds of data splitting:

Training-Validation-Testing
Cross

Validation

Data Splitting
Training-Validation-Testing

Training

Split randomly to avoid bias

Validation

Testing

http://www.inf.ed.ac.uk/teaching/courses/iaml/slides/eval-2x2.pdf

Construct
classifier

Pick algorithm
Knob settings
(tree depth, k in
kNN, c in SVM)

Estimate future
error rate

Data Splitting
Cross Validation
Every point is both training and testing, never at the same time

Dimensionality Reduction

Principal
Component
Analysis

vs

Linear
Discriminant
Analysis

Modeling
CRISP DM Methodology

Machine Learning

Classification

Regression

Ranking

Clustering

Model Selection

Generalization bound

Regression
Technique

Linear regression
Kernel ridge regression
Support vector regression
Lasso


Which one should I choose?
Should I use all of them?

It depends on

Model Selection

Assumptions Interpretability
The predictors are linearly
independent
The error is a random variable
with a mean of zero conditional on
the explanatory variables
The sample is representative of
the population for the inference
prediction
https://chenhaot.com/pubs/mldg-interpretability.pdf

The
understandability
of why the model
is true or how the
model is induced
from

Beware of Overfitting!

http://pingax.com/wp-content/uploads/2014/05/underfitting-overfitting.png

Model Assessment

Regression
(R)MSE
Mean
Absolute
Error
Correlation
Coefficient

Classification
Accuracy
Precision
Recall
F-score

Descriptive
Std. Error
p-value
Confidence
Interval

Evaluation
CRISP DM Methodology

Does my model solve the


problem?
What is the impact? Is it novel? How useful is the solution?

Deployment
CRISP DM Methodology

The Tasks

Plan deployment

Plan monitoring
and maintenance

Produce final
report

Review project

Tools & Resource

Text mining: NLTK, spaCy, OpenNLP

Query expansion & clustering: Carrot2, Weka

Data mining & machine learning: Weka, scikit-learn

Language: R, Python, Julia, Java, Matlab, Mathematica, Haskell, Scala

Python lib: Pandas, SciPy, NumPy, scikit-learn

Infrastructure: AWS, Hadoop, Google Cloud, Azure, Apache Spark

Visualization: D3.js

Community: Big Data & Open Data Indonesia


Thank you!

Data Mining 101 Python-ID Meetup February 2015


Okiriza Wibisono - @okiriza
Ali Akbar Septiandri - @aliakbars

You might also like