You are on page 1of 43

Business Analytics

Data

Pristine

Pristine www.edupristine.com

Agenda
Introduction Data Predictive modeling using Linear Regression

Pristine

2.Data
I. II. Population vs. Sample Types of Data Variables

III. Summarizing data IV. Describe measure of central tendency/measure of location of data set V. Describe spread/variability of data set

VI. Symmetry and Skewness for the distribution of a data set VII. Data Collection VIII. Data Dictionary IX. Outlier Treatment X. Missing Value Imputation

Pristine

2.a. Population vs. Sample


Population Not to be confused with literal meaning of "population" which means number of people living in a defined geographical region. The "population" in statistics includes all members of a defined group that we are studying or collecting information on for data driven decisions. Example: Current inflation rates of EU countries. All the votes casted in an electoral poll. Sample It is a part of the "population". Can be biased or un-biased (also know as random sample). Example: Current inflation rates of EU countries having per capita income of less than 20000 Euros per annum. A portion of votes collected to predict the election outcome through "Exit Poll".

Sample1

Sample2

Population

Sample3

Pristine

2.b. Case: Types of Data variables


Romanov, an Analytics consultant works with Credit One bank. His manager gave him a list having the name of bank's customers. Further he has been asked to pull the information from bank's database pertaining to the customer list. The information will be around the credit cards issued by the bank. He needs to define the variable types and the type of value each one of them will contain. Romanov, who has just started his professional career, doesn't has a good idea about different variable types. Now, suppose after extracting data he approached you and asked your help in categorizing the different variables. Help Romanov in variable categorization.

Pristine

2.b. Case: Types of Data variables


Information to be extracted by Romanov.
Variable Name of Customer Name Customer ID

Number of Age of Gender Marital Annual Monthly Credit Credit Customer of Status of Salary Card Usage Cards Last Birthday Customer Customer

Value Stored

Variable Type

Remarks

Pristine

2.b. Case: Types of Data variables (Data snapshot)


Sl # 1 2 3 4 5 6 7 8 9 10 11 Name of Number of Age of Customer Gender of the Customer ID Customer Credit Cards Last Birthday Customer Josh Janice Dandre Aiden Celine Emilio Joaquin Justus Chaya Justyn Jadon 111669 146861 171690 161721 170359 175646 180732 113136 169254 149771 166226 5 6 3 6 7 5 2 7 4 4 7 42 25 50 37 50 41 62 26 24 35 36 F F M M F M F F M M M Marital Status of the Customer Never Married Married Divorced Married Never Married Never Married Divorced Never Married Never Married Married Never Married Annual Salary Monthly Credit (in USD) Card Usage 88,001 592,489 272,304 726,593 612,075 490,356 164,732 510,321 358,534 140,400 105,259 Low Low Low Low Low Low Low Low Low Low Low
6

Pristine

2.b. Case: Types of Data variables


Variable Name of Customer Name Customer ID Number of Age of Gender Marital Annual Monthly Credit Credit Customer of Status of Salary Card Usage Cards Last Birthday Customer Customer

Value Stored

Name of the Unique individual identifier customer

1, 2, 3

18, 19, 20

Married / Low(<25%) / Male / Divorced / Medium(<50%) / Amount Female Never High(<75%) / Married Very High(>75%)

Variable Type

Remarks

Pristine

2.b. Types of Data Variables


Data consists of a combination of "variables" which actually contain the values Variables at a high level are of two types depending on the kind of values they store:
Numerical
Categorical

Numerical variables Discrete Arises from counting can take only a set of particular values including negative and fractional values Examples: Credit score, number of credit cards owned by a person, number of states in a country, charge on electron etc. Continuous Arises from measuring Can take any value with in a specified range Examples: Height, Amount of money, Age etc.

Categorical variables Binary (or Dichotomous) Has only two categories Examples: yes/no, male/female, pass/fail etc. Nominal Has several unordered category Examples: Type of bank account, type of insurance policy etc. Ordinal Has several ordered category Examples: questionnaire responses such as "strongly in favour / / strongly against".
8

Pristine

2.b. Types of Data Variables - Summary


Data (Consists of Variables)

Numerical

Categorical

Continuous

Discrete

Dichotomous or Binary

Nominal

Ordinal

Arises from measuring

Arises from counting

Only two categories

Several unordered category

Several ordered category

Pristine

2.b. Case: Types of Data variables (Revisited)


Variable Name of Customer Name Customer ID Number of Age of Marital Gender of Credit Customer Status of Customer Cards Last Birthday Customer Annual Salary Monthly Credit Card Usage

Value Stored

Name of the individual customer

Unique identifier

1, 2, 3

18, 19, 20

Male / Female

Married / Divorced / Never Married

Amount

Low(<25%) / Medium(<50%) / High(<75%) / Very High(>75%)

Variable Type

--

--

Numerical (Discrete)

Numerical (Discrete)

Categorical Categorical Numerical (Binary) (Nominal) (Continuous)

Categorical (Ordinal)

Remarks

Identifier

Arises from Arises from counting. counting. Takes certain Takes certain Only two Identifier discrete discrete values categories values in a in a given given range range

Several ordered category

Takes many Several ordered values in a category given range

Pristine

10

2.c. Case: Summarizing Data


Romanov, an Analytics consultant works with Credit One bank. His manager gave him some data around credit cards relating to number of credit cards issued to a set of customers and the credit limit of the cards. Further he has been tasked to summarize the data in a presentable form and prepare the report. Romanov, who has just started his professional career, has never played around with such kind of data, so he is clueless about the different summarizing techniques. Now, suppose he approached you and asked your help in preparing the report. Help Romanov in summarizing the data and preparing the report.

Pristine

11

2.c. Comments: Summarizing Data


There are various ways to summarize data. Some of them are
1. Frequency distribution

2.
3. 4. 5.

Grouped frequency distribution


Cumulative frequency distribution Stem leaf diagram Line plots

Pristine

12

2.c. Summarizing Data - Frequency distribution


A technique to summarize discrete data A simple process which involves counting of distinct discrete values

The representation can be either tabular or graphical


Example: Number of credit cards owned in a sample of 3000 individuals
Tabular representation
Number of Credit Cards 1 2 3 4 5 6 # Customers
700

Graphical representation - Bar Chart


Freq Distribution- #Cards vs. # Customers

150
600

300
# Customers

500 400 300 200 100 0 1 2 3 4 5 6 7

# Customers

450 660 540 300

7
8 9 10 Pristine

240
150 120 90

10

# Cards

13

2.c. Summarizing Data - Frequency distribution (Using MS Excel)


1 2 3
Number of Credit Cards 3 2 4 5 1 7 9 10 6 8

4. Press ctrl+alt+enter

# Customers
700 600

500
400 300 # Customers

200
100 0 1 2 3 4 5 6 7 8 9 10

Pristine

14

2.c. Summarizing Data - Grouped Frequency distribution


A technique to summarize continuous data or discrete data having large number of observations and an extended range

A simple process which involves counting of values falling under the different intervals (grouped)
Example and illustration 2.2: Number of customers falling under different Salary groups
Graphical representation - Bar Chart
Freq Distribution- Salary Band vs. # Customers
120 100

#Customers

80
60 40 20 0

Salary Band

Pristine

15

2.c. Summarizing Data Grouped Frequency distribution (Using MS Excel)


1 2

1. Press ctrl+alt+enter

4
5.Observe the difference between horizontal axes of two charts

5
# Customers
120 100 80 60 40 20 0 0-75000

4.From Edit select the salary bands as horizontal axis

200001-225000

100001-125000

150001-175000

250001-275000

300001-325000

350001-375000

400001-425000

450001-475000

500001-525000

550001-575000

600001-625000

650001-675000

700001-725000

750001-775000

800001-825000

850001-875000

900001-925000

950001-975000

Pristine

16

2.c. Summarizing Data - Cumulative Frequency distribution


Cumulative frequencies are obtained by accumulating the frequencies to give the total number of observations up to and including the value or group in question.

Example and illustration 2.3: Cumulative number of cards in the sample of 3000 individuals
Tabular representation Number of Credit Cards Up to
1 2 3 4 5 6 7 8 9 10

Graphical representation

Cumulative # Customers
Cumulative # Customers
150 450 900 1560 2100 2400 2640 2790 2910 3000

Cumulative # Customers
3000

2500
2000 1500 1000 500 0 0 1 2 3 4 5 6 7 8 9 10

# Cards

Pristine

17

2.c. Summarizing Data - Cumulative Frequency distribution (Using MS Excel)


1 2

5
Cumulative # Customers
3500 3000 2500 2000 1500 1000 500 0 0 2 4 6 8 10 12

Pristine

3. Observe the last entry. It is equal to the total numbers of observations 18

2.c. Summarizing Data Stem-leaf diagram


Stem-leaf diagram
Not suitable for large data. Hence, not extensively used in industry. Illustration: Given age of 20 individuals in years. Represent them using stem-leaf diagram

Sl #
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Age
23 33 23 33 34 21 54 52 34 36 52 51 48 35 40 43 49 54 27 39

Age (Sorted)
21 23 24 27 30 31 33 34 35 36 39 40 43 48 49 51 52 53 54 57

Stem 20

Leaf 1 3 4 7

30

1 3 4 5 6 9

40

0 3 8 9

50

1 2 3 4 7

Pristine

19

2.c. Summarizing Data Line Plots


Line plot diagram
Not suitable for large data. Hence, not extensively used in industry. Illustration: Given test scores of 20 students. Represent them using line plot diagram
Sl #
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Pristine

Score
50 20 50 50 50 30 30 40 30 40 30 20 50 40 20 30 40 40 50 50

Score (Sorted)
20 20 20 30 30 30 30 30 40 40 40 40 40 50 50 50 50 50 50 50 20

2.c. Case: Measure of Central Tendency/Location


After Romanov presented the summarized data to his manager at Credit One, he was asked to produce the various measures of Central Tendency of the Credit Card data.

Now, Romanov being unaware of the term "central tendency" again approached you and asked your help in calculating the central tendency of the data in question. Help Romanov in carrying out his task.

Pristine

21

2.d. Measure of Central Tendency/Location


There are a number of different quantities, which can be used to estimate the central point of a sample.

These are called measures of central tendency, or measures of location.


Just different ways of calculating the "average" value of dataset These are:
Mean
Median Mode

Pristine

22

2.d. Measure of Central Tendency/Location - Mean


By far the most common measure for describing the location of a set of data is the mean. For a set of observations denoted by x1, x2,.,xn the mean is defined by
<x> = (x1 + x2 + + xn)/n (also denoted by x-bar i.e. ).

For a frequency distribution with values x1, x2, xn and corresponding frequency values f1, f2, ,fn it is defined as
<x> = (f1 * x1 + f2 * x3 + . + fn * xn)/(f1 + f2 + + fn).

Illustration 2.4: Calculating mean for sample of 3000 individuals having credit cards.
1. Using Excel function for granular data 2. Using Excel function for frequency distribution table

Pristine

23

2.d. Measure of Central Tendency/Location - Median


Another useful measure of location. The median is a value, which splits the data set into two equal halves. So that half the observations are less than the median and half are greater than the median. If n is odd, then the median is the middle observation. If n is even, then the median is the midpoint of the middle two observations i.e. (n + 1) / 2 th observation. One of the potential advantages of the median for certain data sets is that it is robust or resistant to the effects of extreme observations. Illustration 2.5: Calculating median for sample of 3000 individuals having credit cards along with demonstration of extreme observations.

Pristine

24

2.d. Measure of Central Tendency/Location - Median


1. Using Excel function for granular data 2. For summarized data in form of frequency table

Median # Cards
4

Pristine

25

2.d. Measure of Central Tendency/Location - Mode


A third measure of location is the mode. Defined as the value which occurs with the greatest frequency or the most typical value. Illustration 2.6: Finding the mode for sample of 3000 individuals having credit cards.
Excel has inbuilt function Mode for granular data For summarized data it can be find easily by visual inspection

Tabular representation
Number of Credit Cards
1 2 3 4 5 6 7 8 9 10
Pristine

# Customers
150 300 450 660 540 300 240 150 120 90

Mode = 4 i.e. highest number of individuals have 4 cards

26

2.d. Case: Measure of Spread


After Romanov presented the summarized data along with "measures of Central tendency" to his manager at Credit One, he was further asked to add the various measures of spread to the report. Now, Romanov being unaware of the term "measures of spread" again approached you and asked for your help. Help Romanov in carrying out his task.

Pristine

27

2.d. Measure of Spread


The central tendency of a data set is usually the main feature of interest. Another feature of interest is the spread (or variability or dispersion or scatter) Meaning how widely spread the data are about the mean (or other measure of location). The different measures of spread are: Variance and Standard Deviation The Range The Inter quartile range

Pristine

28

2.d. Measure of Spread - Variance and Standard Deviation


The most commonly used measure of spread is the standard deviation. Essentially it is a measure of how far on average the observations are from the mean.

For a data set having values x1, x2,,xn (or xi where i=1,2,,n) and mean of <x> variance is calculated as
For granular data: Variance (2) = (xi - <x>)2/n For summarized frequency table: Variance (2) = {fi*(xi - <x>)2}/n Standard deviation is positive square root of variance denoted by For a sample variance is calculated as Variance (s2) = (xi - <x>)2/(n-1) Dividing by (n 1) makes the sample variance an unbiased estimator of the population variance. We will look into the details of it in later part of the course Illustration 2.7: Calculating variance and standard deviation for sample of 3000 individuals having credit cards Exercise: Do the algebra to make sure that the above mentioned formulae of variance are equivalent.
Pristine 29

2.d. Measure of Spread - Variance and Standard Deviation (Using MS Excel)


1. Using Excel function for granular data

2. For summarized data in form of frequency table

1 2
Pristine 30

2.d. Measure of Spread - Range


The range is a very simple measure of spread defined, as its name suggests, by the difference between the largest and smallest observations in the data set.

Range = max(xi) min(xi)


A poor measure of the spread of the data as it relies on the extreme values Which aren't necessarily representative of the data as a whole.

Illustration 2.8: Calculating Range for sample of 3000 individuals having credit cards

Pristine

31

2.d. Measure of Spread - Inter quartile Range


Similar to Range but is not affected by the data extremes. Just as the median divides a set of data into two halves, the quartiles divide a set of data into four quarters. They are denoted by Q1, Q2 and Q3. Q2 is just the median, while Q1 is called the lower quartile and Q3 the upper quartile. Q1 can be defined to be the (n + 2) / 4th observation counting from below and Q3 as the same counting from above, with relevant interpolation if needed. The Inter quartile range is defined as Q3 Q1. Illustration 2.9: Calculating Inter quartile Range for sample of 3000 individuals having credit cards

Pristine

32

2.d. Case: Symmetry and skewness of data


Romanov got appreciations after he presented the summarized data along with "measures of Central tendency" and "measure of spread" to his manager at Credit One. But, he was further asked to create an illustration around symmetry and skewness of data. Following that carry out the analysis of credit card data Now, Romanov being unaware of the term "symmetry and skewness" again approached you and asked for your help. In return he promised to gift you a bottle of Champagne. Help Romanov in carrying out his task.

Pristine

33

2.d. Symmetry and skewness


It deals with the shape of the distribution of a data set, that is, whether it is symmetric or skewed to one side or the other.

The approximate shape of a distribution can be determined by looking at a histogram.


Illustration 2.9: Calculating mean, median, mode and variance for symmetric and skewed data.

Symmetrical
120 100 80 60 40 20 0 0 Pristine 5 10 15 20 200 180 160 140 120 100 80 60 40 20 0 0

Positively Skewed
200 180 160 140 120 100 80 60 40 20 0 10 20 0

Negatively Skewed

10

20 34

2.d. Symmetry and skewness

Symmetrical: Mean = Median = Mode


Pristine

Positively Skewed: Mean > Median > Mode

Negatively Skewed: Mean < Median < Mode


35

2.d. Case: Data Collection and Management Framework


After Romanov presented the summarized data along with
Measure of central tendency

Measure of dispersion and


Skewness

he got appreciated for his work. As next step, his manager asked him to put a data management and management framework in place.

Lets help Romanov in putting up the framework.

Pristine

36

2.d. Comments: Data Collection and Management Framework


At a high level, from an analyst's perspective data collection and management framework will involve following components
Data collection mechanism
Maintaining a data dictionary Missing value imputation Outlier treatment

Pristine

37

2.e. Data Collection - quick background


Identify Data Needs
Start with Business Question Determine data need for

Data Mapping
Before preparing a data request, it is necessary to become as familiar as possible with the data sources and

Data Request Plan


Identify & Assess Available Population Coverage Data availability constraints; viz. archives time span Population sizing by key

Data Request Prep.


Be as specific as possible! Accurate file names Specify selection criteria with respect to actual field names and

Quality Check and Merge Step


Always examine results before acceptance!

delivering desired outcome their content that might be available to address the business question to

Illustration: Business Question:

be answered.

characteristics like credit history Discuss with client unexpected size


limitations Identify & Assess Alternate Data Sources Choose between alternatives Identify master data source Ensure that link keys work between sources chosen (beware of key length or encryption differences) Plan to Optimize Client Resource Use Minimize workload for client IT

value formats (e.g. "Values of the field STATE_CD in the subset =


(IN,MI)" rather than "Records from Indiana and Michigan") Specify required or acceptable file formats

For each data file received, Compare basic statistics (no. of records, no. of fields, range of values in each field) to expectations and resolve any discrepancies Ensure that delimiters, file format and record format meet requirements

How to match the most profitable credit product with This Data Mapping has basically three each new customer? Solution: Using Credit & Payment History and Financial Statement data to predict account performance for different products. Data Request: A representative sample of customers from each product with usage and payment data for sufficient no. of months along with their credit and financial history prior to acquisition. major components: Interview clients Obtain & study data layouts Obtain & evaluate data samples Note: The results of each step may require us to repeat one or more previous steps.

Give detailed randomization and/or Ensure that the data dictionary stratification instructions matches the file exactly Note: Enter file into data inventory, In case of Account x Transaction recording basic descriptive level data, random sampling of information (file name, date records is not the same as random received, file size, record length, sampling of accounts. programmer source, date received)

department; even if it makes more Prepare the driver file if requesting While merging files, work at our end to link files, convert data that needs to match another Watch out for identical merge-key media, reformat etc. source - test the driver file to field name with different meanings ensure that it can be linked back to in two files the data source you already have Beware of the consequences of merging two datasets with few identically named non-key fields Specify a distinct output file for sorting

Pristine

38

2.f. Data Dictionary


A comprehensive data dictionary should be maintained and updated as and when any new information is gathered. USE: It can go a long way in helping us understand the data better. For instance, it can help us to revisit old information and see what our initial hypothesis was and how it is changing with the new updated information.

Things To Include In The Data Dictionary: Meaning of all Potential Predictors: Maintain labels of as many variables as possible If possible, one should also try to capture the business sense of these variables Wherever things are not clear, it should be noted down so that it can be clarified with the client later on Clear Definition of Unique Identifier and its Meaning: Ascertain the level at which data is to be rolled up / down. For instance, Individual level Individual x Account level Individual x Month level Individual x Account x Month level, etc. Identify unique key of every dataset. Few examples below: Payment data may be at transaction level Demographic data at individual level Census data at zip code level Dependent Variable Definition and Meaning: This is a very crucial step in modeling exercise as wrong definition can lead to completely wrong conclusions. In absence of a clear definition at this stage, it may be defined later after some actual data analysis. Variable Classification: If not already given, one should always try and classify the variables like Demographic variables, e.g. age, gender Performance variables, e.g. spend, number of transactions Credit Attributes, e.g. total credit line, FICO score Census level, e.g. population, location attributes such as income levels
Pristine 39

2.g. Missing Value Imputation


There are a variety of techniques for missing value imputation; but these should be considered more as scenario-specific than just being a set of pure alternative choices.

Missing Value Imputation Techniques


A. Impute Missing Values with ZERO B. Impute Missing Values with MEDIAN C. Impute Missing Values with MEAN D. Impute Missing Values with MODE E. Information based Segmentation F. Non-Missing Dummy Creation G. Imputation and Non-Missing Dummy Creation H. Impute based on Bivariate Graphs I. Impute using Regression on other Non-Missing Predictors J. DNI

K. Multiple Imputation
Pristine 40

2.h. Outlier Treatment


An outlier is a single observation "far away" from rest of the data. Reasons for outliers: Errors Data errors Sampling error Standardization failure Faulty distributional assumptions Human Error Genuine Outliers
Outlier

Outlier

Why do we care about outliers? Outliers are BAD The presence of outliers can lead to inflated error rates and substantial distortions of results that can lead to wrong conclusions and inferences. Outliers are GOOD The outliers can provide useful information in the data, for example, a spike in spend behavior of some customers may prove to be the deciding factor in marketing response campaigns. So care should be taken while dealing with outliers. In short, outliers are important and hence should not be ignored. Capping and Flooring Technique Exponential Smoothing Technique Sigma Approach Robust Regression Technique Mahalanobis Distance Technique 41

Techniques for outlier detection / treatment:

Pristine

Thank you!

Pristine

Pristine www.edupristine.com

42

You might also like