Professional Documents
Culture Documents
Data
Pristine
Pristine www.edupristine.com
Agenda
Introduction Data Predictive modeling using Linear Regression
Pristine
2.Data
I. II. Population vs. Sample Types of Data Variables
III. Summarizing data IV. Describe measure of central tendency/measure of location of data set V. Describe spread/variability of data set
VI. Symmetry and Skewness for the distribution of a data set VII. Data Collection VIII. Data Dictionary IX. Outlier Treatment X. Missing Value Imputation
Pristine
Sample1
Sample2
Population
Sample3
Pristine
Pristine
Number of Age of Gender Marital Annual Monthly Credit Credit Customer of Status of Salary Card Usage Cards Last Birthday Customer Customer
Value Stored
Variable Type
Remarks
Pristine
Pristine
Value Stored
1, 2, 3
18, 19, 20
Married / Low(<25%) / Male / Divorced / Medium(<50%) / Amount Female Never High(<75%) / Married Very High(>75%)
Variable Type
Remarks
Pristine
Numerical variables Discrete Arises from counting can take only a set of particular values including negative and fractional values Examples: Credit score, number of credit cards owned by a person, number of states in a country, charge on electron etc. Continuous Arises from measuring Can take any value with in a specified range Examples: Height, Amount of money, Age etc.
Categorical variables Binary (or Dichotomous) Has only two categories Examples: yes/no, male/female, pass/fail etc. Nominal Has several unordered category Examples: Type of bank account, type of insurance policy etc. Ordinal Has several ordered category Examples: questionnaire responses such as "strongly in favour / / strongly against".
8
Pristine
Numerical
Categorical
Continuous
Discrete
Dichotomous or Binary
Nominal
Ordinal
Pristine
Value Stored
Unique identifier
1, 2, 3
18, 19, 20
Male / Female
Amount
Variable Type
--
--
Numerical (Discrete)
Numerical (Discrete)
Categorical (Ordinal)
Remarks
Identifier
Arises from Arises from counting. counting. Takes certain Takes certain Only two Identifier discrete discrete values categories values in a in a given given range range
Pristine
10
Pristine
11
2.
3. 4. 5.
Pristine
12
150
600
300
# Customers
# Customers
7
8 9 10 Pristine
240
150 120 90
10
# Cards
13
4. Press ctrl+alt+enter
# Customers
700 600
500
400 300 # Customers
200
100 0 1 2 3 4 5 6 7 8 9 10
Pristine
14
A simple process which involves counting of values falling under the different intervals (grouped)
Example and illustration 2.2: Number of customers falling under different Salary groups
Graphical representation - Bar Chart
Freq Distribution- Salary Band vs. # Customers
120 100
#Customers
80
60 40 20 0
Salary Band
Pristine
15
1. Press ctrl+alt+enter
4
5.Observe the difference between horizontal axes of two charts
5
# Customers
120 100 80 60 40 20 0 0-75000
200001-225000
100001-125000
150001-175000
250001-275000
300001-325000
350001-375000
400001-425000
450001-475000
500001-525000
550001-575000
600001-625000
650001-675000
700001-725000
750001-775000
800001-825000
850001-875000
900001-925000
950001-975000
Pristine
16
Example and illustration 2.3: Cumulative number of cards in the sample of 3000 individuals
Tabular representation Number of Credit Cards Up to
1 2 3 4 5 6 7 8 9 10
Graphical representation
Cumulative # Customers
Cumulative # Customers
150 450 900 1560 2100 2400 2640 2790 2910 3000
Cumulative # Customers
3000
2500
2000 1500 1000 500 0 0 1 2 3 4 5 6 7 8 9 10
# Cards
Pristine
17
5
Cumulative # Customers
3500 3000 2500 2000 1500 1000 500 0 0 2 4 6 8 10 12
Pristine
Sl #
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Age
23 33 23 33 34 21 54 52 34 36 52 51 48 35 40 43 49 54 27 39
Age (Sorted)
21 23 24 27 30 31 33 34 35 36 39 40 43 48 49 51 52 53 54 57
Stem 20
Leaf 1 3 4 7
30
1 3 4 5 6 9
40
0 3 8 9
50
1 2 3 4 7
Pristine
19
Score
50 20 50 50 50 30 30 40 30 40 30 20 50 40 20 30 40 40 50 50
Score (Sorted)
20 20 20 30 30 30 30 30 40 40 40 40 40 50 50 50 50 50 50 50 20
Now, Romanov being unaware of the term "central tendency" again approached you and asked your help in calculating the central tendency of the data in question. Help Romanov in carrying out his task.
Pristine
21
Pristine
22
For a frequency distribution with values x1, x2, xn and corresponding frequency values f1, f2, ,fn it is defined as
<x> = (f1 * x1 + f2 * x3 + . + fn * xn)/(f1 + f2 + + fn).
Illustration 2.4: Calculating mean for sample of 3000 individuals having credit cards.
1. Using Excel function for granular data 2. Using Excel function for frequency distribution table
Pristine
23
Pristine
24
Median # Cards
4
Pristine
25
Tabular representation
Number of Credit Cards
1 2 3 4 5 6 7 8 9 10
Pristine
# Customers
150 300 450 660 540 300 240 150 120 90
26
Pristine
27
Pristine
28
For a data set having values x1, x2,,xn (or xi where i=1,2,,n) and mean of <x> variance is calculated as
For granular data: Variance (2) = (xi - <x>)2/n For summarized frequency table: Variance (2) = {fi*(xi - <x>)2}/n Standard deviation is positive square root of variance denoted by For a sample variance is calculated as Variance (s2) = (xi - <x>)2/(n-1) Dividing by (n 1) makes the sample variance an unbiased estimator of the population variance. We will look into the details of it in later part of the course Illustration 2.7: Calculating variance and standard deviation for sample of 3000 individuals having credit cards Exercise: Do the algebra to make sure that the above mentioned formulae of variance are equivalent.
Pristine 29
1 2
Pristine 30
Illustration 2.8: Calculating Range for sample of 3000 individuals having credit cards
Pristine
31
Pristine
32
Pristine
33
Symmetrical
120 100 80 60 40 20 0 0 Pristine 5 10 15 20 200 180 160 140 120 100 80 60 40 20 0 0
Positively Skewed
200 180 160 140 120 100 80 60 40 20 0 10 20 0
Negatively Skewed
10
20 34
he got appreciated for his work. As next step, his manager asked him to put a data management and management framework in place.
Pristine
36
Pristine
37
Data Mapping
Before preparing a data request, it is necessary to become as familiar as possible with the data sources and
delivering desired outcome their content that might be available to address the business question to
be answered.
For each data file received, Compare basic statistics (no. of records, no. of fields, range of values in each field) to expectations and resolve any discrepancies Ensure that delimiters, file format and record format meet requirements
How to match the most profitable credit product with This Data Mapping has basically three each new customer? Solution: Using Credit & Payment History and Financial Statement data to predict account performance for different products. Data Request: A representative sample of customers from each product with usage and payment data for sufficient no. of months along with their credit and financial history prior to acquisition. major components: Interview clients Obtain & study data layouts Obtain & evaluate data samples Note: The results of each step may require us to repeat one or more previous steps.
Give detailed randomization and/or Ensure that the data dictionary stratification instructions matches the file exactly Note: Enter file into data inventory, In case of Account x Transaction recording basic descriptive level data, random sampling of information (file name, date records is not the same as random received, file size, record length, sampling of accounts. programmer source, date received)
department; even if it makes more Prepare the driver file if requesting While merging files, work at our end to link files, convert data that needs to match another Watch out for identical merge-key media, reformat etc. source - test the driver file to field name with different meanings ensure that it can be linked back to in two files the data source you already have Beware of the consequences of merging two datasets with few identically named non-key fields Specify a distinct output file for sorting
Pristine
38
Things To Include In The Data Dictionary: Meaning of all Potential Predictors: Maintain labels of as many variables as possible If possible, one should also try to capture the business sense of these variables Wherever things are not clear, it should be noted down so that it can be clarified with the client later on Clear Definition of Unique Identifier and its Meaning: Ascertain the level at which data is to be rolled up / down. For instance, Individual level Individual x Account level Individual x Month level Individual x Account x Month level, etc. Identify unique key of every dataset. Few examples below: Payment data may be at transaction level Demographic data at individual level Census data at zip code level Dependent Variable Definition and Meaning: This is a very crucial step in modeling exercise as wrong definition can lead to completely wrong conclusions. In absence of a clear definition at this stage, it may be defined later after some actual data analysis. Variable Classification: If not already given, one should always try and classify the variables like Demographic variables, e.g. age, gender Performance variables, e.g. spend, number of transactions Credit Attributes, e.g. total credit line, FICO score Census level, e.g. population, location attributes such as income levels
Pristine 39
K. Multiple Imputation
Pristine 40
Outlier
Why do we care about outliers? Outliers are BAD The presence of outliers can lead to inflated error rates and substantial distortions of results that can lead to wrong conclusions and inferences. Outliers are GOOD The outliers can provide useful information in the data, for example, a spike in spend behavior of some customers may prove to be the deciding factor in marketing response campaigns. So care should be taken while dealing with outliers. In short, outliers are important and hence should not be ignored. Capping and Flooring Technique Exponential Smoothing Technique Sigma Approach Robust Regression Technique Mahalanobis Distance Technique 41
Pristine
Thank you!
Pristine
Pristine www.edupristine.com
42