Professional Documents
Culture Documents
1/28/15
Contents:
Way out
How to do them in R?
1/28/15
A typical conversation
Analyst 1: Im in some trouble, my manager wants me to build a logistic
regression model but I have only a 2% event rate in my data. The logistic
regression wont be a good choice here the ML estimate will be biased.
Analyst 2: Not necessarily. Its the total count rather than the percentage of
events that matters. How many cases do you have for the rarer event and
how big is your dataset?
Analyst1: Weve got about 1800 odd events in a dataset of about 100,000
cases a less than 2% scenario
Analyst2: Hmm. With these many cases for the rarer event, you can very well
use logistic regression. There are methods to address such skewed, or sparse
data situations.
Analyst1: Wow. Really! Please tell me more!!
Analyst2: There are couple of alternatives. For one you can use exact
logistic regression this is to be used whensample size is too small for
your usual logistic regression using the regular maximum-likelihood-based
estimation. Another option in your scenario is to use the penalizedlikelihood estimation method. This second one has the advantage of being
computationally less demanding than the exact logistic method.
1/28/15
In the current context, this refers to the scenario where under a binary
outcome space (response/no-response, good/bad, default/no-default,
purchase/no-purchase, etc.) one of the two events are far fewer than the other
Suppose in a sample of 1000 applicants for a position only 20 are selected here the
event of being selected is the rare event with a low event rate of 2%
Suppose, in a sample of 100,000 purchases from an online retailer, about 1800 are
returned by the customer here the event of goods being returned is the rare event
with a low event rate of 1.8%
Charge backs in credit card transactions
Goods returned in online retailing
Why is this a problem for logistic regression its still binary anyway?
The problem here is with the estimation method the usual maximumlikelihood method is susceptible to small sample bias and this bias is
strongly dependent on the count (as opposed to percentage) of the rarer of
the events
1/28/15
In case of small sample and/or very unbalanced binary data (When you
have just 20 cases in a sample of 1000) exact logistic regression is to
be used
If, however, you have a larger count of the rarer of the two events, say,
1000, (even better if its 2000) in a sample of 100,000 with the same low
event rate (1% to 2%) you can use logistic regression the estimation will
have to be done using penalized likelihood method (also called Firths
penalized likelihood approach, after its inventor
1/28/15
While we mentioned this method in the context of only small sample size/rare
event scenario, this is a method of addressing issues of separability, small
sample sizes, and bias of the parameter estimates
Tejamoy Ghosh Data Science ATG - New Delhi, India
1/28/15
You can add other options for what you want to have in your
output
The option Exact after the model statement and the Freq
statements are the key differences here
An alternative Event/Trial Syntax:
Proc Logistic Data = YourRareEventData;
model RareEvent / CellCount = X1 X2;
Exact X1 / estimate = both;
Run;
1/28/15
1/28/15
How to do them in R
1/28/15
Exact Logistic in R
Package
Required:
elrm
This package implements (approximate)
1/28/15
Penalized Logistic in R
Package:
logistf
This package runs Firths bias reduced logistic regression
1/28/15
EDUCATION
Econometrics,
Statistics,
Economics
Vanderbilt,
Cincinnati, Indian
Statistical Institute,
Jawaharlal Nehru
University
Research Scholars
Journal Articles
EXPERTISE
Predictive
modelling,
Segmentation,
Market research,
Clickstream data
analysis,
Forecasting,
Financial Time
Series, Simulation,
Bayesian
econometrics,
Machine Learning
Techniques,
Decision Trees,
SAS, SPSS, R,
Octave, Stata,
Eviews, Matlab,
Maxima, Netlogo
EXPERIENCE
18 years combined,
Marketing
analytics, Risk
analytics, Financial
analytics, Analytic
Solution & Tools
development,
Analytics CoE setup, Advanced
Analytics Training
EXISTING/SERVED
CLIENTS
A large Global
Beverage company, A
small insurance
company,
A renowned business
school, A large
Global HR &
Compensation
Consulting Group, A
large Global IT
Research group, A
third party analytics
vendor, A mid sized
analytics consulting
What we dont
doQuick and dirty back of the
envelope calculation
Use jargon presentations with little
impact on your problem
Hide that we are stumped