You are on page 1of 16

Types of Biostatistics

1) Descriptive Statistics
 Exploratory Data Analysis
 often not in literature
Lecture 17: Review Lecture  Summaries
 "Table 1" in a paper
 Goal: visualize relationships, generate
hypotheses
Sandy Eckel
seckel@jhsph.edu 2) Inferential Statistics
 Confirmatory Data Analysis (Methods Section of
paper)
20 May 2008  Hypothesis tests
 Confidence Intervals
 Regression modeling
1
 Goal: quantify relationships, test hypotheses 2

Summary of the approach to modeling I Summary of the approach to modeling II

A general approach for most statistical modeling  Estimate the Parameters in the Model
is to:  Fit the Model to the Observed Data
 Define the Population of Interest  Make Inferences about Covariates
 State the Scientific Questions & Underlying  Check the Validity of the Model
Theories  Verify the Model Assumptions
 Describe and Explore the Observed Data  Re-define, Re-fit, and Re-check the Model if
 Define the Model necessary
 Probability part (models the randomness / noise)  Interpret the results of the Analysis in terms
 Systematic part (models the expectation / signal) of the Scientific Questions of Interest

3 4
Key Descriptive Statistics Ideas

 Visualizing data
 Stem and leaf plots
Descriptive Statistics  Histograms
 Boxplots
 Scatterplots
 Describing data
 ALWAYS look at your data  Distribution shapes (especially skewness)
 If you can't see it, then don't  Quartiles
 Measures of central tendency
believe it  Median, Mean, Mode
 Measures of spread
 Variance, Standard Deviation, Range, Interquartile Range
5 6

Stem-and-Leaf Plots Histograms

 Age in years (10 observations)  Pictures of the frequency or relative


frequency distribution
25, 26, 29, 32, 35, 36, 38, 44, 49, 51

Age Interval Observations Histogram of Age

4
2* 569
3
3* 2568
Fre quency
2

4* 49
5* 1
1

7 1 2
A ge C a te g o ry
3 4 8
Box-and-Whisker Plots (Boxplot) 2 Continuous Variables

Box Plot of Age


 Scatterplot
50
Age by Height in cm
45

190
Age in Years
40

180
Height in Centimeters
35

170
30

160
25

150
 IQR = 44 29 = 15 25 30 35 40 45 50
Age in Years
 Upper Fence = 44 + 15*1.5 = 66.5  Scatterplots visually display the relationship between
 Lower Fence = 29 15*1.5 = 6.5 9
two continuous variables 10

Skewness Skewness

 Positively Skewed  Negatively Skewed


 Longer tail in the high values  Longer tail in the low values
 Mean > Median > Mode  Mode > Median > Mean

Mode Mean Mean Mode


Median 11 Median 12
Symmetric Other descriptive statistics Review on your own

 Right and left sides are mirror images  Quartiles


 Left tail looks like right tail  Measures of central tendency
 Mean = Median = Mode  Median, Mean, Mode
 Measures of spread
 Variance, Standard Deviation, Range,
Interquartile Range

Mean Median Mode 13 14

Key Ideas of Probability/Distributions

 Mutually exclusive
Concepts from Biostat I  Statistically independent
used in this class
 Addition rule
 Conditional probability
 Common distributions
 Continuous: Normal, t-distribution,
Chi-square, F-distribution
 Discrete: Binomial

15 16
Key Ideas of
The Normal Distribution and Statistical Inference 68 95 99.7 Rule
Normal distribution
 Parameters: mean, variance  68% of the area under the curve is
 Standard normal within 1 standard deviation of the mean
 68-95-99.7 Rule
 Areas under the curve relation to p-values
Statistical Inference
 Population: parameters
 Sample: statistics
 We use sample statistics along with theoretical results
to make inferences about population parameters
 Sampling distribution of sample mean
 Central Limit Theorem
17 18

68 95 99.7 Rule 68 95 99.7 Rule

 95% of the area under the curve is  99.7% of the area under the curve is
within 2 standard deviation of the mean within 3 standard deviation of the mean

19 20
Sampling Distribution of the Sample Mean The Central Limit Theorem

 Usually is unknown and we would like to  Given a population of any distribution with
estimate it mean, , and variance, 2, the sampling
 We use to estimate distribution of , computed from samples of
 We know the sampling distribution of size n from this population, will be
approximately normally distributed with
 Definition: Sampling distribution mean, , and variance, 2/n, when the sample
The distribution of all possible values of some size is large.
statistic, computed from samples of the same  In general, this applies when n 25.
size randomly drawn from the same  The approximation to normality becomes better as
population, is called the sampling distribution n increases
of that statistic

21 22

Confidence Intervals

 Point estimation
 An estimate of a population parameter
 Interval estimation
Inferential Statistics  A point estimate plus an interval that expresses the uncertainty
or variability associated with the estimate
100(1 )% Confidence interval:
estimate
(critical value of z or t) (standard error of estimate)

 Critical value is the cutoff such that the area under the
curve in the tails beyond the critical value (both positive
and negative direction) is

23 24
Interpretation of confidence interval?
Use a CI for as an example: Steps of Hypothesis Testing

 Before the data are observed, the probability  Define the null hypothesis, H0.
is at least (1 alpha) that [L,U] will contain ,  Define the alternative hypothesis, Ha, where
the population parameter Ha is usually of the form not H0.
 In repeated sampling from a normally  Define the type 1 error, , usually 0.05.
distributed population, 100(1 )% of all  Calculate the test statistic
intervals of the form above will include the
population mean  Calculate the P-value
 After the data are observed, the constructed  If the P-value is less than , reject H0.
interval [L,U] either contains the true mean or Otherwise fail to reject H0.
it does not (no probability involved anymore)
25 26

MANY types of hypothesis testing Which test statistic do I use for each kind of test?
We discussed
 A single mean  Usually, the form of the test statistic
 H0: = 3000 vs. Ha: 3000 depends on
 A single proportion  Population distribution
 H0: p = 0.35 vs. Ha: p 0.35  Sample size
 Difference of means  Population variance
 H0: 1 - 2= 0 vs. Ha: 1 - 2 0  Whether known or estimated
 Need to decide whether to assume equality of  Or assumptions about equality
variance
 To find out which test statistic to use,
 Difference of proportions check the summary sheets
 H0: p1 p2 = 0 vs. Ha: p1 p2 0
 Not something you really have to memorize
 Others for regression modelling
27 28
Example
Summary: Hypothesis test for a single mean Relation between CI and hypothesis testing

 General rule on the 100(1- )% confidence


interval approach to two-sided hypothesis
testing
 If the null hypothesis value is not contained in the
confidence interval, you reject the null hypothesis
with p-value
 If the null hypothesis value is contained in the
confidence interval, you fail to reject the null
hypothesis with p-value>

29 30

Why is the power of a test important? Were not always right

 Power indicates the chance of finding a


significant difference when there really
is one
 Low power: likely to obtain non-significant
results even when significant differences
exist  Aim: to keep Type I error () small by specifying
a small rejection region
 High power is desirable!  is set before performing a test, usually at 0.05
 Low power is usually cause by small sample  Aim: To keep Type II error () small and thus
size power high
 Power = 1
31 32
: Probability of Type II Error P-Values
 The value of is usually unknown since it
depends on a specified alternative value.  Definition: The p-value for a
 depends on sample size and . hypothesis test is the probability of
 Before data collection, scientists decide obtaining by chance, alone, when H0 is
 the test they will perform true, a value of the test statistic as
 extreme or more extreme (in the
 the desired appropriate direction) than the one
 They will use this information to choose the actually observed.
sample size

33 34

Correlation

 Measures strength and direction of the


Regression Modeling linear relationship between two
continuous variables
 The correlation coefficient, , takes
values between -1 and +1
 -1: Perfect negative linear relationship
 0: No linear relationship
 +1: Perfect positive relationship

35 36
Association and Causation Why use linear regression?

 In general, association between two  Linear regression is very powerful. It


variables means there is some form of can be used for many things:
relationship between them  Binary X
 Continuous X
 The relationship is not necessarily causal
 Categorical X
 Association does not imply causation, no  Adjustment for confounding
matter how much we would like it to  Interaction
 Example: Hot days, ice cream, drowning  Curved relationships between X and Y

37 38

Simple Linear regression: Y= 0+ 1X1+ Assumptions of Linear Regression

 Linear regression is used for continuous  L Linear relationship


outcome variables  I Independent observations
 0: mean outcome when X=0 (Center!)
 N Normally distributed around line
 Binary X = dummy variable for group
 1: mean difference in outcome between groups  E Equal variance across Xs
 Continuous X
 1: difference in mean outcome corresponding Most often assess with graphs
to a 1-unit increase in X
- One type: AV plots
 Center X to give meaning to 0
 visualize the relationship between the outcome and
 Test 1=0 in the population a continuous predictor after adjusting for the
effects of a third variable
39 40
In Simple Linear Regression Regression Methods

 In simple linear regression (SLR):


 One Predictor / Covariate / Explanatory Variable:
X
 In multiple linear regression (MLR):
 Same Assumptions as SLR, (i.e. L.I.N.E.), but:
 More than one Covariate: X1, X2, X3, , Xp
Model:
 Y ~ N(, 2)
 = E(Y | X) = 0 + 1X1 + 2X2 + 3X3 +... pXp

41 42

Regression Methods Nested models


Interactions can allow us to draw separate lines for two
groups  One model is nested within another if
the parent model contains one set of
X1 = Year variables and the extended model
X2 = Group
contains all of the original variables plus
one or more additional variables.
 H0: all new s are zero
 Assess using F-test

 If only one additional variable, use t-test

43 44
Effect Modification Confounding: the epidemiologic definition

 In linear regression, effect modification C is a confounder of the relation


is a way of allowing the association between X and Y if:
between the primary predictor and the
outcome to change with the level of Outcome Y
another predictor.
 If the 3rd predictor is binary, that results in Confounder C
a graph in which the two lines (for the two
groups) are no longer parallel.
Predictor X

45 46

Confounding: example Modeling confounding and effect modification

Smoking is a confounder of the  Potential confounder(s)


relation between coffee consumption (X)  Run model without confounder (model 1)
and lung cancer (Y) since:  Run model with confounder (model 2)
 Compare model 2 estimate to the model 1
Lung Cancer Y
CI of primary predictor to see whether new
parameter is significantly different
Smoking C  Effect modification
 Model using interaction term
Coffee Consumption X  Test if statistically significant using a t-test

47 48
Spline Terms Summary: Flexibility in linear models
 Splines are used to allow the regression line
to bend  A spline allows the slope for a
 the breakpoint is arbitrary and decided graphically continuous predictor to change at a
or by hypothesis given point; the coefficient is for the
 the actual slope above and below the breakpoint is difference in log odds ratio
usually of more interest than the coefficient for the
spline (ie the change in slope)  An interaction term allows the odds ratio
Broken Arrow Model
3500 for one variable to differ by the value of
a second variable; the coefficient is for
Expenditures

3000

2500 Slope = 1 + 2 the difference in log odds ratio


Slope = 1
2000
3 5 7 9
length of stay (days)
49 50

Using R2
as a model selection criteria Logistic Regression
Basic Idea:
 The coefficient of determination, R2  Logistic regression is the type of regression we
evaluates the entire model use for a response variable (Y) that follows a
binomial distribution
 R2 shows the proportion of the total  Linear regression is the type of regression we use
for a continuous, normally distributed response (Y)
variation in Y that has been variable
predicted by this model  Model log odds probability, which we also call
 Model 1: 0.0076; 0.8% of variation the logit
explained  Baseline term interpreted as log odds
 Other coefficients are log odds ratios
 Model 2: 0.05; 5% of variation explained  Transform log odds/ log odds ratio to
 Model 3: 0.20; 20% of variation explained odds/odds ratio scale by exponentiating
coefficient
 You want a model with large R2 51 52
A basic logistic regression model and
Logit Function interpretation

 Relates log-odds (logit) to p = Pr(Y=1) logit(pi) = 0 + 1(agei 45)

 0 = log-odds of blindness among 45 year olds


logit function
10
 exp(0) = odds of blindness among 45 year olds
5
log-odds

 1 = difference in log-odds of blindness


0 comparing a group that is one year older than
another
-5

-10  exp(1) = odds ratio of blindness comparing a


0 .5 1 group that is one year older than another
Probability of Success
53 54

Why we can interpret the difference in log odds


as the log odds ratio Multiplicative change interpretation of the slope
 The slope in a logistic regression is
 A difference in log odds associated with a 1 unit  e 1 is the proportional increase of the
change in X (controlling for other Xs) odds of not visiting a physician corresponding
 A log odds ratio associated with a 1 unit change in to a one year increase in age
X (controlling for other Xs)
 Why? (odds for 30 - yr - old) (odds for 31 - yr - old) = (odds for 31 - yr - old)
log(a) log(b) = log(a/b) (odds for 30 - yr - old)

so  e( )
1 10
= e101 is the proportional increase of
the odds of not visiting a physician
log(odds|X=1) log(odds|X=0) corresponding to a ten year increase in age
= log(OR for X=1 vs. X=0)
55 56
More useful math Comparing nested models with logisitic
how to get the probability from the odds regression

 Models that differ by one variable


probability  Compare models with p-value or CI using the
 odds= 1 probability Wald test, a test that applies the CLT
 H0: the new variable is not needed
or H0: new=0 in the population
odds
 probability =  Models that differ by more than one
1 + odds variable
 Likelihood ratio test (Chi-square test of
e0 +1 deviance)
 so P (X = 1) =
1 + e0 +1  H0: all new variables not needed
or H0: all new=0 in the population
57 58

Grand summary Grand summary: linear models

 Exploratory analysis includes graphs and  Linear regression: for continuous


tables good to get a feel for the data (normal) outcomes
 Confirmatory analysis is useful for  Logistic regression: for binary outcomes
making definitive conclusions
 Linear models provide us with a
framework in which to perform
confirmatory analysis in many settings

59 60
Grand summary: modelling Grand summary: testing

 In all generalized linear models, we can  We can test significance of a single


use the following tools to make models predictor using z-test (or t-test for linear
more flexible: regression)
 Adjust for confounders using additive  Test significance of several covariates
covariates using a pair of nested models by a
 Effect modification allows by interaction likelihood ratio test
terms
 Know how to interpret p-values and
 Curved and bent lines through polynomials confidence intervals!
and splines

61 62

References - textbooks References online JHSPH open courseware


For more information For further directed self-study

 Friendly Intro Epidemiology textbook JHSPH Biostatistics Open Courseware


 Epidemiology by Leon Gordis http://ocw.jhsph.edu/Topics.cfm?topic_id=33

 Intro to Biostatistical Modeling Textbook  Essentials of Probability and Statistical


 Regression Modeling Strategies by Frank E. Inference IV
Jr. Harrell  Methods in Biostatistics I & II
 Slightly more theoretical intro to  Statistical Reasoning I & II
statistical modeling textbook  Statistics for Laboratory Scientists I & II
 Mathematical Statistics and Data Analysis  Statistics for Psychosocial Research:
by John A. Rice
Structural Models & Measurement
63 64

You might also like