Review Lecture

Types of Biostatistics
1) Descriptive Statistics
Exploratory Data Analysis
often not in literature
Lecture 17: Review Lecture Summaries
"Table 1" in a paper
Goal: visualize relationships, generate
hypotheses
Sandy Eckel
seckel@jhsph.edu 2) Inferential Statistics
Confirmatory Data Analysis (Methods Section of
paper)
20 May 2008 Hypothesis tests
Confidence Intervals
Regression modeling
1
Goal: quantify relationships, test hypotheses 2
Summary of the approach to modeling I Summary of the approach to modeling II
A general approach for most statistical modeling Estimate the Parameters in the Model
is to: Fit the Model to the Observed Data
Define the Population of Interest Make Inferences about Covariates
State the Scientific Questions & Underlying Check the Validity of the Model
Theories Verify the Model Assumptions
Describe and Explore the Observed Data Re-define, Re-fit, and Re-check the Model if
Define the Model necessary
Probability part (models the randomness / noise) Interpret the results of the Analysis in terms
Systematic part (models the expectation / signal) of the Scientific Questions of Interest
3 4
Key Descriptive Statistics Ideas
Visualizing data
Stem and leaf plots
Descriptive Statistics Histograms
Boxplots
Scatterplots
Describing data
ALWAYS look at your data Distribution shapes (especially skewness)
If you can't see it, then don't Quartiles
Measures of central tendency
believe it Median, Mean, Mode
Measures of spread
Variance, Standard Deviation, Range, Interquartile Range
5 6
Stem-and-Leaf Plots Histograms
Age in years (10 observations) Pictures of the frequency or relative

frequency distribution
25, 26, 29, 32, 35, 36, 38, 44, 49, 51
Age Interval Observations Histogram of Age
4
2* 569
3
3* 2568
Fre quency
2
4* 49
5* 1
1
7 1 2
A ge C a te g o ry
3 4 8
Box-and-Whisker Plots (Boxplot) 2 Continuous Variables
Box Plot of Age

Scatterplot
50
Age by Height in cm
45
190
Age in Years
40
180
Height in Centimeters
35
170
30
160
25
150
IQR = 44 29 = 15 25 30 35 40 45 50
Age in Years
Upper Fence = 44 + 15*1.5 = 66.5 Scatterplots visually display the relationship between
Lower Fence = 29 15*1.5 = 6.5 9
two continuous variables 10
Skewness Skewness
Positively Skewed Negatively Skewed

Longer tail in the high values Longer tail in the low values
Mean > Median > Mode Mode > Median > Mean
Mode Mean Mean Mode

Median 11 Median 12
Symmetric Other descriptive statistics Review on your own
Right and left sides are mirror images Quartiles

Left tail looks like right tail Measures of central tendency
Mean = Median = Mode Median, Mean, Mode
Measures of spread
Variance, Standard Deviation, Range,
Interquartile Range
Mean Median Mode 13 14
Key Ideas of Probability/Distributions
Mutually exclusive
Concepts from Biostat I Statistically independent
used in this class
Addition rule
Conditional probability
Common distributions
Continuous: Normal, t-distribution,
Chi-square, F-distribution
Discrete: Binomial
15 16
Key Ideas of
The Normal Distribution and Statistical Inference 68 95 99.7 Rule
Normal distribution
Parameters: mean, variance 68% of the area under the curve is
Standard normal within 1 standard deviation of the mean
68-95-99.7 Rule
Areas under the curve relation to p-values
Statistical Inference
Population: parameters
Sample: statistics
We use sample statistics along with theoretical results
to make inferences about population parameters
Sampling distribution of sample mean
Central Limit Theorem
17 18
68 95 99.7 Rule 68 95 99.7 Rule
95% of the area under the curve is 99.7% of the area under the curve is
within 2 standard deviation of the mean within 3 standard deviation of the mean
19 20
Sampling Distribution of the Sample Mean The Central Limit Theorem
Usually is unknown and we would like to Given a population of any distribution with
estimate it mean, , and variance, 2, the sampling
We use to estimate distribution of , computed from samples of
We know the sampling distribution of size n from this population, will be
approximately normally distributed with
Definition: Sampling distribution mean, , and variance, 2/n, when the sample
The distribution of all possible values of some size is large.
statistic, computed from samples of the same In general, this applies when n 25.
size randomly drawn from the same The approximation to normality becomes better as
population, is called the sampling distribution n increases
of that statistic
21 22
Confidence Intervals
Point estimation
An estimate of a population parameter
Interval estimation
Inferential Statistics A point estimate plus an interval that expresses the uncertainty
or variability associated with the estimate
100(1 )% Confidence interval:
estimate
(critical value of z or t) (standard error of estimate)
Critical value is the cutoff such that the area under the
curve in the tails beyond the critical value (both positive
and negative direction) is
23 24
Interpretation of confidence interval?
Use a CI for as an example: Steps of Hypothesis Testing
Before the data are observed, the probability Define the null hypothesis, H0.
is at least (1 alpha) that [L,U] will contain , Define the alternative hypothesis, Ha, where
the population parameter Ha is usually of the form not H0.
In repeated sampling from a normally Define the type 1 error, , usually 0.05.
distributed population, 100(1 )% of all Calculate the test statistic
intervals of the form above will include the
population mean Calculate the P-value
After the data are observed, the constructed If the P-value is less than , reject H0.
interval [L,U] either contains the true mean or Otherwise fail to reject H0.
it does not (no probability involved anymore)
25 26
MANY types of hypothesis testing Which test statistic do I use for each kind of test?
We discussed
A single mean Usually, the form of the test statistic
H0: = 3000 vs. Ha: 3000 depends on
A single proportion Population distribution
H0: p = 0.35 vs. Ha: p 0.35 Sample size
Difference of means Population variance
H0: 1 - 2= 0 vs. Ha: 1 - 2 0 Whether known or estimated
Need to decide whether to assume equality of Or assumptions about equality
variance
To find out which test statistic to use,
Difference of proportions check the summary sheets
H0: p1 p2 = 0 vs. Ha: p1 p2 0
Not something you really have to memorize
Others for regression modelling
27 28
Example
Summary: Hypothesis test for a single mean Relation between CI and hypothesis testing
General rule on the 100(1- )% confidence

interval approach to two-sided hypothesis
testing
If the null hypothesis value is not contained in the
confidence interval, you reject the null hypothesis
with p-value
If the null hypothesis value is contained in the
confidence interval, you fail to reject the null
hypothesis with p-value>
29 30
Why is the power of a test important? Were not always right
Power indicates the chance of finding a

significant difference when there really
is one
Low power: likely to obtain non-significant
results even when significant differences
exist Aim: to keep Type I error () small by specifying
a small rejection region
High power is desirable! is set before performing a test, usually at 0.05
Low power is usually cause by small sample Aim: To keep Type II error () small and thus
size power high
Power = 1
31 32
: Probability of Type II Error P-Values
The value of is usually unknown since it
depends on a specified alternative value. Definition: The p-value for a
depends on sample size and . hypothesis test is the probability of
Before data collection, scientists decide obtaining by chance, alone, when H0 is
the test they will perform true, a value of the test statistic as
extreme or more extreme (in the
the desired appropriate direction) than the one
They will use this information to choose the actually observed.
sample size
33 34
Correlation
Measures strength and direction of the

Regression Modeling linear relationship between two
continuous variables
The correlation coefficient, , takes
values between -1 and +1
-1: Perfect negative linear relationship
0: No linear relationship
+1: Perfect positive relationship
35 36
Association and Causation Why use linear regression?
In general, association between two Linear regression is very powerful. It

variables means there is some form of can be used for many things:
relationship between them Binary X
Continuous X
The relationship is not necessarily causal
Categorical X
Association does not imply causation, no Adjustment for confounding
matter how much we would like it to Interaction
Example: Hot days, ice cream, drowning Curved relationships between X and Y
37 38
Simple Linear regression: Y= 0+ 1X1+ Assumptions of Linear Regression
Linear regression is used for continuous L Linear relationship

outcome variables I Independent observations
0: mean outcome when X=0 (Center!)
N Normally distributed around line
Binary X = dummy variable for group
1: mean difference in outcome between groups E Equal variance across Xs
Continuous X
1: difference in mean outcome corresponding Most often assess with graphs
to a 1-unit increase in X
- One type: AV plots
Center X to give meaning to 0
visualize the relationship between the outcome and
Test 1=0 in the population a continuous predictor after adjusting for the
effects of a third variable
39 40
In Simple Linear Regression Regression Methods
In simple linear regression (SLR):

One Predictor / Covariate / Explanatory Variable:
X
In multiple linear regression (MLR):
Same Assumptions as SLR, (i.e. L.I.N.E.), but:
More than one Covariate: X1, X2, X3, , Xp
Model:
Y ~ N(, 2)
= E(Y | X) = 0 + 1X1 + 2X2 + 3X3 +... pXp
41 42
Regression Methods Nested models

Interactions can allow us to draw separate lines for two
groups One model is nested within another if
the parent model contains one set of
X1 = Year variables and the extended model
X2 = Group
contains all of the original variables plus
one or more additional variables.
H0: all new s are zero
Assess using F-test
If only one additional variable, use t-test
43 44
Effect Modification Confounding: the epidemiologic definition
In linear regression, effect modification C is a confounder of the relation

is a way of allowing the association between X and Y if:
between the primary predictor and the
outcome to change with the level of Outcome Y
another predictor.
If the 3rd predictor is binary, that results in Confounder C
a graph in which the two lines (for the two
groups) are no longer parallel.
Predictor X
45 46
Confounding: example Modeling confounding and effect modification
Smoking is a confounder of the Potential confounder(s)

relation between coffee consumption (X) Run model without confounder (model 1)
and lung cancer (Y) since: Run model with confounder (model 2)
Compare model 2 estimate to the model 1
Lung Cancer Y
CI of primary predictor to see whether new
parameter is significantly different
Smoking C Effect modification
Model using interaction term
Coffee Consumption X Test if statistically significant using a t-test
47 48
Spline Terms Summary: Flexibility in linear models
Splines are used to allow the regression line
to bend A spline allows the slope for a
the breakpoint is arbitrary and decided graphically continuous predictor to change at a
or by hypothesis given point; the coefficient is for the
the actual slope above and below the breakpoint is difference in log odds ratio
usually of more interest than the coefficient for the
spline (ie the change in slope) An interaction term allows the odds ratio
Broken Arrow Model
3500 for one variable to differ by the value of
a second variable; the coefficient is for
Expenditures
3000
2500 Slope = 1 + 2 the difference in log odds ratio

Slope = 1
2000
3 5 7 9
length of stay (days)
49 50
Using R2
as a model selection criteria Logistic Regression
Basic Idea:
The coefficient of determination, R2 Logistic regression is the type of regression we
evaluates the entire model use for a response variable (Y) that follows a
binomial distribution
R2 shows the proportion of the total Linear regression is the type of regression we use
for a continuous, normally distributed response (Y)
variation in Y that has been variable
predicted by this model Model log odds probability, which we also call
Model 1: 0.0076; 0.8% of variation the logit
explained Baseline term interpreted as log odds
Other coefficients are log odds ratios
Model 2: 0.05; 5% of variation explained Transform log odds/ log odds ratio to
Model 3: 0.20; 20% of variation explained odds/odds ratio scale by exponentiating
coefficient
You want a model with large R2 51 52
A basic logistic regression model and
Logit Function interpretation
Relates log-odds (logit) to p = Pr(Y=1) logit(pi) = 0 + 1(agei 45)
0 = log-odds of blindness among 45 year olds

logit function
10
exp(0) = odds of blindness among 45 year olds
5
log-odds
1 = difference in log-odds of blindness

0 comparing a group that is one year older than
another
-5
-10 exp(1) = odds ratio of blindness comparing a

0 .5 1 group that is one year older than another
Probability of Success
53 54
Why we can interpret the difference in log odds

as the log odds ratio Multiplicative change interpretation of the slope
The slope in a logistic regression is
A difference in log odds associated with a 1 unit e 1 is the proportional increase of the
change in X (controlling for other Xs) odds of not visiting a physician corresponding
A log odds ratio associated with a 1 unit change in to a one year increase in age
X (controlling for other Xs)
Why? (odds for 30 - yr - old) (odds for 31 - yr - old) = (odds for 31 - yr - old)
log(a) log(b) = log(a/b) (odds for 30 - yr - old)
so e( )
1 10
= e101 is the proportional increase of
the odds of not visiting a physician
log(odds|X=1) log(odds|X=0) corresponding to a ten year increase in age
= log(OR for X=1 vs. X=0)
55 56
More useful math Comparing nested models with logisitic
how to get the probability from the odds regression
Models that differ by one variable

probability Compare models with p-value or CI using the
odds= 1 probability Wald test, a test that applies the CLT
H0: the new variable is not needed
or H0: new=0 in the population
odds
probability = Models that differ by more than one
1 + odds variable
Likelihood ratio test (Chi-square test of
e0 +1 deviance)
so P (X = 1) =
1 + e0 +1 H0: all new variables not needed
or H0: all new=0 in the population
57 58
Grand summary Grand summary: linear models
Exploratory analysis includes graphs and Linear regression: for continuous

tables good to get a feel for the data (normal) outcomes
Confirmatory analysis is useful for Logistic regression: for binary outcomes
making definitive conclusions
Linear models provide us with a
framework in which to perform
confirmatory analysis in many settings
59 60
Grand summary: modelling Grand summary: testing
In all generalized linear models, we can We can test significance of a single

use the following tools to make models predictor using z-test (or t-test for linear
more flexible: regression)
Adjust for confounders using additive Test significance of several covariates
covariates using a pair of nested models by a
Effect modification allows by interaction likelihood ratio test
terms
Know how to interpret p-values and
Curved and bent lines through polynomials confidence intervals!
and splines
61 62
References - textbooks References online JHSPH open courseware

For more information For further directed self-study
Friendly Intro Epidemiology textbook JHSPH Biostatistics Open Courseware

Epidemiology by Leon Gordis http://ocw.jhsph.edu/Topics.cfm?topic_id=33
Intro to Biostatistical Modeling Textbook Essentials of Probability and Statistical

Regression Modeling Strategies by Frank E. Inference IV
Jr. Harrell Methods in Biostatistics I & II
Slightly more theoretical intro to Statistical Reasoning I & II
statistical modeling textbook Statistics for Laboratory Scientists I & II
Mathematical Statistics and Data Analysis Statistics for Psychosocial Research:
by John A. Rice
Structural Models & Measurement
63 64

Review Lecture

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Review Lecture

Uploaded by

Copyright:

Available Formats

Types of Biostatistics

Summary of the approach to modeling I Summary of the approach to modeling II

Stem-and-Leaf Plots Histograms

Age in years (10 observations) Pictures of the frequency or relative

Age Interval Observations Histogram of Age

Box Plot of Age

Positively Skewed Negatively Skewed

Mode Mean Mean Mode

Right and left sides are mirror images Quartiles

Mean Median Mode 13 14

Key Ideas of Probability/Distributions

68 95 99.7 Rule 68 95 99.7 Rule

General rule on the 100(1- )% confidence

Why is the power of a test important? Were not always right

Power indicates the chance of finding a

Measures strength and direction of the

In general, association between two Linear regression is very powerful. It

Simple Linear regression: Y= 0+ 1X1+ Assumptions of Linear Regression

Linear regression is used for continuous L Linear relationship

In simple linear regression (SLR):

Regression Methods Nested models

If only one additional variable, use t-test

In linear regression, effect modification C is a confounder of the relation

Confounding: example Modeling confounding and effect modification

Smoking is a confounder of the Potential confounder(s)

2500 Slope = 1 + 2 the difference in log odds ratio

Relates log-odds (logit) to p = Pr(Y=1) logit(pi) = 0 + 1(agei 45)

0 = log-odds of blindness among 45 year olds

1 = difference in log-odds of blindness

-10 exp(1) = odds ratio of blindness comparing a

Why we can interpret the difference in log odds

Models that differ by one variable

Grand summary Grand summary: linear models

Exploratory analysis includes graphs and Linear regression: for continuous

In all generalized linear models, we can We can test significance of a single

References - textbooks References online JHSPH open courseware

Friendly Intro Epidemiology textbook JHSPH Biostatistics Open Courseware

Intro to Biostatistical Modeling Textbook Essentials of Probability and Statistical

You might also like