Professional Documents
Culture Documents
1) Descriptive Statistics
Exploratory Data Analysis
often not in literature
Lecture 17: Review Lecture Summaries
"Table 1" in a paper
Goal: visualize relationships, generate
hypotheses
Sandy Eckel
seckel@jhsph.edu 2) Inferential Statistics
Confirmatory Data Analysis (Methods Section of
paper)
20 May 2008 Hypothesis tests
Confidence Intervals
Regression modeling
1
Goal: quantify relationships, test hypotheses 2
A general approach for most statistical modeling Estimate the Parameters in the Model
is to: Fit the Model to the Observed Data
Define the Population of Interest Make Inferences about Covariates
State the Scientific Questions & Underlying Check the Validity of the Model
Theories Verify the Model Assumptions
Describe and Explore the Observed Data Re-define, Re-fit, and Re-check the Model if
Define the Model necessary
Probability part (models the randomness / noise) Interpret the results of the Analysis in terms
Systematic part (models the expectation / signal) of the Scientific Questions of Interest
3 4
Key Descriptive Statistics Ideas
Visualizing data
Stem and leaf plots
Descriptive Statistics Histograms
Boxplots
Scatterplots
Describing data
ALWAYS look at your data Distribution shapes (especially skewness)
If you can't see it, then don't Quartiles
Measures of central tendency
believe it Median, Mean, Mode
Measures of spread
Variance, Standard Deviation, Range, Interquartile Range
5 6
4
2* 569
3
3* 2568
Fre quency
2
4* 49
5* 1
1
7 1 2
A ge C a te g o ry
3 4 8
Box-and-Whisker Plots (Boxplot) 2 Continuous Variables
190
Age in Years
40
180
Height in Centimeters
35
170
30
160
25
150
IQR = 44 29 = 15 25 30 35 40 45 50
Age in Years
Upper Fence = 44 + 15*1.5 = 66.5 Scatterplots visually display the relationship between
Lower Fence = 29 15*1.5 = 6.5 9
two continuous variables 10
Skewness Skewness
Mutually exclusive
Concepts from Biostat I Statistically independent
used in this class
Addition rule
Conditional probability
Common distributions
Continuous: Normal, t-distribution,
Chi-square, F-distribution
Discrete: Binomial
15 16
Key Ideas of
The Normal Distribution and Statistical Inference 68 95 99.7 Rule
Normal distribution
Parameters: mean, variance 68% of the area under the curve is
Standard normal within 1 standard deviation of the mean
68-95-99.7 Rule
Areas under the curve relation to p-values
Statistical Inference
Population: parameters
Sample: statistics
We use sample statistics along with theoretical results
to make inferences about population parameters
Sampling distribution of sample mean
Central Limit Theorem
17 18
95% of the area under the curve is 99.7% of the area under the curve is
within 2 standard deviation of the mean within 3 standard deviation of the mean
19 20
Sampling Distribution of the Sample Mean The Central Limit Theorem
Usually is unknown and we would like to Given a population of any distribution with
estimate it mean, , and variance, 2, the sampling
We use to estimate distribution of , computed from samples of
We know the sampling distribution of size n from this population, will be
approximately normally distributed with
Definition: Sampling distribution mean, , and variance, 2/n, when the sample
The distribution of all possible values of some size is large.
statistic, computed from samples of the same In general, this applies when n 25.
size randomly drawn from the same The approximation to normality becomes better as
population, is called the sampling distribution n increases
of that statistic
21 22
Confidence Intervals
Point estimation
An estimate of a population parameter
Interval estimation
Inferential Statistics A point estimate plus an interval that expresses the uncertainty
or variability associated with the estimate
100(1 )% Confidence interval:
estimate
(critical value of z or t) (standard error of estimate)
Critical value is the cutoff such that the area under the
curve in the tails beyond the critical value (both positive
and negative direction) is
23 24
Interpretation of confidence interval?
Use a CI for as an example: Steps of Hypothesis Testing
Before the data are observed, the probability Define the null hypothesis, H0.
is at least (1 alpha) that [L,U] will contain , Define the alternative hypothesis, Ha, where
the population parameter Ha is usually of the form not H0.
In repeated sampling from a normally Define the type 1 error, , usually 0.05.
distributed population, 100(1 )% of all Calculate the test statistic
intervals of the form above will include the
population mean Calculate the P-value
After the data are observed, the constructed If the P-value is less than , reject H0.
interval [L,U] either contains the true mean or Otherwise fail to reject H0.
it does not (no probability involved anymore)
25 26
MANY types of hypothesis testing Which test statistic do I use for each kind of test?
We discussed
A single mean Usually, the form of the test statistic
H0: = 3000 vs. Ha: 3000 depends on
A single proportion Population distribution
H0: p = 0.35 vs. Ha: p 0.35 Sample size
Difference of means Population variance
H0: 1 - 2= 0 vs. Ha: 1 - 2 0 Whether known or estimated
Need to decide whether to assume equality of Or assumptions about equality
variance
To find out which test statistic to use,
Difference of proportions check the summary sheets
H0: p1 p2 = 0 vs. Ha: p1 p2 0
Not something you really have to memorize
Others for regression modelling
27 28
Example
Summary: Hypothesis test for a single mean Relation between CI and hypothesis testing
29 30
33 34
Correlation
35 36
Association and Causation Why use linear regression?
37 38
41 42
43 44
Effect Modification Confounding: the epidemiologic definition
45 46
47 48
Spline Terms Summary: Flexibility in linear models
Splines are used to allow the regression line
to bend A spline allows the slope for a
the breakpoint is arbitrary and decided graphically continuous predictor to change at a
or by hypothesis given point; the coefficient is for the
the actual slope above and below the breakpoint is difference in log odds ratio
usually of more interest than the coefficient for the
spline (ie the change in slope) An interaction term allows the odds ratio
Broken Arrow Model
3500 for one variable to differ by the value of
a second variable; the coefficient is for
Expenditures
3000
Using R2
as a model selection criteria Logistic Regression
Basic Idea:
The coefficient of determination, R2 Logistic regression is the type of regression we
evaluates the entire model use for a response variable (Y) that follows a
binomial distribution
R2 shows the proportion of the total Linear regression is the type of regression we use
for a continuous, normally distributed response (Y)
variation in Y that has been variable
predicted by this model Model log odds probability, which we also call
Model 1: 0.0076; 0.8% of variation the logit
explained Baseline term interpreted as log odds
Other coefficients are log odds ratios
Model 2: 0.05; 5% of variation explained Transform log odds/ log odds ratio to
Model 3: 0.20; 20% of variation explained odds/odds ratio scale by exponentiating
coefficient
You want a model with large R2 51 52
A basic logistic regression model and
Logit Function interpretation
so e( )
1 10
= e101 is the proportional increase of
the odds of not visiting a physician
log(odds|X=1) log(odds|X=0) corresponding to a ten year increase in age
= log(OR for X=1 vs. X=0)
55 56
More useful math Comparing nested models with logisitic
how to get the probability from the odds regression
59 60
Grand summary: modelling Grand summary: testing
61 62