Statistical Modeling

STA371H Exam 1
1. Explanations & Evidence Selection bias o Ex: study about smoking weed vs. intelligence o One possible example for which you may see confounding in a non-experimental study Confounding exogenous/endogenous o If you have some third factor, z, thats associated with your x and y and may affect your results o Exogenous: from outside the system Explains a causal beginning XY X is the exogenous variable If there are no arrows pointing at a variable, it is exogenous Real experiments o Endogenous: from inside the system XY Y is the endogenous variable If a variable has an arrow pointing at it, it is endogenous o Add a variable Z, if it forms a triangle and points at both X and Y, X and Y are both endogenous and Z is exogenous
Natural experiment vs. real experiments o Paradigm of thinking about good evidence through randomize & intervene Placebo controlled Double-blind trial o Ex: Israeli school
---------------------------------------------------------------------------------------------------------------------------2. Exploring Multivariate Data Plots and summaries o Histograms/standard deviation (1 variable) Think of standard deviation as average error Ex: of a histogram the avg error I make when I guess the sample mean Ex: of a regression avg error I make when I guess the regression line o Boxplots/dot plots (2 variables) o Scatterplots (2 variables) o Lattice plots (3 variables) Simple group-wise models o Group means o Ordinary least squares regression models o Fitted values/residuals/etc. Basic decomposition: takes observed value and breaks it to fitted and residual values Observed = fitted + residual Fitted can be predicted by the model; is systematic Residual is uncertainty; cant be predicted and is random o Things you can do with regression model
STA371H Exam 1
Plug-in prediction bronze level prediction You have your past data relating past data to past y. See a new x*, what is y*? Just plug in: y* = beta-naught + beta-one(x*) Whats wrong with it? Doesnt take into account uncertainty. No error bar. Interpreting beta-naught and beta-one in a least squares regression line What does intercept mean? (Avg value of y variable when x=0) What does slope mean? (Incremental change) Taking x-ness out of y First idea of statistical adjustment silver level statistical adjustment Adjusting price for food score i. What is the lowest compared to what you would expect? ii. Look at residuals and find the most negative This takes out the systematic trend, takes the fitted (what is predictable by x) value out and leaves you with the residual in the line, which means you adjust for different factors i. Observed fitted = residual Reducing (quantifying) uncertainty in a regression model o Its a machine for reducing your uncertainty for forecasting a y-variable o Look at ratio between how spread out the original data was and how spread out the residuals are Transformations for Nonlinear Models o Logs & Power Laws Ex: body weight vs. brain weight of mammals exercise Taking the logarithm to un-squish data (log-log scale) Corresponds to fitting a geometric relationship between x and y instead of a linear relationship o Polynomials Adding x2, x3, etc. Ex: car sales vs. months on job Trade-off between simplicity and fit Want to avoid over-fitting Create models that generalize well of future cases and are no more complex than need to be Ex: exercise 3 that has us choose which model is the best
---------------------------------------------------------------------------------------------------------------------------3. Predictable & Unpredictable Variation Coverage intervals o Read Kaplan book, chapter 5 o Any collection of numbers has a coverage interval o An interval that contains any set % of the data points o Confidence interval: specific kind of coverage interval for a particular set of numbers Nave prediction intervals silver level prediction o Nave interval adds on to the plug-in prediction o Adds a standard deviation to account for uncertainty about future random part Doesnt take into account uncertainty about systematic variable This is why its nave o y* = beta-naught + beta-one(x*) +/- t(standard deviation) o Can be thought of as a coverage interval for your past points
STA371H Exam 1
o To quantify nave prediction intervals Empirical coverage Plot the points and count the number of data points that fall inside of an empirical interval Simple rules of thumb (less tedious than counting) If you want a 65 70% level, use t=1 If you want 95% level, use t=2 2 R and decomposition of variance o Its more difficult to quantify correlation with more than one variable o With more variables, it is a useful generalization to represent correlation o Higher the R2, the more informative the predictor
---------------------------------------------------------------------------------------------------------------------------4. Quantifying Uncertainty (Parameter/Predictor) Sampling distribution o Impossible to understand anything in statistics without understanding sampling distribution Standard error: standard deviation of the sampling distribution o Describes spread (average error) of a sampling distribution o Useful for telling how sure I am that my slope is getting close to the true slope Confidence intervals: range of plausible values in light of the data o How confident are you that your slope is representative of population slope? o Another mathematical definition: Frequentist Coverage Property*** Aka assembly-line property, truth in advertising Stated level of confidence is actually right in a broader frequency sense If you quote your level as 95%, you actually get that in 95% of the cases It is not a claim about a particular interval This is what makes YOUR confidence interval correct o Being able to take repeated samples from the population and *All three are useful theoretical ideals in reality, you cant see these objects, just estimates How to estimate, then? Bootstrapping: faking samples and sampling from the sample o With replacement to get ties and omissions to simulate the kind of variability wed get if we were to take repeated samples from the population o Requires assumption: your sample is broadly representative of the population in the ways that matter If this is false, all the factors that you get will not be trustworthy or correspond to the real sample distribution in the thought experiment we want to make o Estimating parameter uncertainty (difference from cross-validation) Normal linear regression model o Assumption: yi = beta-naught + beta-one o Difference is in the assumptions 1. Normal distribution 2. Constant variance (standard deviations are same for all i) 3. Not biased upwards or downwards mean 0 (no one residual can give any info about other residuals) *Be able to articulate what these assumptions are and why theyre true o Explanation: Why? By making assumption and ask where the randomness comes in, we find that theyre different because they have different residuals. Each residual is a draw from a normal distribution.
STA371H Exam 1 Then we can use mathematics to derive formulas for what the standard error and sampling distributions of the slopes ought to be. Cross validation o Purpose: estimate the prediction or generalization error of a statistical model Whats our future forecasting error likely to be Trying to decide which order to use (ex: utilities exercise) o Mechanics Train/test splits Call some past data the notional prediction/test set Fitting the model on old data Predicting on future data See how well you did estimate of the generalization Repeat avg 100 resamples o Estimate generalization error (difference from bootstrapping)
---------------------------------------------------------------------------------------------------------------------------5. Grouping Variables Aggregation paradoxes o Recognizes that grouping variable might be a confounder Dummy variables: expressing coefficients in baseline-offset form o This is the solution to aggregation paradoxes o Putting grouping variables into the model o Its an indicator (0, 1 do you have the condition or not) o Used to change intercept of a regression line Interaction terms: o Used to change the slope of a regression line ---------------------------------------------------------------------------------------------------------------------------6. Multiple Regression Partial slope Statistical adjustment gold level statistical adjustment o Measures the rate of change of yi variable as x1 variable changes holding the x2 variable constant o Estimating relationship between yi and x1 when x2 is constant Collinearity: what happens when you have predictors that are themselves related o Logical problem in data set that makes it difficult to isolate variables o Ex: weight transfer and topspin in tennis to improve tennis serves Which improvement actually mattered? o Isolate effects in situations (ex: vote undercount exercise 6) ---------------------------------------------------------------------------------------------------------------------------7. Hypothesis Testing Neyman-Pearson testing: 6 steps Permutation test: shuffling the deck
STA371H Exam 1
Terms 1. Explanations and evidence evidence and causality Key Concepts - Randomize and intervene Randomize: randomly split the sample into two groups, denoted the treatment and the control groups o Ensures that other factors, even unknown factors, do not lead us astray in our causal reason Intervene: allocate everyone in the treatment group to take the real activity and control group to take the placebo o Allows differentiation between the two tested variables Surest way to establish causality - Confounding: some systematic effect correlated with both the response and the predictors Usually the z variable lurking/omitted variable that causes bias - Selection bias: when the sample selected is not random and produces results that may not be trustworthy Ex: only healthy people are given a diabetes drugs, results cannot be trustworthy because they do not know if the drug or the participants lifestyles were the cause of health - Natural experiments: experiments that are not designed but observed in the real world and nature seems to have done the randomization and intervention Something you didnt design yourself but almost as good as if you had Gives exogenous variation for free Exogenous variation example: effect of class size on student achievement in Israel o Classes are capped at 40, so cohorts of 41 students end up being two classes of about 20 each while cohorts with exactly 40 will only have 1 class When comparing natural experiments, ask: o 1. How good is the control group? o 2. How good is the randomization? - Endogeneity: (dirty variation) when variables are caught up in some unknown, difficult-to-untangle knot of dependence between predictors and response Ex: State health policies, public attitudes, diabetes rate o Hypothesis 1: State health policies can change public attitudes toward health and lower the incidences of diabetes o Hypothesis 2: Underlying attitudes toward health among the citizens affect both health policies and the diabetes rate o Health policies and diabetes rates are endogenous because they have a possible common cause Are not subject to experimental manipulation Designed experiments gets around this problem Other Ideas to Know - Dependence graphs: matching orders for variables and shows which variables lead and follow XYZ 1. Specify all relevant variables 2. Specify all structural dependencies by drawing arrows pointed in direction of assumed causation - Double-blind experiment: when neither the administrator of the trial nor the participant knows which drug or activity they are taking - Placebo effect: some participants are given a non-useful drug/activity that they do not know about so test creators can see the real effects of a drug vs. non-drug - Longitudinal study: involves following the same group of subjects over time - Cross-sectional study: involves taking a cross-section of subjects at a single time point and studying them all Ex: Looking at all 50 states and seeing how diabetes rate, state health policies, and public attitudes vary from one to the next
STA371H Exam 1
2. Exploring multivariate data describing variation with pictures and simple models Key Concepts - Group means: mean for each category (blue dots/line in above dot plot) Group means vs. baseline/offset form Simple group-wise models - Grand mean: mean for the entire data set (dotted green line in above dot plot) Partitioning the variability: Individual case = group mean + deviation of that case = grand mean + deviation of group + deviation of case - Basic plots and summaries Histogram Boxplot: allow assessment of variability both between and within the groups o Many times within-group variability matters the most o Leads into the idea of group-wise means
Dot plot: (aka strip chart) also depicts between-group and within group variations and shows relationship between variables with each dot representing an individual data point, and groups of data points line up with each x-variable
Lattice plot: good for comparing multivariate data
STA371H Exam 1 Scatter plot: depicts the relationship between an x (predictor) and y (response) variable in a sample
o Trend line shows general relationship and can be described by a linear model with the process of linear regression Regression: fitting the parameters to the observed data Parameters of regression equation: beta-naught (intercept), beta-one (slope) ei is the residual (what is uncertain; noise term) The name regression comes from expecting that future y-values will regress to the mean specified by the linear predictor (and therefore it is a good method of prediction) Method of least squares (criterion): method of fitting parameters where we choose parameters so that the sum of squared residuals will be as small as possible (minimizing sum of the squared residuals) Offers attractiveness of a single best answer Says: of all possible straight-line fits, this line has the smallest sum of squared residuals What are the various purposes of a linear regression model? 1. Plug-in prediction: maps inputs into outputs
o Fitted values: y-values resulting from plugging x-values in the least-squares linear predictor Beta-naught + beta-one*xi Predictable by x o Can also be done for observations not in original data set o Useful for forecasting the response for a known value of the predictor 2. Summarizes trend in data: shows how y changes as a function of x using the slope
o Coefficients: the slope which tells the rate of change in data 3. Taking the x-ness out of y o Residuals = (observed y value) (fitted value) Difference between observed actual, the error of the prediction How far the actual data point is from the residual line in a graph Unpredictable by x o Adjusting for various factors to see a direct relationship between one particular predictor and the response variable o Ex: Adjusting food score for price (taking out location, customer service, etc.; exercise 2) Finding the best value (low price and high rating) Not the least expensive restaurant, but also adjust for food quality Simply subtract off the fitted value from the observed value of y, leaving residual, capturing response after predictor has been taken into account
STA371H Exam 1
4. Reduces uncertainty Nonlinear models: when least squares lines cannot capture the underlying relationships between x and y Logarithms: if the model generating data is a function of log(x) o Taking the log of both sides the exponential equation can help fit a straight line for nonlinear data o Logs are inverse functions of exponentials o Taking the log stretches out the data points o Performing a log transformation: Logging both sides results in a linear function on the right side of the equation, which can be plotted on a graph Then use the least-squares criterion (minimizing sum of residuals squared) to choose particular parameter values Must then undo the transformation because you dont want to make the statement about log(yi), just yi o Transformation: general technique for making linear least-squares work for nonlinear relationships) of the y variable Transforming boosts the R2 (variance) without adding extra parameters, meaning it creates a better prediction method o Interpreting coefficients under transformed model Slope: ratio of percentage change in y to percentage change in x Ex: animal brain vs. body weight (exercise 3) 100% change in body weight (x) is associated with 74% expected change in brain weight (y)
Power-law: use if true model generating data is a function of x to some power c o Power-law curve never goes negative, approaching 0 asymptotically (log curve can keep decreasing forever) o Two fixes to make linear regression work for power laws: Add powers of x (continue adding polynomials x2, x3, etc. to the regression) Take the log of both x and y
STA371H Exam 1
Exponential growth: if the model generating data is an exponential function of x o Growth increases as x increases, also increasing y by exponential amount o Distinguishing feature: e-x = 1 when x = 0 (other power law decay curves blow up to infinity in the limit as x approaches 0) o Much faster than power law growth o Just take log of y-variable (allows to infer rate of growth and decay) o Must undo transformation because taking log of y not only changes y~x relationship, but relationship of y with all other predictors in model
Other Ideas to Know - Continuous and grouping variables: - Contingency tables: used to summarize categorical variables Sometimes uses cross tabulation (calculating the sum of all the rows and columns)
Used for data sets with few classifying variables Types of categories o Ordinal variables: natural ordering (i.e. measuring severity of hurricane) o Success/failure: 2 options (i.e. heads/tails, survived/death) Use 1s and 0s in R Variation between and within groups: use boxplots Parameters of a statistical model: beta-naught and beta-one Interpretation of intercept and slope: Intercept (beta-naught): what is expected form the response y if predictor x=0 o Ex: with a line y = 2+3x denoting rate of change in food quality score (x) and volume of customers (y), with a food quality score of 0, you would expect 2 customers o Sometimes the intercept may be hard to interpret, like when it has a negative value Slope (beta-one): denotes the rate of change of the data set (rise/run)
STA371H Exam 1
3. Predictable and unpredictable variation: partitioning sums of squares Key Concepts - Key Q: After adjusting for other factors, how much variation remains to be explain by these other factors? - Sample correlation (coefficient): shows strength of linear dependence between two observed quantities Always between -1 and 1; 0 means uncorrelated Closely related with linear least squares - Nave prediction intervals: Adds a standard deviation to account for uncertainty about future random part Counting data points within intervals of 1, 2, ... standard deviations from the least-squares line and commenting that the data points with-in is part of the coverage interval o Doesnt take into account uncertainty about parameters understatement of total amount of uncertainty in our interval estimate (This is why its nave) y* = beta-naught + beta-one(x*) +/- t(standard deviation) o This is the residual standard deviation o Raw standard deviation: how much the data point deviates from the sample mean o Sample standard deviation: how much the data point deviates from the least-squares line Can be thought of as a coverage interval for your past points To quantify nave prediction intervals o Empirical coverage Plot the points & count number of data points that fall inside of an empirical interval o Simple rules of thumb (less tedious than counting) If you want a 65 70% level, use t=1; If you want 95% level, use t=2 - R2: (coefficient of determination): measures percent variance that can be predicted by the model Common mistakes when interpreting R2 o 1. Confusing R2 with the slope of a regression line o 2. Ignoring residuals When fitting a regression, always plot residuals versus X and dont just look at R 2 Also look out for outliers that can skew data o 3. Confusing statistical explanations with real explanations R2 means that x is relatively successful in predicting y, not necessarily that x causes y Other Ideas to Know Point versus interval estimate o Point estimate: single best guess (fitted values) o Interval estimate: range of likely values Partitioning the sum of squares: partitioning total into predictable and unpredictable components o ONLY sums of squares can be partitioned reason why good predictor of variation o 3 values to keep track of: Observed values Grand mean Fitted values (group means corresponding to each observation) o 3 important relationships: Total variation: sum of squared deviations from grand mean Measures variability in original data Tells us how much variation we started with Predictable variation: sum of squared differences between fitted values and grand mean Measures variability described by model Unpredictable variation: sum of squared residuals from the group-wise model Measures variation left over in the observed values after accounting for group membership Tells us how much variation is left over ** TV = PV + UV ( because they are all squares, they are a Pythagorean triple) o Model partitions original sum of squares into predictable and not predictable
STA371H Exam 1
4. Quantifying uncertainty: confidence in estimates and predictions Key Concepts - Key Q: If our data set had been different merely due to chance, would our answer have been different too? Typically equate trustworthiness with its stability under influence of luck (repeated trials) When comparing results from one sample of 500 to another sample under same conditions and for the same variables, and their results are completely different, then the results arent trustworthy Sources of instability o Individual observations are subject to forces of randomness o Effect of sampling variability (because we cant study entire population
Sampling distribution: how estimates for parameters change from sample to sample Any number or statistic drawn from a sample has a sampling distribution Summarizes how an estimator behaves under repeated sampling from a particular population Summarizing sampling distributions o Mean: if sample mean is equal to true population mean, then estimator is unbiased o Standard error: the standard deviation of a sample distribution With repeated samples and same estimator, the estimate is typically off from the truth by the amount of the standard error Bigger the standard error, the less stable and trustworthy the estimator Two ways of quantifying uncertainty 1. Bootstrapping (resampling): repeatedly taking samples from the sample itself, with replacement o Why? Variability of the estimates across all bootstrapped samples can be used to approximate the sampling distribution of the corresponding estimator o Assumption: the bootstrapped sample resembles the greater population (look out for outliers) o How? 1. Take real sample size n from population 2. Take x (make sure x is large) number of bootstrapped samples (original sample of size n is a pseudo-population 3. Compute least-squares line for each bootstrapped sample o Things to consider As sample size increases, the bootstrapped sampling distributions get closer to the truth and less variable from original sample Bootstrapped standard errors are close to true standard errors Want bell-shaped curves; if bootstrapped distribution is skewed or have multiple peaks, question its trustworthiness Confident in bootstrapped sampling distribution because original sample is representative of the wider population o Bootstrapped prediction intervals Accounts for uncertainty in interval estimates (unlike prediction intervals) Steps to quantify uncertainty:
STA371H Exam 1 1. Take single bootstrapped sample and compute least-squares estimates for the parameters (giving you best guess for future y) 2. Sample a residual at random from the bootstrapped LSR fit 3. Set the equation y(r) = y-hat(r) + e(r) 4. Take standard deviation of all y(r)s to quantify uncertainty in prediction Bends outward as it gets away from center of sample because prediction uncertainty increases when you move away from mean of x Must be representative of the entire sample population or will not be trustworthy 2. Normal linear regression model (probability theory) o Assuming that x and y variables follow a straight-line trend o 4 assumptions about underlying state of nature: 1. Independence: no one residual conveys information about another residual Residuals as aggregations of nudges: the sum of many small nudges, with each nudge equally likely to be up or down - Nudge is like a sequence of independent coin flips: heads brings you up from residual line and tails brings you down - Successive nudges are independent of each other 2. Residuals come from a normal distribution: mean 0 and variance s2 (Gauss) The maximum likelihood solution for beta-naught and beta-one is the same as the least squares solution Aggregation of up & down nudges can be described well by a normal curve Has a thin tail only has a 5% chance of being more than 2 standard deviations away from the mean and less than 0.3% chance of being more than 3 standard deviations away 3. Constant variance (homoskedasticity): variance s2 are the same for all observations i 4. Linearity All three can be expressed as i.i.d. normal independent and identically distributed o Simulating a simple regression model: 1. Choose particular values for the parameters (beta-naught, beta-one, variance) 2. Choose particular value for the predictor xi 3. Simulate a normally distributed residual ei ~ N(0, s2) 4. Set yi = beta-naught + beta*(xi) + ei (ei is random deviation from line - uncertainty) 5. Repeat steps 2-4 until we have n different (xi, yi) pairs This produces fake data but allows us to deduce information Parameters derived from multiple trials estimate the true parameters o Factors that control the variation of the fitted line about the true line: Standard deviation: better fit as s gets smaller Sample size: better fit as n gets bigger Confidence interval: summary of how precisely the data have allowed you to estimate the underlying population parameter (and gives an interval of certainty) Intervals generated by a method that satisfies the Frequentist Coverage Principle (FCP) o If your sample is fairly representative of the population, then it satisfies the FCP o Repeated trials come up with similar CIs and results o It is not a property of an individual data set, but a property of our procedure Express bootstrapped confidence interval in 2 ways: o Quote a symmetric error bar: least-squares estimate +/- k*standard error k = critical value (number of standard errors you must go out from the center to capture a certain percentage of the distribution) More specific than a coverage interval (an interval of numbers at any percentage level of a sampling distribution) A confidence interval is a coverage interval for a sampling distribution Cross validation:
STA371H Exam 1
Purpose: estimate prediction error/generalization error of a statistical model o Example: estimating gas bill vs. temperature and trying to choose which level of regression model to use Estimate all x number of models and see who got the closest with real-time data How to fake: o 1. Split data down middle, call one half training/old data and the other half is data that we have not seen (new/test data) **Treating half of data as old and half as new o 2. Fit the models using data weve seen o 3. Forecast on data we havent seen o 4. Pull back curtain and see how well we did which model got closest to real data, on average
Other Ideas to Know - Parameter and prediction uncertainty: bootstrapping and NLRM - T-statistic as a signal-to-noise ratio for coefficient estimates If you want a 65 70% level, use t=1 If you want 95% level, use t=2 - Binomial distribution: describes the number of successes in m independent yes-or-no trials, and probability of success in each tril is p (not necessarily )
STA371H Exam 1
5. Grouping variables Key Concepts - Aggregation paradox: when the same trend that holds for individuals doesnt hold for groups of individuals Ex: high SAT scores predict high GPAs, but being in a college with high SAT scores does not predict being in a college with high GPAs o Here, paradox disappears when we realize the college variable is a co nfounder (systematically associated) for the relationship between SAT score and GPA How to disaggregate data and fit a different regression line: o 1. Fit many different lines with different intercepts but all with same slope Only if we thought SAT-GPA relationship ought to hold in each college, but each college had a higher or lower average GPA o 2. Fit many different lines and allowing both slope and intercept to differ Only if we think SAT-GPA relationship differed across colleges - Grouping variables in regression modeling: using dummy variables and interaction terms - Dummy variables: (quantity 1{xi = 1}) takes the value of 1 when xi = 1 and 0 otherwise Used to disaggregate data Implies the following: o Group mean for case where x is off = beta-naught Beta-naught is the baseline (intercept) o Group mean for case where x is on = beta-naught + beta-one Beta-one is the offset o Estimate values of parameters using least-squares criterion (mathematically equivalent as computing group-wise mean separately) Why do we use baseline/offset form? o It gives us the differences between the means If predictor x has more than 2 levels, must expand it by adding more than 1 dummy variable o With 4 levels, the model will have beta-naught beta-three In R, if there is no dummy variable associated with a variable, it is then the baseline case (which is used to compare against other variables) o Fitting models with common slopes but unique intercepts - Interactions: new predictors formed by multiplying a quantitative predictor and a dummy (0 or 1) variable One variable modulates the effect of another o When dummy variable = 0, interaction term disappears o When dummy = 1, interaction = original quantitative predictor (partial slope changes) o Ex: baseball salary Used when expect a categorical variable that will result in change of both intercept and slope o Ex: the GPA of students in liberal arts will vary more sharply with SAT verbal scores and less sharply than with math scores than for students in engineering Has only one residual variance term (compared to the three needed if fitting three diff models) In R, the regression line will be y = intercept + (slope + slope of interaction term)x
STA371H Exam 1
6. Multiple regression Key Concepts - Partial slopes: change in y associated with a one-unit change in x, holding all other variables constant The beta values attached to x-variables Why is it partial? o It only affects the y-variable in relation to changes in one x-variable o Effects of different predictors are completely separable o For example, in a linear equation, beta1 changing only affects changes in yi in relation to x1, and it does not affect y in relation to x2 Interpretation: rate of change in the y-variable that we can predict as the one x-variable changes, holding (adjusting for) the changes in other x-variables constant Using statistical adjustment to gather only the relationship between on x-variable and the y-variable - Collinearity: Perfect linear relationship among predictors (explanatory/x-variables) Within a data set, some predictors seem to be totally estimated by other predictors No unique solutions for regression coefficients Multi-collinearity: when there are multiple predictors with very high correlations Other Ideas to Know - Geometry of multiple regressions: planes through point clouds - From simple to multiple regression What stays the same? o Principle of least squares o Using R2 to summarize preciseness of fit Only difference is that y-hat is now a function of more than just an intercept and a single slope It is still the square of correlation coefficient between y and y-hat o Assumption of normally distributed residuals (linearity assumption extended to more than one predictor) and the rationale of quantifying uncertainty What changes? o Beta-hat is an estimated partial slope o We no longer have simple formulas for these quantities Intuition: we must estimate more parameters compared to the one-variable case and use up additional degrees of freedom in data
STA371H Exam 1
7. Hypothesis testing Key Concepts - Key Q: What is our threshold of believable surprise, beyond which the data will change our minds? - Key Q 2: Does real data fall at or beyond the critical value? If yes, we reject null If no, we fail tor reject - Hypothesis testing: decision between rejecting or accepting the null hypothesis of a statement about a data set - Critical value: threshold of believability - Rejection region: values of the statistic equal to or beyond the critical value that will cause us to reject the null if we observe a value at least as extreme as the critical value Ex: beyond reasonable doubt for criminal trials - When choosing a threshold, we must balance between 2 types of error: False positives: wrongly reject a true null False negatives: wrongly fail to reject a false null When sample distribution is wider, Type II happens more often than Type I error Alpha level: probability that t falls in rejection region given H0 is true o As a Frequentist guarantee: o Lower levels of alpha (significance level) indicates less tolerance for rejecting true nulls (greater conservatism) H0 True False Positive Type I / alpha error Good H0 False Great! False negative Type II / beta error
Reject Fail to Reject
Setting up a Neyman-Pearson test 1. Choose a H0 o Ex A: Wage premium H0: no wage premium for men compared to women in wider population o Ex B: Red states H0: Green states are a random sample of all states 2. Choose a summary measure/test statistic t o Test statistic: summary of data that measures discrepancy between expectation of H 0 and HA o Ex A: t = mean of men mean of women o Ex B: # of overlaps between green states (random) and red states (fixed) 3. Calculate (simulate) the sample distribution of t, assuming H0 is true 4. Choose a rejection region (R) o This can be anything o Ex A: R = any difference outside of +/- $1 o Ex B: R = any difference in overlap outside of +/- 2 states 5. Calculate size of your R 6. Check whether your t falls in your rejection region o If t falls in R, reject o If t does not fall in R, fail to reject P-values: probability of observing a result as extreme as the result actually observed Play no role in Neyman-Pearson A p-value of 0.01 is not 10 times stronger than a p-value of 0.10, though it is overall stronger F-test and analysis of variance: quantifying how much R2 is expected to increase when we add predictors to a model that are uncorrelated with the response Look at how R2 increases with each additional predictor variable Useful for comparing a complex model to a simpler one
STA371H Exam 1 Permutation tests: a randomization test where the perm distribution is used to calculate the p-value Ex: shuffling data like a deck of cards o Suppose you have an algorithm that generates randomness. How do you test it? Under null iid model, all shuffles of data are equally likely when you dont know group labels Permutation testing: o 1. Choose test statistic T (larger T values suggest greater deviation from null model) o 2. Evaluate T on data, T becomes t o 3. Randomly shuffle data o 4. Recomput test statistic T using shuffled data and call result t* o 5. Repeat 3 & 4 large number of times o 6. The p-value is probability of observing result as extreme as t by chance alone

Statistical Modeling

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistical Modeling

Uploaded by

Copyright:

Available Formats

STA371H Exam 1

Lattice plot: good for comparing multivariate data

Reject Fail to Reject

You might also like