You are on page 1of 19

Q- What is regression analysis?

A- Regression analysis is statistical process to determine the relationship among


one variable called dependent as function of other variable called
independent variable.
Q-What is logistics Regression?
A-Logistic Regression is a statistical process to determine the relationship
among a variable whose response is binary called as dependent variable and other
variables called as independent variables.

Q- What is covariance?
A- Covariance is a measure of how much two variables change together.
Q-What is correlation

Correlation scaled version of covariance which means it is a measures


that how much two equally scaled variable change together. The result of
correlation is called as coefficient of correlation which is denoted by r . It
ranges from -1.0 to +1.0. Correlation does not have units. Covariance always
has units.
Q- What is formula of covariance?
A- Let us denote the dependent variable by Y and independent variable by X then formula of covariance
will be covariance (X,Y)= (X-) (Y-)/n or (X-) (Y-)/n-1(sample).
Q- What is mean, median and mode?
We use mean to describe the entire set of observation with a single value representing the center data.
Mean or arithmetic mean is the sum of observation divided by the sum of all observations. Median is
central point of data after arranging it in ascending order. So if there is n odd numbers of observations in
data the Median will N+1/2 if N is odd then median will be mid of N+1/2 and N/2
The mode is the value that occurs most frequently in a set of observations.
Q- What is statistics?
A- A method to get the information from a given set of observations.
Q- What is descriptive statics?
A- Descriptive statistics provides the concise summary of data.
Q- What is inferential statistics?
A- Inferential statistics uses the random sample of a population to draw the conclusion about that
population.
Q-What is variable?
A- A variable is some characteristics of population or sample.
Q- What is Data?
Data is collection of variables or observed values of all variables.
Q- Define types of data?
Data is categorized into two categories - Numeric or quantitative data and categorical or qualitative
data.
Quantitative data has meaning as measurement such as height, weight etc.
Numerical data can be further broken into two types: discrete and continuous.
Discrete data represent items that can be counted.
Continuous data represent measurements; their possible values cannot be counted and can only be
described using intervals on the real number line

Categorical data: categorical data represent characteristics such as a persons gender. The categorical
data fall into two categories. Ordinal ---1,2 3 and nominal such as age sex etc.

Q- What is Range?
Range is difference between highest and lowest observations of a variable.
Q-What is coefficient of determination?
The coefficient of determination is a measure of how well the regression line
represents the data. If the regression line passes exactly through every point on
the scatter plot, it would be able to explain all of the variation. The further the line
is away from the points, the less it is able to explain.
What is Rejection Region?
Rejection level is the range of values such that if test statistics fall into it we reject
the null hypothesis.

What is P value?
P values of test are probability of observing test statistic at least as extreme as one
computed assuming that null hypothesis is true.

What is RSS?
RSS is stand for sum of squares of difference between the mean value of observed
dependent variable and the regression value of dependent variable.

What is SSE?
Sum of square error?
It is sum of squares of difference between the regression value of a dependent
variable and the observed value of that dependent variable.

What is square sum of total?


It is sum of RSS and SSE.
What is coefficient of determination or R square?
The R2 of regression model is the proportion of total variance explained by the
independent variables of model.
It is defined by the R2=SSR/SST
What is adjusted r square and why we use it?
When we add any variable in linear regression model the value of r square increase
whether added variable is not significant hence we use adjusted r2 which is
modified version of the r square.
When we add an insignificant variable in the model the adjusted r square penalized
whereas its value increases when we add significant variable in the model.
R2 assumes that every single variable explains the variation in the dependent variable. The
adjusted R2 tells you the percentage of variation explained by only the independent
variables that actually affect the dependent variable.
The formula for adjusted r square =1-((1-(r square) ^2)*n-1/n-1-p).

What is standard deviation?


Standard deviation is the measure of spread of data across the mean.
Standard deviation is useful when we compare two data in order to get the variance
or spread in both data.
What are the two rules or theorem to calculate the spread of data away /across
/around the mean?
There two rule to calculate the spread of data
1- empirical rule- Which states that if the distribution of data is bell
curved or data is normally distributed then
(a) 68 % data will fall within one standard deviation from mean.
(b) 95 % data will fall within two standard deviation from mean.
(c) 99.7% of data will fall within 3 standard deviation from mean.
Chebyshev's Theorem which state that distribution of across mean is
equal to 1-1/k^2 where K>1, so when k=2 the 75% will fall within
2standard deviation from mean.
What is ordinary least square method and why we use it?

The simplest and most common method of fitting a straight line to a sample of data: by
minimizing the sum of the squares of the errors of the data from the line.
What are the basic assumptions of Ordinary least square method?
1234567-

Model is linear in parameter.


The errors are statistically independent from each other.
Independent variables are not highly collinear.
Errors are normally distributed.
Independent variables are measured precisely.
Expected value of error is always zero.
Residual have constant variance.

What is Standard Error of Estimate?


The root square of variance of error Se^2 is called standard error estimate of error
The Se^2=SSE/ (n-2).

What is heteroscedasticity and how to detect it?

One of the key assumptions of regression is that the variance of the errors is
constant across observations. If the errors have constant variance, the errors are
called homoscedastic. Typically, residuals are plotted to assess this assumption.
Standard estimation methods are inefficient when the errors are heteroscedastic or
have non-constant variance.

How to remove heteroscedasity? Need to learn..

What is VIF- variance inflation factor?


The variance inflation factor (VIF) id used to describe the multicolieartity between the
independent variables. It measure of how much variance of estimated regression coefficients
inflated as compare to when there no linear relation between the IDVs.

What are methods to fit a linear regression model?

Adjusted r square and r square.


MAPE Mean absolute percentage error should be less than or equal to 5%
VIF should me less than 10
Model fit check by plotting predicted vs actual
Residual check/heterscadesity

What is logistics regression and when to use logistic regression?


When the dependent variable is categorical and IDVs are mix of categorical and quantitative
variables we use logistic regression. It predicts the log odds of response instead of the directly
predicting the response.
How many types of logistic regression?
Binary logistic regression when the response of dependent variable is dichotomous or only two
responses we use the binary logistic regression.
Ordered or ordinal logistic regression- when the response of dependent variable is ordinal and
more than two e.g. high low medium we use ordinal logistic regression model .
Multinomial logistic regression - when the response of dependent variable is nominal and more
than two e.g. sex of group, type of bread etc we use ordinal logistic regression model.
How to check multicolinearty in logistic regression in sas?????

What is MLE(Maximum Like hood estimation)?


Maximum like hood is a technique to find the regression coefficient of the logistic regression.

What are the assumptions of MLE?

Variance of Error in estimated model is constant.

What is central tendency of data?


The term central tendency refers to the middle value of the data. Mean, median and mode are
three statistics measures used to measure the central tendency of the data.
Mean is the arithmetic average, mode is frequencie average and medium is positional average.

What are different types of sampling?

Simple Random Sample- A simple random sample is a sample selected in such a way
that every possible sample with the same number of observations is equally likely to be
chosen.
Stratified Random Sample
A stratified random sample is obtained by separating the population into mutually
exclusive sets, or strata, and then drawing simple random samples from each stratum.
Cluster Sample
A cluster sample is a simple random sample of groups or clusters of elements

https://faculty.elgin.edu/dkernler/st
atistics/ch01/1-4.html
Simple Random Sampling and
Other Sampling Methods
Printer-friendly version
Sampling Methods can be classified into one of two categories:

Probability Sampling: Sample has a known probability of being selected


Non-probability Sampling: Sample does not have known probability of
being selected as in convenience or voluntary response surveys
Probability Sampling
In probability sampling it is possible to both determine which sampling units belong to which sample
and the probability that each sample will be selected. The following sampling methods, which are
listed in Chapter 4, are types of probability sampling:

1.
2.
3.
4.
5.

Simple Random Sampling (SRS)


Stratified Sampling
Cluster Sampling
Systematic Sampling
Multistage Sampling (in which some of the methods above are
combined in stages)
Of the five methods listed above, students have the most trouble distinguishing between stratified
sampling and cluster sampling.

Stratified Sampling is possible when it makes sense to partition the population into groups based on
a factor that may influence the variable that is being measured. These groups are then called strata.
An individual group is called a stratum. With stratified samplingone should:

partition the population into groups (strata)


obtain a simple random sample from each group (stratum)
collect data on each sampling unit that was randomly sampled from each
group (stratum)
Stratified sampling works best when a heterogeneous population is split into fairly homogeneous
groups. Under these conditions, stratification generally produces more precise estimates of the
population percents than estimates that would be found from a simple random sample. Table
3.2 shows some examples of ways to obtain a stratified sample.
Table 3.2. Examples of Stratified Samples
Example 1

Population

All people in U.S.

Groups (Strata)
4 Time Zones in the U.S.
(Eastern,Central,
Mountain,Pacific)

Example 2

Example 3

All PSU
intercollegiate
athletes

All elementary students


in the local school district

26 PSU
intercollegiate
teams

11 different elementary
schools in the local school
district

Obtain a Simple
Random Sample

500 people from each of


the 4 time zones

5 athletes from each


of the 26 PSU teams

20 students from each of


the 11 elementary
schools

Sample

4 500 = 2000
selected people

26 5 = 130
selected athletes

11 20 = 220 selected
students

Cluster Sampling is very different from Stratified Sampling. With cluster sampling one should

divide the population into groups (clusters).


obtain a simple random sample of so many clusters from all possible
clusters.
obtain data on every sampling unit in each of the randomly selected
clusters.

It is important to note that, unlike with the strata in stratified sampling, the clusters should be
microcosms, rather than subsections, of the population. Each cluster should be heterogeneous.
Additionally, the statistical analysis used with cluster sampling is not only different, but also more
complicated than that used with stratified sampling.
Table 3.3. Examples of Cluster Samples
Example 1

Example 2

Example 3

Population

All people in U.S.

All PSU intercollegiate


athletes

All elementary students


in a local school district

Groups (Clusters)

4 Time Zones in the


U.S. (Eastern,Central,
Mountain,Pacific.)

26 PSU intercollegiate
teams

11 different elementary
schools in the local
school district

Obtain a Simple
Random Sample

2 time zones from the


4 possible time zones

8 teams from the 26


possible teams

4 elementary schools
from the l1 possible
elementary schools

Sample

every person in the 2


selected time zones

every athlete on the 8


selected teams

every student in the 4


selected elementary
schools

What are different types of error arises when we take a sample of observation is taken from
population?
There are two types of error arise
Sampling error- Sampling error refers to differences between the sample and the population
that exists only because of the observations that happened to be selected for the sample.
Non Sampling ErrorNonsampling errors result from mistakes made in the acquisition of data or from the sample
observations being selected improperly.
1-Errors in data acquisition.
2-Nonresponse error.
3-Selection bias. Selection bias occurs when the sampling plan is such that some members of
the target population cannot possibly be selected for inclusion in the sample. Together with
nonresponse error, selection bias played a role in the Literary Digest poll being so wrong, as
voters without telephones or without a subscription to Literary Digest were excluded from
possible inclusion in the sample taken.

What id random experiment?


A random experiment is an experiment or process that leads to one of several possible outcomes.

What is Sample Space?


A sample space of a random experiment is a list of all possible outcomes of the
experiment. The outcomes must be exhaustive and mutually exclusive.
What is Probability of an Event?
The probability of an event is the sum of the probabilities of the simple events that
constitute the event.
What is discrete and continuous data?

Discrete Data can only take certain values.


Example: the number of students in a class (you can't have half a student).

Continuous Data can take any value (within a range)


Examples:

A person's height: could be any value (within the range of human


heights), not just certain fixed heights,

What is Random Variable?


A random variable is a function or rule that assigns a number to each outcome of
an experiment.

What is Binomial Experiment?


1. The binomial experiment consists of a fixed number of trials. We represent the number
of trials by n.
2. Each trial has two possible outcomes. We label one outcome a success,and the other a
failure.
3. The probability of success is p. The probability of failure is 1 p.
4. The trials are independent, which means that the outcome of one trial does not affect the
outcomes of any other trials.
Binomial Probability Distribution
The probability of x successes in a binomial experiment with n trials and probability of
success = p is
P(x) = n!x!(n x)! px(1 p)nx for x = 0, 1, 2, . . . , n

Poision distribution from book.

What is central limit theorem?

The Central Limit Theorem (CLT for short) basically says that for non-normal data, the
distribution of the sample means has an approximate normal distribution, no matter what the
distribution of the original data looks like, as long as the sample size is large enough (usually at
least 30) and all samples have the same size.

What is concordance discordance and tied pair?

Percent Concordant: Percentage of pairs where the observation with the desired outcome
(event) has a higher predicted probability than the observation without the outcome (nonevent).
Percent Discordant: Percentage of pairs where the observation with the desired outcome
(event) has a lower predicted probability than the observation without the outcome (nonevent).
Percent Tied: Percentage of pairs where the observation with the desired outcome (event) has
same predicted probability than the observation without the outcome (nonevent).
What is ROC?
ROC- Receiver Operating Characteristic- In ROC curve we plot the true positive rate (Sensitivity) vs
false positive rate (100-specificity) for different cutoff points.
Each point on the ROC curve represents a Sensitivity/specificity pair corresponding to a particular
decision threshold. A test with perfect discrimination (no overlap in the two distributions) has a ROC
curve that passes through the upper left corner (100% sensitivity,100% specificity). Therefore the
closer the ROC curve is to the upper left corner, the Higher the overall accuracy of the test.
Sensitivity: probability that a test result will be positive when the disease is present (true positive
rate,expressed as a percentage).
= a / (a+b).
Specificity: probability that a test result will be negative when the disease is not present (true
negative rate, expressed as a percentage).
= d / (c+d).
What is difference between NOdupKey and Noduprecs?
data test1;
input id1 $ id2 $ extra ;
cards;
aa ab 3
aa ab 1
aa ab 2
aa ab 3
;
proc sort nodup data=test1;
by id1;
run;
options nocenter;
proc print data=test1;

run;
/*Noduprecs reads values of all varibales of priovious observation before writing it oouput and
if there is duplicate it does not write*/;
data test1;
input id1 $ id2 $ extra ;
cards;
aa ab 3
aa ab 1
aa ab 2
aa ab 3
;
proc sort nodup data=test1;
by id1 id2;
run;
options nocenter;
proc print data=test1;
run;
data test1;
input id1 $ id2 $ extra ;
cards;
aa ab 3
aa ab 1
aa ab 2
aa ab 3
;
proc sort nodup data=test1;
by id1 id2 extra;
run;
options nocenter;
proc print data=test1;
run;
/*nudupkey checks and delete the observation that have duplicate by varibale values*/;
data test1;
input id1 $ id2 $ extra ;
cards;
aa ab 3
aa ab 1
aa ab 2
aa ab 3
;
proc sort nodupkey data=test1;
by id1;
run;
options nocenter;
proc print data=test1;
run;
data test1;
input id1 $ id2 $ extra ;
cards;
aa ab 3
aa ab 1
aa ab 2
aa ab 3

;
proc sort nodupkey data=test1;
by id1 id2;
run;
options nocenter;
proc print data=test1;
run;
data test1;
input id1 $ id2 $ extra ;
cards;
aa ab 3
aa ab 1
aa ab 2
aa ab 3
;
proc sort nodupkey data=test1;
by id1 id2 extra;
run;
options nocenter;
proc print data=test1;
run;
What are differences between where and if statement?
1- If statement can used only at datastep whereas where statement can be used at datastep as
well proc step.
Data test (where =(name=gau))
2- If statement can be used at data step to read records while specifying input statement
whereas where statement cannot be used.
Data test;
Input a b c;
If b in (2,3);
Dataline;
123
345
426
537
;
Run;
3-If statement is used after the data is read into pDV whereas where statement must be used before
the data is written in PDV.
Execute multiple conditional statements
suppose, you have data for college students mathematics scores. You want to rate them on the
basis of their scores.
Conditions:

1. If a score is less than 40, create a new variable named Rating and give Poor rating to these
students.
What is SAS Macros?
Macros are used to perform repetitive task.
What is Macro variable and how many types of macro variables are there?
Macro variables are used to store values of variables. There are two types of macro variable

Local - If the macro variable is defined inside a macro code, then


scope is local. It would be available for use in that macro only and gets
removed when the macro is finished.
Global - If the macro variable is defined outside a macro code, then
scope is global. It can

be using anywhere in the SAS program and gets

removed at the end of the session.


data test;
input a;
datalines;
1
2
;
run;
/*global varibale*/;
%Let dat=test;
/*define macro*/;
%macro sample;
proc print data=&dat;
run;
%mend sample;
/*invoke micro*/;
%sample
run;
/*local varibale*/;
%macro sample1;
%let dat1=test;
proc print data=&dat1;
run;
%mend sample1;
/*invoke micro*/;
%sample1

run;

Note that value does not require quotation marks even for
characters
What are methods to create micro variables?
1-% let statement
2-macro parameters (named and positional)

Positional In this we provide parameter name while defining


the macro
%Macro S(obs,b);
Data t2;
Set &b;
Run;
%mend S;
Keyword Parameters: In this we provide parameter name with equal sign and also we can
assign default values to parameter

Definition

%MACRO <macro name> (Parameter1=Value, Parameter2=Value.Parameter-n=Value);

Macro Text;

%MEND;

Calling

%<macro name> (Parameter1=Value, Parameter2=Value ..Parameter-n=Value);

3- Call symput- suppose we want to use the variable need to


check.

What is difference between proc univaritae and proc mean?


1- Both procedure produce descriptive statistics. By proc univariate, by default it
produce all the statistics (some timenot all required) but in proc means it is possible
to request the statistics that we want. .

2- Proc univariate produces histogram, quartiles and box plots


whereas proc means dose not.

What are the rules for SAS Data sets?


1- A SAS data set can be 1 to 32 characters long.
2- Should start with underscore or letter and subsequent
character can be letter numeric or underscore.
3- A SAS data set consists of two parts Descriptive portion and
data portion.

What are the SAS attributes?


Name can be 1 to 32 characters long. Should start with
underscore or letter and subsequent character can be letter
numeric or underscore.
Type-Numeric numbers (+, -,.,E and scientific notation )
character contains letter
To represent blank character variable we use blank while . for
numeric variables.
Length A SAS character variable has default value of 8 and
can have the value upto 32767
All numeric variables have a default length of 8. Numeric values (no matter how many digits they
contain) are stored as floating-point numbers in 8 bytes of storage, unless you specify a different
length,

What are the formats and informats, and what are differences between them?
The format is used the write data values in output data sets /report. Informat is used to read data values
from the raw files.
Whereas formats write values out by using some particular form, informats read data values in
Certain forms into standard SAS values
How many types of informat are there?
SAS informat are grouped into three categories
$Character w., Numeric w.d, and date and time W.
So if a data value is numeric and its value is $12, 6000 we use dollar8. Informat To read this from
external file .Please note this is for only reading data file if we want to display it output file we
have to use format if format is not used the data will be 126000.

What is yearcuttoff system option?


When encounters two digit year in date of an external or internal we normally use yearcutoff
option because SAS reads consider (by default) 1 jan; 1960 as system date.
What is date constant and when do we use it?
You can assign date values to variables in assignment statements by using date constants.
To represent a constant in SAS date form, specify the date as 'ddmmmyy' or 'ddmmmyyyy',
followed by a D.
dataclinic.stress;
infiletests;
inputID14Name$625RestHr2729MaxHR3133
RecHR3537TimeMin3940TimeSec4243
Tolerance$45;
TotalTime=(timemin*60)+timesec;
TestDate='01jan2000'd;
run;
Note You can also use SAS time constants and SAS datetime constants in
assignment statements.
Time='9:25't;
DateTime='18jan2005:9:27:05'dt;
Howdoyouexportandimportsasdatafile,indatastepandinprocstep?

You might also like