Professional Documents
Culture Documents
Q- What is covariance?
A- Covariance is a measure of how much two variables change together.
Q-What is correlation
Categorical data: categorical data represent characteristics such as a persons gender. The categorical
data fall into two categories. Ordinal ---1,2 3 and nominal such as age sex etc.
Q- What is Range?
Range is difference between highest and lowest observations of a variable.
Q-What is coefficient of determination?
The coefficient of determination is a measure of how well the regression line
represents the data. If the regression line passes exactly through every point on
the scatter plot, it would be able to explain all of the variation. The further the line
is away from the points, the less it is able to explain.
What is Rejection Region?
Rejection level is the range of values such that if test statistics fall into it we reject
the null hypothesis.
What is P value?
P values of test are probability of observing test statistic at least as extreme as one
computed assuming that null hypothesis is true.
What is RSS?
RSS is stand for sum of squares of difference between the mean value of observed
dependent variable and the regression value of dependent variable.
What is SSE?
Sum of square error?
It is sum of squares of difference between the regression value of a dependent
variable and the observed value of that dependent variable.
The simplest and most common method of fitting a straight line to a sample of data: by
minimizing the sum of the squares of the errors of the data from the line.
What are the basic assumptions of Ordinary least square method?
1234567-
One of the key assumptions of regression is that the variance of the errors is
constant across observations. If the errors have constant variance, the errors are
called homoscedastic. Typically, residuals are plotted to assess this assumption.
Standard estimation methods are inefficient when the errors are heteroscedastic or
have non-constant variance.
Simple Random Sample- A simple random sample is a sample selected in such a way
that every possible sample with the same number of observations is equally likely to be
chosen.
Stratified Random Sample
A stratified random sample is obtained by separating the population into mutually
exclusive sets, or strata, and then drawing simple random samples from each stratum.
Cluster Sample
A cluster sample is a simple random sample of groups or clusters of elements
https://faculty.elgin.edu/dkernler/st
atistics/ch01/1-4.html
Simple Random Sampling and
Other Sampling Methods
Printer-friendly version
Sampling Methods can be classified into one of two categories:
1.
2.
3.
4.
5.
Stratified Sampling is possible when it makes sense to partition the population into groups based on
a factor that may influence the variable that is being measured. These groups are then called strata.
An individual group is called a stratum. With stratified samplingone should:
Population
Groups (Strata)
4 Time Zones in the U.S.
(Eastern,Central,
Mountain,Pacific)
Example 2
Example 3
All PSU
intercollegiate
athletes
26 PSU
intercollegiate
teams
11 different elementary
schools in the local school
district
Obtain a Simple
Random Sample
Sample
4 500 = 2000
selected people
26 5 = 130
selected athletes
11 20 = 220 selected
students
Cluster Sampling is very different from Stratified Sampling. With cluster sampling one should
It is important to note that, unlike with the strata in stratified sampling, the clusters should be
microcosms, rather than subsections, of the population. Each cluster should be heterogeneous.
Additionally, the statistical analysis used with cluster sampling is not only different, but also more
complicated than that used with stratified sampling.
Table 3.3. Examples of Cluster Samples
Example 1
Example 2
Example 3
Population
Groups (Clusters)
26 PSU intercollegiate
teams
11 different elementary
schools in the local
school district
Obtain a Simple
Random Sample
4 elementary schools
from the l1 possible
elementary schools
Sample
What are different types of error arises when we take a sample of observation is taken from
population?
There are two types of error arise
Sampling error- Sampling error refers to differences between the sample and the population
that exists only because of the observations that happened to be selected for the sample.
Non Sampling ErrorNonsampling errors result from mistakes made in the acquisition of data or from the sample
observations being selected improperly.
1-Errors in data acquisition.
2-Nonresponse error.
3-Selection bias. Selection bias occurs when the sampling plan is such that some members of
the target population cannot possibly be selected for inclusion in the sample. Together with
nonresponse error, selection bias played a role in the Literary Digest poll being so wrong, as
voters without telephones or without a subscription to Literary Digest were excluded from
possible inclusion in the sample taken.
The Central Limit Theorem (CLT for short) basically says that for non-normal data, the
distribution of the sample means has an approximate normal distribution, no matter what the
distribution of the original data looks like, as long as the sample size is large enough (usually at
least 30) and all samples have the same size.
Percent Concordant: Percentage of pairs where the observation with the desired outcome
(event) has a higher predicted probability than the observation without the outcome (nonevent).
Percent Discordant: Percentage of pairs where the observation with the desired outcome
(event) has a lower predicted probability than the observation without the outcome (nonevent).
Percent Tied: Percentage of pairs where the observation with the desired outcome (event) has
same predicted probability than the observation without the outcome (nonevent).
What is ROC?
ROC- Receiver Operating Characteristic- In ROC curve we plot the true positive rate (Sensitivity) vs
false positive rate (100-specificity) for different cutoff points.
Each point on the ROC curve represents a Sensitivity/specificity pair corresponding to a particular
decision threshold. A test with perfect discrimination (no overlap in the two distributions) has a ROC
curve that passes through the upper left corner (100% sensitivity,100% specificity). Therefore the
closer the ROC curve is to the upper left corner, the Higher the overall accuracy of the test.
Sensitivity: probability that a test result will be positive when the disease is present (true positive
rate,expressed as a percentage).
= a / (a+b).
Specificity: probability that a test result will be negative when the disease is not present (true
negative rate, expressed as a percentage).
= d / (c+d).
What is difference between NOdupKey and Noduprecs?
data test1;
input id1 $ id2 $ extra ;
cards;
aa ab 3
aa ab 1
aa ab 2
aa ab 3
;
proc sort nodup data=test1;
by id1;
run;
options nocenter;
proc print data=test1;
run;
/*Noduprecs reads values of all varibales of priovious observation before writing it oouput and
if there is duplicate it does not write*/;
data test1;
input id1 $ id2 $ extra ;
cards;
aa ab 3
aa ab 1
aa ab 2
aa ab 3
;
proc sort nodup data=test1;
by id1 id2;
run;
options nocenter;
proc print data=test1;
run;
data test1;
input id1 $ id2 $ extra ;
cards;
aa ab 3
aa ab 1
aa ab 2
aa ab 3
;
proc sort nodup data=test1;
by id1 id2 extra;
run;
options nocenter;
proc print data=test1;
run;
/*nudupkey checks and delete the observation that have duplicate by varibale values*/;
data test1;
input id1 $ id2 $ extra ;
cards;
aa ab 3
aa ab 1
aa ab 2
aa ab 3
;
proc sort nodupkey data=test1;
by id1;
run;
options nocenter;
proc print data=test1;
run;
data test1;
input id1 $ id2 $ extra ;
cards;
aa ab 3
aa ab 1
aa ab 2
aa ab 3
;
proc sort nodupkey data=test1;
by id1 id2;
run;
options nocenter;
proc print data=test1;
run;
data test1;
input id1 $ id2 $ extra ;
cards;
aa ab 3
aa ab 1
aa ab 2
aa ab 3
;
proc sort nodupkey data=test1;
by id1 id2 extra;
run;
options nocenter;
proc print data=test1;
run;
What are differences between where and if statement?
1- If statement can used only at datastep whereas where statement can be used at datastep as
well proc step.
Data test (where =(name=gau))
2- If statement can be used at data step to read records while specifying input statement
whereas where statement cannot be used.
Data test;
Input a b c;
If b in (2,3);
Dataline;
123
345
426
537
;
Run;
3-If statement is used after the data is read into pDV whereas where statement must be used before
the data is written in PDV.
Execute multiple conditional statements
suppose, you have data for college students mathematics scores. You want to rate them on the
basis of their scores.
Conditions:
1. If a score is less than 40, create a new variable named Rating and give Poor rating to these
students.
What is SAS Macros?
Macros are used to perform repetitive task.
What is Macro variable and how many types of macro variables are there?
Macro variables are used to store values of variables. There are two types of macro variable
run;
Note that value does not require quotation marks even for
characters
What are methods to create micro variables?
1-% let statement
2-macro parameters (named and positional)
Definition
Macro Text;
%MEND;
Calling
What are the formats and informats, and what are differences between them?
The format is used the write data values in output data sets /report. Informat is used to read data values
from the raw files.
Whereas formats write values out by using some particular form, informats read data values in
Certain forms into standard SAS values
How many types of informat are there?
SAS informat are grouped into three categories
$Character w., Numeric w.d, and date and time W.
So if a data value is numeric and its value is $12, 6000 we use dollar8. Informat To read this from
external file .Please note this is for only reading data file if we want to display it output file we
have to use format if format is not used the data will be 126000.