You are on page 1of 12

Basic Statistical Procedures in SAS

1
Contents

6 Basic Statistical Procedures in SAS 1


6.1 PROC UNIVARIATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
6.2 PROC MEANS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6.3 PROC TTEST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.3.1 One-sample comparison . . . . . . . . . . . . . . . . . . . . . . . 6
6.3.2 Two independent sample comparison . . . . . . . . . . . . . . . . 7
6.3.3 Paired comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6.4 PROC FREQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
6.5 PROC ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
6.6 PROC CORR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.7 PROC REG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2
6.1 PROC UNIVARIATE

Statisticians know the importance of exploring the characteristics of the data at hand
before jumping into actual (and complicated) statistical analysis. PROC UNIVARIATE
produces statistics and graphs describing the distribution of a single variable (hence the
name UNIVARIATE). The format is:

PROC UNIVARIATE;
VAR variable-list;

REMARKS:

• omitting the VAR statement causes SAS to calculate all statistics for all numeric
variables

• specify the NORMAL option to produce tests of normality

• you can request for the following plots in PROC UNIVARIATE: CDFPLOT, HIS-
TOGRAM, PPPLOT, PROBPLOT and QQPLOT

Example 6.1 Consider the following data in Scores.dat consisting of test scores from a
statistics class. Each line contains scores for 10 students. Explore the distribution of the
test scores using PROC UNIVARIATE.

Example 6.2 The data found in “body fat data.xlsx” were used to produce predictive
equations for lean body weight, a measure of health. Measurements were made on n = 252
men in order to relate the percentage bodyfat determined by underwater weighing (variable
bodyfat), which is inconvenient and costly to obtain, to a number of body circumference
measurements, recorded using only a scale and measuring tape.

The other variables derived from the records are age in years (age), weight in lb
(weight), height in inches (height), neck circumference in cm (neck), chest circumference
in cm (chest), abdomen 2 circumference in cm (abdomen), hip circumference in cm (hip),

3
thigh circumference in cm (thigh),knee circumference in cm (knee), ankle circumference
in cm (ankle), extended biceps circumference in cm (biceps), forearm circumference in
cm (forearm), and wrist circumference in cm (wrist).

Using PROC UNIVARIATE, explore the distribution of all variables by answering the
following guide questions:

• Which among the variables has the highest dispersion?

• Which variable/s has/have a positively skewed distribution?

• At 0.05 level of significance, which variables exhibit a non-normal distribution?

• What is the mean bodyfat of the respondents? Interpret the value.

• Among the following age categories, which category has the highest mean bodyfat?
Which has the lowest mean bodyfat?

– Group 1: Less than 30 years old

– Group 2: 30 to 40 years old

– Group 3: 41 to 50 years old

– Group 4: More than 50 years old

Exercise 6.1 Using the diabetes data set and your knowledge about PROC UNIVARI-
ATE, answer the following guide questions:

• Calculate the mean, median, mode, standard deviation, coefficient of variation, co-
efficient of skewness, and coefficient of kurtosis for glucose, diastolic BP, skinfold
thickness, BMI and age.

• Which of the variables from item 1 exhibit signs of asymmetry in their distribution?

• Which of the variables from item 1 have non-normal distributions? (Use α = 0.05)

• What is the mean age of diabetic and non-diabetic subjects?

• What is the mean glucose level of diabetic and non-diabetic patients?

4
6.2 PROC MEANS

Most of the statistics generated by the UNIVARIATE procedure can also be produced by
the MEANS procedure. If you want an in-depth look at each of the variables of interest
then use PROC UNIVARIATE. Otherwise, use the MEANS procedure to get only the
statistics you want.
Remarks

• PROC MEANS statistic-keywords

• if you don’t put any statistic keywords, the MEANS procedure will give the mean,
the number of non-missing values, the standard deviation, the minimum value and
maximum value for each numeric variable in the data set.

• see page 258 of The Little SAS Book for a list of available statistic keywords

• if you specify one statistic keyword, the default statistics will not be shown

Example 6.3 (Confidence Limit Example) Consider the data in Picbooks.dat. It


contains the data collected by your writer friend who wants to know how many pages her
children’s book should have by getting a random sample of books at a local library and
counting the number of pages each has.

Example 6.4 Answer Example 6.2 using PROC MEANS. Aside from that, compute for
the following:

• A 95% confidence interval for the mean body fat of the population.

• A 90% confidence interval for the mean body fat of the population.

• Minimum and maximum values for the measurements except for bodyfat.

• The 3rd quartile, 1st decile and 50th percentile of the variable weight.

5
Exercise 6.2 Using the data from Exercise 6.1, answer the following questions:

• Calculate the mean, median, mode, standard deviation, coefficient of variation, co-
efficient of skewness, and coefficient of kurtosis for glucose, diastolic BP, skinfold
thickness, BMI and age.

• What is the mean age and mean glucose level of diabetic and non-diabetic subjects?

• Compute for a 93% CI for the mean glucose level of diabetics and non-diabetics.

6.3 PROC TTEST

PROC TTEST obviously performs t-tests in SAS for comparing means. As statistics
students, you know very well that there are three possible comparisons that can be done:

• one-sample comparison

• two independent samples comparison

• paired comparison

The TTEST allows users to specify the following options in the PROC statement:

• ALP HA = α specifies the significance level (default is 0.05 if you omit this option)

• H0 = µ0 requests a test of hypothesis with H0 : µ = µ0 . The default value is 0.

• SIDES = type specifies whether the p-value and confindence interval are two-sided
or one-sided. For one-sided, specify the type as L(lower one-sided) or U(upper one-
sided). For two-tailed, specify 2.

6.3.1 One-sample comparison

In doing a one-sample t-test, use the following form of the TTEST procedure:

PROC TTEST H0=x options;


VAR variable;

6
where x is the null value of population mean µ.

Example 6.5 A researcher believes that in recent years women have been getting taller.
She knows that 10 years ago the average height of young adult women living in her city
was 63 inches. The standard deviation is unknown. She randomly samples eight young
adult women currently residing in her city and measures their heights. The following are
the observed values:
64, 66, 68, 60, 62, 65, 66, 63
Test the researcher’s claim using a 0.05-level of significance in SAS.

Exercise 6.3 The Imaginex Corporation owns a factory that produces sulfuric acid, used
primarily in automobile batteries. Because of changing conditons, the plant’s output is
quite variable. The company’s president has observed that the output is normally dis-
tributed with a mean of 8,200 liters per hour. He recently has been informed that the
government is considering a new law that would require him to alter the way in which the
chemical is manufactured. Unsure about whether he should lobby against the legislation,
he undertakes an experiment. He reorganized production facilities to comply with the pro-
posed law and observes the hourly output for two working days (16 hours). The data are
in Hourly Output.txt. Do these data indicate that the new legislation will reduce output?
A significance level of 5 % is considered appropriate.

6.3.2 Two independent sample comparison

The format of the TTEST procedure for an independent samples t-test is

PROC TTEST options;


CLASS variable;
VAR variable;

The CLASS statement tells SAS the variable that indicates the grouping of the observa-
tions. The VAR statement tells the variable to be tested.

7
Example 6.6 (Lifespans of Rats (in Days) Given Two Diets) This dataset involves
the lifespan of two groups of rats, one group given a restricted diet and the other an ad
libitum diet (i.e. free eating diet). The research aims to determine whether lifespan is
affected by diet.
Setup the hypothesis test for this research problem.

Example 6.7 (One-Sided Test) Despite some controversy, scientists generally agree
that high-fiber cereals reduce the likelihood of various forms of cancer. However, one
scientist claims that people who eat high-fiber cereal for breakfast will consume, on average,
fewer calories for lunch than people who don’t eat high-fiber cereal manufacturers will be
able to claim another advantage of eating their product - potential weigth reduction for
dieters. As a preliminary test of the claim, 30 people were randomly selected and asked
what they regularly eat for breakfast and lunch. Each person was identified as either a
consumer or a nonconsumer of high-fiber cereal , and the number of calories consumed
at lunch was measured and recorded. These data are in calories.txt. Can the scientist
conclude at the 0.05-level of significance that his belief is correct?

6.3.3 Paired comparisons

Use this format for performing a paired t-test in SAS:

PROC TTEST options;


PAIRED variable1*variable2;

Example 6.8 Consider the data in Olympic50mSwim.dat. It gives the finishing times
for semifinal and final races of the women’s 50 meter freestyle swim. Each swimmer’s
initials are followed by their final time and semifinal time in seconds. Is the semifinal time
significantly different from the final time? Test this hypothesis at 0.10-level of significance.

Exercise 6.4 A neuroscientist believes that the lateral hypothalamus is involved in eating
behaviour. If so, then electrical stimulation of that area might affect the amount eaten.
To test this possibility, chronic indwelling electrodes are implanted in 10 rats. Each rat

8
has two electrodes: one implanted in the lateral hypothalamus and the other in the area
where electrical stimulation is known to have no effect. After the animals have recovered
from surgery, they each receive 30 minutes of electrical stimulation to each brain area,
and the amount of food eaten during the stimulation is measured.

Exercise 6.5 A physiologist has the hypothesis that hormone X is important in producing
sexual behaviour. In particular, the physiologist believes that hormone X increases sexual
behaviour. To investigate this hypothesis, 20 male rats were randomly sampled and then
randomly assigned to two groups. The animals in group 1 were injected with hormone X
and then were placed in individual housing with a sexually receptive female. The animals
in group 2 were given similar treatment except they were injected with a placebo solution.
The number of matings was counted over a 20-minute period. The results are shown in
Hormone X.sav. Test the physiologists claim at 0.05-level of significance.

6.4 PROC FREQ

PROC FREQ is used in analyzing categorical data (nominal and ordinal level of mea-
surement). Of particular importance is the chi-square test which is part of the possible
analysis under PROC FREQ. The basic form of PROC FREQ is

PROC FREQ;
TABLES variables-combinations / options;

NOTE: The only option covered in this class is CHISQ, which requests chi-square tests
for independence.

Example 6.9 One day your neighbor, who rides the bus to work, complains that the
regular bus is usually late. He says the express bus is usually on time. Realizing that this
is categorical data, you decide to test whether there really is a relationship between the
type of bus and arriving on time. You collect data for type of bus (E for express or R for
regular) and promptness (L for late or O for on time). Each line of data contains several
observations. The data are in Bus.dat

9
Example 6.10 (using weights in PROC FREQ) Suppose that subjects are classified
according to levels of two variables A and B. The column variable has three categories:
a1 , a2 , a3 and the row variable B has three categories: b1 , b2 , b3 . Consider the following
b1 b2 b3 Total

a1 8 16 31 55
table of frequencies: a2 9 18 74 101

a3 34 23 17 74

Total 51 57 122 230

6.5 PROC ANOVA

The ANOVA procedure requires two statements: the CLASS and MODEL statements.
The basic form is given by:

PROC ANOVA;
CLASS variable-list;
MODEL dependent=effects;

Remarks:

• the CLASS statement must come before the MODEL statement; works similarly to
the CLASS statement in the TTEST procedure

• for one-way ANOVA the effect is the classification variable

• MEANS is the only optional statement we are going to use

Example 6.11 Your daughter plays basketball on a team that travels throughout the
state. She complains that it seems like the girls from the other regions in the state are
all taller than the girls from her region. You decide to test her hypothesis by getting the
heights for a sample of girls from the four regions and performing one-way analysis of
variance to see if there are any differences. Each data line includes region and height for
eight girls. The data are in GirlHeights.dat. Use a 0.05-level of significance.

10
6.6 PROC CORR

Use PROC CORR to evaluate the strength and direction of the linear relationship between
pairs of variables. The basic form is:

PROC CORR PLOTS=(plot-list);


VAR variable-list;
WITH variable-list;
RUN;

Remarks:

• use SCATTER in the plot-list to create scatter plots for pairs of variables

• omitting the VAR and WITH statement computes correlation between all pairings

Example 6.12 Suppose that for each student in a statistics class, data on test score, the
number of hours spent watching television in the week prior to the test, and the number of
hours spent exercising during the same week were recorded. The data are in Exercise.dat.

6.7 PROC REG

The REG procedure fits linear regression models. The details of linear regression analysis
will not be discussed, but the interpretation of the outputs are covered. The basic form
of the REG procedure is

PROC REG;
MODEL dependent=independent;

Example 6.13 Let us fit a linear regression model using the data used in PROC CORR.
Let the test score be a linear function of the other two variables.

Example 6.14 At your young neighbors T-ball game (thats where the players hit the ball
from the top of a tee instead of having the ball pitched to them), he said to you, You can

11
tell how far theyll hit the ball by how tall they are. To give him a little practical lesson in
statistics, you decide to test his hypothesis. You gather data from 30 players, measuring
their height in inches and their longest of three hits in feet. The data are in Baseball.dat.

12

You might also like