You are on page 1of 13

Adam Sweeney

Math 1040
April 14, 2018
Spring 2018 Term Project – Skittles Candy Analysis

To facilitate the class’s understanding of concepts taught throughout the semester, our

class performed several statistical analyses on Skittles candies. This work was performed both

individually and as a part of groups which enabled a student to practice the concepts covered in

the lesson material, but also encouraged discussion and validation with other members of the

class. The steps of the project were staggered so that they corresponded to the ideas,

methodologies, and graphs taught up to that point in the semester. For example, after we

covered different methods of sampling, the first part of the project involved class members

obtaining a simple random sample of candies from 2.17-ounce bags of Skittles (Original). This

resulted in 74 good sample (two were removed since they were clear outliers to the data) and

was detailed as follows:

Part 1

Modified Skittle Data Spring 2018 Math 1040


Red Orange Yellow Green Purple
Count Count Count Count Count Total Candies
ID in Each Bag
1 12 10 6 23 12 63
2 10 18 8 11 13 60
3 14 12 13 9 13 61
4 13 10 13 10 12 58
5 13 8 14 17 10 62
6 9 16 12 12 9 58
7 11 11 9 17 14 62
8 12 9 11 13 14 59
9 16 11 13 15 10 65
10 16 9 14 13 9 61
11 12 11 17 12 8 60
12 12 13 11 12 15 63
13 21 15 10 7 6 59
14 11 15 13 11 9 59
Adam Sweeney
Math 1040
April 14, 2018
15 20 11 11 10 7 59
16 16 12 14 13 10 65
17 11 9 5 16 16 57
18 15 11 12 15 8 61
19 16 11 9 6 18 60
20 5 13 12 9 20 59
21 18 5 12 12 13 60
22 13 13 17 10 9 62
23 12 15 7 14 11 59
24 19 7 13 13 10 62
25 12 12 14 9 11 58
26 11 12 15 15 9 62
27 13 15 15 7 11 61
28 11 7 12 11 16 57
29 13 10 14 12 8 57
30 7 17 14 7 16 61
31 10 7 14 13 9 53
32 9 12 12 12 14 59
33 10 8 13 12 15 58
34 15 12 9 21 5 62
35 8 12 17 15 8 60
36 6 11 10 13 14 54
37 17 6 11 12 16 62
38 9 15 8 17 12 61
39 10 15 12 13 9 59
40 9 15 15 9 12 60
41 11 14 14 10 13 62
42 10 12 13 11 12 58
43 9 14 6 11 20 60
44 13 7 11 9 15 55
45 6 19 12 9 11 57
46 19 11 15 8 5 58
47 13 15 5 10 16 59
48 11 17 7 14 9 58
49 11 17 11 12 10 61
50 10 10 14 11 13 58
51 12 11 7 15 9 54
52 10 15 15 10 10 60
53 11 9 12 12 8 52
54 12 12 12 18 7 61
Adam Sweeney
Math 1040
April 14, 2018
55 11 11 14 11 13 60
56 13 11 17 12 8 61
57 12 12 16 12 13 65
58 12 9 18 9 7 55
59 11 17 15 4 16 63
60 16 6 13 14 11 60
61 14 10 9 17 10 60
62 11 13 10 10 12 56
63 11 8 17 7 14 57
64 12 8 15 10 13 58
65 13 13 14 6 13 59
66 12 16 10 14 5 57
67 11 14 8 7 13 53
68 16 11 13 15 10 65
69 15 10 12 9 10 56
70 12 11 12 12 13 60
71 8 15 12 16 7 58
72 11 15 14 7 10 57
73 10 9 17 19 8 63
74 7 11 9 20 13 60

The second part involved a discussion as a group of the expected results, a comparison

of this expectation to the observed results, graphing of the data using both a Pareto chart and a

Pie chart, as well as a discussion of whether the data represents a simple random sampling:

Part 2 - Group

The expected proportions/percentages for Red, Orange, Yellow, Green, and Purple are 20%
each. This is based on the assumption the colors have even chances of appearing. In reality,
even though the Skittles are distributed by standardized processes and machinery, variability
will, to some extent, still be introduced. Therefore, it is highly unlikely each color will account for
exactly 20% in each bag.
Count Count Count Count
Count Red Orange Yellow Green Purple
Expected
Proportion 20.0% 20.0% 20.0% 20.0% 20.0%
Observed
Proportion 20.3% 19.9% 20.5% 20.2% 19.1%
Adam Sweeney
Math 1040
April 14, 2018

1) Pareto Chart
Adam Sweeney
Math 1040
April 14, 2018

Pie Chart

Yes, the data represents a random sampling of 2.17-ounce bags of Skittles, at least within the
Salt Lake City, Utah area. The population represented by this sample is all 2.17-ounce bags of
Skittles available for purchase. The bags were presumably purchased from various (and
somewhat unique) stores by each member of the class, though likely these stores were
conveniently accessible for each student. The results could perhaps be distorted if the production
process, delivery process, or availability of 2.17-ounce bags of Skittles were different for this
geographic region and, in particular, for the stores that were most convenient to the students. A
likely better representation of the population would be to purchase 2.17-ounce bags of Skittles
from different geographical locations and from different stores, varying days and times of
purchase leading up to the assignment. This would probably provide a better sampling of the
population since this increases the chances of purchasing bags of Skittles from different
production groups.

This was followed by an individual comparison of the class’s data to each student’s individual

data sample, as follows:


Adam Sweeney
Math 1040
April 14, 2018
Part 2 – Individual

Personal Skittle Count compared to Class Skittle Count

Count Red Count Orange Count Yellow Count Green Count Purple Total Count
My Bag 13 (21%) 8 (12.9%) 14 (22.6%) 17 (27.4%) 10 (16.1%) 62
Class Counts 893 (20.3%) 874 (19.9%) 900 (20.5%) 889 (20.2%) 838 (19.1%) 4394

The graphs (and information presented in the tables) of the class data essentially match what I
expected to see regarding each color’s count approximating 20% of the total (within a 1%
margin of error). I note that the sample from my personal bag of Skittles varied quite a bit more,
over 7% different in a couple of cases. I believe that the class counts benefit from a wider
sampling, especially since one sample is almost certainly not sufficient for gathering the
appropriate data. The class counts appear to have fairly consistent numbers, with the exception
of the two entries where a significant variance occurred (~630% more than “usual” in one of the
cases). These outliers would potentially skew the proportions if included, especially the case
with 106 Skittles, 58 of which were purple. This proportion of ~55% is nearly triple the class
average (not including this case) and so would inflate the purple Skittle proportion. This
emphasizes to me the importance of doing everything possible to eliminate “bad data” that
could skew results, as well as the importance of acquiring a good sample of data to better
illustrate the behavior of the population.

Later in the semester, the project groups performed a more detailed statistical breakdown of

the class totals. This included determining the mean, standard deviation, minimum, median,

and maximum vales of the data, as well as identification of the first and third quartiles. This

breakdown was accompanied by a frequency histogram and box plot of candy counts per bag:

Part 3 – Group

Measures for Total Candies in Each Bag:


a. Mean Number of Candies per Bag: 59.4
b. Std. Deviation of Number of Candies per Bag: 2.8
c. 5-Number Summary for Number of Candies per Bag:
i. 52 (Min)
ii. 58 (Q1)
iii. 60 (Median)
iv. 61 (Q3)
v. 65 (Max)
Adam Sweeney
Math 1040
April 14, 2018
Adam Sweeney
Math 1040
April 14, 2018

Individually, students were asked to answer a question regarding the findings of the variable

“Total candies in each bag” as well as to write a paragraph explaining the difference between

quantitative data and qualitative (or categorical) data. My responses are below:

Part 3 – Individual

Total Candies in Each Bag Response

The findings regarding the variable “Total candies in each bag” generally follows what I would
expect: the total Skittles in each bag, while having some variance/outliers in count, would most
frequently be near the average (~59 Skittles per bag). This is represented in both the bell-shaped
frequency histogram and the “centered” (i.e. approximately equal whisker length, bell-shaped data)
Adam Sweeney
Math 1040
April 14, 2018
box diagram. This is also represented by the mean and median number of candies per bag being very
close to equal in value (less than one Skittle difference). The histogram shows the majority of bags
contain a number of candies within one standard deviation (~3 Skittles) of the average. This is
corroborated by the box diagram, which shows that the majority of bags are fall very close to the
mean and median with only two outliers falling outside the lower fence. These findings are
supported by the 62 skittles found in my own bag, a count that is less than one standard deviation of
the average produced by the 74 total bag counts of the class.

Quantitative and Qualitative Data Response


Categorical data (which could also be defined as qualitative data) is data that cannot be added. This
data provides descriptions of observations which can be broken into categories. Examples could
include political party, type of car, favorite food, or species of animal. Categorical data can be
counted within their categories and averaged against a total. For example, the number of red Skittles
in a bag out of the total Skittles in a bag. This kind of data lends itself to a pie chart or a bar chart
where category “size” can be compared to a total. Scatter plots, histograms, and boxplots would not
be able to meaningfully display this data. For example, if a bag of Skittles contained 15 Skittles of
each color except Green which had 2, a box plot would show this as an outlier when it’s a completely
different category than the others. Contrast this to a pie chart comparing relative frequency between
categories to the total number of Skittles which would represent a meaningful proportion.

Quantitative data can have arithmetic operations performed on it to provide further meaning.
Examples could include a grade point average, salary earned, number of cats owned, or counts of
Skittles in specific size bags. A mean, median, mode, standard deviation, and quartile can be
calculated for this data. For example, the total number of Skittles in a bag and for several bags could
be averaged (mean) or observed to determine if one (or several) totals are repeated more than
others (mode). This kind of data lends itself to scatter plots (e.g. time-series), histograms, and box
plots where trends can be observed. The Skittle example can be plotted in a histogram to present
potential even variance in totals (evenly distributed) or if a majority falls near a consistent number
with a minority on either side (bell-shaped) and so on. Presenting this data in pie chart would not
present any meaningful information as there is no “grand total” to compare the bag totals against.
Adam Sweeney
Math 1040
April 14, 2018
A few weeks later, after the class discussed the concept of confidence intervals, groups were

asked to construct and interpret a 99% confidence interval estimate for the population

proportion of yellow candies. They were also asked to construct and interpret a 90% confidence

interval estimate for the population mean number of candies per bag.

Part 4 – Group

99% Confidence Interval for the Population Proportion of Yellow Candies:


a. Sample proportion of Yellow Candies (p̂):
𝒙
̂ = , where x = 900 and n = 4394
vi. 𝒑 𝒏
𝟗𝟎𝟎
̂=
vii. 𝒑 = 0.205 or 20.5%
𝟒𝟑𝟗𝟒
b. Since we have the sample proportion, we will construct a confidence interval for
a population proportion (p).
c. The three requirements that must be met to construct a confidence interval for a
population proportion are:
viii. The sample was obtained through a simple random sample since
several students obtained a 2.17-ounce bag of Skittles from various and
(at least somewhat) unique locations.
ix. 𝒏𝒑(𝟏 − 𝒑) ≥ 𝟏𝟎, where n = 4394, p̂ = 0.205, and 1- p̂ = 0.795
1. 𝟒𝟑𝟗𝟒 × 𝟎. 𝟐𝟎𝟓(𝟏 − 𝟎. 𝟐𝟎𝟓) = 𝟕𝟏𝟔. 𝟏𝟏𝟐
2. 𝟕𝟏𝟔. 𝟏𝟏𝟐 ≥ 𝟏𝟎 Verified
x. Skittles were sampled from 74 bags out of millions sold. It is therefore
reasonable to assume that the sample size is less than 5% of the
population size (𝒏 ≤ 𝟎. 𝟎𝟓𝑵).
d. 99% Confidence Interval is (0.189, 0.221).
̂(𝟏−𝒑
𝒑 ̂)
̂ ± 𝒛𝜶 × √(
xi. Lower and Upper bounds: 𝒑 ) where α = 0.01 and 𝒛.𝟎𝟏 =
𝟐 𝒏 𝟐
𝟐. 𝟓𝟕𝟓𝟖
𝟎.𝟐𝟎𝟓(𝟏−𝟎.𝟐𝟎𝟓)
1. Lower: 𝟎. 𝟐𝟎𝟓 − 𝟐. 𝟓𝟕𝟓𝟖 × √ = 𝟎. 𝟏𝟖𝟗
𝟒𝟑𝟗𝟒

𝟎.𝟐𝟎𝟓(𝟏−𝟎.𝟐𝟎𝟓)
2. Upper: 𝟎. 𝟐𝟎𝟓 + 𝟐. 𝟓𝟕𝟓𝟖 × √ = 𝟎. 𝟐𝟐𝟏
𝟒𝟑𝟗𝟒
𝒖𝒑𝒑𝒆𝒓 𝒍𝒊𝒎𝒊𝒕−𝒍𝒐𝒘𝒆𝒓 𝒍𝒊𝒎𝒊𝒕
e. The margin of error is equal to 𝟐
𝟎.𝟐𝟐𝟏−𝟎.𝟏𝟖𝟗
xii. = 𝟎. 𝟎𝟏𝟔 or 1.6%
𝟐

This confidence interval, 𝟎. 𝟐𝟎𝟓 ± 𝟎. 𝟎𝟏𝟔, indicates that if a large number of different
samples is obtained, we expect 99% of intervals will encapsulate the population proportion of
Yellow Candies out of all Candies.
Adam Sweeney
Math 1040
April 14, 2018
90% Confidence Interval for the Population Mean Number of Candies per Bag:
a. Sample mean Number of Candies per Bag (x̄):
𝑪𝒂𝒏𝒅𝒊𝒆𝒔 𝒊𝒏 𝒆𝒂𝒄𝒉 𝒃𝒂𝒈 𝟒𝟑𝟗𝟒
xiii. 𝒙̄ = = = 𝟓𝟗. 𝟒 Candies per Bag
𝑵𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒃𝒂𝒈𝒔 𝟕𝟒
b. Since we have the sample mean, we will construct a confidence interval for a
population mean ().
c. The two requirements that must be met to construct a confidence interval for a
population mean are:
xiv. The sample was obtained through a simple random sample since
several students obtained a 2.17-ounce bag of Skittles from various and
(at least somewhat) unique locations.
xv. 𝒏 = 𝟕𝟒 ≥ 𝟑𝟎 Verified
d. 90% Confidence Interval is (58.8, 59.9).
𝒔
xvi. Lower and Upper bounds:𝒙̄ ± 𝒕𝜶 × √𝒏 where α = 0.10, 𝒕.𝟏𝟎 = 𝟏. 𝟔𝟔𝟔𝟎, and
𝟐 𝟐
𝒔 = 𝟐. 𝟖𝟏𝟐𝟒𝟏𝟐
𝟐.𝟖𝟏𝟐𝟒𝟏𝟐
1. Lower: 𝟓𝟗. 𝟒 − 𝟏. 𝟔𝟔𝟔𝟎 × √𝟕𝟒 = 𝟓𝟖. 𝟖
𝟐.𝟖𝟏𝟐𝟒𝟏𝟐
2. Upper: 𝟓𝟗. 𝟒 + 𝟏. 𝟔𝟔𝟔𝟎 × = 𝟓𝟗. 𝟗
√𝟕𝟒
𝒖𝒑𝒑𝒆𝒓 𝒍𝒊𝒎𝒊𝒕−𝒍𝒐𝒘𝒆𝒓 𝒍𝒊𝒎𝒊𝒕
e. The margin of error is equal to 𝟐
𝟓𝟗.𝟗−𝟓𝟖.𝟖
xvii. = 𝟎. 𝟓𝟓 or 0.55 Candies per Bag
𝟐

This confidence interval, 𝟓𝟗. 𝟒 ± 𝟎. 𝟓𝟓, indicates that if a large number of different samples is
obtained, we expect 90% of intervals will encapsulate the population mean Number of
Candies per Bag.

Individuals were asked to generally explain the purpose and meaning of a confidence interval.

My response was as follows:

Part 4 – Individual

A confidence interval is used to represent an estimate of a characteristic of a population (e.g. the


proportion or the mean) with an associated level of confidence. This interval is based upon the
sample proportion/mean plus or minus a margin of error. The level of confidence represents the
expected number of intervals that would include the proportion or mean. For example, a 90%
confidence level would mean that if a population was sampled in a similar manner numerous times,
and an interval was calculated for each sample, it is expected that 90% of the intervals would
encapsulate the true proportion or mean. This also means that 10% of the time it is expected the
intervals would not capture the true proportion or mean. It is worth noting that as the confidence
Adam Sweeney
Math 1040
April 14, 2018
level increases or as the number of samples decreases, the interval’s width will increase (i.e. the
lower bound will be lower, the upper bound will be higher, and the margin of error will increase). The
opposite holds true in that as the confidence level decreases or as the number of samples increase,
the interval’s width will decrease.

Ultimately, the term project culminated in each student writing a reflection essay about

concepts learned throughout the semester. This reflection could cover topics including what

the student learned, how mathematics and statistics skills will impact future classes in the

student’s school career, how the project helped to develop the student’s problem solving skills,

among other topics.

Part 5 – Reflection

I believe one of the key takeaways from this semester is that statistics analysis can be applied to

a very diverse repertoire of problems or situations. As demonstrated in the variety of examples provided

in this course’s material, statistical analysis is used in an attempt to provide insight into the populations

we are a part of. This includes everything from the likelihood of a candy bar being within a margin of

error of average weight to correlating gun-related incidents during a period of change in firearm

legislation. This information can influence decisions that have significant impact on people’s lives.

Examples include new initiatives a business is considering (and thereby an employer’s potential success

or failure), the lawmaker’s we vote for (and thereby the laws and policies we abide by), or even the

viewership of television shows (and thereby the longevity of one of our recreational avenues). I have

been aware of the use of statistics throughout my life, but I believe I better understand the breadth of

applications this analysis is useful for.

Another takeaway from this semester is how these analyses are performed and how inferences

are made. I have previously been rather dubious about the authenticity or accuracy of statistics as they

often appear skewed to sell an argument. I would recall the old joke, “[Insert random percentage here]%
Adam Sweeney
Math 1040
April 14, 2018
of all statistics are made up”. I believe this course has prepared me to both better recognize potentially

skewed data, as well as to also better understand and trust carefully performed analyses. In particular,

the group project throughout the semester has made me better aware of the need for consistent scales,

appropriate graphical representations of different types of data, the influence of outliers, and the

importance for proper modeling of data distributions. I believe I can leverage this information in my own

work and better appreciate the need and use of it in the world around me.

This project really helped reinforce the concepts taught this semester. We were able to

obtain a sample and walk through the different levels of analysis taught, including identifying

the mean and median (and understanding when to use which), and developing confidence

intervals. The group work encouraged discussion of topics and provided opportunities for

students to clarify concepts to each other. This practice, in particular, helped me to ensure I

was secure in my understanding of confidence intervals because of my desire to answer

questions for others. I believe this project was a practical way of ensuring students remained

involved all semester long.

You might also like