You are on page 1of 6

inferential_statistics_project1

July 3, 2017

1 What is the True Normal Human Body Temperature?


Background The mean normal body temperature was held to be 37 C or 98.6 F for more than
120 years since it was first conceptualized and reported by Carl Wunderlich in a famous 1868
book. But, is this value statistically correct?

In [36]: # libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pylab

# colors
green = '#7fc97f'
purple = '#beaed4'
organe = '#fdc086'
yellow = '#ffff99'
blue = '#386cb0'

df = pd.read_csv('data/human_body_temperature.csv')

In [16]: df.head()

Out[16]: temperature gender heart_rate


0 99.3 F 68.0
1 98.4 F 81.0
2 97.8 M 73.0
3 99.2 F 66.0
4 98.0 F 73.0

1.0.1 1. Check Normality


In [33]: # histogram
plt.hist(df.temperature, color = blue)
plt.xlabel("Body Temperature")
plt.title("Hitogram of Body Temperature")

1
plt.show()
plt.clf()

According to the histogram we can see that the distribution is very normal except slightly
skewed to the right. We can also further test this using qqplot.

In [40]: # use qqplot to check normality


stats.probplot(df.temperature, dist="norm", plot=pylab)
pylab.title("QQplot of Body Temperature")
pylab.show()

2
According to the qqplot, it is evident that the distribution is very normal.

1.0.2 2. Is the sample size large? Are the observations independent?


In [47]: # test sample size
len(df.temperature)

Out[47]: 130

Since the sample = 130 > 30, we conclude that it is large enough.
The observations are independent.

1.0.3 3. Is the true population mean really 98.6 degrees F?


H0 : The true population mean is 98.6 degrees F. ( = 98.6)
We will use 2-tail test because the null hypothesis is NOT equal instead of great
than/smaller than.
Ideally t and Z test will both work for large dataset. In this case, since we do not know the
variance of the poulation, we will use t test.

In [51]: sample_mean = df.temperature.mean()


sample_std = df.temperature.std()
[sample_mean, sample_std]

Out[51]: [98.24923076923078, 0.7331831580389454]

3
In [104]: p_value = stats.ttest_1samp(df.temperature, popmean = 98.6)
p_value

Out[104]: Ttest_1sampResult(statistic=-5.4548232923645195, pvalue=2.410632041556127

Since p-value < 0.05, we reject the null hypothesis and conclude that the true population cannot
be 98.6.

1.0.4 4. At what temperature should we consider someones temperature to be abnormal?


If we use the CI of the Z test, for temperature < 98.122 or > 98.376 should be consider abnormal.

In [113]: CI = stats.t.interval(0.95, len(df.temperature)-1, sample_mean, stats.sem


[round(item, 3) for item in CI]

Out[113]: [98.122, 98.376000000000005]

1.0.5 5. Is there a significant difference between males and females in normal temperature?
We can test this using: * Overlap * Probability of superiority * Pooled variance

In [71]: female_temp = df.temperature[df.gender == "F"]


male_temp = df.temperature[df.gender == "M"]
female_mean = female_temp.mean()
male_mean = male_temp.mean()
female_var = female_temp.var()
male_var = male_temp.var()
n1 = len(female_temp)
n2 = len(male_temp)
print("Female: (mean, var, len)", female_mean, female_var, n1)
print("Male: (mean, var, len)", male_mean, male_var, n2)

Female: (mean, var, len) 98.39384615384613 0.5527740384615375 65


Male: (mean, var, len) 98.1046153846154 0.488259615384615 65

In [86]: bins = np.linspace(97, 99, 1000)


plt.hist(female_temp)
plt.hist(male_temp)
plt.show()

4
Overlap
In [79]: threshold = (female_mean * n1 + male_mean * n2)/(n1 + n2) # need need to w
threshold

Out[79]: 98.24923076923076

In [88]: overlap_rate = sum(female_temp < threshold)/n1 + sum(male_temp > threshold


overlap_rate
misclassification_rate = overlap_rate / 2
misclassification_rate

Out[88]: 0.42307692307692313

The misclassification rate is really high, which mean there is not much difference between 2
distribution.

Probability of Superiority
In [96]: new_female_temp = np.random.choice(female_temp, n1, replace=True)
new_male_temp = np.random.choice(male_temp, n2, replace=True)
sum(x > y for x,y in zip(new_female_temp, new_male_temp))/n1

Out[96]: 0.53846153846153844

Since the probability of superiority is not really high (close to 90%), we cannot tell there is a
difference between body temperature of female and male. However, we still need pooled variance
to prove rigorously.

5
Pooled Variance H0 : The difference between 2 distribution is 0.

In [114]: diff = female_mean - male_mean


pooled_var = (n1 * female_var + n2 * male_var)/(n1 + n2)
p_value = diff/np.sqrt(pooled_var)
p_value

Out[114]: 0.40089173785982207

Since 0.40 is too big for the default significant level 0.05, we conclude that there NO difference
between the body temperature of female and male.

You might also like