Data Analysis and Classification

Analysis and Classification of Respiratory Health
Risks with Respect to Air Pollution Levels

Ruhul Amin Dicken, S.A.M Fazle Rubby, Sheefta Naz, A. M. Arefin Khaled, Shuvo Ashish Rahman, Sharmina Rahman, Rashedur M Rahman
Department of Electrical and Computer Engineering, North South University,
Plot-15, Block-B, Bashundhara, Dhaka 1229, Bangladesh
ruhul.amin1125@gmail.com, fazle2712@yahoo.com, sheefta@hotmail.com, akonkshaan@gmail.com, shuvo.nova74886@gmail.com,
sharmina.rahman01@yahoo.com, rashedur.rahman@northsouth.edu
AbstractAir pollutants are really a hazardous problem in

Bangladesh. This paper works on the relationship between the
pollutants and the admittance of patients in the medical facilities
and analyzes the reason behind the increase of the disease rate in
the hospitals. The research collected medical data from the
medical center named National Institute of Disease of the Chest
and Hospital (NIDCH) that is located in Dhaka, Bangladesh and
the air pollutants data of the city of Dhaka. This paper uses kmeans clustering method for clustering different air pollutants in
different seasons of Bangladesh. CART method is also used to
classify the patients according to different rate of admission. The
missing values of data were replaced by probabilistic mean
median method.
Keywords- data mining; health problem; decision tree; air
pollution; respiratory diseases.
I. INTRODUCTION
Air pollution is the biological matter or other harmful
materials in the Earths atmosphere. Air pollutants can have
adverse effects on human lives. The pollutants can have
harmful effects not only on human but also on all kind of
living beings. The substance can be solid particles, liquid
droplets or gaseous substances. Pollutants can be of natural
origin or artificially industrial man made pollutants. They are
classified as primary or secondary types of pollutants.
The primary kinds are those which are produced from
processes of nature, for example, ashes from volcanic
eruptions, CO from vehicles gas emulsion. The secondary are
the kinds where the pollutants are spread indirectly in the air
when primary pollutants react with ground level ozone.
In this paper the primary emulsion of pollutants are
calculated. NOx is spread from thunderstorm and industrial
works. SO2 is spread from the volcanic eruptions, industrial
works petroleum combustion which can be caused from
various sources mostly from vehicles. Further oxidization in
the atmosphere let it remain as a catalyst which later turn into
acid rain in atmosphere. CO is created from combustion
through vehicles, wood burning, natural gases and coal etc. O3
is caused by the disturbance of almost the same process of
materials and mixtures of chemicals and oxidization of them.
PM10 and PM 2.5 are particulate matters containing lead and
other harmful matter in solid matters and liquid droplets in the
atmosphere. The types are in between 2.5 micrometers to 10
micrometers. The coarse particulates are PM10 and the finer
particulates are PM 2.5. PM 10 is smoke, dust and dirt from
roadside and factories. The PM10 is created from the process
that rocks and soil go through that includes mixtures of rocks
978-1-4799-8676-7/15/$31.00 copyright 2015 IEEE

SNPD 2015, June 1-3 2015, Takamatsu, Japan
and metals turned into smaller state and mixed. The finer
particulate matter contains toxic organic compound and heavy
metals. They are created from driving automobiles, burning
plants and smelting or purifying process from metals. Human
body responds to particular invasion more than any other
pollutants in the atmosphere. PM2.5 is more harmful to the
living being than PM10.
Particle matters trigger diseases like asthma and play an
important part in respiratory diseases and as well in dangerous
cancer diseases causing premature deaths. It is stated that
some air pollutants increase in dry season compared to rainy
seasons. Most of the lung diseases are the result of air
pollutants present in our atmosphere.
It is the major problem faced in day to day lives. It is
getting more serious with the development of the growing
cities and their increasing population. The increase in the
number of vehicles, increase in industrial expansion, etc. are a
major turnabout for pollution. Bangladesh is also facing this
problem due to continuous increase of population. As it is a
developing country with its increasing number of vehicles, air
pollution is now affecting the health of the people.
The conversion of vehicles, establishment of factories,
illegal disposing of chemical products and usage of ozone
harming products are main causes of pollution. Different
levels of air pollutants present in air also causes health
problems. Such pollutants are Carbon Monoxide (CO),
Nitrogen Oxide (NOx), Sulfur Oxide (SOx), PM10 also Pm2.5.
These pollutants cause some serious health problems and
diseases which can be fatal in result. These pollutants not only
affect the healthy adults but also the children. These diseases
affect a certain age of people.
In this research the presence of air pollutant, their
percentage and the number admitted patient in the hospital are
investigated using the decision tree algorithm to learn the
effect of the emission of vehicles fuel on air and also the
effect of air pollutant on peoples health. This study is focused
on Dhaka city. As Bangladesh is a developing country so the
technologies and software here are not quite updated for data
mining task, thus data processing could be hard.
Medical data are collected from well-known NIDCH
(National Institute of Disease of the Chest and Hospital) and
the air pollution data are collected from DoE (Department of
Environment). The main goal of this study is to find the level
of air pollutants in the atmosphere and also the effect of
emission of vehicles fuel on air so that necessary step could
be taken to prevent the health risks due to their exposure.
II. RELATED WORK

Ojeda et al[1] used Fuzzy c Means to get a
combinational measurement acquired by the three chosen air
monitoring posts. By observing the pollutants levels recorded
by every single station and combination of the measurement,
the authors analyzed the relation between the pollutants
groups and environment variables.
Kyoko et al.[2] discussed about different particle matters
with relation to the respiratory diseases. To find relation the
K- Maximum Sub-Array (2D) Algorithm was used, where K
is the threshold taken.
The measurement of hospital
admissions were taken within the age of (0 to 98) years. Each
admission and the diseases were taken into record. The
records were mostly divided within each season; the increase
and decreasing of diseases in the seasons are recorded. The air
pollutant levels were recorded for PM10. PM10 varies between
each level groups associating within gender and different time
period. The model K-MSA is used with a slight modification
to search for the largest sub-array of the one dimensional array
which is the solution to the problem.
Haibo et al.[3] presented the classification process inbetween roadside pollution and exposure to human. A model
was developed to stimulate the human exposed number
distribution as functions of urban areas together with road
traffic pollutants concentration for separate traffic situations.
The ratio between two pollutants had been observed and
applied to find the finer particulate levels. These results were
applied for personal exposure frequency distribution
representation to find introduction of urbanization effects. In
this paper the K-means structure was used for classifying the
roadside sections.
Cizao et al.[4] reported that though many countries
developed in the past years, yet many researches had shown
different level of air pollution ranges from high to low. The
authors proposed co-vantages lagging models for the Air
Pollutants. Results show that short exposure effects to air
pollutants are dangerous in health outcomes.
Ketzel et al.[5] describe the fact that a major cause of air
pollution is created due to traffic pollution. It depends on the
three factors of the traffic, e.g., vehicles pollutant spread,
natural conditions and surrounding. According to the paper the
dispersion models can create a relation between the emission
and concentration levels in the street. Here the model
calculations are mostly referred on real emission data which is
unnecessary. The data is collected from the traffic site. There
is a permanent monitoring station at a specific state. Different
pollutants are taken like NOx and CO. Those are measured by
the COPERT models the emissions are calculated more
specifically by the OSPM model so that the underestimation is
avoided. The concentrations are calculated by a dispersion
model depending on the course on parameterization of
dispersion process applied in the model. Each gas is leveled
and calculated by the graph per day of the week. The gas
emission is measured differently. In conclusion a result of the

clustering is compared in ratio format in between NOx and
CO. They are represented in a linear graph form.
Simpson et al.[6] demonstrated in his research the relation
between air pollutants and respiratory diseases affecting
children. His study focused on children of three age groups
and used the record of hospital admission for the period 1998 2001 from five major cities of Australia and two cities of New
Zealand. The air pollution levels of the cities for the same time
period is recorded as well. The study was conducted by
combing the analyzed data from the hospital admissions of the
different cities along with the pollutant levels and running
them through the random-effects meta-analysis method.
III. DATA SOURCES
Bangladesh is the fourth most polluted place in the world.
Thus there is a huge amount of dirt and pollutants present
here. To analyze the health hazard due to air pollution, data
had to be collected. Data is collected from the department of
environment about air pollutants that are responsible for
causing health problems. The data is collected from CASE
(Clean Air and Sustainable Environment). The data consists of
pollutants which are present in the targeted regions
atmosphere. CASE project is a government funded project
which is focused on providing a cleaner atmosphere by
reducing gas emissions from gas plants, brick making
industries and vehicles. It also monitors and regulates
Sustainable Environment Initiatives (SEIs) throughout Dhaka
City. From the results provided, it is found that CO, SO2, NO2,
O3, PM2.5, PM10 are close to a large amount present in Dhaka
Citys atmosphere. These data were collected by using CAMS
(Continuous Air Monitoring Stations). These stations are used
to monitor and measure the levels of air pollutants present in
different regions, in our case Dhaka City. In Dhaka City these
stations are located near Shangshad Bhaban, BARC
(Bangladesh Agricultural Research Council) in Farmgate and
Darus-Salam. The data from the CAMS there were located in
these areas are used in the project. Fig 1 depicts this.
Figure 1: The air pollutants and meteorological variables

records From CAM 1, CAM 2 and CAM 3
It is necessary to have the knowledge of the diseases

caused by these harmful air pollutants. Air pollution mostly
exhibit respiratory problems in people, but for some cases it
causes more dangerous diseases. An institution is chosen for
the data regarding respiratory diseases. NIDCH (National
Institute of Disease of the Chest and Hospital) is chosen. It is
one of the well-known hospitals. It is a state supported
research institute and hospital in Bangladesh. It provides
diagnosis for tuberculosis and chest diseases and also conducts
research in the field of such kind of diseases. It provides
diagnosis and surgical treatment for such kind of air borne
diseases as well chest diseases.
This particular institution has been diagnosing air borne
diseases and mainly tuberculosis. Research on such cases has
been conducted by NIDCH for a long time. The data is very
useful in finding a relationship between air pollution and the
health effects due to it. The data which has been collected
from the institution has shown that the diseases caused due to
air pollution are asthma, bronchogenic/bronchial carcinoma,
COPD (chronic obstructive pulmonary disease) and, ILD
(interstitial lung disease). These are some of the strong
diseases affecting the health of the residents of the city.
frequently used to check the formulated missing value. They

are: 1) Probabilistic Neural Network algorithm, 2)
Interpolation method. Those two methods give more or less
closer value to the calculated value. In this project Rapid
miner software is used to calculate the probabilistic neutral
network, and interpolation is used to make sure the error is
very low as low as 10%. There is an example given in Table 1.
TABLE I. EXAMPLE
Jan
11.82
Mean =
Mean=
.
.
Jun
Jul
6.61
Dec
7.76
=7.42
Two closest values present in the nearest to Mean value:

7.76&7.30.
Here, Maximum difference x=7.76 -7.42=0.34.
Value above missing value =6.61
Value below missing value =4.53
By OVM it is referred as among the table a certain value

was found to be missing and by SVM it is referred to as
among the table two or more consecutive values were found to
be missing. For the two category probabilistic mean median
mode method was carried out with a slight change in the
method for value formulation and in some cases probable
values were taken into account. For SVM suppose a certain
value of a month is missing in order to formulate the
missing value the sum of all the value is taken and its mean
value is calculated, afterwards the mean is compared with
values of the table, two to three values which is close to the
mean is taken and subtracted. Subtraction value which gives
the maximum difference is taken suppose it is x. Then the
value above the missing value is taken along with the value
below and their average is calculated suppose and check
this calculated value if it is less than (i.e. < ) and greater
than (i.e. <).Now subtract x from or add x to
(depending on ones judgmental decision) and put this value
in the table as missing value. For SVM same procedure is used
but the calculated value is increased by 50% for the value
above and decreased 50% for the value below. Among the
various algorithm and methods two of the methods were
Data Missing has been a serious issue in this project.

While performing weather pollution analysis some of the
values in the table seemed to be missing so formulation of
result varied dramatically than the expected result and to
achieve accurate result missing values had to be filled. In
order to formulate the missing values certain measures were
taken. Different missing values of the imputation algorithm
were observed to visualize the cluster and pattern formation in
output graph. At first the missing values were categorized into
two set:
2) Series value Missing (SVM).
4.53
3.1. Missing Data
1) One value Missing (OVM)
Aug
Average = (+ )/2=5.57
Checking if < and < .
Add x to or subtract x from using ones judgment in
order to see which value fits best for the missing value . So
the calculated missing value is is6.27.
IV. ALGORITHMS
Two algorithms were used in the overall project. First a kmeans clustering algorithm was used in order to cluster the
records of air pollution levels and also the numbers of patient
admissions to the NIDCH. Then the CART analysis was
performed on both the data sets in order to create a decision
tree model which results in proper classification of the data.
Our approaches to implement the algorithms are explained
further below.
4.1. Clustering using k-Means algorithm:
Given the nature of our data set already discussed earlier the
k-means algorithm suited best to work with. The algorithms
tendency to locate clusters of comparable spatial extent and
different shapes is useful as our data of air pollution levels
contains attributes with very high values compared to values
of other attributes within the set. The application of the
algorithm in feature learning also comes in handy in
recognizing the trends of the increase and decrease of air
pollution levels. For all the clusters the divergence of the data
was a measure of the squared Euclidean distance among the
objects.
The traditional Bengali calendar consists of six seasons.
However, for our project, we specified the seasons of
Bangladesh to four categories: Winter (November-January),

Summer (February-May), Monsoon (June-August) and
Autumn (September-October). Our goal was to create a cluster
model that characterizes the transition of the pollution levels
from cluster to cluster with respect to the change of the four
seasons. Thus we choose k = 4 to get four clusters for the four
seasons.
Figure 2: Clustered Model of the Air pollution levels

The data set of the air pollutants and the meteorological
variables from the CASE project for the period 2013 2014
was applied to the k-means clustering algorithm and the model
shown in Figure 2 was formed. When the clusters are plotted
against the months we can clearly observe that there is trend in
which the objects are being clustered. When considered with
respect to the change of seasons the clusters also seem to form
in a sort of transitional state which is clearly visible in the
figure for the months February to April. This also validates
that our technique for generating some of the missing values
were mostly accurate and did not affect the model in any
negative manner. Those make it clear to understand the rise
and drop of the pollutant levels among the clusters. The
following figure shows the correlation among the air pollution
and the meteorological variables.
Figure 3: Correlation among the Air data attributes

Here we can see that all the air pollutants levels are highly
correlated with each other. Rainfall and solar radiation proved
to be an important factor for the levels of most of the
pollutants where humidity and temperature does not seem to
effect the pollution levels. Clustering of the medical data was
conducted with the objective to create an appropriate dataset
for classification. While preprocessing the medical data, we
observed that a majority of the admitted patients fell in the age
groups of 24-49 and 50+. Almost no children and only a few
teenagers were admitted in the two year period and the
numbers were insufficient to work in any model. As the
ultimate goal of this project is to create a model that would

classify and predict the levels of hospital admissions, we
decided that for the existing data we would create three
clusters for the numbers of hospital admissions: HIGH (H),
MEDIUM (M), and LOW (L). So the value k = 3 is chosen
when creating the clusters for hospital admissions with respect
to month. The purpose of creating the clusters here was so that
we use the three clusters (H, M, and L) for the number of
admissions as a class label when preparing the decision tree
later in our project. This approach is appropriate instead of
using numerical levels for the admissions since our dataset
was from only one hospital and therefore very small. From the
cluster we were easily able to prepare the class attribute i.e.
the number of hospital admission for each age group of the
three targeted diseases. Thus we now had a proper dataset
ready for classification.
4.2. Classification using the CART analysis:
The decision tree algorithm was selected for classification
because of its robustness. The aim is that using the air
pollution data and clustered medical data acting as class label,
we would generate a decision tree which, when provided a
particular months air pollution data, would predict the level
of hospital admissions (H, M or L). Since our data from
NIDCH consist of age groups 24-49 and 50+ for the diseases
COPD, ILD and Bronchial Carcinoma, we decided to build a
decision tree for each age group of all the diseases. Therefore
we had to make six decision trees that would classify the level
of hospital admission for a specific age group of the particular
disease. The decision tree generation process was conducted
on the basis of the three different criterion metrics (i)
Information Gain (ii) Gini Index and (iii) Gain Ratio. Then the
two best trees were selected in our results. To better explain
this method, let us use the case of decision tree for COPD
patients of age group 50+.
COPD:
First the dataset for the air pollutants level for the period
2013 2014 was merged with the clustered levels (H, M & L)
of hospital admissions of both male and female patients, of
age group 50+, under treatment of COPD for the same time
period. The Union set that resulted was used as input in a
decision tree algorithm. The resulting model was then tested
with applied model of the same dataset in order to evaluate the
performance of the model. This process is repeated several
times using three different criterion metrics mentioned earlier
with different values set for minimum gain. After several
observation the best two trees are selected for a final evolution
to choose the better model. In this process the value for
minimum gain was selected as 0.06. The best two trees created
are shown above in Figure 4. Tree 1a formed using
Information gain and Tree 1b formed using Gini Index.
Here Tree 1b seems to be a better classified tree since it is
using more parameters to label an object, however Tree 1a
shows a higher percentage of accuracy. It is because of this
reason two trees are chosen before making the final choice.
The process was repeated with the same data for level of air
pollutants but this time the data for hospital admission was for
patients of age group 24 49. The two chosen trees is shown

in Tree 1c formed using Information gain and Tree 1d
formed using Gini Index. Choosing the final tree based on
the comparison of performance will be discussed later in the
paper. For the remainder of this section the two selected
generated trees for each age group will be discussed.
ILD:
Similar process of the CART analysis was applied in the case
of patients of ILD. For patients of age group 50+ plus the best
two generated trees was again as before. Tree 2a formed
using Information gain and Tree 2b formed using Gini
Index. However for patients of age group 24 49 as seen the
Tree 2c formed using Information gain and Tree 2d
formed using Gain Ratio. For tree 2a and 2b the value of
information gain selected was 0.06 and for tree 2c and 2d the
value selected was 0.08. The trees also seem to use less
parameters in order to deduce the class and is highly depended
on the levels of SO2.
Bronchial Carcinoma:
For the third category of respiratory disease in our dataset
decision tree was created first with information gain = 0.05 for
patients of age group 50+. This resulted in Tree 4a formed
using Gini Index and Tree 3b formed using Gain Ratio. For
patients of age group 24 49 the value for information gain
was changed to 0.03 with resulted in Tree 3c formed using
Information gain and Tree 3d formed using Gain Ratio.
Both the trees formed using gain ratio seems to have better
split of the parameters compared to the other variables.
4.3. Performance Evaluation:
After completing the CART analysis for all the age groups the
end results were sets of two decision trees for each age group
of the three respiratory diseases, a total of 12 decision trees.
However only one tree is needed to be chosen for each age
group. To achieve this a X-validation test was run on each of
the 12 classification trees, where the generated tree was tested
with a applied model of the same dataset and the resulting
class labels was compared with the original labels of the
objects. In order for any model to be validated as an applicable
model to real world scenarios it must have an accuracy higher
than 50%. Figure 5 shows the evaluated accuracy for all the
generated trees.
Figure 5a displays the accuracy of the decision trees of
COPD. For the age group 50+ both tree 1a and tree 1b shows
accuracy 62.5% and 58.5% respectively. Since both the values
are higher than 50% both decision trees prove to be strong
classification models. Obviously the tree higher accuracy of
62.5% is chosen. In case of the trees of age group 24 49
depicted by the tree 1c and 1d the decision models seem to be
weaker compared to before as both accuracy lies close to 50%.
This problem may arise the small size of our dataset. Since
tree 1c shows accuracy of 51% it is chosen over tree 1d.
The accuracy of the decision trees of ILD are depicted in
Figure 5b. Observing the accuracy table it can be easily
concluded that tree 2a and tree 2d, which have a higher
accuracy value than 50%, would prove to be strong
classification models for their respective age groups.
The accuracy table shown by Figure 5c is for the decision

trees for Bronchial Carcinoma. Despite trying several value
for the minimum gain, the process yielded unsuccessful results
in getting an accuracy higher than 35%. Thus the classification
models generated are inapplicable in real life scenarios. This
may due to low amount of data. It can also be concluded that
the other factors related to the diagnosis of disease play more
important role and levels of air pollution alone is not enough
to create a sufficient classification model.
V. CONCLUSION
While the research was being conducted, numerous amount of
obstructions came forward due to errors. Errors like missing
information in data. The COPD and ILD model came as
applicable but the bronchitis carcinoma gave a model which
was not applicable in real life due to low accuracy. As most of
the data models are working, therefore, we can assure that the
measure taken for the filling the missing data worked well.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
Magana Ojeda, Januchs Cortina M.G. , Adame Barrone, Dominiguez

Quintella J., Hernandez W. , Corona Vega A., Ruclas R. and Andina D.
Air pollution Analysis with a PFCM Clustering Algorithm Applied in a
Real Database of Salamanca. IEEE International Conference on
Industrial Technology (ICIT), pp.1297-1302, 2010. Retrived from
http://oa.upm.es/8108/2/INVE_MEM_2010_81133.pdf
Fukuda Kyoko & Takaoka Tadao. Analysis of Air Pollution and
Respiratory Morbidity. Analysis of Air Pollution (PM10) and
Respiratory Morbidity Rate using K-Maximum Sub-array (2-D)
Algorithm. Proceeding of ACM Symposium of Applied Computing
(SAC).
pp.153-157,
2007.
Retrived
from
http://dl.acm.org/citation.cfm?id=1244041.
Chen Haibo, Namdeo Anil & Bell Margaret. Classification of road
traffic and roadside pollution concentrations for assessment of personal
exposure. Environment Modeling & Software, Volume 23, Issue 3, pp.
282287, 2008. Retrived from doi:10.1016/j.envsoft.2007.04.006.
Ren Cizao, Tong Shilu. Health effects of ambient air pollution recent
research development and contemporary methodological challenges.
Environmental Health 7:56, November 6, 2008. Retrived from
www.ehjournal.net/content/7/1/56.
Ketzel .M, Winther .M & Berkowicz .R .Traffic pollution modeling and
emission data. Environment Modeling & Software, Volume 21, Issue 4,
April
2006,
pp.
454460.
Retrived
from
www.elsevier.com/locate/envsoft.
Barnett .G, Williams .M, Schwartz & Simpson .W. Air Pollution and
Child Respiratory Health: A Case-Crossover Study in Australia and New
Zealand. School In American journal of respiratory and Critical care
medicine.,
171(11):1272-8,
March
2005.
Retrived
from
http://dl.acm.org/citation.cfm?id=1244041.
COPD
Age Group: 50+
Tree 1a (I.G)
Age Group: 24-49
Tree 1b (G.I)
Tree 1c (I.G)
Tree 1d (G.I)
ILD
Age Group: 50+
Age Group: 24-49
Tree 2d (G.R)
Tree 2b (G.I)
Tree 2c (I.G)
Bronchial Carcinoma
Age Group: 50+
Age Group: 24-49
Tree 2a (I.G)
Tree 3a (G.I)
Tree 3b (G.R)
Tree 3c (I.G)
Tree 3d (G.R)
Figure 4: Decision trees generated for the different age groups
Figure 5: Accuracy Tables of the decision trees

Data Analysis and Classification

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analysis and Classification

Uploaded by

Copyright:

Available Formats

Analysis and Classification of Respiratory Health

Risks with Respect to Air Pollution Levels

AbstractAir pollutants are really a hazardous problem in

978-1-4799-8676-7/15/$31.00 copyright 2015 IEEE

II. RELATED WORK

emission is measured differently. In conclusion a result of the

Figure 1: The air pollutants and meteorological variables

It is necessary to have the knowledge of the diseases

frequently used to check the formulated missing value. They

Two closest values present in the nearest to Mean value:

By OVM it is referred as among the table a certain value

Data Missing has been a serious issue in this project.

3.1. Missing Data

1) One value Missing (OVM)

Bangladesh to four categories: Winter (November-January),

Figure 2: Clustered Model of the Air pollution levels

Figure 3: Correlation among the Air data attributes

ultimate goal of this project is to create a model that would

patients of age group 24 49. The two chosen trees is shown

The accuracy table shown by Figure 5c is for the decision

Magana Ojeda, Januchs Cortina M.G. , Adame Barrone, Dominiguez

Age Group: 24-49

Age Group: 24-49

Figure 5: Accuracy Tables of the decision trees

You might also like