You are on page 1of 7

IPASJ International Journal of Computer Science(IIJCS)

Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm


A Publisher for Research Motivation ........ Email: editoriijcs@ipasj.org
Volume 2, Issue 7, July 2014 ISSN 2321-5992



Volume 2 Issue 7 July 2014 Page 17


ABSTRACT
It is a continuous process to modify the databases to suit to the dynamic requirements of the users. In this process, there is
always a greater scope for getting the dimensionality increased which in turn causes severe damage to the databases through
increased redundancy. To optimize the efficiency of classification tasks, there is a pressing need to eliminate the irrelevant and
repeated features. An attempt has been made to analyze the usefulness of variable ranking method in improving the
classification performance. An empirical study has been carried out on public domain benchmark data set of Pima Indian
Diabetic patients. Both Forward subset selection and Backward subset eliminations methods have been exercised and results
were discussed in terms of increased learning accuracy and improved comprehensibility.
Key Words: Variable ranking, Classification efficiency, Redundancy, Feature subset
1. INTRODUCTION
Among the many variable or feature selection mechanisms, variable ranking method can be treated as a principal or
auxiliary selection mechanism because of its straightforwardness, scalability, and acceptable empirical success. Several
researchers have addressed usefulness of variable ranking as a baseline method [2][1][7][9]. Other than the prediction,
one of its common uses in the microarray analysis domain is to find out a set of drug leads. A ranking criterion is used
to find genes that distinguish between healthy and diseased patients; Ranking criteria can be defined for individual
variables, independent of the context of others. Correlation methods are one of the prominent methods of this category
1.1 Method & Notation for feature weighting
Consider a set of m samples { Xk , Yk } (for all k =1.. m) consisting of n input variables Xki (for all i =1n)
and an output variable Yk . Feature weighting makes use of a scoring function S(i) computed from the values Xki and
Yk , for all k =1.m. By principle, a high scored feature will be treated as a valuable feature and the feature set will
be sorted in decreasing order of S(i). To use feature ranking to build predictors, nested subsets incorporating features of
decreasing relevance are defined. According to the Kohavi and John (1997)[3] classification method, filter ranking is a
filter Method, i.e. this process is independent of the choice of the particular predictor. But, under certain independent
assumptions, it may be optimal with respect to a given predictor. Even when feature ranking is not optimal, it may be
preferable to other feature subset selection methods because of its computational and numerical scalability. The
computational complexity of feature ranking method is acceptable, since it requires only the computation of n scores
and sorting them; Over fitting problem can also be reduced because it introduces bias, but it may have significantly less
variance [12].
Few of the feature ranking methods are discussed here:
1.2 Correlation for variable/feature ranking
In general, a feature is good if it is highly correlated with the class but not with any of the other features. To measure
the correlation between two random variables, broadly two approaches can be followed. One is based on classical linear
correlation and the other is based on information theory. Under the first approach, the most familiar measure is linear
correlation coefficient. For a pair of variables (X, Y ), the linear correlation coefficient r is given by the formula[8]
r =



2 2
) y (y ) x (x
) y )(y x (x
i i i i
i i i i
(1)




Variable Ranking in the context of neighboring
features through nested subset method: Empirical
study by advocating the greedy strategies on
benchmark micro array datasets
Babu Reddy.M
Department of Computer Science, Krishna University, Machilipatnam, AP, India
IPASJ International Journal of Computer Science(IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email: editoriijcs@ipasj.org
Volume 2, Issue 7, July 2014 ISSN 2321-5992



Volume 2 Issue 7 July 2014 Page 18


The same formula can be also be expressed as:

R(i) =
) ( ) (
) , (
Y Var Xi Var
Y Xi Cov
(2)
where x
i
is the mean of X, and y
i
is the mean of Y . The value of the correlation coefficient r lies between -1 and 1,
inclusive. If X and Y are completely correlated, r takes the value of 1 or -1; if X and Y are totally independent, r is
zero. Correlation measure is a symmetrical measure for two variables. There are several benefits of choosing linear
correlation as a feature goodness measure for classification. First, it helps to identify and remove features with near
zero linear correlation to the class. Second, it helps to reduce redundancy among selected features. It is known that if
data is linearly separable in the original representation, it is still linearly separable if all but one of a group of linearly
dependent features are removed [Das, 1971]. The set of all features can be ordered based on the correlation coefficient
and only significant ones may be taken into consideration. Linear regression explains the linear relation between X
i
and
Y, which is the square of R(i) and this represents the fraction of the total variance around the mean value Y. Hence,
using linear regression(R(i)
2
) as a feature ranking criterion, features will be assigned with a rank according to goodness
of linearity of individual variables.
1.3. Single Feature Classifier
As already mentioned, regression enforces a ranking according to goodness of linearity of individual variables. One can
extend the new idea of building a classifier based on the predictive power of individual feature. For example, in many
of the applications, the value of the feature itself can be used as discriminate function. By setting a threshold value
on the value of a feature, classifiers can be designed During classification, the predictive power of a feature can be
measured in terms of error rate. In addition to error rate, few other criteria can be defined which involve FPR(False +ve
classification Rate) and FNR(False ve classification Rate). By varying the threshold value(), the tradeoff between
FNR and FPR can be monitored. Break even point analysis(FNR=FPR for specific value) and ROC curve that
analyses the hit rate(1-FPR), can be used for the close watch of FNR and FPR.When there are a large number of
variables that can separate the data well, the success rate of classification can not be used to distinguish between the top
ranking variables. Then correlation coefficient or another statistical tool like the margin (the distance between the
examples of opposite classes that are closest to one another for a given variable) will be preferred[5].
1.4. Information Theoretic Ranking Criteria(Mutual Information)
Many of the approaches to feature selection problem using information theoretic criteria [2][18][7][10] rely on
pragmatic estimates of the mutual information between each feature and the target



dxdy
y P x P
y x P
y x P i MI
i
i
x y
i
i
,
log ,

(3)
Where P is a probability density function and MI(i) is a measure of dependency between the density of variable Xi
and the density of the target y. The only difficulty associated with it is that the estimation of probability density is
very difficult. But, for discrete variables, it will be easier; because, the integration becomes sum, i.e


y Y P x X P
y Y x X P
y Y x X P i I
i
i
x y
i
i


,
log ,
(4)

Then the probabilities are estimated from frequency counts. In the case of continuous variables, estimation of
probabilities is very difficult. Then the possible solution is to discretize the variables or approximating their densities
with a non-parametric method such as Parzen windows [10]. Using the normal distribution to estimate densities,
covariance between Xi and Y has to be estimated and thus it is similar to a correlation coefficient.
1.5. Limitations of Variable ranking methods
(i) Use of redundant features - Can Redundant Features Help Each Other?
(ii) Better class prediction may not be obtained by adding likely redundant features.
(iii) How does correlation impact feature redundancy?
(iv) No information is gained by adding highly correlated variables i.e. they are truly redundant.
(v) Can a feature that is useless by itself be useful with other features?
When a feature is taken with others, though it is completely useless by itself can provide a significant performance
improvement. Two insignificant features by themselves can be useful together.

IPASJ International Journal of Computer Science(IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email: editoriijcs@ipasj.org
Volume 2, Issue 7, July 2014 ISSN 2321-5992



Volume 2 Issue 7 July 2014 Page 19


1.6. Variable Ranking in the Context of Others
The above discussion is limited to presenting feature/variable weighting methods which use a criterion computed from
single feature independently, ignoring the perspective of other features.
1.7. Nested Subset Methods
Individually some variables may have low rank because or their redundancy, but they may be relevant to the target
concept. Nested subset methods provide a useful ranking of subsets, not of individual variables. The search process will
be guided by an objective function whose value is dependent on the moves in feature subset space. Combined with
greedy search strategies like: forward selection or backward elimination, nested subsets of features can be yielded. The
selected feature subset will be used for training purpose based on the expected change in objective function.
Let s = number of features selected and
J(s) =value of the objective function.
The change in the objective function is obtained by:
(i) Finite difference calculation: The difference between J(s) and J(s+1) or J(s - 1)
(ii) Quadratic approximation of the cost function
Through the pruning of feature weights Wi, this method can be used in backward elimination of variables. In the
second order Taylor expansion, for the optimum value of J, the first-order term can be neglected, while yielding the
variation to a variable i', D
Ji
=(0.5) (
2
j
/W
2
i
)(DW
i
)
2.

Corresponding to the removing variable, the change in weight DW
i
will be equal to W
i.

(iii) Sensitivity of the objective function calculation
To avoid the sensitivity of the objective function, the absolute value or the square of the derivative of J with respect to
Wi or Xi is used.
2.EMPIRICAL STUDY AND SIMULATION RESULTS
Variable ranking has great significance in Feature Selection and Classification tasks. The Nested Subsets method has
been applied on the benchmark dataset of Pima Indian Diabetic patients though Forward subset selection and Backward
feature elimination strategies. An attempt has been made to identify the relation among the percentage of training
amples, classification efficiency and Learning rate(). Learning Vector Quantization(LVQ) has great significance in
Feature Selection and Classification tasks. Variable feature selection has been carried out on the bench mark data set
and the LVQ method has been applied on the benchmark datasets of Diabetic patients[15]. This may help us to analyze
the performance in terms of classification efficiency in a supervised learning environment.
Pima-Indian Diabetic Data Set
Database Size: 768 Learning rate: 0.1
Learning Rate: 0.1 No. of classes: 2( 0 & 1)
No. of attributes: 8 No. of iterations performed: 30
Source of Database:
(i) Original owners: National Institute of Diabetes and Digestive and Kidney Diseases
(ii) Donor:Vincent Sigillito , Research Center, RMI Group Leader Applied Physics Laboratory, The Johns Hopkins
University, Johns Hopkins Road Laurel, MD 20707 (301) 953-6231

(iii) TABLE 1
TABULATION OF CLASSIFICATION RESULTS WITH VARYING TRAINING SAMPLES AND LEARNING
RATE BY ADOPTING FORWARD SELECTION OF FEATURE SUBSET.


% of Training Samples % of Testing Samples
Efficiency for varying learning rate()
= 0.1 = 0.2 = 0.3 = 0.4 = 0.5
10 90 11.23 12.12 12.13 11.23 10.33
20 80 23.67 33.67 33.26 22.17 22.08
30 70 54.09 64.19 65.29 58.44 51.21
40 60 86.34 89.24 96.13 76.14 74.45
50 50 98.45 99.74 99.45 88.45 80.35
60 40 112.25 122.13 131.56 116.15 110.25
70 30 127.64 130.16 137.44 120.13 109.24
80 20 200.67 231.67 220.55 198.19 178.67
90 10 220.98 233.18 232.58 220.82 205.19
IPASJ International Journal of Computer Science(IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email: editoriijcs@ipasj.org
Volume 2, Issue 7, July 2014 ISSN 2321-5992



Volume 2 Issue 7 July 2014 Page 20


TABLE 2
TABULATION OF CLASSIFICATION RESULTS WITH VARYING TRAINING SAMPLES AND LEARNING
RATE BY ADOPTING BACKWARD SELECTION OF FEATURE SUBSET

% of Training Samples % of Testing Samples
Efficiency for varying learning rate()
= 0.1 = 0.2 = 0.3 = 0.4 = 0.5
10 90 21.12 25.22 28.11 20.29 18.23
20 80 33.16 44.62 45.66 27.62 26.16
30 70 64.92 76.32 74.64 59.12 54.31
40 60 92.23 102.45 100.28 85.12 82.12
50 50 118.15 124.52 128.35 111.52 108.21
60 40 132.25 142.12 146.15 134.50 122.25
70 30 141.49 149.90 151.55 136.14 131.92
80 20 212.24 232.28 234.42 192.66 192.14
90 10 253.19 273.22 272.20 212.46 201.48

The following graphs depict the relation between the classification results yielded out of the Forward subset selection
and Backward subset elimination methods which were applied on the benchmark data set with varying percentage of
Training samples. The impact of the percentage of training samples and learning rate() on the classification efficiency
can also be clearly identified through the following graphs.


Figure 1. Analysis of Classification effeciency for =0.1


Figure 2. Analysis of Classification effeciency for =0.2






IPASJ International Journal of Computer Science(IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email: editoriijcs@ipasj.org
Volume 2, Issue 7, July 2014 ISSN 2321-5992



Volume 2 Issue 7 July 2014 Page 21



Figure 3. Analysis of Classification effeciency for = 0.3


Figure 4. Analysis of Classification effeciency for =0.4


Figure 5. Analysis of Classification effeciency for =0.5

It has been clearly observed that the Efficiency of classification is encourageable after reducing the dimensionality of
data sets through Nested Subsets method. Better classification efficiency has been identified for the optimum size of the
subset chosen in forward selection and backward elimination. To ensure the accurate results, the program need to be
run in an ideal standalone environment as the dynamic load on the processor may affect the performance.
3.CONCLUSION
Variable ranking/feature raking has great influence in the processing of high dimensional data and in optimizing the
performance of classification tasks as it works around the features with acceptable level of relation with the target class.
Among the many variable or feature selection mechanisms, variable ranking method can be treated as a principal or
auxiliary selection mechanism because of its straightforwardness, scalability, and acceptable empirical success.
However, many of the variable ranking methods or feature weighing methods works around the independent features.
IPASJ International Journal of Computer Science(IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email: editoriijcs@ipasj.org
Volume 2, Issue 7, July 2014 ISSN 2321-5992



Volume 2 Issue 7 July 2014 Page 22


Nested subset methods provide a useful ranking of subsets, not of individual variables. To illustrate the usefulness of
nested subset method, an empirical study has been conducted on benchmark micro array data set of Pima Indian
Diabetic patients and it is found that the nested subset methods yield better classification performance. In addition to
the Backward elimination and Forward selection strategies, more appropriate greedy strategies can also be identified to
optimize performance of classification tasks and there will a scope to design better heuristics to ensure the optimum
utilization of computing resources in this process. And the impact of learning rate and threshold value chosen for
subset selection on the classification performance can also be studied.
4.ACKNOWLEDGEMENT
The author of this paper wish to acknowledge Vincent Sigillito Research Center, RMI Group Leader, Applied Physics
Laboratory, The Johns Hopkins University, Johns Hopkins Road, Laurel, MD 20707, for his effort in building &
maintaining and for his kind heart in donating the benchmark data sets which helped the author in carrying out the
empirical study.
REFERENCES
[1] Caruana R 1994, and D. Freitag, Greedy Attribute Selection ; in International Conference on Machine
Lear ni ng. 1994: Mor gan Kauf man.
[2] Bekkerman et al., 2003], Distributional Word Clusters vs. Words for Text Categorization , Journal of Machine
Learning Research 3 (2003) 1183-1208
[3] R. Kohavi 1997 and G. John, Wrapper for Feature Subset Selection, Artificial Intelligence, vol. 97, nos. 1-2, pp.
273- 324, 1997.
[4] S.K. Das 1971, Feature selection with a linear dependence measure, IEEE Trans. Computers, pp. 1106-110-
1971
[5] Halina Kwasnick 2006] and Krzysztof Michalak, Correlation-based Feature selection strategy in classification
problems, International journal of Applied Mathematics and Computer Science, 2006, Vol 16, No:4, pages:
503-511.
[6] John GH, Kohavi R & Pfleger K(1994), Irrelevant features and the subset selection problem, Proceedings of
the11th International Conference on machine Learning(pp. 121-129). New Brunswick.
[7] Forman. G 2003, An extensive empirical study of feature selection metrics for text classification, Journal of
Machine Learning Research, 3, 12891305, 2003.
[8] S.C.Gupta & V.K. Kapoor - Fundamentals of Mathematical Statistics; Sulthan Chand & Sons.
[9] Weston, et al, 2003, "Feature selection for SVMs." Advances in Neural Information Processing System.
[10] Torkkala, 2003,Feature extraction by non parametric mutual information maximization, Journal of
Machine Learning research, Vol-3, pp: 1415-1438
[11] Babu Reddy.M, Independent Feature Elimination in High Dimensional Data : Empirical Study by applying
Learning Vector Quantization method, IJAIEM, Vol 2, Issue 10, Oct-2013, pp: 248-253
[12] Hastie, el al, 2001, The Elements of Statistical Learning , New York: Springer-Verlag.
[13] SN Sivanandam, S. Sumathi and SN Deepa Introduction to Neural Networks using Matlab-6.0; TMH-2006
[14] Huan Liu 2004 and Lei yu, Redundancy based feature Selection for micro-array data, In proceedings of
KDD04, pages:737-742, Seattle, WA, USA, 2004
[15] Vincent Sigillito, UCI Machine Learning Repository:Pima Indian Diabetes Data, Available at:
http://archives.uci.edu/ml/datasets/
[16] Blake C 2006 and Merz C, UCI repository of Machine Learning Databases, Available at:
http://ics.uci.edu/~mlearn/MLRepository.html
[17] LeCun, O. Matan, B. Boser, J. S. Denker, et al. Hand-written zip code recognition with multilayer networks.
In ICPR, volume II, pages 3540, 1990.
[18] Dhillon et al, 2003, Information-Theoretic Co- Clustering, Proc. ACM SIGKDD, pp. 89-98, 2003




IPASJ International Journal of Computer Science(IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email: editoriijcs@ipasj.org
Volume 2, Issue 7, July 2014 ISSN 2321-5992



Volume 2 Issue 7 July 2014 Page 23


AUTHOR
Dr. M. Babu Reddy, has received his Masters Degree in the year 1999 and Doctor of Philosophy in the
year 2010 from Acharya Nagarjuna University, AP, India. He has been actively involved in teaching and
research for the past 15 years and now he is working as Asst. Professor of Computer Science, Krishna
University, Machilipatnam, AP, India. His research interests include: Machine Learning, Software
Engineering, Algorithm Complexity analysis and Data Mining.

You might also like