Professional Documents
Culture Documents
1
Abstract
With the huge amount of data being generated in the world every day, at a rate far higher
than by which it can be analyzed by human comprehension alone, data mining becomes an
extremely important task for extracting as much useful information from this data as
possible. The standard data mining techniques are satisfactory to a certain extent but they
are constrained by certain limitations, and it is for these cases that evolutionary approaches
are both more capable and more efficient. In this paper we present the use of nature
inspired evolutionary techniques to data mining augmented with human interaction to
handle situations for which concept definitions are abstract and hard to define, hence not
quantifiable in an absolute sense. Finally, we propose some ideas for these techniques for
future implementations.
2
Table of Contents
1 Introduction ............................................................................................................................. 4
2 Overview of Data mining and Knowledge Discovery ............................................... 5
2.1 Data Mining Pre-processing ....................................................................................... 5
2.2 Data Mining Tasks .......................................................................................................... 6
2.2.1 Models and Patterns ............................................................................................. 6
2.3 Conventional Techniques of Data Mining ............................................................. 8
3 Evolutionary Algorithms and Data Mining ................................................................... 8
3.1 Genetic Algorithms ........................................................................................................ 9
3.2 Genetic Programming ................................................................................................... 9
3.3 Co-evolutionary Algorithms .................................................................................... 10
3.4 Representation and Encoding ................................................................................ 11
3.4.1 Rules Representation ........................................................................................ 11
3.4.2 Fuzzy Logic Based Rules Representation .................................................. 12
3.5 Genetic Operators........................................................................................................ 12
3.5.1 Crossover ............................................................................................................... 12
3.5.2 Mutation ................................................................................................................. 13
3.5.3 Fuzzy logic Operators ....................................................................................... 13
3.6 Fitness Evaluation ....................................................................................................... 15
3.6.1 Objective Fitness Evaluation .......................................................................... 15
3.6.2 Subjective Fitness Evaluation (Interactive Evolutionary Algorithms)
17
3.7 Selection and Replacement...................................................................................... 18
3.8 Integrating Conventional Techniques with Evolutionary Algorithms.... 19
4 Applications of Data Mining Using IEA ....................................................................... 19
4.1 Extracting Knowledge from a Text Database ................................................... 19
4.2 Extracting Marketing Rules from User Data ..................................................... 22
4.3 Fraud Detection Using Data Mining and IEA Techniques ............................ 24
4.4 Some current work being done.............................................................................. 25
5 Conclusion and Future Work .......................................................................................... 25
6 References .............................................................................................................................. 27
3
1 Introduction
In recent years, the massive growth in the amount of stored data has increased the demand
for effective data mining methods to discover the hidden knowledge and patterns in these
data sets. Data mining means to “mine” or extract relevant information from any available
data of concern to the user. Data mining is not a new technique but has been around for
centuries and has been used for problems like regression analysis, or knowledge discovery
from records of various types.
As computers invaded almost all conceivable fields of human knowledge and occupation,
their advantages were advocated all over, but what was observed soon enough was that
with the increasing amounts of data that could be generated, stored and analysed there was
a need to define some way to sift through it and grab the important stuff out. During the
earlier days a human or a group of humans would sit down to analyse the data by going
through it manually and using statistical techniques, but the curve of data generation was
far steeper than what could realistically be processed by hand. This led to the emergence of
the field of data mining, which was essentially to define and formalize standard techniques
to extract data from large data warehouses. As data mining evolved it was observed that the
data at hand was almost always never perfect or suitable to be fed to data mining engines
and needed several steps of pre-processing before it could be put through “mining”.
Generally these inconsistencies would be in data format, level of noise or incorrect data,
unnecessary data, redundant data etc. These steps would clean, integrate, discretize and
select the most relevant attributes before performing any mining.
A whole new area called Intelligent data analysis has emerged which utilises efficient
techniques for mining data from large sets keeping in mind that the knowledge obtained is
useful at the same time also remembering that time for mining is constrained and the user
requires data as soon as possible. Some of the methods used to mine data include support
vector machines, decision trees, nearest neighbour analysis, Bayesian classification, and
latent semantic analysis.
With the problems associated with conventional data mining techniques, clever new ways to
overcome these were needed, and the application of AI techniques to the field resulted in a
very powerful hybrid of techniques. Evolutionary optimization techniques provided with a
useful and novel solution to these issues, and once data mining was enhanced with using EC
many of the previously mentioned problems were no longer big issues.
4
This paper begins by describing some concepts in data mining and general evolutionary
algorithms by giving relevant concepts and descriptions. In the later sections we discuss
some of the areas where these are implemented and lastly we give a few ideas of where
these techniques may be implemented in the future.
Knowledge discovery and data mining as defined by Fayyad et al. (1996) is “the process of
identifying valid, novel, useful, and understandable patterns in data”. Data mining has
emerged particularly in situations where analysing the data manually or by using simple
queries is either impossible or very complicated (Cant´u-Paz & Kamath, 2001). Data mining is
a multi-disciplinary field that incorporates knowledge from many disciplines, mainly from
machine learning, artificial intelligence, statistics, signal and image processing, mathematical
optimization, and pattern recognition (ibid.).
Knowledge discovery and data mining consist of three main steps to convert a collection of
raw data to valuable knowledge. These three steps are data pre-processing, knowledge
extraction, and data post-processing (Freitas, 2003). The discovered knowledge should be
accurate, comprehensible, relevant and interesting for the end user in order to consider the
data mining process as successful (Cant´u-Paz & Kamath, 2001).
This section gives an overview of data mining pre-processing, data mining tasks, and the
conventional techniques for data mining.
The purpose of using data mining pre-processes is to eliminate the outliers, inconsistency
and incompleteness of data in order to obtain accurate results (Freitas, 2003). These pre-
processes are listed below:
Data integration: removes redundant and inconsistent data from data that is
collected from different sources.
Attribute selection: selects the relevant data to the analysis process from all the
data sets.
Data mining: after doing all the previous steps, data mining algorithms or
techniques can be applied to the data in order to extract the desirable knowledge.
5
2.2 Data Mining Tasks
It is very important to define the data mining task that the algorithm should address before
designing it for application to a particular problem. There are several tasks of data mining
and each of them has specific purposes in terms of the knowledge to be discovered (Freitas,
2002).
In data mining the term model “is a high level description of the data set” (Hand, 20001). A
model can be either descriptive or predictive. As the names imply, the descriptive model is
an unsupervised model that aims to describe the data, while predictive model is a
supervised model that aims to predict values from the data.
Patterns are used to define the important and interesting features of the data. Unusual
combination of purchased items in supermarket is an example of a pattern. Models are used
to describe the whole data set, while patterns are used to highlight particular aspects of
data.
Classification Task
Some terms need to be introduced in order to describe classification tasks. The data
sets that the classification techniques or algorithms are applied to are composed of
a number of instances/objects. Each instance has a number of attributes, which
have discrete values. The records in databases tables, for example represent the
instances and the fields represent the attributes. In other words, each row
represents an object and the columns describe this object in terms of its attributes.
Classification is tasked with being able to extract the hidden knowledge from some
attributes values in form of patterns in order to predict the value of particular field
or attribute. This target value is known as the class (Dzeroski & Lavrac, 2001)
The inputs for the classification algorithm are data instances and the outputs are the
patterns that are used to predict the class that this instance belongs to. Here is an
example of classification rule (Freitas, 2002):
6
The data in the classification task is divided into two “mutually exclusive” data sets,
the training dataset and the testing dataset (Freitas, 2003). The training dataset is
used to build the classification model and the test dataset is used to evaluate the
predictive performance of the model. Overfitting occurs when the model is over
trained on the training dataset and it is simply “memorizing” it, which would result
in poor predictive performance on the testing dataset. By contrast, underfitting
occurs when the model is undertrained and did not learn well from the training
data. In underfiting situations, the model consists of number of rules that cover too
many training instances (ibid.).
Regression Task
Clustering Task
Clustering simply means grouping, placing data instances into different groups or
clusters such that instances from the same clusters are similar together and easily
distinguished from the instances that belong to the other clusters (Zaki et al., 2010).
Association analysis refers to the process of extracting association rules from a data
set that describe some interesting relations hidden in this data set. For further
illustration, imagine the market basket transactions example, where we have two
items A and B and the following rule is extracted from the data: {A} -> {B}. This rule
suggests that there is a strong relation between item A and item B in terms of the
frequency of their occurrence together (Tan et al., 2006). This means if there is an
item A in the basket then there is a high probability that item B will be in the basket
as well.
7
2.3 Conventional Techniques of Data Mining
Several tools and techniques are available for data mining and knowledge discovery. These
techniques have been developed from two main fields: statistics and machine learning.
Multivariate analysis, logistic regression, liner discrimination, ID3, k-nearest neighbor,
Bayesian classifiers, principal component analysis, and support vector machines are
examples of these techniques. These techniques are designed to discover accurate and
comprehensible rules, but most of them are not designed to discover interesting rules
(Freitas, 2003).
Statistics and machine learning techniques are considered to be the most used techniques
for data mining but these techniques have some drawbacks. The models or rules discovered
using these techniques are not always optimal. This is due to their sensitivity to the noise in
the data set, which may cause them to overfit the data (Vafaie & Jong, 1994). They also tend
to generate models with a larger number of features than really necessary, which increases
the computational cost of the model (ibid.).Another drawback is they typically assume a
priori knowledge about the data set, which is not available in most cases. Statistical methods
have another problem that is that they assume linearity of the models and distribution of
the data (Terano & Ishino, 1996).
Evolutionary algorithms have several features that make them attractive for the data mining
process (Freitas, 2003; Vafaie & Jong, 1994). They are a domain independent technique,
which makes them ideal for applications where domain knowledge is difficult to provide.
They have the ability to explore large search spaces finding consistently good solutions. In
addition, they are relatively insensitive to noise, and can manage attribute interaction better
than the conventional data mining techniques.
Therefore, several works have been done, in recent years, to develop new techniques for
data mining using evolutionary algorithms. These attempts used evolutionary algorithms for
different tasks of data mining such as feature extraction, feature selection, classification,
and clustering (Cant´u-Paz & Kamath, 2001). The main role of evolutionary algorithms in
most of these approaches is optimization. They are used to improve the robustness and
accuracy of some of the traditional data mining techniques.
Different types of evolutionary algorithms have been developed over the years such as
genetic algorithms, genetic programming, evolution strategies, evolutionary programming,
evolution strategies, differential evolution, cultural evolution algorithms and co-evolutionary
algorithms (Engelbrecht, 2007). Some of these types that are used in data mining are genetic
algorithms, genetic programming and co-evolutionary algorithms. Genetic algorithms are
used for data preprocessing and for post processing the discovered knowledge, while
genetic programming is used for rule discovery and data preprocessing (Freitas, 2003).
This section will give a general overview of genetic algorithms, genetic programming, and
co-evolutionary algorithms, followed by an overview of different representation schemes,
genetic operators, and fitness evaluation for the purpose of data mining. Finally a brief
discussion of integrating conventional data mining techniques with evolutionary algorithms
is given.
8
3.1 Genetic Algorithms
Genetic algorithms are those that “have been originally proposed as a general model of
adaptive processes, but by far the largest application of the techniques is in the domain of
optimization” (Back et al., 1997). They consist of a population of individual solutions that are
acted upon by a series of genetic operators in order to generate new, and hopefully better,
solutions to a particular problem, and are inspired by natural evolution.
The term ‘genetic algorithm’ was coined in the early 70s by John Holland, who had been
working with systems that generate populations of potential solutions using natural
methods since the early 60s. In his paper “Outline for a Logical Theory of Adaptive Systems”,
Holland (1962) describes a system where a “generation tree” of populations is generated. By
applying a number of solutions (the population) to a number of problems (the environment),
the solutions that are able to successfully solve the problems are given reward/activation
scores, which enable solutions to be compared with one another, and the best of these are
used in the generation of the next branch of the generation tree.
Virtually all modern evolutionary systems have the same general stages:
• A random population of solutions is generated, to be used as the initial population
• The solutions within the population are evaluated to determine their ‘fitness’
• Solution pairs are first selected, based on their fitness, and then are combined to
create offspring, which are added to the next generation of the population
• Other genetic operators, such as mutation, are also applied to offspring
Figure 1: Examples of evolved LISP programs. Fitness calculated by number of outputs closer than 20% to
correct output. (Mitchell, 1998)
Koza realized that not only could genetic algorithms be used to evolve programs, but also
other complex structures, such as equations and rule sets. This is particularly useful when
9
looking at genetically evolving data mining techniques, since we can use the principles of
genetic programming in the evolution of rule constructs.
In co-evolutionary algorithms, two populations are evolved together, with the fitness
function involving the relationship with other individuals. In this algorithm, the individuals of
the two populations evolve through either competing against each other or through
cooperation with each other (Engelbrecht, 2007). The competitive approach is used to
“obtain exclusivity on a limited resource” (Tan et al, 2005), while a cooperative approach is
used to “gain access to some hard to attain resourse”(ibid.). In the competitive approach,
the fitness of an individual in one population is based on the direct competition with the
fitness of individuals in the other population. The cooperative approach, on the other hand,
the fitness of an individual in one population is based on the how much does it cooperate
with the individuals in the other population. Co-evolutionary approaches, particularly the
cooperative approach, can address some of the problems of evolutionary algorithms with a
single population, such as poor performance and convergence to local optima when dealing
with problems that have complex solution (Tan et al, 2005).
Several attempts have been made to apply co-evolutionary algorithms to the field of data
mining. One of them is the distributed evolutionary classifier for knowledge discovery in
data mining proposed by Tan et al (2005). In their approach, they use a cooperative
evolutionary algorithm to evolve two populations. The individuals of the first population
represent a single rule. Each individual of the second population represents a set of rules.
They validated their approach using six datasets. Their classifier preformed better than C4.5
classifier (a well-known algorithm for generating decision trees). The proposed co-
evolutionary approach reduces the computation time through sharing the workload among
multiple computers. It has also achieved a smaller number of rules for the rule set compared
with other classification techniques, which increases the comprehensibility of the
classification model. Moreover, it is more robust to noise in the data and has robust
predication accuracy.
Another approach that applies co-evolutionary algorithms to data mining is the co-
evolutionary system for discovering fuzzy classification rules developed by Mendes et al
(2001). They used two evolutionary algorithms in their system: a genetic programming
algorithm and an evolutionary algorithm to co-evolve two populations. The genetic
programming algorithm evolves a population of fuzzy rule sets and the evolutionary
algorithm evolves a population of membership function definitions. The advantage of using
the co-evolutionary process is the discovery of fuzzy rule sets and the membership function
definitions that are more adjusted to each other.
10
3.4 Representation and Encoding
The traditional method of encoding the genetic rules, which perhaps resembles most closely
the way evolution occurs in nature, is to use a direct representation scheme to encode the
population data as a series of bitstrings – a binary string representative of the genes, which
build each chromosome in a population. An 8-bit binary string would be representative of a
population whose genetic data consisted of 8 Boolean values, where each bit had some
specific meaning. For example, a system looking to design a new car could use its first bit to
represent whether or not the car has two or four doors, the second to represent whether it
has 3 or 4 wheels, the third to represent whether the car has a spoiler, etc. In this system, a
population member with a value 011xxxxx would represent a car with two doors, four
wheels and a spoiler. The key issue with this kind of representation, however, is that it
defines a very specific search space, with a set number of genetic ‘parameters’ and a very -
restricted set of values that each of these parameters can be.
A simple way to make the genetic algorithm considerably more powerful is to alter the
representation so that rather than being encoded as a set of Boolean values, it is stores a
number, such as an integer or a floating point value. This means that the search space
defined by the system described before could be hugely expanded, allowing for a far greater
number of genetic possibilities in its populous. For example, the second bit was
representative of whether the car had three or four wheels; by using an integer rather than
a Boolean we broaden the system so that it can represent a car with any number of wheels.
Expanding the representation in this way comes with an additional memory overhead – a
gene encoded with a 1-byte integer is 8 times the size of a binary gene – however, the
number of possible values leaps to 255, so the memory increase is a small cost to pay for
significant improvement to the representation.
Of course, not all genes need to be encoded in the same way; a chromosome can be
constructed by any combination of data types that best fit the space being represented. For
example, it would not make sense to represent whether a car has a spoiler or not with an
integer, as there are only two possibilities, so a Boolean would be sufficient.
11
3.4.2 Fuzzy Logic Based Rules Representation
For our rule set, therefore, we will use just two binary operators, AND and OR, four unary
operators, NOT, LOW, MEDIUM and HIGH, and an integer value to represent each data
variable. We will assign a numerical value to each of these, so for instance values 0-5 could
represent the operators, and the variables could be 6+. We will use 1 byte binary
representations of these, which gives us up to 249 possible variables; if we require more
variables, we can simply chose to use a larger representation (i.e. 2 bytes gives us 65529
variables).
If we consider a sample rule, LOW(Age) AND
NOT(HIGH(Height) OR HIGH(Weight)), we can see how the
binarized form of the rule, 0000 0011 0110 0010 0001 0101
1000 0101 0111, can be parsed and understood (in this
example, we have used a 4 bit representation).
3.5.1 Crossover
A common method for crossover is called one-point crossover (Rawlins, 1991). Bitstring
representations will be discussed for simplicity here, but whatever the data type used, the
methods do not vary. In one-point crossover, the two parents chromosomes are split in the
same place, and half of one set of genes is combined with the other half of the other. The
crossover point tends to be randomized each time a pair of chromosomes reproduce. As an
12
example, consider the two parent chromosomes 01100101 and 10011100. Since there are
eight genes in the chromosome, the crossover point can be anywhere between bits 1-2 and
7-8. Say the crossover point is between bits 3-4, the two halves of each parent will be
011|00101 and 100|11100. Depending on which way the parents are combined, the
offspring will be either 011|11100 or 100|00101.
Example of generalizing and specializing crossover, where the symbol “|” illustrates the
crossover points (ibid.).
3.5.2 Mutation
Mutation is a fairly simple operator, where bits are flipped to alter a chromosome’s genetic
makeup. The mutation rate affects how often these mutations occur – a system with a high
mutation rate will result in lots of mutated offspring. Mutation is necessary as it provides
renewable variety: it allows the system to explore solutions that may not be available by
recombination alone.
With fuzzy logic representation, we need to come up with new ways to cross and mutate the
individuals in our population, ensuring that the rules are still valid within the structure of the
grammar.
13
Mutation
Mutation is a simple enough process: we can interchange the binary functions AND and OR,
leaving a syntactically correct rule; we can add or remove NOT before any of the operators,
whether unary or binary; and we can substitute any of the fuzzy classification functions
LOW, MEDIUM or HIGH with one another.
Crossover
Crossover, however, is a more difficult problem. One point crossover is not a suitable
method here, since it can result in syntactically incorrect rules. Figure 3 shows the result of
crossing the rule shown in Figure 2 with another rule, NOT( (MEDIUM(Age) OR LOW(Age))
OR (LOW(Height)) ) at a random point.
Rule 1: 0000 0011 0110 0010 0001 0101 | 1000 0101 0111
Rule 2: 0010 0001 0001 0100 0110 | 0011 0110 0011 1000
Rule 3: 0000 0011 0110 0010 0001 0101 | 0011 0110 0011 1000
Figure 4
As you can see, crossing at this point has cut a binary operator in half, resulting in a rule that
cannot be parsed. For this reason, it is important to look more carefully at the crossover
points. Crossover may not occur after a fuzzy classification function, however, it may occur
at any point where an AND, OR, or NOT branch. In addition to this method of merging rules,
other systems have used a method of occasionally simply combining rules using ‘AND’ or
‘OR’. This can be a useful technique if used infrequently, so we can use this combination
method 10% of the time, and the rest use the merging method (Walter, 2000).
14
3.6 Fitness Evaluation
Each member of the population in an evolutionary system will have a fitness level, which is
defined by how effective the solution is deemed to be at solving a particular problem. The
general aim of any genetic algorithm is to adapt the parameters of its population in order to
evolve solutions with maximal fitness. Fitness functions can be extremely complicated, as
they require some method for quantitatively and qualitatively evaluating solutions where
often the knowledge of what makes a good solution is not known.
In data mining, the fitness function is used to evaluate the fitness of the prediction rules. As
mentioned earlier in this paper, prediction accuracy, comprehensibility and interestingness
represent the quality criteria of the discovered rules and can be used to measure their
fitness. Two main types of fitness functions are used in data mining to evaluating the fitness
of an individual: objective and subjective fitness evaluation.
A major issue with using evolutionary algorithms in data mining that needs to be considered
when designing the fitness function is the interesting of the discovered rules. Because
evolutionary algorithms are powerful techniques that can perform global search and
generate huge numbers of rules. However, these rules can be trivial and not interesting.
This section discusses the objective or the quantitative approaches to evaluate the fitness of
discovered rules (Freitas, 2003). Different approaches have been proposed to design
effective objective fitness functions. These approaches will be organized according to the
quality criteria of discovered rule mentioned earlier. The examples used to illustrate these
approaches are represented using Michigan scheme, which means each individual
represents a single rule.
One of the approaches to measure the prediction accuracy of a rule is to use confidence
factor CF (Freitas, 2003). Suppose that the rule to be evaluated is as follow:
IF A THEN B.
CF=|A&B|/|A|
Where |A| represents the number of the instances in the data that satisfy all the conditions
in the antecedent part A of the rule and |A&B| represents the number of the instances in
the data that satisfy all the conditions in A and are classified to be of class B (Freitas, 2003).
Here is an example of how to calculate CF. IF |A|= 100 and |A&B|=60, then CF will be 60%
which can give an insight of how accurate the rule is. Therefore, the higher the rule accuracy
in the training set, the more likely that it will be selected. This is a very simple approach to
define the prediction accuracy, but one obvious drawback of such approach is that it most
likely to overfit the data which would results in poor prediction performance on the testing
data set.
15
Another approach mentioned in (Freitas, 2003) is to use a confusion matrix. This matrix is a 2
x 2 matrix used to describe the “predictive performance” of the rule.
Where:
CF=TP/ (TP+FP).
Fitness=CF*Comp
Comprehensibility Criteria
Fitness function can be extended in order to cover the comprehensibility criteria as follow
(Freitas, 2003):
Where W1 and W2 are user defined weights. Simp refers to the simplicity measurement of a
rule. One obvious way to measure the simplicity is to compute the number of the conditions
in the rule. The smaller the number of conditions the simpler the rule is.
For data mining approached that uses genetic programming, the simplicity of a rule can be
measured by counting the number of nodes. Possible method to measure the rule simplicity
mentioned in (Freitas, 2003), is to define a maximum number of nodes in a tree (individual)
and then calculate the simplicity as follow:
16
Simp= (MaxNodes -0.5 NumNodes – 0.5)/(Maxnodes -1)
Interestingness Criteria
Noda et al (1999) have proposed a fitness function that composed of two parts: the first part
is to measure the degree of interestingness and the second part to measure the predictive
accuracy. The degree of interestingness part also consists of two parts. Users are supposed
to set the weights of the degree of interestingness and the predictive accuracy parts.
PS = |A&B| - |A||B|/N.
According to Piatetsky-Shapir summarized in (Gebhardt, 1991), there are three principles for
rule interestingness (RI) measures:
• RI = 0 if |A & B| = |A| |B| / N, when the antecedent and the consequent of the rule
are statistically independent.
• RI monotonically increases with |A&B| when other parameters are fixed, namely
|A| and |B|. In this case the CF and Comp factors are increased also which means
more interesting rule.
• RI monotonically decreases with |A| or |B| when other parameters are fixed,
namely |A&B|. In this case the CF and Comp factors are decreased also which
means less interesting rule.
Writing an appropriate objective fitness function can be a very hard task. This is particularly
true for situations where the domain knowledge or a prior knowledge is not available, which
makes the decision of what is considered as an interesting knowledge difficult. In such cases,
a subjective fitness function evaluation could be very useful. Subjective fitness evaluation is
done by human experts.
In data mining, domain experts evaluate the fitness of the discovered rules according to
their interesting feature. Rules can be interesting if they are unexpected and actionable for
the user (Liu et al. , 1997). In many different domains, however, knowledge about the
domain data can vary from one user to another. User’s prior knowledge of the domain can
be either general impression (GI) when the user has feelings about the domain or it can be
reasonably precise knowledge (RPK) when the user has definite idea. Generally, discovered
rules are evaluated and ranked against these two types of concept (ibid.).
A major problem with data mining is that the discovered models do not necessarily contain
important or interesting rules. They sometime include trivial rules or even worse they can
have counterintuitive rules (Pazzani, 2002). Some of the previous attempt to address this
problem is to interact with the domain experts to evaluate the models and to find what is
17
interesting and important. The found model, then, will be adjusted according to the
feedback from the domain experts (for example by adding or removing variables) until an
acceptable model is found (ibid). Subjective fitness evaluation and Interactive evolutionary
algorithms accelerates this process and probably generates more interesting rules by
involving the domain expert in the search process to bias the search toward models that are
more novel and comprehensible.
Using subjective function evaluation in data mining offers many opportunities for future
research. For example, Pazzani’s idea (2002) illustrated in Figure 5 could be accomplished
through the use of subjective fitness evaluation. In his idea, he describes how the different
fields of artificial intelligence, statistics, data base and cognitive psychology should be
combined together to improve the performance of the multi-disciplinary field of data
mining. Interactive evolutionary algorithms can allow the use of cognitive psychology in
developing tools and techniques for data mining and knowledge discovery through involving
the human cognitive process into the search for interesting patterns and the discovery of
new knowledge from data sets.
Selection is the process of choosing which individuals in the population to use for
reproduction, and replacement is the process of selecting which individuals in the
population will go through to the next generation. Whilst selection and replacement can use
the same methods almost interchangeably, they do not both need to be implemented the
same way in a particular genetic system: a system may use the ‘roulette wheel’ method for
selection, and the ‘absolute’ method for replacement. There are a number of different
methods for making these selections:
Absolute
The n fit individuals in the population are chosen for breeding, or the n least fit
individuals are replaced. Whilst this seems like a good strategy, it can result in losing
individuals that, while being less fit, hold genetic material that could be useful in
18
evolving even better strategies. It is important to keep a mix or genetic material
within the population to stop the solutions converging prematurely at local optima.
Random
The opposite of absolute selection is random selection. Here, no regard is given at all
to the fitness – all individuals are selected with uniform probability. Whilst this
method does preserve variety, it also means that it can take a very long time to find
a good solution, and if it is used as a replacement strategy, good solutions have as
much chance of being overlooked as bad ones, meaning that good solutions may
never develop.
Roulette Wheel
By looking at the previous two methods, we can see that whilst it is important to
focus on the individuals with higher fitness levels, we also need to ensure that we do
not throw away potentially useful solutions. The roulette wheel method addresses
this by picking randomly, but in proportion to the fitness of the individuals, so that
very fit individuals have a higher chance of being selected for breeding, and less fit
individuals have a higher chance of being replaced.
Several hybrid approaches have been proposed that integrate evolutionary algorithms with
one of the conventional techniques to tackle some of the problems with the conventional
techniques such as minimizing the number of selected features and selecting more
interesting features. One of the successful attempts to integrate evolutionary algorithms
with data mining is the approach developed by Terano and Ishino (1996). Their approach
integrates evolutionary algorithm with one of the machine learning data mining techniques,
namely inductive learning technique that generates decision trees. They used the inductive
learning algorithm to find rules from the data, then they used interactive evolutionary
algorithm to refine these rules. Their work will be discussed in greater depth in the following
section.
In this section we examine some areas where data mining with interactive evolutionary
algorithms IEA techniques has been successfully applied.
The first approach detailed is very general in terms that it can be used to classify any text
based data and hence is not limited to any specific discipline. The approach requires textual
data in the form of reports, which can be just normal text files corresponding to the
database for which the knowledge needs to be extracted.
This technique proposed by Sakurai (2001) details a means to extract knowledge from any
database with the help of domain dependent dictionaries. The particular application in the
paper deals with text mining from daily business reports generated by some institution and
19
classification of the reports based on some knowledge dictionaries. In their experiment, two
kinds of knowledge dictionaries were used, one is called the key concept dictionary, and the
other is the concept relation dictionary.
The daily business reports generated from any source are decomposed into words using
lexical analysis and the words are checked for entry in the key concept dictionary. All reports
are then classified with particular concepts; according to the words in the report, which
represent the concept in the key concept dictionary. Also each report is then checked if its
key concepts are assigned in the concept relation dictionary. Reports are then classified
according to the set of concept relations, and reports having the same text class are put into
the same group. This facilitates the end users as they can read only those reports, which are
put into groups with topics matching their interests; also it gives them and indication of the
trends of topics in reports.
The key concept dictionary contains concepts having common features, concepts and
related keywords, and expressions and phrases concerned with the target problem. An
example of the key concept dictionary can be seen in the figure below concept relation
dictionary contains a relation, which describes a condition and a result. This is a mapping
from key concepts to classes. Since creating a dictionary is time consuming and prone to
errors the paper describes an automatic way of creating a concept relation dictionary.
The relation in concept relation dictionary is like a rule and can be acquired by inductive
learning if training examples are available, to do so words are extracted from the document
by lexical analysis and these words are checked if they match a expression in key concept
dictionary. Thus we have the following assumptions, concept classes are attributes, concepts
are values and test classes given by the reader are the result classes we want, this forms a
training example. Also for all those attributes, which do not have values, 0 is assigned. An
overview of this is clearly depicted in the figure below
20
Figure 7 (Sakurai et al., 2001)
For the inductive learning to work we need a fuzzy algorithm, as reports, which are written
by humans, are not strict in accordance with descriptions. Thus the method described for
the learning is the IDF algorithm, which is a fuzzy algorithm. This algorithm makes rules from
the generated training examples and the rules, which are generated, have the genotype of a
tree.
21
The whole process can be seen in figure 8 below which shows the inputs, and the processes,
which go into getting the final outputs from the input dictionaries and data.
The algorithm was tested on daily reports for a business concerning retail sales into 3 classes
concerned with describing a sales opportunity as best, missed or other. The key concept
dictionary was composed of 13 concept classes and each concept class has its subset of
concepts. Those reports which contained contradicting descriptions were regarded as
unnecessary and training example from them were not generated. And the results showed
that by using 10 fold cross validation they were successfully able to generate the concept
relation dictionary and obtain better results than IDF on the reports generated for retailing.
Since marketing decisions require optimum rules from customer data, which can be really
noisy, Simulated breeding and inductive learning methods have been tested to create such
rules, which have been able to generate simple and easy to understand results in the form
which can be used directly by the marketing agent.
This work has been developed by Terano and Ishino (1996). The conventional method to
solve the problem of generating efficient decision making rules was to use statistical
methods but these prove to be weak since they assume that the mining data is based on
linear models. Multivariate analysis, which is popularly used, fails to satisfy the need for
22
both quantitative as well as qualitative analysis of data. AI techniques on the other hand
focus on the problem of feature selection, which is based on machine learning and aims to
find the optimal number of features to describe the target concept. This does not work for
the current problem hence we cannot apply well-known standard techniques to choose the
appropriate features.
Hence the smart way proposed by the authors is to use both simulated breeding and
inductive learning techniques. The Inductive learning is used to generate the decision rules
from data to give emphasis on relationship between product and feature, while simulated
breeding to get the effective features. This work was the first of its kind that specifically
address the problem of clarifying the relationship between the product image and features
using user questionnaire data.
Simulated breeding is a GA based technique to evolve offspring. The offspring, which are
judged by human expert to have some, desired features are allowed to breed. The judgment
is done interactively. It is used in cases where fitness function is hard to define. Inductive
learning is used to generate the rules in the form of a decision tree as output for the analysis
of features and attribute value pairs. This specific implementation used C4.5.
Since marketing decisions must be made by analysts who need to make promotion
strategies for their product according to an abstract image of their product. The things they
need to keep in mind are that the data gathered from users is inherently noisy and the data
is based on complicated models hence simple rules are needed to explain the characteristics
of the products. Also the features of the product to realize the image are left on intuition of
the experts and there is no clear way to do this. So, the information needs to be organized in
a clear manner to understand the relationship between the feature and image of the
product.
This analysis was carried out on oral care products and 2300 users filled the questionnaire
used, the knowledge obtained was tested by a domain expert at the manufacturing
company. The domain expert must know basic principles if IL , stats , and must understand
outputs obtained. Using the outputs of decision trees she interactively evaluates the quality
of the obtained knowledge.
23
4.3 Fraud Detection Using Data Mining and IEA Techniques
An interesting application of Genetic Programming and rule based data mining can be seen
in the work done by Bentley (2000) where a system is designed to analyze data provided by
a bank and discover the cases of fraud; in this particular case, for insurance applications.
The logic that goes into this paper stems from a very pressing issue that is the increase in
fraud in all forms of financial institutions. For a large bank this is typically hard to handle as
the number of fraudulent cases is masked by the other large number of true applicants and
hence they just slip under the hands. An effective method is needed to find such cases from
huge amounts of data where data mining comes in. The evolutionary computation
techniques are used to generate certain rules, which might be the underlying explanations
for fraud cases.
In this experiment first the data was clustered into 3 segments, which then correspond to
the domains of the rule generation membership functions. These functions give the “degree
of membership” of the input data into fuzzy logic sets of “LOW”, “MEDIUM” and “HIGH”. A
GP is used for the purpose of evolving rules and the representation of a rule is done in the
form of a tree, where each tree will correspond to a particular rule. After a set of rules has
been generated, they are evaluated by an expert system and assigned a score before they
are applied on training data. This is the data for which the bank has certified the number of
fraud cases and this is used to generate rules, which can be accurate to describe fraud cases.
The fitness function checks the scores and describes different fitness values with a key
objective to ensure that there are as less as possible of misclassified items, differentiate
between “suspicious” and “unknown” classes ensuring that “suspicious” are given more
relevance and finally ensuring that the rule generated is concise and yet understandable.
One single run of GP generated one rule which might not classify all suspicious items hence
it is run several times until all suspicious items are classified and therefore we get more than
one rule. Any of them, which misclassified a number of claims, is removed from the final set.
Now comes the role of the human interaction in the process, since the variables in our
evolutionary system are large each can have an effect on our outcome and there is no single
selection of settings, which will classify every data set correctly. Cluster size choices,
membership function choices, rule interpreters, fitness functions, GA settings, etc all of
them can be tweaked. Therefore to help the human decision maker four versions of this
system with different setting are run in parallel all the results generated are presented. The
human has the task to select the best results from the four by performing a series of task, to
find the most accurate, intelligent and, accurate and intelligent rule sets. Then this rule set is
finally evaluated on the global data and if need be the settings of our parameters needed to
be adjusted by the human to generate better rules.
With this setting and data obtained from a bank the research team was able to get results
up to an accuracy of 60% which is impressive as the training data which contained reported
incidents of frauds was less and also was spread over a number of years while the data to be
tested was for the past couple of months and the percentage of suspicious items was
unknown.
24
4.4 Some current work being done
This is a brief glimpse of research going on at the TAKAGI lab, Kyushu University that
involves interactive evolutionary computation and data mining from different sources.
Constructing image or music feature space and impression space. Neural networks are used
to learn mapping from features to impressions and search a point in feature space from
impression space. By doing this the aim is to retrieve images or music based on human
impression.
In this paper we have discussed the use of different evolutionary algorithms in data mining
and knowledge discovery field. The main motive behind using evolutionary algorithms in
data mining is their attractive features that enable them to resolve some of the drawbacks
in conventional data mining techniques and enable them to discover novel solutions, such as
their robustness when dealing with noisy data, and their ability to interpret data without any
a priori knowledge. The difficulty of discovering novel and interesting knowledge is one of
the main issues in data mining, and interactive evolutionary algorithms have been used to
address this problem. Interactive evolutionary algorithms provide a promising research area
for data mining and knowledge discovery, and there are a wide range of applications that
use evolutionary algorithms in data mining; a number of which have been presented in this
paper.
In this section we propose some applications where Data mining along with IEA methods can
be fruitful to implement and it is hoped that this will become evident in the near future.
For the analysis of stock market data at any given point the number of variables to be taken
into account can be enormous and often each of these variables in turn can have a lot of
different choices. An expert needs to make a decision on what subset of these variables he
must consider before trying to make any assumptions. An example of a variable is selection
of trading rules which are constrains or choices which define when to buy sell or hold a
stock. Several rules are available like Filter rules, moving averages, support and resistance,
abnormal return, etc. Another variable, which is more evident in share market, is the effect
of other stocks on the stock being considered. The number of other stocks to consider, etc.
hence it is almost never true that at a given time a strategy will fit all situations. We propose
that in the future we will see implementations where humans will specify the fitness
functions by selection the best rules for a given situation by specifying the number and types
of variables to be chosen at any given time and then test the effectiveness by deploying
them.
25
Intrusion Detection in Networks or Websites
26
6 References
Zaki, M. J., Yu, J. X., Ravindran, B., & Pudi, V. (Eds.). (2010). Advances in Knowledge
Discovery and Data Mining, Part I, Proceedings of the 14th Pacific-Asia Conference
(PAKDD 2010). Springer.
Vafaie, H., & Jong, K. D. (1994). Improving a rule induction system using genetic
algorithms. In R. S. Michalski, R. S. Michalski, & G. Tecuci, Machine learning: a
Multistrategy Approach (pp. 453-470). San Francisco, CA: Morgan Kaufmann.
Back, T., Hammel, U., & Schwefel, H.-P. (1997). Evolutionary Computation:
Comments on the History and Current State. IEEE Transactions on Evolutionary
Computation, 1 (1) , 3-17.
Cant´u-Paz, E., & Kamath, C. (2001). On the use of evolutionary algorithms in data
mining. In H. A. Abbass, R. A. Sarker, & C. Sincla (Eds.), Data mining: a heuristic
approach (pp. 48-71). Idea Group Inc .
Dzeroski, S., & Lavrac, N. (2001). Relational Data Mining. Secaucus, NJ: Springer.
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (1996). Advances in
Knowledge Discovery and Data Mining. Melno Park,Calif: The MIT Press.
Holland, J. (1962). Outline for a Logical theory of Adaptive Systems, 9 (3). Journal of
the ACM , 297-314.
27
Kambert, J. H. (2001). Data Mining: Concepts and Techniques. San Francisco: Morgan
Kaufmann.
Larose, D. T. (2006). Data Mining: Methods and Models. New York: Wiley
Interscience, Inc.
Liu, B., Hsu, W., & Chen, S. (1997). Using general impressions to analyze discovered
classification rules. Knowledge Discovery & Data Mining, (pp. 31-3).
Li, W. (2004). Using Genetic Algorithm for network intrusion detection. United States
Department of Energy Cyber Security Group 2004 Training Conference, (pp. 24-27).
Kansas City, Kansas.
Noda, E., Freitas, A. A., & Lopes, H. S. (1999). Discovering interesting prediction rules
with a genetic algorithm. Conference on Evolutionary Computation 1999 (CEC-99),
(pp. 1322-1329). Washington D.C.
Mendes, R. R., Voznika, F. d., Freitas, A. A., & Nievola, J. C. (2001). Discovering fuzzy
classification rules with genetic programming and co-evolution. In D. Raedt, & S. A.
Luc (Eds.), Principles of Data Mining and Knowledge Discovery (Vol. 2168, pp. 314-
325). Heidelberg, Berlin: Springer-Verlag.
Pazzani, M. J. (2002). Knowledge discovery from data? Intelligent Systems and their
Applications, IEEE , 15 (2), 10-12.
Sakurai, S., Ichimura, Y., Suyama, A., & Orihara, R. (2001). Acquisition of a knowledge
dictionary for a text mining system using an inductive learning method. IJCAI 2001
Workshop on Text Learning: Beyond Supervision, (pp. 45–52).
Tan, K. C., Yu, Q., & Lee, T. H. (2005). A distributed evolutionary clasifier for
knowledge and discovery in data mining. IEEE Trans. on Systems, Man, and
Cybernetics: Part C - Applica- tions and Reviews , 35 (2), 131-142.
(2006). Association analysis basic concept and algorithms. In P.-N. Tan, M. Steinbach,
& V. Kumar, Introduction to data mining. Pearson Addison Wesley.
28
Terano, T., & Ishino, Y. (1996). Knowledge acquisition from questionnaire data using
simulated breeding and inductive learning methods. Expert Systems with
Applications , 11 (4), 507-518.
29