You are on page 1of 22

CHAPTER 1

INTRODUCTION
1.1 DOMAIN OVERVIEW
Knowledge and Data Engineering
A knowledge engineer integrates knowledge into computer systems in order to
solve complex problems normally requiring a high level of human expertise.
Often, knowledge engineers are employed to translate the information elicited
from domain experts into terms which cannot be easily communicated by the
highly technalized domain expert (ESDG 2000).
Knowledge engineers interpret and organize information on how to make systems
decisions (Aylett & Doniat 2002).
The term "knowledge engineer" first appeared in the 1980s in the first wave of
commercialization of AI the purpose of the job is to work with a client who
wants an expert system created for them or their business.
Validation and verification
Knowledge engineers are involved with validation and verification.
Validation is the process of ensuring that something is correct or conforms to a
certain standard. A knowledge engineer is required to carry out data collection and
data entry, but they must use validation in order to ensure that the data they collect,
and then enter into their systems, fall within the accepted boundaries of the
application collecting the data.

It is important that a knowledge engineer incorporates validation procedures into


their systems within the program code. After the knowledge-based system is
constructed, it can be maintained by the domain expert (Bultman, Kuipers & van
Harmelen 2000).
Database

Systems and Knowledgebase

Systems share

many

common

principles. Data & Knowledge Engineering (DKE) stimulates the exchange of


ideas and interaction between these two related fields of interest. DKE reaches a
world-wide audience of researchers, designers, managers and users. The major aim
of the journal is to identify, investigate and analyze the underlying principles in the
design and effective use of these systems.DKE achieves this aim by publishing
original research results, technical advances and news items concerning data
engineering, knowledge engineering, and the interface of these two fields.
DKE covers the following topics:
1. Representation and Manipulation of Data & Knowledge: Conceptual data
models. Knowledge representation techniques. Data/knowledge manipulation
languages and techniques.
2. Architectures of database, expert, or knowledge-based systems: New
architectures for database / knowledge base / expert systems, design and
implementation techniques, languages...
3. Construction of data/knowledge bases: Data / knowledge base design
methodologies

and

tools,

data/knowledge

acquisition

methods,

integrity/security/maintenance issues.
4. Applications, case studies, and management issues: Data administration issues,
knowledge engineering practice, office and engineering applications.

5. Tools for specifying and developing Data and Knowledge Bases using tools
based on Linguistics or Human Machine Interface principles.
6. Communication aspects involved in implementing, designing and using KBSs in
Cyberspace.
1.2 TECHNIQUES IN DATAMINING
The Classics
These two sections have been broken up based on when the data mining technique
was developed and when it became technically mature enough to be used for
business, especially for aiding in the optimization of customer relationship
management systems. Thus this section contains descriptions of techniques that
have classically been used for decades the next section represents techniques that
have only been widely used since the early 1980s.
This section should help the user to understand the rough differences in the
techniques and at least enough information to be dangerous and well armed enough
to not be baffled by the vendors of different data mining tools.
The main techniques that we will discuss here are the ones that are used 99.9% of
the time on existing business problems. There are certainly many other ones as
well as proprietary techniques from particular vendors - but in general the industry
is converging to those techniques that work consistently and are understandable
and explainable.
Statistics
By strict definition "statistics" or statistical techniques are not data mining. They
were being used long before the term data mining was coined to apply to business
applications. However, statistical techniques are driven by the data and are used to

discover patterns and build predictive models. And from the users perspective you
will be faced with a conscious choice when solving a "data mining" problem as to
whether you wish to attack it with statistical methods or other data mining
techniques. For this reason it is important to have some idea of how statistical
techniques work and how they can be applied.
1.3 FEATURE SELECTION (DATA MINING)
Feature selection is a term commonly used in data mining to describe the
tools and techniques available for reducing inputs to a manageable size for
processing and analysis. Feature selection implies not only cardinality reduction,
which means imposing an arbitrary or predefined cutoff on the number of attributes
that can be considered when building a model, but also the choice of attributes,
meaning that either the analyst or the modeling tool actively selects or discards
attributes based on their usefulness for analysis.
The ability to apply feature selection is critical for effective analysis,
because datasets frequently contain far more information than is needed to build
the model. For example, a dataset might contain 500 columns that describe the
characteristics of customers, but if the data in some of the columns is very sparse
you would gain very little benefit from adding them to the model. If you keep the
unneeded columns while building the model, more CPU and memory are required
during the training process, and more storage space is required for the completed
model.
Even if resources are not an issue, you typically want to remove unneeded
columns because they might degrade the quality of discovered patterns, for the
following reasons:

Some columns are noisy or redundant. This noise makes it more difficult to
discover meaningful patterns from the data;
To discover quality patterns, most data mining algorithms require much
larger training data set on high-dimensional data set. But the training data is
very small in some data mining applications.
If only 50 of the 500 columns in the data source have information that is useful in
building a model, you could just leave them out of the model, or you could use
feature selection techniques to automatically discover the best features and to
exclude values that are statistically insignificant. Feature selection helps solve the
twin problems of having too much data that is of little value, or having too little
data that is of high value.
1.4 Knowledge-based Systems
In earlier days research in Artificial Intelligence (AI) was focused on the
development of 2 formalisms, inference mechanisms and tools to operationalize
Knowledge-based Systems (KBS). Typically, the development efforts were
restricted to the realization of small KBSs in order to study the feasibility of the
different approaches. Though these studies offered rather promising results, the
transfer of this technology into commercial use in order to build large KBSs failed
in many cases. The situation was directly comparable to a similar situation in the
construction of traditional software systems, called software crisis in the late
sixties: the means to develop small academic prototypes did not scale up to the
design and maintenance of large, long living commercial systems. In the same way
as the software crisis resulted in the establishment of the discipline Software
Engineering the unsatisfactory situation in constructing KBSs made clear the need

for more methodological approaches. So the goal of the new discipline Knowledge
Engineering (KE) is similar to that of Software Engineering: turning the process of
constructing KBSs from an art into an engineering discipline. This requires the
analysis of the building and maintenance process itself and the development of
appropriate methods, languages, and tools specialized for developing KBSs.
Subsequently, we will first give an overview of some important historical
developments in KE: special emphasis will be put on the paradigm shift from the
so-called transfer approach to the so-called modeling approach. This paradigm
shift is sometimes also considered as the transfer from first generation expert
systems to second generation expert systems. Based on this discussion will be
concluded by describing two prominent developments in the late eighties: Rolelimiting Methods and Generic Tasks. we will present some modeling frameworks
which have been developed in recent years: Common

KADS, MIKE, and

PROTG-II. It gives a short overview of specification languages for KBSs.


Problem-solving methods have been a major research topic in KE for the last
decade. Basic characteristics of (libraries of) problem-solving methods are
described. Ontologies, which gained a lot of importance during the last years are
discussed. The paper concludes with a discussion of current developments in KE
and their relationships to other disciplines. In KE much effort has also been put in
developing methods and supporting tools for knowledge elicitation. E.g. in the
VITAL approach a collection of elicitation tools, like e.g. repertory grids, are
offered for supporting the elicitation of domain knowledge. However, a discussion
of the various elicitation methods is beyond the scope of this paper.
Historical Roots
Basic Notions

In this section we will first discuss some main principles which characterize the
development of KE from the very beginning. Knowledge Engineering as a Transfer
Process This transfer and transformation of problem-solving expertise from a
knowledge source to a program is the heart of the expert-system development
process. In the early eighties the development of a KBS has been seen as a
transfer process of human 3 knowledge into an implemented knowledge base. This
transfer was based on the assumption that the knowledge which is required by the
KBS already exists and just has to be collected and implemented. Most often, the
required knowledge was obtained by interviewing experts on how they solve
specific tasks.
hich were executed by an associated rule interpreter. However, a careful analysis of
the various rule knowledge bases showed that the rather simple representation
formalism of production rules did not support an adequate representation of
different types of knowledge: e.g. in the MYCIN knowledge base strategic
knowledge about the order in which goals should be achieved (e.g. consider
common causes of a disease first) is mixed up with domain specific knowledge
about for example causes for a specific disease. This mixture of knowledge types,
together with the lack of adequate justifications of the different rules, makes the
maintenance of such knowledge bases very difficult and time consuming.
Therefore, this transfer approach was only feasible for the development of small
prototypical systems, but it failed to produce large, reliable and maintainable
knowledge bases.
Furthermore, it was recognized that the assumption of the transfer approach, that is
that knowledge acquisition is the collection of already existing knowledge
elements, was wrong due to the important role of tacit knowledge for an experts
problem-solving capabilities. These deficiencies resulted in a paradigm shift from

the transfer approach to the modeling approach. Knowledge Engineering as a


Modeling Process Nowadays there exists an overall consensus that the process of
building a KBS may be seen as a modeling activity. Building a KBS means
building a computer model with the aim of realizing problem-solving capabilities
comparable to a domain expert. It is not intended to create a cognitive adequate
model, i.e. to simulate the cognitive processes of an expert in general, but to create
a model which offers similar results in problem-solving for problems in the area of
concern.
While the expert may consciously articulate some parts of his or her knowledge, he
or she will not be aware of a significant part of this knowledge since it is hidden in
his or her skills. This knowledge is not directly accessible, but has to be built up
and structured during the knowledge acquisition phase. Therefore this knowledge
acquisition process is no longer seen as a transfer of knowledge into an appropriate
computer representation, but as a model construction process. This modeling view
of the building process of a KBS has the following consequences:
Like every model, such a model is only an approximation of the reality. In
principle, the modeling process is infinite, because it is an incessant activity with
the aim of approximating the intended behaviour. The modeling process is a
cyclic process. New observations may lead to a refinement, modification, or
completion of the already built-up model. On the other side, the model may guide
the further acquisition of knowledge.
The modeling process is dependent on the subjective interpretations of the
knowledge engineer. Therefore this process is typically faulty and an evaluation of
the model with respect to reality is indispensable for the creation of an adequate

model. According to this feedback loop, the model must therefore be revisable in
every stage of the modeling process.

Problem Solving Methods


In Clancey reported on the analysis of a set of first generation expert systems
developed to solve different tasks. Though they were realized using different
representation formalisms (e.g. production rules, frames, LISP), he discovered a
common problem solving behaviour. Clancey was able to abstract this common
behaviour to a generic inference pattern called Heuristic Classification, which
describes the problem-solving behaviour of these systems on an abstract level, the
so called Knowledge Level. This knowledge level allows to describe reasoning in
terms of goals to be achieved, actions necessary to achieve these goals and
knowledge needed to perform these actions. A knowledge-level description of a
problemsolving process abstracts from details concerned with the implementation
of the reasoning process and results in the notion of a Problem-Solving Method
(PSM).
A PSM may be characterized as follows
A PSM specifies which inference actions have to be carried out for solving a
given task.
A PSM determines the sequence in which these actions have to be activated.
In addition, so-called knowledge roles determine which role the domain
knowledge plays in each inference action.

These knowledge roles define a domain independent generic terminology. When


considering the PSM Heuristic Classification in some more detail (Figure 1) we
can identify the three basic inference actions abstract, heuristic match, and refine.
Furthermore, four knowledge roles are defined: observables, abstract observables,
solution abstractions, and solutions. It is important to see that such a description of
a PSM is given in a generic way. When considering a medical domain, an
observable like 410 C may be abstracted to high temperature by the inference
action abstract. This abstracted observable may be matched to a solution
abstraction, e.g. infection, and finally the solution abstraction may be
hierarchically refined to a solution, e.g. the disease influenca.
In the meantime various PSMs have been identified, like e.g. Cover-andDifferentiate for solving diagnostic tasks or Propose-and-Revise for parametric
design tasks.

Fig. 1 The Problem-Solving Method Heuristic Classification

PSMs may be exploited in the knowledge engineering process in different ways:


The Problem-Solving Method Heuristic Classification observables abstract
solutions abstract refine observables heuristic match solution abstractions inference
action role
PSMs contain inference actions which need specific knowledge in order to
perform their task. For instance, Heuristic Classification needs a hierarchically
structured model of observables and solutions for the inference actions abstract and
refine, respectively. So a PSM may be used as a guideline to acquire static domain
knowledge.
A PSM allows to describe the main rationale of the reasoning process of a KBS
which supports the validation of the KBS, because the expert is able to understand
the problem solving process. In addition, this abstract description may be used
during the problem solving process itself for explanation facilities.
1.5 CLUSTER ANALYSIS
Cluster analysis or clustering is the task of grouping a set of objects in such a way
that objects in the same group (called a cluster) are more similar (in some sense or
another) to each other than to those in other groups (clusters). It is a main task of
exploratory data mining, and a common technique for statistical data analysis, used
in

many

fields,

including machine

learning, pattern

recognition, image

analysis, information retrieval, bioinformatics, data compression, and computer


graphics.
Cluster analysis itself is not one specific algorithm, but the general task to be
solved. It can be achieved by various algorithms that differ significantly in their
notion of what constitutes a cluster and how to efficiently find them. Popular

notions of clusters include groups with small distances among the cluster members,
dense areas of the data space, intervals or particular statistical distributions.
Clustering can therefore be formulated as a multi-objective optimization problem.
The appropriate clustering algorithm and parameter settings (including values such
as the distance function to use, a density threshold or the number of expected
clusters) depend on the individual data set and intended use of the results. Cluster
analysis as such is not an automatic task, but an iterative process of knowledge
discovery or interactive multi-objective optimization that involves trial and failure.
It is often necessary to modify data pre-processing and model parameters until the
result achieves the desired properties.

OBJECTIVE
In this project, clustering based subset selection algorithm works in two steps. In
the first step, features are divided into clusters by using graph-theoretic clustering
methods. In the second step, the most representative feature that is strongly related
to target classes is selected from each cluster to form a subset of features.
2.1 EXISTING SYSTEM
Data stream clustering is typically done as a two-stage process with an online part
which summarizes the data into many micro-clusters or grid cells and then, in an
offline process, these micro-clusters (cells) are re-clustered/merged into a smaller
number of final clusters. Since the re-clustering is an offline process and thus not
time critical, it is typically not discussed in detail in papers about new data stream
clustering algorithms. Most papers suggest using an (sometimes slightly modified)
existing conventional clustering algorithm (e.g., weighted k-means in CluStream)
where the micro-clusters are used as pseudo points. Another approach used in Den
Stream is to use reach ability where all micro-clusters which are less than a given
distance from each other are linked together to form clusters. Grid-based
algorithms typically merge adjacent dense grid cells to form larger clusters (see,
e.g., the original version of D-Stream and MR-Stream).
2.2 Disadvantages:

1. The number of clusters varies over time for some of the datasets. This needs
to be considered when comparing to clustream, which uses a fixed number
of clusters.
2. This reduces the speed and accuracy of learning algorithms.
3. Some existing systems doesnt removes redundant features alone.
2.3 PROPOSED SYSTEM
In proposed system develop and evaluate a new method to address this problem for
micro-cluster-based algorithms. We introduce the concept of a shared density graph
which explicitly captures the density of the original data between micro-clusters
during

clustering

and

then

show

how

the

graph can be used for re-clustering micro-clusters.


In this project, proposed Clustering based subset Selection algorithm uses
minimum spanning tree-based method to cluster features. Moreover, our proposed
algorithm does not limit to some specific types of data.
Irrelevant features, along with redundant features, severely affect the accuracy of
the learning machines. Thus, feature subset selection should be able to identify and
remove as much of the irrelevant and redundant information as possible. Moreover,
good feature subsets contain features highly correlated with (predictive of) the
class, yet uncorrelated with (not predictive of) each other.
In our proposed Cluster based subset Selection algorithm, it involves 1) the
construction of the minimum spanning tree from a weighted complete graph; 2) the
partitioning of the MST into a forest with each tree representing a cluster; and 3)
the selection of representative features from the micro-clusters.
2.4 Advantages:

1. This is an important advantage since it implies that we can tune the online
component to produce less micro-cluster for shared-density re-clustering.
2. It improves performance and, in many cases, the saved memory more than
offset the memory requirement for the shared density graph.

CHAPTER 3
LITERATURE SURVEY
3.1 OVERVIEW:
A literature review is an account of what has been published on a topic by
accredited scholars and researchers. Occasionally you will be asked to write one as
a separate assignment, but more often it is part of the introduction to an essay,

research report, or thesis. In writing the literature review, your purpose is to convey
to your reader what knowledge and ideas have been established on a topic, and
what their strengths and weaknesses are. As a piece of writing, the literature review
must be defined by a guiding concept (e.g., your research objective, the problem or
issue you are discussing or your argumentative thesis). It is not just a descriptive
list of the material available, or a set of summaries
Besides enlarging your knowledge about the topic, writing a literature
review lets you gain and demonstrate skills in two areas
1. INFORMATION SEEKING: the ability to scan the literature efficiently,
using manual or computerized methods, to identify a set of useful articles
and books
2. CRITICAL APPRAISAL: the ability to apply principles of analysis to
identify unbiased and valid studies.

3.2. Clustering Performance on Evolving Data Streams: Assessing Algorithms


and Evaluation Measures within MOA
Author - Philipp Kranen ; Hardy Kremer ; Timm Jansen ; Thomas Seidl
In today's applications, evolving data streams are ubiquitous. Stream clustering
algorithms were introduced to gain useful knowledge from these streams in realtime. The quality of the obtained clusterings, i.e. how good they reflect the data,

can be assessed by evaluation measures. A multitude of stream clustering


algorithms and evaluation measures for clusterings were introduced in the
literature, however, until now there is no general tool for a direct comparison of the
different algorithms or the evaluation measures. In our demo, we present a novel
experimental framework for both tasks. It offers the means for extensive evaluation
and visualization and is an extension of the Massive Online Analysis (MOA)
software environment released under the GNU GPL License.
3.3. Organizing multimedia big data using semantic based
video content extraction technique
Author - Manju ; P. Valarmathie
With the proliferation of the internet, video has become the principal source. Video
big data introduce many hi-tech challenges, which include storage space,
broadcast, compression, analysis, and identification. The increase in multimedia
resources has brought an urgent need to develop intelligent methods to process and
organize them. The combination between multimedia resources and Semantic link
Network provides a new prospect for organizing them with their semantics. The
tags and surrounding texts of multimedia resources are used to measure their
association relation. There are two evaluation methods namely clustering and
retrieval are used to measure the semantic relatedness between images accurately
and robustly. This method is effective on image searching task. The semantic gap
between semantics and video visual appearance is still a challenge. A model for
generating the association between video resources using Semantic Link Network
model is proposed. The user can select the attributes or concepts as the search
query. This is done by providing the knowledge conduction during information
extraction and by applying fuzzy reasoning. The first action line is related to the

establishment of techniques for the dynamic management of video analysis based


on the knowledge gathered in the semantic network. This helps the decisions taken
during the analysis process. Based on a set of rules it is able to handle the fuzziness
of the annotations provided by the analysis modules gathered in the semantic
network.
3.4. Evaluation Methodology for Multiclass Novelty Detection Algorithms
Author - Elaine R. Faria ; Isabel J. C. R. Goncalves ; Joao Gama
Novelty detection is a useful ability for learning systems, especially in data stream
scenarios, where new concepts can appear, known concepts can disappear and
concepts can evolve over time. There are several studies in the literature
investigating the use of machine learning classification techniques for novelty
detection in data streams. However, there is no consensus regarding how to
evaluate the performance of these techniques, particular for multiclass problems. In
this study, we propose a new evaluation approach for multiclass data streams
novelty detection problems. This approach is able to deal with: i) multiclass
problems, ii) confusion matrix with a column representing the unknown examples,
iii) confusion matrix that increases over time, iv) unsupervised learning, that
generates novelties without an association with the problem classes and v)
representation of the evaluation measures over time. We evaluate the performance
of the proposed approach by known novelty detection algorithms with artificial and
real data sets.
3.5. Joint Image-Text News Topic Detection and Tracking by Multimodal
Topic And-Or Graph
Author - Weixin Li ; Jungseock Joo ; Hang Qi ; Song-Chun Zhu

This article presents a novel method for automatically detecting and tracking news
topics from multimodal TV news data. We propose a Multimodal Topic And-Or
Graph (MT-AOG) to jointly represent textual and visual elements of news stories
and their latent topic structures. An MT-AOG leverages a context sensitive
grammar that can describe the hierarchical composition of news topics by semantic
elements about people involved, related places and what happened, and model
contextual relationships between elements in the hierarchy. We detect news topics
through a cluster sampling process which groups stories about closely related
events together. Swendsen-Wang Cuts (SWC), an effective cluster sampling
algorithm, is adopted for traversing the solution space and obtaining optimal
clustering solutions by maximizing a Bayesian posterior probability. The detected
topics are then continuously tracked and updated with incoming news streams. We
generate topic trajectories to show how topics emerge, evolve and disappear over
time. The experimental results show that our method can explicitly describe the
textual and visual data in news videos and produce meaningful topic trajectories.
Our method also outperforms previous methods for the task of document clustering
on Reuters-21578 dataset and our novel dataset, UCLA Broadcast News Dataset.

3.6. Performance evaluation of distance measures for preprocessing of setvalued data in feature vector generated from LOD datasets
Author - Rajesh Mahule ; Akshenndra Garg
The linked open data cloud has evolved as a huge repository of data with data from
various domains. A lot of work has been done in generating these datasets and
enhancing the LOD cloud, whereas a little work is being done in the consumption

of the available data from the LOD. There are several types of applications that
have been developed using the data from the LOD cloud; of which, one of the
areas that has attracted the researchers and developers most is the use of these data
for machine learning and knowledge discovery. Using the available, state of the art
knowledge discovery and machine learning algorithms requires conversion of the
heterogeneous interlinked RDF graph datasets, available in LOD cloud, to a feature
vector. This conversion is performed with the subject set as instances; the
predicates set as attributes and object set as attribute values in a feature vector
However choosing the most suitable distance measures of the different distance
measures available is a problem that needs to be catered. This paper provides a
performance study to select the most suitable distance measure that can be used in
pre-processing by building the feature vector with the different distance measures
for set-valued data attributes and applying transformation with Fastmap. The
evaluation of the distance measures is done using clustering of the transformed
feature vector table with pre-identified class labels and getting micro-precision
values for the clustering results. Performing the experimental analysis with LMDB
data it has been found that the Hausdorff and RIBL distance measures are the most
suitable distance measures that can be used to pre-process the created feature
vector with set-valued data from the linked open data cloud.
3.7. Analyzing Enterprise Storage Workloads With Graph
Modeling and Clustering
Author - Yang Zhou ; Ling Liu ; Sangeetha Seshadri
Utilizing graph analysis models and algorithms to exploit complex interactions
over a network of entities is emerging as an attractive network analytic technology.
In this paper, we show that traditional column or row-based trace analysis may not

be effective in deriving deep insights hidden in the storage traces collected over
complex storage applications, such as complex spatial and temporal patterns,
hotspots and their movement patterns. We propose a novel graph analytics
framework, GraphLens, for mining and analyzing real world storage traces with
three unique features. First, we model storage traces as heterogeneous trace graphs
in order to capture multiple complex and heterogeneous factors, such as diverse
spatial/temporal access information and their relationships, into a unified analytic
framework. Second, we employ and develop an innovative graph clustering
method that employs two levels of clustering abstractions on storage trace analysis.
We discover interesting spatial access patterns and identify important temporal
correlations among spatial access patterns. This enables us to better characterize
important hotspots and understand hotspot movement patterns. Third, at each level
of abstraction, we design a unified weighted similarity measure through an
iterative dynamic weight learning algorithm. With an optimal weight assignment
scheme, we can efficiently combine the correlation information for each type of
storage access patterns, such as random versus sequential, read versus write, to
identify interesting spatial/temporal correlations hidden in the traces.

ABSTRACT

Data streams are massive, fast-changing, and in-finite. Clustering is a prominent


task in mining data streams, which group similar objects in a cluster. With the aim
of choosing a Re-Cluster subset of good features with respect to the target
concepts, feature subset selection is an effective way for reducing dimensionality,
removing irrelevant data, increasing learning accuracy, and improving result
comprehensibility. While the efficiency concerns the time required to find a recluster subset of features, the effectiveness is related to the quality of the subset of
features. In this project, proposed clustering based subset selection algorithm
works in two steps. In the first step, features are divided into clusters by using
graph-theoretic clustering methods. In the second step, the most representative
feature that is strongly related to target classes is selected from each cluster to form
a subset of features. To ensure the efficiency of this algorithm, we are going to use
mRMR method with heuristic algorithm. A heuristic algorithm used for solving a
problem more quickly or for finding an approximate re-cluster subset selection
solution. Minimum Redundancy Maximum Relevance (mRMR) selection used to be
more powerful than the maximum relevance selection. It will provide effective
way to predict the efficiency and effectiveness of the clustering based subset
selection algorithm.

You might also like