Professional Documents
Culture Documents
INTRODUCTION
1.1 DOMAIN OVERVIEW
Knowledge and Data Engineering
A knowledge engineer integrates knowledge into computer systems in order to
solve complex problems normally requiring a high level of human expertise.
Often, knowledge engineers are employed to translate the information elicited
from domain experts into terms which cannot be easily communicated by the
highly technalized domain expert (ESDG 2000).
Knowledge engineers interpret and organize information on how to make systems
decisions (Aylett & Doniat 2002).
The term "knowledge engineer" first appeared in the 1980s in the first wave of
commercialization of AI the purpose of the job is to work with a client who
wants an expert system created for them or their business.
Validation and verification
Knowledge engineers are involved with validation and verification.
Validation is the process of ensuring that something is correct or conforms to a
certain standard. A knowledge engineer is required to carry out data collection and
data entry, but they must use validation in order to ensure that the data they collect,
and then enter into their systems, fall within the accepted boundaries of the
application collecting the data.
Systems share
many
common
and
tools,
data/knowledge
acquisition
methods,
integrity/security/maintenance issues.
4. Applications, case studies, and management issues: Data administration issues,
knowledge engineering practice, office and engineering applications.
5. Tools for specifying and developing Data and Knowledge Bases using tools
based on Linguistics or Human Machine Interface principles.
6. Communication aspects involved in implementing, designing and using KBSs in
Cyberspace.
1.2 TECHNIQUES IN DATAMINING
The Classics
These two sections have been broken up based on when the data mining technique
was developed and when it became technically mature enough to be used for
business, especially for aiding in the optimization of customer relationship
management systems. Thus this section contains descriptions of techniques that
have classically been used for decades the next section represents techniques that
have only been widely used since the early 1980s.
This section should help the user to understand the rough differences in the
techniques and at least enough information to be dangerous and well armed enough
to not be baffled by the vendors of different data mining tools.
The main techniques that we will discuss here are the ones that are used 99.9% of
the time on existing business problems. There are certainly many other ones as
well as proprietary techniques from particular vendors - but in general the industry
is converging to those techniques that work consistently and are understandable
and explainable.
Statistics
By strict definition "statistics" or statistical techniques are not data mining. They
were being used long before the term data mining was coined to apply to business
applications. However, statistical techniques are driven by the data and are used to
discover patterns and build predictive models. And from the users perspective you
will be faced with a conscious choice when solving a "data mining" problem as to
whether you wish to attack it with statistical methods or other data mining
techniques. For this reason it is important to have some idea of how statistical
techniques work and how they can be applied.
1.3 FEATURE SELECTION (DATA MINING)
Feature selection is a term commonly used in data mining to describe the
tools and techniques available for reducing inputs to a manageable size for
processing and analysis. Feature selection implies not only cardinality reduction,
which means imposing an arbitrary or predefined cutoff on the number of attributes
that can be considered when building a model, but also the choice of attributes,
meaning that either the analyst or the modeling tool actively selects or discards
attributes based on their usefulness for analysis.
The ability to apply feature selection is critical for effective analysis,
because datasets frequently contain far more information than is needed to build
the model. For example, a dataset might contain 500 columns that describe the
characteristics of customers, but if the data in some of the columns is very sparse
you would gain very little benefit from adding them to the model. If you keep the
unneeded columns while building the model, more CPU and memory are required
during the training process, and more storage space is required for the completed
model.
Even if resources are not an issue, you typically want to remove unneeded
columns because they might degrade the quality of discovered patterns, for the
following reasons:
Some columns are noisy or redundant. This noise makes it more difficult to
discover meaningful patterns from the data;
To discover quality patterns, most data mining algorithms require much
larger training data set on high-dimensional data set. But the training data is
very small in some data mining applications.
If only 50 of the 500 columns in the data source have information that is useful in
building a model, you could just leave them out of the model, or you could use
feature selection techniques to automatically discover the best features and to
exclude values that are statistically insignificant. Feature selection helps solve the
twin problems of having too much data that is of little value, or having too little
data that is of high value.
1.4 Knowledge-based Systems
In earlier days research in Artificial Intelligence (AI) was focused on the
development of 2 formalisms, inference mechanisms and tools to operationalize
Knowledge-based Systems (KBS). Typically, the development efforts were
restricted to the realization of small KBSs in order to study the feasibility of the
different approaches. Though these studies offered rather promising results, the
transfer of this technology into commercial use in order to build large KBSs failed
in many cases. The situation was directly comparable to a similar situation in the
construction of traditional software systems, called software crisis in the late
sixties: the means to develop small academic prototypes did not scale up to the
design and maintenance of large, long living commercial systems. In the same way
as the software crisis resulted in the establishment of the discipline Software
Engineering the unsatisfactory situation in constructing KBSs made clear the need
for more methodological approaches. So the goal of the new discipline Knowledge
Engineering (KE) is similar to that of Software Engineering: turning the process of
constructing KBSs from an art into an engineering discipline. This requires the
analysis of the building and maintenance process itself and the development of
appropriate methods, languages, and tools specialized for developing KBSs.
Subsequently, we will first give an overview of some important historical
developments in KE: special emphasis will be put on the paradigm shift from the
so-called transfer approach to the so-called modeling approach. This paradigm
shift is sometimes also considered as the transfer from first generation expert
systems to second generation expert systems. Based on this discussion will be
concluded by describing two prominent developments in the late eighties: Rolelimiting Methods and Generic Tasks. we will present some modeling frameworks
which have been developed in recent years: Common
In this section we will first discuss some main principles which characterize the
development of KE from the very beginning. Knowledge Engineering as a Transfer
Process This transfer and transformation of problem-solving expertise from a
knowledge source to a program is the heart of the expert-system development
process. In the early eighties the development of a KBS has been seen as a
transfer process of human 3 knowledge into an implemented knowledge base. This
transfer was based on the assumption that the knowledge which is required by the
KBS already exists and just has to be collected and implemented. Most often, the
required knowledge was obtained by interviewing experts on how they solve
specific tasks.
hich were executed by an associated rule interpreter. However, a careful analysis of
the various rule knowledge bases showed that the rather simple representation
formalism of production rules did not support an adequate representation of
different types of knowledge: e.g. in the MYCIN knowledge base strategic
knowledge about the order in which goals should be achieved (e.g. consider
common causes of a disease first) is mixed up with domain specific knowledge
about for example causes for a specific disease. This mixture of knowledge types,
together with the lack of adequate justifications of the different rules, makes the
maintenance of such knowledge bases very difficult and time consuming.
Therefore, this transfer approach was only feasible for the development of small
prototypical systems, but it failed to produce large, reliable and maintainable
knowledge bases.
Furthermore, it was recognized that the assumption of the transfer approach, that is
that knowledge acquisition is the collection of already existing knowledge
elements, was wrong due to the important role of tacit knowledge for an experts
problem-solving capabilities. These deficiencies resulted in a paradigm shift from
model. According to this feedback loop, the model must therefore be revisable in
every stage of the modeling process.
many
fields,
including machine
learning, pattern
recognition, image
notions of clusters include groups with small distances among the cluster members,
dense areas of the data space, intervals or particular statistical distributions.
Clustering can therefore be formulated as a multi-objective optimization problem.
The appropriate clustering algorithm and parameter settings (including values such
as the distance function to use, a density threshold or the number of expected
clusters) depend on the individual data set and intended use of the results. Cluster
analysis as such is not an automatic task, but an iterative process of knowledge
discovery or interactive multi-objective optimization that involves trial and failure.
It is often necessary to modify data pre-processing and model parameters until the
result achieves the desired properties.
OBJECTIVE
In this project, clustering based subset selection algorithm works in two steps. In
the first step, features are divided into clusters by using graph-theoretic clustering
methods. In the second step, the most representative feature that is strongly related
to target classes is selected from each cluster to form a subset of features.
2.1 EXISTING SYSTEM
Data stream clustering is typically done as a two-stage process with an online part
which summarizes the data into many micro-clusters or grid cells and then, in an
offline process, these micro-clusters (cells) are re-clustered/merged into a smaller
number of final clusters. Since the re-clustering is an offline process and thus not
time critical, it is typically not discussed in detail in papers about new data stream
clustering algorithms. Most papers suggest using an (sometimes slightly modified)
existing conventional clustering algorithm (e.g., weighted k-means in CluStream)
where the micro-clusters are used as pseudo points. Another approach used in Den
Stream is to use reach ability where all micro-clusters which are less than a given
distance from each other are linked together to form clusters. Grid-based
algorithms typically merge adjacent dense grid cells to form larger clusters (see,
e.g., the original version of D-Stream and MR-Stream).
2.2 Disadvantages:
1. The number of clusters varies over time for some of the datasets. This needs
to be considered when comparing to clustream, which uses a fixed number
of clusters.
2. This reduces the speed and accuracy of learning algorithms.
3. Some existing systems doesnt removes redundant features alone.
2.3 PROPOSED SYSTEM
In proposed system develop and evaluate a new method to address this problem for
micro-cluster-based algorithms. We introduce the concept of a shared density graph
which explicitly captures the density of the original data between micro-clusters
during
clustering
and
then
show
how
the
1. This is an important advantage since it implies that we can tune the online
component to produce less micro-cluster for shared-density re-clustering.
2. It improves performance and, in many cases, the saved memory more than
offset the memory requirement for the shared density graph.
CHAPTER 3
LITERATURE SURVEY
3.1 OVERVIEW:
A literature review is an account of what has been published on a topic by
accredited scholars and researchers. Occasionally you will be asked to write one as
a separate assignment, but more often it is part of the introduction to an essay,
research report, or thesis. In writing the literature review, your purpose is to convey
to your reader what knowledge and ideas have been established on a topic, and
what their strengths and weaknesses are. As a piece of writing, the literature review
must be defined by a guiding concept (e.g., your research objective, the problem or
issue you are discussing or your argumentative thesis). It is not just a descriptive
list of the material available, or a set of summaries
Besides enlarging your knowledge about the topic, writing a literature
review lets you gain and demonstrate skills in two areas
1. INFORMATION SEEKING: the ability to scan the literature efficiently,
using manual or computerized methods, to identify a set of useful articles
and books
2. CRITICAL APPRAISAL: the ability to apply principles of analysis to
identify unbiased and valid studies.
This article presents a novel method for automatically detecting and tracking news
topics from multimodal TV news data. We propose a Multimodal Topic And-Or
Graph (MT-AOG) to jointly represent textual and visual elements of news stories
and their latent topic structures. An MT-AOG leverages a context sensitive
grammar that can describe the hierarchical composition of news topics by semantic
elements about people involved, related places and what happened, and model
contextual relationships between elements in the hierarchy. We detect news topics
through a cluster sampling process which groups stories about closely related
events together. Swendsen-Wang Cuts (SWC), an effective cluster sampling
algorithm, is adopted for traversing the solution space and obtaining optimal
clustering solutions by maximizing a Bayesian posterior probability. The detected
topics are then continuously tracked and updated with incoming news streams. We
generate topic trajectories to show how topics emerge, evolve and disappear over
time. The experimental results show that our method can explicitly describe the
textual and visual data in news videos and produce meaningful topic trajectories.
Our method also outperforms previous methods for the task of document clustering
on Reuters-21578 dataset and our novel dataset, UCLA Broadcast News Dataset.
3.6. Performance evaluation of distance measures for preprocessing of setvalued data in feature vector generated from LOD datasets
Author - Rajesh Mahule ; Akshenndra Garg
The linked open data cloud has evolved as a huge repository of data with data from
various domains. A lot of work has been done in generating these datasets and
enhancing the LOD cloud, whereas a little work is being done in the consumption
of the available data from the LOD. There are several types of applications that
have been developed using the data from the LOD cloud; of which, one of the
areas that has attracted the researchers and developers most is the use of these data
for machine learning and knowledge discovery. Using the available, state of the art
knowledge discovery and machine learning algorithms requires conversion of the
heterogeneous interlinked RDF graph datasets, available in LOD cloud, to a feature
vector. This conversion is performed with the subject set as instances; the
predicates set as attributes and object set as attribute values in a feature vector
However choosing the most suitable distance measures of the different distance
measures available is a problem that needs to be catered. This paper provides a
performance study to select the most suitable distance measure that can be used in
pre-processing by building the feature vector with the different distance measures
for set-valued data attributes and applying transformation with Fastmap. The
evaluation of the distance measures is done using clustering of the transformed
feature vector table with pre-identified class labels and getting micro-precision
values for the clustering results. Performing the experimental analysis with LMDB
data it has been found that the Hausdorff and RIBL distance measures are the most
suitable distance measures that can be used to pre-process the created feature
vector with set-valued data from the linked open data cloud.
3.7. Analyzing Enterprise Storage Workloads With Graph
Modeling and Clustering
Author - Yang Zhou ; Ling Liu ; Sangeetha Seshadri
Utilizing graph analysis models and algorithms to exploit complex interactions
over a network of entities is emerging as an attractive network analytic technology.
In this paper, we show that traditional column or row-based trace analysis may not
be effective in deriving deep insights hidden in the storage traces collected over
complex storage applications, such as complex spatial and temporal patterns,
hotspots and their movement patterns. We propose a novel graph analytics
framework, GraphLens, for mining and analyzing real world storage traces with
three unique features. First, we model storage traces as heterogeneous trace graphs
in order to capture multiple complex and heterogeneous factors, such as diverse
spatial/temporal access information and their relationships, into a unified analytic
framework. Second, we employ and develop an innovative graph clustering
method that employs two levels of clustering abstractions on storage trace analysis.
We discover interesting spatial access patterns and identify important temporal
correlations among spatial access patterns. This enables us to better characterize
important hotspots and understand hotspot movement patterns. Third, at each level
of abstraction, we design a unified weighted similarity measure through an
iterative dynamic weight learning algorithm. With an optimal weight assignment
scheme, we can efficiently combine the correlation information for each type of
storage access patterns, such as random versus sequential, read versus write, to
identify interesting spatial/temporal correlations hidden in the traces.
ABSTRACT