Professional Documents
Culture Documents
html
the key pieces of information used to manage and direct the business for the most
profitable outcome. In other words, a data warehouse is the data (meta/ f act/
DEFINATION:
SUBJECT-ORIENTED:
insurance company using data warehouse would organize their data by customer,
premium, and claim instead of different products (auto, life, etc). The data organized
by the subject obtained only the information necessary for the decision support
processing.
NON-VOLATILE:
from the application data found in the appropriate environment. The data are not
updated or changed in any way once they enter the data warehouse, but are only
TIME-VARYING:
The data warehouse contains a place for sorting data that are 5 to 10 years old, or
1 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
INTEGRATED:
sources such as relational databases, flat files, and OLTP files. Data cleaning and data
Based on the logical data model of the data warehouse, we shall see the popular 3-tier
the warehouse server, Tier 2 is the OLAP-engine for analytical processing, and Tier 3
is a client containing reporting tools, visualization tools, data mining tools, query
tools, etc. There is also the backend process which is concerned with extracting data
from multiple operational databases and external sources; with cleaning, transforming
and integrating data for loading into the data warehouse server; and of course, with
periodically refreshing the warehouse. Tier 1 contains the main data warehouse. It can
follow one of three models or some combinations of these. It can be single enterprise
warehouse, or may contain several departmental marts. The third model is to have a
virtual warehouse. Tier 2 follows three different ways of designing the OLAP engine,
WAREHOUSE SERVER:
2 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
ENTERPRISE WAREHOUSE:
This model collects all the information about the subjects, spanning the entire
DATA MARTS:
Data marts are partitions of the overall data warehouse. A data mart is a subset of a
data warehouse built specifically for a department. They may also contain some
overlapping data. The physical data marts together serve as the conceptual data
warehouse. These marts must provide the easiest possible access to information
STAND_ALONE MART:
Here the management of the data sources by the enterprise database is required. This
In a virtual warehouse, we have a logical description of all the databases and their
structures, and individuals who want to get information from those databases do not
have to know anything about them. This approach creates single “virtual database”
from all data sources. The data source can be local or remote. In this type of a data
warehouse, the data is not moved from the sources. Instead user gets direct access to
the data.
3 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
A virtual database is easy and fast, but it is not without problems. Since the
queries must compete with the production data transactions, its performance can be
considerably degraded. Since there is no meta data, no summary data or history, all
Meta data provides a catalogue of data in the data warehouse and the pointers to this
data.
information.
aggregation, etc.
changing policies.
TYPES OF METADATA:
Due to the variety of metadata, it is necessary to categorize into different types based
1. BUILD-TIME METADATA:
Whenever we design and build a warehouse, the metadata that we generate can be
termed as build-time metadata. Data links business and warehouse terminology and
describes the data’s technical structure. It is the primary source of most of the
4 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
2. USAGE METADATA:
When the warehouse is in the production, usage metadata, which is derived from
build-time metadata, is an important tool for user and data administrators. This
metadata is used differently from build-time metadata, and its structure must
3. CONTROL METADATA:
subset which is generated and used by the tools that populate the warehouse, is of
considerable interest to users and data warehouse administrators. It provides the vital
information about the timeliness of warehouse data and users track the sequence and
The data warehouse process managers are piece of software responsible for the flow,
maintenance and upkeep of the data, both into and out of the data warehouse
database.
LOAD MANAGER
WAREHOUSE MANAGER
QUERY MANAGER
LAOD MANAGER:
The load manager is responsible for any data transformation required and for the
Data transformation
5 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
Data load
WAREHOUSE MANAGER:
The warehouse manager is responsible for maintaining the data while it is in the data
Data movement
Metadata management
Data archiving
QUERY MANAGER:
Query scheduling
Query monitoring
SECURITY:
Security can affect many different parts of the data warehouse, such as:
User access
Data load
Data movement
Query generation
Security also costs. Any security that is implemented will cost in terms of either
VIEWS:
6 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
Views are a standard RDBMS mechanism for applying restrictions to data access.
DATA MOVEMENT:
Because of the volumes of data being handled in the data warehouse, data movement
There are a number of different ways in which bulk data movements can occur:
1. data loads
2. aggregation creation
4. data extracts.
AUDITING:
Clearly any auditing that has to be performed will have a CPU impact, because
each audited action will require some code to be run. Auditing also requires disk
space.
Backup is one of the most important regular operations carried out on any system.
BACKUP STRATEGIES:
There is a major interaction between the backup strategy and the database design. The
two go hand in hand. Data warehouses are large and complex systems that backup
should be integral part of the system. We need to design the whole data warehouse
7 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
2. DESIGN STRATEGIES:
Read-only tables are one of the main weapon in the battle to reduce the amount of
Another way of reducing the regular backup requirements is to reduce the amount of
journaling or redo generated. This is possible with some RDBMSs, because they
RECOVERY STRATEGIES:
The recovery strategy will be built around backup strategy. Any recovery situation
Whatever software we choose, the recovery steps for the failure scenarios below need
to be fully documented:
Instance failure
Media failure
And others
8 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
Creation of aggregations
DISASTER RECOVERY:
Backup of database
1. SALES ANALYSIS:
factors.
2. FINANCIAL ANALYSIS:
9 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
4. OTHER AREAS
10 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
DATA MINING:
Data mining is the non-trivial process of identifying valid, novel, potentially useful,
and ultimately understandable patterns in data. Data mining attempts to source out
patterns and trends in the data and infers rules from these patterns.
DEFINATIONS:
The term ‘data mining’ refers to the finding of relevant and useful information from
“Data mining is the search for the relationships and global patterns that exist in
large databases but are hidden among vast amounts of data, such as the
relationship between patient data and their medical diagnosis. This relationship
represents valuable knowledge about the database, and the objects in the
database, if the database is a faithful mirror of the real world registered by the
database.”
and trends by sifting through large amount of data stored in repositories, using
techniques.”
11 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
Prediction.
Description.
Description focuses on finding patterns describing the data and the subsequent
VERIFICATION MODEL:
In this process of data mining, the user makes a hypothesis and tests the hypothesis
on the data to verify its validity. The emphasis is on the user who is responsible for
formulating the hypothesis and issuing the query on the data to affirm or negate the
hypothesis.
DISCOVERY MODEL:
The discovery model differs in its emphasis, in that it is the system automatically
discovering important information hidden in the data. The data is sifted in search of
frequently occurring patterns, trends and generalizations about the data without
12 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
Clustering.
Deviation detection.
These tasks are of an exploratory nature and cannot be directly handed over to
MINING PROBLEMS:
stand-alone system. Data for data mining need not always be enterprise related data
residing on a relational database. Data source are very diverse and appear in varied
form. It can de textual data, image data, CAD data, map data, ECG data or the much
SEQUENCE MINING:
It is concerned with mining sequence data. It may be noted that in the discovery of
of their order of occurrences. Another related area which falls into the larger domain
discovery in comparison with trend discovery is the lack of shapes, since the causal
WEB MINING :
With the huge amount of information available online, the WWW is a fertile area for
data mining research. web mining is the use of data mining techniques to
automatically discover and extract information from web documents and services.
13 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
1. Resource finding.
3. Generalization.
4. Analysis.
TEXT MINING:
The term text mining KDT(Knowledge Discovery in Text) was first proposed by
Feldman and Dagan in 1996. Presently the term text mining , is being used to cover
IE(Information Extraction),
Spatial data mining is the branch of data mining that deals with spatial(location) data.
in IT, digital mapping, remote sensing, and the global diffusion of GIS, places
modelling.
Data mining systems depend on databases to supply the raw input and this raises
problems, such as that databases tend to be dynamic, incomplete, noisy and large.
Limited information.
14 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
Uncertainity.
DM APPLICATION AREAS:
BUSINESS TRANSACTIONS:
ELECTRONIC COMMERCE:
Not only does electronic commerce produce large data sets in which the analysis
of marketing patterns and risk patterns is critical but, it is also important to do this
GENOMIC DATA:
which are accessible on the web. Finding relationships between these data
SENSOR DATA:
15 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
data.
SIMULATION DATA:
theory and experiment. Data mining and, more generally, data intensive
experiment.
MULTIMEDIA DOCUMENTS:
WEB DATA:
The data on the web is growing not only in volume but also in complexity. Web
data now includes not only text, audio and video material, but also streaming
RISK ANALYSIS
TARGETED MARKETING
CUSTOMER RETENTION
16 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
PROTFOLIO MANAGEMENT
BRAND LOYALITY
BANKING
17 Email: chinna_chetan05@yahoo.com