What Is Data Warehouse?: Defination

Visit: www.geocities.com/chinna_chetan05/forfriends.
html
WHAT IS DATA WAREHOUSE?

A data warehouse, in its simplest perception, is no more than a collection of
the key pieces of information used to manage and direct the business for the most
profitable outcome. In other words, a data warehouse is the data (meta/ f act/
dimension /aggregation) and the process managers (load/warehouse/query) that make
information available, enabling people to make informed decisions.
DEFINATION:
“ A data warehouse is a subject-oriented, integrated, time-varying, non-volatile
collection of data in support of the management’s decision-making process.
SUBJECT-ORIENTED:
Data are organized according to subject instead of application. For example, an
insurance company using data warehouse would organize their data by customer,
premium, and claim instead of different products (auto, life, etc). The data organized
by the subject obtained only the information necessary for the decision support
processing.
NON-VOLATILE:
A data warehouse is always a physically separate store of data, which is transformed
from the application data found in the appropriate environment. The data are not
updated or changed in any way once they enter the data warehouse, but are only
loaded, refreshed and accessed for queries.
TIME-VARYING:
The data warehouse contains a place for sorting data that are 5 to 10 years old, or
older, to be used for comparisons, trends and forecasting.
1 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
INTEGRATED:
A data warehouse is usually constructed by integrating multiple, heterogeneous
sources such as relational databases, flat files, and OLTP files. Data cleaning and data
integration techniques are applied to maintain consistency in naming convention,
measures of variables, encoding structure, and physical attributes.
DATA WAREHOUSE ARCHITECTURE:

Data warehouse must be architected to support three major driving factors:
• Populating the warehouse,
• Day-to-day management of the warehouse,
• The ability to cope with requirements evolution.
Based on the logical data model of the data warehouse, we shall see the popular 3-tier
architecture and components of the warehouse at different layers. Tier 1 is essentially
the warehouse server, Tier 2 is the OLAP-engine for analytical processing, and Tier 3
is a client containing reporting tools, visualization tools, data mining tools, query
tools, etc. There is also the backend process which is concerned with extracting data
from multiple operational databases and external sources; with cleaning, transforming
and integrating data for loading into the data warehouse server; and of course, with
periodically refreshing the warehouse. Tier 1 contains the main data warehouse. It can
follow one of three models or some combinations of these. It can be single enterprise
warehouse, or may contain several departmental marts. The third model is to have a
virtual warehouse. Tier 2 follows three different ways of designing the OLAP engine,
namely ROLAP, MOLAP and extended SQL OLAP.
WAREHOUSE SERVER:
As mentioned earlier, there are three data warehouse models.
ENTERPRISE WAREHOUSE:
This model collects all the information about the subjects, spanning the entire
organization. It provides corporate-wide data integration, usually from one or more
operational systems or external information providers. An enterprise data warehouse
requires a traditional mainframe.
DATA MARTS:
Data marts are partitions of the overall data warehouse. A data mart is a subset of a
data warehouse built specifically for a department. They may also contain some
overlapping data. The physical data marts together serve as the conceptual data
warehouse. These marts must provide the easiest possible access to information
required by its user community.
STAND_ALONE MART:
This approach enables a department to implement a data mart with minimal or no
impact on the enterprise’s operational database.
DEPENDENT DATA MART:
Here the management of the data sources by the enterprise database is required. This
data sources include operational databases and external sources of data.
VIRTUAL DATA WAREHOUSE:
In a virtual warehouse, we have a logical description of all the databases and their
structures, and individuals who want to get information from those databases do not
have to know anything about them. This approach creates single “virtual database”
from all data sources. The data source can be local or remote. In this type of a data
warehouse, the data is not moved from the sources. Instead user gets direct access to
the data.
A virtual database is easy and fast, but it is not without problems. Since the
queries must compete with the production data transactions, its performance can be
considerably degraded. Since there is no meta data, no summary data or history, all
the queries must be repeated, creating an additional burden on the system.
META DATA: “data about data”
Meta data provides a catalogue of data in the data warehouse and the pointers to this
data.
A metadata repository should contain:
• A description of the structure of the data warehouse.
• Operational metadata, such as data linkages, currency of data and monitoring
information.
• The summarization processes which include dimension definition, partitions,
aggregation, etc.
• Details of data sources.
• Data related to system performance.
• Business metadata, which includes business terms and definitions, and
changing policies.
TYPES OF METADATA:
Due to the variety of metadata, it is necessary to categorize into different types based
on how they are used
1. BUILD-TIME METADATA:
Whenever we design and build a warehouse, the metadata that we generate can be
termed as build-time metadata. Data links business and warehouse terminology and
describes the data’s technical structure. It is the primary source of most of the
metadata used in the warehouse.
2. USAGE METADATA:
When the warehouse is in the production, usage metadata, which is derived from
build-time metadata, is an important tool for user and data administrators. This
metadata is used differently from build-time metadata, and its structure must
accommodate this fact.
3. CONTROL METADATA:
Most control metadata is of interest only to system programmers. However, one
subset which is generated and used by the tools that populate the warehouse, is of
considerable interest to users and data warehouse administrators. It provides the vital
information about the timeliness of warehouse data and users track the sequence and
timing of warehouse events.
DATA WAREHOUSE PROCESS MANAGERS:
The data warehouse process managers are piece of software responsible for the flow,
maintenance and upkeep of the data, both into and out of the data warehouse
database.
There are three different data warehouse process managers:
 LOAD MANAGER
 WAREHOUSE MANAGER
 QUERY MANAGER
LAOD MANAGER:
The load manager is responsible for any data transformation required and for the
loading of data into the database. The responsibilities are as follows:
 Data source interaction
 Data transformation
 Data load
WAREHOUSE MANAGER:
The warehouse manager is responsible for maintaining the data while it is in the data
warehouse. The responsibilities are listed below:
 Data movement
 Metadata management
 Performance monitoring and tuning
 Data archiving
QUERY MANAGER:
The query manager has several distinct responsibilities they are:
 User access to the data
 Query scheduling
 Query monitoring
SECURITY:
Security can affect many different parts of the data warehouse, such as:
 User access
 Data load
 Data movement
 Query generation
PERFORMANCE IMPACT OF SECURITY:
Security also costs. Any security that is implemented will cost in terms of either
processing power or disk space, or both.
VIEWS:
Views are a standard RDBMS mechanism for applying restrictions to data access.
Some common restrictions are:
 Restricted DML operations
 Lost query optimization paths
 Restrictions on parallel processing of view projections.
DATA MOVEMENT:
Because of the volumes of data being handled in the data warehouse, data movement
is an expensive process, in terms both of resource and of time.
There are a number of different ways in which bulk data movements can occur:
1. data loads
2. aggregation creation
3. results temporary tables
4. data extracts.
AUDITING:
Clearly any auditing that has to be performed will have a CPU impact, because
each audited action will require some code to be run. Auditing also requires disk
space.
BACKUP AND RECOVERY:
Backup is one of the most important regular operations carried out on any system.
BACKUP STRATEGIES:
1. EFFECT ON DATABASE DESIGN:
There is a major interaction between the backup strategy and the database design. The
two go hand in hand. Data warehouses are large and complex systems that backup
should be integral part of the system. We need to design the whole data warehouse
system in a unified fashion. It is particularly important to manage the design of the
backup, the database and the overnight processing together.
2. DESIGN STRATEGIES:
Read-only tables are one of the main weapon in the battle to reduce the amount of
data needed to be backed up.
Another way of reducing the regular backup requirements is to reduce the amount of
journaling or redo generated. This is possible with some RDBMSs, because they
allow you to turn off logging with certain operations.
RECOVERY STRATEGIES:
The recovery strategy will be built around backup strategy. Any recovery situation
naturally implies that some failures has occurred.
Whatever software we choose, the recovery steps for the failure scenarios below need
to be fully documented:
 Instance failure
 Media failure
 Loss or damage of table space
 Loss or damage of redo log files
 Loss or damage of archive log files
 Failure during data movements
 And others
There are number of data movements scenario that need to be covered:
 Data load into staging tables
 Movement from staging to fact table
 Partition roll-up into larger partitions
 Creation of aggregations
DISASTER RECOVERY:
Recovering from a disaster requires the following:
 Replacement / standby systems
 Sufficient tape and disk capacity
 Communication links to users
 Communication links to data sources
 Copies of all relevent pieces of software
 Backup of database
 Application-aware systems administration and operations staff.
DATA WAREHOUSE APPLICATIONS:
1. SALES ANALYSIS:
 Determine real-time product sales
 Analyze historical product sales
 Evaluate successful products and determine key success
factors.
 Rapidly identify preferred customer segments
 Quickly isolate past preferred customer who no longer buy
2. FINANCIAL ANALYSIS:
 Compare actual to budgets on timely basis.
 Review past cash flow trends and forecast future needs
 Identify and analyze key expense generators
 Receive near-real-time, interactive financial statements.
3. HUMAN RESOURCE ANALYSIS:
 Evaluate trends in benefit program use.
 Identify the wage and benefits costs to determine company-wide variation.
 Review compliance levels for EEOC and other regulated activities.
4. OTHER AREAS
 Warehouse have also been applied to areas such as logistics, inventory,
purchasing, detailed transaction analysis, and load balancing.
DATA MINING:
Data mining is the non-trivial process of identifying valid, novel, potentially useful,
and ultimately understandable patterns in data. Data mining attempts to source out
patterns and trends in the data and infers rules from these patterns.
DEFINATIONS:
The term ‘data mining’ refers to the finding of relevant and useful information from
databases. A few definitions are given below:
“Data mining or knowledge discovery in databases, as it is known, is the non-
trivial extraction of implicit, previously unknown and potentially useful
information from the data. This encompasses a number of technical approaches,
such as clustering, data summarization, classification, finding dependency
networks, analyzing changes, and detecting anomalies.”
“Data mining is the search for the relationships and global patterns that exist in
large databases but are hidden among vast amounts of data, such as the
relationship between patient data and their medical diagnosis. This relationship
represents valuable knowledge about the database, and the objects in the
database, if the database is a faithful mirror of the real world registered by the
database.”
“Data mining is the process of discovering meaningful, new correlation patterns
and trends by sifting through large amount of data stored in repositories, using
pattern recognition techniques as well as statistical and mathematical
techniques.”
DATA MINING TECHNIQUES:
Researchers identify two fundamental goals of data mining:
 Prediction.
 Description.
Prediction makes use of existing variables in the database in order to predict
unknown or future values of interest.
Description focuses on finding patterns describing the data and the subsequent
presentation for user interpretation.
The study of DM techniques is to classify the techniques as
 User-guided or verification-driven data mining.
 Discovery-driven or automatic discovery of rules.
Most techniques of data mining have elements of both the models.
VERIFICATION MODEL:
In this process of data mining, the user makes a hypothesis and tests the hypothesis
on the data to verify its validity. The emphasis is on the user who is responsible for
formulating the hypothesis and issuing the query on the data to affirm or negate the
hypothesis.
DISCOVERY MODEL:
The discovery model differs in its emphasis, in that it is the system automatically
discovering important information hidden in the data. The data is sifted in search of
frequently occurring patterns, trends and generalizations about the data without
intervention or guidance from the user.
The typical discovery driven tasks are :
 Discovery of association rules.
 Discovery of classification rules.
 Clustering.
 Discovery of frequent episodes.
 Deviation detection.
These tasks are of an exploratory nature and cannot be directly handed over to
currently available database technology.
MINING PROBLEMS:
A data mining system can either be a portion of a data warehousing system or a
stand-alone system. Data for data mining need not always be enterprise related data
residing on a relational database. Data source are very diverse and appear in varied
form. It can de textual data, image data, CAD data, map data, ECG data or the much
talked about Genome data.
The DM problems for different types of data:
SEQUENCE MINING:
It is concerned with mining sequence data. It may be noted that in the discovery of
association rules, we are interested in finding associations between items irrespective
of their order of occurrences. Another related area which falls into the larger domain
of temporal data mining is trend discovery. One characteristic of sequence-pattern
discovery in comparison with trend discovery is the lack of shapes, since the causal
impact of a series of events cannot be shaped.
WEB MINING :
With the huge amount of information available online, the WWW is a fertile area for
data mining research. web mining is the use of data mining techniques to
automatically discover and extract information from web documents and services.
Web mining can be broken down into following subtasks:
1. Resource finding.
2. Information selection and preprocessing.
3. Generalization.
4. Analysis.
TEXT MINING:
The term text mining KDT(Knowledge Discovery in Text) was first proposed by
Feldman and Dagan in 1996. Presently the term text mining , is being used to cover
many applications such as text categorization, exploratory data analysis, text
clustering, finding patterns in text databases, finding sequential patterns in texts,
IE(Information Extraction),
Empirical computational linguistic tasks, and association discovery.
SPATIAL DATA MINING:
Spatial data mining is the branch of data mining that deals with spatial(location) data.
The immense explosion in geographically-referenced data accasioned by development
in IT, digital mapping, remote sensing, and the global diffusion of GIS, places
demands on developing data driven inductive approaches to spatial analysis and
modelling.
ISSUES AND CHALLENGES IN DM:
Data mining systems depend on databases to supply the raw input and this raises
problems, such as that databases tend to be dynamic, incomplete, noisy and large.
The difficulties in data mining can be categorized as:
 Limited information.
 Noise or missing data.
 User interaction and prior knowledge.
 Uncertainity.
 Size, updates and irrelevant fields.
DM APPLICATION AREAS:
The applications can be naturally divided into three broad categories:
1. Business and E-Commerce Data.
2. Scientific, Engineering and Health care data.
3. Multimedia Documents and Web Data.
A. BUSINESS AND E-COMMERCE DATA
This is a major source category of data mining applications.
BUSINESS TRANSACTIONS:
Modern business processes are consolidating with millions of customers and
billions of their transactions. Business enterprises require necessary information for
their effective functioning in today’s competitive world.
ELECTRONIC COMMERCE:
Not only does electronic commerce produce large data sets in which the analysis
of marketing patterns and risk patterns is critical but, it is also important to do this
near-real time, in order to meet the demands of online transactions.
B. SCIENTIFIC, ENGINEERING AND HEALTH CARE DATA
 GENOMIC DATA:
Genomic sequencing and mapping efforts have produced a number of databases
which are accessible on the web. Finding relationships between these data
sources is another fundamental challenge for data mining.
 SENSOR DATA:
Remote sensing data is another source of voluminous data. Remote sensing
satellites and a variety of other sensors produce large amounts of geo-referenced
data.
 SIMULATION DATA:
Simulation is now accepted as an important mode of science, supplementing
theory and experiment. Data mining and, more generally, data intensive
computing is proving to be a critical link between theory, simulation, and
experiment.
 HEALTH CARE DATA:
Hospitals, health care organizations, insurance companies, and the concerned
government agencies accumulate large collections of data about patients and
health care-related data.
C. MULTIMEDIA DOCUMENTS AND WEB DATA:
 MULTIMEDIA DOCUMENTS:
Today’s technology for retrieving multimedia items on the web is far
satisfactory. It is becoming harder to extract meaningful information from the
archives of multimedia data as the volume grows.
WEB DATA:
The data on the web is growing not only in volume but also in complexity. Web
data now includes not only text, audio and video material, but also streaming
data and numerical data.
OTHER APPLICATION AREAS:
 RISK ANALYSIS
 TARGETED MARKETING
 CUSTOMER RETENTION
 PROTFOLIO MANAGEMENT
 BRAND LOYALITY
 BANKING
The application area in banking are:
1. detecting pattern of fraudulent credit cards use
2. identifying ‘loyal’ customers
3. predicting customers likely to change their card affiliation
4. determine credit card spending by customer groups
5. finding hidden correlations between different financial indicators
6. identifying stock trading rules from historical market data.

What Is Data Warehouse?: Defination

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

What Is Data Warehouse?: Defination

Uploaded by

Copyright:

Available Formats

Visit: www.geocities.com/chinna_chetan05/forfriends.

WHAT IS DATA WAREHOUSE?

dimension /aggregation) and the process managers (load/warehouse/query) that make

information available, enabling people to make informed decisions.

“ A data warehouse is a subject-oriented, integrated, time-varying, non-volatile

collection of data in support of the management’s decision-making process.

Data are organized according to subject instead of application. For example, an

A data warehouse is always a physically separate store of data, which is transformed

loaded, refreshed and accessed for queries.

older, to be used for comparisons, trends and forecasting.

A data warehouse is usually constructed by integrating multiple, heterogeneous

integration techniques are applied to maintain consistency in naming convention,

measures of variables, encoding structure, and physical attributes.

DATA WAREHOUSE ARCHITECTURE:

• Populating the warehouse,

• Day-to-day management of the warehouse,

• The ability to cope with requirements evolution.

architecture and components of the warehouse at different layers. Tier 1 is essentially

namely ROLAP, MOLAP and extended SQL OLAP.

As mentioned earlier, there are three data warehouse models.

organization. It provides corporate-wide data integration, usually from one or more

operational systems or external information providers. An enterprise data warehouse

requires a traditional mainframe.

required by its user community.

This approach enables a department to implement a data mart with minimal or no

impact on the enterprise’s operational database.

DEPENDENT DATA MART:

data sources include operational databases and external sources of data.

VIRTUAL DATA WAREHOUSE:

the queries must be repeated, creating an additional burden on the system.

META DATA: “data about data”

A metadata repository should contain:

• A description of the structure of the data warehouse.

• Operational metadata, such as data linkages, currency of data and monitoring

• The summarization processes which include dimension definition, partitions,

• Details of data sources.

• Data related to system performance.

• Business metadata, which includes business terms and definitions, and

on how they are used

metadata used in the warehouse.

accommodate this fact.

Most control metadata is of interest only to system programmers. However, one

timing of warehouse events.

DATA WAREHOUSE PROCESS MANAGERS:

There are three different data warehouse process managers:

loading of data into the database. The responsibilities are as follows:

 Data source interaction

warehouse. The responsibilities are listed below:

 Performance monitoring and tuning

The query manager has several distinct responsibilities they are:

 User access to the data

PERFORMANCE IMPACT OF SECURITY:

processing power or disk space, or both.

Some common restrictions are:

 Restricted DML operations

 Lost query optimization paths

 Restrictions on parallel processing of view projections.

is an expensive process, in terms both of resource and of time.

3. results temporary tables

BACKUP AND RECOVERY:

1. EFFECT ON DATABASE DESIGN:

system in a unified fashion. It is particularly important to manage the design of the