You are on page 1of 11

Data Warehouse Architecture -Processes-

Overview
Architecture a technical blueprint stage Must support 3 major driving forces:
Populating the warehouse
Data extraction, cleaning and loading

Day-to-day management of the warehouse


Large volumes of data, create/delete summaries

The ability to cope with requirement evolution


Cope with future changes with query profiles

Typical Process Flow


Extract and load the data Clean and transform data into a form that provides good query performance Backup and archive data Manage queries, and direct them to appropriate data sources

Extract & Load Process


Extract
Takes data from source systems and make it available to the data warehouse

Load
Takes extracted data and loads it into the data warehouse

Data in operational systems is held in a from suitable for that system Before loading the data into the DW, information content must be reconstructed Data must become value added business information
Extract & load process must take data and add context and meaning

Issues with ELP


When to start extracting the data, run transformation and consistency checks and so on?
A controlling mechanism is essential to fire each module when appropriate

When to extract?
Data must be in consistent Start extracting data from data sources when it represents the same snapshot of time as all other data sources
Eg. Customer database

Issues
Loading the data
Extracted data are loaded into temporary data store to perform clean up and check for consistency Do not execute consistency checks until all the data sources have been loaded into the temporary data store
Eg. Customer canceling subscriptions

Error recovery must be an integral part of the design The effort required to clean up the source systems increases exponentially with the number of overlapping data sources

Issues
Copy Management tools and clean up
Eg. IBMs Information Warehouse Framework
Data Refresher & Data Hub

Most copy management tools do not have the capability of performing consistency check directly (user must write the logic & code it) Make cost-benefit analysis before purchasing copy management tool
If source systems do not overlap, then consistency checks are very simple

Clean and Transform Data


Steps involved are:
Clean and transform the loaded data into a structure that speeds up queries Partition the data to speed up queries, optimize hardware performance and simplify the DW management Create aggregations to speed up the common queries

Clean and Transform Data


Data needs to be cleaned and checked in the following ways:
Make sure data is consistent with itself Make sure data is consistent with other data within the same source Make sure data is consistent with data in the other source systems Make sure data is consistent with the information already in the DW

Once data is cleaned, convert source data into a structure that is designed to balance query performance and operational cost
The structure must be suitable for long term storage

Backup & archive


Regular backup is essential to recover data from loss Archiving
Older data is removed from the system in a format that allows it to be quickly restored if required Issue
As DW evolves, all information may change Hence to ensure that a restored archive is valid, it becomes necessary to extract all related data and structures as well

Query Management Process


It is a system process
Manages the queries Speeds them up by directing queries to the most effective data source Ensure that all system resources are used effectively Monitor query profiles manage which aggregations to generate This process operates at all times

You might also like