MS EDW Arch Guidance BP Chapter 1-2-3

Microsoft EDW Architecture, Guidance and Deployment Best Practices
Chapter One: Overview

By Microsoft Corporation Acknowledgements: Contributing writers from Solid Quality Mentors: Larry Barnes Technical reviewers from Microsoft: Benjamin Wright-Jones Contributing editors from Solid Quality Mentors: Kathy Blomstrom Published: Applies to: SQL Server 2008 R2
Microsoft Corporation
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Chapter One: Overview................................................................................................................................. 1 Introduction .............................................................................................................................................. 4 Why a Data Warehouse? .......................................................................................................................... 5 Data Warehouse Access Patterns ......................................................................................................... 7 Data Warehouse Components .................................................................................................................. 7 Data Warehouse Life Cycle and Team Model ........................................................................................... 8 Data Warehouse Life Cycle ................................................................................................................... 9 Data Warehouse Team Model ............................................................................................................ 10 Data Stewardship ................................................................................................................................ 11 Data Warehouse Projects ................................................................................................................... 11 Managing a Data Warehouse.............................................................................................................. 12 Chapter Roadmap ................................................................................................................................... 13 Chapter 2 Data Architecture ............................................................................................................ 13 Chapter 3 Data Integration .............................................................................................................. 13 Chapter 4 Operations Management and Security ........................................................................... 13 Chapter 5 Querying, Performance Monitoring, and Tuning ............................................................ 14 Getting Started.................................................................................................................................... 14
Copyright 2010
Introduction
An Enterprise Data Warehouse (EDW) has long been considered a strategic asset for organizations. An EDW has multiple business subject areas which reside in multiple databases. The success of a data warehouse requires more than the underlying hardware platform and software products. It also requires executive-level sponsorship and governance programs, a solid team structure, strong project management, and good communications. This document does not focus on those subjects. Instead, the focus is on the best practices and repeatable patterns for developing and maintaining large data warehouses on the Microsoft SQL Server platform (SQL Server), as the focus areas diagram in Figure 1-1 shows.
Figure 1-1: Microsoft EDW Architecture, Guidance and Deployment Best Practices focus areas Note that migrating data warehouses to SQL Server from other platforms is an important topic, but it is also out of scope for this document. The audience for this document is the project team tasked with designing, building, deploying, maintaining, and enhancing components for a data warehouse built on top of the SQL Server platform and contains the following chapters: Introduction Providing an overview of the toolkit, key data warehouse phases and roles, and an introduction of the remaining chapters Data Architecture Covering design of the database architecture and data models Data Integration Addressing data movement to and from the data warehouse Database Administration Explaining how to manage and maintain the data warehouse Querying, Performance Monitoring, and Tuning Covering optimization and monitoring of query and load performance
Although this document is primarily written for the data warehouse teams data architects, data developers, and database administrators (DBAs), other team members involved in the development and maintenance of a data warehouse may also find this document useful.
Copyright 2010
Why a Data Warehouse?

Whats the benefit of implementing a data warehouse? Although the answer is probably already clear if youre reading this paper, its worth revisiting the challenges that data warehouses address. Relational databases have traditionally been the data store for Line of Business (LOB) applications. These applications require a write-optimized database that supports thousands of concurrent users. Figure 1-2 shows the characteristics of a LOB application.
Figure1-2: Line of Business application characteristics Table 1 lists some characteristics of an LOB workload, which is also often referred to as On Line Transaction Processing or OLTP. Characteristic Concurrent Users Database Access Transaction Scope Database Operations Data Volatility Line of Business (OLTP) Many (100s - 10,000+) Stored Procedures (Stored Procs.), SQL Short: One record several records Singleton Selects, Inserts, Updates, Deletes Volatile, Current Data
Table 1-1: LOB workload characteristics Everything works well until the business starts running large reports against the database, as shown in Figure 1-3.
Copyright 2010
Figure 1-3: Reporting against a LOB database In such scenarios, youll start seeing LOB database reporting issues, including: Blocking and locking - Reports of any size will request many rows from many tables joined together in one Select statement. This results in the report blocking LOB application activity. Poor performance Report performance is often sub-optimal even if theres no blocking simply because of the amount of joins required to produce the results. Lack of history Many reports, such as trend analysis, require historical data to report trends over. LOB systems overwrite historical data with current data, making it impossible to report on historical data. Scope Organizations frequently need to report on enterprise level information. This requires information from multiple subject areas, which can span multiple LOB systems.
These LOB reporting limitations then become requirements for a data warehouse: Scope The ability to report on the big picture History The ability to report on historical information Read optimized Data models that are tuned in support of large queries
These requirements in turn feed into the following challenges for properly designing and implementing the data warehouse: Scope The scope of a data warehouse often crosses organizational boundaries and requires strong communication and governance at the enterprise level. Scale The large volumes of data involved require an appropriate database configuration, a scalable software platform, and correctly configured, scalable hardware. Read performance Read-optimizing large volumes of data requires a solid data model and, often, physical partitioning schemes to meet the performance needs of the business. Load performance Loading large amounts of data into a read-optimized data model within decreasing windows of time requires efficient load procedures.
Data warehouses have traditionally addressed the above challenges by serializing different classes of activities, as well see in the next section. Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Data Warehouse Access Patterns As Data warehouses mature over time, the emphasis shifts from making data available to consumers to the production of data. Many times this translates into the Data warehouse having a Production data area and a Consumption data area. Data production processes typically run at night as shown in Figure 1-4. The start time traditionally has depended upon when Source systems have completed their daily processing and post-processing.
Figure 1-4: Data warehouse access patterns Note that the Consumption area is unavailable to consumers during a certain period of time each day; usually this is in the early morning hours when usage is at its lowest. This is due to the fact that load and maintenance activities are write intensive and perform best when they have exclusive access to the database. Also note that the Production area also has backup and maintenance activities, but these can be scheduled throughout the day and are less dependent upon consumer activity. However, there is increasing pressure to compress the period of time that data warehouse consumption area is closed to consumers, including: Requirements to keep the data warehouses available for longer periods. Picture a global Fortune 1000 company with business consumers in all time zones. In these cases, the argument can be made that the Data warehouse should always be open for business.
Combine these demands with increased volumes of data plus the desire for more current data, and you can see the challenges data warehouse teams face as they try to meet all business requirements while maintaining good performance and producing high quality data. Data quality is foundational to business trusting the data within the Data warehouse and is a core requirement for the Data production processes.
Data Warehouse Components

Lets look more closely at the data warehouse, beginning with an overview of its components. The Corporate Information Factory (CIF), a well-known architecture in the industry, defines a data warehouse as: A subject-oriented, integrated, time variant and non-volatile collection of data used to support the strategic decisions making process for the enterprise. For more information on CIF visit Bill Inmons web site: http://www.inmoncif.com
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices A data warehouse is fed by one or more sources and, in turn, is accessed by multiple consumers, as Figure 1-5 shows.
Figure 1-5: Data warehouse components A data warehouse consists of the following components: The Production area is where source data is cleansed, reconciled and consolidated. This is the where a majority of a Data warehouse teams time and resources are spent. The Consumption area is where data is transformed for consumption. Reporting rules are applied and the data is normalized and in some cases aggregated. Consumers then use SQL to access the data. The Platform is the hardware and software products, including network and SAN, which the data warehouse is implemented on. Data integration processes area responsible for the data movement and transformation as it makes it way from sources through the Production data area to the Consumption data area.
For more information on these components, read Chapter 2 Data architecture. A data warehouse has a long life span and is rarely implemented all at once. This is due to a variety of factors including mergers and acquisition, changing business landscape, the request for new subject areas and new questions asked by the business. This translates into ongoing activities in support of a data warehouse as well as ongoing projects to implement new features and functionality. Lets look at an overview of the common activities in a data warehouse life cycle and the team responsible for each activity.
Data Warehouse Life Cycle and Team Model

Key factors for any successful data warehouse include: The team responsible for developing and maintaining the data warehouse The methodology used to develop and deploy the data warehouse The processes employed to ensure the ongoing alignment of the data warehouse with its business users Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices To achieve success in each of these areas, you need to take the organizations culture and people into account. Thus, its difficult to recommend a prescriptive set of best practices without detailed knowledge of an organization. Because of this, detailed team models, project methodologies and processes, and governance activities are out of scope for this document. However, we will review them briefly in the following section, and then focus the remainder of this document on patterns and best practices for data warehouses implemented on SQL Server. Data Warehouse Life Cycle Figure 1-6 shows a high-level abstraction of a data warehouses life cycle, not representing any specific methodology, and highlights the activities that this toolkit focuses on.
Figure 1-6: Data warehouse life cycle The life cycle components include: Governance This oversight activity ensures that the organization is aligned around a data warehouse. Define This phase, involving developing solid business requirements and translating them into a design, is critical to every data warehouse projects success. Develop-Test This important phase includes development and testing of data models, integration processes, and consumer SQL access for subject areas within the data warehouse. Deploy Deployment activities promote data warehouse project deliverables across different physical areas including development, test, QA, and production. Maintain Maintenance activities are the ongoing tasks that support a successful data warehouse.
Governance and the Define project phase are beyond the scope of this document. Although we will discuss deployment topics in the remaining chapters, it will be a smaller discussion than that for the Develop-Test and Maintain stages within the life cycle. Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Data Warehouse Team Model Figure1-7 shows the key roles within the different phases of the data warehouse life cycle.
10
Figure 1-7: Data warehouse roles Data warehouse roles fall into these general categories: Business This group, essential in data warehouse governance and the Define phase, represents the key business roles within the data warehouse team: business sponsors, business analysts, and data stewards. Develop and test Developers, quality insurance staff, and data stewards are key team members while the solution is being developed and tested. Technical oversight The data architect has an oversight role in all technical phases of the data warehouse. Maintain DBAs, data stewards, and IT operations staff are responsible for the ongoing data warehouse maintenance activities.
In addition, release engineering, DBAs, and IT operations are involved in solutions deployment across environments. As noted earlier, although deployment topics will be addressed in the toolkit, they arent a primary focus for this. Table 1-2 maps the level of involvement for different roles in the data warehouse life cycle.
Copyright 2010
11
Role Project Manager Business Analyst Data Steward Data Architect Developer QA Release Engineering DBA IT Operations
Define
Dev-Test
Deploy
Maintain
Table 1-2: Data warehouse roles and responsibilities Data Stewardship The data steward role is key team member. Data stewards are responsible for data quality. This role is different from the traditional Quality Assurance role, which focuses on product quality. The ideal data steward has significant tenure within an organization and understands both the business and the source system or systems supporting the business. The data steward is involved in all phases of a data warehouse: Governance The data steward contributes to the business and technical metadata deliverables from ongoing data governance activities and is responsible for ensuring business trust in data warehouse results. Define At this stage, the data steward provides subject matter expertise for the business system and underlying database(s) that are sources for the data warehouse. Develop and Test The data steward provides subject matter expertise when identifying and diagnosing data quality issues during the development and test phases. Maintain During ongoing maintenance, the data steward is responsible for identifying data exceptions and correcting them at their source.
Note that data stewards are not a primary audience for this toolkit; the focus, instead, is on the tools and frameworks that are developed in support of data stewardship. As noted earlier, our goal is to provide best practices and guidance for: Data architects providing oversight and architecture design for the data warehouse Database and integration developers implementing the data warehouse DBAs responsible for managing and maintaining the Data warehouse.
Data Warehouse Projects Most data warehouse projects are responsible for delivering the code and data models that implement an area within the data warehouse. Other projects are foundational and deliver the frameworks used by Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices other projects. Figure 1-8 shows the primary deliverables for each phase of the project and the role mostly responsible for the delivery.
12
Figure 1-8: Roles and primary deliverables Remember that the Define phase is out of scope for this document. The key project roles and deliverables for the other data warehouse phases are: Data architect Responsible for the platform and database architectures Database developer Responsible for data models, supporting data objects, SQL queries, and reports Integration developer Responsible for integration code and frameworks
Managing a Data Warehouse Most of a data warehouses costs are related to ongoing management and maintenance. Figure 1-9 shows the key roles and responsibilities for these efforts.
Figure 1-9: Ongoing maintenance roles and responsibilities Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Ongoing roles responsible for continuing data warehouse management and maintenance are: DBAs who manage and monitor the data warehouse Data stewards who monitor data quality and repair data exceptions IT operations staff who manage and monitor platform components supporting the data warehouse
13
Document Roadmap
Building on this overview of the data warehouse life cycle and key roles, we are ready to dig into the patterns and best practices for the development and tests phases of a data warehouse project and the ongoing maintenance of a data warehouse on Microsofts SQL Server database platform. Lets look at what you will find in the remaining four chapters of this toolkit. Chapter 2 Data Architecture Targeted at the data architects responsible for the database and platform architecture of a SQL Server data warehouse and the database developers responsible for the data models, Chapter 2 covers the following major topics: Introduction and overview of roles, responsibilities, and concepts Database architecture Master Data and Master Data Management (MDM) Platform architecture SQL Server database considerations Data modeling Conclusion and Resources
Chapter 3 Data Integration The primary audience for Chapter 3 includes the developers responsible for all data integration code and frameworks as well as the data architects who provide oversight of frameworks, patterns, and best practices. It addresses these essential topics: Introduction Data Integration Concepts and Patterns Data Integration overview ETL frameworks Data Quality Data integration best practices SQL Server Integration Services best practices Conclusion and Resources
Chapter 4 Operations Management and Security Directed at the DBAs responsible for ongoing data warehouse maintenance, monitoring, and troubleshooting, Chapter 4 covers best practices and guidance in the following areas: Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Introduction and Context DBA Data warehouse essentials Working with a Data warehouse database server Managing databases Database server security Managing database change Business continuity Maintenance Monitoring tools Conclusion and Resources
14
Chapter 5 Querying, Performance Monitoring, and Tuning The last chapter in the toolkit is targeted at database developers responsible for writing efficient queries and optimizing existing queries against the data warehouse. Its organized into the following sections: Introduction and query optimization overview Querying Monitoring Performance tuning Best practices summary Conclusion and Resources
Note that product specific best practices and guidance within Chapters 2 5 are for the symmetric multi-processing (SMP) versions of SQL Server 2008 R2, i.e. specific guidance for SQL Server 2008 R2 Parallel Data Warehouse (PDW) is out of scope for the initial release of this document. Getting Started Each of these chapters contains not only concepts, but also patterns and best practices that Data warehouse practitioners can use to successfully implement SQL Server Data warehouses within their organizations. These patterns and best practices have been successfully utilized by the content authors in client SQL Server Data warehouse engagements over the last ten to fifteen years. In addition, there are links to web content which expands on the material within each chapter. The reader can choose to read all four chapters or to go directly to the chapter which is closely aligned to the readers Roles and responsibilities within the Data warehouse effort.
Copyright 2010
15
Chapter 2: Data Architecture

By Microsoft Corporation Acknowledgements: Contributing writers from Solid Quality Mentors: Larry Barnes, Bemir Mehmedbasic Technical reviewers from Microsoft: Eric Kraemer, Ross LoForte, Ted Tasker, Benjamin Wright-Jones Contributing editors from Solid Quality Mentors: Kathy Blomstrom Published: Applies to: SQL Server 2008 R2
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Chapter 2: Data Architecture ...................................................................................................................... 15 Introduction ............................................................................................................................................ 18 Chapter Focus ..................................................................................................................................... 19 Roles and Responsibilities.................................................................................................................. 20 Requirements ..................................................................................................................................... 21 Challenges........................................................................................................................................... 23 Data Warehouse Maturity Model Overview ..................................................................................... 24 Data Quality and the Role of Governance and Stewardship ............................................................ 25 Data Warehouse Concepts ..................................................................................................................... 27 Data Warehouse Components ........................................................................................................... 27 Business Intelligence = Data Warehousing? ...................................................................................... 28 Data Warehouse Architectures .......................................................................................................... 29 Data Models: Normalized to Denormalized ...................................................................................... 31 Database Architecture ............................................................................................................................ 36 One Data Area: Full Loads .................................................................................................................. 37 One Data Area: Incremental Loads .................................................................................................... 38 Adding a Production Data Area ......................................................................................................... 39 The Consumption Data Area .............................................................................................................. 40 The Data in Data Area ........................................................................................................................ 41 Exception and Logging Data Areas..................................................................................................... 43 Archiving ............................................................................................................................................. 43 Metadata ............................................................................................................................................ 44 Operational Data Stores..................................................................................................................... 46 Consumer Interfaces .......................................................................................................................... 46 Master Data and Master Data Management .......................................................................................... 48 What Is Master Data? ........................................................................................................................ 50 What Is Master Data Management? ................................................................................................. 51 Where Do Data Warehousing and MDM Overlap? ........................................................................... 51 Platform Architecture ............................................................................................................................. 58 Data Warehouse Server ..................................................................................................................... 59 Server Virtualization........................................................................................................................... 60 SQL Server Fast Track Data Warehouse............................................................................................. 60 Microsoft Corporation Copyright 2010
16
Microsoft EDW Architecture, Guidance and Deployment Best Practices Data Warehouse Appliances .............................................................................................................. 60 Which Server Option Should You Choose?........................................................................................ 61 Database Architecture ............................................................................................................................ 61 Databases, Schemas, and Filegroups ................................................................................................. 62 Considerations for Physical Database Design ................................................................................... 66 Data Modeling......................................................................................................................................... 67 Data Modeling Overview ................................................................................................................... 68 Conceptual Model .............................................................................................................................. 69 Logical Model...................................................................................................................................... 71 Physical Model ................................................................................................................................... 72 Data Modeling Column Types ......................................................................................................... 74 Keys ..................................................................................................................................................... 79 Dimensions ......................................................................................................................................... 80 Fact Tables .......................................................................................................................................... 84 Reference Data ................................................................................................................................... 85 Hierarchies .......................................................................................................................................... 87 Bridge Tables ...................................................................................................................................... 95 Nulls and Missing Values.................................................................................................................... 95 Referential Integrity ........................................................................................................................... 96 Clustered vs. Heap .............................................................................................................................. 96 SQL Server Data Type Considerations ............................................................................................... 98 Very Large Data Sets ........................................................................................................................ 100 Conclusion and Resources .................................................................................................................... 101 Resources .......................................................................................................................................... 102
17
Copyright 2010
18
Introduction
Data architecture is an umbrella term for the standards, metadata, architectures, and data models used to ensure that an organizations data warehouse meets the strategic decision-making needs of its business users. At its core, a data warehouse consists of a collection of databases organized into production and consumption areas, as shown in Figure 2-1.
Figure 2-1: Data warehouse organized into production and consumption areas Each database contains tables and supporting objects. These tables are populated with large amounts of data. Data integration processes are responsible for moving and transforming the data as it flows from sources to the consumption area. This simple concept is complicated by the following factors: Scope Multiple, cross-organizational subject areas Scale Very large volumes of time-variant data Quality Cleansing, integrating, and conforming diverse data from multiple sources
There is a lot of literature available on data warehouse data architecture, but the two most visible data warehouse authors in the industry focus on different aspects of data architecture: Bill Inmons Corporate Information Factory (CIF) focuses on database architecture, a top-down approach. Ralph Kimballs Enterprise Data Bus focuses on data modeling, a bottom-up approach.
Well look briefly at these different approaches in the next section. However, the objective of this chapter is to distill the available information into a set of concepts and patterns and then present tangible best practices for data warehouses implemented on the Microsoft SQL Server database platform. The audience for this chapter is members of the data warehouse team responsible for the data architecture, database architecture, data models, and overall quality of the data warehouse.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Chapter Focus The data architecture team has responsibilities for oversight and specific deliverables throughout the data warehouse development life cycle, shown in Figure 2-2.
19
Figure 2-2: Focus of this chapter This chapter focuses on the deliverables that the data architecture team produces as part of the development phase or in support of developmentnamely, architectures and data models. Note that physical best practices and guidance within in this chapter are for the symmetric multiprocessing (SMP) versions of SQL Server 2008 R2, i.e. guidance for SQL Server 2008 R2 Parallel Data Warehouse (PDW) is out of scope for the initial release of this chapter and document. Figure 2-3 shows the data architecture deliverables and the inputs into these deliverables.
Copyright 2010
20
Figure 2-3: Data architecture deliverables Data architecture team deliverables include: Technical metadata and standards The team provides oversight and contributes to all database development standards and technical metadata used within the data warehouse. Data models The team provides oversight and understanding of all data models within the data warehouse and acts as subject matter expert for data models within data governance and other cross-organizational efforts. Database architecture The team has primary responsibility for the data warehouse database architecture. Platform architecture The team contributes to the product selection and underlying hardware and software platform that hosts the data warehouse. One primary driver behind these data architecture deliverables is the maturity level of an organizations data warehouse. This is briefly covered in the next section. Roles and Responsibilities The data architect, data developer, and data steward each play key roles within data architecture and are responsible for working with business analysts and the extended team to translate business requirements into technical requirements. Figure 2-4 shows these roles along with their responsibilities.
Copyright 2010
21
Figure 2-4: Roles and responsibilities on the data architecture team Roles and responsibilities: The data architect is a member of the data governance team and is responsible for the data warehouse architecture, metadata, overall quality of the data warehouse solution, and in some cases, the initial data model. The database developer is responsible for the development of data models and other database objects within the data warehouse and contributes to data warehouse metadata. The data steward, also a member of the data governance team, contributes to business and technical metadata and is responsible for the quality of data within the data warehouse.
This chapters primary audience is data architects and database developers. Data stewardship is a key role within a data warehouse project and will be covered when it intersects with the core data architecture team. However, data stewards and their day-to-day activities are not a focus of Chapter 2. Requirements Data architecture for data warehouses should be first driven by business need and should also conform to organizational and technology needs. As Figure 2-5 illustrates, as enterprise data warehouse systems are being developed, data architecture should be implemented to: Support the business objectives Enable information management (data flows, validations, consumption) Productize data (i.e., turn data into an asset) for competitive advantage Produce a single version of the truth across time Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Achieve high performance
22
Figure 2-5: Requirements that the data architecture needs to support Support Business Objectives Ideally, business objectives are provided to teams developing data architecture. When business objectives are not clearly defined or dont exist, data architecture team members need to be proactive in acquiring the missing information from business stakeholders and subject matter experts. Information Management Data Consumption and Integration Business objectives are further broken down into data requirements that define the databases and data models used by business consumers. The data architecture team works closely with the data integration team to ensure that data requirements can be successfully populated by data integration processes. These data mappings, business rules, and transformations are often a joint effort between business analysts and data integration developers. Productize Data As data warehousing systems and business intelligence (BI) within the organization matures over time, organizations begin to realize the potential for productizing the datameaning transforming data from a raw asset into a measure of business results that ultimately provides insight into the business. Single Version of the Truth There is no alternative to one version of the truth in successful data architecture. The data warehouse team shouldnt underestimate the difficulty involved in achieving one version of the truth. The team must overcome technical, cultural, political, and technological obstacles to achieve what is probably the most challenging requirement when building a data warehouse. Achieve High Performance Achieving high performance for data retrieval and manipulation within the data warehouse is a requirement of data architecture. Later in this chapter, we discuss the different logical and physical data modeling techniques and best practices that directly relate to ensuring the most efficient performance for the very large databases within the data warehouse consumption area.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices However, its often the performance of the data integration processes that drive the data models and architectures within the production area of very large data warehouses (VLDWs). These dual requirements for high-performance loads and high-performance queries result in different databases and data models for the production and consumption areas. Secure Data The data architecture must ensure that data is secured from individuals who do not have authorization to access it. Often, users are allowed to view only a subset of the data, and the data architecture is responsible for ensuring that the underlying constructs are in place to meet this requirement. Challenges Providing high-performance access to one version of the truth that meets business objectives and requirements over time is not a simple task. It requires both a solid initial implementation and the ability to enhance and extend the data warehouse over a period of many years. Figure 2-6 presents some of the challenges for the data architecture team.
23
Figure 2-6: Challenges for the data warehouse team These challenges include: The lack of complete business requirements or conflicting business requirements from business stakeholders Scope, or the cross-organizational communication toward a common goal and the need for common definitions and models Understanding source systems, their limitations (including data quality and antiquated systems), and the impact on the downstream data warehouse data stores Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Agility, or the ability to meet the needs of the business in a timely manner Data volumes and the need for high performance for both queries and data loads
24
Business Challenges In a perfect world, business objectives are clearly defined and well understood by stakeholders and users. In reality, however, business needs are often not clearly defined or are not broken down into a clear set of requirements. Frequently, this is due to the difference between business descriptions of objectives and the technical interpretations of these objectives. Technical Challenges In addition, there will always be organizational change during the lifetime of the data warehouse, which requires that the data architecture be agile and able to respond to changes to the organization and environment. The requirements and challenges depend upon the scope and scale of the data warehouse, which can often be mapped to its level of maturity. The next section provides a brief overview of a data warehouse maturity model. Data Warehouse Maturity Model Overview Different organizations are at different levels of maturity with respect to their data warehouse initiatives. Several challenges, including scope and data volumes, are a function of a data warehouses maturity. Figure 2-7, from a 2004 Information Management article, Gauge Your Data Warehouse Maturity , by Wayne Eckerson, provides an overview of a maturity model for data warehouses.
Figure 2-7: Data warehouse maturity model The following section contains excerpts from this article. Data marts are defined as a shared, analytic structure that generally supports a single application area, business process, or department. These "independent" data marts do a great Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices job of supporting local needs; however, their data can't be aggregated to support crossdepartmental analysis. After building their third data mart, most departments recognize the need to standardize definitions, rules, and dimensions to avoid an integration nightmare down the road. Standardizing data marts can be done in a centralized or decentralized fashion The most common strategy is to create a central data warehouse with logical dependent data marts. This type of data warehouse is commonly referred to as a hub-and-spoke data warehouse. Although a data warehouse delivers many new benefits, it doesn't solve the problem of analytic silos. Most organizations today have multiple data warehouses acquired through internal development, mergers, or acquisitions. Divisional data warehouses contain overlapping and inconsistent data, creating barriers to the free flow of information within and between business groups and the processes they manage. In the adult stage, organizations make a firm commitment to achieve a single version of the truth across the organization. Executives view data as a corporate asset that is as valuable as people, equipment, and cash. They anoint one data warehouse as the system of record or build a new enterprise data warehouse (EDW) from scratch. This EDW serves as an integration machine that continuously consolidates all other analytic structures into itself. In the adult stage, the EDW serves as a strategic enterprise resource for integrating data and supporting mission-critical applications that drive the business. To manage this resource, executives establish a strong stewardship program. Executives assign business people to own critical data elements and appoint committees at all levels to guide the development and expansion of the EDW resource. In summary, the need for a data warehouse arises from a desire by organizations to provide a single version of the truth. The complexity of this effort is magnified when done at the enterprise level (i.e., for an EDW). As stated above, stewardship programs are central to a successful data warehouse. The last section in this introduction briefly discusses stewardship, data governance, and how both are essential to delivering a high quality data warehouse. Data Quality and the Role of Governance and Stewardship Acquiring and maintaining business trust is a foundational objective of a data warehouse that requires strong communication between the data warehouse team and the business users as well as the ability to provide accessible high quality data. Data governance and data stewardship are ongoing processes in support of maximizing business trust in a data warehouse. The Data Governance Institute defines data governance as:
25
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices A system of decision rights and accountabilities for information-related processes, executed according to agreed-upon models which describe who can take what actions with what information, and when, under what circumstances, using what methods. This definition can be distilled down to the following statement: Data governance is a set of processes that ensures that important data assets are formally managed throughout the enterprise. The data governance conceptual process flows that intersect with data architecture include: 1. Identifying key data assets within the organization. 2. Ensuring that data assets have common definitions within the organization. Once common definitions are defined, business consumers can share this data, as opposed to re-creating it within each solution. 3. Ensuring that quality data is loaded into the data warehouse. The set of processes to ensure maximum data integrity is called data stewardship. Data stewardship focuses on the management of data assets to improve reusability, accessibility, and quality. The team members implementing and enforcing these objectives are data stewards. These individuals should have a thorough understanding of business processes, data flows, and data sources. Additionally, data stewards are liaisons between data warehouse architects and developers and the business community. Typically, one data steward is responsible for one subject area in a data warehouse. Maintaining data warehouse data quality is an ongoing process. Explicitly creating direct ownership and accountability for quality data for data warehouse sources eliminates a shadow role that exists in many data warehouses. This shadow role often falls to business analysts and database developers once the business starts questioning results within the data warehouse. Failure to allocate resource to data stewardship activities can result in data warehouse team member burnout and negative attrition. Data stewards are responsible for establishing and maintaining: Business naming standards Entity definitions Attribute definitions Business rules Base and calculated measures Data quality analysis Linkages to and understanding of data sources Data security specifications Data retention criteria
26
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices This chapters focus is less on the process and more on deliverables related to data architecture. The following links provide more information about data governance and data stewardship: Data Governance Institute Data Governance & Stewardship Community of Practice Kimball University: Data Stewardship 101: First Step to Quality and Consistency
27
Data Warehouse Concepts

This chapters focus is on data warehousing, not business intelligence (BI). Figure 2-8 illustrates the different data responsibilities and what is considered a data warehouse component or process.
Figure 2-8: Data warehousing vs. business intelligence Data responsibilities are segmented as follows: OLTP source systems and external data feeds are responsible for data creation Data warehousing focuses on data production BI focuses on data consumption
Data Warehouse Components The data warehouse is comprised of the following components: The consumption area serves up information to downstream consumers through SQL queries. The production area is where source data is transformed, normalized, and consolidated. Note that a Data in area is common within data warehouse implementations and is typically housed within the production area. The metadata area is populated with data about datathat is, business, technical, and process metadata providing detail behind the actual data, data models, and data integration processes. Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices The platform is the system or systems where the data warehouse resides.
28
Data integration processes are responsible for the movement of data from sources to destination databases. See Chapter 3 Data Integration for details about this topic. The data warehouse is responsible for optimizing the query performance within the data consumption area. BI components are beyond the scope of this chapter but include: Data presentation, including decision-support systems, reports, and analytics Data delivery channels Downstream data stores, including data marts and semantic data models (e.g., OLAP)
Business Intelligence = Data Warehousing? There are many examples within the SQL Server community where the terms business intelligence and data warehousing are used interchangeably. In reality, these are two separate disciplines. This is especially true for enterprise data warehouses due to an EDWs scope, complexity, and large volumes of data. As seen in the data warehouse maturity model, as a data warehouse matures, the data warehouse team spends more and more resources on data integration. This shifts the focus from data consumption to data production. Simply put, data warehousing and BI differ because: The primary focus for a data warehouse is the production of data. The primary focus for BI is the consumption, presentation, and delivery of the data produced by the data warehouse.
One example of where BI is confused with data warehousing within the SQL Server community is the AdventureWorks samples available at the Microsoft SQL Server Community Projects & Samples site. These samples include the following databases: A sample OLTP database (AdventureWorks2008R2) A sample data warehouse database(AdventureWorksDW2008R2) A sample SQL Server Analysis Services (SSAS) database, or cube
The AdventureWorks sample data warehouse link provides more information about the AdventureWorks data warehouse and supporting scenarios in which data warehouse, data mining, and Online Analytical Processing (OLAP) are relevant. However, note that these are BI scenarios because the focus is on consumption, not production. There are no samples for the data warehouse production scenario; instead, the data warehouse is populated directly from Comma Separated Values (CSV) files. In summary, BI focuses more on data consumption and should not be equated to data warehousing, which concentrates more and more on data production as it matures, especially when you are working with very large data volumes.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Data Warehouse Architectures As we noted earlier, much of the data warehouse literature is grounded in one of two different implementation approaches: The top-down approach is often used when describing Bill Inmons Corporate Information Factory reference architecture. The bottom-up approach is often used when describing Ralph Kimballs dimensional modeling and Enterprise Data Bus strategy.
29
The top-down approach historically has led to a centralized EDW, and the bottom-up approach has led to federated data marts. This section reviews these two approaches and presents a third approach used for many data warehouses. The centralized EDW, shown in Figure 2-9, was the first data warehouse database architecture. This architecture creates a central repository for all of an organizations integrated data.
Figure 2-9: Centralized EDW As stated earlier, the sheer scope of implementing an EDW often results in analysis paralysisthat is, inordinate amounts of time spent gathering requirements and designing subject area data models. These extended development cycles increase the risk of user requirement changes, user requirement misinterpretation, and ultimately, a failed implementation. In 1996, Ralph Kimball published The Data Warehouse Toolkit. This book introduced dimensional data modeling to a large population and contained examples of dimensional data models for a variety of vertical scenarios. Kimball was also active on the lecture circuit and started to support federated data marts over a centralized EDW. Figure 2-10 illustrates the federated data mart approach.
Copyright 2010
30
Figure 2-10: Federated data marts In this architecture, a separate data mart is implemented for each business process and subject area. This strategy allows for shorter development cycles, lowers the risk of changing and misunderstanding user requirements, and delivers partial solutions faster. However, a drawback of federated data marts is that it often results in multiple versions of the truth, even when the ideal approach for federated data marts is to have a common data model. The data warehouse space started seeing the implementation of both centralized data warehouses, which often presented aggregated data to business users in support of faster SQL queries, and subjectoriented data marts created and populated from the central data warehouse. This resulted in a third approach, the hub-and-spoke architecture, shown in Figure 2-11. This model incorporates the benefits of a centralized data warehouse database and federated data marts: The central data warehouse database provides business consumers with a single version of the truth. Separate data marts provide business consumers with better performance.
Copyright 2010
31
Figure 2-11: Hub-and-spoke architecture The hub-and-spoke approach has downstream data marts that are fed from a common data warehouse database. Data marts return consistent results because they are all populated from the data warehouse database, which contains a single version of the truth. Performance is improved for business unit analysis because the marts contain less information than the data warehouse database and also have a smaller user community. Note that scope issues still exist with the hub-and-spoke configurationthe data warehouse database still contains multiple subject areas and many consolidated sources. Microsofts Parallel Data Warehouse (PDW) is a distributed architecture that supports the hub-and spoke approach. PDW provides a publish model that supports the parallel loading of data mart spokes from the data warehouse hub. This publish model reduces data integration processes and the resources required to maintain these processes. You can read more about PDW in the MSDN article Hub-AndSpoke: Building an EDW with SQL Server and Strategies of Implementation. One of the primary deliverables for the data warehouse team is the data model or models. The next section provides an overview of this topic. Data Models: Normalized to Denormalized One characteristic of many data warehouses is that the data is transformed from a normalized to denormalized form as it makes its way from sources to the production and consumption areas, as Figure 2-12 shows.
Copyright 2010
32
Figure 2-12: Normalized to denormalized data model Most source data originates from transactional line of business (LOB) systems. These LOB systems data models are highly normalized in support of Online Transaction Processing (OLTP). Other sources, mostly file-based, are typically in a denormalized format and have various sources (e.g., external vendors and partners, exports from internal systems). In some cases, these files are created and maintained by business and technical users within applications such as Microsoft Excel and Access. The production area is where source data is consolidated and rationalized. The subject areas within the production area are typically modeled in a normalized format, but less so than the source LOB systems. This data is then transformed, denormalized, and aggregated when it flows to the consumption area. Dimensional data models are an example of a denormalized structure optimized for data consumption; Figure 2-13 shows one example, the AdventureWorksDW2008R2 FactResellerSales snowflake data model.
Copyright 2010
33
Figure 2-13: Snowflake dimension example: FactResellerSales Note that this sample database is modeled to demonstrate the capabilities of SSAS. A normalized data model would not have the Parent-Child and the Snowflake dimensions highlighted above. Normalized vs. Denormalized The key differences between normalized and denormalized data structures are as follows: OLTP systems are where data is created. Normalized data structures support a high-volume transactional workload, which consists of inserts, updates, deletes, and selects of individual or a small number of records. In a denormalized approach, data structures are optimized for data consumption (.i.e., workloads that process high volumes of records).
Dimensional modeling is one common example of a denormalized approach where information is grouped into dimensions and facts, as we see in Figure 2-13. Facts contain surrogate key pointers to dimensions along with mostly numeric measures. Dimensions contain flattened or snow-flaked hierarchies and relationships for entities and relevant attributes. Deciding on the amount of denormalization within the consumption area involves balancing consumer requirements with source system models and load complexity, including the following considerations: Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Whether to use a star schema or snowflake schema implementation. A star schema is a fully denormalized implementation. See the Kimball Groups Web site for more information about star schemas, snowflake schemas, and dimensional modeling. How much to denormalize data for the most efficient SQL access. This, however, can increase load complexity. When choosing the appropriate level of denormalization and complexity of data integration processes related to it, consider the data warehouse requirements for analysis and data latency as well as overall delivery milestones and timelines for the data warehouse initiative. Whether to use database views to present a denormalized view of normalized data. o This implementation pattern allows for introducing denormalized data structures without having to materialize tables. With this approach, data integration for existing tables doesnt need to change. o This approach is sometimes seen as a feed for a downstream semantic layer such as SSAS. o Views will have a negative impact on performance unless they are materialized.
34
Figure 2-14 shows different table structures for a hypothetical entity containing information about vendors for an organization. This illustrates how a data model changes as it flows from the source to the production area and then to the consumption area.
Copyright 2010
35
Figure 2-14: Example source, production, and consumption area table structures Production area data store characteristics are as follows: Tables are denormalized some, but the overall data model is still normalized. Common naming conventions are applied to tables, attributes, and other database objects. Business keys are identified and enforced, as are relationships between entities. Business rules for data types for various types of string and numeric attributes are introduced and enforced. All of these are building blocks that help provide for consistency and manageability across a data warehouse.
A consumption data store has these characteristics: Tables are denormalized. These structures are designed for most effective data retrieval. Natural keys in dimensions are referenced via surrogate keys.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Natural and surrogate keys are discussed in more detail in the data modeling section later in this chapter. In summary, choosing an appropriate data warehouse architecture and data model approach are central to the success of a data warehouse. The decisions about which data warehouse architecture and data models to use, however, are independent of one anothera federated data mart can have a more normalized data model, while a centralized EDW can have both normalized and denormalized data models, for example. Once youve selected the architecture for your data warehouse, the next deliverable is the database architecture, which we cover in the following section.
36
Database Architecture
This section expands on the data warehouse components and provides an overview of the different data areas within the data warehouse, as shown in Figure 2-15. Note that each data area can include one or more physical databases.
Figure 2-15: Data warehouse data areas Data warehouse data areas include: Production area o Production databases are where data is cleansed, lineage is introduced, business rules are applied, and versioned data is introduced. o Data in databases contain a mirror copy of a subset of source system data. The consumption area is accessed by business consumers and can include a data warehouse, multiple data marts, or a combination of the two. The exception data area is where records that fail data quality checks and business rules are held.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Metadata describes the data itself, both in business terms and technical terms. It includes definitions, rules, and origins of all data in the data warehouse. In addition, process metadata helps define configurations and implement security within the data warehouse. Logging databases are important for recording the day-to-day activity within the data warehouse and typically include logging activity from data integration processes and, optionally, consumption data area access. Archived databases hold aged data removed from data areas to improve performance.
37
These data areas are the focus of the remainder of this section, which is organized by data warehouse maturity levelstarting with basic configurations seen in departmental data marts and moving to more advanced configurations seen in EDWs. One Data Area: Full Loads The first implementation for many data warehouses and data marts is a full load, shown in Figure 2-16.
Figure 2-16: Full load This implementation truncates and reloads the data warehouse or data mart directly from one source on a scheduled basis (i.e., daily, weekly, monthly). This is the simplest implementation and requires the least amount of data integration code. The data model is typically a denormalized dimensional model. However, organizations soon realize that full loads have the following issues: No history or no point-in-time history. Many OLTP systems keep only current data, which precludes historical reporting. Even if the OLTP systems store history, these historical records are often modified over time. This makes it impossible to capture the state of data at a particular point in time. Extended processing times. As the data volumes grow, it takes longer and longer to drop indexes, truncate , reload, and reindex all data. This becomes an issue when processing extends into business-usage hours. Data quality issues. Data stewards have little to no visibility into record change histories, which makes it difficult to track data issues back to the source data and process.
These problems with full loads lead most organizations to an incremental load approach.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices One Data Area: Incremental Loads The first step for incremental loads is to create a data model that supports history. The dimensional data model is extended to support record versions. When implemented within dimensions, this is referred to as a Slowly Changing Dimension II (SCD II). With incremental loads, organizations can report on history, but this approach has the following issues: The fact that there is one data area for consumption and production results in the database being offline to business users while data integration processing is active. Merging multiple business entities into one dimension makes it difficult to enforce a single version of the truth, and tracking down and resolving data quality issues is a time-consuming process.
38
The AdventureWorksDW2008R2 DimCustomer dimension provides a simple example of how one dimension is populated from many source tables. Figure 2-17 shows the dimensional table along with its sources from the AdventureWorks2008R2 OLTP database. Note that 13 tables were identified as sources for the DimCustomer dimension.
Figure 2-17: DimCustomer dimension and its sources Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Note that the AdventureWorks2008R2 OLTP data model presents an overly simplistic source data model. For example, the Person table contains two XML data columnscontact information and demographicswhich contain columns used to populate DimCustomer columns, including education, number of cars owned, total children, and number of children at home. This information is typically obtained from a loan application, a credit card application, and/or a survey, which are often separate applications with separate data models. Here are additional issues encountered when the dimensional model is used to store and track history: Common entities (e.g., Address, Person) cant be used to populate other data warehouse tables. Consolidating customer information from multiple source systems directly into the denormalized data warehouse table would make it more difficult for data stewards to track changes back to the respective source systems theyre responsible for.
39
Hopefully this simple example helps illustrate the level of complexity that exists within data, and why its beneficial to have a separate production data area. Adding a Production Data Area Figure 2-18 shows separate data areas for production and consumption. The production area is where information loaded from sources is normalized and where business rules are applied, business keys are identified and enforced, lineage is introduced, and data is prepared for loading into downstream data layers.
Figure 2-18: Separate production data area Production areas are typically seen in organizations that have moved from data marts to data warehouses within the data warehouse maturity model. At this point, the primary focus of the data warehouse shifts from providing results to consumers to integrating data from multiple sources. The production area is the data integration working area where a single version of the truth is created and serves as the basis for data consistency and quality: The data model is typically normalized, although less so than in the source systems. This layer enforces relationships between entities using natural keys. The production area is referenced by data stewards and analysts when verifying and reconciling outputs with inputs. Production databases feed all consumption databases. Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices What About Very Large Tables? Source system transaction and fact tables are the largest tables in a data warehouse and often have multi-billion record counts. Given these volumes, its tempting for the data warehouse team to decide to load this table directly into the consumption area and not keep a master copy of this table in the production area. Although this strategy reduces the size of the production database, it also means that the production area no longer supports a single version of the truth. Given this issue, its generally recommended that the data warehouse team not use this approach. See Chapter 3 for more information about loading patterns for very large tables using versioned inserts. The Consumption Data Area After data is loaded into the production area, the next step is to transform the data into a format more conducive to consumptionsuch as a dimensional data model. In addition, data will often be summarized and aggregated when there are large volumes of data. Figure 2-19 shows the options for the consumption area: one data warehouse, multiple data marts, or a combination of both. This decision is based on many factors, including organizational structure, geographic location, security, data volumes, and platform architecture. For the purpose of this discussion, the difference between a data mart and a data warehouse within the consumption area is scope: A data mart contains one subject area. Data warehouses have larger volumes of data than a data mart and contain multiple subject areas.
40
Note that there are strong views in the industry surrounding data marts and data warehouses. You can read Bill Inmons perspective in the article Data Mart Does Not Equal Data Warehouse.
Figure 2-19: Consumption area options As we noted, consumption area options are as follows:
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices One data warehouse with multiple subject areas provides the greatest availability to data; users can drill down and drill across subject areas within the data warehouse. Multiple data marts typically provide better performance because the data volumes are less than those in a data warehouse. Having both a data warehouse and multiple data marts allows the business consumer to choose between completeness and performance.
41
Often, the option chosen by the data warehouse team depends on data volumes, as shown in Figure 220.
Figure 2-20: Data volume impact on the consumption area Note that the question of when the consumption area can no longer support one data warehouse depends not only on data volumes but also on the underlying platform architecture. The Data in Data Area The next decision is whether to have a Data in area or not. Figure 2-21 shows how the Data in area is populated with source data as the first step in the data integration process. The production area is then loaded from the Data in area.
Figure 2-21: Data in area The decision about whether to have a Data in data store is based on business rules, available storage, data integration patterns, and whether your system can accommodate increased processing times. Data in databases are probably not needed if data sources are fully available and there is an archive strategy in place for sources. However, here are some reasons to have a Data in area: Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Source data is preserved for auditing and reload. Preserving flat file sources within a database allows database backup and restore to be used as opposed to a separate file system backup. The Data in area facilitates very large data warehouse implementations: o Landing source data for very large data warehouses in extract databases lets you manage the amount of data being loaded by segmenting data in batches. o Processing batches can be introduced in parallel fashion, reducing data latency in a data warehouse. o Aggregations and calculations can be applied at the batch level, speeding up loads that require an intermediate area for applying aggregations and other calculations to source databases. The Data in area supports entities residing in multiple source systems: o Entities residing in multiple source systems typically have different designs. For example, the Products entity within a large financial services conglomerate can exist in many different systems. o Introducing extract databases can help when the complete definition of a dimension is not available until data from all relevant entities is made available or processed. Such an area helps handle data inconsistency and late-arriving facts and dimensions: o Issues related to data inconsistency and dirty data are preserved in extract layers so that this information can be analyzed and corrected at the source level. o Late-arriving facts and dimensions can be hosted in the extract area for consolidation and loading in downstream data warehouse layers once its complete.
42
Preserving the source data in an extract layer also helps in identifying common patterns in entities, which allows for more efficient design of relevant dimensions as they are shared across multiple data sources. Chapter 3 covers source system extraction and data Integration patterns in more detail. However, note that some SQL Server technologies can be used to populate the Data in area. These technologies include database mirroring, log shipping, and database replication. And SQL Server 2008 introduced a fourth option: Change Data Capture. The following links provide details about SQL Server high availability options: Database Mirroring Log Shipping Overview SQL Server Replication Basics of Change Data Capture
For more information about whether to use a Data in area, see the To Stage or not to Stage section of Chapter 2 in The Data Warehouse ETL Toolkit, by Ralph Kimball and Joe Caserta.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Exception and Logging Data Areas The exception and logging data areas are both populated by data integration processes, as Figure 2-22 shows.
43
Figure 2-22: Exception and logging data areas Data integration processes use business rules, data quality checks, and data lookups to identify and move data records into the exception area. Whether the entire data record is infirmed or just the natural key depends on the implementation. Data stewards use the exception data area to troubleshoot and correct data exceptions. Data integration processes also populate the logging data area with information used to monitor and track the status of data loads. In addition, data integration processing errors are logged and used by data integration developers to troubleshoot the processes. See Chapter 3 on data integration for more information about these topics. The data warehouse may also log consumer query activity. This is often useful in determining usage patterns, optimizing queries, and developing user charge-back models. Archiving One way to optimize performance for any database is to reduce the size of the data being queried. Deleting data, however, conflicts with the data warehouses objective of providing one version of the truth over time. When data is deleted or purged, it is lostalong with the ability to obtain certain historical perspectives. The alternative to deleting data is to archive it. Archiving data is the process of moving data from a primary data store to a secondary data store. Figure 2-23 contains an example of an archive data area within the data warehouse.
Copyright 2010
44
Figure 2-23: Data warehouse with an archive data area Note that the secondary data store in the above example is a database, but it could also be a file or set of files. Archiving improves performance while supporting the preservation of data. Archiving data is typically based on when the data was loaded, and the data archive schedule within a data warehouse is a function of: The data area that is being archived Data volumes over time How frequently business consumers query the data
The Data in data area will be frequently archived or truncated. Production data may require more frequent archiving for data warehouses with massive amounts of data. And examples of how vertical industry requirements affect consumption data include: Health care industry: Seven to 10 years of data is required for claims data for companies doing business in the US. Financial industry: Seven years of data is required for financial audit data for companies doing business in the US.
In general, every very large data warehouse should have an archiving strategy. This strategy also needs to support the repopulating of archived tables should the need arise for this data. In these cases, the destination can be the same or a different table. The architecture team should base its archiving requirements on legal and industry standards as well as user access patterns. Note that the physical partitioning of tables by date simplifies the data archival processes. Metadata As we mentioned earlier, metadata is data that describes the data itself. It includes definitions, rules, and origins of all data in the data warehouse. Metadata is important because it gives context to the data Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices in the data warehouse and is typically classified into business and technical metadata. Ralph Kimball adds another category: process metadata, which is a variant of technical metadata. Business metadata Provides business definitions for database objects, including databases, tables, and columns. It also includes additional information about columns, including but not limited to business rules and field types. Technical metadata Documents technical aspects of data; the classic example of technical metadata is a data dictionary. Other examples include source-to-destination mappings used in data integration as well as results from data profiling and lineage. Process metadata A type of technical metadata used to configure key components and processes within the data warehouse, including data integration and security.
45
The following are examples of process metadata. Configurations In data warehouses, there are databases used for storing information on various configurations, including: Data integration connection properties Environment variables, XML configuration files Size of ETL batches Data warehouse default values
Security Security-related scenarios are addressed in databases residing in the data warehouse security area. Access to data in the data warehouse is defined by business rules. There are often requirements to secure data both vertically and horizontally. Client-level access restrictions are a common example of these requirements, where data needs to be secured by specific client IDs. Additionally, there are often provisions for securing data on a row-level basis.
It is important to catalog business and technical metadata information in database structures so that this information is not scattered across various documents, diagrams, and meeting notes. Once introduced into databases, metadata information can be queried and analyzed. Reports can be developed to include answers to commonly asked questions about data transformation rules or data types and column defaults. Using metadata also provides for more effective integration of data design tools and development tools; an entire software market is dedicated to metadata tools and metadata repositories. This concludes our overview of data areas. Next, lets look at brief overviews of operational data stores and data warehouse consumers.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Operational Data Stores Operational data stores (ODSs) are databases that support operational reporting outside of source systems. An ODS is a key data area within the Corporate Information Factory and is typically categorized into different classes, with the key variable being the delay between the live source data and the ODS data, as follows: Class I One to two-second delay; this short delta often requires asynchronous push data integration processes instead of the traditional pull method. Class II Intraday; typically a two- to four-hour delay. Class III One day; this can be part of the daily data integration processes. Class IV Loaded directly from the data warehouse.
46
Note that Class I and Class II ODSs require special data integration processes (i.e., processes that run more frequently than the nightly data integration batch schedule). Also note that an ODS may be used as a Data in area depending upon the implementation. You can read more about ODSs at the following links: The Operational Data Store Corporate Information Factory
Consumer Interfaces Data integrated and stored in data warehouses is consumed by a variety of interfaces. Typical data consumers include external data feeds, queries, reports, OLAP structures, BI semantic layer tools, and suites of interrelated applications and services. In addition, data marts are consumers when the data warehouse architecture is a centralized EDW. This section covers the typical data consumer interfaces to consider in your data warehouse architecture. Note that there are many products within each of these categories. The products listed below are provided as examples within the Microsoft product suite for each category. Queries One of the most straightforward methods for consuming data from a data warehouse is using queries. These queries typically scan large numbers of records, are often compiled in stored procedures, and provide for a consistent source of data for analysts and decision makers. Queries are also often combined with other information delivery vehicles, such as reports, and provide for uniform representation of information. Managing changes to queries and managing security are some important aspects of using this delivery method. Reports SQL Server Reporting Services (SSRS) is one of the most commonly used tools to access data from data warehouses. This tool provides for enterprise-wide access to information in predefined forms as well as for ad hoc access to data.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices SSRS, coupled with the powerful report-authoring environment provided by Microsoft Report Builder, frequently provides the primary data consumption methods within an organization. For more information about SSRS and Report Builder, see the following links: SQL Server 2008 Reporting Services Web site SQL Server 2008 R2 Books Online topic SQL Server Reporting Services TechNet article Getting Started with Report Builder 3.0
47
OLAP OLAP takes data consumption from data warehouses to a higher level. One of the most advanced OLAP technologies available is SSAS, which gives organizations advanced enterprise-wide analytical capabilities, complementing the power of data contained in data warehouses. How well data warehouse systems are architected directly affects how efficiently you can implement SSAS. In other words, straightforward and effective data warehouse design greatly reduces the complexities around OLAP models. BI Semantic Layer Tools SharePoint 2010 Insights is a powerful BI semantic layer tool that enables users to create BI dashboards that include powerful analytic reports, Key Performance Indicators (KPIs), and scorecards. Using PerformancePoint Services (PPS) in SharePoint 2010 provides for even greater insight into how business is performing and the status of key indicators for business processes. Adding PPS to a list of data consumers from a data warehouse enables organizations to make real-time decisions and improve the overall ROI of data warehouse efforts. For more information about PPS, see the following links: PerformancePoint Services What's new for PerformancePoint Services (SharePoint Server 2010)
Embedded BI Applications This class of applications is developed to solve a targeted business problem, such as fraud detection. These applications often use data mining algorithms or provide a more guided user experience than their BI tool equivalents. Suites of Applications and Services Enabling ease of access to information contained in data warehouse systems is a main objective of any enterprise BI strategy. In this respect, having information users experience pervasive BI in their organization at a low cost through the Microsoft Office Suite tools they use every day efficiently accomplishes this goal. Applications and services within Office provide for easy and direct access to data in data warehouses, whether users connect directly from Microsoft Excel or use the advanced analytics of Microsoft PowerPivot. For details about Microsoft Office Suite or PowerPivot, see the following links: Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Microsoft Office Web site Microsoft PowerPivot Web site
48
External Data Feeds External data feeds can include LOB systems, such as customer relationship management (CRM) systems. It is crucial for CRM systems to consume data that is put together from various data sources, so a data warehouse properly architected in this respect represents an ideal source of data for CRM. Data warehouses provide for data feeds that encompass a central, comprehensive, and consistent perspective of customers, making CRM systems more efficient and effective. Additionally, data warehouses represent a consistently cleansed and reliable source of customer data that is accessible enterprise-wide.
Master Data and Master Data Management

Master data and Master Data Management (MDM) are hot topics in the industry today. Best-of-breed MDM software products have hit the market, and major software vendors have added MDM capabilities to their product suites. SQL Server 2008 R2, for example, introduced Master Data Services (MDS). However, its still early in the adoption curve. There is still confusion in this area, and few organizations are managing all of their master data within an MDM solution. This section provides more clarity around the topics of master data and MDM and gives guidance around the options that organizations have for managing their master data. Figure 2-24 illustrates some questions organizations are asking today, such as: What is our organizations MDM solution? Is there an overlap between data warehouses and MDM? Do I need an MDM software product? And if so, how does my MDM software communicate with my data warehouse?
Figure 2-24: Master Data Management questions
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Before continuing, its useful to look back in time to provide a historical perspective on how the MDM industry arrived at where it is today. Figure 2-25 shows a timeline with significant events leading up to the present.
49
Figure 2-25: Master Data Management timeline According to the timeline: Data warehousing has been part of the database industrys lexicon since the early 1990s. A major selling point for an EDW was that it provided a single version of the truth. The 1990s was also a decade that saw wide adoption of LOB systems, starting with Enterprise Resource Planning (ERP). One business driver for these systems was Y2K concerns. The advent of the Internet and multiple customer channels drove demand for CRM systems. And wide adoption of CRM created challenges for data warehouses because there were now multiple versions of the truth implemented in complex enterprise-level systems such as SAP and Siebel (now Oracle). Customer Data Integration (CDI) started appearing in analyst writings around 2000. Although this software category never gained critical mass, it did raise awareness of the need to provide a single version of the truth for key business entities. The concept of a data governance discipline started to emerge after organizations recognized that data quality was an ongoing enterprise discipline, not a single-point-in-time solution. The term Master Data Management began to appear in 2002-2004 and was followed by the emergence of best-of-breed MDM products. In 2010, SQL Server 2008 R2 shipped with Master Data Services 1.0.
An Accenture article titled Master Data Management (MDM) Hot Topic Getting Hotter outlines MDM common business definitions and management processes for business and reference data. It also stresses the importance of data governance, stating: The technology side consists of master data management systems that provide a single version of the truth. Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Heres where the confusion lies. Todays consensus is that a data governance discipline is necessary, but how about a single version of the truth? Where does the authoritative version of the data reside? Is it within a master data database, the data warehouse, or both? What tools are used to create it: the data warehouse and data integration processes, an MDM system, or a combination of the two? The first step is to define master data and Master Data Management. What Is Master Data? Many papers and articles include definitions of master data. The following definition is from the MSDN article The What, Why, and How of Master Data Management, by Roger Wolter and Kirk Haselden: Master data is the critical nouns of a business. These nouns generally fall into four groupings: people, things, places, and concepts. (Note that this definition is abbreviated for readability.) This article also categorizes an organizations data into data types, as Figure 2-26 shows, and describes which ones are stored within a data warehouse. Note: Instead of the term data type, which could be confused with the same term commonly used in physical database design, in this chapter, we use data category.
50
Figure 2-26: Data categories The different data categories are: Unstructured Data found in email, magazine articles, corporate intranet portals, and so on Metadata Data about data, including report definitions, database column descriptions, and configuration files Master The critical nouns of a business, which generally fall into four groupings: people, things, places, and concepts Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Transactional Data related to sales, deliveries, invoices, trouble tickets, claims, and other monetary and non-monetary interactions Hierarchical One- or multi-level relationships between other data, such as organizational charts and product lines Reference A classification of a noun or transaction; often has two columnscode and descriptionand examples include marital status and gender
51
Notice in Figure 2-26 that: Master data exists within a data warehouse and can be fed from an MDM system. Reference and hierarchy data are used to classify master data and transaction data. Unstructured data is outside the scope of the data warehouse. Metadata is valuable but out of scope for this discussion.
What Is Master Data Management? The next step is to define Master Data Management. David Loshin in the book Master Data Management defines MDM as: A collection of best data management practices that orchestrate key stakeholders, participants, and business clients in incorporating the business applications, information management methods, and data management tools to implement the policies, procedures, services, and infrastructure to implement the capture, integration, and subsequent shared used of accurate, timely, consistent and complete master data. There are many other definitions, but the key points are: MDM is not only a technical solution; its an ongoing set of processes and an organizational discipline. Transactional data is out of scope for MDM. Classification datathat is, reference and hierarchy datais in scope for MDM.
Where Do Data Warehousing and MDM Overlap? Now that weve defined master data and MDM, the next question is: Wheres the overlap between MDM and data warehousing? The simple answer to this question is master, reference, and hierarchy data. Master data has existed within the data warehouse long before the term Master Data Management was coined. This section shows where master data exists within a data warehouse; Figure 2-27 shows the data stores in scope for this discussion.
Copyright 2010
52
Figure 2-27: Data stores containing master data Lets start with the data warehouse destination and work our way back to the original source. Most data marts, either within the consumption area or downstream from the data warehouse, use a denormalized data model for optimal query access. Figure 2-28 shows an example within the SQL Server sample data warehouse, AdventureWorksDW2008. (Note that this chapter assumes that the reader is familiar with dimensional data modeling concepts.)
Figure 2-28: Reseller Sales snowflake schema The Reseller Sales fact table contains all sales transactions for the Adventure Works bicycle shop. This fact table contains sales order quantity, product cost, and other numeric values recorded at the sales transaction. The fact table also consists of foreign keys to dimensions, which are used to filter the results Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices from the joined fact and dimension tables. Note that transactional data is out of the scope of MDM and fact tables are not considered master data. Each dimension in the data warehouse data store is sourced from master data, reference data, and hierarchy data residing in the staging data store. ETL processes process source data en route to the staging area. This processing includes cleansing, consolidating, and correlating master, reference, and hierarchy data as well as linking it to transaction data. Figure 2-29 shows a partial mapping of three dimensions to their master, reference, and hierarchy data sources within the staging data store. Note that these mappings are conceptual and not actual mappings since the Adventure Works database samples do not include a staging data store.
53
Figure 2-29: Mapping three dimensions to their data categories Note that the objective of this example is not completeness but rather to show that: Each dimension is in a denormalized form. Its source is the staging data store. The production area contains master, reference, and hierarchy data categories, which are the same categories managed by a data management system. The production area also contains transactional data.
Thus, the answer to the question, Is there an overlap between data warehousing and MDM? is: Yes, the master, reference, and hierarchy data categories exist in both. Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Given this overlap, the next question arises: Can an organization forego a data warehouse when it has an MDM software product? The answer is no, because MDM software does not manage transactional data. However, transactional data holds all the numeric and monetary values used to measure an organizations performancea key objective for a data warehouse. The next question then becomes: Do I need an MDM software product? Does an MDM solution require an MDM software product? The answer here is maybe. Below are some questions to determine these answers for your organization: How many data sources contain master data? Does the master data have strong natural keys? How mature is the current data warehouse solution? Does it already handle master data? What level of manual effort is required for reference list maintenance? What if one reference table exists and is separately maintained within multiple source systems? What level of hierarchy management exists? What level is required? What if the internal hierarchy is a super-set of a standard? For example, what if a health care organizations diagnostic codes are different than the ICD-9 standards?
54
Lets look at how your answers to some of these questions might influence your decision to acquire and utilize an MDM software product. How Many Data Sources Contain Master Data? Its useful to understand where master data is sourced from. Figure 2-30 shows an example of master data within a production area and its sources.
Figure 2-30: Master data by category and sources Some notes about this source data: Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Master data resides within the LOB systems. Master data can be loaded from multiple LOB systems; this requires the data to be normalized and consolidated. Reference and hierarchy data can reside in LOB systems, internal feeds, and external feeds. Traditionally, internal feeds were created and maintained with Excel. External data traditionally was structured in CSV format, but many industry standards are moving toward an XML format.
55
The need for a separate MDM software product increases as the number of sources increase. How Mature Is the Current Data Warehouse Solution? Does It Already Handle Master Data? The next step in determining whether an MDM software product is required is to review the data integration processes already in place. Figure 2-31 shows a high-level view of the data integration process used to load master data from one source to the staging area. Note that this chapter presents these data integration processes at a conceptual level; refer to Chapter 3 for details about data integration.
Figure 2-31: Master data integration flow The data integration steps for master data are as follows: 1. Master data is extracted from the source. 2. Data quality business rules are applied, and data flows to an exception area if it fails a data quality check. 3. Master data is then normalized (i.e., attributes are used as lookups into reference lists and hierarchies). Data may flow to an exception area if the attribute does not match any entries in a reference list or hierarchy. 4. The process checks whether the master data already exists. 5. Master data flows to the production area if it doesnt exist or if it has changed.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices The above conceptual process for loading master data is very similar to processes used to populate dimensions and is familiar to data integration developers. The simple answer to whether an MDM software product is required for loading master, reference, and hierarchy data is no if existing data integration processes already exist and are working without issues. If these master data integration processes are not in place, the answer is then maybe, depending on the complexity of the master data integration processes, combined with other factors such as: Whether the existing data warehouse team (developers and operations) are comfortable learning a new technology Whether the new technology is affordable (both in acquisition and ongoing costs).
56
The takeaway from this section is that introducing an MDM software product has associated costs. The decision on whether an MDM software product is required should take these costs into consideration along with the complexity of the master data processes. The next section introduces two master data scenarios and the data integration processing required for each. Employee Data One Source In this first scenario, the data exists in one and only one source system, as Figure 2-32 shows. In this scenario, the Employee table is populated with Education and Marital Status reference data sourced from the same system.
Figure 2-32: Master data example one source A separate MDM product is not required in the simplest of scenarios. Employee Data Multiple Sources Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices However, master data often exists in more than one placetypically in multiple source systems, but it can also be an external or internal feed. The next example assumes that Education data comes from an external feed, as shown in Figure 2-33.
57
Figure 2-33: Master data example multiple sources Note that even in this simple multi-source scenario the following issues arise: Reference lookup failures The reference table in staging is sourced from the LOB system. The second source containing employee education information uses different descriptions. These cases require intermediate mapping tables, which take codes or descriptions from the source and map them into the staging reference table. Production area table lookup failures The lookup against the production area Employee table fails in the above scenario. It demonstrates two common reasons for these lookups failures: o Lack of a strong natural key Employee Id is a strong natural key and is ideal for comparing employee data across internal LOB systems if, and only if, this key exists in each system. If it doesnt, then the source-to-stage lookup becomes more complicated and error-prone. o Inexact matching Name and/or address matching are perfect examples of where the matching process is not completely deterministic. In these cases, either a fuzzy matching algorithm or product or product capabilities specifically developed for name and address matching is required.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices SQL Server 2008 R2 Master Data Services (MDS) may be beneficial in this scenario if the implementation team thinks that leveraging MDS capabilities are preferable to adding capabilities to existing features (such as Fuzzy Lookups). Here are a couple of things to think about when considering MDS: Reference table management What processes/interfaces are used for creating and maintaining reference tables and mappings that arent sourced directly from a LOB system? Name/address matching Can the team leverage SQL Server Integration Services (SSIS) Fuzzy Lookups and provide tools for data stewards to make a human decision when the fuzzy matching doesnt produce strong results?
58
Complex Master Data: Multiple Sources and Hierarchy Management The multi-source Employee data example above provides a simple illustration of master data within organizations. Many organizations have more complicated master data needs as well as other advanced MDM needs such as creating and maintaining multiple versions of very complex hierarchies (e.g., organizational charts, chart of accounts, disease classification, etc.). In these cases, organizations can benefit from advanced data integration capabilities such as merge/match. In addition, version control allows data stewards to work on new reference data and hierarchies prior to publishing them in the data warehouse. Workflow approval supports a multi-tier review process, and role based security lets data stewards work only on subsets of master data. Each of these capabilities supports complex master data environments and are features within SQL Server 2008 R2 MDS. In summary, master data and MDM are hot topics today but are old problems within data warehousing. Many existing data warehouse implementations already process master data effectively. However, organizations with complex master data needs that dont have enterprise MDM capabilities in place can benefit from adopting an enterprise MDM solution such as SQL Server 2008 R2 MDS. Now that weve covered the key concepts involved in data warehouse architecture, we are ready to address the responsibilities that are owned or co-owned by the data architecture team.
Platform Architecture
The data architect is typically the responsibility of the team that selects the platform architecture for a data warehouse. The final selection of a hardware and software platform depends on a combination of many factors, including performance and adherence to corporate standards. This section provides a brief overview of this topic; however, a detailed discussion of platform architecture is out of scope for this document. Data warehouse data volumes and usage requirements require large dedicated hardware servers and storage hardware. Historically, organizations deployed SQL Server data warehouses by getting a big server with lots of CPUs and lots of memory and allocating lots of space in the Storage Area Network (SAN) for the data warehouse databases. However, alternative options have emerged over the past few years, as shown in Figure 2-34.
Copyright 2010
59
Figure 2-34: Platform architecture server options This illustration shows the following server options: Server The historical Symmetrical Multiprocessor (SMP) configuration is a multi-processor architecture with lots of memory running SQL Server Enterprise Edition. Server virtualization This configuration maximizes server resources by supporting the running of multiple virtual instances of SQL Server. This is a potential configuration for pre-production data warehouse environments, including test, staging, and QA. Data warehouse appliance SQL Server 2008 R2 PDW is a massively parallel processor (MPP) architecture supporting both centralized EDW and hub-and-spoke architectures on dedicated SAN storage. Server/reference architecture Microsofts Fast Track Data Warehouse architecture is a reference hardware configuration tailored to run a SQL Server data warehouse workload.
Well look briefly at each of these options, but the selection, acquisition, and deployment of data warehouse platform architectures is out of scope for this document. Data Warehouse Server The traditional SQL Server data warehouse server for medium-size data volumes is a 64-bit machine running SQL Server Enterprise Edition and configured with a lot of CPU and memory. Given the memoryand compute-intensive workloads for both querying and populating the data warehouse, organizations should purchase hardware with the maximum CPUs and memory configuration they can afford. These data warehouse servers are typically connected to the corporate SAN. This shared SAN provides centralized administration and is more flexible than storage directly attached to a server. This simplifies administration operations such as backup and disaster recovery. However, the SAN is the primary bottleneck for most data warehouse workloads experiencing performance issues, so pay careful attention to SAN configurations. One issue with this large-server approach is that it becomes very expensive because multiple servers are required for different phases of the data warehouse lifecycle, including development, test, QA, and Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices production. Traditionally, organizations typically purchased smaller servers for all servers not in production, but virtualization has provided another option. Server Virtualization Server virtualization has emerged as a dominant theme in computing over the past 5 years. Virtualization reduces the number of servers in data centers, which results in lower IT costs from acquisition through ongoing management. It also provides a flexible mechanism for spinning up and tearing down server environments. This flexibility is useful for non-production environments, but dedicated servers are still the norm for production data warehouse environments. SQL Server 2008 R2 Datacenter Edition is a new product offering from Microsoft for organizations looking for a virtualized SQL Server solution. SQL Server Fast Track Data Warehouse The challenges involved in correctly acquiring and configuring a data warehouse hardware and software platform led Microsoft to introduce the SQL Server Fast Track Data Warehouse. This is a scale-up reference architecture targeted for data warehouses containing up to 48 TB of data. Options include different reference hardware configurations from HP, Dell, Bull, EMC, and IBM. The benefits to this solution are that preconfigured, industry-standard hardware provides lower cost of ownership through better price/performance, rapid deployment, and correct configurations. For details about SQL Server Fast Track Data Warehouse, see these links: SQL Server Fast Track Data Warehouse Web site MSDN white paper An Introduction to Fast Track Data Warehouse Architectures
60
Data Warehouse Appliances SQL Server 2008 R2 PDW is a highly scalable data warehouse appliance that uses MPP software architecture based on the DataAllegro acquisition. Its target workload is data warehouses containing up to 400 TB of data. In a traditional, symmetric multi-processing (SMP) architecture, query processing occurs entirely within one physical instance of a database. CPU, memory, and storage impose physical limits on speed and scale. A PDW MPP appliance partitions large tables across multiple physical nodes, each node having dedicated CPU, memory, and storage and each running its own instance of SQL Server in a parallel, shared-nothing design. All components are balanced against each other to reduce performance bottlenecks, and all server and storage components are mirrored for enterprise-class redundancy. PDW is a distributed architecture that can act both as the centralized EDW and the hub in a hub-andspoke architecture, as covered earlier. You can read about PDW at these links: Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices SQL Server 2008 R2 Parallel Data Warehouse Web site TechNet white paper Hub-And-Spoke: Building an EDW with SQL Server and Strategies of Implementation
61
Which Server Option Should You Choose? Large 64-bit servers running SQL Server Enterprise Edition can support multi-terabyte data warehouses and are a solid choice for organizations today. Here is some guidance for considering one of the alternative options: Non-production environments SQL Server 2008 R2 Datacenter Edition provides the flexibility of quickly spinning up and tearing down environments, which lowers the total cost of ownership (TCO) for non-production data warehouse environments. Performance issues Performance issues, both directly experienced and projected, can be addressed by scaling up with SQL Server Fast Track Data Warehouse. Pre-configured solutions help minimize common performance problems, such as SAN bottlenecks due to incorrect configurations. Enterprise data warehouse PDW provides a platform for the largest organizations EDW needs. Its hub-and-spoke architecture supports data marts optimized for target business consumer communities and simplifies data integration between the data warehouse and downstream data marts.
These options allow organizations to standardize their data warehouse on the SQL Server platform whether the data warehouse is measured in gigabytes or terabytesfirst by selecting a platform and then by choosing more scalable platforms as their data warehouse matures. Now that weve briefly covered platform architecture, the next section addresses the data warehouse database architectures implemented on that platform.
Database Architecture
This section focuses on how different data warehouse data stores and data categories are organized into logical and physical databases. Choosing the most appropriate database architecture for a data warehouse system is an important aspect of the overall data warehouse strategy: Poorly implemented database architecture will negatively affect data warehouse performance and overall user experience. Its not uncommon to find SQL Server data warehouse implementations in which no effort was made to create more than a single default filegroup, which results in all data, indexes, and logs residing on the same number of disks. Despite having a strong data model, such a data warehouse initiative will be labeled a failure because of: Slow queries Backup files becoming too large, resulting in excessive backup and restore times
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Data files becoming too large to manage, causing excessive execution times for rebuilding indexes and updating statistics
62
This underscores the need for an effective physical database design consisting of files and filegroups. In addition, a good database schema design can make database objects easier to access and simplify the security implementation. Databases, Schemas, and Filegroups Data modelers are responsible for developing the data warehouse data models, which are accessed by the three roles mentioned below. A data warehouse can be viewed differently depending on your role and your objectives: Database developers view the data warehouse through a SQL lens. Their objective is to select the correct information from the data warehouse as efficiently as possible. Data integration developers view the data warehouse as a destination. Their objective is to populate the data warehouse as efficiently as possible. Database administrators view the data warehouse as a precious resource. Their objective is to ensure that the data warehouse is operational, secure, and high-performingboth when serving up data to business consumers and during database maintenance operations.
Figure 2-35 shows these different roles and how they relate to the data warehouse.
Figure 2-35: Data warehouse database logical and physical view The data warehouse database can be viewed both as a logical entity accessed through SQL and as a physical entity consisting of multiple files stored within the disk subsystem. Every SQL Server database is stored on disk and has two or more files:
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Primary data file A database has only one primary data file, which contains startup information and pointers to other files in the database. Log file(s) A database can have one or more log files. Secondary data file(s) These optional files are used to store database objects. Although optional, the use of secondary data files is highly recommended for a SQL Server data warehouse. Backup file(s) These files are used to back up information from the data warehouse.
63
All database files are organized into one or more filegroups. Every file has one filegroup parent, and every database object is associated with a filegroup. The primary filegroup stores all system tables. A key database architecture task is determining which database objects reside in which filegroups. This task must take the different developer and DBA activities into consideration, including select, insert, and update activity; backup and restore processes; and database maintenance operations such as reindexing, the updating of statistics, defragmentation, and so on. Databases Databases are the parent for both logical schemas and physical filegroups and files. Before we go deeper into physical filegroup configurations, its useful to review how database objects are logically accessed and organized within SQL Server. The SQL Server Transact SQL (T-SQL) language supports three-level naming when accessing database objects within the same server: [Database].[Schema].Object Both database and schema are optional, as shown in the following query, which assumes our current database is AdventureWorksDW2008: SELECT * FROM DimCustomer The following query, which specifies all three levels, returns identical results as the above query: SELECT * FROM AdventureworksDW2008.dbo.DimCustomer Every database table has a schema, which we look at in a moment. When a schema isnt included, the default schema is assumed. The default schema is dbo or the default schema explicitly assigned to the user account by the DBA. Note that SQL Server linked servers extends this query model to four-level naming, as shown below, to access an object on another server: [Linked-Server-Name].[Database].[Schema].[Object] However, linked servers are not regularly used in production due to performance and security reasons. Every linked server request has to first connect to a remote server and then retrieve the results before Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices returning them to the requester. More often, SSIS data flows are used to populate tables local to the server, which are then directly accessible within a SQL Server instance. Schemas Schemas provide for logical grouping of tables. Figure 2-36 shows an example in which tables are grouped by the logical domain they belong to.
64
Figure 2-36: Subset of schemas from the AdventureWorks2008 sample database The decision to create schemas is made by the data warehouse team in the design phase. The following are reasons for using schemas: Logical grouping The ability to logically organize tables into groupings or subject areas makes it easier to navigate complex databases such as data warehouses. The example above illustrates how tables can be organized into subject areas. Logical partitioning There are many other way to logically group tables; one example might be to group by entity. For example, the Delta-American Airlines merger creates the need to consolidate data from the LOB systems used by each entity prior to the merger. Creating a Delta and AmericanAirlines schema with identical table structures, such as DimCustomer, would be an intermediate structure prior to loading the consolidated DimCustomer table. Security The ability to grant and restrict access to database users and roles through different schema simplifies the security approach for a database. For the example in Figure 2-29, you could: o Grant SELECT access to the HR department for the HumanResources schema o Grant INSERT, UPDATE, and DELETE access to the HR Admin role for the HumanResources and Person schema o Grant SELECT access to all data warehouse users for the Person schema
The downside to creating schemas is that the schema name must be included in all database object references. This may not seem like a significant issue for developers and DBAs, but it could be an issue for downstream business consumers. Also, any change in the schema name will require a change to every SQL statement accessing objects within the schema, except in the case where the user is referencing objects within their default schema. Changes in the schema name will not impact the underlying physical structure because schemas are a logical construct. When using schemas in SQL Server, it is important to recognize that schemas are not equal to owners. Thus, it is not necessary to change owners of objects when owner accounts are being removed. A schema does have an owner, but the owner is not tied to the name. Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Note that dbo is the only schema supported in the initial PDW release. Filegroups A core database architecture concept is the filegroup, which contains database files and represents the physical aspect of the architecture. Database tables, indexes, and logs are created on filegroups, while filegroups contain one or more physical files. Proper management of files and filegroups is instrumental to an efficient data warehouse design and maintenance. Here are some scenarios where wise filegroup architecture can provide significant value: Partitioned fact tables Large fact tables (with more than 100 million rows) in one database file have benefit from partitioning. Modifying such tables to have their data divided among multiple physical files, with each file stored on a separate physical disk array, would enable the Database Engine to perform multiple parallel reads (one per file), improving read performance. Figure 237 shows a simple scenario of a fact table with two partitions.
65
Figure 2-37: Partitioned fact table Backup/restore and archiving For VLDBs, it may become impossible to manage backups and restores for certain files in timely fashion unless multiple filegroups are introduced and separately backed up. Other database maintenance activities Partitions also have advantages for indexing, consistency checking, updating statistics, and other maintenance tasks.
For more information about SQL Server filegroups, see the MSDN article Files and Filegroups Architecture. Database Files Regarding database files, it is important to consider the physical architecture of your hardware infrastructure to take advantage of separate disk controllers. In addition:
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Placing non-clustered indexes on files separate from files containing data is a common practice to improve data warehouse performance. A strategy for where to store tempdb in SQL Server is an essential part of an overall database architecture because if not properly architected, write operations to this frequently used database often conflict with read/write operations for data and log files.
66
The next section of this chapter will address the questions: How many databases should be in my data warehouse, and what factors affect this decision? Database Files and Storage To have a high-performing data warehouse, you need to understand and properly manage storage for database objects. The MSDN article Storage Top 10 Best Practices is a good place to start. Most data warehouses use SAN storage, with disks organized in applicable types of RAID configurations. Depending on disaster recovery policies, some data warehouse systems will reside on database clusters, but which services should be set up for failover depends on an organizations particular business requirements. Considerations for Physical Database Design This section outlines several considerations for physical database design as it relates to logical design and architecture decisions. For more about best practices for partitioning with SQL Server, see the partitioning section of Chapter 4. Capacity Planning and Designing for Change Initial capacity planning is required for every data warehouse and includes estimates for initial data size and data growth. In addition, physical designs need to be implemented so that new disks and servers can be added without affecting systems already in production. Templates in Physical Design The process of deploying the physical database architecture can be simplified by establishing templates. These templates are deployed to a new database with predefined architecture for filegroups and files. However, these patterns need to be adjusted for special casesfor example, when complexity and size of data in the fact tables requires multiple files for more efficient partitioning. Some of the recommended practices for using templates in database architecture include maintaining scripts for template databases, including all relevant CREATE statements. These scripts are modified based on naming standards used in a particular data warehouse implementation and are executed to create new database objects. Partitioning Partitioning is a valuable technique for very large tables within the data warehouse. Partitioning creates smaller clusters of data, which enables maintenance operations to be applied on a partition-by-partition basis. Partitioning also enables minimum latency because source data is continuously loaded into a passive partition, which gets switched with active partitions on set schedules.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices In physical design, typically for purposes of producing proof-of-concept models, the partitioning strategy is implemented after business rules, entity relationships, and attribute and measure definitions are established. Partitioning is a key tool employed by both DBAs and database developers when working with very large tables and is covered in more detail in Chapters 4 and 5. Striping Striping is associated with the type of RAID levels implemented as a part of the data warehouse architecture. RAID levels 0, 1, 5, and 10 are typically implemented with SQL Server. This is an important aspect of database architecture because striping improves read performance by spreading operations across disks. The following SQL Server Customer Advisory Team (SQLCAT) article contains more information about SQL Server striping and RAID levels for SQL Server: Storage Top 10 Best Practices
67
Data Compression Data compression can help reduce the size of the database as well as improve the performance of I/Ointensive workloads. However, CPU cycles are required on the database server to compress and decompress the data while data is exchanged with the application. SQL Server provides two levels of data compression: row compression and page compression. Row compression helps store data more efficiently in a row by storing fixed-length data types in variable-length storage format. A compressed row uses 4 bits per compressed column to store the length of the data in the column. NULL and 0 values across all data types take no additional space other than these 4 bits. Page compression is a superset of row compression. In addition to storing data efficiently inside a row, page compression optimizes storage of multiple rows in a page by minimizing data redundancy.
It is important to estimate space savings and apply compression only to those tables and indexes that will yield reduced I/O and memory consumption due to the reduced size. The sp_estimate_data_compression_savings stored procedure can be used for SQL Server 2008 R2 databases. Data compression for SQL Server is explained in detail in the SQLCAT article Data Compression: Strategy, Capacity Planning and Best Practices.
Data Modeling
Success of a data warehouse implementation initiative is greatly affected by what data modeling technique is used and how it is implemented. Data modeling begins the process of structuring the data elements that make up a data warehouse.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices In this process, requirements are analyzed and defined to support the objectives of the data warehouse. The results of this analysis are contained within conceptual, logical, and physical data models. Before we dig into the different types of data models, lets review some key aspects of data modeling for a data warehouse. Inherent Definition Data modeling identifies definitions of entities in regard to their inheritance properties. This provides clear relationships and dependencies. For example, a result of inherent definition modeling would be a definition that a product can belong to multiple categories in a manufacturing data warehouse implementation. Alignment through Logical Decomposition Entities, relationships, and dependencies as well as metrics are modeled across subject areas of a system. Commonalities in modeling are identified and appropriated throughout systems. Modeling of each component of a data warehouse is aligned with a business requirement relevant to a particular subject. Thus, all models in a data warehouse are included in an overall data warehouse strategy parsed through initiatives and relevant projects. Identifying Complexity, Priorities, and Needs for Simplification Data modeling efforts become more complex as you define entities and the relationships between them. Incomplete data models lead to inadequate business rules, resulting in a sub-optimal solution. However, its also important to prioritize data modeling efforts in accordance with project milestones and deliverables. Decisions about the importance of defining all possible attributes of a single entity versus defining just desired outputs and metrics of a data warehouse system will affect the overall result of data warehouse initiatives. Eliminating Non-Critical, Isolated, and Unrelated Data An overarching objective for a data warehouse is often to include all data that exists in an enterprise. However, its important to keep the focus on data as it relates to the overall scope of requirements and priorities. Failure to do so can result in extending timelines for deliverables, analysis paralysis, and eventual failure of a data warehouse. Data Modeling Variables and Inputs A sub-optimal data model affects the ROI of a data warehousing effort, including higher costs for systems, infrastructure, and business consumer discovery and access as well as additional resources for maintenance and modification. Business rules and requirements are inputs into data models, and this dependency translates into data model changes when business rules and requirements change. Data Modeling Overview A data warehouse data model includes definitions of entity types, which are typically grouped into dimensions, reference tables, hierarchies, fact tables, and bridge tables. This section will cover each of
68
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices these elements. In addition, data modeling includes the classifications of attributes, measures, relationships, and integrity rules. The process of data modeling begins with the translation of business requirements into conceptual models, which include data definitions. Following the conceptual model is a logical model that outlines the actual implementation. The logical data model identifies both data objects and the relationships between them. One conceptual model often yields multiple logical models, which are referred to as subject or context models. The physical model rounds out the data modeling process and includes additional physical information, such as physical data types, referential integrity, constraints, and partitions. Data warehouse teams are faced with several options when implementing data models: The team can work in a serial order, where the data modeling team completes the entire model before moving to the next phase (i.e., conceptual model first, then logical, and finally physical). Or the data modeling team can work by subject area, iterating through the conceptual, logical, and physical models for one subject area before moving on to the next one. Alternatively, the team can perform a parallel effort, in which different areas within the data model are in different development stages (i.e., conceptual, logical, or physical).
69
The parallel effort typically involves enterprise data modeling tools and requires the team to traverse between subject areas as well as to move between the conceptual, logical, and physical data model within a subject area until the overall model is complete. However, the reality of data warehouses is that data models change and subject areas are continuously added. This reality frequently translates into a parallel or continuous data modeling effort. For example, in a recent data warehouse project, an organization realized it needed to add support for analyzing data warehouse measures in respect to multiple time zones. The data warehouse had been in production for over a year when the request came in from business stakeholders, but this added requirement was part of a never-ending need to support changes in how the business operated. This is a very common scenario, and it should come as no surprise to data warehouse developers that they will have to be agile and creative in combining conceptual, logical, and physical data modeling efforts to support new and changing business needs. The following sections provide more detail on the purpose of the different data models and the activities that take place during each stage. Conceptual Model Major characteristics of a conceptual data model relate to the following points: Reduce the Enterprise Highest-Level Nouns
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Creating one common definition for common entities reduces ambiguity when comparing data and results across different business and organizational units. Examples of these efforts are consolidated definitions for Customer, Product, and Inventory entities. These nouns may be referred to as domains, master objects, or another abstract term. For example, the Product domain in physical terms will include numerous tables, but in the conceptual phase, it is identified as a single abstract term. Limit the Conceptual Model to 1 Page, or Less than 8 Objects The purpose of the conceptual model is to provide a high-level understanding of the major entities and business processes in scope for the data warehouse. The general rule is that if the conceptual model cant fit on a single page, its too granular. Figure 2-38 is an example of a high-level conceptual model.
70
Figure 2-38: Conceptual data model Subject or Context Model The context or subject data model defines intersections of conceptual data elements with subject areas in a data warehouse. Here are the main characteristics of a subject data model: The subject model presents the highest-level functions of the enterprise (Sales, Marketing, Finance, IT). Some high-level concepts are also subjects (Finance). The grid of concepts by subject should show that most subjects need most concepts. The subject model should anticipate the design of data marts or reporting groups.
Figure 2-39 shows an example of a subject model represented via a reverse-engineered bus matrix of dimension usage in SSAS.
Copyright 2010
71
Figure 2-39: Subject data model Logical Model The logical data model deepens the analysis of data elements in scope of a data warehouse effort and includes more detail about the data. Included in the logical model are: Entities, attributes, and domains The logical model includes more granular information than the conceptual model, defining logical attribute names and domains in which entities belong. Normalization/denormalization and relationships Primary and foreign key relationships are identified as well as objects containing various levels of hierarchies and entity relationships. Advanced concepts The following concepts are also addressed in the logical model: sub-typing (inheritance), one-to-many relationships (rather than 0-to-many), null meaning and operations, and row interdependence (chaining in history).
For example, attributes for product type, product category, product sub-category, and product are defined as well as relationships between these hierarchies. In transactional systems, data is normalized and definitions of primary and foreign keys as well as relevant relationships are included in the model. In analytical systems, data can be denormalized, and in this case, logical models will also include definitions of business keys on dimensions and composites for surrogate keys on facts. Microsoft Corporation Copyright 2010
72
Figure 2-40 is an example of a partial logical model for the larger denormalized model.
Figure 2-40: Partial logical data model Physical Model The physical model adds final detail to the modeling effort in respect to column data types, nullability, primary keys, indexes, statistics, and other relevant table properties. The diagram in Figure 2-41 expands on the logical model from above, introducing information related to the physical modeling level of detail.
Copyright 2010
73
Figure 2-41: Partial physical data model In physical modeling, it is important to properly manage data definition language (DDL) statements for all tables, functions, stored procedures, and other relevant database objects. When changes to database objects are properly managed, switching between versions of changes to metadata becomes more feasible. This is sometimes required in the normal course of development and introduction of new features. Having DDL properly managed also provides for more encompassing disaster recovery procedures and enables recreating data warehouse structures in additional development, testing, or user acceptance testing (UAT) environments. An effective method for managing DDL is to have scripts versioned and controlled via version control software such as Microsoft Team Foundation Server. Figure 2-42 shows the DDL for the DimDate table used in the previous diagram.
Copyright 2010
74
Figure 2-42: DimDate DDL Data Modeling Column Types So far, our focus has been at the table level. However, this section covers different column types defined and used during data modeling. Surrogate Keys A surrogate key, shown in Figure 2-43, is a unique identifier for records within a table. Surrogate keys are either generated by the database when the record is inserted into a table or by a data integration application prior to the record insertion. SQL Server automatically generates the surrogate key when the column is created with the IDENTITY attribute. Note that PDW does not support the IDENTITY attribute in its initial release.
Figure 2-43: Surrogate key - AdventureWorksDW2008 DimProduct table
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Surrogate keys are at the core of every data warehouse model and are typically integer columns that provide a pointer to a unique instance of a dimension member defined by its natural key. Surrogate keys are used to keep fact tables as narrow as possible, to create effective indexes on dimension tables, and to support Type 2 dimensions. In addition, surrogate keys replace the need for including a source system identifier (along with the natural key) for each table that merges data from multiple source systems. Surrogate keys are typically designed as sequentially incrementing integers and have no logical correlation to a natural key (defined below). In special cases, such as in the Date dimension, the surrogate key is an integer representation of the canonical date value (e.g., May 1, 2009, has a surrogate key of 20090501). This type of surrogate key is also referred to as an intelligent key. Although intelligent keys are not generally recommended, they are acceptable in this case because the Julian calendar is a stable entity not subject to change. The SQL Server data type for a surrogate key is typically an int because this data types maximum value (2,147,483,647) is larger than most tables projected cardinality. However, use of tinyint and smallint for small dimensions, as well as use of bigint for very large dimensions, is relevant depending on the projected table cardinality over the life of the data warehouse. The SQL Server data type section below contains more information about integer data types, including their minimum and maximum values. Data modeling also includes creating a record containing a default surrogate key used to represent null or not available records. Data integration processes will use the default surrogate key (e.g., 0) instead of a NULL value when a lookup doesnt produce a match. Natural Keys The natural key, also known as a business ID, is one or more attributes (columns) within a source system used to uniquely identify an entity. Figure 2-44 shows the natural key (ProductLevel1Cd) for an AdventureWorks product.
75
Figure 2-44: Natural key Note that a natural key is used to uniquely identify one product, while the surrogate key is used to uniquely identify one instance of that product over time. Also note in the above example that the SourceSystemKey may also be part of the natural key when theres the risk of duplicated product codes from different source systems. The data warehouse data model needs to be flexible to account for changes in uniqueness of dimension records as business rules change over time. However, the natural key is one value that must never change over time, and when it does, its considered a new item (e.g., a new product SKU).
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Attributes An attribute (or column) adds descriptive characteristics to a table. Attributes exist for all types of tables within a data model. Figure 2-45 shows some attributes for the AdventureWorks DimVendor table.
76
Figure 2-45: Attributes for DimVendor table Measures or Metrics A measure, or metric, is a numeric column typically used to store transaction amounts, counts, quantities, and ratios. Figure 2-46 shows some measures from the AdventureWorksDW2008 FactSalesDetail table.
Figure 2-46: Measures from the FactSalesDetail table Measures are classified as either base or calculated. One or more base measures are used when resolving the value of a calculated measure. Decisions about which layer of a data warehouse architecture hosts measure calculations are defined during data modeling sessions. These decisions are determined based on whether calculations are relevant to all subject areas in a data warehouse or not. Measures can also be categorized by how they are calculated: Additive These measures aggregate across all dimensions with no special provisions. Examples of additive measures are the OrderQuantity and SalesAmount measures found in the AdventureWorksDW2008 fact tables. Semi-additive These measures can be aggregated only over specific dimensions. Examples of semi-additive measures are account balances and inventory levels. These measures can be aggregated by some of the dimensions. But if account balances, for example, are Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices summed up over 1 year for one customer, the resulting sum of account balance snapshots would be inaccurate. Non-aggregated measures Ratios and percentages are examples of non-aggregated measures. These measures cant be aggregated over any dimension.
77
Dates and Time Dates play a central role within a data warehouse. Time does as well, but usually to a lesser degree. Date and time are special entities in a data warehouse and, as such, require particular attention in data modeling efforts. Dates and time are typically stored in the SQL Server datetime or smalldatetime data types and can represent: Time of day (e.g., May 10, 2010, 10:30am EST) Date (e.g., May 10, 2010) Time within a day (e.g., 10:30am EST)
Besides including intelligent surrogate keys for these dimensions, as described in the section on surrogate keys above, organizations often use date and time for to date calculations. Whether to keep date and time in a single dimension or separate them is one of the questions data modeling touches on. Separating these two dimensions provides for greater manageability and usability in a data warehouse. In this approach, two surrogate keys are derived from one source column. For example, a source column for a transaction date of 2010-01-01 14:01:00 would yield a surrogate key for the Date dimension based on the 2010-01-01 segment, while a surrogate key for the Time dimension would be derived from the 14:01:00 part of this source column. Date and time are often used as role-playing dimensions, so one physical date dimension will logically be modeled to reference invoice date and transaction date, as well as create date and other dates in a data warehouse. Also note that date dimensions differ across vertical industriesfor example, a financial date dimension will differ from a retail date dimension. Figure 2-47 shows an example of both a Date and a Time dimension.
Copyright 2010
78
Figure 2-47: Date and Time dimensions Support Columns Support columns are included in data models in support of tracking values over time. These columns allow for more efficient management of state changes for dimensions and for more effective auditing of data loads. Figure 2-48 shows some instrumentation support columns.
Figure 2-48: Support columns The following columns are used to track different historical versions: StartDate The date when this record version become active.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices EndDate The date when this record became inactive. The current record will either have a NULL value or a value representing the maximum date (e.g., 12/31/9999). RecordStatus The status of this record. Typical values are Active, InActive and Pending. VersionId This value represents the version number of the record and is incremented every time a version of the record is created. InferredMemberInd This value indicates if a dimension member was loaded during fact load as an inferred member. Inferred members are dimension members that dont have attributes other than business ID available during the time facts are being loaded (a lookup for surrogate key during fact load yields no results).
79
The other support column types are in support of data integration instrumentation. LineageId is an example of this kind of column and contains a value that lets data stewards track the process that loaded the record. Another example is InferredMemberInd, which as we just saw is a flag indicating whether a record is an inferred member or not. All of these columns are covered in more detail in Chapter 3 Data integration. Other Columns Data modelers also address other types of columns as they are relevant to business requirements. These columns include spatial data, XML, large text, and images. Keys Data models include definitions on keys as well as changes to keys as data changes custody in data warehouse layers. Typically, data models account for introducing new keys for each change in data custody. These keys are implemented by using either the natural key from the source system or the surrogate key created during the data integration process. The blanket recommendation is for data warehouse data models to use surrogate keys for all primary/foreign key activity. The reasons for doing so include: More efficient representation Natural keys in source systems are often defined using character data types, which are less efficient than integer data types in SQL Server. Additionally natural keys may be represented as different values in different source systems, making it challenging to consolidate. Instead, use a surrogate key. Tracking changes across history The need to report across history requires the data model to store the instance of that record at a particular point in time.
Natural keys are often used in data warehouses to aggregate multiple versions of one entity across all instances of this entity (i.e., one natural key for more than one surrogate key). Additional benefits of proper key management in data warehouses include: Enabling the conceptual and enterprise model Failing to properly identify natural keys or to efficiently manage lookups of natural keys for surrogates negates the basic principles of data
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices modeling. Key management is one of the essential factors for enabling accurate and effective conceptual and logical models. Preparing for the future Managing natural and surrogate keys prepares the data warehouse for downstream data consumption. OLAP structures and BI semantic layer components perform more efficiently and provide for greater data analysis when data models include proper handling of keys.
80
Once a decision has been made to use surrogate keys, the next question becomes: How do we generate surrogate keys? The short answer is that the SQL Server data modeler and developer can use globally unique identifiers (GUIDs), IDENTITY columns, or a key generation utility. GUIDs usually are not used due to their size (16 bytes) and their randomly generated nature. For more information about surrogate key generation, see the Surrogate Key section in Chapter 3 Data Integration. Dimensions Dimension is a term most commonly associated with Ralph Kimball. In his white paper Facts and Fables about Dimensional Modeling, Kimball attributes the first reference of facts and dimensions to a joint research project conducted by General Mills and Dartmouth University in the 1960s. Wikipedia states that: Dimensional modeling always uses the concepts of facts (measures), and dimensions (context). Facts are typically (but not always) numeric values that can be aggregated, and dimensions are groups of hierarchies and descriptors that define the facts. Dimensions are often modeled based on multiple entities from one or more source systems. Depending on reporting needs, consumers of a data warehouse system will be accessing information in entities for operational reporting, while analytical reporting will rely on dimensions. Traditionally, flattened, denormalized structures are the most efficient data model technique because they require the least amount of joins to produce the requested result. However, in a few cases, a dimension can become too wide, and data modelers need to opt for a more normalized structure. For more information about dimensions, a good starting point is Kimballs articles at his Web site. Lets spend the rest of this section getting an overview of the most common dimension types and concepts: Star dimension A star dimension is a fully denormalized version of a dimension, where all hierarchies and reference values are merged into the dimension. Star dimension tables are directly linked to the fact table. Snowflake dimension Snowflake dimensions contain more than one dimensional table to include references to all relevant attributes. The AdventureWorks Product Category hierarchy,
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices shown in Figure 2-49, is an example of a snowflake dimension. In a snowflake dimension, some of the tables may be indirectly linked to the fact table.
81
Figure 2-49: Snowflake schema example Parent-child dimension This dimension type is used to model hierarchies such as an organizational chart or a chart of accounts. These hierarchies are unbalanced and ragged, which makes them difficult to model using a snowflake technique. The AdventureWorks2008DW DimEmployee table, shown in Figure 2-50, is an example of a parent-child hierarchy.
Copyright 2010
82
Figure 2-50: Parent-child dimension example Role-playing dimension A role-playing dimension occurs when one dimension is linked to multiple times within one table. The most common example of this is kind of dimension is Date, where one dimension is used to provide information on order date, invoice date, create date, and other dates, as shown in Figure 2-51.
Copyright 2010
83
Figure 2-51: Role-playing dimensions Junk dimension Junk dimensions represent a collection of low-cardinality, non-related attributes contained within one dimension, where each possible combination of attributes is represented with a single surrogate key. This design decision is purely for optimizationone 2byte junk dimension key is more efficient than five 1-byte keys, which results in significant savings in storage space for fact tables with billions of rows or more. Degenerate dimensionA degenerate dimension does not have its own table; it is represented by its value within the fact table. This typically occurs in transaction tables, and examples are order number and invoice number. Degenerate dimensions are useful to capture the transaction number or natural primary key of the fact table. Because it does not make sense to create a dimension with no attributes, the attributes instead may be directly stored in the fact table. Figure 2-52 shows two examples of degenerate dimensions within the AdventureWorksDW2008 FactInternetSales table.
Copyright 2010
84
Figure 2-52: Degenerate dimensions Tracking History Accurately reporting historical results often requires that the state of a dimension at a particular point in time be recorded and saved, as opposed to being overwritten. There are basically three methods for tracking history: Update one record Theres one version of a record, and all changes are applied to this one record. This approach is often referred to as a Type I Slowly Changing Dimension (or SCD I). Track record changes Every change in a record will result in a new version of that record with a unique surrogate key. This is often referred to as a Type II SCD (or SCD II). Add new columns Every change in a key value results in a new column added to the table. This is often referred to as a SCD III.
Its often not practical to add columns, so the options come down to two data integration patterns: versioned inserts and versioned updates. Each of these is covered in more detail in Chapter 3 Data Integration. Now that dimensions have been covered, the next topic discusses fact tables, which typically model a business transaction or business process. Fact Tables The largest tables within source systems are the transaction tables. These tables are often orders of magnitude larger than dimensions. These tables are modeled as fact tables within a dimensional data model. Another common source for fact tables is business processes. Fact tables are optimized structures that are typically comprised of numbers, dates, and very small character columns. The numbers are divided into numeric and monetary values and foreign key references to dimensions. Classification of facts can be done by fact table types and categories. Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Fact table categories were introduced in the Column Types section above and include additive, semiadditive, and non-additive measures. Another category is a custom rollup fact table, where the aggregation rules are specific to a dimensional value or values. The most common example of this is a financial chart of accounts. Fact table types include: Transactional This is the most common type of fact table. The classic example is entering one product sale at a store. Transaction facts are usually additivethat is, SQL aggregate functions such as SUM, MIN, MAX, and COUNT can be applied to the measures. Snapshot The facts within a snapshot fact table are not additive across one or more dimensions, typically the Date dimension. The classic example is inventory, where the measures represent values at a point in time. Inventory is not additive across time, but is additive across other dimensions referenced by the snapshot fact table. Other examples include event booking levels and chart of account balance levels. Fact less This table has no measured factsrow counts are the only measure. This is typically used to describe events, such as something that has or has not happened, or many-to-many relationships like coverage models or to measure that something has or has not happened. It contains only dimensions or one fact with a value of 0 or 1. Common examples include class attendance, event tracking, coverage tables, promotions, or campaign facts. Ralph Kimball used Voyages as the example in his seminal book, The Data Warehousing Toolkit. Aggregate Aggregate fact tables include information rolled up at a certain hierarchy level. These tables are typically created as a performance optimization in support of a business reporting requirement. Deriving an aggregation table once in a data integration process is much more efficient than aggregating every time the information is requested. Aggregations need to account for the additive nature of the measures, created on the fly or by pre-aggregation.
85
Reference Data Reference data was described earlier in this chapter as a simple classification of a noun or transaction. These classifications play a key role in data warehouse reporting and analytics by providing the ability to filter and/or aggregate results. Reference data is considered a category of master data. The DimScenario reference table, shown in Table 2-1, is one example of reference data within the AdventureWorks samples.
ScenarioKey ScenarioName 1 Actual 2 Budget 3 Forecast

Table 2-1: AdventureWorksDW2008 DimScenario reference table This table is a common reference table found in budgeting and forecasting software and allows the filtering of fact tables. Note that the above example isnt truly representative because many reference Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices data tables consist of a character code and description. For example, a State reference table would have state code values and description columns, as shown in Table 2-2.
86
StateCode NH NY
StateName New Hampshire New York
Table 2-2: Example State reference table In the above example, the data modeler decides whether to add a surrogate key, as shown in Table 2-3, or to use the natural key.
StateKey 29 32
StateCode NH NY
Table 2-3: Example State reference table with surrogate key The general data modeling rule is that adding a surrogate key rarely has a negative impact. The more relevant question is whether to use the surrogate key (StateKey) or the natural key (StateCode) as the foreign key. The general rule of thumb is to always use the surrogate key, but the downside is that: There will always be a need to join to the reference table. Reference tables with versioned inserts reference an instance of a reference table entry.
Data modelers often include the natural key and description in a denormalized table when modeling hierarchies, as shown in Table 2-4.
CityKey 1477 1597
City Concord Nyack
StateCode NH NY
Table 2-4: Denormalized City table Merging Reference Data The data warehouse team will often need to merge multiple reference data tables into one master reference table. This is required when there are multiple source systems, each with their own set of Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices reference tables and codes. In addition, organization-specific reference data frequently needs to be merged with industry-standard data because the standard either does not fully meet the organizations needs or the reference data existed prior to the standard. This is a common occurrence in the health care industry, where codes used in subject areas such as clinical analysis are organization-specific. Creating a single version of the truth for reference tables and hierarchies is a significant portion of MDM efforts. Merging different clinical codes may require the creation of a new natural key, which is represented by two or more different natural keys in their respective source systems. Note that using a surrogate key as the natural key could result in an issue if the reference table supports versioned inserts. What seems like a very simple issue can turn into something much more complex. Although adding a surrogate key doesnt hurt, determining whether to use the surrogate key or the natural key for foreign key references is a function of: Whether the natural key will ever change Whether the objective is to reference a versioned instance of the record or the record regardless of version
87
This decision becomes more important when modeling hierarchies, which are often comprised of multiple levels, each with their own reference table. Hierarchies A hierarchy is a multi-level structure in which each member is at one level within the hierarchy. It is a logical structure of dimension attributes that uses order levels to define data aggregations and end user navigational drill paths. Each hierarchy member has zero or one parent and zero or more children. Figure 2-53 shows an example of a date hierarchy.
Figure 2-53: Date hierarchy The leaf level is the lowest level in the hierarchy, and each member within this leaf is connected to one parent. This structure is repeated for each level above the leaf level, until the top level is reached.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Multiple hierarchies can be created from one source. Figure 2-54 shows an example of this: The AdventureWorks2008DW DimDate table is displayed on the left, and the multiple hierarchies that can be created from this one table are displayed on the right. (Note that this SSAS example was used to visually show multiple hierarchies; SSAS is not required when creating multiple hierarchies within a data warehouse.)
88
Figure 2-54: DimDate table hierarchies Hierarchies are a large topic, and this chapter covers balanced, ragged, unbalanced, and network hierarchies.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Balanced Hierarchies Figure 2-54 above is an example of a balanced hierarchy. All branches of a balanced hierarchy descend to the same level, with each members parent being at the level immediately above the member. Balanced hierarchies can be collapsed in one table, as in the above example, or exist in multiple tables, as with the Product Category hierarchy shown in Figure 2-49. In a standard balanced hierarchy, each level of the hierarchy is stored in one and only one column of the dimension table. Ragged Hierarchies In a ragged hierarchy, the parent of a member can come from any level above the level of the member, not just from the level immediately above. This type of hierarchy can also be referred to as raggedbalanced because levels still exist. A Geography dimension is a typical example of such a hierarchy. In some countries, the province/state level may not exist, as with the Republic of Singapore or Vatican City State, for example. Unbalanced Hierarchies Unbalanced hierarchies include levels that have a consistent parent-child relationship but have logically inconsistent levels. The hierarchy branches can also have inconsistent depths. An example of an unbalanced hierarchy is an organization chart, which shows reporting relationships among employees in an organization. The levels within the organizational structure are unbalanced, with some branches in the hierarchy having more levels than others. The AdventureWorksDW2008 DimEmployee dimension table is an example of an unbalanced hierarchy. Hierarchies and History Hierarchies change over time, which requires the data warehouse to display the correct hierarchy based on the particular date range or period in time. One example is reporting on sales based on the Sales Territory hierarchy. Figure 2-55 shows the Northeast sales territory in 2009 and 2010. This hierarchy has three levels: Division (Northeast) Region( Mid-Atlantic, New England) District (State)
89
Copyright 2010
90
Figure 2-55: Northeast sales territory As you can see, sales management changed the sales territories in 2010 and subdivided New York and Connecticut as follows: Created a new district, Upstate NY, and assigned it to the New England region. The remainder of New York stays within the New York district. Divided Connecticut into two districts: o North of Hartford is now part of the New England region. o Hartford and South is part of the Mid-Atlantic region.
However, the business consumer may still want to see 2009 Northeast sales totals based on: How the sales territory was organized in 2009 The current sales territory structure (i.e., the 2010 sales territory)
Modeling a Versioned Balanced Hierarchy A hierarchy can be modeled in several ways, depending on whether the hierarchy is balanced or not. When the hierarchy is balanced, it can be represented by multiple reference tables, where each reference table is used to populate one level in the hierarchy. This can be modeled either by each level containing a foreign key reference to its parent or in a flattened structure, as shown in Table 2-5. This table shows how the changed sales territory information presented above is stored within a denormalized Sales Territory table. Note that this table could also be modeled in a normalized structure, using multiple reference tables with foreign key references to their parent. However, each of the reference tables would need to be versioned to account for the sales territory versioned changes over time.
Copyright 2010

PK Ver # RecSts Start 1 3 I 1/1/2009 2 3 I 1/1/2009 3 4 A 1/1/2010 4 4 A 1/1/2010 5 4 A 1/1/2010 6 4 A 1/1/2010 End 12/31/2009 12/31/2009 NULL NULL NULL NULL Div NE NE NE NE NE NE Div. Desc Northeast Northeast Northeast Northeast Northeast Northeast RegCd MA NE MA NE NE MA RegionDesc Mid-Atlantic New England Mid-Atlantic New England New England Mid-Atlantic DistCd NY CT NY UNY NCT SCT DistrictDesc New York Mod Dt 2/1/2009 Connecticut 2/1/2009 New York 2/1/2010 Upstate New York 2/1/2010 Connecticut North 2/1/2010 Connecticut South 2/1/2010
91
Table 2-5: Sales territory structure Note the following about the Sales Territory table: Sales Division, Region, and District all are separate reference tables. Their natural key is used to model the sales territory. Using the natural key, as opposed to the surrogate key, avoids an issue of returning different results from reference tables that support versioned inserts. The version of the hierarchy is a key attribute in this example. Many hierarchical structures such as a financial chart of accounts or a sales territory are created as one instance or version, as opposed to treating each separate change as its own version.
The next consideration is how dimension members that are children of the Sales Territory table (such as Sales Rep and Company of both Suppliers and Customers) should be modeled. Figure 2-56 shows the options available to data modelers.
Figure 2-56: Hierarchy modeling options Data modeling options include: 1. 2. 3. 4. Store the surrogate key of the Sales Territory instance. Store the natural key of the sales territory. Denormalize the Sales Territory into the Company table. Add to the hierarchy a Company level that contains the Company surrogate and natural keys. Every version of the hierarchy will include all companies. Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Options #1 and #3 would both require a different version for every Sales Rep record each time the sales territory changed. Option #2 would not require an extra version of the Sales Rep record, but would require the query to have a version filter to return the desired version of the sales territory. Option #4 would require the creation of one Sales Rep record for every version of the sales territory. Modeling a Time-Dependent Balanced Hierarchy Other hierarchies change based on events as opposed to a versioned release. One example of this scenario is a company linking structure that links subsidiaries and legal entities to their parent company. Figure 2-57 shows a simple example with two parent companies each consisting of two companies. The time-dependent event would be Holding Company Bs acquisition of Company A.
92
Figure 2-57: Company linking example Table 2-6 shows how this two-level hierarchy could be modeled.
SK 100 101 102 103 104
NK C_A C_B C_C C_D C_B
Company Company A Company B Company C Company D Company B
Par Cd HC A HC A HC B HC B HC B
Parent Name Holding Company A Holding Company A Holding Company B Holding Company B Holding Company B
Start End Rec Sts Ver # 1/1/1980 NULL A 1 1/1/1980 5/2/2010 I 1 1/1/1980 NULL A 1 1/1/1980 NULL A 1 5/3/2010 NULL A 2
Table 2-6: Modeling a company linking hierarchy Notice that in this example, the changes are in response to an independent event instead of a scheduled version release. Any query that aggregates or filters on the parent company could return different results when you apply an as of date to the Start and End date ranges. Ragged and Unbalanced Hierarchies As discussed earlier, this class of hierarchy will have different levels and will be unbalanced. Organizational charts and a financial chart of accounts are typical examples of ragged, unbalanced hierarchies. These hierarchies are often modeled as parent-child structures. The DimEmployee dimension illustrated in Figure 2-43 above is a good example of this. Although the parent-child structure makes it easy to store these values, reporting becomes more difficult because the table is not level-centricmeaning there isnt one table for each level in the Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices hierarchy. When reporting, the hierarchy levels become more important because aggregation is frequently based on a level number. The options in this case are: Create a balanced hierarchy Transform to a balanced hierarchy within SQL Server
93
Creating a balanced hierarchy from a ragged, unbalanced hierarchy is typically done by following these steps: 1. Create a balanced hierarchy table, with the code and description columns for the maximum level within the ragged hierarchy. 2. Populate this balanced hierarchy table starting with the parent and descending down through the hierarchy. 3. Repeat parent values in the child level if the number of levels within a branch is less than the maximum number of levels. Using Common Table Expressions In SQL Server, Common Table Expressions (CTEs) are a powerful tool for querying hierarchies. For example, Figure 2-58 shows a CTE for querying the HumanResources.Employee hierarchy in the AdventureWorks sample database.
Copyright 2010
94
Figure 2-58: Common Table Expression example Results of this CTE include manager and employee IDs, relevant titles, and levels within the organization, as Figure 2-59 illustrates.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Figure 2-59: Common Table Expression results For more information about SQL Server CTEs, see WITH common_table_expression (Transact-SQL). Note that CTEs are not supported in the initial release of PDW. Also note that in a network hierarchy, nodes can contain more than one parent. A common example of a network hierarchy is a family tree. Which Hierarchy-Handling Option Is Most Effective? In data warehouse architecture, the most effective design option to handle hierarchies has proven to be flattening hierarchies into a single table. Also, if there is more than one hierarchy defined for a dimension, all hierarchies should be included in the one table. This approach eliminates joins between the main dimension table and lookup tables, improving data retrieval, which is what data warehouse systems are built for. Finally, in parent-child hierarchies, for smaller dimensions, you can use a typical ID-ParentID recursive approach. But for larger dimensions, this technique can have significant performance issues. A recommended strategy in these cases is to introduce many-to-many fact(less) tables. This approach works well with extending data models to OLAP, especially with SSAS. Bridge Tables Bridge tables hold values for multiple instances of relationships between entities. These containers for storing many-to-many relationships are also referred to as junction or cross-reference tables. One of the examples for using bridge tables is in handling security privileges for users in a data warehouse. Figure 2-60 depicts this scenario, with the UserSecurity table serving as a bridge table.
95
Figure 2-60: Bridge table example Nulls and Missing Values This section outlines guidelines for handling nulls and missing values in respect to architecting effective data warehouse systems. Null Values Nulls are not recommended for attribute values because they provide no meaning for analysis. The existence of nulls also translates into more complex queries (i.e., the existence of nulls must be checked in addition to comparing values). In addition, in respect to business ID, nulls need to be handled before they reach data warehouse tables.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Business rules implemented in data integration processes need to include handlers for nulls and process them based on what type of attributes they are assigned to. Typical rules for handling null values for dimension attributes include replacing code values with N/A and replacing descriptive values with Unknown. Of course, data integration processing of null values must be consistent across the data warehouse; otherwise, data will be subjected to various interpretations. Finally, if business ID columns include nulls, depending on relevant business rules, corresponding records can be prevented from loading to the data warehouse by loading them into exception tables. Alternatively, they can be loaded under an Unknown dimension member. For measures, the decision to persist null values is sometimes a function of the downstream consumers. For example, OLAP engines such as SSAS that support sparse cubes typically perform much better when null values are used for fact table measures used to populate the OLAP database. Missing Values (Empty Strings) Empty strings for attributes usually dont translate into an asset for a data warehouse. Preserving empty strings can lead to inconsistent reporting and confusion for information workers if an empty string in a source system is equivalent to a null value. Business consumers should have input into whether null values and empty strings can both be replaced with the same default. If so, data integration business rules can replace missing values and null values with the same default value. Referential Integrity Referential integrity defines relationships between entities as stipulated by business rules in the logical data warehouse model. Enforcing referential integrity is a core element of data quality. Traditionally, the best practice in data modeling for preserving referential integrity was to create a foreign key constraint. However, this best practice can cause performance issues when applied to large tables with a significant number of foreign keys. The reason is that SQL Server enforces referential integrity by first verifying that a record exists in the referenced table prior to the insert. This is a consideration for most SQL Server data warehouses because many fact tables have a lot of foreign keys and very large table sizes. The alternative is to enforce referential integrity within the data integration processes rather than creating foreign key constraints. This is a best practice in large SQL Server data warehouse implementations and applies to both the staging and data warehouse data stores. Clustered vs. Heap One of the important physical design decisions for SQL Server tables is whether a table should be modeled with a clustered index or as a heap. SQL Server clustered index tables have one index that stores and sorts the data rows based on the index key values. Tables without this feature are referred to as heap tables. Figure 2-61, from SQL Server 2008 R2 Books Online, illustrates the difference between heaps and clustered tables at a physical level.
96
Copyright 2010
97
Figure 2-61: Heap vs. clustered table on disk structure The leaf level of a clustered table index is the data page, which is organized by the clustered index key. Heap tables are organized by their insert ordereach insert appends the record to the last data page. Table 2-7 compares clustered and heap tables with no indexes across storage and access options. Category Storage Reads Inserts Heap Data is stored in order insert Table scan Inserts occur at the end of table Clustered Data is stored by the clustered key Index if request is by clustered key Inserts occur by the clustered key and can result in fragmentation Updates can result in page splits Additional overhead for both disk space and time to manage the index Defragmentation required due to clustered key inserts
Updates Overhead
Updates can result in page splits No overhead
Maintenance Less need for defragmentation
Table 2-7: Heap vs. clustered table comparison Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices The general rule for SQL Server data modelers is that clustered tables provide better performance when the clustered key(s) is commonly used for query operations. This is typically the case for dimensional data models. The MSDN SQL Server best practices article Clustered Indexes and Heaps recommends that a clustered index always be created for a table. However, data modelers should recognize that creating a clustered index on every table will result in additional load times for very large fact tables. That additional load time may be unacceptable if it results in the data integration processes execution times overlapping with the time the data warehouse is available to business consumers. Heap tables with no indexes are necessary in VLDW scenarios. In this scenario, data is optimized for loading data. Data is inserted into heap tables, and updates are never applied due to performance reasons. Updates are expensive for very large tables. Heap tables are a consideration when large tables are not directly accessed by business consumers, such as with the data warehouse database within a hub-and-spoke architecture. In this scenario, the fact tables within the data mart spokes should be modeled as clustered unless VLDW performance considerations prohibit this. Here are some implementation notes to keep in mind about clustered indexes: For clustered tables, the data integration process should sort the source data by the clustered index key, which in most cases will translate into records being inserted at the end of the table to reduce fragmentation. These indexes may need to be maintained on a regular basis because clustered indexes can suffer from fragmentation depending on the extent of insert and update activity. Fragmentation in clustered tables can easily be resolved by rebuilding or reorganizing the clustered index. However, this can be a very lengthy operation, especially for fact tables with cardinality in the billions of rows. Its recommended that the clustered index column be an integer value. For dimensional data models, its common for the Date dimensions primary key to be a 4-byte integer (YYYYMMDD). For example, the date Aug 4, 2010, is converted to 20100804. The clustered index key for the fact table is also a date represented in this format.
98
Chapter 5 includes details about performance and query plans when accessing clustered vs. heap tables. SQL Server Data Type Considerations As a general rule, data modelers should choose the smallest data types when possible to reduce the total size of a record. This in turn reduces the amount of storage, disk I/O, and network bandwidth required to load and retrieve the data. This section reviews the data type options available to data modelers. Character Data Types The general rules of thumb for choosing character data types are:
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Only use nchar and nvarchar when the universe of values spans or will span multiple languages. This is because the nchar and nvarchar data types are twice as large (2 bytes vs. 1 byte) as their char and varchar equivalents. Use varchar or nvarchar for columns with descriptive values, names, or addresses. These varying data types are represented by a length and value, which make them slower to locate within a data page but smaller to store when theres any variability in the length of the values. o Source system char data types often are space-padded, so a name column defined as a char(8) would represent the name Smith as Smith . Use char for columns containing coded values. SQL Server char columns are more efficient because they are stored on disk in fixed locations within a record. Columns containing codes and abbreviations, such as StateCode, should be a char(2) not varchar(2).
99
For more information about SQL Server character data types, see the following links: char and varchar nchar and nvarchar (Transact-SQL)
Integers Exact integers are the most common data type found in data warehouses and are used for measures, counts, and surrogate keys. The SQL Server data modeler can choose to model exact integers as: 1 byte (tinyint), 0-255 2 bytes (smallint) -32,768 to 32,767 4 bytes (int) -2^31 (-2,147,483,648) to 2^31-1 (2,147,483,647) 8 bytes (bigint) -2^63 to -2^63
Keep these notes in mind about SQL Server integer data types: Theres no unsigned small integer, integer, or big integer data types. Date dimensions are an example where smallints are used, provided that the date range is less than 89 years (i.e., 32767/Days in a year). Int is the most common data type used in SQL Server data warehouse models. Bigint should only be used when the maximum value has a chance of exceeding 2.14 billion, the maximum value for an int. GUIDs are 16-byte integers and are a convenient way to generate unique surrogate key values. However, they are not recommended if other options such as IDENTITY columns or a key generator are available. This recommendation is discussed in more detail in Chapter 3 Data Integration.
For more about SQL Server integer data types, see int, bigint, smallint, and tinyint (Transact-SQL). Numeric Data Types SQL Server includes decimal, numeric, real, float, money, and smallmoney data types to store numbers. For data warehousing implementations, the use of precise data typesdecimal, numeric, money, and
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices smallmoney is recommended. The money data type can be used instead of decimal as long as requirements dont include definitions for more than four decimal digits. The use of approximate data types such as real and float are not recommended because they are approximate numbers and, as such, can lose precision when aggregated. If source data is stored in the real data type, there can be a slight loss of data precision when converting to decimal, but decimal data types are more accurate for querying in WHERE conditions and are typically more compatible with applications consuming data from a data warehouse. See Using decimal, float, and real Data for more information about SQL Server real data types. Date Data Types Date and time values are relevant to data warehouse implementations from the perspective of designing calendar dimensions. Depending on the grain of the data in a data warehouse, modelers can decide to have a single calendar dimension or one dimension for date values and another for time values. The datetime2 data type should be considered for columns that store full date values as long as this date detail includes hours, minutes, seconds, and milliseconds. The date data type should be considered if the data contains only year, month, and day values due to the storage savings. If a dimension holding values for time is modeled, you should use the time data type. In case of two calendar-type dimensions, fact tables will have both surrogate keys, providing for more efficient data analysis. You can find more information about SQL Server date data types at datetime (Transact-SQL). Other Data Types SQL Server supports many other data types. For a complete list, see Data Types (Transact-SQL). Data warehouses, however, mostly use the data types we covered in this section. This is predominately because data warehouses are architected for most efficient data analysis. Note that the emergence of XML as a standard for data integration shouldnt translate into XML being used as a data type for data warehouses. Data modelers should review the XML data and decide whether to shred it into a table structure more conducive to efficient querying. Very Large Data Sets When you are architecting data warehouses with very large data sets, you can run into potential data consumption and data management difficulties. While there are number of considerations to keep in mind when architecting for VLDBs, major points related to the data warehouse architecture revolve around removing contention between read and write operations and maintaining the effectiveness of data retrieval and indexes. Microsoft Corporation Copyright 2010
100
Microsoft EDW Architecture, Guidance and Deployment Best Practices Read and Write Contention Organizing data and log files and tempdb onto separate disks is the focus of the effort to remove conflicts between read and write processes. This topic is covered in the Database Architecture section earlier in this chapter. Effective Data Retrieval and Indexes The size of a VLDB often introduces difficulties with quick retrieval of data needed for reports and analysis. Data consumption also becomes more complicated because indexes tend to grow to the point that their maintenance becomes impractical. You can address these issues with SQL Server by incorporating a partitioning strategy for both data and indexes. Table partitioning, which we referenced in the Database Architecture section above, is an important tool for getting the most value out of your data warehouse data. See Chapter 4 for details about table partitioning, and see Chapter 5 for how to set up partitioned tables for efficient query access.
101
Conclusion and Resources

A successful data warehouse requires a solid data architecture. Data architecture is a broad topic, and this chapter has focused on the data architect and data developer responsibilities and deliverables, including database architecture, platform architecture, and data models. First, the database architecture depends on the data warehouse implementation pattern, which is typically a centralized EDW, a federated data mart, or a hub-and-spoke approach. In all cases, the data warehouse team is faced with a series of challenges, including scope (maintaining a single version of the truth throughout the implementation), scale (handling huge data volumes), and quality (delivering results that business users trust). Providing a single version of the truth presents the data warehouse team with significant technical challenges. However, there are larger business and process issues that have resulted in the emergence of data governance as a key corporate activity. Master data and MDM are crucial to arriving at a single version of the truth and are old problems in data warehousing that have spawned an emerging class of software products. The decision about whether to purchase and use an MDM software product is not a forgone conclusion and depends on a variety of factors. Once youve selected an appropriate data warehouse implementation pattern, a robust platform architecture is required to support the data warehouse volumes, including loading data into the data warehouse and obtaining results from the data warehouse. The SQL Server platform provides customers with a variety of options, including reference hardware solutions (Fast Track Data Warehouse), data warehouse appliances (SQL Server 2008 R2 PDW), and data virtualization (SQL Server 2008 R2 Enterprise Edition or Data Center).
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices As stated above, physical best practices and guidance within this chapter are for the symmetric multiprocessing (SMP) versions of SQL Server 2008 R2 due to some differences in functionality between the SQL Server 2008 R2 SMP release and the initial release of PDW. The next key deliverable is developing the correct data models for the different databases within the data warehouse. The level of denormalization your database development team chooses depends on whether the databases user community is business consumers (denormalized) or data integration developers and data stewards (more normalized). As we can see in this chapter, the data architecture deliverables provide the foundation for the core data warehouse activities, which we cover in the remaining chapters of this toolkit: Loading data into the data warehouse (Chapter 3 Data Integration) Managing the data warehouse (Chapter 4 Database Administration) Retrieving results from the data warehouse (Chapter 5 Querying, Monitoring, and Performance)
102
Resources To learn more about data warehouse data architecture considerations and best practices, see the following links: Best Practices for Data Warehousing with SQL Server 2008 http://msdn.microsoft.com/en-us/library/cc719165(v=SQL.100).aspx Clustered Indexes and Heaps http://msdn.microsoft.com/en-us/library/cc917672.aspx Data Compression: Strategy, Capacity Planning and Best Practices http://msdn.microsoft.com/en-us/library/dd894051(v=SQL.100).aspx Data Governance Institutes Web site: http://www.datagovernance.com/ The Data Governance & Stewardship Community of Practice http://www.datastewardship.com/ The Data Loading Performance Guide http://msdn.microsoft.com/en-us/library/dd425070(v=SQL.100).aspx Data Warehousing 2.0 and SQL Server: Architecture and Vision http://msdn.microsoft.com/en-us/library/ee730351.aspx Data Warehouse Design Considerations http://msdn.microsoft.com/en-us/library/aa902672(SQL.80).aspx Fast Track Data Warehouse 2.0 Architecture http://msdn.microsoft.com/en-us/library/dd459178(v=SQL.100).aspx Hub-And-Spoke: Building an EDW with SQL Server and Strategies of Implementation http://msdn.microsoft.com/en-us/library/dd459147(v=SQL.100).aspx Introduction to New Data Warehouse Scalability Features in SQL Server 2008 http://msdn.microsoft.com/en-us/library/cc278097(v=SQL.100).aspx Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices An Introduction to Fast Track Data Warehouse Architectures http://technet.microsoft.com/en-us/library/dd459146(SQL.100).aspx Introduction to the Unified Dimensional Model (UDM) http://msdn.microsoft.com/en-US/library/ms345143(v=SQL.90).aspx Kimball University: Data Stewardship 101: First Step to Quality and Consistency http://www.intelligententerprise.com/showArticle.jhtml?articleID=188101650 Partitioned Tables and Indexes in SQL Server 2005 http://msdn.microsoft.com/en-US/library/ms345146(v=SQL.90).aspx Partitioned Table and Index Strategies Using SQL Server 2008 http://download.microsoft.com/download/D/B/D/DBDE7972-1EB9-470A-BA1858849DB3EB3B/PartTableAndIndexStrat.docx Scaling Up Your Data Warehouse with SQL Server 2008 http://msdn.microsoft.com/en-us/library/cc719182(v=SQL.100).aspx Storage Top 10 Best Practices http://msdn.microsoft.com/en-US/library/cc966534.aspx Strategies for Partitioning Relational Data Warehouses in Microsoft SQL Server http://msdn.microsoft.com/en-US/library/cc966457.aspx Thinking Global BI: Data-Warehouse Principles for Supporting Enterprise-Enabled BusinessIntelligence Applications http://msdn.microsoft.com/en-us/architecture/aa699414.aspx Top 10 SQL Server 2005 Performance Issues for Data Warehouse and Reporting Applications http://msdn.microsoft.com/en-US/library/cc917690.aspx Using SQL Server to Build a Hub-and-Spoke Enterprise Data Warehouse Architecture http://msdn.microsoft.com/en-us/library/dd458815(v=SQL.100).aspx
103
Copyright 2010
104
Chapter 3 - Data Integration

By Microsoft Corporation Acknowledgements: Contributing writers from Solid Quality Mentors: Larry Barnes, Erik Veerman Technical reviewers from Microsoft: Ross LoForte, Benjamin Wright-Jones, Jose Munoz Contributing editors from Solid Quality Mentors: Kathy Blomstrom Published: Applies to: SQL Server 2008 R2
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Chapter 3 - Data Integration ..................................................................................................................... 104 Introduction .......................................................................................................................................... 107 Data Integration Overview.................................................................................................................... 108 Data Integration Patterns ................................................................................................................. 109 Which Pattern Should You Use? ....................................................................................................... 110 Roles and Responsibilities ................................................................................................................. 111 Data Integration Concepts .................................................................................................................... 112 Consolidation, Normalization, and Standardization ......................................................................... 112 Data Integration Paradigms (ETL and ELT) ........................................................................................ 118 ETL Processing Categories ................................................................................................................. 122 Incremental Loads............................................................................................................................. 123 Detecting Net Changes ..................................................................................................................... 125 Data Integration Management Concepts ......................................................................................... 127 Batch Processing and the Enterprise ETL Schedule .......................................................................... 133 ETL Patterns .......................................................................................................................................... 135 Destination Load Patterns................................................................................................................. 135 Versioned Insert Pattern ................................................................................................................... 136 Update Pattern ................................................................................................................................. 138 Versioned Insert: Net Changes ......................................................................................................... 139 Data Quality .......................................................................................................................................... 140 Data Quality Scenario........................................................................................................................ 141 Data Exceptions................................................................................................................................. 143 Data Stewardship and Validation...................................................................................................... 145 Data Profiling .................................................................................................................................... 146 Data Cleansing................................................................................................................................... 148 Data Reconciliation ........................................................................................................................... 151 Lineage .............................................................................................................................................. 152 ETL Frameworks .................................................................................................................................... 155 ETL Framework Components ............................................................................................................ 156 Users and Interfaces ......................................................................................................................... 157 Configurations ................................................................................................................................... 158 Logging .............................................................................................................................................. 159 Microsoft Corporation Copyright 2010
105
Microsoft EDW Architecture, Guidance and Deployment Best Practices Master Package ................................................................................................................................. 163 Execution Package ............................................................................................................................ 169 Package Storage Options .................................................................................................................. 171 Backing Out Batches ......................................................................................................................... 173 More ETL Framework Information.................................................................................................... 175 Data Integration Best Practices............................................................................................................. 175 Basic Data Flow Patterns................................................................................................................... 175 Surrogate Keys .................................................................................................................................. 182 Change Detection.............................................................................................................................. 188 De-duping.......................................................................................................................................... 193 Dimension Patterns........................................................................................................................... 196 Fact Table Patterns ........................................................................................................................... 200 Data Exception Patterns.................................................................................................................... 203 SSIS Best Practices................................................................................................................................. 207 The Power of Data Flow Scripting ..................................................................................................... 207 Destination Optimization (Efficient Inserts) ..................................................................................... 212 Partition Management ...................................................................................................................... 214 SSIS Scale and Performance .............................................................................................................. 215 Source Control .................................................................................................................................. 217 Conclusion and Resources .................................................................................................................... 217 Resources .......................................................................................................................................... 217
106
Copyright 2010
107
Introduction
Data integration is responsible for the movement of data throughout the data warehouse and the transformation of that data as it flows from a source to its next destination. Todays reality is that a large percentage of a data warehouses total cost of ownership (TCO) is related to post development integration coststhat is, the ongoing costs of loading source data into the data warehouse and distributing data from the data warehouse to downstream data stores. The daily, and in some cases intraday, process of loading data and validating the results is a time-consuming and repetitive process. The resources required to support this process increase over time due to: Increases in data volumes The growth of data warehouse integration processes and the long lifetime of the processes once the data warehouse is in production The lack of best software engineering practices when developing integration solutions The growing need for real-time or near real-time data
This chapters objective is to help reduce data integration TCO for data warehouses implemented on the Microsoft SQL Server platform by presenting a set of integration patterns and best practices found in successful Microsoft-centric data warehouse implementations today. This chapter covers the following topics from the perspective of the noted intended audiences: Data integration overview and challenges ETL concepts and pattern (audience: data integration team) Data quality (audience: ETL operations, ETL developers and Data stewards) ETL Frameworks (audience: ETL developers and Data architects) Data Integration best practices (audience: ETL developers) SSIS best practices (audience: ETL developers) Conclusion and resources: Links to Web content
Copyright 2010
108
Data Integration Overview

Data integration is responsible for moving, cleansing and transforming set-based dataoften very large data setsfrom source(s) into the Production data area and then into the Consumption data area as shown in Figure 3-1.
Figure 3-1: The role of data integration in a data warehouse project The requirements for the data integration component include: Trust Business consumers must be able to trust the results obtained from the data warehouse. One version of the truth Consolidating disparate sources into an integrated view supports business consumers need for an enterprise-level view of data. Current and historical views of data The ability to provide both a historical view of data as well as a recent view supports key business consumer activities such as trend analysis and predictive analysis. Availability Data integration processes must not interfere with business consumers ability to get results from the data warehouse.
The challenges for the data integration team in support of these requirements include: Data quality The data integration team must promote data quality to a first-class citizen. Transparency and auditability Even high-quality results will be questioned by business consumers. Providing complete transparency into how the data results were produced will be necessary to allay business consumers concerns around data quality. Tracking history The ability to correctly report results at a particular period in time is an ongoing challenge, particularly when there are adjustments to historical data. Reducing processing times Efficiently processing very large volumes of data within ever shortening processing windows is an ongoing challenge for the data integration team.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Data Integration Patterns The industry has several well-known data integration patterns to meet these requirements and solve these challenges, and its important for data warehouse practitioners to use the correct pattern for their implementation. How do you determine which of these patterns you should use for your data integration needs? Figure 3-2 positions the different integration options that are available.
109
Figure 3-2: Integration patterns The two axes in Figure 3-2 represent the main characteristics for classifying an integration pattern: Timing Data integration can be a real-time operation or can occur on a scheduled basis. Volumes Data integration can process one record at a time or data sets.
The primary integration patterns are: Enterprise Information Integration (EII) This pattern loosely couples multiple data stores by creating a semantic layer above the data stores and using industry-standard APIs such as ODBC, OLE-DB, and JDBC to access the data in real time. Enterprise Application Integration (EAI) This pattern supports business processes and workflows that span multiple application systems. It typically works on a message-/event-based model and is not data-centric (i.e., it is parameter-based and does not pass more than one record at a time). Microsoft BizTalk is an example of an EAI product. Extract, Transform, and Load (ETL) This pattern extracts data from sources, transforms the data in memory and then loads it into a destination. SQL Server Integration Services (SSIS) is an example of an ETL product. Extract, Load, and Transform (ELT) This pattern first extracts data from sources and loads it into a relational database. The transformation is then performed within the relational database Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices and not in memory. This term is newer than ETL but, in fact, was the method used in early data warehouses before ETL tools started to emerge in the 1990s. Replication This is a relational database feature that detects changed records in a source and pushes the changed records to a destination or destinations. The destination is typically a mirror of the source, meaning that the data is not transformed en route from source to destination.
110
Data integration, which frequently deals with very large data sets, has traditionally been scheduled to run on a nightly basis during off hours. In this scenario, the following has held true for the different patterns: EII is not commonly used in data warehouses because of performance issues. The size and data volumes of data warehouses prohibit the real-time federation of diverse data stores, which is the technique employed by the EII pattern. EAI is not used in data warehouses because the volume of the data sets results in poor performance for message-/event-based applications. ETL is the most widely used integration pattern for data warehouses today. ELT is seen mostly in legacy data warehouse implementations and in very large data warehouse implementations where the data volumes exceed the memory required by the ETL pattern. Replication, used to extract data from sources, is used in conjunction with an ETL or ELT pattern for some data warehouse implementations. o The decision to use replication can be based on a variety of factors, including the lack of a last changed column or when direct access to source data is not allowed.
However, because of the growing need for real-time or near real-time reporting outside of the line of business (LOB) database, organizations are increasingly running some data integration processes more frequentlysome close to real time. To efficiently capture net changes for near real-time data integration, more and more companies are turning to the following solutions: Replication to push data out for further processing in near real time when the consumer requires recent data (replication is also useful when the source system doesnt have columns that the ETL or ELT tool can used to detect changed records) Relational databases additional capabilities to detect and store record changes, such as SQL Server 2008 Change Data Capture (CDC) which is based upon the same underlying technology used by replication. Incremental change logic within an ETL or ELT pattern (as long as the source table has a date or incrementing column that can be used to detect changes)
Which Pattern Should You Use? Typically, a data warehouse should use either ETL or ELT to meet its data integration needs. The costs of maintaining replication, especially when re-synchronizing the replication process is required, makes it a less attractive alternative for extracting data from sources. However, hybrid approaches such as ETL/ELT combined with source system net-change detection capabilities may be required for near real-time data.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Throughout the rest of this document, we will use ETL for most patterns and best practices and explicitly point out where ELT and source system net-change detection are applicable. Roles and Responsibilities Chapter 1 outlined the teams roles and responsibilities within the entire data warehouse effort. Figure 3-3 shows the responsibilities for the data steward, data architect, ETL developer, and ETL operations roles for a data warehouses data integration component.
111
Figure 3-3: Data Integration Team roles and responsibilities The responsibilities of the different team roles are: Governance Data stewards and data architects are members of the data warehouse Governance team. Architecture The data architect is responsible for the data warehouse architecture, including but not limited to the platform architecture, best practices and design patterns, oversight of frameworks and templates, and creating naming conventions and coding standards. Development ETL developers are responsible for designing and developing ETL packages and the underlying ETL framework. In addition, ETL developers are typically called when theres an issue with the ETL processes (errors) or with the data results (exceptions). ETL Operations The ETL operations team is responsible for deploying ETL solutions across the different environments (e.g., Dev, Test, QA, and Prod) and the day-to-day care and feeding of the ETL solutions once in production. Data Quality Data stewards are responsible for data quality.
Note that the objective of this team setup is to minimize the TCO of daily data warehouse ETL activity. Its important that ETL operations and data stewards have the necessary tools to diagnose errors and exceptions. Otherwise, the time required to diagnose and fix errors and exceptions increases, and ETL Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices developers will be pulled into all error and exception activity, reducing the time they can spend on new applications. More important, this constant firefighting leads to burnout for all parties. The rest of this chapter will expand on the patterns and best practices you can use to reduce data integrations TCO, starting with key ETL concepts.
112
Data Integration Concepts

This section introduces the following key Data integration concepts, which we will look at in more detail as we present best practices later in this chapter: Consolidation, normalization, and standardization Data integration paradigms (ETL and ELT) ETL processing categories Incremental loads Detecting net changes Data integration management concepts
Consolidation, Normalization, and Standardization Data integration processes typically have a long shelf lifeits not uncommon for an ETL process to be in production for more than 10 years. These processes undergo many revisions over time, and the number of data processes grows over time as well. In addition, different development teams often work on these processes without coordination or communication. The result is duplication of effort and having multiple ETL processes moving the same source data to different databases. Each developer often uses different approaches for common data integration patterns, error handling, and exception handling. Worse yet, the lack of error and exception handling can make diagnosing error and data exceptions very expensive. The absence of consistent development patterns and standards results in longer development cycles and increases the likelihood that the ETL code will contain bugs. Longer development times, inconsistent error and exception handling, and buggy code all contribute to increasing data integration TCO. Well-run data integration shops have recognized that time spent upfront on ETL consistency is well worth the effort, reducing both maintenance and development costs. ETL consistency is achieved through three practicesconsolidation, normalization, and standardization: Consolidation is the practices of managing the breadth of processes and servers that handle ETL operations. This includes both the operations that perform ETL, such as SSIS packages, and the databases and files stores that support the ETL, such as Data In and Production databases. If your environment has dozens of databases and servers that do not generate data but merely copy and transform data (and often the same data!), you are not alone. However, you likely spend a lot of time managing these duplicate efforts.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Normalization involves being consistent in your ETL processing approach. You can develop an ETL package in many different ways to get to the same result. However, not all approaches are efficient, and using different approaches to accomplish similar ETL scenarios makes managing the solutions difficult. Normalization is about being consistent in how you tackle data processingtaking a normal or routine implementation approach to similar tasks. Standardization requires implementing code and detailed processes in a uniform pattern. If normalization is about processes, standardization is about the environment and code practices. Standardization in ETL can involve naming conventions, file management practices, server configurations, and so on. Data integration standards, like any standards, need to be defined up-front and then enforced. ETL developers and architects should implement the standards during development. You should never agree to implement standards later. Lets look at each of these practices more closely. Consolidation Suppose you work for a moderately large organization, such as an insurance company. The core systems involve policy management, claims, underwriting, CRM, accounting, and agency support. As with most insurance companies, your organization has multiple systems performing similar operations due to industry consolidation and acquisitions or to support the various insurance products offered. The supporting LOB systems or department applications far outnumber the main systems because of the data-centric nature of insurance. However, many of the systems require data inputs from the core systems, making the Web of information sharing very complicated. Figure 3-4 shows the conceptual data layout and connection between the systems.
113
Copyright 2010
114
Figure 3-4: System dependency scenario for an insurance company Each line in Figure 3-4 involves ETL of some nature. In some cases, the ETL is merely an import and export of raw data. Other cases involve more complicated transformation or cleansing logic, or even the integration of third-party data for underwriting or marketing. If each line in the diagram were a separate, uncoordinated ETL process, the management and IT support costs of this scenario would be overwhelming. The fact is that a lot of the processes involve the same data, making consolidation of the ETL greatly beneficial. The normalization of consistent data processes (such as the summary of claims data) would help stabilize the diversity of operations that perform an aggregation. In addition, the sheer number of ETL operations involved between systems would benefit from a consolidation of servers handling the ETL, as well as from the normalization of raw file management and standardization of supporting database names and even the naming conventions of the ETL package. Normalization
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Because normalization applies to being consistent about the approach to processes, ETL has several layers of normalization. In fact, a large part of this chapter is dedicated to normalization, first as we look at common patterns found in ETL solutions and then later as we cover best practices. Normalization in ETL includes but is not limited to: Common data extraction practices across varying source systems Consistent approach to data lineage and metadata reporting Uniform practices for data cleansing routines Defined patterns for handling versioning and data changes Best practice approaches for efficient data loading Standardization Although it sounds basic, the first step toward standardization is implementing consistent naming conventions for SSIS packages. Consider the screen shot in Figure 3-5, which represents a small slice of ETL packages on a single server. For someone trying to track down an issue or identify the packages that affect a certain system, the confusion caused by the variety of naming styles creates huge inefficiencies. It is hard enough for an experienced developer or support engineer to remember all the names and processes, but add a new developer or IT support person to the mix, and the challenges increase.
115
Copyright 2010
116
Figure 3-5: Example of non-standard package naming In contrast, Figure 3-6 shows another slice of packages that are named consistently. These packages follow a standard naming convention: [Source System].[Destination System].[OperationDescription].[ExecutionFrequency].dtsx
Copyright 2010
117
Figure 3-6: Standard package naming simplifies management and troubleshooting However, the point isnt about this particular naming convention, but about the need to define and follow a consistent standard, whatever is appropriate in your environment. The ETL Framework section presents additional standards including SSIS package templates used as a foundation for all SSIS ETL development. Benefits of the Big 3 In summary, the benefits of consolidation, normalization, and standardization include: Improved process governance ETL consistency and consolidation help you achieve better overall enterprise data stewardship and effective operations. Better resource transition As people move in and out of a support or development environment, they can focus their energies on the core business problem or technical hurdle, rather than trying to figure out where things are and what they do. Enhanced support and administration Any ETL support team will benefit from following consistent patterns and consolidation, especially if the support organization is in a different location (such as in an off-shore operations management scenario). More effective change management The ability to nimbly handle system changes is enhanced when you can clearly see what processes are running and those processes are consistently implemented.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Reduced development costs The implementation of development standards reduces the cost of development because in the long run, developers are able to focus more on the business requirement they are coding when theyre given clear direction and process. Failure to address consolidation, normalization, and standardization up-frontor to stabilize an existing ETL environment that is deficient in any or all of these areaswill make your job architecting, developing, or managing data integration for your data warehouse more complicated. Each of the benefits above can be turned into a drawback without the proper standards and processes in place: difficult process management and administration, ineffective knowledge transfer, and challenges in change management of processes and systems. The ETL Frameworks section below presents a set of package templates that provide a solid foundation for ETL developers. Data Integration Paradigms (ETL and ELT) ETL products populate one or more destinations with data obtained from one or more sources. The simplest pattern is where one source loads one destination, as illustrated in Figure 3-7.
118
Figure 3-7: Simple ETL data flow The processing steps are as follows: 1. The ETL tool retrieves data sets from the source, using SQL for relational sources or another interface for file sources. 2. The data set enters the data pipeline, which applies transformations to the data one record at a time. Intermediate data results are stored in memory. 3. The transformed data is then persisted into the destination. Advantages to this process are that: Procedural programming constructs support complex transformations. Storing intermediate results in memory is faster than persisting to disk. Inserts are efficiently processed using bulk-insert techniques.
However, the disadvantages include the following:
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Very large data sets could overwhelm the memory available to the data pipeline. Updates are more efficient using set-based processingmeaning using one SQL UPDATE statement for all records, not one UPDATE per each record.
119
Figure 3-8 shows an example of an SSIS data flow that performs transformation processes (joining, grouping, calculating metrics, and so on) in the pipeline. This data flow has the advantage of leveraging the memory resources of the server and can perform many of the transformation tasks in parallel. However, when memory is limited or the data set needs to entirely fit in memory, the processes will slow down.
Figure 3-8: SSIS data flow example Remember that ELTExtract, Load, and Transformalso moves data from sources to destinations. ELT relies on the relational engine for its transformations. Figure 3-9 shows a simple example of ELT processing.
Copyright 2010
120
Figure 3-9: Simple ELT data flow The processing steps in the ELT data flow are as follows: 1. Source data is loaded either directly into the destination or into an intermediate working table when more complex processing is required. Note that transformations can be implemented within the source SQL Select statement. 2. Transformations are optionally applied using the SQL Update command. More complex transformations may require multiple Updates for one table. 3. Transformations and Lookups are implemented within the SQL InsertSelect statement that loads the destination from the working area. 4. Updates for complex transformations and consolidations are then applied to the destination. The advantages of this process include the following: The power of the relational database system can be utilized for very large data sets. Although, note that this processing will impact other activity within the relational database. SQL is a very mature language that translates into a greater pool of developers than ETL tools would.
However, you need to consider these disadvantages: As just noted, ELT places a greater load on the relational database system. You will also see more disk activity because all intermediate results are stored within a table, not memory. Implementing transformations and consolidations using one or more SQL Updates is more inefficient than the ETL equivalents, which make only one pass through the data and apply the changes to the destination using a single SQL statement rather than multiple ones. Complex transformations can exceed the capabilities of the SQL Insert and Updates statements because transformations occur at the record level not the data set level. When this occurs, SQL cursors are used to iterate over the data set, which results in decreased performance and hardto-maintain SQL code. For a given transformation, the processes applied are often serialized in nature and add to the overall processing time.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Figure 3-10 shows the SSIS control flow used in more of an ELT-type operation. You can identify ELT-type operations by their multiple linear tasks, which perform either Execute SQL Tasks or straight data loads using a few working tables.
121
Figure 3-10: SSIS control flow ELT process Which Should You Use for Your Implementation? The decision about whether to use an ETL or ELT pattern for a SQL Server data integration solution is generally based on the following considerations. Use ETL when Working with flat files and non relational sources. ETL tools have readers which can access nonrelational sources like flat files and XML files. ELT tools leverage the SQL language which requires that the data be first loaded into a relational database. This is a new data integration project or the current first-generation implementation is hard to manage and maintain. The visual workflows for tasks and data flows make the process easier to understand by non-developers. The transformations are complex. ETLs ability to apply complex transformations and business rules far exceeds the abilities of one set-based SQL statement. Many legacy ELT solutions have become unmanageable over time because of cursor-based logic and multiple Update operations used to implement complex transformations.
Use ELT when
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices The data volumes being processed are very large. Huge data sets may exhaust the available memory for an ETL approach. Remember that the ETL data pipeline uses in-memory buffers to hold intermediate data results. The source and destination data is on the same server and the transformations are very simple. A SQL-centric ELT solution is a reasonable choice when the current database development team is not trained on SSIS. But keep in mind that complex transformations can easily translate into poorly performing, unmaintainable data integration code.
122
ETL Processing Categories One of the key benefits of a data warehouse is its ability to compare or trend an organizations performance over time. Two important questions for every data warehouse are: How much time is stored in the data warehouse? When is the data warehouse populated?
The question of how much time to store is a function of: Historical reporting and the needs of business consumers. Many industries have similar needs around the amount of historical data within their data warehouse. For example, retail organizations often report on a three-year time period, and many health care organizations keep data for seven years to conform to health care regulations. Data warehouse maturity and size. Mature data warehouses typically archive historical data older than a specified period of time because the older, less frequently referenced data degrades query and load performance.
ETL processes have traditionally loaded data warehouses on a nightly basis during non-working hours (e.g., 11pm 7am). Note that the more global the data warehouse the less down time exists because every period of the day is working hours for some geography. However, business consumers requests for real-time data are placing additional demands on the traditional ETL processing methods. In addition, many organizations are using the concept of a current period, where information within the current period will change frequently before being frozen at the end of the period. Figure 3-11 organizes the data warehouse by time.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Figure 3-11: Data warehouse organized by time Data warehouses organized by time have the following categories: Archive This includes historical versioned data that is referenced infrequently. Archived information can either be stored in a separate database or loaded on demand into a database from flat files. Note that archived data falls within the data warehouse umbrella. Historical The majority of data warehouse tables are loaded on a scheduled basis, usually nightly but it could be weekly or monthly for very large data warehouses. Current Period Current period areas are frequently seen in industries where the current set of data is in progress and changes frequently before being frozen at the end of a time period, such as at the end of the month. Data is truncated and fully reloaded for the current time period. Near real time Business consumers are frequently dissatisfied with the reporting capabilities of LOB systems. In addition, expensive queries can lock out transaction users. In these cases, your data warehouse may need to serve up real-time or near real-time data. Real time The LOB system is the source for real-time data.
123
The key point is that different ETL load patterns are used for each of these time categories: The archive database/flat files are populated using ETL data flow and bulk inserts. The archived data is then deleted from the data warehouse. Note that this delete is an expensive operation on very large tables, so you need to allocate sufficient time for this operation. The historical data warehouse is populated using incremental load patterns, which are covered in the next section. The current period area is populated using the full load pattern, also covered in the next section. Real-time and near real-time data requires a more active ETL process, such as a historical process that runs every five minutes or a push process, which we look at in the changedetection section later in this chapter.
Incremental Loads Many first-generation data warehouses or data marts are implemented as full loads, meaning theyre rebuilt every time theyre populated. Figure 3-12 illustrates the different steps within a full load.
Figure 3-12: Full load process The steps in a full-load process are: Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices 1. 2. 3. 4. 5. Drop indexes Indexes increase load performance times Truncate tables Delete all records from existing tables Bulk copy Load data from the source system into an Extract In area Load data Use stored procedures and SQL INSERT statements to load the data warehouse Post process Re-apply indexes to the newly loaded tables
124
However, full loads are problematic because the time to reload will eventually exceed the window of time allocated for the load process. More important, business consumers dont have access to historical point-in-time reporting because only the most recent copy of the source system data is available in the data warehouse. With full loads often unable to support point-in-time historical reporting, many organizations have turned to a second-generation approach that uses an incremental load, which Figures 3-13 and 3-14 show.
Figure 3-13: Incremental load with an Extract In area
Figure 3-14: Incremental load without an Extract In area The steps involved in an incremental load are: 1. Load net changes from the previous load process 2. Insert/Update net changes into the Production area 3. Insert/Update the Consumption area from the Production area The primary differences between full loads and incremental loads are that incremental loads: Do not require additional processing to drop, truncate, and re-index Do require net change logic Do require Updates in addition to Inserts
These factors combine to make incremental loads more efficient as well as more complex to implement and maintain. Lets look at the patterns used to support incremental loads. Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Detecting Net Changes The incremental ETL process must be able to detect records that have changed within the source. This can be done using either a pull technique or push technique. With the pull technique, the ETL process selects changed records from the source: o Ideally the source system has a last changed column that can be used to select changed records. o If no last changed column exists, all source records must be compared with the destination. With the push technique, the source detects changes and pushes them to a destination, which in turn is queried by the ETL process.
125
Pulling Net Changes Last Changed Column Many source system tables contain columns recording when a record was created and when it was last modified. Other sources have an integer value that is incremented every time a record is changed. Both of these techniques allow the ETL process to efficiently select the changed records by comparing the maximum value of the column encountered during the last execution instance of the ETL process. Figure 3-15 shows the example where change dates exist in the source table.
Figure 3-15: Detecting net changes using change date Figure 3-16 shows how an integer value can be used to select changed records. Note that this example shows one benefit of adding an Execution lineage ID within the Production area.
Figure 3-16: Detecting net changes using an integer value Its the responsibility of the ETL process to calculate and store the maximum net change value for each instance for which the process is invoked, as Figure 3-17 shows.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Figure 3-17: Calculating and saving the change value The ETL process is then responsible for retrieving this saved max value and dynamically applying it to the source SQL statement. Pulling Net Changes No Last Changed Column The lack of a last changed column requires the ETL process to compare all source records against the destination. Figure 3-18 shows the process flow when no last changed column exists.
126
Figure 3-18: Detecting net changes with no last changed column This process flow is as follows: 1. Join the source with the destination using a Left Outer Join. Note that this should be implemented within the ETL product when the source and destination are on different systems. 2. Process all source records that dont exist in the destination. 3. Compare source and destination attributes when the record does exist in the destination. 4. Process all source records that have changed. Because all records are processed, pulling net changes when theres no last changed column is a less efficient approach, which is especially important for very large transaction tables. Pushing Net Changes When pushing net changes, the source system is responsible for pushing the changes to a table, which then becomes the source for the ETL process. Figure 3-19 shows two common push methods.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Figure 3-19: Push net change options Whats different about these two options? In the first scenario, the source system relational database actively monitors the transaction log to detect and then insert all changes into a destination change table. In the second scenario, developers create triggers that insert changes into the destination change table every time a record changes.
127
Note that in both cases, additional load is placed on the source system, which impacts OLTP performance. However, the transaction log reader scenario is usually much more efficient than the trigger scenario. Net Change Detection Guidance The first rule of ETL processing is that LOB source systems should not incur additional overhead during peak usage hours due to ETL processing requirements. Because ETL processing typically occurs during non-peak hours, the preferred option is a pull, not push, mechanism for net change detection. That said, the scenarios where you might consider a push option are: If this process is already occurring within the source system (e.g., audit tables exist and are populated when a net change occurs in the LOB system). Note that the ETL logic will still need to filter records from the change table, so it will need to use a last changed column. If there is no last changed column and the logic to detect net changes is complex. One example would be a person table where a large number of columns flow to the data warehouse. The ETL logic required to compare many columns for changes would be complex, especially if the columns contain NULL values. If there is no last changed column and the source table is very large. The ETL logic required to detect net changes in very large transaction tables without a net change column can result in significant system resource usage, which could force the ETL processing to exceed its batch window.
Data Integration Management Concepts Reducing ongoing TCO for data integration operations is top priority for organizations. To reduce costs, you need to understand what contributes to the ETL TCO, as follows: ETL package installation and configuration from development through production ETL package modifications in response to hardware and software issues ETL package modifications for different processing options Tracking down system errors when they occur (e.g., when a server is down and or a disk is offline) Detecting programming issues
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices ETL developers can help ETL operations reduce ongoing TCO by building dynamic configuration and logging capabilities within their ETL packages. Two primary areas that developers can focus on in this area are supporting dynamic configurations and providing robust logging. Dynamic Configurations Dynamic configurations support the run-time configuration of SSIS connections and variables used within SSIS package workflows. ETL developers use the following SSIS capabilities to develop packages that support dynamic configurations: SSIS expressions provide a rich expression language that can be used to set almost any property or value within a package. Variables can either be set statically or dynamically by values or the results from an expression. Package configurations can be used to dynamically set variables and task properties. Figure 3-20 shows how expressions and variables combine to dynamically configure the destination database connection.
128
Copyright 2010
129
Figure 3-20: Dynamic configurations: Destination connection Whats going in Figure 3-20? Heres the process: Dst is a SQL Server database connection for the destination database. An expression is used to set the destination ConnectionString property with the inpCnDst variable. The Initvars_EtlFwk Script task then populates the inpCnDst SSIS variable with a value stored within the ETL frameworks configuration table.
Dynamic configurations are also commonly used for configuring SQL statementsfor example, to add filters for incremental loads.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Note that this is only one example of a dynamic configuration. Well cover this implementation of dynamic configurations in more detail later in the ETL Framework section. Package configurations are also a powerful tool for initializing properties and variables. The following resources provide more information about package configurations: SQL Server Integration Services SSIS Package Configuration SSIS Parent package configurations. Yay or nay? SSIS Nugget: Setting expressions SSIS - Configurations, Expressions and Constraints Creating packages in code - Package Configurations Microsoft SQL Server 2008 Integration Services Unleashed (Kirk Haselden), Chapter 24 Configuring and Deploying Solutions BIDs Helper This is a very useful add-in that includes an Expression highlighter
130
Integration Services Logging Basic ETL execution auditing can be performed through the built-in logging feature in SSIS, which captures task events, warnings, and errors to a specified logging provider such as a text file, SQL table, or the event log. Any ETL operation needs logging to track the execution details and to do error troubleshooting; the SSIS logging provider is the first step. Figure 3-21 shows the logging event details.
Copyright 2010
131
Figure 3-21: SSIS logging events Table 3-1 shows a few of the details captured by different logging events; the details are linearly logged and associated with the appropriate package and task within the package (package and task columns not shown). Event OnWarning Source Customer _Import starttime 2009-11-04 17:17:33 endtime 2009-11-04 17:17:33 Message Failed to load at least one of the configuration entries for the package. Check configuration entries for "XML Config; SQL Configurations; Configuration 1" and previous warnings to see descriptions of which configuration failed.
OnPreExecute
Customer _Import Data Flow
2009-11-04 17:17:34 2009-11-04
2009-11-04 17:17:34 2009-11-04
OnPreExecute
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Task OnWarning Data Flow Task 17:17:34 2009-11-04 17:17:34 17:17:34 2009-11-04 17:17:34 Warning: Could not open global shared memory to communicate with performance DLL; data flow performance counters are not available. To resolve, run this package as an administrator, or on the system's console. Warning: Could not open global shared memory to communicate with performance DLL; data flow performance counters are not available. To resolve, run this package as an administrator, or on the system's console.
132
OnWarning
Customer _Import
2009-11-04 17:17:34
2009-11-04 17:17:34
Table 3-1: SSIS logging output The benefits of SSISs built-in logging are its simplicity and ease of configuration. However, SSIS logging falls short rather quickly when dealing with data warehouse-type ETL that has any level of complexity or volume. Here are some drawbacks to that basic logging functionality: The data is not normalized, and therefore it is difficult to query because it requires joining to itself several times just to get start and stop times for a single task and to identify the package. Scalability may be a problem because if you want to capture every package that executes to a common logging table. The logging shows no precedence between tasks or between parent or child packages, making it difficult to capture overall batch activity.
The following links provide more information about SSIS logging: SQL Server 2005 Report Packs. This page has a link to the SQL Server 2005 Integration Services Log Reports. Note: The SSIS logging table has changed from sysdtslog90 (2005) to sysssislog (2008); you will need to change the SQL in all of the reports or create a view if running SSIS 2008 or later. Integration Services Error and Message Reference. This is useful for translating numeric SSIS error codes into their associated error messages.
Custom Logging An alternative to SSIS logging is to implement a custom logging solution, which is often part of an overarching ETL execution framework that manages configurations, package execution coordination, and logging.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices A robust, centralized logging facility allows ETL operations to quickly see the status of all ETL activity and locate and track down issues in ETL processes. ETL developers can decide to leverage SSIS logging capabilities or access their own logging facility from within ETL packages. Well expand on custom logging a bit later in the ETL Framework section. Batch Processing and the Enterprise ETL Schedule As we noted earlier, most data warehouses load source data nightly during non-peak processing hours. Many of these processes are organized into a sequential series of scheduled ETL processeswhat we call an ETL batch. ETL batches often need to be scheduled in a sequenced order due to data dependencies within the various batches. In general, there are two basic ETL batch types: Source centric batches Data is extracted from one source Destination centric batches One destination subject area is populated
133
The enterprise ETL schedule is responsible for sequencing and coordinating all of the ETL batches used to load the data warehouse. Such a schedule is necessary when there are data interdependencies within batches. A scheduling software package, such as SQL Server Agent, is typically used to sequence the ETL batches. The following are key considerations when developing ETL batches and creating the Enterprise ETL Schedule. Parallel Processing Running ETL processes in parallel maximizes system resources while reducing overall processing times. Because efficient ETL processes typically obtain table-level locks on destinations, the ability to run ETL processes in parallel translates into only one process loading a table at any particular point in time. This is a major consideration for ETL operations resources when creating the enterprise ETL schedule. Note that SQL Server table partitioning can be used in support of the parallel loading of large tables. However there is still an issue with readers having access to tables or partitions that are in the process of being loaded. For more information on best practices for high volume loads into partitioned tables, see the We Loaded 1TB in 30 Minutes with SSIS, and So Can You article. Terminating and Restarting Batches One consideration when running ETL processes together in one batch is the ability to restart and resume execution at the point where the ETL process failed. On the surface, this seems like a simple concept. However, consider the fairly simple scenario presented in Figure 3-22.
Copyright 2010
134
Figure 3-22: Simple batch processing scenario Lets say the following batch processing occurs nightly, starting at 11pm: 1. Loading the customer record in Production area from the Customer Relationship Management (CRM) system. This is the definitive source for customer information and is where customer records are created from. 2. Loading product, customer, and sales information from the Enterprise Resource Planning (ERP) system into the Production area. 3. Loading the Consumption area product, customer, and sales tables from the Production area. Batch #2 has a dependency on Batch #1 (Customer). Batch #3 has dependencies on batches #1 and #2. These dependencies lead to the following questions: What happens when Batch #1 fails with an error? Should batches #2 and #3 execute? What happens when Batch #2 fails? Should Batch #3 execute? What happens when Batch #3 fails? Should it be rerun when the problem is located and diagnosed?
The answer to all these is it depends. For example, batches #2 and #3 can run when Batch #1 fails if and only if: Incremental loads are supported in Batch #1 Late-arriving customers are supported in the Batch #2 customer and sales ETL processing
Two primary approaches to rerunning ETL processes are: The ability to rerun an entire ETL batch process and still produce valid results in the destination. This supports the simplest workflow (i.e., exit batch processing when the first severe error is encountered). You rerun the entire batch once the problem is fixed. The ability to checkpoint an ETL batch at the point of failure. When rerun, the ETL batch then resumes processing where it left off. Note that SSIS supports the concept of a checkpoint.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices In either case, one feature that does need to be supported is the ability to back out results from a particular instance or instances of a batch execution. Backing Out Batches Conceptually, an ETL batch can be viewed as one long-running transaction. ETL processes almost never encapsulate activity within a transaction due to the excessive overhead from logging potentially millions of record changes. Because ETL processing doesnt run within a transaction, what happens if a process within an ETL batch is faulty? What happens when source data is determined to be in error for a particular instance of an ETL load? One answer is the ability to back out all the activity for one ETL batch or subsets of the ETL batch. The ETL Framework section later in this chapter presents a pattern for backing out results from a particular batch execution instance.
135
ETL Patterns
Now that we have our changed source records, we need to apply these changes to the destination. This section covers the following patterns for applying such changes; we will expand on these patterns to present best practices later in the chapter: Destination load patterns Versioned insert pattern Update pattern Versioned insert: net changes
Destination Load Patterns Determining how to add changes to the destination is a function of two factors: Does the record already exist in the destination? Is the pattern for the destination table an update or versioned insert?
Figure 3-23s flow chart shows how the destination table type impacts how the source record is processed. Note that we will cover deletes separately in a moment.
Copyright 2010
136
Figure 3-23: Destination load pattern flow chart Versioned Insert Pattern A versioned insert translates into multiple versions for one instance of an entity. The Kimball Type II Slowly Changing Dimension is an example of a versioned insert. A versioned insert pattern requires additional columns that represent the state of the record instance, as shown in Figure 3-24.
Figure 3-24: Versioned insert supporting columns The key versioned insert support columns are as follows: Start Date The point in time when the record instance becomes active End Date The point in time when the record instance becomes inactive Record Status The record status; at minimum, this is set to Active or Inactive Version # This is an optional column that records the version of this record instance
Figure 3-25 shows an example of the first record for a versioned insert.
Figure 3-25: Versioned insert: First record Lets assume that this record changes in the source system on March 2, 2010. The ETL load process will detect this and insert a second instance of this record, as shown in Figure 3-26.
Copyright 2010
137
Figure 3-26: Versioned insert: Second record When the second record is inserted into the table, notice that the prior record has been updated to reflect the following: End date The record is no longer active as of this point in time Record status Changed from Active to Inactive
The ETL process implementing this versioned insert pattern should implement the following logic for optimal performance, as Figure 3-27 shows: Use a bulk-insert technique to insert the new version Use one set-based Update to set the previous record instances End Date and Record Status values
Figure 3-27: Versioned insert pattern: Combine data flow with set-based SQL The best practices section later in this chapter will show an SSIS example that implements this pattern. Why Not Always Use Versioned Inserts? Some data warehouse implementations predominately use a versioned insert pattern and never use an update pattern. The benefit of this strategy is that all historical changes are tracked and recorded. However, one cost is that frequently changing records can result in an explosion of record versions. Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Consider the following scenario: Company As business is analyzing patterns in health care claims and providing analytics around scoring and categorizing patient activity, which can be used by payors to set health insurance premiums. Company A: Receives extracts from payors in a Comma Separated Value (CSV) format Loads these extracts into its data warehouse
138
One extract is the insured party extract, which the payor provides each week. Insured party information is entered manually by the payors customer service representatives. Figure 3-28 shows the change activity that can occur in this scenario.
Figure 3-28: Heavy change activity in a versioned insert pattern The data warehouse team should consider the update pattern for instances where changes dont impact historical reporting. In the above example, that would reduce the number of records from five to one. In situations when human data entry can result in many small changes, the update pattern, which we cover next, could result in a table that is factors smaller than one using the versioned insert pattern. Update Pattern An update pattern updates an existing record with changes from the source system. The benefit to this approach is that theres only one record, as opposed to multiple versioned records. This makes the queries more efficient. Figure 3-29 shows the supporting columns for the update pattern.
Figure 3-29: Update pattern support columns The key update support columns are: Record Status The record status; at minimum, this is set to Active or Inactive Version # This is an optional column that records the version of this record instance
The primary issues with the update pattern are: History is not recorded. Change histories are valuable tools for data stewards and are also useful when data audits occur. Updates are a set-based pattern. Applying updates one record at a time within the ETL tool is very inefficient. Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices One approach that addresses the above issues is to add a versioned insert table to the update pattern, as Figure 3-30 shows.
139
Figure 3-30: Update pattern: Adding a history table Adding a history table that records all source system changes supports data stewardship and auditing and allows for an efficient set-based update of the data warehouse table. Versioned Insert: Net Changes This versioned insert: net changes pattern is often used in very large fact tables where updates would be expensive. Figure 3-31 shows the logic used for this pattern.
Copyright 2010
140
Figure 3-31: Versioned insert: net change pattern
Note that with this pattern: Every numeric and monetary value is calculated and stored as a net change from the previous instance of the fact table record. There is no post-processing activity (i.e., updates to the fact table after the data flow completes). The goal is to avoid updates on a very large table. The lack of updates combined with the size of the underlying fact table makes the record change-detection logic more complex. The complexity comes from the need to efficiently compare the incoming fact table records with the existing fact table.
The best practices section later in this chapter will show examples of all these patterns.
Data Quality
Data quality is of primary importance to every data warehouse. The lack of data quality directly leads to business distrust of data warehouse results, which in turn can result in extensive ongoing maintenance costs to track down each reported data issue. Worse yet, business users might stop using the data warehouse entirely. Some common reasons why data quality issues exist include: Data quality issues exist in the source systemfor example, LOB or core enterprise systems have some level of incomplete data, duplicates, or typographical errors. Connecting and correlating data across multiple systems involves data that often does not associate easily. Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices External data that is part of the data warehouse rarely matches the codes or descriptions used by internal systems. Combining data across similar internal systems (e.g., regional, lines of business, or sub-entity systems) introduces duplicates, differing data types, and uncoordinated system keys. The challenge for the data warehouse team is balancing the time and cost involved in trying to cleanse or connect incomplete or erroneous data. The first step is to promote data quality to a first-class citizen by doing the following: Implementing robust data quality checks within all ETL processes Correcting data quality issues where possible Logging data quality issues as exceptions when they cant be corrected Building reporting tools for data stewards so they can assist in the detection and correction of data exceptions at the source, where they originated.
141
However, thats not enough. Even correct results are often questioned by business consumers, who rarely having complete insight into how the data has been consolidated and cleansed. Providing complete transparency for all ETL processing is required so that data stewards can track a data warehouse result all the way back to the source data from which it was derived. This section covers data quality through the following topics: A data quality scenario Data errors and exceptions Data stewardship and exception reporting Data profiling Data cleansing Data reconciliation and lineage
Lets start by looking at a data quality scenario. Well then walk through the other concepts, including data profiling and data cleansing coverage, which provides guidance on how to leverage SQL Server technologies to identify and achieve a solution to data quality issues. Data Quality Scenario A common data quality situation occurs when tracking customer sales across channels. Most retailers sell their product through different sales channels, such as through a Web merchant, direct marketing, or a store front. Customer identification many times isnt recorded within the sales transaction. But without customer identification, there is no easy way to identify and track one customers transaction history across all sales channels. Tracking a customers transaction history involves a few steps: 1. Identifying a unique customer across person-centric LOB systems (Call Center, Sales, Marketing, etc.) Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices 2. Mapping the consolidated customers across the various sales channels 3. Mapping the final customers to transactions The first two steps involve matching customers to identify their residence, which is often called address geocoding or postal identification. This involves matching an address received with a master list of addresses for area and being able to identify the physical geographical location. Whether the location is used or not, it produces the ability to identify matches across data sets. Of course, people move or share residences, so another step is to try and match to individuals based on name after an address is identified. In Table 3-2, you can see that a single customer exist multiple times in different systems. Source Call Center Name John Frame J.S. Frame Source Customer portal Website Johnathan Frame Address 18 4-Oaks Dr 18 Four Oaks Dr 18 Four-Oaks drive City Monroe, LA Unknown Monroe, LA Postal Code 71200 71200 00000
142
Table 3-2: Multiple occurrences of one customer The master record in Table 3-3 is the cleansed record that the above source records need to match. Running the records above through an address-cleansing utility will identify that these customers are all one in the same. Several applications can handle this task and use either SQL Server as a source or integrate with SSIS. Master Address ID L456BDL Name Jonathan S Frame Address 18 Four Oaks Dr City Monroe, LA Postal Code 71200-4578
Table 3-3: One occurrence of a customer Once the addresses and names are matched, the source transactions can be merged together with common customer association. Examples of where bad data can enter a data warehouse system are numerous: Think about all those sales force automation applications where sales representatives are responsible for entering information, including prospect and customer names and contact information. Or consider a mortgage-origination system whose results are used to incent and compensate mortgage originators on completing a mortgage. A lack of checks and balances within the system could allow incorrect information (e.g., base salary and assets) to be entered.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Many first-generation data warehouse ETL processes in production today were developed without the checks required to detect, flag, and log data exceptions. In these cases, bad data can make its way into the data warehouse, which ultimately will result in business consumer distrust. Once the business starts questioning the results, the burden is on the data warehouse delivery team to prove that the results are correct or find where the data is in error. Data Exceptions There many types of data exceptions and this section will cover some of the more common types. As you deal with exceptions, you need to decide how to handle each situationwhether you discard the entire related record, cleanse the data, replace the bad data with an unknown, or flag the record for manual review. Missing or Unknown Values Missing values are the most common data exception because many systems do not put constraints on every data entry field. When you receive missing values in a data warehouse system, you need to decide whether to replace the value with an Unknown identifier or leave the value as blank. The importance of the column also affects how you handle the data. If you are missing a required field to identify the product purchased, for example, you may need to flag the record as an exception for manual review. If the value is just an attribute where unknowns are expected, you can use a placeholder value. Unknown values go beyond values that are missing. An unknown can be when you receive a code that does not match your master code list for the given attribute. Or lets say you are doing a lookup from the source record to associate with other data, but there isnt a matchthis is also a type of unknown. In this case, theres an additional option: Create a placeholder record in the lookup table or dimension. Kimball refers to this type of data exception as a late arriving dimension. Date-Type Conversion and Out-of-Range Errors Another common data problem, especially when dealing with text files, involves conversion problems. This problem manifests itself through truncation errors, range errors with numeric constraint issues, or general conversion issues (such as converting text to numeric). Table 3-4 contains some examples of these kinds of data quality issues. For each situation, you need to decide whether to try to fix the issue, NULL-out the result, or flag the problem for manual review. Source Value 00/00/2010 Normalized Data Type Date Issue Date does not exist Resolution NULL value or convert to 01/01/2010 NULL value or convert to 10 Remove value, flag for review Copyright 2010
143
1O ABC
2-byte integer Non-Unicode text Microsoft Corporation
Typo - O used for 0 Characters do not
Microsoft EDW Architecture, Guidance and Deployment Best Practices map to 1 byte Marriott 5-character text Truncation issue of destination data type Truncate value at 5 characters or flag for review of data type Discard value, convert to 100, flag for review
144
1000
1-byte integer
256 is the max value for a 1-byte Integer
Table 3-4: Data exception examples Thousands of variations of data conversion and range issues can occur. The appropriate solution may involve simply ignoring the value if the attribute or measure is not core to the analysis, or going to the opposite side of the spectrum and sending the record to a temporary location until it can be reviewed. Names and Addresses The data quality scenario above illustrated the type of data exceptions that can occur with addresses and names. Company names are also a common challenge when correlating systems that contain vendors or business-to-business (B2B) customers. Take this list of organizations: Home Design, INC Home Design, LLC Home design of NC Home Designs They could all be the same company, or they could be separate entities, especially if they are located in different states. If you have an address, you can use it to try to match different representations of the company name to the same company, but larger organizations often have multiple locations. However, a combination of name, region, and address will often do a decent job of matching. You can also purchase corporate data that can help associate businesses and other entities. Uniformity and Duplication Data entry systems often have free-text data entry or allow a system user to add a new lookup if he or she doesnt find the value, even if it already exists. Here are some considerations for data uniformity: When dealing with data in textual descriptions, it is often best to leave the raw data in an intermediate data warehouse database table for drill-down or drill-through use rather than trying to parse out the values. When dealing with duplication of codes and values that are more discrete, a lookup mapping table can be maintained to help in the process. Consider managing the mappings through a data validation interface to prevent operations or development from being directly involved. Two general types of duplication exist: Duplication of exact values or records that match across all columns Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Duplication when values are not exact but similar The first step in removing duplication is identifying where duplications can exist and the type or range of variations that indicate duplication. Data Stewardship and Validation Before moving on to implementation patterns, lets step back and consider data quality from an overall perspective. Figure 3-32 illustrates the life cycle of data quality across the data warehouse data integration process. (Note that this figure doesnt include an Extract In database, which is often populated from the source and in turn populates the Production data area.)
145
Figure 3-32: Data quality checkpoints in the data integration life cycle Data warehouse efforts need a data steward who is responsible for owning everything from the initial data profiling (identifying data issues during development) to the data exception planning process and exception reporting. Data Quality Checkpoints The earlier data exceptions are detected, the easier it is to repair the error. There are several steps in the data integration process where data exceptions can be identified: Operational reports Detecting data quality errors from operational reports allows the data steward to repair the data error. Data profile reports Data profiling allows the data steward to detect certain data quality errors, including out-of-range values. Identifying data exceptions at this stage enables the ETL processes to check for and log these exceptions. Exception reports Exception reports show that the ETL process detected and logged an exception in the Source- (or Extract In-) to-Stage ETL process. The data steward can use the exception report to identify and repair the error at the source. Repairing/Flagging Data Exceptions
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices The ETL process in Figure 3-32 shows an Exception database (Exc) where data exceptions from ETL processing are inserted. An alternative approach is to repair the data exception and/or flag the records as exceptions and allow them to flow to the data warehouse as shown in Figure 3-33.
146
Figure 3-33: Flagging exceptions Reasons you might consider this approach include: The data exception is not severe and can be repaired by the ETL process. Missing values and unknown values are examples of this. The record has valid data along with the exceptions that need to be flagged.
Data Profiling One of the first tasks for the data steward is to profile the source data to document data quality issues that need to be handled by the development team. This assumes that a data map has been generated to identify which data points from the source systems and files will be used in the warehouse and how they are mapped to the destination entities. SQL Server 2008 introduced a built-in Data Profiling task in SSIS that allows an easy review of source data for this purpose. To leverage the Data Profiling task, you first have to have the data in SQL Server tables, such as in an export in database. Figure 3-34 shows the task editor and the types of data profiling that can be done, including NULL evaluations, value distributions and patterns, statistics on numeric columns, candidate key options, and more.
Copyright 2010
147
Figure 3-34: Data profiling in SSIS After the task is run in SSIS, it generates an XML file that can be viewed by the Data Profile Reader, found in the SQL Server 2008 applications folder, which you can access from the Start menu. As an example, the report in Figure 3-35 shows the column length distribution for a column. It also provides sample values and lets you include or exclude leading or trailing spaces.
Copyright 2010
148
Figure 3-35: Column length distribution Other data profile types include pattern identification with regular expression outputs, max, min, mean value statistics, distribution of numbers, and discreteness. You can use this collective information to help determine where and how data quality issues should be handled. Data Cleansing Data cleansing is the activity of dealing with the data quality issues and data exceptions in the ETL process. Cleansing can range from simply replacing a blank or NULL value to a complicated matching process or de-duplication task. Here are some general guidelines when implementing data cleansing: Multiple layers of data cleansing are often applied in a data warehouse ETL process. The first layer usually involves data parsing and handling of common data exceptions such as NULLs, unknowns, and data range problems. The second level may involve de-duplication or data correlation to a different source, data consolidation of different sources, or data aggregation and summary. Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices A common mistake is to perform updates to the raw tables in the extract in environment. This adds a synchronous step and incurs a database performance hit because of the transaction as well as causes an I/O hit and lengthens the overall execution time. In addition, this approach invalidates the ability to go back to the extraction environment and see the raw data as it was first extracted. A better approach is to either apply the initial layer of data cleansing in the query that is run against the extract table or leverage the transformation capabilities in the ETL tool. Both of these strategies have much less overhead and allow more server resources and time to be applied to any complicated data cleansing. Of course, you can also use T-SQL queries to cleanse data, employing ISNULL, NULLIF, RTRIM, LTRIM, REPLACE, etc. When using SSIS for data cleansing, you can apply many of the transformations to different data cleansing situations. Here are some examples: You can use the Derived Column transformation for text parsing, NULL identification and replacement, calculations, and more. Lookup and Merger Join transformations can help with data correlation and association. Fuzzy Lookup and Fuzzy Grouping allow for complicated data associations and de-duplication. Pivot and Un-Pivot transformations let you change the granularity and normalization pattern of data. As a simple example, Table 3-5 contains data from a text column that identifies rough locations of equipment. The challenge is trying to parse out the data. Values 6N03 /D E. HATTERAS, NC 6N46 /A CANAVERAL, FL 10D08/M WESTERN GULF 3D35 /D LANEILLE, TX 10D09/A WESTERN CARIBBEAN Table 3-5: Sample data equipment locations However, you can use two SSIS Derived Column transformations to parse and cleanse the data. The first transformation does some preparation textual calculations, and the second performs the primary parsing. Figure 3-36 shows the data flow and the Data Viewer with the resulting parsed data.
149
Copyright 2010
150
Figure 3-36: Text cleansing in SSIS The Derived Column transformation uses the SSIS Expression Language. Tables 3-6 and 3-7 show the expressions used in each transformation. The reason two transformations are used is that the columns added in the first transformation are referenced in the second, making the expressions easier to follow and manage. Column Location LocationPosition StatePosition Purpose Replace NULLs Starting position for Location Starting position for State Expression ISNULL(Location) ? "Unknown" : TRIM(Location) FINDSTRING(Location,"/",1) FINDSTRING(Location,",",1)
Table 3-6: Derived Column 1 expressions
Column ParsedLocation
Purpose Parse Location
Expression SUBSTRING(Location,LocationPosition + 3,(StatePosition == 0 ? (LEN(Location) - LocationPosition + 4) : (StatePosition - LocationPosition - 3))) (StatePosition == 0) ? "UNK" : SUBSTRING(Location,StatePosition + 2,LEN(Location) - StatePosition + 1)
ParsedState
Parse State
Table 3-7: Derived Column 2 expressions Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices The SSIS Best Practices section of this chapter provides additional approaches to and examples of deduplication. Data Reconciliation As we noted earlier, users will distrust data warehouse results when the numbers dont add up. A common complaint is, I get different results from different systems. This complaint can be unfounded, but if thats the perception, the reality often doesnt matter. Heres where reconciliation becomes a key component of a data stewards responsibilities. Reconciliation is the process of comparing results of the same metric from two different systems; the classic example is balancing a checkbook. Table 3-8 shows a sample bank statement; Table 3-9 is the checkbook register. Date Account 31-May 100-35 Previous Balance Deposits/Credits Checks/Debits Date Description 1-May Deposit 17-May Check 24-May Check 25-May Check 31-May Check Table 3-8: Bank statement Date Description 1-May Deposit 15-May Electric Utility 22-May Gas 23-May Rent 30-May Grocery Store 31-May Restaurant Table 3-9: Checkbook register Type Ending Balance Checking 550.00 Count Amount 0.00 1 1000.00 4 450.00 Number Amount Balance 1000.00 1000.00 1 100.00 900.00 2 50.00 850.00 3 200.00 650.00 4 100.00 550.00
151
Number Amount Balance 1000.00 1000.00 1 100.00 900.00 2 50.00 850.00 3 200.00 650.00 4 100.00 550.00 5 100.00 450.00
To reconcile the statement with your checkbook records, you would: 1. Compare the bank statement Ending Balance with the checkbook Balance on May 31, the date the bank statement was created. 2. If these values are the same, youve successfully reconciled your checking account. If these values are not the same, as in the above example, more analysis is required. 3. One next step is to compare deposit and check counts. In the above example, the bank statement shows four checks, while the checkbook ledger shows five.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices 4. The next step is look at the detailed check ledgers to find the discrepancy. In this example, the bank statement does not include Check #5. This is probably due to the delta between when a check is issued by the buyer to the seller and when the seller deposits the check in his or her bank account. This simple example demonstrates the key role of reconciliation within a data warehouse and how it can grow to be an all-consuming process. Its important to equip the data steward with tools that assist in this process. Well present reconciliation best practices later in this chapter. Dates also play a significant role in any data warehouse and are a common cause for business consumers questioning data warehouse results. Well discuss the role of dates and time in data integration in an upcoming section. Lineage Data exception and reconciliation processes both benefit when Lineage is introduced into the data integration processes. Merriam-Websters online dictionary defines lineage as, A descent in a line from a common progenitor. Lineage in data integration can be applied to: Code and Data Definiton Language (DDL) Identifies a particular version of the ETL code or table or column definition Data lineage Identifies the source record or records for a destination Execution lineage Identifies the instance of the ETL process that inserted a new record or updated an existing record
152
Code and DDL lineage Although not directly related to data quality, code and DDL change tracking does allow data integration issues and bugs to be identified. Code lineage is used to tie an executing instance of an ETL process to a particular version of that ETL process. Capturing and logging code lineage at execution time assists error and exception analysis. The team tracking down a code error or a data exception can check to see if the code lineage had changed from the previous execution instance. SSIS has many package-level system variables available at runtime that provide detailed information about the particular version of the package thats executing. You can learn more about System Variables in SQL Server Books Online. DDL lineage is about tracking the changes in structure to the data warehouse over time. Although often manually maintained through DDL scripts, DDL lineage assists in root cause analysis when an issue is identified. Like any system, when the physical table or key relationships need to change, the change needs to be scripted out and packaged into a deployment, including: Handling of data changes that are affected by the DDL change DDL change script encompassed within a transaction
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices DDL lineage is commonly handled by simply maintaining SQL scripts in a deployment folder. After the scripts are tested in a QA environment, they are usually run by the DBA at a planned downtime window and then archived for future reference. DDL changes should be documented in the project management system. Data Lineage One of the two most important types of lineage related to data warehouse systems is data lineage. Data lineage is about tracking each data row through the system from source to the data warehouse. The value of data lineage includes: Data validation and drill-through analysis Troubleshooting data questions and errors Auditing for compliance and business user acceptance Identification of records for updates or comparison Two general approaches can be taken for data lineage: Leveraging the source systems key for lineage tracking. In some cases, using the source system keys is the easiest way to track rows throughout the integration process. Figure 3-37 highlights a couple of sets of data lineage columns in the represented fact table. The primary lineage columns are the sales order numbers, but the tracking and PO numbers also are useful for lineage.
153
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Figure 3-37: Data lineage example Creating custom lineage identifiers in the data warehouse ETL load. Not all sources have keys, or the keys may not identify the source record uniqueness. In this case, data lineage can be handled by generating an auto-incrementing number in the ETL, as Figure 3-38 shows.
154
Figure 3-38: Custom data lineage example Generating a lineage is easiest to do right in the export process from the source. In SSIS, a couple lines of code will allow an auto-incrementing value to be added. Figure 3-39 shows a data flow with a Script Component that adds the lineage.
Figure 3-39: Data lineage in the SSIS data flow Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices In this example, a package variable has been created called DataLineageID of type Int32. This value is first updated in an Execute SQL task with the last lineage for the table. The data flow Script Component then simply increments an internal variable for the lineage by 1 for each row. The code in the Script Component is shown below.
Public Class ScriptMain Inherits UserComponent Private DataLineageID As Integer = Me.Variables.DataLineageID Public Overrides Sub Input_ProcessInputRow(ByVal Row As InputBuffer) Me.DataLineageID = Me.DataLineageID + 1 Row.DataLineageID = Me.DataLineageID End Sub End Class
155
The challenges of data lineage revolve around complicated transformation logic where multiple source rows combine together into one destination row. In this case, a mapping table of surrogate keys to lineage ID can allow history and mapping to be tracked. Execution Lineage The other important lineage in data warehouses is execution lineage, which is about recording detailed information for ETL package execution. When stored as an integer value within each destination table, execution lineage allows the data steward or auditor to view the specifics of the instance of the ETL process that created or modified the record. Execution lineage tracks: When a process ran What ETL batch it belongs to The precedence of steps The duration of steps Error tracking of process failures and warnings Execution lineage is created within the context of a logging facility, which is usually a component of a larger ETL framework.
ETL Frameworks
Successful enterprise data warehouse integration efforts typically have an ETL framework as one of their best practices. ETL frameworks at their core support dynamic configurations and centralized logging. This section will first provide an overview of ETL frameworks and then cover examples of SSIS template packages that integrate into an ETL framework.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices This sections best practices, used by ETL developers doing ongoing management and monitoring of ETL processes, is targeted at architects, developers, and ETL operations staff: Data architects and developers are responsible for implementing the tools and frameworks used by ETL operations. ETL operations can be a separate team in large organizations or systems and database administrators in smaller organizations. These resources are responsible for keeping the wheels on the bus for all ETL processes within an organization.
156
Note that all the packages presented here, as well as the ETL reports shown above, can be found online at: http://etlframework.codeplex.com/ ETL Framework Components The primary components for an ETL framework are a database, template packages, and reports, as shown in Figure 3-40.
Figure 3-40: ETL framework components The ETL framework database stores all metadata and logging for enterprise ETL activity: Active Metadata Consists of technical metadata tables used to drive ETL package execution. Examples include configuration tables, sequenced execution packages, and source-todestination mappings. Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Logging Contains all logging activity from master and execution packages.
157
Package templates are used as starting points for all ETL development. Each template interfaces with the ETL framework stored procedures; there are several package templates and components: Master package Creates and logs batches and provides workflow for execution packages. Execution package Creates, logs, and provides workflow for one or more data flows. In addition, the template presented in the Execution package section below provides postprocessing activity for insert and update data flow patterns. Execution lineage Unique value representing one execution instance of a data flow. As we saw earlier, this value is stored in destinations and provides a link between the destination data and the process that created or modified the data.
Users and Interfaces Developers, operations personnel, and data stewards are the primary users of the ETL framework. Figure 3-41 shows how they interface with the ETL framework.
Figure 3-41: ETL framework roles and interfaces Lets look a little more closely at the key roles and responsibilities involved in the ETL framework: ETL developers
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Develop the ETL packages and maintain and extend the ETL templates Maintain and extend ETL framework reports Assist operations with diagnosing and correcting errors
158
ETL operations Maintain configuration parameters Schedule master packages Monitor ETL activity Identify and coordinate ETL error diagnostics Receive error alerts
Data stewards Review reports for out-of-band activity Review, diagnose, and manage data change requests Receive exception alerts
Master and execution packages interface with ETL framework database objects. Master package interfaces Load configurations into SSIS variables and properties Log batch activity Log errors from master and execution packages Send email error and exception alerts Send batch and configuration information to execution packages
Execution package interfaces Accept batch and configuration parameters from the master package Log data flow activity Get/set filter values for incremental loads
Most mature ETL development shops have their own version of an ETL framework. Dynamic configurations and logging are core capabilities, but they may be implemented differently across implementations. Configurations Most ETL frameworks developed for SSIS use either XML configuration files or database tables for dynamic configurations. XML configurations are typically preferred by developers who are fluent in XML. Configurations are changed by directly editing values stored within the XML file. The XML file is then loaded by either an SSIS XML configuration or a script task. Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Database configuration tables are typically preferred by IT operations and DBAs who are less fluent in XML than developers. Another advantage to using configuration tables is that it supports the storing of each instance of the configuration parameters in a table. This can help later on in the diagnostics phase if an incorrect parameter value is specified. You can find more information about SSIS configurations in the following articles: SQL Server Integration Services SSIS Package Configuration Simple Steps to Creating SSIS Package Configuration File Reusing Connections with Data Sources and Configurations
159
Logging Most ETL frameworks use custom logging. This is usually due to the low-level nature of SSIS logging and because SSIS lacks the concept of a Batch, which provides a logical grouping of activity. This section presents reports from an ETL framework that uses custom logging to record package activity. The reports well look at in a moment are generated from logging tables, as shown in Figure 3-42.
Figure 3-42: ETL framework logging tables This collection of logging tables consists of: Batch Contains one record created for each instance of a batch invocation. A batch workflow is typically implemented in a master package, which is described below. Parameter Contains one record for each parameter used to dynamically configure a batch instance. Activity Contains one record for an activity, a logical construct that can contain 0 or more transfers (a.k.a. data flows). Error Contains one record for each error thrown within an ETL activity. Transfer Contains one record for a data flow between one source and one destination. Object Contains one record for each unique instance of an object used as a source and/or destination.
Now lets look at some reports that leverage these tables to see the value of centralized logging for all ETL activity within an organization. ETL operations use these reports to gauge the overall health of ETL
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices activity. Note that the examples presented below are all developed using SQL Server Reporting Services (SSRS). Figure 3-43 shows the highest-level report, displaying ETL batch activity for a specified period of time.
160
Figure 3-43: Batch Summary Report Table 3-10 summaries the report columns and provides a short description of each. Column Name Master Package Batch ID Config Sts Created On Duration Errors Reads Exc Ignored Inserts Updates Description Name of the controller package that manages the ETL batch Unique ID for the batch instance Configuration identifier used to retrieve the batch configuration parameters Completion status: E = Error, S = Success, A = Active Time the batch started execution How long the batch took to execute Total error count (if any) for this batch execution instance Total number of records read from the source Total number of records sent to the exception area Total number of records not processed Total number of records inserted Total number of records updated
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Table 3-10: Batch Summary Report columns Note that the ETL operations staff can choose to filter batches started on or after a specified date. Clicking the Config value in the above report links to another report showing the input values used in the dynamic configuration process, as Figure 3-44 shows.
161
Figure 3-44: Batch Parameter Report This example shows the parameters used to load dynamic configurations for the development (DEV) environment. Note that: Each of the parameters directly map to an SSIS package variable. A dynamic configuration task script reads the value from the configuration table. This value in turn is used to populate the SSIS variable. ETL operations can change a configuration by modifying values in the ETL framework configuration table. The directory to which the ETL packages have been deployed is determined by the cfgDtsxDirectoryName parameter value. The source and destination databases are initialized by the cfgxxxServerName and cfgxxxDatabaseName parameter values.
Error details are also a click away. Figure 3-45 shows the report you get when you click the Batch 506 error count field in the Batch Summary Report. This report displays details about all the errors encountered within the batch instance.
Copyright 2010
162
Figure 3-45: Batch Error Report Table 3-11 summaries the Batch Error Reports columns. Column Name Activity ID Description Identifier for the Activity run instance. Note that 506 is the batch ID, and 1589 is the execution lineage ID. The activity that generated the error The errors thrown by the SSIS process
Source Name Error Description
Table 3-11: ETL Error Report columns Notice in Figure 3-45 that there was a violation of a unique key constraint when the SSIS data flow attempted to insert a record that already existed. Clicking the Batch Id field within the Batch Summary Report displays the Batch Detail Report, shown in Figure 3-46.
Figure 3-46: Batch Detail Report Note that each report line maps to one execution lineage ID. This report provides details on the activity, the data flow source and destination, and record counts. Table 3-12 contains a brief description of each Batch Detail Report column. Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Column Name Id Package Activity Source Destination Rows Read Ignored Exc Inserts Updates SQL Description Activity ID, this is the execution lineage ID Package name Activity name Source name, 3-level naming (Database.Schema.Table) Destination name, 3-level naming (Database.Schema.Table) Reconciliation value: Read Ignore Exception Insert Update Records read Records ignored Records sent to the exception area Total number of records inserted Total number of records updated Click through to the SQL used to select from the source
163
Table 3-12: Batch Detail Report columns These reports show one approach for centralized custom logging for ETL activity, with all the above reports coming from the logging tables populated by an ETL framework. We will cover the ETL framework and SSIS package templates used to populate these logging tables in the next section. But first, lets look at examples of master and execution package templates, which interface with the ETL framework and support table-based dynamic configurations and custom logging. Master Package Master packages control ETL package workflow for one ETL batch. This section provides an overview of different components within a master package. Figure 3-47 shows a master package.
Copyright 2010
164
Figure 3-47: Master package A master package is composed of: ETL Framework tasks These stock tasks for every master package are responsible for loading/recording dynamic configuration instances, creating/updating the ETL batch record, and sending notifications to interested parties. Execution Package tasks The master package developer creates the workflow for the ETL batch by connecting a sequenced set of execution package tasks. Contract The master packages variables can be viewed as the contract between ETL framework components and SSIS connections and tasks.
The master package developer starts with a master package template, which contains the ETL framework tasks used to interface with the ETL framework. The execution package workflow is then added to this master package template. Figure 3-48 shows an example of an execution package workflow.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices This scenario populates the AdventureWorksDW 2008 DimCustomer and DimGeography tables from the AdventureWorks2008 OLTP database. Notice how three Production data area tables (PersonContact, City, and Address) were populated prior to loading the DimGeography and DimCustomer Data warehouse tables. This illustrates how even simple scenarios require an intermediate Production data area within a data warehouse.
165
Figure 3-48: Master package execution package workflow Here are the steps for developing this workflow: 1. Determine the workflow for the ETL batch. The above simple example shows the logical steps used to build a Geography and Customer dimension. Again notice how activity is required within the Production data area prior to loading the dimensions. 2. Create one Execution Package task for each separate ETL activity. 3. Create one file connection for each execution package. 4. Set the DelayValidation property to True for each file connection. This allows the file connection to be dynamically initialized at run time. 5. Create an expression for each ConnectionString property. Figure 3-49 shows the expression for the EP_LoadCity file connection.
Copyright 2010
166
Figure 3-49: ConnectionString expression The file directory where the SSIS packages reside is dynamically loaded. This simple expression now allows the ETL operations team to easily deploy packages onto a server by doing the following: 1. 2. 3. 4. Creating a directory to store the ETL packages Copying the ETL packages into the directory Creating configuration entries (this ETL framework uses a Configuration table) Using dynamic configurations to initialize the cfgDtsxDirectoryName variable at runtime
Here are a few developer notes to keep in mind: When DelayValidation is set to False, SSIS validates the connection metadata at package open time, not package execution time. Setting DelayValidation = False and having a hard-coded directory value stored in the cfgDtsxDirectoryName variable is a common developer oversight. The result of the above is that the wrong set of ETL packages can get invoked or, more commonly, the package will fail once it moves from the development environment to the test environment. OLE DB sources within data flows have a ValidateExternalMetadata property which is set to a default value of True. When set to False, the Source metadata is not checked at design time which could result in a run-time error when the source metadata changes.
Error Handling The SSIS event model allows developers to build SSIS workflows that are activated when an event fires. Figure 3-50 shows the workflow for the master package On Error event.
Copyright 2010
167
Figure 3-50: On Error workflow This workflow writes an error record to the ETL framework then optionally sends an alert if the error results in termination of the master package. This error record is populated with a combination of SSIS user and system variables. SSIS system variables contain additional information that is useful when diagnosing the error. The following link has more information on the system variables available to the SSIS developer: http://technet.microsoft.com/en-us/library/ms141788.aspx Note that this master packages On Error event also captures all errors from execution packages invoked within the master package. Adding specific error handlers for execution packages is necessary only if error logic specific to the execution package is required (e.g., whether to continue processing instead of terminating processing). Master Package Example: Metadata-Driven Approach An alternative strategy for master packages is a metadata-driven approachstoring a sequenced set of execution packages within a table and using them to dynamically configure an Execution task. Figure 351 shows an example of a metadata-driven master package.
Copyright 2010
168
Figure 3-51: Metadata-driven master package Metadata-driven master package tasks include: A GetPackages Execute SQL task that queries a table containing a sequenced set of execution packages and returns an ADO result set into a variable. Figure 3-52 shows this packages properties and the SQL statement used to return the result set. A Foreach Loop Container to iterate through the ADO result set returned by GetPackages. This result set contains an ordered set of execution packages, and each record value is mapped to an SSIS variable. An EP_Package Execute Package task that references the EP_Package connection. The EP_Package connection, which uses the following expression to dynamically set the execution package location and name: Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices @[User::cfgDtsxDirectoryName] + @[User::rsPackageName] + ".dtsx"
169
Figure 3-52: GetPackages properties with getPackageSql variable value This sample master package allows execution packages to be added and removed from the execution flow without any changes to the code. The getPackageSql variable contains the following expression: "SELECT DISTINCT PackageName, SortOrder FROM MDM.EtlPackage WHERE ActiveFlag = 'A' AND MasterPackageName = '" + @[System::PackageName] + "' ORDER BY SortOrder " Note how the master package name, stored in the System::PackageName variable, is used to filter the result set. Execution Package Now lets look at an example of an execution package. Figure 3-53 shows an execution package template. Note that every task in this template is an ETL Framework task. The ETL developer is responsible for adding the data flow and initializing the Contract variables to drive task behavior.
Copyright 2010
170
Figure 3-53: Execution package template This execution package template includes the following tasks: InitVars_EtlFwk Retrieves the dynamic configuration information created and persisted by the master package by using the batch ID as a filter. The batch ID is initialized using a Parent Package Variable configuration. Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices InitActivity Creates an ETL framework activity record. This records primary key is the Execution Lineage Id and is stored in the destinations LineageId column by the data flow. BuildSqlFilter Retrieves the last maximum value for the source system column used to log changes (e.g., ModifiedDate). This value will be inserted into the sources Select statement to filter the number of records processed to the ones that have changed since the last invocation of the package instance. dfContainer The container in which the ETL developer creates the data flow. Note the comments section, which lists the output variables that the data flow must populate. PostProcess Three tasks that combine to implement set-based processing after the data flow has completed. These set-based tasks are the ones mentioned in the ETL Patterns section above. The BuildSql task creates the set-based Updates and Inserts, which are then executed by one of the UpdateTable and InsertHistory tasks. The ETL developer is responsible for setting the following variables: o ETL pattern Is this an update or versioned insert pattern? o Natural key(s) The natural key list is used in the update and versioned insert set-based updates to join instances of a record together. o Update Columns These columns are the values updated in the update pattern. UpdateActivity The task that updates the ETL framework activity record with execution status and completion date/time. It also inserts transfer record details (e.g., source and destination names, reconciliation row counts, and the source SQL statement).
171
This sections sample execution package template isnt meant to be the definitive example. But it does illustrate how the following key requirements can be implemented: Dynamic configurations The dynamic initialization of connections provides for hands-off management as ETL solutions move from development to production environments. Batch and execution lineage Creating an execution lineage ID allows data stewards to navigate through a tables change history when business consumers question data warehouse content. Analyzing particular batch runs is possible because the lineage IDs all have a batch parent record. Incremental loads Adding a filter to source systems containing a change column date or integers significantly reduces the records processed. Post processing Using set-based Updates to post-process update and versioned insert patterns is more efficient than including the Updates within a procedural data flow.
This ETL framework example uses a file-based storage option for its master and execution packages. However, there are other options, and the next section provides some guidance on the key criteria for choosing a package storage approach. Package Storage Options The master package example above uses a file-based option for storing SSIS packages. The benefits to the file-based storage are:
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices A file copy can be used to transfer ETL packages between environments. File operations (copy and backup) are well understood by IT operations resources responsible for deploying solutions into production and QA environments. The SSIS service is not required for this option, which eliminates an additional moving part and simplifies the environment. For more information on the SSIS service, click on this link: http://msdn.microsoft.com/en-us/library/ms137731.aspx.
172
However, there are two other package-storage options: Storing packages within the msdb database. One benefit to this option is security: The administrator can use built-in roles (e.g., db_ssisadmin, db_ssisltduser and db_ssisoperator) when granting access to a package. Storing the packages within a directory monitored by the SSIS service. A benefit to this option, which requires the SSIS service, is that the SSIS service monitors all package activity on a server. The question then becomes, which package storage option is the most appropriate for your environment? The workflow illustrated in Figure 3-54 can be used to select the package storage option most appropriate for your environment.
Figure 3-54: Package storage options workflow
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Essentially, SQL Server package storage is appropriate when youre using SQL Server as a relational database and: Package backup and security is important or you have hundreds of packages
173
File package storage is appropriate when: You dont use SQL Server as a relational database or your source control system is used to control package deployment or you have a limited amount of packages
The article Managing and Deploying SQL Server Integration Services contains good information about managing and deploying SSIS solutions. Backing Out Batches As discussed earlier, the ability to back out all new and changed records for a particular batch execution instance should be a requirement for every ETL framework. This approach is necessary because ETL batches are long-running, I/O-intensive processes that dont lend themselves to being efficiently encapsulated within a database transaction. Note: Its critical that data stewards locate processing errors as soon as possible. Backing out an ETL batch that occurred in the past (e.g., two weeks ago) would most likely require the backing out of all recent batches due the fact that incremental changes to a bad record would also result in bad data. Execution lineage and custom logging help to simplify the task of backing out changes; Figure 3-55 shows the high-level flow of the procedure used to back out all changes from a recently executed batch.
Figure 3-55: Batch back-out code The SQL used to retrieve the table set for this ETL framework is shown below:
SELECT DISTINCT d.SchemaName + '.' + d.ObjectName as DestinationTable, x.ActivityId as LineageId, A.BatchId
Copyright 2010

FROM LOG.EtlXfr x INNER JOIN LOG.EtlActivity a ON x.ActivityId = a.ActivityId INNER JOIN LOG.EtlObject d ON x.DstId = d.ObjectId WHERE A.BatchId = @BatchID ORDER BY x.ActivityId DESC
174
Table 13 shows the result set from this query for the scenario presented in the execution package section above. DestinationTable LineageId BatchId dbo.DimCustomer 1618 513 dbo.DimGeography 1617 513 MDM.Address 1616 513 MDM.City 1615 513 MDM.PersonContact 1614 513 Table 13: Destination table result set for BatchId 513 Notice how the records are sorted in descending order based on the LineageId. This allows you to iterate through the record set without worrying about conflicts such as primary keyforeign key constraints. The master package executes ETL packages in an order that honors any referential integrity. The following sequence will be executed for each table. (Note that this section refers to the versioned insert and update integration patterns, which are covered in the next section.) 1. The first step is to delete all the newly inserted records. This is simple with the versioned insert table pattern because the only operations on the table are inserts: DELETE FROM @DestinationTable WHERE LineageId = @LineageId This is also a simple operation for the Update table pattern when theres a version ID column or some other indicator that this was the first operation on that record, as shown below: DELETE FROM @DestinationTable WHERE LineageId = @LineageId AND VersionId = 1 2. Update table patterns will require one more operation: re-applying any recently updated fields to their previous value. This is where History tables are useful. Figure 3-56 shows the pseudocode used to dynamically build this UPDATE statement.
Figure 3-56: Code to re-apply the previous version The following is pseudo-code for the UPDATE statement generated from the logic in Figure 3-56: Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices UPDATE @TableName SET tableName.ColumnName = h.ColumnName,... FROM @TableName INNER JOIN @HistoryTable h ON tableName.NaturalKey = h.NaturalKey WHERE LineageId = @LineageId AND h.VersionId = tableName.VersionId - 1 More ETL Framework Information The ETL framework presented in this section is available online at http://etlframework.codeplex.com/ It includes: Scripts to create the ETL framework database ETL framework reports Master and execution package templates Execution package examples
175
Data Integration Best Practices

Now that you have a good understanding of SSIS data integration concepts and patterns, lets look at some examples that put those ideas and patterns into action. Basic Data Flow Patterns This section presents SSIS examples of two fundamental patterns that almost all ETL data flows are based on. These patterns are the update pattern and the versioned insert pattern. Each of these patterns can be applied to table types made popular by KimballSlowly Changing Dimensions Type I (SCD I) and Type 2 (SCD II) and fact tables: SCD I Uses the update pattern SCD 2 Uses the versioned insert pattern Fact table The type of pattern you can use depends on the type of fact table being loaded. (Well discuss this in more detail later in this chapter).
Update Pattern A table containing only the current version of a record is populated using an update data flow pattern, as Figure 3-57 shows.
Copyright 2010
176
Figure 3-57: Update data flow pattern Some things to keep in mind about the update data flow: This data flow only inserts rows into a destination. SSIS destinations have a Table or view fast load option, which supports very fast bulk loads. You can update records within a data flow by using the OLE DB Command transform. Note that using this transform results in one SQL Update per record; there is no concept of a bulk-load update within a data flow. Because of this, the pattern we present only inserts records and never updates records within the data flow. Every table has an associated history table that is responsible for storing all activity: One record is inserted into the history table every time the source record exists in the destination but has different values. o The alternative is to send the changed records into a working table that is truncated after the Update processing completes. o The benefit of keeping all records in a history table is that it creates an audit trail that can be used by data stewards to diagnose data issues raised by business consumers. There are different options for detecting record changes; these will be covered later in this chapter.
Figure 3-58 shows an SSIS update pattern data flow.
Copyright 2010
177
Figure 3-58: SSIS update data flow pattern This SSIS update data flow contains the following elements: Destination table lookup The lkpPersonContact transform queries the destination table to determine whether the record already exists (using the Natural key). Change detection logic The DidRecordChange conditional split transform compares the source and destinations. The record is ignored if theres no difference between the source and destination. The record is inserted into the history table if the source and destination differ. Destination Inserts Records are inserted into the destination table if they dont already exist. Destination History Inserts Records are inserted into the destination history table if they do exist.
After the data flow completes, a post-processing routine is responsible for updating the records in the primary table with the records stored in the history table. This is implemented by the execution package templates post-processing tasks, as Figure 3-59 shows.
Copyright 2010
178
Figure 3-59: SSIS update post processing tasks The execution package PostProcess container is responsible for all set-based activity for this pattern, including: Updating the tables records using the records inserted into the history table. Inserting all new records (first version and into the history table). The tables primary key is an IDENTITY column whose value isnt known until after the insert. This means that we need to wait until after the records are inserted before moving them to the history table.
Update Pattern ETL Framework Figure 3-60 shows the updated data flow with ETL framework instrumentation transformations. These transformations: Initialize supporting columns The Init_EtlFwk and Init_EtlFwk derived column transformations initialize the Record Status, Lineage Id, and Version Id supporting columns documented earlier in this chapter. Save record counts The ReadCount, IgnoreCount, InsertCount, UpdateCount, and Row Count transformations initialize SSIS variables with record counts. These in turn are inserted into ETL framework logging tables by a post-processing task. Net changes The GetMaxDate aggregate transformation calculates the maximum value for the Modified On date. The InitOutFilterMaxValue transformation stores this into an SSIS variable, which is then inserted into an ETL framework table. It will be used to retrieve only the changed records the next time this data flow runs.
Copyright 2010
179
Figure 3-60: SSIS update data flow pattern with ETL instrumentation The Benefit of This Data Flow Your first observation might be that this data flow looks a lot more complicated that the original one. That might be the case because most frameworks require extra up-front time in the development phase. However, this is time well spent when you can avoid the daily costs of reconciling data. The benefits to adding this additional logic are as follows: Supporting columns make the auditing and data stewardship tasks much easier. (See the discussion above for the benefits of adding execution lineage.)
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Row counts provide good indicators of the health of a particular data flow. A data steward can use these record counts to detect anomalies in processing and data patterns for an execution instance. Figure 3-61 shows an example of the Batch Summary Report with record counts.
180
Figure 3-61: Batch Summary Report with record counts ETL operations staff and data stewards can use the summary record counts to understand more about the activity within a particular batch instance. Further investigation, including drilling into detail reports, can then occur if any of the record counts look suspiciouseither by themselves or in relation to other batch instances. Versioned Insert Pattern A table with versioned inserts is populated using a versioned insert data flow pattern, as shown in Figure 3-62. Note that the logic flow has been revised for efficiency.
Figure 3-62: Versioned insert data flow pattern Note that with the versioned insert data flow: All new/changed records go into the Versioned Insert table. This is a simpler data flow than the update data flow pattern.
Figure 3-63 shows an SSIS versioned insert data flow.
Copyright 2010
181
Figure 3-63: SSIS versioned insert data flow pattern Some notes about this SSIS update data flow: The lkpDimGeography Lookup transform has the Redirect rows to no match output option set to Ignore Failures. The DidRecordChange conditional split task checks whether the destinations primary key is Not Null and if the source and destination columns are the same. The record is ignored if this expression evaluates to true. The net change logic is not includedin this case, theres no Modified Date column in the source. The ETL framework supporting column and record count transformations are similar to the previous data flow.
Update vs. Versioned Insert The versioned insert pattern is simpler to implement and has less I/O activity than the update pattern. So why bother with the update pattern at all?
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices The answer is: There are fewer records in the update table. Fewer records translate into fewer I/Os, which mean better performance. OK, so why do we need a history table? Couldnt we truncate this table after its used to update the primary table? The answer to these questions is: The history table allows data stewards and data auditors to track changes over time. This can be invaluable when business consumers question results from the data warehouse. Surrogate Keys Surrogate keys play an integral part in every data warehouse. Determining how surrogate keys are generated is an important aspect of any data integration effort. The options for key generation when using SQL Server are well known, and each option has its pros and cons. These options are: GUIDs These are 16-byte keys that are guaranteed to be unique across all systems. o GUIDs are not recommended for surrogate keys. o Their size (16 bytes) is at least two times larger than big integers and four times larger than integers. IDENTITY This property generates the next unique value for a column and can only be associated with one column in a table. o Although Identity columns are a solid option for key generation, the key value is not determined until after an insert occurs. o This means the ETL developer cannot use this value within the SSIS dataflow responsible for the inserts, making implementation of patterns such as Late arriving dimensions difficult. ETL key generation The ETL process itself is responsible for generating the surrogate keys. The steps within this pattern are: 1. Obtain the base value for the surrogate key (i.e., the starting point). 2. Calculate the new surrogate key. 3. Optional: Store the last surrogate key into a table for the future reference. This following snippet shows code within a data flow script component that calculates surrogate keys:
Dim gNextKey As Integer = 0 Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer) ' ' Init the next key with the Max key value the first time around ' Increment by 1 to create the next unique key ' If gNextKey = 0 Then gNextKey = Row.MaxId End If gNextKey = gNextKey + 1 End Sub
182
Copyright 2010
183
Key Lookups Every ETL operation involves joining data from a source to a destination at some level. This could include: Dimension surrogate key lookups Code-to-value lookups Primary key lookups across systems Master data management key lookups Using a relational database SQL join solution is not always an option, nor is it always efficient for the following reasons: When the source data and lookup table are on different environments, you would need to land the source data to the same RDBMS before performing the join. Flat files are a common source for data warehouse ETL operations and often require looking up codes or keys, which cant be done at the file layer. Even if data is stored in the same type of RDBMS system, such as SQL Server, but on different servers, a cross-server join has major performance implicationsespecially with large data sets. The Data-in or Production environment is usually not on the same server as the Consumption database or databases, so you would need to add another working area to accomplish an efficient join. Of course, using a SQL join should not be taken off the table. You should do a healthy evaluation of all options to select the appropriate solution. SSIS has three primary mechanisms to perform key lookups, each with benefits: Lookup transformation Merge Join transformation Fuzzy Lookup transformation
Lets look briefly at each of these transformations. Lookup Transformation SSISs Lookup transformation is the most common and easiest-to-use solution. It works by matching the source rows in the data flow to a table or view that has been defined in the transformation. Usually, the Lookup is configured to load the entire table into the SSIS memory space, which alleviates the need for a join to the database and is very efficient. Figure 3-64 shows a data flow that uses lookups to acquire keys. Note that even though the lookups are strung together in the data flow, the SSIS engine is performing the lookups simultaneously in the data flow.
Copyright 2010
184
Figure 3-64: SSIS Lookup transformation Figure 3-65 highlights the product lookup column mapping that acquires the data warehouse product key by matching across the source key.
Copyright 2010
185
Figure 3-65: SSIS Lookup column matching Here are some considerations about when to use the Lookup transformation: The Lookup works well when the entire lookup reference table can fit into memory. If your lookup reference table has several million rows or your ETL server is limited in memory, you will run into problems and should consider another route. As a rule of thumb, a 1M record table that is about 50-100 bytes wide will consume about 100MB of memory. A large 64-bit server with a lot of memory available for ETL can handle large lookup tables. The Lookup can be configured without cache or with partial cache. No cache means that every row will run a query against the RDBMS. Do not use this approach if you want a scalable solution. Partial cache is when the cache gets loaded as rows are queried and matched against the source system.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices You can share Lookup cache across Lookup transformations. This is a valuable capability if you need to use the same lookup table at multiple times during the same ETL. When the lookup does not find a match, you can either fail the process or ignore the failure and return a NULL to the output.
186
Merge Transformation A second solution to looking up keys is to use the Merge transformation. Merge does a lot more than just a lookup because you can perform different join types across data coming from any source. The Merge transformation requires that the data be sorted in the order of the keys that are joined. You can use a Sort transformation for this, or if the source is already sorted (physically or through an ORDER BY), you can configure the source adapter to be pre-sorted. Figure 3-66 shows an example of a Merge Join lookup.
Figure 3-66: Merge Join data flow In this example, one of the sources is already pre-sorted, and the second one uses a Sort transformation. The Merge Join brings the data together, in this case joining transactions from a flat file to a customer list also from a flat file. Figure 3-67 shows the Merge Join Transformation Editor for this example.
Copyright 2010
187
Figure 3-67: Merge Join Transformation Editor Here, an inner join is used. However, left and full outer joins are also available (as well as a right outer join). The sources are joined across the keys, and the data sets are merged. The following are key considerations for when to use the Merge transformation: The Merge Join adds a level of complexity over and above the Lookup transformation. So if your reference table will easily fit in memory, use the Lookup; otherwise, the Merge Join can be effective. The Merge Join allows the data to be joined even when there is more than one match per key. If you need each merge input to be matched with zero, one, or more records in the other input, the Merge Join will do that. The Lookup will always return only one row per match, even if there are duplicate records with the same key. Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Fuzzy Lookup Transformation The third approach to looking up data for data warehouse ETL is the Fuzzy Lookup transformation. Very similar to the Lookup transformation, the Fuzzy Lookup joins to a reference table not on exact matches, but on possible matches where the data may not be exact but is similar. Note that the Fuzzy Lookup requires the lookup reference table to be in SQL Server. It also has significant processing overhead, but its uses when dealing with bad data are valuable. Change Detection We briefly discussed change detection earlier, but the following examples will provide more detail. Remember that change detection involves: Detecting changes in existing data Identifying new or deleted records
188
Some systems track changes through audit trails, others append timestamps to records (such as CreationDate or ModifiedDate), and still other systems dont have any identifiers. Working with Change Identifiers When working with a system that tracks changes, you can use SSIS to easily identify the changed records and process them. For example, if you have a system with CreationDatetime and LastModifiedDatetime timestamps appended to the rows and you need to process inserts and updates, the following steps are needed: 1. Merge the create and last-modified date into one valuefor example: ISNULL(LastModifiedDatetime, CreationDateTime) Note that this value is referred to as LastModifiedDatetime within this example. 2. Identify the maximum value of the last modified date. 3. Extract and process the changed records: ISNULL(LastModifiedDatetime, CreationDateTime) > LastModifiedDatetime 4. Save the maximum value of the above ISNULL comparison. Steps number one and four imply that the LastModifiedDatetime is stored. Assuming this value is captured in a SQL Server table, SSIS can extract the table data and store the results in a package variable. Figure 3-68 shows an Execute SQL task configured to receive the results of a query into a package variable.
Copyright 2010
189
Figure 3-68: Execute SQL task and variables Not shown in Figure 3-68 is the Result Set tab, which maps the data from the query to the variable. An alternative approach is to use a parameterized stored procedure and map input and output parameters to variables using the Parameter Mapping tab in the editor. The second step is to extract the targeted data from the source based on the LastModifiedDatetime. You can use either a parameterized query or a variable that contains the SQL statement. Figure 3-69 shows a source adapter with a parameterized query.
Copyright 2010
190
Figure 3-69: Parameterized query The alternative approach is to set the Data Access Mode to SQL Command from Variable and build the SQL query before the data flow task. This approach works well for non-SQL sources that do not support parameterized queries. The final step is to save the MAX of the created or modified date to a table. You can do this with a MAX SQL query, or if you are using the data flow, you can use the Max aggregate in the Aggregate transformation. Your situation may require slightly different steps, but this process can be adjusted to accommodate other scenarios where you have some way to identify changes. Note that the Figure 3-69 shows a parameterized query for an OLE DB source. Working with parameters can differ across Data providers, e.g. OLE DB, ODBC, ADO and ADO.Net. The following link contains more information on this topic: http://technet.microsoft.com/en-us/library/cc280502.aspx. Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Change Identification through Data Comparison If you have a system where you are not able to identify changes through flags, dates, or audit trails, you will need to handle change detection by comparing the data. There are several approaches to doing this; the most common include: SQL Join Comparison SSIS Conditional Split SQL Checksum Slowly Changing Dimension Transformation
191
SQL Join Comparison The first solution is to compare the data by joining the source data to the destination data across the keys and then comparing the attributes in the predicate WHERE clause. Typically, the SQL pattern to update the changes looks like this: UPDATE [Source Table] SET [Source Attribute Columns] = [Destination Attribute Columns] FROM [Source Table] INNER JOIN [Destination Table] ON [Source Keys] = [Destination Keys] WHERE [Source Attribute Columns] <> [Destination Attribute Columns] If you are also identifying new records, then you would perform a LEFT OUTER JOIN where the keys are NULL, as follows: SELECT [Source Columns] FROM [Source Table] LEFT OUTER JOIN [Destination Table] ON [Source Keys] = [Destination Keys] WHERE [Destination Key] IS NULL The drawbacks to this approach include the need to stage the source data to a table in the same SQL instance and the process overhead to perform the joins. Conditional Split Another solution is to use an SSIS Conditional Split to compare the data from the source and destination. In the next section on dimension patterns, we look at an example where a Merge Join brings the data together and then a Conditional Split compares the results.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices The Conditional Split transformation uses the SSIS expression language to evaluate a Boolean condition for every row. The following SSIS expression compares the columns from the source and the destination and for each column:
GeographyKey != DW_GeographyKey || Title != DW_Title || FirstName != DW_FirstName || MiddleName != DW_MiddleName|| LastName != DW_LastName || BirthDate != DW_BirthDate || MaritalStatus != DW_MaritalStatus || Suffix != DW_Suffix || Gender != DW_Gender || EmailAddress != DW_EmailAddress || TotalChildren != DW_TotalChildren || NumberChildrenAtHome != DW_NumberChildrenAtHome || HouseOwnerFlag != DW_HouseOwnerFlag || NumberCarsOwned != DW_NumberCarsOwned || AddressLine1 != DW_AddressLine1 || AddressLine2 != DW_AddressLine2 || Phone != DW_Phone || DateFirstPurchase != DW_DateFirstPurchase || CommuteDistance != DW_CommuteDistance
192
Notice how the above expression assumes there are no NULL values in source or destination fields. The logic required when NULLS exist in the source and/or destination would be more complex. This is one argument for eliminating NULLs within your ETL data flow. SQL Checksums You can also use checksums (a computed hash of the binary data) across all the source and destination columns that need to be evaluated for change. In your destination table, you can generate this checksum by using the T-SQL CHECKSUM operator. To compare the destination to the source data, you need to perform the same CHECKSUM against the source from the Production area environment. However, here are a couple of cautions when using checksums: Checksums or other hash algorithms are not guaranteed to be 100% accurate. You need to be absolutely sure that the data types and column order are the same on both the source and destination columns.
Slowly Changing Dimension Transformation The final approach is to use the SSIS Slowly Changing Dimension (SCD) transformation. This built-in transformation allows comparison of source data with destination data to identify new records, changes that require an update, changes that require an insert (for preserving history), and updates to handle inferred members (when the destination is merely a placeholder key). The SCD transformation uses the paradigm of the Ralph Kimball dimension change types, but its use goes beyond just dimension processing. Although the SCD component is very useful for smaller tables with a few thousand rows or smaller, it does not scale well for table comparisons with hundreds of thousands or millions of rows. It incurs a lot of overhead by making a call to the database for every row to compare, every row to update, and every row for inserts.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices De-duping Especially when dealing with merged sources, duplication is often an issue. In fact, duplication can occur in any system, whether from an operator error or a customer signing up again on a site because they forgot their login. SSIS has two built-in approaches for dealing with standard de-duplication: the Sort transformation and the Fuzzy Grouping transformation. Of course, more complicated de-duplication can be achieved through a combination of features. Sort Transformation with De-duplication The Sort transformation has the ability to remove duplicates across the selected sorted columns. Figure 3-70 shows the Sort Transformation Editor with the option to Remove rows with duplicate sort values selected.
193
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Figure 3-70: Sort Transformation Editor The de-duplication feature of the Sort transformation requires that the key values match exactly. Notice that you can also pass through the other columns that arent involved in the de-duplication. A simple GROUP BY or DISTINCT SQL statement can do a similar operation, but if you are dealing with a file or your set of distinct columns is a subset of the column list, the Sort transformation can be a valuable option. Fuzzy Grouping Transformation The Fuzzy Grouping transformation is effective at de-duplication when the data set you are working with requires identifying duplicates across similar but not exactly matching data. Using the Fuzzy Grouping transformation is like using any other transformation: You connect it to the data flow data set and then configure it. Figure 3-71 shows the Fuzzy Grouping Transformation Editor configured against the same data as the previous Sort transformation.
194
Copyright 2010
195
Figure 3-71: Fuzzy Grouping Transformation Editor In this example, the State is set to Exact match, which means that the State has to be identical for the engine to identify more than one record as a duplicate. The other columns have similarity thresholds set as needed. Although not shown, the Advanced tab has an overall Similarity Threshold, which applies to all the columns defined in the column list. A word of caution: Fuzzy Grouping leverages the SQL Server Database Engine for some of its work, and while it has powerful application for dealing with data quality issues, it also has a high overhead meaning that you should perform volume testing as you are planning the ETL solution.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Dimension Patterns Weve discussed many ways of loading and updating dimensions throughout this toolkit, such as how to deal with updates, record versioning, and generating keys. Updating dimensions involves: Tracking history Making updates Identifying new records Managing surrogate keys
196
If you are dealing with a smaller dimension (in the magnitude of thousands of rows or less, as opposed to hundreds of thousands or more), you can consider using the built-in dimension processing transformation in SSIS called the Slowly Changing Dimension transformation. However, because this transformation has several performance-limiting features, it is often more efficient to build your own process. The process of loading dimension tables is really about comparing data between a source and destination. You are typically comparing a new version of a table or a new set of rows with the equivalent set of data from an existing table. After identifying how the data has changed, you can then perform a series of inserts or updates. Figure 3-72 shows a quick dimension processing example, with the following general steps: 1. The top left source adapter pulls records into SSIS from a source system (or intermediate system). The top right source adapter pulls data from the dimension table itself. 2. The Merge Join compares the records based on the source key (see Figure 3-73 for the transformation details). 3. A conditional split evaluates the data, and rows are either inserted directly into the dimension table (bottom left destination adapter) or inserted into a Production data area table (bottom right destination) for an update process. 4. The final step (not shown) is a set-based update between the Production data area table and the dimension table.
Copyright 2010
197
Figure 3-72: Dimension processing The Merge Join performs the correlation between the source records and the dimension records by joining across the business or source key (in this case, CustomerAlternateKey). Figure 3-73 shows the setup in the Merge Join Transformation Editor. When you use this approach, be sure to set the join type to left outer join, which will let you identify new records from the source that are not yet present in the dimension table.
Copyright 2010
198
Figure 3-73: Using Merge Join for dimension processing The last step is to compare data to determine whether a record is new or changed (or unaffected). Figure 3-74 shows the Conditional Split transformation, which does this evaluation. (The Conditional Split uses the code displayed earlier in the Change Detection section.)
Copyright 2010
199
Figure 3-74: Conditional Split transformation The Conditional Split redirects the records directly to the dimension table through a destination adapter or to an working update table using a destination adapter followed by a set-based UPDATE statement. The UPDATE statement for the set-based approach joins the working table directly to the dimension table and performs a bulk update, as follows:
UPDATE SET , , , , , , , AdventureWorksDW2008.dbo.DimCustomer AddressLine1 = stgDimCustomerUpdates.AddressLine1 AddressLine2 = stgDimCustomerUpdates.AddressLine2 BirthDate = stgDimCustomerUpdates.BirthDate CommuteDistance = stgDimCustomerUpdates.CommuteDistance DateFirstPurchase = stgDimCustomerUpdates.DateFirstPurchase EmailAddress = stgDimCustomerUpdates.EmailAddress EnglishEducation = stgDimCustomerUpdates.EnglishEducation EnglishOccupation = stgDimCustomerUpdates.EnglishOccupation
Copyright 2010

, , , , , , , , , , , , , FROM INNER ON FirstName = stgDimCustomerUpdates.FirstName Gender = stgDimCustomerUpdates.Gender GeographyKey = stgDimCustomerUpdates.GeographyKey HouseOwnerFlag = stgDimCustomerUpdates.HouseOwnerFlag LastName = stgDimCustomerUpdates.LastName MaritalStatus = stgDimCustomerUpdates.MaritalStatus MiddleName = stgDimCustomerUpdates.MiddleName NumberCarsOwned = stgDimCustomerUpdates.NumberCarsOwned NumberChildrenAtHome = stgDimCustomerUpdates.NumberChildrenAtHome Phone = stgDimCustomerUpdates.Phone Suffix = stgDimCustomerUpdates.Suffix Title = stgDimCustomerUpdates.Title TotalChildren = stgDimCustomerUpdates.TotalChildren AdventureWorksDW2008.dbo.DimCustomer DimCustomer JOIN dbo.stgDimCustomerUpdates DimCustomer.CustomerAlternateKey = stgDimCustomerUpdates.CustomerAlternateKey
200
Fact Table Patterns Fact tables have a few unique processing requirements. First, you need to acquire the surrogate dimension keys and possibly calculate measures. These tasks can be handled through Lookup, Merge Join, and Derived Column transformations. The more difficult process is dealing with the updates, record differentials, or snapshot table requirements. Inserts Most fact tables involve insertsit is the most common fact table pattern. Some fact tables have only inserts, which makes the ETL process perhaps the most straightforward approach. Inserts also involve bulk loading of data, index management, and partition management as necessary. Well talk more about how to best handle inserts in the Destination Optimization section later in this chapter. Updates Updates to fact tables are typically handled in one of three ways: Through a change or Update to the record Via an Insert of a compensating transaction Using a SQL MERGE In the case where changing fact records are less frequent or the update process is manageable, the easiest approach is to perform an UPDATE statement against the fact table. (See the earlier section on Change Detection for ways to identify changes.) The most important point to remember when dealing with Updates is to use a set-based update approach, as we showed in the Dimension Pattern section.. The second approach is to insert a compensating or net change record, rather than performing an Update. This strategy simply inserts the data that has changed between the source and the destination Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices fact table into a new record. For example, Table 3-13 shows the current source value, Table 3-14 shows what exists in the fact table, and Table-15 shows the new record. Measure Value Source ID 12345 80 Table 3-13: Source data Measure Value Source ID 12345 100 Table 3-14: Current fact table data Current Measure Value Source ID 12345 100 12345 - 20 Table 3-15: New fact table data The last approach is to use a SQL MERGE statement, where you land all the new or changed fact data to a working table and then use the merge to compare and either insert or update the data. The following example code shows that when the merge does not find a match, it inserts a new row; when it finds a match, it performs an update:
MERGE dbo.FactSalesQuota AS T USING SSIS_PDS.dbo.stgFactSalesQuota AS S ON T.EmployeeKey = S.EmployeeKey AND T.DateKey = S.DateKey WHEN NOT MATCHED BY target THEN INSERT(EmployeeKey, DateKey, CalendarYear, CalendarQuarter, SalesAmountQuota) VALUES(S.EmployeeKey, S.DateKey, S.CalendarYear, S.CalendarQuarter, S.SalesAmountQuota) WHEN MATCHED AND T.SalesAmountQuota != S.SalesAmountQuota THEN UPDATE SET T.SalesAmountQuota = S.SalesAmountQuota ;
201
The drawback to the MERGE approach is performance. Although it simplifies the insert and update process, it also performs row-based operations (one row at a time). In situations where you are dealing with a large amount of data, you are often better off with a bulk-load insert and set-based Updates. Snapshot Fact Table A snapshot fact table is when you are capturing measure balances on a recurring basis, such as weekly or monthly inventory levels or daily account balances. Instead of capturing the transactions, the snapshot fact table is summarizing the balance. In other words, there will be one record per product or account per period. Take the example of inventory captured once a week. Table 3-16 shows what the snapshot fact table would look like. Product SK Week SK Microsoft Corporation Quantity In Stock Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices 1 1001 (2011 Week 1) 10 2 1001 (2011 Week 1) 8 3 1001 (2011 Week 1) 57 4 1001 (2011 Week 1) 26 5 1001 (2011 Week 1) 0 1 1002 (2011 Week 2) 5 2 1002 (2011 Week 2) 6 3 1002 (2011 Week 2) 40 4 1002 (2011 Week 2) 21 5 1002 (2011 Week 2) 15 Table 3-16: Weekly Product Inventory fact table This example shows each products inventory quantities for two weeks. If you were to ask what the total inventory is for ProductSK = 1, the answer would not be 15, inventory quantities are not additive. Rather it is either: the average, the first, or the last value. The ETL for snapshot fact tables is commonly handled through one of two approaches: If the source has the current data but no history, you simply perform an Insert every time you reach the first day of the new time interval, using the new intervals time key (Week SK in the example above). On the other hand, if the source is just tracking changes to levels, you perform the updates to the latest inventory levels, and then on the first day of the new time interval, you use the prior intervals current data as the basis for the new time period.
202
The difference is that one process uses the source data and creates a snapshot of the source data, and the other process uses the current fact table data as the snapshot. When creating a new snapshot from the existing fact table, you can consider using a SELECT INTO operation. If you are creating a new partition in a fact table (a SQL partition), create a table with the new data and then switch that table into the partitioned table. Managing Inferred Members An inferred member is when you are missing the dimension member as you load the fact table. To handle this, you add a placeholder dimension record during the dimension processing. You have three ways to manage inferred members: Scan the staged fact records before inserting them, create any dimension inferred members at that time, and then load the fact records. During the fact load, send any records that are missing to a temporary table, add the missing dimension records, and then reload those fact records to the fact table. In the data flow, when a missing record is identified, add the record to the dimension at that time, get the surrogate key back, and then load the dimension.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Figure 3-75 shows the third option.
203
Figure 3-75: Inferred member process In this process, the record with the missing product key is sent to an OLE DB Command transformation, where a procedure is called to add the record to the table (after checking that it didnt exist). The noncached Lookup transformation then gets the surrogate key back, and the data is brought back into the data flow with a Union All transformation. An alternative approach is to use a Script Component to handle the inferred member insert. If you can manage the surrogate key generation in the package, you can optimize the process. As an example of this kind of scripting, see the data flow scripting example in the Power of Data Flow Scripting section later in this chapter. Data Exception Patterns As we noted earlier in this chapter, processing data exceptions can be a difficult if not identified early in the ETL operations. Heres a common scenario: You created a nightly process that loads data into a Production data area table from several sources, and you wrote a SQL INSERT statement that includes some data conversions for loading data into a reporting table. The data contains a few hundred million rows and typically takes 45 minutes to run (indexes are not dropped and re-created). Early one morning, you get a call from the off-shore ops department that the process failed. It ran for 40 minutes, hit a failure, and then took 60 minutes to roll back. Now you have only 30 minutes before users start to run reports against the table.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Sound familiar? The challenge is that a single data exception can cause a lot of lost time. The solution is to deal with data quality, exceptions, and conversions early in the data warehouse data integration process and leverage the data flow error-row handling in SSIS. Data Flow-based Pattern Data flow exceptions can be handled for sources, transformations, and destinations and may be caused by a data conversion error or truncation. The most common data exceptions happen when importing data from a text file, given the loose nature of text file data types and the strong nature of data types in SSIS. The best practice is to redirect failure rows to an exception table for review. Figure 3-76 shows the error row output of the Flat File Source adapter.
204
Figure 3-76: Source Adapter error row output Because there was a conversion error in one of the input columns, all the input columns are converted to a text string and passed to the error output (which will allow the data to be reviewed) along with the error code and description. Note that the error table can either contain all of the source data or just the key values . Now lets look at a destination error. Figure 3-77 shows a data flow where the error rows during the insert are sent to a temporary table for a manual error review.
Copyright 2010
205
Figure 3-77: Data Flow error-row output You have several options for handling errors in SSIS data flow components (sources, transformations, and destinations): Package failure Redirecting errors (as in the above example) Ignoring the issue. In this case, a NULL can be used instead of the value that caused the exception.
Figure 3-78 shows how you configure error row handling.
Figure 3-78: Error row configuration Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices The following are considerations when dealing with error rows: Use a text column when redirecting error rows to an error table. Otherwise, the conversion or exception will also occur in the temp table. When redirecting error rows in a destination and Fast Load is enabled, the entire batch commit size will get redirected. This is fine as long as you then try to insert one row at a time from the batch into the destination table and do a second redirect to a temporary table.
206
Data Flow-based Exception Handling Both of the above scenarios are a reactive approach: They log all data exceptions when they are encountered. The objective for a well-run ETL development shop is to convert the reactive approach to a proactive approach. A proactive approach adds additional transformation logic to a data flow or within the source SQL statement to correct the problem inline or flag that particular column as a data exception. This may not always be possible because some data is critical to business consumers and cannot be propagated to the destination if in error. However, other data exceptions are less critical and, in these cases, can flow to the destination as a NULL value, a code indicating that the data was in error, or a transformed version of the data exception. Consider the AdventureWorksDW2008 DimCustomer tables NumberChildrenAtHome column. This column is a tinyint data type. A source system value of -129 or 130 will result in a data exception due to a data overflow error. However, the following values can also be considered data exceptions: -1, 35. The value of NumberChildrenAtHome is also a data exception when its exceeds the TotalChildren value. Adding the following logic to the source SQL statement applies these rules:
Case WHEN NumberChildrenAtHome < 0 THEN NULL WHEN NumberChildrenAtHome > 25 THEN NULL WHEN NumberChildrenAtHome > TotalChildren THEN NULL END as RepairedNumberChildrenAtHome
The data flow logic can now store the RepairedNumberChildrenAtHome value in the destination. In addition, it can also flow this record (using a Multi-cast transformation) to an exception table when the RepairedNumberChildrenAtHome value is NULL and the NumberChildrenAtHome value is NOT NULL. Note that there are many ways to implement the above business rules within SSIS. The point is that proactive rather than reactive data exception handling will reduce the time that data stewards will need to spend tracking down and repairing exceptions within the source data. This in turn results in lower TCO and reduces the risk of data exception analysis affecting the data warehouse availability to business consumers. Set-based Exception Handling
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Another approach is to pre-process the source data using set-based SQL logic to detect and log the source data to an exception table prior to it entering the data flow. For the NumberChildrenAtHome example we just looked at, the following SQL pseudo-code would detect and log the data exceptions: INSERT ExceptionTable (Column List) SELECT (Column List) FROM Source.Customer WHERE (NumberChildrenAtHome < 0 OR NumberChildrenAtHome > 25 OR NumberChildrenAtHome > TotalChildren ) Note that this isnt a scalable approach because each data exception check is implemented within separate set-based SQL statements. However, this approach does lend itself nicely to a metadata-driven solution. A metadata-driven solution would consist of a SQL code generator that would read from a metadata table containing the business rules. Table 3-17 shows an example of a metadata table used to drive a SQL data exception code generator. Id Database Table Rule 1 AdventureWorks2008 Person.Person NumberChildrenAtHome < 0 OR NumberChildrenAtHome > 25 2 AdventureWorks2008 Person.Person NumberChildrenAtHome > TotalChildren Table 3-17: Business rules for a set-based data exception implementation Note that the goal of this example is to demonstrate the concept; the AdventureWorks2008 Person table stores the NumberChildrenAtHome value within an XML data type. Description Incorrect value for the NumberChildrenAtHome column NumberOfChildrenAtHome is greater than TotalChildren
207
SSIS Best Practices

The previous section contained best practices for common patterns. This section focuses on SSIS best practices for specific technical scenarios. The Power of Data Flow Scripting With the variations of systems, customizations, and integration requirements out there, chances are you will run into a situation that is not easily solved by either a set-based SQL or SSIS out-of-the-box solution. SSIS scripting in the data flow is an alternative strategy that in many cases can be the most effective solution to a problem. As a simple example, take the trending of products over time where the source is in constant variation and the full history of inventory needs to be processed nightly, but the system tracks inventory through Diffs. Table 3-18 shows the source.
Week
Product
Stock Difference
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices 48 50 51 52 50 52 49 50 51 A A A A B B C C C 10 3 5 -1 5 -4 1 5 -1
208
Table 3-18 Inventory differences If the goal is to fill in missing weeks where there was no stock difference and to track the total stock for each product on a weekly basis, this can be a challenging problemespecially when you are dealing with approximately 500 million records that need to get updated nightly. Table 3-19 shows the desired output. Week 48 49 50 51 52 50 51 52 49 50 51 Product A A A A A B B B C C C -4 1 5 -6 3 5 -1 5 Stock Difference 10 Stock Week Level 10 10 13 18 17 5 5 1 1 6 0
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices 52 C 0
209
Table 3-19 Inventory stock week levels If you were thinking about how to accomplish this solution in SQL with the high volume of data, you are not going to find good options. The challenge is not a set-based challenge because you have to deal with the data sequentially in order to calculate the weekly levels. You also have to identify missing weeks where there were no changes in stock values. Here is where an SSIS Script Component can be very effective. Because of the pipeline engine and the efficient use of memory and processor, you can achieve the above solution relatively easily. Figure 3-79 shows a data flow with a source adapter to a flat file where the data is sourced, followed by a script transformation and a destination. It also highlights two data viewers with the source data and script output to illustrate what the script is doing.
Copyright 2010
210
Figure 3-79: Script Component results The Script Component is configured to create a new output with the four columns shown (this is called an asynchronous script). Each row of the script compares the previous rows product and week to determine whether the stock level needs to be reset or to add records for the missing weeks. Heres the code used in the component:
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
Copyright 2010

public class ScriptMain : UserComponent { private String myLastProduct = ""; private int myLastWeek = 0; private int myStockLevelCalc; public override void Input0_ProcessInputRow(Input0Buffer Row) { /*check to see if the product is the same but there has been a skipped week*/ if (Row.Product == myLastProduct & Row.Week > myLastWeek + 1) { while (Row.Week > myLastWeek + 1) { myLastWeek = myLastWeek + 1; /*perform the insert for the skipped week*/ StockLevelOutputBuffer.AddRow(); StockLevelOutputBuffer.ProductOutput = Row.Product; StockLevelOutputBuffer.WeekOutput = myLastWeek; StockLevelOutputBuffer.StockDifferenceOutput = 0; StockLevelOutputBuffer.WeekStockLevelOutput = myStockLevelCalc; } } /*check for a existing product and update stock level*/ if (Row.Product == myLastProduct) { myStockLevelCalc = myStockLevelCalc + Row.StockDifference; } /*update the stock level for a new product*/ else { myStockLevelCalc = Row.StockDifference; } /*perform the insert for the existing week*/ StockLevelOutputBuffer.AddRow(); StockLevelOutputBuffer.ProductOutput = Row.Product; StockLevelOutputBuffer.WeekOutput = Row.Week; StockLevelOutputBuffer.StockDifferenceOutput = Row.StockDifference; StockLevelOutputBuffer.WeekStockLevelOutput = myStockLevelCalc; /*update the private variables for the next row*/ myLastProduct = Row.Product; myLastWeek = Row.Week; } }
211
Other examples of leveraging the SSIS Script Component include dealing with complicated pivoting scenarios or challenging data cleansing logic where you want to leverage the full function list in C# or Visual Basic. It should be noted here that scripts transforms cannot be shared across data flows, i.e. one script transformation cannot be referenced by multiple data flows. In these cases, the SSIS team should consider keeping the master copy of the script transformation within source code version control. Any changes in this master script transformation would then need to be propogated to all instances within data flows.
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices Destination Optimization (Efficient Inserts) Much of SSIS ETL performance for data warehouse loads revolves around the interaction with the database layer, specifically around large volumes of data inserts. If you are used to dealing with ETL operations with hundreds of thousands or millions of rows (or more!), you can attest to the fact that the biggest performance gains come when you optimize table loading. The two primary ways to optimize SSIS data loading are: Using bulk load settings Managing indexes
212
First, bulk table loading means inserting more than one record at a time into a table. SSIS supports both bulk load (called fast load) and the standard load. The standard data insert will write one row at a time into the table, any triggers will fire, and indexes will be updated on a row-by-row basis. When tracing the activity, you will see each row in the trace log. You can configure SSIS to insert data in bulk though the destination adapters in the data flow. Figure 380 shows the OLE DB destination adapter with the data access mode set to fast load. This will allow thousands of records to be inserted into the table at one time. (Note that the number of records inserted is dependent on the number of columns and data types in the data flow as well as the Rows per batch setting; typically the OLE DB bulk load will insert about 10,000 rows at a time.) Note that for the most efficient inserts, the recovery mode for the destination database should not be set to full mode, instead it should be set to simple or bulk logged. The implication is that a SSIS bulk load cant be rolled back within a transaction. However, this trade off is often deemed acceptable in order to maximize performance. Refer to the Backing out batches section above for techniques that are used to roll back bulk operations.
Copyright 2010
213
Figure 3-80: OLE DB destination bulk settings Even with fast load turned on, you may still have a bottleneck with your destination inserts. This is because any indexes on the table need to be updated with the new data that is added. When you are loading millions of rows into a table, the process of committing the rows to the table requires that the indexes are updateda process that can take as long if not longer than it takes to insert the rows. The most common way to deal with indexes in large tables that require large bulk inserts is to drop the indexes before loading the data and then re-create the indexes afterward. This may not sound intuitive,
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices but it more often than not is faster than allowing the engine to reorganize the index when new data is added in bulk. Note that re-building indexes is many times an expensive operation that results in parallel query plans. Using the MAXDOP to restrict the amount of parallel activity for index re-builds may help reduce bottlenecks on you I/O sub-system. The following list shows the SSIS task flow for dropping and re-creating the indexes: 1. SQL Execute Task with a DROP INDEX statement. 2. Data flow task that runs the primary data loading logic with fast load. 3. SQL Execute Task with a CREATE INDEX statement. However, if you have a large table but are inserting only a few thousand or a few hundred thousand records into the table per ETL cycle, it may be faster to leave the indexes in place. In many (but not all) cases, large tables in a data warehouse have fewer indexes, and therefore you are not rebuilding a lot of indexes on a table at once. If you have a data warehouse table with a clustered indexed on it, you may want to keep the clustered index if the inserted data will be appended to the end of the table. For example, if your fact table is clustered on a date key and new records have the latest date key, then there is no need to drop the clustered index. Partition Management Large tables (in the hundreds of millions, billions, or trillions of rows) are often partitioned to help manage physical data management and indexes. SQL Server tables support physical partitioning, where the table is made up of separate physical structures tied together into a single query-able table. A partitioned table acts like any other table in that you can query the table and insert or update the records within the table. However, just as with index management and large data, when you are inserting data into a partitioned table, there is some overhead for the engine to determine which partition each row should be inserted into. In addition, if the partitions have indexes on them, even more time is required for data inserting because you cannot drop an index on a partition that is part of a partitioned table. The alternative solution for inserting into a partitioned table involves: 1. Switching out the most current partition that most of the data will be added to. This is a metadata operation with little overhead using the T-SQL SWITCH OUT command 2. Dropping some or all of the indexes on the table that has been removed from the partitioned table. 3. Inserting new warehouse data into the table using the fast load settings. 4. Recreating the indexes on the table. 5. Adding the table back to the partitioned table using the SWITCH IN command. Microsoft Corporation Copyright 2010
214
Microsoft EDW Architecture, Guidance and Deployment Best Practices 6. Possibly re-creating indexes that have been applied to the entire partition table (and cross partitions) if necessary. Figure 3-81 shows an SSIS control flow of what this operation might look like.
215
Figure 3-81: ETL partition table management The following link provides more information about table partitioning and partition management: We Loaded 1TB in 30 Minutes with SSIS, and So Can You. Designing and Tuning for Performance your SSIS packages in the Enterprise (SQL Video Series)
SSIS Scale and Performance The two worse offenders for ETL performance and scalability are: Poor ETL design patterns I/O bottlenecks
If youve worked with data warehouses for a few years, youve probably run into poor ETL design. Heres a common very-bad-design scenario: A well-intentioned DBA or developer is tasked with transforming some data for a report or a systems integration project. A common approach is for the DBA to build a two-step process that involves three tables: two for the source data and a final one for the merged data. After review, the system or report owner reports that some information is missing and the data is not exactly Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices right. By now, the developer is working on a new project, and the quickest way to get the user the results is to write a new step or two. Five more tables and four serialized T-SQL steps with updates later, the process is done, right? Wrong. Another business unit wants similar data. One department has found the tables and has started to use one of the intermediate tables. Before the developer knows it, the processing takes four hours and has become mission critical. Its mission critical because hes now getting calls about it at 7am when the process breaks. Even worse, it has a nickname: The John Doe Process. Recognize the scenario? The important thing to realize is that the issues didnt develop overnight, and someone wasnt sitting in a back room trying to cause problems. Here are the critical path issues that led to the poor design: Data Stewardship: Planning First of all, planning didnt happen. The business needed a solution, and the quickest one that came to mind was put in place. Data Stewardship: Architecture Second, an architectural strategy was not thought through either because a strategy did not exist in the organization or because the developer was not stepping back and thinking about how the solution would support change or growth. The default approach was implemented because it was the simplest and fastest to develop. Future Considerations The path of least resistance was taken at every step, which caused the solution to grow into a complicated and overburdened process. In the end, more money was spent supporting the process (with people and hardware) than it would have cost to spend the extra time up-front to ensure an effective solution.
216
The resulting bad design suffered from some common data challenges: Serial processes Each part of the process was an all-or-nothing step, and each subsequent step required that the prior step complete. This approach causes delays and risk when a failure occurs. High I/O Because the process was heavy with working tables, the I/O was a hidden bottleneck. Every time data is written to a working table, it requires persisting that data to the physical drive. And each time that happens, data is read from disk and inserted into a new table, which doubles the I/O. The differences between read and write I/O also make the I/O processes even more inefficient. Dependency issues When an interim step was intercepted by another process, this added to the dependency chain of the system. These dependency chains can easily get out of control, which in the end will cause more complication and less agility in handling changes.
Scaling SSIS ETL begins with the right design. The inverse of the bullets above will give you the general guidance you need to think through SSIS design. When you are planning your ETL process, you should be considerate of enterprise processes. Always look to get your data from the source of record or the identified enterprise standard for source data. Do
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices not create duplication of code when you can avoid it. Leverage the SSIS data flow where you can take advantage of its features and processes. Use standards when naming and designing similar processes. In summary, the best practices covered throughout this chapter give you a good place to start. Remember that these principles are here to guide you to the best solution, but there may be more than one right solution. One thing is for sure, there will always be more than one bad solution. Source Control SSIS packages are code and should be placed under source control. Many SSIS development shops use Team Foundation Server (TFS). Note that TFS does show the differences in two SSIS packages side by side by showing the changes in the underlying XML. This sometimes is hard to follow and add ons like BIDS Helper provides a more filtered version of the differences. See the following link for more information on BIDS Helpers features including Smart Diff: http://www.mssqltips.com/tip.asp?tip=1892 .
217
Conclusion and Resources

Data integration is critical to the success of any Data warehouse and typically represents the largest cost both for the initial development and the ongoing maintenance. Loading large volumes of data within shrinking execution windows requires ETL developers to use industry best practices and patterns as well as best practices for SQL Server and SSIS. More importantly, business consumers must trust the data loaded into the Data warehouse. This requires the elevation of Data quality to a first class citizen throughout the Data integration life cycle including: Profiling source data Handling and reporting data exceptions within the integration code Adding Data and Execution lineage throughout the integration data flow Creating reports that Data Stewards can use to reconcile results and identify the root cause of data exceptions
The creation of an ETL Framework and ETL template packages allow ETL developers to create consistent scalable solutions in less time. In addition, these templates reduce development maintenance costs over the lifetime of the ETL solution. ETL Framework dynamic configurations and logging make ETL operations resources more efficient and reduce the amount of resources required for ongoing ETL operations. Finally, building your Data warehouse on the SQL Server product stack reduces overall software acquisition costs as well as the training costs for the Data warehouse team. Resources This section contains links mentioned in this chapter along with other useful links on SSIS. SSIS Sites / Blogs:
Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices SQLCAT Teams Integration Services best practices SSIS Team blog
218
Additional information for SSIS: SSIS System Variables SSIS Service Integration Services Error and Message Reference. This is useful for translating numeric SSIS error codes into their associated error messages. sysssislog (2008); you will need to change the SQL in all of the reports or create a view if running SSIS 2008 or later. Integration Services Error and Message Reference. This is useful for translating numeric SSIS error codes into their associated error messages. Working with Parameters and Return Codes in the Execute SQL Task SSIS Nugget: Setting expressions
SSIS Performance SSIS 2008 Data flow improvements SSIS Performance Design Patterns video
Partitioned Tables: http://sqlcat.com/msdnmirror/archive/2010/03/03/enabling-partition-level-locking-in-sqlserver-2008.aspx http://blogs.msdn.com/b/sqlprogrammability/archive/2009/04/10/sql-server-2005-2008-tablepartitioning-important-things-to-consider-when-switching-out-partitions.aspx We Loaded 1TB in 30 Minutes with SSIS, and So Can You. Designing and Tuning for Performance your SSIS packages in the Enterprise (SQL Video Series)
Configuration and Deployment: SQL Server Integration Services SSIS Package Configuration SSIS Parent package configurations. Yay or nay? SSIS - Configurations, Expressions and Constraints Creating packages in code - Package Configurations Microsoft SQL Server 2008 Integration Services Unleashed (Kirk Haselden), Chapter 24 Configuring and Deploying Solutions SQL Server Integration Services SSIS Package Configuration Simple Steps to Creating SSIS Package Configuration File Reusing Connections with Data Sources and Configurations Managing and Deploying SQL Server Integration Services
SSIS Tools and Add ins: Microsoft Corporation Copyright 2010
Microsoft EDW Architecture, Guidance and Deployment Best Practices BIDs Helper This is a very useful add-in that includes an Expression highlighter BIDs Helper Smart Diff ETL Framework SQL Server 2005 Report Packs. This page has a link to the SQL Server 2005 Integration Services Log Reports. Note: The SSIS logging table has changed from sysdtslog90 (2005) to sysssislog (2008); you will need to change the SQL in all of the reports or create a view if running SSIS 2008 or later.
219
Copyright 2010

MS EDW Arch Guidance BP Chapter 1-2-3

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MS EDW Arch Guidance BP Chapter 1-2-3

Uploaded by

Copyright:

Available Formats

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Chapter One: Overview

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Why a Data Warehouse?

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Data Warehouse Components

Data Warehouse Life Cycle and Team Model

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Chapter 2: Data Architecture

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Data Warehouse Concepts

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Master Data and Master Data Management

Figure 2-24: Master Data Management questions

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Figure 2-43: Surrogate key - AdventureWorksDW2008 DimProduct table

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Microsoft EDW Architecture, Guidance and Deployment Best Practices

ScenarioKey ScenarioName 1 Actual 2 Budget 3 Forecast

StateName New Hampshire New York

StateName New Hampshire New York

CityKey 1477 1597

City Concord Nyack

StateName New Hampshire New York

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Microsoft EDW Architecture, Guidance and Deployment Best Practices

SK 100 101 102 103 104

NK C_A C_B C_C C_D C_B

Company Company A Company B Company C Company D Company B

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Updates can result in page splits No overhead

Maintenance Less need for defragmentation

Conclusion and Resources

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Chapter 3 - Data Integration

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Data Integration Overview

Data Integration Concepts

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Microsoft EDW Architecture, Guidance and Deployment Best Practices

However, the disadvantages include the following:

Microsoft EDW Architecture, Guidance and Deployment Best Practices

Use ELT when