Apostila Admdados DW

DATAPREV DNG DIRETORIA DE NEGCIOS DETI.
N Departamento de Negcios Tratamento de Informaes
Modelagem Multidimensional para Data Warehouse
Instrutor: Roge Oliveira Colaboradores: Alfredo M. V. Martins Delmir Peixoto A Jr.
SUMRIO
Prefcio.............................................................................................................3 ndice do curso.................................................................................................4 Tabela de referncia cruzada: Anexos x Captulos......................................5 Anexos...............................................................................................................6 Listas de exerccios.........................................................................................39
Pg. 2
PREFCIO
A presente apostila trata-se de uma coleo de diversas publicaes a respeito do tema Data Warehousing. Os autores dos respectivos artigos esto indicados e a eles cabe o crdito das idias. Seu objetivo servir de apoio como fonte de consulta aos diversos conceitos apresentados no curso. Quando possvel os artigos encontram-se diretamente no corpo da apostila, alguns traduzidos outros no original. Para aqueles que so demasiadamente grandes no esto presentes, mas podem ser encontrados atravs dos hyperlinks indicados. A seguir, ser apresentado o ndice dos tpicos discutidos no curso. Uma tabela de referncia cruzada ajuda a escolher quais artigos melhor abordam determinado assunto. Ao final so apresentados os exerccios propostos para serem desenvolvidos ao longo das aulas.
Pg. 3
NDICE DO CURSO
1. Introduo 1.1 - Evoluo de Bancos de Dados 1.2 - Conceitos de Data Warehousing 1.3 Exerccios 2 - Modelagem Multidimensional 2.1 - Definio dos elementos utilizados 2.1.1 - Simbologia adotada 2.1.2 - Tabelas de Fatos 2.1.3 - Tabelas de Dimenses 2.1.4 - Exerccios 2.2 - Comparao entre abordagens 2.2.1 - Modelo Entidades-Relacionamentos 2.2.2 - Modelo Esquema-Estrela 2.2.3 - Modelo Floco de Neve 2.2.4 - Exerccios 2.3 - Procedimentos para elaborao de modelo 2.3.1 - A partir de um MER 2.3.2 - A partir das consultas a serem atendidas 2.3.3 Demonstrao de caso prtico 2.4 Casos especiais 2.4.1 Surrogate keys e Slowly Changing Dimensions 2.4.2 Relacionamentos NxM 2.4.3 - ODS 2.4.4 Tabela fatos sem fatos 3 - Integrao de Modelos 3.1 - Projeto de um Data Warehouse 3.2 - Projeto de Data Marts independentes 3.3 - Projeto de Data Marts integrados num Data Warehouse 3.4 Exerccios 4 - Concluso 4.1 - Dvidas e comentrios finais
Pg. 4
Tabela de referncia cruzada: Anexos x Captulos

Anexo
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Ttulo
Dimensional Modeling and E-R Modeling In The Data Warehouse Gerenciando Tabelas Auxiliares No H Garantias Princpios de Projeto para um Data Warehouse Dimensional
Mapeamento Entre os Modelos E/R e Star
Captulos
2.2 2.4.2 2.2.1 2.3.2 2e3 2.2.3 2 3 1,2 e 3 3 3 3 1e2 3.3 1 2.4.4 2.4.1 2.4.1 2.4.3 2.4.3 2.4.3
Trs casos interessantes para o uso de Snowflakes What Not To Do Data Mart No Igual a Data Warehouse Curso de Data Warehouse Strategies to Solutions: How to Implement a Data Warehouse The Anti-Architect Getting Started And Finishing Well A Conceptual Modelling Perspective for DataWarehouses Information Strategy: Data Mart vs. Data Warehouse Business Intelligence Factless Fact Table Slowly Changing Dimensions Surrogate Keys Introduction: The Operational Data Store Designing the ODS Relocating the ODS
Pg. 5
Anexo 1
Dimensional Modeling and E-R Modeling In The Data Warehouse
by Joseph M. Firestone, Ph.D. White Paper No. Eight June 22, 1998 Introduction Dimensional Modeling (DM) is a favorite modeling technique in data warehousing. In DM, a model of tables and relations is constituted with the purpose of optimizing decision support query performance in relational databases, relative to a measurement or set of measurements of the outcome(s) of the business process being modeled. In contrast, conventional E-R models are constituted to (a) remove redundancy in the data model, (b) facilitate retrieval of individual records having certain critical identifiers, and (c) therefore, optimize On-line Transaction Processing (OLTP) performance. Practitioners of DM have approached developing a logical data model by selecting the business process to be modeled and then deciding what each individual low level record in the "fact table" (the grain of the fact table) will mean. The fact table is the focus of dimensional analysis. It is the table dimensional queries segment in the process of producing solution sets. The criteria for segmentation are contained in one or more "dimension tables" whose single part primary keys become foreign keys of the related fact table in DM designs. The foreign keys in a related fact table constitute a multi-part primary key for that fact table, which, in turn, expresses a many-to-many relationship. [1] In a DM further, the grain of the fact table is usually a quantitative measurement of the outcome of the business process being analyzed. While the dimension tables are generally composed of attributes measured on some discrete category scale that describe, qualify, locate, or constrain the fact table quantitative measurements. Since a dimensional model is visually represented as a fact table surrounded by dimension tables, it is frequently called a star schema. Figure One is an illustration of a DM/star schema using a student academic fact database.
Pg. 6
While there is consensus in the field of data warehousing on the desirability of using DM/star schemas in developing data marts, there is an on-going controversy over the form of the data model to be used in the data warehouse. The "Inmonites," support a position identified with Bill Inmon, and contend that the data warehouse should be developed using an E-R model. The "Kimballites" believe in Ralph Kimball's view that the data warehouse should always be modeled using a DM/star schema. Indeed Kimball has stated that while DM/star schemas have the advantages of greater understandability and superior performance relative to E-R models, their use involves no loss of information, because any ER model can be represented as a set of DM/star schema models without loss of information. In this paper I will comment on two issues related to the controversy. First, the claim that any E-R model can be represented as an equivalent set of DM/star schema models [2], and second, the question of whether an E-R structured data warehouse, absent associative entities, i.e. fact tables, is a viable concept, given recent developments in data warehousing. Can DM Models Represent E-R Models? In a narrow technical sense, not every E-R model can be represented as a star schema or closely related dimensional model. It depends on the relationships in the conceptual model formalized by the logical data model. As Ralph Kimball has pointed out on numerous occasions, star schemas represent many-to-many relationships. If there are no many-to-many relationships in an underlying conceptual model, there is no opportunity to define a series of dimensional models. That is, the possibility of a dimensional model is
Pg. 7
associated with the presence of many-many relationships of whatever order. On the other hand, an E-R model can be defined whether or not many-many relationships exist. But without them it would have no fact tables. Having said the above, it really doesn't directly address the central question of whether an E-R data warehouse model can always be represented as a series of dimensional models. But it does shed some light on it. Specifically, the answer to the question depends on whether the underlying conceptual model of a data warehouse must always contain many-to-many relationships. I think the answer to this question is yes, and that it follows that an E-R data warehouse can be expressed as a star schema. Here are my reasons. (1) Data warehouses must contain "grain" attributes in the sense of the term specified by Ralph Kimball in The Data Warehouse Toolkit. This is a necessary conclusion for anyone who believes either in a queryable data warehouse, or in a data warehouse that will primarily serve as a feeder system for queryable data marts. In either case, the grain attributes must be available as part of the data warehouse, because they provide data on the extent to which any business is meeting its goals or objectives. Without such attributes, business performance can't be evaluated, and a primary DSS-related purpose of the data warehouse architecture can't be fulfilled. (2) If the grain attributes are present in the data warehouse, what kinds of relationships will be associated with them and what kinds of entities will contain them? In the underlying conceptual model of the data warehouse, there will be attributes that are causally related to the grain attributes, attributes that are effects of the grain attributes, and attributes such as product color, geographic level, and time period that are descriptive of the grain attributes. In the conceptual model, the grain attributes will be associated with many-many relations among these different classes of factors. How can these manymany relations be resolved in a formal model, whether E-R or dimensional? (3) The various causal, effect, and descriptive factors will be contained in fundamental entities, and perhaps in attributive entities, or sub-type entities as well. In a correct E-R or dimensional model, however, the entities containing the grain attributes can only be associative entities, because the grain attributes will not belong to any one fundamental entity in the model; but will be properties of a manymany relation (an n-ary association) among fundamental entities. Since fact tables are resolved many-many relations among fundamental entities, it follows that in a correct E-R model, fact tables are a necessary consequence of grain attributes and of standard E-R modeling rules requiring conceptual correctness and conceptual and syntactic completeness. It goes without saying that fact tables are also the means of resolving many-many relationships in dimensional models. (4) If fact tables must be present in correct E-R models, it still doesn't follow, however, that the fundamental entities related to them must be de-normalized dimension tables as specified in dimensional models. Here, in my view, is where the major distinction between dimensional and E-R data warehouse models will be found. In E-R models, normalization through addition of attributive and sub-type entities destroys the clean dimensional structure of star schemas and creates "snowflakes," which, in general, slow browsing performance. But in star schemas, browsing performance is protected by restricting the formal model to associative and fundamental entities, unless certain special conditions (pointed out in "Toolkit," and in Ralph Kimball's various DBMS columns) exist. So, that's it. In data warehouses, conventional E-R models and Star Schemas are both options, and this is due to the semantics of data warehouses as DSS applications requiring many-to-many relationships containing essential grain attributes. Kimball's position is therefore essentially correct: a data
Pg. 8
warehouse E-R model can be represented as a series of dimensional models. But this argument has an additional implication I'd like to see widely discussed. I emphasized earlier that both correct dimensional and E-R models rely on fact tables to resolve the many-many relations encompassing grain attributes that are so essential for the data warehouse. If this is true, then why are fact tables so frequently associated with dimensional data warehouse models and not with correct E-R data warehouse models? I suspect this may be because many E-R data warehouse models may not always explicitly recognize many-many relations and the need to resolve them with associative entities, i.e. fact tables. Instead, these models are being defined with fundamental entities containing some of the characteristics of associative entities but also carrying with them the risks of confusion, contradiction, and redundancy inherent in an incomplete resolution of many-to-many relationships, and ad hoc de-normalization of fundamental entities. I can't prove that this hunch of mine is valid, and that the problem in E-R data modeling I've inferred is widespread. But there are examples of the problem in the data warehousing literature. One good example is in the recent book by Silverston, Inmon, and Graziano (Wiley, 1997) [3], called "The Data Model Resource Book." Figure 10.2 on P. 266 presents a sample data warehouse data model. This data model contains no fact tables, but three tables come closest: CUSTOMER_INVOICES, PURCHASE_INVOICES, and BUDGET_DETAILS. Let's focus on CUSTOMER_INVOICES, which is typical of the three. The multi-part primary key is composed of: INVOICE_ID, and LINE_ITEM_SEQ. A number of foreign keys are included as mandatory attributes, but constitute no part of the primary key, and are not determined by it. These are: CUSTOMER_ID, SALES_REP_ID, and PRODUCT_CODE. Other mandatory attributes are: INVOICE_DATE, BILL_TO_ADDRESS_ID, MANAGER_REP_ID, ORGANIZATON_ID, ORG_ADDRESS_ID, QUANTITY, UNIT_PRICE, AMOUNT, and LOAD_DATE. An optional attribute is PRODUCT_COST. I believe that this entity diverges as much as it does from a fact table in a dimensional model, not because it is an E-R model-based entity, but because: (a) it fails to adequately model the conceptual distinction between customer invoice and customer sales, (b) doesn't recognize that unit price, amount, and quantity are attributes of a sale, related not only to an invoice but also to Sales Reps, Products, and Customers, and (c) in consequence doesn't correctly resolve the many-many relationship of Sales Reps, Customer Invoices, Products, and Customers. In short, the CUSTOMER_INVOICES entity, as constructed in the example, represents an error in the E-R model. That is why the QUANTITY, UNIT_PRICE, and AMOUNT attributes are not contained in a CUSTOMER_SALES associative entity, a true fact table, with a multi-part key drawn from SALES_REPS, CUSTOMER_INVOICES, PRODUCTS, and CUSTOMERS.
Pg. 9
This point is emphasized further by looking at the star schema design for sales analysis provided in Figure 11.1 on P. 271. This design is supposed to provide an example of a departmental specific data warehouse, (or data mart). While this figure includes a CUSTOMER_SALES table that looks a lot like a fact table, it still reflects the conceptual confusion in the underlying model. Specifically, the multipart key of this "fact table" includes INVOICE_ID, and LINE_ITEM_SEQ, as parts of the primary key. But neither attribute comes from a dimension table, nor are they degenerate dimension attributes since they are part of the primary key. Instead they originate in the "fact table." And since from the previous CUSTOMER_INVOICES entity we know that INVOICE_ID, and LINE_ITEM_SEQ constitute a unique primary key, it follows that CUSTOMER_SALES is not an associative entity or fact table at all, but instead is another fundamental entity, very similar to CUSTOMER_INVOICES, that again confuses the distinction between CUSTOMER_INVOICES and CUSTOMER_SALES. In short, Figure 11.1 is not a valid star schema design, as Figure 10.2 is not a valid E-R model. Because neither the CUSTOMER_INVOICES entity in one, nor the CUSTOMER_SALES entity in the other, is an appropriately normalized entity, whose non-key attributes are fully dependent on the primary key. If they were, they would present properly constructed associative entities resolving many-many relations including CUSTOMER_INVOICES, and CUSTOMER_SALES. Again, how typical this example is of E-R modeling in data warehousing I can't say. That's the question I'd like to see more widely discussed. Is the widely perceived divergence between E-R and dimensional modeling in data warehousing due to the fact that dimensional modeling necessarily involves fact tables and E-R modeling normally does not, or is the perceived divergence due to the fact that E-R modeling practices in data warehousing are not faithful to E-R modeling principles; and if they were they would involve fact tables to exactly the same extent as dimensional models? Is An E-R Data Warehouse Model With No Fact Tables A Viable Concept? DM/Star schemas represent n-ary associations. N-ary associations are embodied in many-to-many relations. These may be resolved within a data model in an entity associating two or more entities. A star schema with one fact table (the associative entity) and two dimension tables represents a binary association. One with one fact table, and three dimension tables represents a ternary association, and so on. As we have seen E-R models can also represent n-ary associations. They differ from star schemas not in the presence of fact tables, but in the fact that their dimension tables are "snowflaked" to meet the requirements of normalization. Since star schemas and "snowflaked" E-R models represent n-ary associations, to say that another type of E-R model eliminating fact tables should be used to structure the data in the data warehouse is also to say that n-ary associations should not be used for this purpose. But n-ary associations are essential for analysis in the context of DBMS DSS applications, because analytical DSS queries employ manyto-many relationships and are frequently multi-stage in character. Many-to-many relationships can only be resolved in data models into (1) n-ary associations of various types with associative entities (fact tables), or (2) more atomic data dependency relationships in E-R models without fact tables. I think the
Pg. 10
second alternative ensures poor query response performance in large databases, and therefore discourages and often prevents execution of a multi-stage analysis process. It does so because it provides no structure for navigating the logic of the particular n-ary association implied by an analytical DSS query, and therefore requires that the DBMS engine construct the association "on the fly." In contrast, the first alternative provides a navigational structure for such a query, with consequent good query performance, and practical implementing of multistage analysis processes. Among associative models however, a DM/Star design generally provides better navigation and performance than an E-R /Snowflake (in the absence of tools with special capability to handle the more complex snowflake model). If one accepts this argument (and if it's correct, 95% of it is in some way owed to Ralph Kimball, and if it's wrong, the correct 95% of it is still owed to Ralph Kimball); then the claim that dimensional modeling or "snowflaked" E-R models should not be employed in the data warehouse, largely amounts to the claim that only the limited, constrained analysis supported by data dependency models without associative entities should be employed. That is, the data warehouse becomes no more than a big staging area for data marts, and has no independent analytical function of its own. I can't subscribe to this conclusion. After all, in recent data warehousing/data mart system architectures, we've added an Operational Data Store (ODS) [4], distinct from the data warehouse, and a non-queryable centralized staging area for storing, extracted, cleansed, and transformed data and for gathering centralized metadata for implementing an Enterprise Data Mart Architecture (EDMA) [5]. Why then do we need yet another non-queryable staging area? Also, if the data warehouse is only a staging area and we can do analysis only in data marts, where do we go for enterprise-wide DSS? Conclusion In the context of the "Inmonite"/"Kimballite" dispute over the proper form of data warehouse data models, this paper examined: (1) the claim that any E-R model can be represented as an equivalent set of DM/star schema models; and (2) the question of whether an E-R structured data warehouse, absent associative entities, i.e. fact tables, is a viable concept given recent developments in data warehousing. A number of conclusions are supported by the arguments. Not every E-R model can be represented as a set of star schemas containing equivalent information; But every properly constructed E-R data warehousing model can be so represented; Many E-R data warehouse models are not properly constructed in that they don't explicitly recognize many-many relations and the need to resolve them with associative entities, i.e. fact tables. To use data warehousing E-R models specifying atomic data dependency relationships without fact tables is to ensure poor query response performance in large databases, and therefore discourage, and often prevent, execution of a multi-stage analysis process. In effect, it is to make the data warehouse no more than a big staging area for data marts, with no independent analytical function of its own. Given the development of ODSs and non-queryable centralized staging areas for storing, extracted, cleansed, and transformed data and for gathering centralized metadata for
Pg. 11
implementing an Enterprise Data Mart Architecture (EDMA); we don't need another nonqueryable staging area called a data warehouse. What we do need, instead, is a dimensionally modeled data warehouse for enterprisewide DSS, prepared to provide the best in query response performance and to support the most advanced OLAP [6] functionality we can devise. References [1] Ralph Kimball, The Data Warehouse Toolkit (New York, NY: John Wiley & Sons, Inc., 1996), Pp. 15-16 [2] I thank Ralph Kimball for prodding myself and other participants in the dwlist@datawarehousing.com list server group about the importance of examining this issue. [3] W. H. Inmon, Claudia Imhoff, and Ryan Sousa, Corporate Information Factory (New York, NY: John Wiley & Sons, Inc., 1998), Pp. 87-100 [4] Len Silverston, W. H. Inmon, and Kent Graziano, The Data Model Resource Book (New York, NY: John Wiley & Sons, Inc., 1997) [5] Douglas Hackney, Understanding and Implementing Successful Data Marts (Reading, MA: Addison-Wesley, 1997), Pp. 52-54, 183-84, 257, 307-309 [6] "What is OLAP?" The OLAP Report, revised February 19, 1998, @http://www.olapreport.com/fasmi.htm
Pg. 12
Anexo 2 Gerenciando Tabelas Auxiliares

Autor: Ralph Kimball Traduo: Delmir Peixoto Um olhar cuidadoso para relacionamentos muitos-para-muitos entre dimenses importantes. Dimenses multi-valoradas so normalmente ilegais em um projeto dimensional. Usualmente exige-se que quando a granulao de uma tabela de fatos declarada, as nicas dimenses legais que podem ser associadas tabela de fatos so aquelas que empregam um valor nico para aquela granulao. Por exemplo, no mundo bancrio, Se a granulao da tabela de fatos Conta por Ms, ento exclui-se a dimenso Transao porque ela emprega muitos valores diferentes durante o ms. Se desejar-se ver transaes individuais, ento declarara-se uma granulao refinada, tal como Conta por Transao por Hora do dia. Mas toda regra tem excees. Algumas vezes, mesmo quando uma dimenso emprega mltiplos valores na presena da granulao da tabela de fatos, natural associar a dimenso multi-valorada tabela de fatos sem mudar a granulao. muito desejvel, por exemplo, associar a dimenso Cliente quela tabela de fatos do banco cuja granulao Conta por Ms. O problema que o nmero de clientes associados com cada conta aberto. Algum pode ter uma conta de cheque em seu nome, mas sua esposa e ele podem ter tambm uma conta conjunta. Possivelmente pode-se ter tambm uma conta familiar com cinco ou seis nomes de clientes. A melhor forma de lidar com dimenses multi-valoradas atravs de uma tabela auxiliar , como mostrado na figura a seguir, onde ela chamada de Mapa para Conta de Clientes (Account to Customer Map).
Usando chaves hospedeiras (surrogate keys)
Pg. 13
O Mapa para Conta de Clientes um tipo de tabela de fatos cuja chave primria (PK) composta de mltiplas chaves estrangeiras (FK). A chave primria neste exemplo consiste da chave Conta (Account Key), da chave Cliente (Customer Key), e da chave DataInicial (Begindate Key). Um registro individual nesta tabela mostra que um cliente particular foi parte de uma conta especfica durante o intervalo definido pela data inicial e final. Mas esta definio requer um olhar cuidadoso. muito importante que as chaves estrangeiras de cliente e conta sejam chaves hospedeiras referindo-se s suas respectivas dimenses, ambas sendo dimenso de mudana lenta Tipo 2 (Type 2 slowly changing dimensions (SCDs)). Em outras palavras, rastrea-se cuidadosamente mudanas nas dimenses cliente e conta, e continuamente edita-se novas verses de registros nestas dimenses para refletir mudanas. Em um Type 2 SCD, as chaves naturais para cliente e conta permanecem constantes, Mas as chaves hospedeiras muda sempre que insere-se um novo registro na dimenso. A tabela auxiliar precisa das chaves hospedeiras para que o registro das propriedades do cliente para a conta refira-se a descries do cliente e conta corretamente atualizadas durante o intervalo de tempo designado. Mas esta preciso tem um preo: Toda vez que o cliente ou conta submete-se a uma mudana Tipo 2, preciso editar um novo registro na tabela auxiliar para refletir as novas combinaes de chave. Desta forma, o tempo inicial e o final na tabela auxiliar realmente refere-se ao momento quando o cliente era parte da conta e ambas, a descrio do cliente e da conta, no haviam sido mudadas. Embora isto parea complicado, ser mostrado na seo seguinte que usando twin timestamps, pode-se realizar consultas interessantes sem ter que ser um especialista em lgica. Usando Twin Timestamps Uma lista dos clientes de uma conta chamada ABC123 em um perodo de tempo particular pode ser conseguida com uma consulta SQL muito simples: SELECT customer.name FROM account, map, customer WHERE account.accountkey = map.Accountkey AND customer.customerkey = map.Customerkey AND account.naturalid = ABC123 AND 7/18/2001BETWEEN map.begindate AND map.enddate Esta no uma interpretao padro do BETWEEN. A SQL especifica a sintaxe do BETWEEN como campo BETWEEN valor. Neste exemplo foi usado um relacionamento reverso, valor BETWEEN campos. Mas a maioria dos banco de dados relacionais modernos como o Oracle suporta esta sintaxe. A desvantagem de se usar twin timestamps que isto complica a atualizao das tabelas auxiliares. Todo registro de mapa de conta atualmente vlida tem o ENDDATE aberto, o que feio. Quando um novo registro substibui este, o dado ENDDATE tem que ser ajustado ao valor real. A alternativa de armazenar apenas o BEGINDATE torna a consulta muito mais complexa. Seria necessrio mudar a consulta acima para olhar para a maior data de incio menor ou igual da data
Pg. 14
requerida. Esta seleo de instruo ineficiente e difcil de ser implementada em ferramentas de consultas tradicionais. Twin timestamps torna uma time span query bastante simples. Suponha-se que deseja-se listar todos os clientes que fizeram parte de uma conta em qualquer tempo entre duas datas. Seria necessrio testar apenas (1) se a data de incio cai na durao requisitada ou (2) se a durao requisitada cerca completamente as datas de incio e fim. A consulta pareceria como esta: SELECT DISTINCT customer.name FROM account, map, customer WHERE account.accountkey = map.Accountkey AND customer.customerkey = map.Customerkey AND account.naturalid = ABC123 AND (map.begindate BETWEEN 7/18/2000 e 7/18/2001 OR (7/18/2000 < map.begindate AND 7/18/2001 > map.enddate))
Pg. 15
Anexo 3
No H Garantias
Autor: Ralph Kimball Traduo e Resumo: Delmir Peixoto A modelagem Entidade-Relacionamento est longe de ser uma soluo universal para regras de negcio de data warehouse. As regras de negcio so o corao e alma das aplicaes. Se os sistemas obedecerem as regras de negcio, ento os dados estaro corretos e as aplicaes funcionaro. Mas o que exatamente uma regra de negcio? Onde so declaradas ou foradas? Elas podem se d em quatro nveis: 1. Simples definies de formato de campo, forada diretamente pelo banco de dados: Ex: O campo Pagamento pode ser uma quantia interpretada como dlar. 2. Multiplos relacionamentos de campo chave, forado por declaraes chaves residentes no banco de dados: Ex: Uma chave estrangeira de produto numa tabela Vendas tem um relacionamento muitospara-um com a chave primria do produto na tabela Produto. 3. Relacionamentos entre entidades, declarados num diagrama entidade-relacionamento (E/R), mas no diretamente forados pelo banco de dados porque o relacionamento muitos-para-muitos: Ex: Empregados um subtipo de Pessoa. 4. Lgica complexa de negcio, relativa a processos de negcio, e foradas talvez apenas no momento da entrada de dados, por uma aplicao complexa: Ex: Quando uma poltica de segurana foi definida mas no ainda aprovada pelo responsvel, a data de gesto pode ser NULL, mas quando assinada, a data deve ser atual e mais recente que a data do acordo que a definiu. O ncleo dos softwares de banco de dados gerencia apenas os dois primeiros nveis, definies de formato de campo e mltiplos relacionamentos de campos chave. Porm, h muito mais contedo de negcio valioso nos nveis 3 e 4, relacionamentos entre entidades e lgica complexa de negcio. A modelagem E/R parece ser uma linguagem compreensiva para descrever relacionamentos entre entidades, mas no . A modelagem E/R uma tcnica de diagramao para especificar relacionamentos um-para-um, muitos-para-um, e muitos-para-muitos entre elementos de dados. O modelo E/R apenas um modelo lgico, ferramentas como Computer Associates`s Erwin que convertem um diagrama E/R em declaraes de linguagem de definio de dados (DDL) que determina definies chaves e restries entre tabelas, forando os vrios tipos de relacionamentos apropriadamente. Embora a modelagem E/R seja uma tcnica til para iniciar o processo de entendimento das regras de negcio, ela apresenta falhas quanto integridade e garantia:
Pg. 16
A modelagem E/R incompleta. As entidades e os relacionamentos de um dado diagrama representam apenas o que o analista decidiu enfatizar, ou foi informado. No h teste num modelo E/R para determinar se o analista especificou todas as possibilidades de relacionamentos um-para-um, muitos-para-um, ou muitos-para-muitos. A modelagem E/R no nica. Um dado conjunto de relacionamento de dados pode ser representado por muitos diagramas E/R diferentes. A maioria dos relacionamentos de dados so muitos-para-muitos. Existem muitas variedades de relacionamentos muitos-para-muitos envolvendo vrias condies e graus de correlao que seria proveitoso incluir como regras de negcio, mas a modelagem E/R no fornece extenses declarao muitos-para-muitos bsica. A maioria dos grandes modelos E/R so ideais, no reais. Quase todos os modelos de dados corporativos so um exerccio de como as coisas devem ser. So um exerccio para entender o negcio, mas se no alimentado fisicamente com dados reais, no vale a pena usar o modelo de dados corporativo como a base para uma implementao prtica de data warehouse. Modelos E/R raramente so modelos de dados reais. Uma concluso do ponto anterior que no existe ferramentas para varrer os dados reais para ento criar modelos E/R. Quase sempre os modelos E/R so criados e depois os dados so adequados ao modelo. Este fato faz com que quando dados sujos chegam rea de preparao de dados depois de terem sidos extrados de uma fonte de produo primria, no se pode inser-los no modelos E/R considerando-os dados limpos. necessrio limp-los. E, considerando os dois primeiros pontos desta seo, mesmo se eventualmente o dado for limpo e colocado no modelo E/R, no h garantia de que a fase de limpeza completa, nica, ou capturou os relacionamentos de dados que interessam. Modelos E/R conduzem a esquemas absurdamente complexos que se perdem do objetivo inicial. Todo programador est ciente de quanto complexo um modelo E/R pode se tornar. Os modelos E/R que do base ao Oracle Financials pode facilmente requerer 2.000 tabelas, e o modelo da SAP pode facilmente requerer 10.000 delas. Estes esquemas gigantescos so obstculos aos objetivos bsicos de entendimento e alta performance de um data warehouse. O modelo E/R completamente incapaz de lidar com restries de integridade ou regras de negcio , exceto em alguns casos especiais . Regras declarativas so muito complexas para serem capturadas como parte do modelo de negcio e devem ser definidas separadamente pelo analista/desenvolvedor. A modelagem E/R til no processamento de transaes porque ela reduz a redundncia dos dados, e til num conjunto limitado de atividades de limpeza de dados, mas est longe de ser uma plataforma compreensiva para regras de negcio de data warehouse.
Pg. 17
Anexo 4
Princpios de Projeto para um Data Warehouse Dimensional
Maria Cludia Cavalcanti Lawrence Zordam Klein Pablo Lopes Alenquer http://genesis.nce.ufrj.br/dataware/DataWarehouse/trabalhos_DW.html
Pg. 18
Anexo 5
Mapeamento Entre os Modelos E/R e Star
Roberto Reis Monteiro Neto http://genesis.nce.ufrj.br/dataware/DataWarehouse/trabalhos_DW.html
Pg. 19
Anexo 6
Trs casos interessantes para o uso de Snowflakes
Autor: Ralph Kimball Traduo: Delmir Peixoto Quando usar uma snowflake? No uma boa idia expor os usurios finais a um projeto fsico de snowflake, pois isso quase sempre compromete o entendimento e a performance. Porm, em certas situaes uma estrutura snowflake no apenas aceitvel, mas recomendada. Snowflake Clssica O modo de criar uma snowflake clssica, remover atributos de baixa cardinalidade de uma tabela dimensional e coloc-los numa tabela dimensional secundria conectada por uma chave snowflake. Nestes casos onde um conjunto de atributos formam uma hierarquia de vrios nveis, a srie de tabelas resultantes parecem um floco de neve (snowflake) da o nome. Variaes numa snowflake so chaves para o sucesso do projeto nos trs seguintes casos: Tabelas Cliente com grandes dimenses A dimenso do cliente provavelmente a dimenso mais desafiadora num data warehouse. Numa grande organizao, a dimenso do cliente pode ser gigantesca, com milhes de registros, e muitos atributos. Um agravante que as maiores dimenses de clientes normalmente contm duas categorias de consumidores, visitantes e clientes. Visitantes so annimos. Pode-se v-lo mais de uma vez, mas no se sabe seu nome ou qualquer outro dado a seu respeito. Num site Web, o nico conhecimento que se tem sobre um visitante um cookie indicando que ele retornou. Numa operao de venda a varejo, um visitante contrata o servio atravs de uma transao annima. Clientes, ao contrrio, so confiavelmente registrados na companhia. Sabe-se o nome do cliente, endereo, e dados demogrficos e histricos obtidos diretamente do cliente ou adquirido de terceiros. Vamos supor que no nvel mais granular de uma dada coleo de dados, 80% da dimenso da tabela de fatos envolva visitantes e 20% envolva clientes. Dos visitantes, apenas dois scores (marcas) de comportamento so acumuladas, que so: recency (notcias recentes Quando eles visitaram pela ltima vez) e frequency (fregncia quantas vezes eles visitaram). Para os clientes, assumiremos cinqenta atributos e medidas, cobrindo todos os dados de localizao, comportamento de pagamento, comportamento de crdito, atributos demogrficos diretamente extrados, e atributos demogrficos adquiridos.
Pg. 20
Combina-se visitantes e clientes numa simples dimenso lgica chamada Comprador. D-se ao visitante ou cliente uma simples e permanente identidade (ID) de comprador, mas faz-se a chave para a tabela uma surrogate key (chave hospedeira) de forma que se possa rastrear mudanas para o comprador a qualquer momento. A dimenso de comprador possuir os seguintes atributos: Chave hospedeira do comprador ID do comprador (ID fixo para cada comprador fsico) Recency (recente) Frequency(freqncia)
Atributos apenas dos clientes: 5 atributos de nome 10 atributos de locao 10 atributos de comportamento 25 atributos demogrficos
Note a importncia de se incluir as informaes de recentes e freqncia como atributos dimensionais ao invs de como fatos e atualiz-las ao longo do tempo. Esta deciso torna a dimenso comprador potente. Desta forma, pode-se fazer segmentaes clssicas de compradores diretamente da dimenso sem navegar por uma tabela de fatos numa aplicao complexa. Assumindo-se que a maioria dos ltimos 50 atributos de clientes so textuais, poderia-se ter um uma largura total de registro de 500 bytes ou mais. Supondo-se que tenha 20 milhes de compradores (16 milhes de visitantes e 4 milhes de clientes registrados). Neste caso haver 80% de registros com os ltimos 50 atributos vazios. Numa dimenso de 10GB este percentual notrio. Este um caso claro de quando, dependendo do banco de dados, recomendado introduzir uma snowflake. Deve-se quebrar a dimenso numa dimenso base e uma subdimenso de snowflake. Todos os visitantes iro compartilhar um simples registro na subdimenso, o qual conter valores especiais de atributos null(nulo). Ver figura a seguir.
Pg. 21
Em um banco de dados com comprimento fixado, e de acordo com as suposies anteriores, a base da dimenso comprador seria 20 milhes x 25 bytes = 500 MB, e a dimenso snowflake seria 4 milhes x 475 bytes = 1.9GB. Dessa forma, haveria uma economia de 8 GB usando a snowflake. Dimenses de Produtos Financeiros Bancos, casas de corretagem, e companhias de seguro, todas tm preocupaes na modelagem das dimenses de seus produtos porque cada um dos produtos individualmente tem muitos atributos especiais no compartilhados por outros produtos. Uma conta de cheque pouco parece com uma hipoteca ou certificado de depsito. Todos tm diferentes nmeros de atributos. Se tentar-se construir uma dimenso de produto simples com a unio de todos os atributos possveis, resulta-se em milhes de atributos com muitos deles vazios. A soluo para este caso construir uma snowflake de contexto dependente. Deve-se isolar os atributos ncleo (the core attributes) numa tabela de dimenso de produto base, e incluir uma chave snowflake em cada registro base que apontar para sua prpria subdimenso de produto extendida. Ver figura a seguir.
Esta soluo no uma ligao relacional convencional! A chave snowflake deve conectar-se a uma tabela de subdimenso particular que um tipo especfico de produto define.
Pg. 22
Dimenses de calendrio multi-empresarial. Construir uma dimenso calendrio num data warehouse distribudo transpondo mltiplas organizaes difcil pois cada organizao tem seu perodo fiscal particular, estaes, e frias. Embora pode-se fazer um esforo herico para reduzir legendas incompatveis de calendrios, muitas vezes deseja-se olhar para todo o ambiente multi-empresarial da perspectiva de apenas uma das empresas. Diferente das dimenses de produtos financeiros, cada um dos calendrios pode ter o mesmo nmero de atributos descrevendo perodos fiscais, estaes, e frias. Mas pode haver centenas de calendrios separados. Um varejista internacional pode ter que lidar com um calendrio para cada pas diferente. Neste caso deve-se modificar o projeto da snowflake para fazer a chave da snowflake se ligar a uma nica subdimenso de calendrio(Ver figura a seguir). Mas a subdimenso tem cardinalidade maior que a dimenso base! A chave para a subdimenso tanto a chave snowflake como a chave da organizao.
Nesta situao, deve-se especificar uma nica organizao na subdimenso antes de avaliar a ligao entre as tabelas. Quando feito corretamente, a subdimenso tem um relacionamento umpara-um com a dimenso base como se as duas tabelas fossem uma nica entidade. Assim, o data warehouse do ambiente multi-empresarial pode ser pesquisado atravs do calendrio de qualquer uma das empresas que o constitui. Snowflakes permitidas Estes trs exemplos mostram como variaes de projeto de snowflakes podem ser til. Quando se pensa em alternativas de projetos, deve-se separar os aspectos fsicos dos lgicos. O projeto fsico direciona a performance. O projeto lgico determina a facilidade de entendimento. A snowflake pode ser usada quando maximizar estes dois objetivos.
Pg. 23
Anexo 7
What Not To Do
Ralph Kimball http://www.rkimball.com/html/articles.html
Pg. 24
Anexo 8
Data Mart No Igual a Data Warehouse
Bill Inmon Publicado em DM Direct em Novembro de 1999 Traduo: Alfredo M. V. Martins O Data Warehouse no nada mais do que a unio de todos os Data Marts..., Ralph Kimball, 29 de dezembro de 1997. Voc pode apanhar todas as sardinhas no oceano e empilh-las e ainda assim elas no formaro uma baleia, Bill Inmon, 8 de janeiro de 1998. O desafio mais importante para o gerente de tecnologia de informao este ano decidir se constri inicialmente o data warehouse ou se inicia pelo data mart. Os vendedores de data marts afirmaram que os data warehouses so difceis e caros de construir, demandam um longo tempo para serem projetados e desenvolvidos, requerem pensamento e investimento, e exigem que a corporao enfrente problemas difceis tais como a integrao dos dados legados, a administrao dos macios volumes de dados, e a justificativa de custos relativos ao projeto do DSS(Sistema de Apoio a Deciso)/data warehouse para o Comit de Gerncia. O quadro pintado pelos defensores dos data marts para a construo do data warehouse melanclico. Atende tambm a seus interesses e incorreto. Os vendedores de data mart olham para o data warehouse como um obstculo entre si e os rendimentos provenientes das vendas realizadas. claro, eles querem evitar que o data warehouse alongue o seu ciclo de vendas, sem levar em considerao o efeito a longo prazo de construir um punhado de data marts e nenhum data warehouse. Os comerciantes de data marts esto vendendo uma perspectiva de muito curto prazo a custo do sucesso da arquitetura de longo prazo. Os defensores do data mart sugerem que podem existir caminhos alternativos, muito mais fceis para Sistemas de Apoio a Deciso (DSS) bem sucedidos do que construir um data warehouse. Um destes caminhos construir vrios data marts e quando eles crescerem o suficiente, chamlos de data warehouse ao invs de construir um verdadeiro data warehouse. Os defensores do data mart argumentam que o data mart pode ser construdo muito mais rapidamente e economicamente do que um warehouse. Quando se constri um data mart no h necessidade para um enorme confronto organizacional ou disciplinar e nenhuma preocupao com a arquitetura de longo prazo que criada pelos data marts. Infelizmente, ao evitar os viscerais problemas internos organizacionais e de projeto de um warehousing, os defensores do data mart perdem muito do foco do warehousing. Ao construir uma arquitetura consistindo inteiramente de data marts, os defensores do data mart dirigem a organizao para uma confuso ainda maior. Ao invs de um legado confuso de sistemas operacionais, agora passamos a ter um legado confuso de sistemas operacionais E data marts confusos. Data marts stovepipe (stovepipe = chamin de fogo um datamart stovepipe um
Pg. 25
data mart incompatvel com outro data mart) e aplicaes de Sistemas de Apoio a Deciso (DSS) stovepipe so o resultado de se construir somente data marts. E um ambiente de Sistema de Apoio a Deciso sem integrao como um homem sem um esqueleto, dificilmente uma entidade vivel e til. Uma Mudana nas Abordagens Nos primeiros dias do comrcio de data warehouse, os vendedores de data mart tentaram pular no trem no warehouse proclamando que um data warehouse era a mesma coisa que um data mart. Comercial aps comercial, os vendedores de data mart confundiram as pessoas com definies equivocadas do que um data warehouse e do que um data mart. Os vendedores de data mart espalharam meias verdades e desinformao sobre o data warehousing. O resultado foi confuso. A confuso semeada pelos vendedores de data mart fizeram alguns clientes confusos construirem data marts sem nenhum warehouse real. Depois do 3o data mart , os clientes descobriram que algo estava podre na Dinamarca. A deficincia de arquitetura por construir somente data marts foi desmascarada. O cliente descobriu que quando voc no constri um data warehouse, existe: redundncia macia de dados detalhados e histricos de um data mart para outro; resultados inconsistentes e irreconciliveis de um data mart para outro; uma interface no gerencivel entre os data marts e o ambiente de aplicaes legadas, etc. Em curto perodo de tempo, o mundo descobriu que um ambiente DSS sem um data warehouse era uma realidade extremamente insatisfatria. Agora que o mundo descobriu que construir data marts no a maneira adequada de proceder em DSS, os vendedores de data mart e seus anunciantes esto novamente de volta e semeando um tipo diferente de confuso. Desta vez, eles alteraram um pouco suas palavras originais e prometeram um caminho novo e melhorado para o sucesso fcil. Numa ligeira mudana do conceito original, a noo que agora est sendo difundida que um data warehouse meramente uma coleo de data marts integrados (o que quer que isto seja). A noo de que mltiplos data marts possam ser integrados paradoxal. A questo essencial associada aos data marts que seus usurios fazem o seu depsito de dados de tal maneira que eles no tem que integr-lo com outros marts. Dito de uma forma simples, por uma variedade de razes muito poderosas, no se pode construir data marts, observ-los crescer e magicamente transform-los num data warehouse quando eles atingem um determinado tamanho. E da mesma maneira, integrar dados atravs de data marts igualmente impensvel porque cada departamento que possui seu prprio data mart tem suas prprias especificaes.
Pg. 26
Para se entender porque um ou mais data marts no podem ser transformados num data warehouse, voc tem que inicialmente entender o que um data mart e o que um data wearehouse. Estruturas Arquitetnicas Diferentes Um data mart e um data warehouse so essencialmente diferentes estruturas arquitetnicas, mesmo embora ambos paream semelhantes quando vistos de longe e superficialmente. O que um Data Mart? Um data mart um conjunto de dados agregados organizados para apoiar a deciso, baseado nas necessidades de um dado departamento. As Finanas tem seu data mart, o Marketing tem o seu, e assim segue. E o data mart para Marketing s vagamente lembra outro data mart de outro departamento. O mais importante, talvez, que os departamentos individuais POSSUEM o hardware, o software, os dados e os programas que constituem o data mart. Os direitos de propriedade permitem que os departamentos contornem quaisquer tentativas de controle ou de disciplina que poderiam coordenar os dados oriundos dos diferentes departamentos. Cada departamento tem sua prpria interpretao do que um data mart deveria parecer e cada data mart departamental nico e especfico para suas prprias necessidades. Tipicamente, o projeto da base de dados para um data mart construdo em torno de uma estrutura de juno de estrela que tima para as necessidades dos usurios encontrados naquele departamento. A fim de moldar a juno de estrela, os requisitos dos usurios para o departamento devem ser reunidos. O data mart contem apenas um pouco da informao histrica e granular somente ao ponto em que ele adere s necessidades do departamento. O data mart tipicamente hospedado numa tecnologia multidimensional, o que bom em termos de flexibilidade de anlise, mas no timo para grandes quantidades de dados. Os dados encontrados nos data marts so altamente indexados. Existem dois tipos de data marts dependente e independente. Um data mart dependente aquele cuja fonte o data warehouse. Um data mart independente aquele cuja fonte o ambiente de aplicaes legadas. Todos os data marts dependentes so alimentados pela mesma fonte o data warehouse. Cada data mart independente alimentado unicamente e separadamente prelo ambiente de aplicaes legadas. Os data mart dependentes so arquitetonicamente e estruturalmente sadios. Os data mart independentes so instveis e arquitetonicamente insalubres, pelo menos para a grande integrao. O problema com os data marts independentes que suas deficincias no se manifestam at que a organizao tenha construdo muitos data marts. O que um Data Warehouse? Data warehouses so significativamente diferentes de data marts. Os data warehouses so organizados em torno das reas de assuntos corporativos encontradas no modelo de dados corporativos. Normalmente o data warehouse construdo por organizaes com cooordenao
Pg. 27
centralizadora, sendo pertencente s mesmas, tal como a clssica organizao de Tecnologia de Informao (IT). O data warehouse representa um verdadeiro esforo corporativo. Pode ou no existir um relacionamento entre as reas de assuntos de quaisquer departamento e as reas de assuntos da corporao. O data warehouse contem os dados mais granulares que a corporao possui. O dado do data mart usualmente muito menos granular do que o dado do data warehouse. ( isto , os data warehouses contem informaes muito mais detalhadas enquanto que a maioria dos data marts contem dados mais resumidos ou agregados). A estrutura de dado do data warehouse essencialmente uma estrutura normalizada. A estrutura e o contedo do dado num data warehouse no reflete o padro de nenhum departamento particular, mas representa as necessidades de dados da corporao. O volume de dados encontrados num data warehouse significativamente diferente do volume de dados encontrados num data mart. Por causa do volume de dados de um data warehouse, o data warehouse levemente indexado. O data warehouse contem uma grande quantidade de dados histricos. A tecnologia de hospedagem do data warehouse otimizada manejando uma quantidade de dados de comprimento industrial. O dado do data warehouse integrado de muitas fontes legadas. Resumidamente, existem diferenas muito significativas entre a estrutura e o contedo dos dados armazenados num data warehouse e a estrutura e contedo dos dados armazenados num data mart. A Figura 1 mostra algumas diferenas entre um data mart e um data warehouse.
Por ser o dado armazenado num data warehouse granular, integrado e histrico, o data warehouse atrai um volume significativo de dados. Por o warehouse atrair um volume
Pg. 28
significativo de dados, aconselhvel que ele seja construdo iterativamente. Se voc no constri o warehouse iterativamente, voc gastar anos construindo o warehouse. Desde o primeiro artigo que foi escrito sobre data warehousing, tem sido reconhecido haver uma urgncia em conseguir resultados concretos e tangveis para o usurio final to rpido quanto possvel. O melhor conselho dos autores e consultores para a construo de data warehousing foi de construir o warehouse rapidamente e evitar esforos longos e prolongados. De forma interessante, os defensores dos data marts e seus porta vozes afirmam que os data warehouses levam um longo tempo de construo. somente no exagero do discurso dos defensores dos data marts que se sugere que o warehouse seja construdo em propores gigantescas. A Figura 2 mostra o caminho de construo recomendado para data warehouses.
A teoria mais recente dos defensores de data mart que voc pode construir um ou mais data marts , integr-los (apesar de ningum ser muito claro no que isto significa) e ento quando eles crescerem at um certo tamanho, eles possam ser (magicamente) transformados num warehouse. Infelizmente esta sugesto incorreta por uma variedade de razes: O data mart projetado para atender as necessidades de um departamento. Muitos departamentos com objetivos muito diferentes devem ser satisfeitos. Esta a razo por existirem tantos data marts diferentes na corporao, cada qual com sua prpria viso e percepo. O data warehouse projetado para atender s necessidades coletivas da corporao como um todo. Um dado projeto pode ser timo para um departamento isolado ou para a corporao mas no para ambos. Os objetivos do projeto para a corporao so muito diferentes dos objetivos do projeto para um dado departamento. A granularidade do dado em um data mart muito diferente da granularidade do dado num data warehouse. O data mart contem dados agregados ou resumidos. O data warehouse contem o dado mais detalhado que encontrado na corporao. Como a granularidade do dada mart muito mais elevada do que a encontrada no data warehouse, voc no consegue facilmente decompor a granularidade do data mart na granularidade do data warehouse. Mas voc pode sempre ir na direo oposta e resumir unidades detalhadas de dados em agregaes.
Pg. 29
A estrutura dos dados num data mart (normalmente uma estrutura de juno estrela) somente remotamente compatvel com a estrutura de dados num warehouse (uma estrutura normalizada). A quantia de dados histricos encontrados num data mart muito diferente da histria dos dados encontrados num warehouse. Data warehouses contem uma vasta quantia de histria. Data marts contem somente modestas quantias de histria. As reas de assuntos encontrada num data mart so s remotamente relacionadas com as reas de assuntos encontradas num data warehouse. Os relacionamentos encontrados num data mart no so aqueles relacionamentos encontrados num data warehouse. Os pedidos de recuperao de informao (queries) atendidas num data mart so muito diferentes daqueles encontrados num data warehouse. O tipo de usurios (agricultores) que so encontrados nos marts so bem diferentes dos tipos de usurios (exploradores) encontrados num data warehouse. As estruturas de chave encontradas num data mart so significativamente diferentes das estruturas de chave encontradas num data warehouse, e assim por diante. Realidade Existem simplesmente diferenas MAIS significativas entre um ambiente de data mart e um ambiente de data warehouse. A afirmativa de que um data mart pode ser transformado num data warehouse quando ele atinge um certo tamanho ou que data marts podem ser integrados conjuntamente to invlido dizer quanto afirmar que uma erva que cresce o suficiente possa ser transformada num carvalho. Sendo a realidade e a gentica o que so, verdadeiro que uma erva e uma carvalho so, num determinado momento de suas vidas, organismos verdes vivos plantados no solo com aproximadamente o mesmo tamanho. Mas somente porque aquelas duas plantas partilham algumas poucas caractersticas bsicas no significa que uma erva rasteira possa se transformar num carvalho. Somente uma pessoa desinformada confundiria uma erva rasteira de um carvalho num estgio da vida das plantas.
Pg. 30
Anexo 9
Curso de Data Warehouse
Rubens Melo http://www.mcc.ufc.br/eti/etipages/eti2000/moddesc.htm#DWH
Pg. 31
Anexo 10
Strategies to Solutions: How to Implement a Data Warehouse
Gary Clark http://www.dmreview.com/portal.cfm?NavID=91&EdID=660&PortalID=8&Topic=4
Pg. 32
Anexo 11
The Anti-Architect
Ralph Kimball http://www.rkimball.com/html/articles.html
Pg. 33
Anexo 12
Getting Started And Finishing Well
Peter Nolan Ralph Kimball http://www.rkimball.com/html/articles.html
Pg. 34
Anexo 13
A Conceptual Modelling Perspective for DataWarehouses
Jaroslav Pokorn Peter Sokolowsky http://wi99.iwi.uni-sb.de/Folien/Sek11_Pokorny.PDF
Pg. 35
Anexo 14
Information Strategy: Data Mart vs. Data Warehouse
Jane Griffin
Published in DM Review in February 1998. Printed from DMReview.com
Do we need a single, enterprise-wide data warehouse, or are the information-intensive departments' data marts sufficient? This question is an industry debate and a common one for organizations that are considering an investment in an integrated information system. Data marts are often an attractive alternative to the mammoth job of implementing an enterprise-wide data warehouse. A data warehouse incorporates information about many subject areas--often the entire enterprise--while the data mart focuses on one or more subject areas. The data mart represents only a portion of an enterprise's data--perhaps data related to a business unit or work group. Typically, a data mart's data is targeted to a smaller audience of end users or used to present information on a smaller scope. The smaller-scale data mart is typically easier to build than the enterprise-wide warehouse; can be quickly implemented; and offers tremendous, fast payback for the users. The downside comes when several department-focused data marts are implemented with no forethought for a future data warehouse that serves the entire enterprise. What at first may seem like a quick and easy solution can cause a problem rather than solve it. Implementing several data marts to serve as reporting systems for individual departments can lead to data mart anarchy. Danger looms when individual departments select different hardware and software platforms, and the organization neglects to standardize and integrate information. This leaves the information technology (IT) department potentially supporting multiple databases, network operating systems and a variety of OLAP reporting tools. The ultimate goal with any integrated information system--whether it be a data mart or a data warehouse--is to provide consistent, accurate data about the organization to the users. Departmentfocused data marts have only the information that group needs. Each department has its own specific uses for a data mart, which often ignore the information needs of other areas. Having different departments with various data marts also escalates the number of problems and issues for the IT group to resolve. Unlike the enterprise-wide data warehouse, IT cannot manage and maintain these information stores from one central location. And one solution cannot address the myriad of problems that may arise from the data marts. Despite their potential pitfalls, data marts can pave the way for a large-scale IT investment in data warehousing. The key rests in designing and implementing a scalable technical infrastructure for the data marts that will allow the leveraging of information for an enterprise-wide data warehouse.
Pg. 36
A critical component of a scalable infrastructure involves using a standardized technical architecture across all data marts. Like building a data warehouse, the data mart's architecture should be stable, yet flexible. There are several key components that must be in place to ensure this flexibility and stability. One component is centralized, integrated meta data and consistent definitions. Such consistency will smooth the transition from data marts to data warehouse by making the individual systems compatible. All of the tools used to build the data marts, and eventually the data warehouse, must "speak" to each other. This communication is accomplished by selecting tools that have integrated meta data. The extraction, transformation and loading (ETL) tools selected for the data mart must transform data into common formats and integrate, match and index information from disparate sources. Using a standard technical platform--including a standard operating system, ETL, meta data management tool and reporting tool--can be an effective way to accomplish this task. Use of data marts with a standard infrastructure can offer unsurpassed business analysis and management capabilities. Data must ultimately be put into the hands of the people who are responsible for the achievement of business objectives and strategies. Issues to consider in information management are: data load times, synchronization, recovery, summarization levels, method of data security implementation, data distribution, data access and query speed, and ease of maintenance. All of these issues should be addressed when implementing the data mart. If not now, they will have to be considered when an enterprise-wide system is implemented. With these key components, organizations implementing data marts will be able to scale the technical architectures they put in place today into an enterprise-wide data warehouse to serve the information demands of tomorrow. While many of the issues are technical, the core issue of the data warehouse versus data mart is often political. Can IT deliver the data warehouse fast enough to meet the expectations and needs of the departments that are demanding them? How much does standardization slow down the organization? IT must wrestle with and overcome the time and standardization issues if they are to build a flexible, expansive, data warehousing architecture that meets the future needs of the organization.
Pg. 37
Anexo 15
Business Intelligence
Valentim Silva http://www.dds.pt/docs/BI%20WhitePaper.pdf
Pg. 38
Anexo 16
Factless Fact Tables
Two Types of Useful Fact Tables Contain No Facts At All.
DBMS - September 1996
Over the past year I have given many examples of fact tables in dimensional data warehouses. You should recall that fact tables are the large tables "in the middle" of a dimensional schema. Fact tables always have a multipart key, in which each component of the key joins to a single dimension table. Fact tables contain the numeric, additive fields that are best thought of as the measurements of the business, measured at the intersection of all of the dimension values. There has been so much talk about numeric additive values in fact tables that it may come as a surprise that two kinds of very useful fact tables don't have any facts at all! They may consist of nothing but keys. These are called factless fact tables. The first type of factless fact table is a table that records an event. Many event-tracking tables in dimensional data warehouses turn out to be factless. One good example is shown in Figure 1. Here you will track student attendance at a college. Imagine that you have a modern student tracking system that detects each student attendance event each day. With the heightened powers of dimensional thinking that you have developed over the past few months, you can easily list the dimensions surrounding the student attendance event. These dimensions include: Date: one record in this dimension for each day on the calendar Student: one record in this dimension for each student Course: one record in this dimension for each course taught each semester Teacher: one record in this dimension for each teacher Facility: one record in this dimension for each room, laboratory, or athletic field
Pg. 39
The grain of the fact table in Figure 1 is the individual student attendance event. When the student walks through the door into the lecture, a record is generated. It is clear that these dimensions are all well-defined and that the fact table record, consisting of just the five keys, is a good representation of the student attendance event. Each of the dimension tables is deep and rich, with many useful textual attributes on which you can constrain and from which you can form row headers in reports. The only problem is that there is no obvious fact to record each time a student attends a lecture or suits up for physical education. Tangible facts such as the grade for the course don't belong in this fact table. This fact table represents the student attendance process, not the semester grading process or even the midterm exam process. You are left with the odd feeling that something is missing. Actually, this fact table consisting only of keys is a perfectly good fact table and probably ought to be left as is. A lot of interesting questions can be asked of this dimensional schema, including: Which classes were the most heavily attended? Which classes were the most consistently attended? Which teachers taught the most students? Which teachers taught classes in facilities belonging to other departments? Which facilities were the most lightly used? What was the average total walking distance of a student in a given day? My only real criticism of this schema is the unreadability of the SQL. Most of the above queries end up as counts. For example, the first question starts out as: SELECT COURSE, COUNT(COURSE_KEY) FROM FACT_TABLE COURSE_DIMENSION, ETC. WHERE ... GROUP BY COURSE In this case you are counting the course_keys non-distinctly. It is an oddity of SQL that you can count any of the keys and still get the same correct answer. For example: SELECT COURSE, COUNT(TEACHER_KEY) FROM FACT_TABLE COURSE_DIMENSION, ETC. WHERE ... GROUP BY COURSE would give the same answer because you are counting the number of keys that fly by the query, not their distinct values. Although this doesn't faze a SQL expert (such as my fellow columnist Joe Celko), it does make the SQL look odd. For this reason, data designers will often add a dummy "attendance" field at the end of the fact table in Figure 1. The attendance field always contains the value 1. This doesn't add any information to the database, but it makes the SQL much more readable. Of course, select count (*) also works, but most query tools don't automatically produce the select count (*) alternative. The attendance field gives users a convenient and understandable place to make the query.
Pg. 40
Now your first question reads: SELECT COURSE, SUM(ATTENDANCE) FROM FACT_TABLE COURSE_DIMENSION, ETC. WHERE ... GROUP BY COURSE You can think of these kinds of event tables as recording the collision of keys at a point in space and time. Your table simply records the collisions that occur. (Automobile insurance companies often literally record collisions this way.) In this case, the dimensions of the factless fact table could be: Date of Collision Insured Party Insured Auto Claimant Claimant Auto Bystander Witness Claim Type Like the college course attendance example, this collision database could answer many interesting questions. The author has designed a number of collision databases, including those for both automobiles and boats. In the case of boats, a variant of the collision database required a "dock" dimension as well as a boat dimension. A second kind of factless fact table is called a coverage table. A typical coverage table is shown in Figure 2. Coverage tables are frequently needed when a primary fact table in a dimensional data warehouse is sparse. Figure 2 also shows a simple sales fact table that records the sales of products in stores on particular days under each promotion condition. The sales fact table does answer many interesting questions but cannot answer questions about things that didn't happen. For instance, it cannot answer the question, "Which products were on promotion that didn't sell?" because it contains only the records of products that did sell. The coverage table comes to the rescue. A record is placed in the coverage table for each product in each store that is on promotion in each time period. Notice that you need the full generality of a fact table to record which products are on promotion. In general, which products are on promotion varies by all of the dimensions of product, store, promotion, and time. This complex many-to-many relationship must be expressed as a fact table. This is one of Kimball's Laws: Every many-to-many relationship is a fact table, by definition. Perhaps some of you would suggest just filling out the original fact table with records representing zero sales for all possible products. This is logically valid, but it would expand the fact table enormously. In a typical grocery store, only about 10 percent of the products sell on any given day. Including all of the zero sales could increase the size of the database by a factor of ten. Remember, too, that you would have to carry all of the additive facts as zeros. Because many big grocery store sales fact tables approach a billion records, this would be a killer. Besides, there is something obscene about spending large amounts of money on disk drives to store zeros.
Pg. 41
The coverage factless fact table can be made much smaller than the equivalent set of zeros described in the previous paragraph. The coverage table must only contain the items on promotion; the items not on promotion that also did not sell can be left out. Also, it is likely for administrative reasons that the assignment of products to promotions takes place periodically, rather than every day. Often a store manager will set up promotions in a store once each week. Thus we don't need a record for every product every day. One record per product per promotion per store each week will do. Finally, the factless format keeps us from storing explicit zeros for the facts as well. Answering the question, "Which products were on promotion that did not sell?" requires a two-step application. First, consult the coverage table for the list of products on promotion on that day in that store. Second, consult the sales table for the list of products that did sell. The desired answer is the set difference between these two lists of products. Coverage tables are also useful for recording the assignment of sales teams to customers in businesses in which the sales teams make occasional very large sales. In such a business, the sales fact table is too sparse to provide a good place to record which sales teams were associated with which customers. The sales team coverage table provides a complete map of the assignment of sales teams to customers, even if some of the combinations never result in a sale.
Pg. 42
FIGURE 1
-- A factless fact table for recording student attendance on a daily basis at a college. The five dimension tables contain rich descriptions of dates, students, courses, teachers, and facilities. There are no additive, numeric facts.
FIGURE 2
--A factless coverage table used in conjunction with an ordinary sales fact table to answer the question, "Which products were on promotion that did not sell?" Ralph Kimball was co-inventor of the Xerox Star workstation, the first commercial product to use mice, icons, and windows. He was vice president of applications at Metaphor Computer Systems, and is the founder and former CEO of Red Brick Systems. He now works as an independent consultant designing large data warehouses. You can reach Ralph through his Internet web page at http://www.rkimball.com.
Pg. 43
Anexo 17
Slowly Changing Dimensions
Unlike OLTP Systems, Data Warehouses Can Track Historical Data.
DBMS, April 1996
Slowly Changing Dimensions

Unlike OLTP Systems, Data Warehouses Can Track Historical Data.
One major difference between an OLTP system and a data warehouse is the ability to accurately describe the past. OLTP systems are usually very poor at correctly representing a business as of a month or a year ago. A good OLTP system is always evolving. O rders are being filled and, thus, the order backlog is constantly changing. Descriptions of products, suppliers, and customers are constantly being updated, usually by overwriting. The large volume of data in an OLTP system is typically purged every 90 t o 180 days. For these reasons, it is difficult for an OLTP system to correctly represent the past. In an OLTP system, do you really want to keep old order statuses, product descriptions, supplier descriptions, and customer descriptions over a multiyear p eriod? The data warehouse must accept the responsibility of accurately describing the past. By doing so, the data warehouse simplifies the responsibilities of the OLTP system. Not only does the data warehouse relieve the OLTP system of almost all forms of repor ting, but the data warehouse contains special structures that have several ways of tracking historical data. (OLTP systems produce "flash reports" for management, and the people who run OLTP systems are proud of that capability. But beyond these simple d aily and weekly summaries and counts, the OLTP environment is a very costly environment in which to do any kind of complex reporting. Whether an OLTP shop likes it or not, the economics of reporting favor the data warehouse.)
Pg. 44
A dimensional data warehouse database consists of a large central fact table with a multipart key. This fact table is surrounded by a single layer of smaller dimension tables, each containing a single primary key. In a dimensional database, these issues of describing the past mostly involve slowly changing dimensions. A typical slowly changing dimension is a product dimension in which the detailed description of a given product is occasionally adjusted. For example, a minor ingredient change or a minor packaging change may be so small that production does not assign the product a new SKU number (which the data warehouse has been using as the primary key in the product dimension), but nevertheless gives the data warehouse team a revised description of t he product. The data warehouse team faces a dilemma when this happens. If they want the data warehouse to track both the old and new descriptions of the product, what do they use for the key? And where do they put the two values of the changed ingredient attribute? Other common slowly changing dimensions are the district and region names for a sales force. Every company that has a sales force reassigns these names every year or two. This is such a common problem that this example is something of a joke in data ware housing classes. When the teacher asks, "How many of your companies have changed the organization of your sales force recently?" everyone raises their hands. There are three main techniques for handling slowly changing dimensions in a data warehouse: overwriting, creating another dimension record, and creating a current value field. Each technique handles the problem differently. The designer chooses among th ese techniques depending on the users' needs. Overwriting
The first technique is the simplest and fastest. But it doesn't maintain past history! Nevertheless, overwriting is frequently used when the data warehouse team legitimately decides that the old value of the changed dimension attribute is not interesting . For example, if you find incorrect values in the city and state attributes in a customer record, then overwriting would almost certainly be used. After the overwrite, certain old reports that depended on the city or state values would not return exactl y the same values. Most of us would argue that this is the correct outcome.
Pg. 45
Creating Another Dimension Record

The second technique is the most common and has a number of powerful advantages. Suppose you work in a manufacturing company and one of your main data warehouse schemas is the company's shipments. The product dimension is one of the most important dimens ions in this dimensional schema. (See Figure 1.) A typical product dimension in a shipments schema would have several thousand detailed records, each representing a distinguishable product capable of being shipped. A good product d imension table would have at least 50 attributes describing the products, including hierarchical attributes such as brand and category, as well as nonhierarchical attributes such as flavor and package type. An important attribute provided by manufacturin g operations is the SKU number assigned to the product. You should start by using the SKU number as the key to the product dimension table.
Suppose that manufacturing operations makes a slight change in packaging of SKU #38, and the packaging description changes from "glued box" to "pasted box." Along with this change, manufacturing operations decides not to change the SKU number of the prod uct, or the bar code (UPC) that is printed on the box. If the data warehouse team decides to track this change, the best way to do this is to issue another product record, as if the pasted box version were a brand new product. The only difference between the two product records is the packaging description. Even the SKU numbers are the same. The only way you can issue another record is if you generalize the key to the product dimension table to be something more than the SKU number. A simple technique i s to use the SKU number plus two or three version digits. Thus the first instance of the product key for a given SKU might be SKU# + 01. When, and if, another version is needed, it becomes SKU# + 02, and so on. Notice that you should probably also park t he SKU number in a separate dimension attribute (field) because you never want an application to be parsing the key to extract the underlying SKU number. Note the separate SKU attribute in the Product dimension in Figure 1. This technique for tracking slowly changing dimensions is very powerful because new dimension records automatically partition history in the fact table. The old version of the dimension record points to all history in the fact table prior to the change. The new version of the dimension record points to all history after the change. There is no need for a timestamp in the product table to record the change. In fact, a timestamp in the dimension record may be meaningless because the event of interest is t he actual use of the new product type in a shipment. This is best recorded by a fact table record with the correct new product key. Another advantage of this technique is that you can gracefully track as many changes to a dimensional item as you wish. Each change generates a new dimension record, and each record partitions history perfectly. The main drawbacks of the technique are the requirement to generalize the dimension key, and the growth of the dimension table itself.
Pg. 46
Creating a Current Value Field

You use the third technique when you want to track a change in a dimension value, but it is legitimate to use the old value both before and after the change. This situation occurs most often in the infamous sales force realignments, where although you ha ve changed the names of your sales regions, you still have a need to state today's sales in terms of yesterday's region names, just to "see how they would have done" using the old organization. You can attack this requirement, not by creating a new dimen sion record as in the second technique, but by creating a new "current value" field.
Suppose in a sales team dimension table, where the records represent sales teams, you have a field called "region." When you decide to rearrange the sales force and assign each team to newly named regions, you create a new field in the sales dimension ta ble called "current_region." You should probably rename the old field "previous_region." (See Figure 2.) No alterations are made to the sales dimension record keys or to the number of sales team records. These two fields now allow an application to group all sales fact records by either the old sales assignments (previous region) or the new sales assignments (current region). This schema allows only the most recent sales force change to be tracked, but it offers the immense flexib ility of being able to state all of the history by either of the two sales force assignment schemas. It is conceivable, although somewhat awkward, to generalize this approach to the two most recent changes. If many of these sales force realignments take place and it is desired to track them all, then the second technique should probably be used. Choosing a Technique
The second and third techniques described here will handle the great majority of applications with slowly changing dimensions. The second technique, creating another dimension record, works very well for dimension tables with up to several hundred thousa nd records. Even the addition of many new records to these moderately large dimensions will not compromise performance in a DBMS with good indexing techniques, such as bit vector indexing. However, eventually a point may be reached in very large dimensio ns, such as multimillion record customer lists, where the second technique cannot be used. In this case, you are forced to resort to a cruder technique, appropriate for Monster Dimensions. This will be the subject of my column next month.
Pg. 47
Figure 1.
--A typical manufacturing shipments schema with five dimensions, showing the Product dimension expanded. In this article I show how to track a meaningful change of the package type (pkg_type) attribute over time when the OLTP system refuses to change th e master product key (SKU #).
Figure 2.
--A typical Sales Team dimension for almost any company that sells products. In this article I show how to track a change in the region attribute when you need to see both the old and new versions of the attribute over all historical data. Ralph Kimball was co-inventor of the Xerox Star workstation, the first commercial product to use mice, icons, and windows. He was vice president of applications at Metaphor Computer Systems, and is the founder and former CEO of Red Brick Systems. He now works as an independent consultant designing large data warehouses. You can reach Ralph through his Internet web page at http://www.rkimball.com.
Pg. 48
Anexo 18
Surrogate Keys
Keep control over record identifiers by generating new keys for the data warehouse
DBMS - May 1998
According to the Websters Unabridged Dictionary, a surrogate is an "artificial or synthetic product that is used as a substitute for a natural product." Thats a great definition for the surrogate keys we use in data warehouses. A surrogate key is an artificial or synthetic key that is used as a substitute for a natural key. Actually, a surrogate key in a data warehouse is more than just a substitute for a natural key. In a data warehouse, a surrogate key is a necessary generalization of the natural production key and is one of the basic elements of data warehouse design. Lets be very clear: Every join between dimension tables and fact tables in a data warehouse environment should be based on surrogate keys, not natural keys. It is up to the data extract logic to systematically look up and replace every incoming natural key with a data warehouse surrogate key each time either a dimension record or a fact record is brought into the data warehouse environment. In other words, when we have a product dimension joined to a fact table, or a customer dimension joined to a fact table, or even a time dimension joined to a fact table, as shown in Figure 1 the actual physical keys on either end of the joins are not natural keys directly derived from the incoming data. Rather, the keys are surrogate keys that are just anonymous integers. Each one of these keys should be a simple integer, starting with one and going up to the highest number that is needed. The product key should be a simple integer, the customer key should be a simple integer, and even the time key should be a simple integer. None of the keys should be:
Smart, where you can tell something about the record just by looking at the key Composed of natural keys glued together Implemented as multiple parallel joins between the dimension table and the fact table; so-called double or triple barreled joins.
If you are a professional DBA, I probably have your attention. If you are new to data warehousing, you are probably horrified. Perhaps you are saying, "But if I know what my underlying key is, all my training suggests that I make my key out of the data I am given." Yes, in the production transaction processing environment, the meaning of a product key or a customer key is directly related to the
Pg. 49
records content. In the data warehouse environment, however, a dimension key must be a generalization of what is found in the record. As the data warehouse manager, you need to keep your keys independent from the production keys. Production has different priorities from you. Production keys such as product keys or customer keys are generated, formatted, updated, deleted, recycled, and reused according to the dictates of production. If you use production keys as your keys, you will be jerked around by changes that can be, at the very least, annoying, and at the worst, disastrous. Suppose that you need to keep a three-year history of product sales in your large sales fact table, but production decides to purge their product file every 18 months. What do you do then? Lets list some of the ways that production may step on your toes:
Production may reuse keys that it has purged but that you are still maintaining, as I described. Production may make a mistake and reuse a key even when it isnt supposed to. This happens frequently in the world of UPCs in the retail world, despite everyones best intentions. Production may recompact its key space because it has a need to garbage-collect the production system. One of my customers was recently handed a data warehouse load tape with all the production customer keys reassigned! Production may legitimately overwrite some part of a product description or a customer description with new values but not change the product key or the customer key to a new value. You are left holding the bag and wondering what to do about the revised attribute values. This is the Slowly Changing Dimension crisis, which I will explain in a moment. Production may generalize its key format to handle some new situation in the transaction system. Now the production keys that used to be integers become alphanumeric. Or perhaps the 12-byte keys you are used to have become 20-byte keys. Your company has just made an acquisition, and you need to merge more than a million new customers into the master customer list. You will now need to extract from two production systems, but the newly acquired production system has nasty customer keys that dont look remotely like the others.
The Slowly Changing Dimension crisis I mentioned earlier is a well-known situation in data warehousing. Rather than blaming production for not handling its keys better, it is more constructive to recognize that this is an area where the interests of production and the interests of the data warehouse legitimately diverge. Usually, when the data warehouse administrator encounters a changed description in a dimension record such as product or customer, the correct response is to issue a new dimension record. But to do this, the data warehouse must have a more general key structure. Hence the need for a surrogate key. I discussed Slowly Changing Dimensions in my April 1996 column. In next months column, I will describe the low-level architecture for recognizing and processing Slowly Changing Dimensions at high speed. There are still more reasons to use surrogate keys. One of the most important is the need to encode uncertain knowledge. You may need to supply a customer key to represent a transaction, but perhaps you dont know for certain who the customer is. This would be a common occurrence in a retail situation where cash transactions are anonymous, like most grocery stores. What is the customer key for the anonymous customer? Perhaps you have introduced a special key that stands for this anonymous customer. This is politely referred to as a "hack." If you think carefully about the "I dont know" situation, you may want more than just this one special key for the anonymous customer. You may also want to describe the situation where "the customer
Pg. 50
identification has not taken place yet." Or maybe, "there was a customer, but the data processing system failed to report it correctly." And also, "no customer is possible in this situation." All of these metasituations call for a data warehouse customer key that cannot be composed from the transaction production customer keys. Dont forget that in the data warehouse you must provide a customer key for every fact record in the schema shown in Figure 1. A null key automatically turns on the referential integrity alarm in your data warehouse because a foreign key (as in the fact table) can never be null. The "I dont know" situation occurs quite frequently for dates. You are probably using date-valued keys for your joins between your fact tables and your dimension tables. Once again, if you have done this you are forced to use some kind of real date to represent the special situations where a date value is not possible. I hope you have not been using January 1, 2000 to stand for "I dont know." If you have done this, you have managed to combine the production key crisis with the Year 2000 crisis. Maybe one of the reasons you are holding on to your smart keys built up out of real data is that you think you want to navigate the keys directly with an application, avoiding the join to the dimension table. It is time to forget this strategy. If the fifth through ninth alpha characters in the join key can be interpreted as a manufacturers ID, then copy these characters and make them a normal field in the dimension table. Better yet, add the manufacturers name in plain text as a field. As the final step, consider throwing away the alphanumeric manufacturer ID. The only reason the marketing end users know these IDs is that they have been forced to use them for computer requests. Holding onto real date values as keys is also a strategic blunder. Yes, you can navigate date keys with straight SQL, thereby avoiding the join, but you have left all your special calendar information marooned in the date dimension table. If you navigate naked date keys with an application, you will inevitably begin embedding calendar logic in your application. Calendar logic belongs in a dimension table, not in your application code. You may be able to save substantial storage space with integer-valued surrogate keys. Suppose you have a big fact table with a billion rows of data. In such a table, every byte wasted in each row is a gigabyte of total storage. The beauty of a four-byte integer key is that it can represent more than 2 billion different values. That is enough for any dimension, even the so-called monster dimensions that represent individual human beings. So we compress all our long customer IDs and all our long product stock keeping units and all our date stamps down to four-byte keys. This saves many gigabytes of total storage. The final reason I can think of for surrogate keys is one that I strongly suspect but have never proven. Replacing big, ugly natural keys and composite keys with beautiful, tight integer surrogate keys is bound to improve join performance. The storage requirements are reduced, and the index lookups would seem to be simpler. I would be interested in hearing from anyone who has harvested a performance boost by replacing big ugly fat keys with anonymous integer keys. Having made the case for surrogate keys, we now are faced with creating them. Fundamentally, every time we see a natural key in the incoming data stream, we must look up the correct value of the surrogate key and replace the natural key with the surrogate key. Because this is a significant step in the daily extract and transform process within the data staging area, we need to tighten down our techniques to make this lookup simple and fast. In next months column, I will describe the state of the art for surrogate key architectures.
Pg. 51
Figure 1. A sample data warehouse schema.

Ralph Kimball was coinventor of the Xerox Star workstation, the first commercial product to use mice, icons, and windows. He was vice president of applications at Metaphor Computer Systems and is the founder and former CEO of Red Brick Systems. He now works as an independent consultant designing large data warehouses. His book The Data Warehouse Toolkit: How to Design Dimensional Data Warehouses (Wiley, 1996) is now available. You can reach Ralph through his Web page at www.rkimball.com. What did you think of this article? Send a letter to the editor. Subscribe to DBMS -- It's free for qualified readers in the United States May 1998 Table of Contents | Other Contents | Article Index | Search | Site Index | Home DBMS (http://www.dbmsmag.com) Copyright 1998 Miller Freeman, Inc. ALL RIGHTS RESERVED Redistribution without permission is prohibited. Please send questions or comments to dbms@mfi.com Updated April 2, 1998
Pg. 52
Anexo 19
Introduction: The Operational Data Store
by W. H. Inmon
The ODS is first and foremost an operational construct. Because it is operational it is up to the second in its timeliness. The ODS can be updated, unlike the data warehouse. The ODS can provide an integrated, operational collective view of information. Online, high performance integration then is one of primary characteristics of the ODS. Some of the other characteristics of the ODS are that the ODS contains only very current data and only detailed data. The ODS does not contain anywhere near the lengthy historical perspective of data found in the data warehouse. At most one weeks data is contained in the ODS. And the ODS physically contains only detailed data, not summarized data. In these regards the ODS is very different from the data warehouse. The ODS is fed from the operational, legacy environment, much like the data warehouse. The data that passes out of the legacy application environment feeds the ODS. The ODS in turn feeds the data warehouse environment. There still is a clear separation between the operational and informational world even when there is an ODS. Consider two kinds of architectures - an architecture where there is a legacy environment and a data warehouse and an architecture where there is a legacy environment, a data warehouse, and an ODS. There is the question - which architecture is correct? In truth both architectures are correct. In other words the architectures are not mutually exclusive. In some cases for some systems the architecture without the ODS is proper. In other cases the architecture with the ODS is proper. The existence and validity of one architecture does not invalidate the existence of the other architecture. There are four different types of ODS, depending on the speed and source with which the ODS is updated. When the ODS is updated asynchronously with the operational environment, the ODS is said to be class I. In this case there may be only a half second lag between an operational update and an ODS update. When the updates that occur in the operational environment are held for one or two hours - in a store and forward mode -before being applied to the ODS, the ODS is said to be class II. And when the operational updates are held overnight and are applied to the ODS as part of the batch update cycle, the ODS is said to be a class III ODS. A class IV ODS is one that is updated from the data warehouse.
The ODS is architecturally similar to the data warehouse in that both the ODS and the data warehouse are integrated and subject oriented. But the ODS and the data warehouse are very dissimilar in that the data warehouse is not updated by transactions while the ODS certainly can be updated by transactions, the data warehouse contains summarized data while the ODS does not, the data warehouse contains a lengthy historical perspective of data while the ODS does not, the data warehouse is used for strategic decisions while the ODS is used for tactical decisions, and the data warehouse is used at the managerial level while the ODS is used at the clerical level.
There are then significant differences between the ODS and the data warehouse.
Copyright 1999 by BILLINMON.COM LLC, all rights reserved.
Pg. 53
Anexo 20
Designing the ODS
by W. H. Inmon
In order to have a discussion about ODS, the conversation best begins with a schematic that shows how an ODS is architecturally positioned. The ODS sits between the legacy applications environment and the data warehouse. The ODS is an architectural structure that is fed by integration and transformation (I/T) programs. These I/T programs can be the same programs as the ones that feed the data warehouse or they can be separate programs. The ODS in turn feeds data to the data warehouse. Some operational data traverses directly into the data warehouse through the I/T layer while other operational data passes from the operational foundation into the I/T layer, then into the ODS and on into the data warehouse. An ODS is an:
integrated, subject oriented, volatile (including update), current valued
structure designed to serve operational users as they do high performance integrated processing. The essence of an ODS is the enablement of integrated, collective online processing. An ODS delivers consistent high transaction performance - 2 to 3 seconds. An ODS supports online update. An ODS is integrated across many applications. An ODS provides a foundation for collective, up to the second views of the enterprise. And at the same time the ODS supports decision support processing (DSS). Because of the many roles that an ODS fulfills, it is necessarily a complex structure. Its underlying technology is complex. Its design is complex. Monitoring and maintaining the ODS is complex. The ODS typically takes a long time to design and implement. The ODS requires changing or replacing old legacy systems that are unintegrated. In short, it is no small undertaking, this ODS. In order to understand the foundation of the design of the ODS, you first need to understand that two very different types of users are attracted to the ODS -farmers and explorers. The first user of the ODS is a user who can be called a "farmer". Farmers are those people who do the same task repetitively. Farmers know what they want when they set out to search for something. Farmers look at small amounts of data with each transaction. Farmers almost always find what they want. Farmers usually find small flakes of gold, not huge nuggets at the completion of their transaction. Farmers operate in a world of structure - structured data, structured processing, structured procedures, and so forth. The other type of user that is served by the ODS is the "explorer". The explorer is the antithesis of the farmer. The explorer operates in a random manner. The explorer does not know what he/she is looking for at the outset of the analysis. Explorers operate in a heuristic mode. Explorers look at very large sets of data. Explorers look for associations between types of data, patterns that are useful, and relationships that have heretofore never been discovered. The explorer often finds nothing as a result of an analysis. But occasionally the explorer finds huge nuggets of gold. Explorers operate in a pattern that defies prediction. The explorer operates in an almost completely unstructured manner. The ODS must satisfy the needs of both the farmer and the explorer, and because of this paradox, the design of the ODS is a difficult task, in the best of circumstances.
Pg. 54
The classical design of the structures found in the DSS environment begins with a data model, which reflects the informational needs of the corporation. From the data model are generated normalized tables. These tables constitute what can be described as a logical design. The many normalized tables are combined into a form of physical design that can be described as lightly normalized design. In a lightly normalized design, tables are combined on the basis of containing common keys and general common usage. The design technique of creating normalized \ lightly normalized structures based on a data model that has been described here fits many instances of DSS design. But there is a fly in the ointment of this approach. When the issues of:
performance, where many tables must be joined, performance, where there are many occurrences of data that will populate the design, and simplicity, where users find it unnatural to join many tables together to represent data in a form comprehensible to the end user each time the end user does a transaction
are considered, the design technique of light normalization yields marginal results. An alternate design approach is to take into consideration the volume and usage of the data. When the volume and usage of the data are factored into the design, a mutant form of normalization is achieved. The light normalization turns into heavy normalization, and a structure known as the "star join" is created. Consider a star join. There are two essential parts to a star join - fact tables and dimension tables. The fact table represents the structure that holds the majority of the occurrences of the data. Fact tables typically combine data and cross reference keys from a variety of other tables. The other type of table that participates in a star join is the dimension table. Dimension tables contain data which is not terribly voluminous. Dimension tables are related to fact tables by means of a foreign key relationship. Fact tables are efficient to access because data has been prejoined into the table at the moment of loading. The end user is be able to access fact tables efficiently because the fact tables are extremely streamlined in their design. In addition, the fact table is familiar to the end user, in terms of the day to day structuring of data that the end user is used to seeing. By building star joins, the designer has created a structure for efficient access, large volumes of data, and natural end user viewing. But there is a problem with star joins. In order to know how to create the star join, the designer must make assumptions about the usage of the data. Stated differently, without knowing the predominant pattern of access and usage of the data, you cannot create a star join. At the heart of the design of any star join is the implicit understanding of how the data in the star join is to be used. Unfortunately one department will look at data very differently from another department. The star join for finance will be very different than the star join for production, for example. And there is a second problem with star join structures, and that problem is that online update plays havoc with the underlying data management required to make the star join complete. In a DSS world where there is no update, this is not a problem. But in an ODS world where online update is a normal event, the inability of the star join to gracefully handle updates presents a special challenge. The ODS designer has a dilemma then. On the one hand the designer wishes to have efficiency of access and the ability to handle large amounts of data. On the other hand the ODS designer must design the system to be able to accommodate a wide variety of users. The following table illustrates the dilemma of the ODS data base designer:
Pg. 55
NORMALIZED STRUCTURE
inefficient to access holds modest amounts of data applicable to a wide audience handles updates
STAR JOIN STRUCTURE

efficient to access holds large amounts of data applicable to a restricted audience does not handle updates
The designer in the ODS environment faces Hobson's choice. Neither design approach - normalized or star join is optimal for the ODS. Both approaches have their strengths and weaknesses. The way the sophisticated designer goes about solving this apparent contradiction is to go back to the users of the system. For those parts of the system used primarily by explorers, a normalized design is optimal. Explorers do not know how they are going to use the system, so normalization suits them just fine. For those parts of the system used primarily by farmers, a star join approach is optimal. Since farmers have a predictable and repetitive usage pattern, a star join can be created to allow them optimal access. A dual design approach is the best for the ODS. The next factor that must be accounted for is the issue of update or pure DSS processing. Some farmers do no update. They are the "pure" DSS processors. Other farmers do update as a regular part of their ODS processing. Explorers however seldom do online update. If explorers do update at all, it is by creating sweeping batch programs that march across entire tables and make massive changes. But explorers are not known for making changes, certainly not online updates. The proper basis of design for an ODS is entirely dependent on who is using the ODS and what kind of work they are doing. If the ODS is used ONLY by farmers doing DSS processing, then an exclusive star join approach is in order for the entire ODS. But if update processing is being done by farmers or if there is usage of the ODS by explorers to any extent, then one or the other form of normalization is in order. If the ODS is used ONLY by explorers, then a normalized approach is in order for the entire ODS.
Copyright 1999 by BILLINMON.COM LLC, all rights reserved.
Pg. 56
Anexo 21
Relocating the ODS
Moving the Operational Data Store Will Solve a Number of Problems.
DBMS - December 1997
The operational data store (ODS) is a part of the data warehouse environment about which many managers have confused feelings. I am often asked, "Should I build an ODS?" I have decided that the underlying question is, "What is an ODS, anyway?"According to Bill Inmon and Claudia Imhoff in their book Building the Operational Data Store (John Wiley & Sons, 1996), an ODS is a "subjectoriented, integrated, volatile, current valued data store, containing only corporate detailed data." This definition for an ODS reflects a real market need for current, operational data. If anything, the need for the ODS function has grown in recent years and months. At the same time as our data warehouse systems have gotten bigger, the need to analyze ever more detailed customer behavior and ever more specific operational texture has grown. In most cases the analysis must be done on the most granular and detailed data that we can possibly source. The emergence of data mining has also demanded that we crawl though reams of the lowest-level data, looking for correlations and patterns. Until now, the ODS was considered a different system from the main data warehouse because the ODS was based on "operational" data. The downstream data warehouse was almost always summarized. Because the warehouse was a complete historical record, we usually didn't dare store this operational (transactional) data as a complete history. However, the hardware and software technology supporting data warehousing has kept rolling forward, able to store more data and able to process larger answer sets. We also have discovered how to extract and clean data rapidly, and we have figured out how to model it for user understandability and extreme query performance. It is fashionable these days to talk about multiterabyte data warehouses, and consultants braver than I talk about petabyte (1,000 terabyte) data warehouses being just around the corner. Now I am getting suspicious of the ODS assumption we have been making that you cannot store the individual transactions of a big business in a historical time series. Let us stop for a moment and estimate the number of low-level sales transactions in a year for a large retailer. This is surprisingly simple. I use the following technique to triangulate the overall size of a data warehouse before I ever interview the end users.
Pg. 57
Imagine that our large retailer has six billion dollars in retail sales per year. The only other fact we need is the average size of a line item on a typical sales transaction. Suppose that our retailer is a drug store and that the average dollar value of a line item is two dollars. We can immediately estimate the number of transaction line items per year as six billion dollars divided by two dollars, or three billion. This is a large number, but it is well within the range of many current data warehouses. Even a three-year history would "only" generate nine billion records. If we did a tight dimensional design with four 4byte dimension keys, and four 4-byte facts, then the raw-fact table data size per year would be nine billion times 32 bytes, or 288GB. Three years of raw data would be 864GB. I know of more than a dozen data warehouses bigger than this today. Our regular data warehouses can now embrace the lowest-level transaction data as a multiyear historical time series, and we are using high-performance data extracting and cleaning tools to pull this data out of the legacy systems at almost any desired frequency each day. So why is my ODS a separate system? Why not just make the ODS the leading, breaking wave of the data warehouse itself? With the growing interest in data mining fine-grained customer behavior in the form of individual customer transactions, we increasingly need detailed transaction-time histories available for analysis. The effort expended to make a lightweight, throwaway, traditional ODS data source (for example, a volatile, current, valued, data source restricted to current data) is becoming a dead end and a distraction. Let us take this opportunity to tighten and restrict the definition of the ODS. We will view the ODS simply as the "front edge" of the existing data warehouse. By bringing the ODS into the data warehouse environment, we make it more useful to clerks, executives, and analysts, and we need only to build a single extract system. This new, simplified view of the ODS is shown in Figure 1 and Figure 2. Let us redefine the ODS as follows. The ODS is a subject-oriented, integrated, frequently augmented store of detailed data in the enterprise data warehouse. The ODS is subject-oriented. That is, the ODS, like the rest of the data warehouse, is organized around specific business domains such as Customer, Product, Activity, Policy, Claim, or Shipment. The ODS is integrated. The ODS gracefully bridges between subjects and presents an overarching view of the business rather than an incompatible stovepipe view of the business. The ODS is frequently augmented. This requirement is a significant departure from the original ODS statement that said the ODS was volatile; for example, the ODS was constantly being overwritten and its data structures were constantly changing. This new requirement of frequently augmenting the data also invalidates Inmon and Imhoff's statement that the ODS contains only current, valued data. We aren't afraid to store the transaction history. In fact, that has now become our mission. The ODS sits within the full data warehouse framework of historical data and summarized data. In a data warehouse containing a monthly summarized view of data in addition to the transaction detail, the input flow to the ODS also contributes to a special "current rolling month." In many cases, when the last day of the month is reached, the current rolling month becomes the most recent member of the standard months in the time series and a new current rolling month is created.
Pg. 58
The ODS naturally supports a collective view of data. We now see how the ODS presents a collective view to the executive who must be able to see a customer's overall account balance. The executive can immediately and gracefully link to last month's collective view of the customer (via the time series) and to the surrounding class of customers (via the data warehouse aggregations). The ODS is organized for rapid updating directly from the legacy system. The data extraction and cleaning industry has come a long way in the last few years. We can pipeline data from the legacy systems through data cleaning and data integrating steps and drop it into the ODS portion of the data warehouse. Inmon and Imhoff's original distinctions of Class I (near realtime upload), Class II (upload every few hours), and Class III (upload perhaps once per day) are still valid, but the architectural differences in the extract pipeline are far less interesting than they used to be. The ability to upload data very frequently will probably be based more on waiting for remote operational systems to deliver necessary data than computing or bandwidth restrictions in the data pipeline. The ODS should be organized around a star join schema design. Inmon and Imhoff recommend the star join data model as "the most fundamental description of the design of the data found in the operational data store." The star join, or dimensional model, is the preferred data model for achieving user understandability and predictable high performance. (For further information on this subject, please see my article, "A Dimensional Modeling Manifesto," in the August 1997 issue of DBMS.) The ODS contains all of the text and numbers required to describe low-level transactions, but may additionally contain back references to the legacy system that would allow realtime links to be opened to the legacy systems through terminal- or transaction-based interfaces. This is an interesting aspect of the original definition of the ODS, and is somewhat straightforward if the low-level transactions are streamed out of the legacy system and into the ODS portion of the data warehouse. What this means is that operational keys like the invoice number and line number are kept in the data flow all the way into the ODS, so that an application can pick up these keys and link successfully back to the legacy system interface. The ODS is supported by extensive metadata needed to explain and present the meaning of the data to end users through query and reporting tools, as well as metadata describing an extract "audit" of the data warehouse contents. Bringing the ODS into the existing data warehouse framework solves a number of problems. We can now focus on building a single data extract pipeline. We don't need to have a split personality where we are willing to have a volatile, changing data structure with no history and no support for performanceenhancing aggregations. Our techniques have improved in the last few years. We understand how to take a flow of atomic-level transactions, put them into a dimensional framework, and simultaneously build a detailed transaction history with no compromising of detail, and at the same time build a regular series of periodic snapshots that lets us rapidly track a complex enterprise over time. As I just mentioned, a special snapshot in this time series is the current rolling snapshot at the very front of the time series. This is the echo of the former separate ODS. In next month's column, I will describe the dual personality of transaction and snapshot schemas that is at the heart of "operational data warehouse." Finally, if you have been reading this with a skeptical perspective, and you have been saying to yourself, "storing all that transaction detail just isn't needed in my organization: all my management needs are high-level summaries," then broaden your perspective and listen to what is going on in the
Pg. 59
marketing world. I believe that we are in the midst of a major move to one-on-one marketing in which large organizations are seeking to understand and respond to detailed and individual customer behavior. Banks need to know exactly who is at that ATM between 5:00 p.m. and 6:00 p.m., what transactions are they performing, and how that pattern has evolved this year in response to various bank incentive programs. Catalina Marketing is ready to print coupons at your grocery store register that respond to what you have in your shopping basket and what you have been buying in recent trips to the store. To do this, these organizations need all the gory transaction details, both current and historical. Our data warehouse hardware and software are ready for this revolution. Our data warehouse design techniques are ready for this revolution. Are you ready? Bring your ODS in out of the rain and into the warehouse. Figure 1.
The original ODS architecture necessitated two pathways and two systems because the main data warehouse wasn't prepared to store low-level transactions.
Pg. 60
Figure 2.
The new ODS reality. The cleaning and loading pathway needs only to be a single system because we are now prepared to build our data warehouse on the foundation of individual transactions. Ralph Kimball was coinventor of the Xerox Star workstation, the first commercial product to use mice, icons, and windows. He was vice president of applications at Metaphor Computer Systems and is the founder and former CEO of Red Brick Systems. He now works as an independent consultant designing large data warehouses. His book The Data Warehouse Toolkit: How to Design Dimensional Data Warehouses (Wiley, 1996) is now available. You can reach Ralph through his Web page at www.rkimball.com. What did you think of this article? Send a letter to the editor. Subscribe to DBMS and Internet Systems -- It's free for qualified readers in the United States December 1997 Table of Contents | Article Index | Search | Site Index | Home DBMS and Internet Systems (http://www.dbmsmag.com) Copyright 1997 Miller Freeman, Inc. ALL RIGHTS RESERVED Redistribution without permission is prohibited. Please send questions or comments to dbms@mfi.com Updated November 3, 1997
Pg. 61
DIAD.N Diviso de Negcios de Administrao de Dados
Modelagem Multidimensional para Data Warehouse Exerccio 1: Perguntas

As questes abaixo so simplesmente uma forma de exercitar os conceitos apresentados no curso, visando uma melhor assimilao dos mesmos. Levando-se em conta as seguintes situaes: a)"Numa certa loja, uma nota fiscal emitida para cada cliente, podendo um cliente efetuar vrias compras. Uma nota fiscal contm vrios tens de nota, correspondendo cada um a um determinado produto. Um certo produto pode aparecer em vrias notas fiscais." b)"As vendas do primeiro trimestre desse ano esto 10% menores que as realizadas no mesmo perodo do ano passado. preciso que sejam identificadas quais filiais esto apresentando mais prejuzo e quais tipos de produto tiveram procura abaixo do esperado." Os problemas descritos acima so tratados tipicamente por qual tipo sistema, OLAP ou OLTP? Qual tipo de banco de dados (operacional ou DW) mais recomendvel para cada um? Defina qual arquitetura de SGBD mais adequada para se implantar um DW e quais as vantagens e desvantagens em relao a outras opes. Cite as tcnicas relacionadas ao processo de Data Mining e qual a funo de cada uma.
Pg. 62
_________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________
Pg. 63
Modelagem Multidimensional para Data Warehouse Exerccio 2: Criando um Star Schema com o Designer
Agora vamos por a mo na massa. Entre no Oracle Designer e se familiarize com o software. Dentro do aplicativo de teste, crie uma tabela de fatos e algumas de dimenses.
A. Objetivo
Criar um esquema-estrela para uma cadeia de drogarias, guardando dados sobre as compras dos clientes para seu departamento de marketing. Assim poder ser feita a anlise das mudanas no comportamento das compras dos clientes ao longo do tempo. A ferramenta a ser utilizada ser o Design Editor do Oracle Designer. A fonte para o star schema um arquivo texto, extrado de um sistema OLTP Ponto de Vendas. Cada registro contm 23 atributos. Alguns dados esto faltando. Sua misso criar um modelo fsico para esse banco de dados analtico de gigabytes, de modo a aumentar a eficincia da extrao dos dados e reduzir o tempo de consulta. Favor salvar seu modelo para uso futuro.
B. Metadados Os 23 campos so os seguintes: Nome store customer date receipt Quantity time manufacturer brand Store code Customer ID Purchase date Receipt number Quantity Purchase time Manufacturer code Brand Code Descrio Number Text Date Number Number Date/Time Number Number Tipo
Pg. 64
Nome cat2 cat4 cat6 dob gender cost discount total_profit product_ID location day week_number week_begin_date buy_not
Descrio 2-digit category code 4-digit category code 6-digit category code Date of birth Gender (0 - group, 1 - male , 2- Female ) Cost of item Discount amount quantity X profit/unit Product code Location Code Day of the week Week number (starting from 1998/01 1998/12) The begin date of each week Whether a customer will purchase 6 months after their last visit in 1998 0=not buy 1= buy Number Number Number Date Number Number Number Number Number Number Text Number Date Number
Tipo
Nesse 1. 2. 3. 4.
tutorial, voc aprender como: Analisar a estrutra de dados e construir o star schema. Criar tabelas e definir suas colunas. Garantir unicidade, definindo primary keys. Criar foreign keys ligando as tabelas
Pg. 65
C. Processo
Seqncia de passos a serem seguidos.

Crie um novo Server Model Diagram Adicione as tabelas ao esquema Defina as colunas Defina as chaves primrias Defina as chaves estrangeiras Passo 1: Crie um novo Server Model Diagram No Design Editor, menu File, New, Server Model Diagram, Save Diagram As SMD_GRUPO_N (onde N o nmero de seu grupo). Passo 2: Adicione as tabelas ao esquema Clique no cone . Clique na rea em branco. Defina nome, alias e tipo de sua tabela. Coloque no nome e alias o prefixo GN (de acordo com seu grupo). Clique Finish. Ao final, o star schema dever conter as tabelas de dimenses store, product, customer e time, ligadas ao redor da tabela de fatos fact_sales. As chaves primrias esto sublinhadas.
GN_Customer Customer ID*

Gender DOB
GN_Time
Time_key
GN_Fact_Sales
Customer ID Time_key Product Code Store Code Receipt Number Discount Amount Cost Profit Amount Quantity Buy / Not Buy
Time Date Week Number Week_Begin_Date
GN_Product
GN_Store
Product Code
Manufacturer Code Brand Code Cat2 Cat4 Cat6
Store Code
Location
Passo 3: Defina as colunas Para definir as colunas das tabelas: 1.Clique na tabela com o boto direito. 2.Clique em Add Column. 3.Preencha nome, tipo, tamanho e se mandatria. 4.Clique em Finish.
Pg. 66
Passo 4: Defina as chaves primrias Para definir as chaves primrias das tabelas: 1.Clique na tabela com o boto direito. 2.Clique em Add Primary Key. 3.Escolha a(s) coluna(s) que formam a PK 4.Preencha nome e marque a PK como no atualizvel. 5.Clique em Finish. Passo 5: Defina as chaves estrangeiras Para definir as chaves estrangeiras das tabelas: 1.Clique na tabela de fatos com o boto direito. 2.Clique em Add Foreign Key. 3.Escolha a tabela que apontada pela FK 4.Preencha nome e marque a FK como mandatria. 5.Escolha Join to the PK. 6.Esolha Select coluns e depois escolha a coluna adequada. 7.Clique em Finish. 8.Salve e feche o diagrama.
Voc criou seu primeiro star schema para uma rede de drogarias. Parabns!
Pg. 67
Modelagem Multidimensional para Data Warehouse Exerccio 3: Gerando um Star Schema com o Designer
Vamos continuar com a parte prtica. Faa a gerao dos scripts, mas no execute no banco. Examine os diferentes tipos de arquivos gerados..
Pg. 68
Modelagem Multidimensional para Data Warehouse Exerccio 4: Consultando um Star Schema com o SQL-Plus
Continuando com a parte prtica. Agora no SQL-Plus faa as consultas a seguir. 1) Faa uma consulta que mostre o cliente, sexo do cliente, as datas de suas compras e os lucros. 2) Extraia cliente, sexo, lucro_total e custo, sumarizando lucro_total e custo por cada cliente. 3) Conte o nmero de visitas nicas de cada cliente em 1998. 4) Crie uma consulta de referncia cruzada para mostrar o lucro de cada cliente masculino de cada cdigo de categoria de 2 dgitos.
Pg. 69

Apostila Admdados DW

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Apostila Admdados DW

Uploaded by

Copyright:

Available Formats

DATAPREV DNG DIRETORIA DE NEGCIOS DETI.

N Departamento de Negcios Tratamento de Informaes

Modelagem Multidimensional para Data Warehouse

Instrutor: Roge Oliveira Colaboradores: Alfredo M. V. Martins Delmir Peixoto A Jr.

Tabela de referncia cruzada: Anexos x Captulos

Anexo 2 Gerenciando Tabelas Auxiliares

Usando chaves hospedeiras (surrogate keys)

Slowly Changing Dimensions

Creating Another Dimension Record

Creating a Current Value Field

Figure 1. A sample data warehouse schema.

Copyright 1999 by BILLINMON.COM LLC, all rights reserved.

integrated, subject oriented, volatile (including update), current valued

STAR JOIN STRUCTURE

Copyright 1999 by BILLINMON.COM LLC, all rights reserved.

DIAD.N Diviso de Negcios de Administrao de Dados

Modelagem Multidimensional para Data Warehouse Exerccio 1: Perguntas

DIAD.N Diviso de Negcios de Administrao de Dados

Seqncia de passos a serem seguidos.

GN_Customer Customer ID*

DIAD.N Diviso de Negcios de Administrao de Dados

DIAD.N Diviso de Negcios de Administrao de Dados

You might also like