You are on page 1of 11

DATA WAREHOUSING AND DATA MINING BHOJ REDDY ENGINEERING COLLEGE FOR WOMEN

ABSTRACT

DATA WARE HOUSING DEFINITION: Data Warehouse/Business Intelligence is a general term to describe a system used in an organization to collect data, most of which are transactional data, such as purchase records and etc., from one or more data sources, such as the database of a transactional system, into a central data location, the Data Warehouse, and later report those data, generally in an aggregated way, to business users in the organization. This system generally consists of an ETL tool, a Database, a Reporting tool and other facilitating tools, such as a Data Modeling tool. A data warehouse (DW) is a database used for reporting. The data is offloaded from the operational systems for reporting. The data may pass through an operational data store for additional operations before it is used in the DW for reporting. A data warehouse maintains its functions in three layers: staging, integration, and access. Staging is used to store raw data for use by developers (analysis and support). The integration layer is used to integrate data and to have a level of abstraction from users. The access layer is for getting data out for users. HISTORY: The concept of data warehousing dates back to the late 1980s [1] when IBM researchers Barry Devlin and Paul Murphy developed the "business data warehouse". In essence, the data warehousing concept was intended to provide an architectural model for the flow of data from operational systems to decision support environments. The concept attempted to address the various problems associated with this flow, mainly the high costs associated with it. In the absence of a data warehousing architecture, an enormous amount of redundancy was required to support multiple decision support environments. In larger corporations it was typical for multiple decision support environments to operate independently. ARCHITECTURE: Operational database layer

The source data for the data warehouse An organization's enterprise resource planning systems fall into this layer. Data access layer The interface between the operational and informational access layer Tools to extract, transform, load data into the warehouse fall into this layer. Metadata layer The data dictionary This is usually more detailed than an operational system data dictionary. There are dictionaries for the entire warehouse and sometimes dictionaries for the data that can be accessed by a particular reporting and analysis tool. Informational access layer The data accessed for reporting and analyzing and the tools for reporting and analyzing data This is also called the data mart. Business intelligence tools fall into this layer. The InmonKimball differences about design methodology, discussed later in this article, have to do with this layer .

Top-down versus bottom-up design methodologie Bottom-up design In the bottom-up approach data marts are first created to provide reporting and analytical capabilities for specific business processes. Though it is important to note that in Kimball methodology, the bottom-up process is the result of an initial business oriented Top-down analysis of the relevant business processes to be modelled. Data marts contain, primarily, dimensions and facts. Facts can contain either atomic data and, if necessary, summarized data. The single data mart often models a specific business area such as "Sales" or "Production." These data marts can eventually be integrated to create a comprehensive data warehouse. The integration of data marts is managed through the implementation of what Kimball calls "a data warehouse bus architecture". Top-down design Bill Inmon, one of the first authors on the subject of data warehousing, has defined a data warehouse as a centralized repository for the entire enterprise.[5] Inmon is one of the leading proponents of the top-down approach to data warehouse design, in which the data warehouse is designed using a normalized enterprise data model. "Atomic" data, that is, data at the lowest level of detail, are stored in the data warehouse. Dimensional data marts containing data needed for specific business processes or specific departments are created from the data

warehouse. In the Inmon vision the data warehouse is at the center of the "Corporate Information Factory" (CIF), which provides a logical framework for delivering business intelligence (BI) and business management capabilities. Inmon states that the data warehouse is: Subject-oriented The data in the data warehouse is organized so that all the data elements relating to the same real-world event or object are linked together. Non-volatile Data in the data warehouse are never over-written or deleted once committed, the data are static, read-only, and retained for future reporting. Integrated The data warehouse contains data from most or all of an organization's operational systems and these data are made consistent. Time-variant The top-down design methodology generates highly consistent dimensional views of data across data marts since all data marts are loaded from the centralized repository. Top-down design has also proven to be robust against business changes. Generating new dimensional data marts against the data stored in the data warehouse is a relatively simple task. The main disadvantage to the the top-down methodology is that it represents a very large project with a very broad scope Data warehouses versus operational systems

Operational systems are optimized for preservation of data integrity and speed of recording of business transactions through use of database normalization and an entity-relationship model. Operational system designers generally follow the Codd rules of database normalization in order to ensure data integrity. Codd defined five increasingly stringent rules of normalization. Fully normalized database designs (that is, those satisfying all five Codd rules) often result in information from a business transaction being stored in dozens to hundreds of tables. Relational databases are efficient at managing the relationships between these tables. The databases have very fast insert/update performance because only a small amount of data in those tables is affected each time a transaction is processed. Finally, in order to improve performance, older data are usually periodically purged from operational systems.

Data warehouses are optimized for speed of data analysis. Frequently data in data warehouses are denormalised via a dimension-based model. Also, to speed data retrieval, data warehouse data are often stored multiple timesin their most granular form and in summarized forms called aggregates. Data warehouse data are gathered from the operational systems and held in the data warehouse even after the data has been purged from the operational systems[edit] Evolution in organization use

These terms refer to the level of sophistication of a data warehouse: Offline data warehouse operational Data warehouses in this initial stage are developed by simply copying the data off of an operational system to another server where the processing load of reporting against the copied data does not impact the operational system's performance Offline data warehouse Data warehouses at this stage are updated from data in the operational systems on a regular basis and the data warehouse data are stored in a data structure designed to facilitate reporting. Real-time data warehouse Data warehouses at this stage are updated every time an operational system performs a transaction (e.g. an order or a delivery or a booking). Integrated data warehouse These data warehouses assemble data from different areas of business, so users can look up the information they need across other systems

A data warehouse provides a common data model for all data of interest regardless of the data's source. This makes it easier to report and analyze information than it would be if multiple data models were used to retrieve information such as sales invoices, order receipts, general ledger charges, etc. Prior to loading data into the data warehouse, inconsistencies are identified and resolved. This greatly simplifies reporting and analysis. Information in the data warehouse is under the control of data warehouse users so that, even if the source system data are purged over time, the information in the warehouse can be stored safely for extended periods of time. Data warehouses can work in conjunction with and, hence, enhance the value of operational business applications, notably customer relationship management (CRM) systems. Data warehouses facilitate decision support system applications such as trend reports (e.g., the items with the most sales in a particular area within the last two years), exception reports, and reports that show actual performance versus goals.

Disadvantages
There are also disadvantages to using a data warehouse. Some of them are:

Benefits
Some of the benefits that a data warehouse provides are as follows:[7][8]

Data warehouses are not the optimal environment for unstructured data.

Because data must be extracted, transformed and loaded into the warehouse, there is an element of latency in data warehouse data. Over their life, data warehouses can have high costs. Data warehouses can get outdated relatively quickly. There is a cost of delivering suboptimal information to the organization. There is often a fine line between data warehouses and operational systems. Duplicate, expensive functionality may be developed. Or, functionality may be developed in the data warehouse that, in retrospect, should have been developed in the operational systems.

A 2009 Gartner Group paper predicted these developments in business intelligence/data warehousing market.[11]

Because of lack of information, processes, and tools, through 2012, more than 35 percent of the top 5,000 global companies will regularly fail to make insightful decisions about significant changes in their business and markets. By 2012, business units will control at least 40 percent of the total budget for business intelligence.

DATA MINING
Data mining, a branch of computer science [1] and artificial intelligence [2], is the process of extracting patterns from data. Data mining is seen as an increasingly important tool by modern business to transform data into business intelligence giving an informational advantage. It is currently used in a wide range of profiling practices, such as marketing, surveillance, fraud detection, and scientific discovery. The related terms data dredging, data fishing and data snooping refer to the use of data mining techniques to sample portions of the larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered (see also data-snooping bias). These techniques can, however, be used in the creation of new hypotheses to test against the larger data populations. Background

Sample applications:
Some of the applications data warehousing can be used for are:

Decision support Trend analysis Financial forecasting Churn Prediction for Telecom subscribers, Credit Card users etc. Insurance fraud analysis Call record analysis Logistics and Inventory management Agriculture [9]

Future
Data warehousing, like any technology, has a history of innovations that did not receive market acceptance.[10]

The manual extraction of patterns from data has occurred for centuries. Early methods of identifying patterns in data include Bayes' theorem (1700s) and regression analysis (1800s). The proliferation, ubiquity and increasing power of computer technology has increased data collection, storage and manipulations. As data sets have grown in size and complexity, direct hands-on data analysis has increasingly been augmented with indirect, automatic data processing. This has been aided by other discoveries in computer science, such as neural networks, clustering, genetic algorithms (1950s), decision trees (1960s) and support vector machines (1980s). Data mining is the process of applying these methods to data with the intention of uncovering hidden patterns.[3]

In addition to industry driven demand for standards and interoperability, professional and academic activity have also made considerable contributions to the evolution and rigour of the methods and models; an article published in a 2008 issue of the International Journal of Information Technology and Decision Making summarises the results of a literature survey which traces and analyzes this evolution.[8] The premier professional body in the field is the Association for Computing Machinery's Special Interest Group on Knowledge discovery and Data Mining (SIGKDD).[citation needed] Since 1989 they have hosted an annual international conference and published its proceedings,[9] and since 1999 have published a biannual academic journal titled "SIGKDD Explorations".[10] Other Computer Science conferences on data mining include:

A primary reason for using data mining is to assist in the analysis of collections of observations of behaviour. Such data are vulnerable to collinearity because of unknown interrelations. An unavoidable fact of data mining is that the (sub-)set(s) of data being analysed may not be representative of the whole domain, and therefore may not contain examples of certain critical relationships and behaviours that exist across other parts of the domain Research and evolution

DMIN - International Conference on Data Mining;[11] DMKD - Research Issues on Data Mining and Knowledge Discovery; ECML-PKDD - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases; ICDM - IEEE International Conference on Data Mining;[12] MLDM - Machine Learning and Data Mining in Pattern Recognition; SDM - SIAM International Conference on Data Mining EDM - International Conference on Educational Data Mining ECDM - European Conference on Data Mining

PAKDD - The annual PacificAsia Conference on Knowledge Discovery and Data Mining Process

Data mining Data mining commonly involves four classes of tasks:[13]

Clustering - is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data. Classification - is the task of generalizing known structure to apply to new data. For example, an email program might attempt to classify an email as legitimate or spam. Common algorithms include decision tree learning, nearest neighbor, naive Bayesian classification, neural networks and support vector machines. Regression - Attempts to find a function which models the data with the least error. Association rule learning Searches for relationships between variables. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis. Notable uses

Since the early 1960s, with the availability of oracles for certain combinatorial games, also called tablebases (e.g. for 3x3-chess) with any beginning configuration, small-board dotsand-boxes, small-board-hex, and certain endgames in chess, dotsand-boxes, and hex; a new area for data mining has been opened up. This is the extraction of human-usable strategies from these oracles. Current pattern recognition approaches do not seem to fully have the required high level of abstraction in order to be applied successfully. Instead, extensive experimentation with the tablebases, combined with an intensive study of tablebaseanswers to well designed problems and with knowledge of prior art, i.e. pre-tablebase knowledge, is used to yield insightful patterns. Berlekamp in dots-and-boxes etc. and John Nunn in chess endgames are notable examples of researchers doing this work, though they were not and are not involved in tablebase generation

Business automatically discover the segments or groups within a customer data set. Businesses employing data mining may see a return on investment, but also they recognise that the number of predictive models can quickly become very large.

Games

Rather than one model to predict how many customers will churn, a business could build a separate model for each region and customer type. Then instead of sending an offer to all people that are likely to churn, it may only want to send offers to customers. And finally, it may also want to determine which customers are going to be profitable over a window of time and only send the offers to those that are likely to be profitable. In order to maintain this quantity of models, they need to manage model versions and move to automated data mining. Science and engineering In recent years, data mining has been widely used in area of science and engineering, such as bioinformatics, genetics, medicine, education and electrical power engineering.

There are several critical research challenges in geographic knowledge discovery and data mining. Miller and Han [27] offer the following list of emerging research topics in the field:

In the area of study on human genetics, an important goal is to understand the mapping relationship between the interindividual variation in human DNA sequences and variability in disease susceptibility. In lay terms, it is to find out how the changes in an individual's DNA sequence affect the risk of developing common diseases such as cancer. This is very important to help improve the diagnosis, prevention and treatment of the diseases. The data mining technique that is used to perform this task is known as multifactor dimensionality reduction

Challenges

Developing and supporting geographic data warehouses Spatial properties are often reduced to simple aspatial attributes in mainstream data warehouses. Creating an integrated GDW requires solving issues in spatial and temporal data interoperability, including differences in semantics, referencing systems, geometry, accuracy and position. Better spatio-temporal representations in geographic knowledge discovery - Current geographic knowledge discovery (GKD) techniques generally use very simple representations of geographic objects and spatial relationships. Geographic data mining techniques should recognise more complex geographic objects (lines and polygons) and relationships (nonEuclidean distances, direction, connectivity and interaction through attributed geographic space such as terrain). Time needs to be more fully integrated into these geographic representations and relationships. Geographic knowledge discovery using diverse data types - GKD techniques should be developed that can handle diverse data types beyond the traditional raster and vector models, including imagery and geo-referenced multimedia, as well as dynamic data types (video streams, animation

Applications

Data mining can assist financial institutions in areas such as credit reporting and loan information.

Data Mining in Agriculture Surveillance / Mass surveillance National Security Agency Quantitative structure-activity relationship Customer analytics Police-enforced ANPR in the UK Stellar wind (code name) Educational Data Mining

For example, by examining previous customers with similar attributes, a bank can estimated the level of risk associated with each given loan.

Advantages and diadvantages

DISADVANTAGES OF DATA MINING

Marking/Retailing

Security issues:

Data mining can aid direct marketers by providing them with useful and accurate trends about their customers purchasing behavior.

Although companies have a lot of personal information about us available online, they do not have sufficient security systems in place to protect that information.

Based on these trends, marketers can direct their marketing attentions to their customers with more precision.

For example, recently the Ford Motor credit company had to inform 13,000 of the consumers that their personal information including Social Security number, address, account number and payment history were accessed by hackers who broke into a database belonging to the Experian credit reporting agency.

For example, marketers of a software company may advertise about their new software to consumers who have a lot of software purchasing history

Banking/Crediting:

Security issues:

Although companies have a lot of personal information about us available online, they do not have sufficient security systems in place to protect that information.

Fayyad.pdf. Retrieved 2008-1217. ^ Ellen Monk, Bret Wagner (2006). Concepts in Enterprise Resource Planning, Second Edition. Thomson Course Technology, Boston, MA. ISBN 0-619-21663-8. OCLC 224465825.

For example, recently the Ford Motor credit company had to inform 13,000 of the consumers that their personal information including Social Security number, address, account number and payment history were accessed by hackers who broke into a database belonging to the Experian credit reporting agency.

Whenever I find the key to success, someone changes the lock. To Err is human, to forgive is not a COMPANY policy. The road to success.. is always under construction. Alcohol doesn't solve any problems, but if you think again, neither does milk.. In order to get a Loan, you first need to prove that you don't need it. All the desirable things in life are either illegal, expensive or fattening. Since Light travels faster than Sound, people appear brighter before you hear them speak. Everyone has a scheme of getting rich.. which never works. If at first you don't succeed. Destroy all evidence that you ever tried. You can never determine which side of the bread to butter. If it falls down, it will always land on the buttered side. Anything dropped on the floor will roll over to the most inaccessible corner. ***** 42.7% of all statistics is made on the spot. ***** He who has the gold, makes the rules ---Murphy's golden rule. If you come early, the bus is late. If you come late the bus is still late. Once you have bought something, you will find the same item being sold somewhere else at a cheaper rate.

References

^ "Data Warehouse". http://www.tech-faq.com/datawarehouse.html. ^ Yang, Jun. WareHouse Information Prototype at Stanford (WHIPS). [1]. Stanford University. July 7, 1998. ^ Caldeira, C. "Data Warehousing Conceitos e Modelos". Edies Slabo. 2008. ISBN 978-972-618-479-9 ^ Fayyad, Usama; Gregory Piatetsky-Shapiro, and Padhraic Smyth (1996). "From Data Mining to Knowledge Discovery in Databases". http://www.kdnuggets.com/gpsp ubs/aimag-kdd-overview-1996-

If you have paper, you don't have a pen. If you have a pen, you don't have paper if you have both, no one calls. Especially for engg. Students---If you have bunked the class, the professor has taken attendance. All Govt buses are crowded. Corollary--- -- The Govt buses in opposite direction always go empty. The door bell or your mobile will always ring when you are in the bathroom. After a long wait for bus no.20, two 20 number buses will always pull in together and the bus which you get in will be crowded than the other. If your exam is tomorrow, there will be a power cut tonight. The last person to be fired or quit is responsible for all the errors until another person is fired or quits. Irrespective of the direction of the wind, the smoke from the cigarette will always tend to go to the non-smoker

You might also like