Database Management Systems Complete Notes

DATABASE MANAGEMENT SYSTEMS PREPARED BY II CSE(2010-2014) CBIT, PRODDATUR R09 batch (Guidance by J UMA MAHESH, CBIT CSE, DEPT.
DATA BASE MANAGEMENT SYSTEMS UNIT-1 Data base systems: Data vs information, introducing the data base and the DBMS-why data base design is important-files and file systems problems with the file systems data managements- database systems Data models: Data modeling and data models-the importance of data models data model basic building blocks business rules- the evolution of data models-degree of data abstraction. UNIT-2 ENTITY RELATIONSHIP MODELLING: The entity relationship model (ERM)-developing an ER diagram-data base design challenges: conflicting goals The extends entity relationship model-entity clustering-entity integrity: selecting primary keys-learning flexible data base design-data modeling checklist. UNIT-3 RELATIONAL DATA BASE MODEL: A logical view of data-keys-integrity rules_-relational set operators the data dictionary and the system catalog-relationships within the relational data base data redundancy revisited indexes codds relational data base rules. UNIT-4 STRUCTURED QUERY LANGUAGE (SQL): Introduction to SQL data definition commands data manipulation commands SELECT queries advanced data definition commands advanced SELECT queries virtual tables: creating a view- joining database tables.
DATABASE MANAGEMENT SYSTEMS PREPARED BY II CSE(2010-2014) CBIT, PRODDATUR R09 batch (Guidance by J UMA MAHESH, CBIT CSE, DEPT.)
ADVANCED SQL: Relational set Operators- SQL join operators-sub queries & co-related queries SQL functions oracle sequences updateable views procedural SQL embedded SQL. UNIT -5 NORMALIZATION OF DATA BASE TABLES Database tables & normalization the need for normalization the normalization process improving the design surrogate key considerations higher level normal forms normalization and database design de-normalization. UNIT-6 TRANSACTIONS MANAGEMENT AND CONCURRENCY CONTROL What is transaction? Transaction state implementation of atomicity & durability concurrency control serializability testing for serializability concurrency control with locking methods concurrency control with time stamping method concurrency control with optimistic methods database recovery management validation based protocols - multiple granularity. UNIT-7 RECOVERY SYSTEM Recovery & atomicity log based recovery recovery with concurrent transactions buffer management failure with loss of non-volatile storage advanced recovery techniques remote backup systems. UNIT-8 FILE STRUCTURE AND INDEXING Overview of physical storage media- magnetic disks RAID tertiary storage storage access file organization organization of records in files data dictionary storage basic concepts of indexing ordered indices B+ - tree index files B tree index files multiple key access static hashing dynamic hashing comparison of ordered index & hashing bit map indices indexed sequential access methods (ISAM).
DATA BASE MANAGEMENT SYSTEM UNIT 1 DATABASE SYSTEMS: o DATA VS INFORMATION o INTRODUCING THE DATABASE AND THE DBMS
o o o o o
WHY DATABASE DESIGN IS IMPORTANT FILES AND FILE SYSTEMS PROBLEM WITH FILE SYSTEM DATA MANAGEMENT DATA BASE SYSTEMS
DATA MODELS:
DATA MODELING AND DATAMODELS o THE IMPORTANCE OF DATA MODELS o DATA MODEL BASIC BUILDING BLOCKS o BUSSINESS RULES o THE EVOLUTION OF DATA MODELS o DEGREES OF DATA ABSTRACTION
o
DATA VS INFORMATION INTRODUCTION: DATABASE:
Database is also called as DBMS. DBMS means Data Base Management System. The father of RDBMS is Edgar F Codd. Database is software or a program used to store a huge amount of data and to retrieve the data from the database.
The primary goal of DBMS is to store and retrieve or extract the data from the database. Examples of DBMS are as follows: Oracle SQL(Structured Query Language) PL_SQL(Programmable Logic SQL) Db2 INGREES POSTGRE Data Ware House and etc
The hierarchy of every Data Base is as follows:
KNOWLEDGE
ZZ INFORMATION
DATA
Always the data gives information i.e; processed data and information gives knowledge to the user.
DATA: Data is the collection of raw facts/materials. Raw facts is defined as the collection of characters , numbers , special symbols , special operators , pictures , measurements and etc. Example: 21212.
INFORMATION: collection of Data the Information is raw facts. processed data. By processing the data we can understand and it is called as Information. are defined Raw facts Always the user can as the collection of understand the Information is the processed data. characters , numbers,, information because it By special symbols gets the knowledge. organized (or) the information user , is special operators , processed (or) Example: 21-2-12. pictures , Structured data. measurements that Hence by this we can sayand etc. it is the date. Always information Always data doesnt gives knowledge to Difference between the Data and the Information: give any meaning the user. which is unstructured data (or) Example: Un_ organized data. Example: Consider 21212 Consider 21-2-12 ,it is the date and it is information so it is understand by the
INTRODUCING THE DATABASE AND THE DBMS
DATABASE:A database is a collection of interrelated data. By data, we mean known facts addresses of the people. For example; consider the names, telephone numbers, and
*Database is software or a program. It consists of huge amount of data to store. The primary goal of database is to store and retrieve the data. Some of the data base soft wares are oracle, sql , plsql, POSTGRE, INGRES, db2, data ware house. DATABASE MANAGEMENT SYSTEMS (DBMS):* A database management system (dbms) is a collection of programs that enables users to create and maintain a database. The DBMS is hence a general-purpose software system that
facilitates the processes of defining, constructing, and manipulating databases for various applications. Defining: a database involves specifying the data types, structures, and constraints for the data to be stored in the database. Constructing: the database is the process of storing the data itself on some storage medium that is controlled y the DBMS. Manipulating: a database includes such functions as querying the database to retrieve specific data, updating the database to reflect changes in the mini world, and generating reports from the data. WHY DATA BASE DESIGN IS IMPORTANT? Database design is a database structure that will use a plan to store & manage data. Database management system is the software used to implement a database design. Modern database & application database software is so easy to use that many people can quickly learn to implement simple database & develop simple applications. Note: As data requirements become more complex, those will simply produce the required data by incorrectly by adding columns of tables to the database. *so we use the follow rules for a good database The following are basically the reasons for doing database design. Good application programs cannot overcome bad database design. The end user & the designer decide what data will be stored in the data base.
FILESYSTEM VERSES DATABASE BASIC FILE TERMINOLOGY: DATA: Data is the collection of raw facts/materials. Raw facts is defined as the collection of characters , numbers , special symbols , special operators , pictures , measurements and etc.
Example: 21212 FIELD: A character (or) group of characters that has special meaning. A field is use to define & store data. RECORD:A logically connected set of one(or) more fields that describes person ,place (or) a thing. FILE:A collection of related records. PROBLEMS WITH FILE SYSTEM MAJOR PROBLEMS WITH FILE SYSTEM ARE: 1: DATA REDUNDANCY 2: DATA ISOLATION 3: DATA INCONSISTENCY 4: DIFFICULTY IN ACESSING DATA 5: CONCURRENT& ANAMOLIES 6: ATOMICITY PROBLEM 7: DATA INTEGRITY PROBLEM 8: SECURITY PROBLEM
1. Data redundancy: Data redundancy is duplication. Duplication is nothing but same copy of data presented in several files. This is called Data Redundancy. Data redundancy have some problems or errors in DBMS 1. storage space 2. wastage of time BY USING DBMS: Always DBMS contains unique files.There is no redundancy.
2. Data isolation: It means we are storing the data in various file formats (.exe, .doc etc So, it is difficult to write the program and accessing the information presented in files BY USING DBMS: By using DBMS we can write the programs is easy and to retrieve the information. 3. Data inconsistency: Modification done in one file it affect the other file and doesnt give any accurate result is called Data inconsistency. BY USING DBMS: By using dbms it gives accurate result. 4. Difficult in accessing data: In file system it is difficult to access the data continuously and effectively. BY USING DBMS: By using dbms it gives result.
5. Concurrent access & anomalies: In file system it doesnt allow more transaction at a time BY USING DBMS: By using dbms it allows the concurrent transactions at a time.
6. Atomicity problems: Recovery of transaction occurs in DBMS. Example Let us consider two transactions namely T1(A) and T2(B).we are transferring the funds from T1(A) to T2(B).while transferring the funds from T1(A) to T2(A) suddenly if power got failure or system crashes there may be a chance of data inconsistency. So, to overcome this problem we use DBMS. It starts from the failure stage. but whereas in File system the transaction will start from initial state.
7. Data integrity problems: Modification done in one file will effect to another file and give some problems. BY USING DBMS: By using dbms before updation we have consider some constraints. 8. Security problems:
In file system there is no security. By using DBMS we can provide Security by the Database Administrator (DBA).
EVOLUTION OF DATA MODELS OR Data modeling and data models A data model is a collection of conceptual tools that can be used to describe the structure of the database. Data models include a set of basic operations for specifying retrieval and updates on the database. The basic building blocks of all data models are entities, attributes, relationships and constraints. ENTITY: An entity represents a particular type of object in the real world.Entity may be physical objects,such as customer. ATTRIBUTE: An attribute is a characteristic of an entity.(or)each entity has associated with it a set of attributes describing it. Relationships: A relationship is an association among several entities.Data models use three types of relationships:one-to-one,one-to-many and many-to-many.
Constraints: A constraint is a restriction placed on the data.Constraints are important because they helps to ensure data integrity.Constraints are normally expressed in the form of rules. THE data models can be classified into three types they are
1. OBJECT_BASED LOGICAL DATA MODEL 2. RECORD_BASED LOGICAL MODEL 3. PHYSICAL DATA MODEL
OBDM (object based data model)
ERDM (entity relational data model)
OODM (object oriented data model)
RBDM (record based data model)
HDM (hierarchical data model)
NDM (network data model)
PDM
(physical data model)
Fig: EVOLUTION OF DATA MODELS

1.
OBJECT_BASED DATA MODEL(OBDM): It is the first step in the data model. It describe data at the conceptual and view level.Provides fairly flexible structuring capabilities.Allow one to specify data constraints explicitly.It includes the models are E-Rmodel and object-oriented data model.
E-R MODEL: An E-r model Is also called as entity relationship model. It is based on a perception of the world as consisting of a collection of basic objects (entities)and relationships among these objects.

Here we are using special symbols to represent an E-r diagram. An E-r diagram shows the relationships among entities and the relationships between entities to attributes. RECTANGLE: represent entity sets. The symbol is
ELLIPSE:represent attributes.
DIAMOND:represent relationships among entities.
LINES:link attributes to entities and entity sets to relationships. Object_oriented data model: 1.An object_oriented model is based on a collection of objects,like ERmodel. 3.Unlike the record-oriented models,these values are themselves object.
4.an object also contains bodies of code that operate on the object. Internal parts of the objects ,the instance variables and code are not visible externally. 2. Record based logical data model: 1. It also describes data at the view level and conceptual level. 2. Unlike object-oriented models, are used to

Specify overall logical structure of the database, and Provide a higher-level description of the implementation.
3. Because the database is structured in fixed-format records of several types. 4. Each record type defines a fixed number of fields, or attributes. 5. Each field is usually of a fixed length. 6. separate languages are associated with the model are used to express database queries and updates. 7. the three most widely-accepted models are the relational, network and hierarchical.
Hierarchical model: 1. This model is uses tree structure to represent relationships among records. 2. Hierarchical database consist a collection of records ,each record consist a collection of fields ,each of which contain one data value. 3. Each hierarchical tree can have only one root record and this record does not have a parent record. 4. The root can have any number of child records. child record can have only one parent record. 5. If any modification done in the parent record it will (effect)applies to all child records also.
6. It is conceptually simple and easy to design ,but it implementation is difficult. 7. Very efficient for one-to-many relationships. 8. It cant handle many-to-many relationships.
Network data model: 1. The network model replace the hierarchical tree with a graph. 2. In this model records are allowed to have more than one parent. 3. So many-to-many relationships are possible. 4. Here data access is flexible. 5. If any updates are done in parent will not effect in child records. 6. It gives very complex structure to the application programmer. 7. It is difficult to make changes in a database.
Fig: (network between different vendors) Relational data model: The relational model was introduced by Dr.E.F.Codd.This model allows data to be represented in a simple row and column format.Each data field is considered as column and each record is considered as a row of table.The relational model eliminate the all parent child relationship.Database based on relational model have built in support for query languages. ADVANTAGES: Flexibility Ease of use Security Data independence Data manipulation language 3.Physical data model: are used to describe data at lowest level.It is a representation of a data in database. It applies the constraints of a given data base management system. Eg: unifying model,Frame memory.
The Importance of Data Models Data model 1. Relatively simple representation, usually graphical, of complex realworld data structures 2. Communications tool to facilitate interaction among the designer, the applications programmer, and the end user
3. Good database design uses an appropriate data model as its foundation 4. End-users have different views and needs for data 5. Data model organizes data for various users Business Rules of Data Models: Entity relational data model given by e.f.codd. Data models and their properties are using as a Business rules for the enterprises. The business rules are as follows 1. 2. 3. 4. Data should be brief. It should be precise.(accurate data) It should be unambiguous.(without errors) Data should be follows some policies. i.e some rules when designing the database with in the organization. 5. Business rules are applied to any organization and uses the data and they get the result. 6. Data should be processed nothing by narrow fact. 7. Must be put in writing(uptodate) 8. Must be kept upto date. 9. The data should be easy to understand. 10.Explain each every properties or characteristics of data clearly. Basic building blocks: The basic building blocks of all data models are: Entities Attributes Relationships Constraints Entity: Entity is a particular type of object in the real world. Entity is being capable of independent existence which can be uniquely identified. Entity is anything (a person, a place, a thing, or a event) about which data are to be collected and stored.
An entity may be a physical object such as house or a car, an event such as house sale or a car service, or concept such as customer transaction or order. Entity is represented in terms of rectangle. Eg: Bus
Attribute: An attribute is a characteristic of an entity. Attribute is a single data item related to database object. The database schema associates one or more attributes with each database entity. Attributes are the equivalence of fields in file systems. Attributes are represented in terms of oval. eg:
Nam e
Relationship: A relationship describes an association among entities. A relationship is represented in terms of rhombus. For example, a relationship exists between customer and agents that can be described as follows: An agent can serve many customers, and each customer may be served by one agent. Data models use four types of relationships: One-to-many (1:M or 1..*) One-to-one(1:1 or 1..1) Many to-many(M:N or *..*) Many-to-one(M:1 or *..1) One-to-many relationship: A painter paints many different paintings, but each one of them is painted by only one painter. Thus the painter (the one) is related to the paintings (the many).therefore, database designers label the relationship PAINTER paints PAINTING as 1:M.
Many-to-one relationship: The preceding discussion identified each relationship in both directions that is relationships are bidirectional: The paintings are painted by one painter. Therefore the relationship PAINTINGS are painted by PAINTER as M:1.
One-to-one relationship: A retail companys management structure may require that each of its stores be managed by a single employee. In turn, each store manager, who is an employee, manages only a single store. Therefore, the relationship EMPLOYEE manages STORE is labeled 1:1. Many-to-many relationship: An employee may learn many job skills, and each job skill may be learned by many employees. Database designers label the relationship EMPLOYEE learns SKILL as M:N. Constraint: A constraint is a restriction placed on the data. Constraints are important because they help to ensure data integrity. Constraints are normally expressed in the form of rules. Eg: An employees salary must have values that are between 6,000 and 35,000. Each class must have one and only one teacher.
Degrees of Data Abstraction: 1. Way of classifying data models 2. Many processes begin at high level of abstraction and proceed to an everincreasing level of detail 3. Designing a usable database follows the same basic process 4. American National Standards Institute/Standards Planning and Requirements Committee (ANSI/SPARC) 5. Classified data models according to their degree of abstraction (1970s): 1. Conceptual 2. Internal 3. External
Data Abstraction Levels
1. The Conceptual Model: 2. 3. 4. 5. Represents global view of the database Enterprise-wide representation of data as viewed by high-level managers Basis for identification and description of main data objects, avoiding details Most widely used conceptual model is the entity relationship (ER) model
Advantages of Conceptual Model 1. Independent of both software and hardware 2. Does not depend on the DBMS software used to implement the model 3. Does not depend on the hardware used in the implementation of the model 4. Changes in either the hardware or the DBMS software have no effect on the database design at the conceptual level 1. The Internal Model: 1. 2. 3. 4. 2. Representation of the database as seen by the DBMS Adapts the conceptual model to the DBMS Software dependent Hardware independent The External Model:
1. End users view of the data environment 2. Requires that the modeler subdivide set of requirements and constraints into functional modules that can be examined within the framework of their external models 3. Good design should: 1. Consider such relationships between views 2. Provide programmers with a set of restrictions that govern common entities Advantages of External Models Use of database subsets makes application program development much simpler The Physical Model: 1. Operates at lowest level of abstraction, describing the way data are saved on storage media such as disks or tapes 2. Software and hardware dependent 3. Requires that database designers have a detailed knowledge of the hardware and software used to implement database design
UNIT-2 ENTITY RELATIONSHIP MODELLING: The entity relationship model (ERM)-developing an ER diagram-data base design challenges: conflicting goals The extends entity relationship modelentity clustering-entity integrity: selecting primary keys-learning flexible data base design-data modeling checklist. The Entity-Relationship model: It is expressed in terms of entities in the business environment, the relationships (or associations) among those entities and the attributes (properties) of both the entities and their relationships The E-R model is usually expressed as an E-R diagram E-R Model Constructs: Entity - person, place, object, event, concept Entity Type or Entity set - is a collection of entities that share common properties or characteristics. Each entity type is given a name, since this name represents a set of items, it is always singular. It is placed inside the box representing the entity type Entity instance is a single occurrence of an entity type. An entity type is described just once (using metadata) in a database, while many instances of that entity type may be represented by data stored in the database. e.g. there is one EMPLOYEE entity type in most organizations, but there may be hundreds of instances of this entity stored in the database
ER Model Basics:
Entity: Real-world object distinguishable from other objects. An entity is described (in DB) using a set of attributes. Entity Set: A collection of similar entities. E.g., all employees. All entities in an entity set have the same set of attributes. (Until we consider ISA hierarchies, anyway!) Each entity set has a key. Each attribute has a domain. ENTITY TYPES: Entities are classified into two types namely strong entity and weak entity
Strong entity: The entity which does not depends on another entity called Strong entity Weak entity: The entity which does depends on another entity called Weak entity Key and key attributes: Key: a unique value for an entity Key attributes: a group of one or more attributes that uniquely identify an entity in the entity set Super key, candidate key, and primary key Super key: a set of attributes that allows to identify and entity uniquely in the entity set Candidate key: minimal super key There can be many candidate keys Primary key: a candidate key chosen by the designer Denoted by underlining in ER attributes
ssn
Fig:Example: Er diagram

Relationship: Association among two or more entities. e.g., Jack works in Pharmacy department. Relationship Set: Collection of similar relationships. An n-ary relationship set R relates n entity sets E1 ... En; each relationship in R involves entities e1 in E1, ..., en in En Same entity set could participate in different relationship sets, or in different roles in same set.
Example of Bank Er diagram:
Note: The Entity Relationship (E-R) Model 1. A relationships degree indicates the number of associated entities or participants. 1. A unary relationship exists when an association is maintained within a single entity. 2. A binary relationship exists when two entities are associated. 3. A ternary relationship exists when three entities are associated.
data base design challenges: conflicting goals: DESIGN STANDERDS: Database must be designed to conform to design standards i.e exact information requirements High-speed processing may require design compromises Quest for timely information may be the focus of database design Other concerns Security Performance Shared access Integrity Note: High processing speed means minimal access time,which may be achieved by minimizing the number and complexity of logically desirable realationship.
Conflicting Goals A design that meets all logical requirements and design conventions is an important goal.However, if this perfect design fails to meet the customers transaction speed and/or information requriments,the designer will not have done a proper job from the end users point of view.Compromises are a fact of life in the real world of database design. Even while focusing on the entities,attributes,relationships and constraints,the designer should begin thinking about end user requirements such as performance,security ,shared access,and data integrity.The designer must consider processing requirements and verify that all update,retrival, and deletion options are available. Finally,a design is of little value unless the end product is capable of delivering all specified query and reporting requirements. You are likely to discover that even the best design process produces an ERD that requires further changes mandated by operational requirements .Such changes should not discourage you from using the process.
ER modeling is essential in the development of a sound design that is capable of meeting the demands of adjustment and growth.
THE EXTENDED ENTITY RELATIONSHIP MODEL

It is simply called as EERM. It is the extended form of entity relationship diagram (ERD) that means adding some additional features to the entity relationship diagram (ERD).
Here some additional features means Specialization Generalization Entity clustering Crow notations(cardinalities) Let us see what is ERD ERD means Entity Relational Diagram.
It consists of entities, attributes , relationships,cardinalities and constraints.
Example:
SCHOOL
consist s
TEACHER
Sname
saddre ss tid
tname
has
STUDEN TS
sid sname
teach es
Let us discuss about the features SPECIALIZATION: The relation between the parent class to the child class entities is calledSPECIALIZATION.
The relation between the super class to the sub class entities is called SPECIALIZATION.
It is the top to bottom approach of sub group entities.
It is the top _down approach of sub group entities.
Example:
Sub
Super class
Specialization Generalization: Generalization is the part of specialization. It is the bottom to top approach.
The relation between the sub class to the super class entities is called GENERALIZATION. The relation between the child class to the parent class entities is called GENERALIZATION. It is the bottom _up approach of sub group entities.
Example:
Super class
Sub class
generalization
The combination of both specialization and generalization is called CLASS HIERARCHY.
ISA relationship is denoted with TRIANGLE. i.e;

ISA
We can use DESCRIMINENT CONSTRAINT instead of ISA. DESCRIMINENT CONSTRAINT COMPLETENESS CONSTRAINT. is also called as
It is of 2 types. They are: 1. Total participation 2. Partial participation Total completeness means that every super type occurrence must be of at least one sub type.
Here total participation and weak entity is denoted with
partial completeness means that there may be some super type occurrences that are not members of any sub type.
Here partial participation and strong entity is denoted with
ENTITY CLUSTERING: It is an advance technique. It is used to group more number of entities into one entity. Clustering means grouping the similar entities.
We are grouping because to remove the confusion and to reduce the complexity of ER diagram including the relationships and their cardinalities.
Entity clustering is also called as VIRTUAL ENTITY Or ABSTRACT ENTITY. It is of 2 types they are 1. With clustering
2.
Without clustering
with clustering is very simple than the without clustering.
EXAMPLE FOR WITH OUT CLUSTERING:

addre ss id room no name
name
ha correspondent s name id
SCHOOL
HEAD Orders
id name colour
Consists of
name BUILDING
name name
CLASS tea ha che s ROOM STUDENTS s

TEACHER id
id
with clustering:
app oint s
Cardinalities:
Which gives the range some are normal notations and crow notations. Normal notations are
**
1*
*1
11
entity integrity: selecting primary keys-learning flexible data base design:
The key is defined as the column or attribute of the database table. For example if a table has id,name and address as the column names then each one is known as the key for that table. We can also say that the table has 3 keys as id, name and address. The keys are also used to identify each record in the database table.The following are the various types of keys available in the DBMS system.
1. 2. 3.
A simple key contains a single attribute. A composite key is a key that contains more than one attribute. A candidate key is an attribute (or set of attributes) that uniquely identifies a row. A candidate key must possess the following properties:
Unique identification - For every row the value of the key must uniquely identify that row.
o
Non redundancy - No attribute in the key can be discarded without destroying the property of unique identification.
o o
4.
A primary key is the candidate key which is selected as the principal unique identifier. Every relation must contain a primary key. The
primary key is usually the key selected to identify a row when the database is physically implemented. For example, a part number is selected instead of a part description.
5.
A superkey is any set of attributes that uniquely identifies a row. A superkey differs from a candidate key in that it does not require the non redundancy property. A foreign key is an attribute (or set of attributes) that appears (usually) as a non key attribute in one relation and as a primary key attribute in another relation. I say usuallybecause it is possible for a foreign key to also be the whole or part of a primary key: A many-to-many relationship can only be implemented by introducing an intersection or link table which then becomes the child in two one-to-many relationships. The intersection table therefore has a foreign key for each of its parents, and its primary key is a composite of both foreign keys.
o
6.
7.
8.
A one-to-one relationship requires that the child table has no more than one occurrence for each parent, which can only be enforced by letting the foreign key also serve as the primary key. A semantic or natural key is a key for which the possible values have an obvious meaning to the user or the data. For example, a semantic primary key for a COUNTRY entity might contain the value 'USA' for the occurrence describing the United States of America. The value 'USA' has meaning to the user. A technical or surrogate or artificial key is a key for which the possible values have no obvious meaning to the user or the data. These are used instead of semantic keys for any of the following reasons:
o
When the value in a semantic key is likely to be changed by the user, or can have duplicates. For example, on a PERSON table it is unwise to use PERSON_NAME as the key as it is possible to have more than one person with the same name, or the name may change such as through marriage.
o
When none of the existing attributes can be used to guarantee uniqueness. In this case adding an attribute whose value is generated by the system, e.g. from a sequence of numbers, is the only way to provide a unique value. Typical examples would be ORDER_ID and INVOICE_ID. The value '12345' has no meaning to the user as it conveys nothing about the entity to which it relates.
o
Data Modeling Checklist: Data modeling checklist is a combination of Business rules and Data models
UNIT-III RELATIONAL DATA BASE MODEL: A logical view of data-keys-integrity rules_-relational set operators the data dictionary and the system catalog-relationships within the relational data base data redundancy revisited indexes codds relational data base rules.
A Logical views of Data-Keys-Integrity Rules Types of Data Constraints: There are 2 types of Data constraints that can be applied to data being inserted in the oracle table The Constraints are: Input Output Constraints. Business rule Constraints. Input Output Constraints: These are 1. Primary Key Constraints 2. Foreign Key Constraints 3. Candidate Key Constraints 4. Composite Key Constraints 5. Super Key Constraints 6. Partial Key Constraints 7. Unique Key Constraints
1. Primary Key Constraints : It can be defined as one (or) more columns that uniquely identifies tuples of rows from a table is called Primary key. Primary Key doesnt allow NULL values as well as Duplication. Primary key is divided into 2 types.
1. Simple Primary key. 2. Composite Primary key.
1.Simple Primary key: In a table which contains only a field as a primary key then it is called simple primary key. 2.Composite Primary key: In a table which contains only one field as a primary key then it is called Composite Primary key. Primary key always act as a parent. Ex: create table student (st_name varchar(10),st_id varchar2(10) primary key); 2. Foreign Key Constraints: It can be defined as one (or) more columns that rows from a table is called Primary key. It is derived from Primary key. The foreign key of reference must match with the reference of primary key. Foreign key allows NULL values and it doesnt allows duplicate values. The combination of primary key and foreign key gives the relation as Referential Integrity Relation Constraint. uniquely identifies tuples or
Ex: create table emp1(emp_name varchar(10),emp_id varchar2 (10) foreign key reference(emp_id1) from employee; 3.Candidate Key Constraints: It can be defined as set of fields that uniquely identifies tuples or rows from a table. one of the primary key may be a candidate key. It doesnt allow duplication & NULL values.
4.Composite Key Constraints: It is a part of the primary key. It can be defined as more than one field acting as a primary key in a table.
5.Super key Constraints: It can be defined as set of pairs of columns which uniquely identifies the rows (or)tuples from a table. Ex: create table emp1(emp name varchar(10),emp_id varchar2(10) super key) It allows NULL values & doesnt allow duplication. 6.Partial Key Constraints: It can be defined as one or more columns which identifies the weak entity. 7.Unique Key Constraints: It can be defined as one (or) more columns that rows from a table is called Primary key. uniquely identifies tuples or
It allows NULL values and doesnt allow Duplication.
Business Rule Constraints:
It is also General Constraint. General Constraints are again classified into 2 types namely: 1. Table Constraint 2. Assertion Constraint General Constraints: It can be defined as the constraints which is applied in general. Ex: Student age is not greater than 18 then he is eligible to sit in the B.tech class.so the end user cant violate this constraints.
1. Table Constraints: The constraints which is applicable to particular table is called Table Constraints. 2. Assertion Constraints: The constraint which is applicable to Collection of Tables is called Assertion Constraints. ------------------------------------------------------------------------------------------------------------------------------------------------------RELATIONAL SET OPERATORS Work properly if relations are Union-Compatible same and their data types must be identical. Names of relation attributes must be the same and their data types must be identical. There are many types of Relational Set Operators: 1. UNION 2. UNION ALL
3. INTERSECT AND 4. MINUS 1.UNION: combines rows from two or more queries without including duplicate rows.
QUERY: select cus_Lname,cus_Fname,cus_initial,cus_Areacode From customer UNION Select cus_Lname,cus_Fname,cus_initial,cus_Areacode From customer1; It can be used to unite more than two queries.
EXAMPLE: Cus_name A B C Cus_id 1 2 3 U Cus_name D A B C Cus_id 4 1 2 3
RESULT: Cus_name A Cus_id 1
B C D UNION ALL:
2 3 4
It produces a relation that retains duplicate rows. It can be used to unite more than two queries. It allows redundancy. QUERY: select cus_Lname,cus_Fname,cus_initial,cus_Areacode. From customer UNION ALL select cus_Lname,cus_Fname,cus_initial,cus_Areacode From customer1;
EXAMPLE: Cus_name A B C Cus_id 1 2 3 Cus_name Cus_id D A Result: Cus_name A B Cus_id 1 2 B c 4 3 2 1
C D A B C
3 4 1 2 3
INTERSECT: It combines rows from two queries, returning only the rows that appear in both sets. QUERY: select cus_Lname,cus_Fname,cus_initial,cus_Areacode. From customer INTERSECT select cus_Lname,cus_Fname,cus_initial,cus_Areacode From customer1;
EXAMPLE: Cus_name A B C D E Cus_id 1 2 3 4 5
Cus_name
Cus_id
D 4 DATABASE MANAGEMENT SYSTEMS PREPARED BY II CSE(2010-2014) CBIT, PRODDATUR B R09 batch (Guidance by J UMA MAHESH, CBIT CSE,2DEPT.) RESULT: Cus_name B C D Cus_id 2 3 4 C F G 3 6 7
MINUS: It combines rows from two queries. Returns only the rows that appear in the first set but not in the second. QUERY: select cus_Lname,cus_Fname,cus_initial,cus_Areacode. From customer MINUS select cus_Lname,cus_Fname,cus_initial,cus_Areacode From customer1;
Cus_name A B C E F
Cus_id 1 2 3 5 6
EXAMPLE: Cus_name D A B C
cus_id 4 1 2 3
RESULT: Cus_id E F Cus-name 5 6
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------DATA DICTIONARY AND SYSTEM CATALOG: DATA DICTIONARY: The data dictionary provides a detailed accounting of all tables found within the user or designer created database. Thus the data dictionary contains at least all of the attribute names and characteristics for each table in the system. It therefore contains metadata, or data about data. It is sometimes described as the database designers database because it records the design decisions about tables and their structures. SYSTEM CATALOG: Like the data dictionary the system catalog contains metadata. It can be described as a Detailed system data dictionary that describes all objects within the database. Because The system catalog contains all required data dictionary information, the terms system catalog and data dictionary are often interchangeable. The system catalog is actually a System-created database whose tables store the user or designer created database Characteristics and contents and can therefore be queried like any other user or designer created database. ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Relationship In Relational Database Relation: A relation is a data structure which consists of attributes and set of tuples which share the same type. Relation has zero or more tuples. For example: to store the data of student, a relation is represented by the 'Student' table. This table has column attributes 'Roll-No', 'Name' and 'City'. Each column contains values of the attribute e.g., the 'Name' column contains only the names of students. If there are four rows in the table. It means that records of four students are represented. Therefore, in relational data model terminology, we define the following terms as:
Attributes: Each column of a relation has a heading, called 'field name'. The named columns of the relation are called attributes. In the above "Student" relation, the attribute names are 'Roll-No', 'Name' and 'City'. Degree: The number of attributes in a relation is called its degree. In the "Student" relation, the degree is 3. Tuple: Each row of the relation is called tuple. Each tuple consists of set of values that are related to a particular instance of an entity type. Cardinality: The number of tuples (rows or records) in the relation is called cardinality. In the 'Student' relation, there are 4 tuples, so the cardinality of the relation is 4. Domain: The range of acceptable values for an attribute is called the domain for the attribute. Two or more attributes may have the same domain.
A relation that contains a minimum amount of redundancy and allows users to insert, modify, and delete the rows without errors, is known as -well-structured relation. A relation schema is used to describe a relation. It is made up of a relation name and a list of attributes. For example, if R is a relation name and its attributes are: A1,A2, .... An, then the relation schema of R is denoted as; R(A1,A2, .... An).
An example of relation schema for a relation of degree 3, which describes the students of university, is given below. STUDENT (Roll-No, Name, City) Where 'Student' represents the name of relation (or table) followed by names of attributes (in parenthesis) in the relation. Properties of Relations: There is no duplication of tuples in a relation. All attribute values of relation are atomic. All attribute values of a relation must come from the same domain. Tuples of a relation are unordered from top to bottom. Attributes of a relation are unordered from left to right.
The relationship between any two entities can be one-to-one, one-to-many, or many-to-many many-to-one. One-to-many relationships are probably the most common type. An invoice includes many products. A salesperson creates many invoices. These are both examples of one-to-many relationships.
One -to-One: Eg: 1. In South Africa a person can have one and only one Identity Document. Also, a specific Identity Document can belong to one and only one person. 2. A voter can cast only one vote in an election. A ballot paper can belong to only one voter. So there will be a 1:1 relationship between a Voter and a Ballot Paper.
Lets look at the Person - ID Document example:
The two entities could translate into the following two tables: Person Table ID Sur name Name Second Name Initials Date Of Birth ID_Document Table IDNumber DateIssued ExpiryDate IssuedLocation CheckedBy
In the Person table, the column ID will be the Primary Key. In the ID_Document table, the IDNumber will be the Primary key. One -to-Many: Eg: Master-detail. You have a master (or header) record with many detail records. For example an order. There will be a master record with the order date, person placing the order, etc. And then detail records of everything ordered. The master record will have many details, and the detail will have only 1 master.
Lets look at the Master - Detail:
The two entities could translate into the following two tables: Master Order Number Date Client Code Captured By Detail Order Number Line Number Item Code Quantity
In the Master table, the Order Number is the primary key. In the detail table, Order Number is the foreign key. The primary key in the Detail table will be a combination of OderNumber and Line Number.
Many-to-Many: Eg:. Student - professor. A student will have one or more professors. The same professor will have lots of students.
Lets look at the Student - Professor :
Now, because many: many is not allowed, we will change this to add a junctionentity in the middle. Hint: The name of the junction entity could be a combination of the other 2 entities, in this example: Student Professor. Notice that the Many- side always points to the junction entity. The three entities could translate into the following three tables: Student Table Student Number Name Address Home Tel Number Student Professor Professor Table Table Student Number ProfNumber Prof Number Name Office Number Work Tel Number
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
DATA REDUNDANCY REVISITED Data redundancy leads to data anomalies. Those anomalies can destroy the effectiveness of the database. It is however possible to control data redundancies by using common attributes also known as foreign keys, that are shared by tables. The proper use of foreign keys is crucial to data redundancy control; they however do
not eliminate data redundancies because they can be repeated many times. There are also instances where data redundancy is necessary. Problems caused by redundancy: Redundancy can be lead to several problems: Redundancy Storage. Insertion Anomalies. Updation Anomalies. Deletion Anomalies.
Redundancy Storage: Same information stored repeatedly called Redundancy Storage. Insertion Anomalies : It is not possible to insert certain information until and unless to insert some waste information into a relation. Updation Anomalies: If one copy of such repeated data is updated which leads to data inconsistency. Deletion Anomalies: It is not possible to delete certain information until and unless to delete some meaningful information.
Indexes : Index is a constant in dbms or it a special or additional features. Primary key itself act as index. It dosent allow duplication used for quick reference to your coloumns in a tables.
Syntax for normal index: Create index emp_x on emp (emp_name,emp_id); Syntax for unique index: Create unique index emp_y on emp1 (emp_nam,emp_id1); Index are 2 types they are : Normal Index. Unique Index. A unique index is an index in which the index key can have only one pointer value or row associated with it. A table can have many indexes, but each index is associated with only one table. CODD RULES : In 1985 codds rules minimum rules for RDBMS used by a company. Every organisation must satisfy with this rules.
Information. All information in a relational database must be logically represented as column values in rows within tables. Guaranteed Access. Every value in a table is guaranteed to be accessible through acombination of table name, primary key value, and column name. Systematic Treatment of Nulls. Nulls must be represented and treated in a systematic way, independent of data type. Dynamic On-Line Catalog Based on the Relational Model. The metadata must be stored and managed as ordinary data, that is, in tables within the database. Such data must be available to authorized users using the standard database relational language.
Comprehensive Data Sublanguage. The relational database may support many languages. However it must support one well defined declarative language with support for data definition, view definition, data manipulation, integrity constraints, authorization, and transaction management.
View Updating. Any view that is theoretically updatable must be updatable through the system. High-Level Insert, Update and Delete. The database must support set-level inserts, updates, and deletes. Physical Data Independence. Application programs and ad hoc facilities are logically unaffected when physical access methods or storage structures are changed. Logical Data Independence. Application programs and ad hoc facilities are logically unaffected when changes are made to the table structures that preserve the original table values.
Integrity Independence. All relation integrity constraints must be definable in the relational language and stored in the system catalog, not at application level. Distribution Independence. The end users and application programs are unaware And unaffected by the data location. No subversion. If the system supports low-level access to the data, there must not be a way to bypass the integrity rules of the database.

Rule Zero. All preceding rules are based on the notion that in order for a database to be considered relational, it must use its relational facilities exclusively to manage the database.
UNIT-4
Structured query language: Introduction to SQL Data definition commands Data manipulation commands Advanced SELECT queries Virtual tables Creating a view Joining database tables Advance SQL: Relational set operators SQL join operators Sub queries and correlated queries SQL functions Oracle sequences Updatable views Procedural SQL Embedded SQL Introduction to SQL: SQL means structured query language SQL was developed by IBM SQL is the combination of tuple calculus and branch of mathematics The first name of SQL is SEQUEL SEQUEL means structured English query language
In 1968 , IBM uses IMS to store and retrieve the data It is similar to CODASYL approach ,so they developed SQL Here IMS means integrated management system
Data definition commands: It is the data definition commands They are as follows: Alter table table name add fieldname data type (); Alter table table name modify fieldname data type (); Alter table table name dropcolumn column name; Create table tablename (fieldname data type()); Drop tablename; Truncate table tablename; Rename tablename1 to tablename2; Data manipulation commands: Insert into table tablename values (&fieldname); Update tablename set column name =value where tablename=value; Delete table tablename; Select * from table name;
Advanced select queries: Syntax to display a column from table Select column name from table name; Ex: select empname from emp; Virtual tables: Creating a view: View: A view is a virtual table based on a select query The query contains columns, computed columns, aliases and aggregate functions from one or more tables The tables on which a view is based called as BASE TABLES A view is a SCHEMA or TABLES or ALIASES View provides security to the data base It is used by the higher authority people like administrator for security purpose Here view is a duplicate table to the original table
These are accessed by the data base administrator The changes made in the original table may effects the views then the higher authority people knows the modification done in the tables but changes done in the views i.e.; duplicate tables doesnt effects the original tables Views are dynamically updated We can create, delete ,drop the views ,there syntax is shown below: CREATE
Create view viewname as select column1,column2,.from tablename; Ex: Create view emp1 as select emp name, emp id from tablename; DROP: drop view viewname; Ex: Drop view emp1; Delete view viewname where column name =value; Ex:Delete view emp1 where emp id =519;
Joining data base tables: Joins: Joins are used to joining the tables on common attributes only A join is performed when data is retrieved from more than one time at a time To join tables, you simply list the tables in the form clause of the select statement the DBMS will create the Cartesian product of every table in the from clause In the relational algebra , the join symbol is ' It is then related to mathematical operation i.e.; Cartesian product but in relational calculus JOIN is used Joins are of 3 types. They are as follows: 1. Cross join(natural join) 2. Outer join 3. Theta join
Cross join: It is also called as natural join natural join: it is similar to join it performs operations on both the tables it is denoted with ex: R1 R2 R1=EMP, R2=EMP work Syntax: Select * from table1 natural join table2; it consists of some loss of data it is shown in the following example EMP: name A B C D street X Y Z W City 1 2 3 4
EMP WORK: NAME A B C D E RESULT: BRANCH CSE EEE ECE IT CIVIL SAL 10 20 30 40 50
NAME STREET CITY BRANCH A X 1 CSE B Y 2 EEE C Z 3 ECE D W 4 IT TO overcome this drawback we are using outer join Outer join: It again classified into 3 types They are as follows: 1. Left Outer join
2. 3.
SAL 10 20 30 40
Right Outer join FullOuter join
Left outer join: It is one type of outer join It performs operation on the left table It is denoted by the symbol It will display the matching values in the common column and the remaining values in the left table Syntax: Select * from T1 left outer join T2 where T1.C1=T2.C1; It does not have the loss of information in the left table
EMP: name A B C street X Y Z City 1 2 3
EMP WORK: NAME A B C D E BRANCH CSE EEE ECE IT CIVIL SAL 10 20 30 40 50
RESULT: NAME A B C D STREET X Y Z W CITY 1 2 3 4 BRANCH CSE EEE ECE IT SAL 10 20 30 40
Right outer join: It is one type of outer join It performs operation on the right table It is denoted by the symbol It will display the matching values in the common column and the remaining values in the right table Syntax: Select * from T1 right outer join T2 where T1.C1=T2.C1; It does not have the loss of information in the right table
EMP: name A B C D street X Y Z W City 1 2 3 4
EMP WORK: NAME A B C D E RESULT: NAME A B C D E STREET X Y Z W NULL CITY 1 2 3 4 NULL BRANCH CSE EEE ECE IT CIVIL SAL 10 20 30 40 50 BRANCH CSE EEE ECE IT CIVIL SAL 10 20 30 40 50
Full outer join: It is one type of outer join It performs operation on both the tables It is denoted by the symbol It will display the values in both the tables Syntax: Select * from T1 fullouter join T2 where T1.C1=T2.C1;
It does not have the loss of information in both the tables name A B C D street X Y Z W City 1 2 3 4
EMP:
EMP WORK: NAME A B C D E BRANCH CSE EEE ECE IT CIVIL SAL 10 20 30 40 50
RESULT: NAME A B C D E STREET X Y Z W NULL CITY 1 2 3 4 NULL BRANCH CSE EEE ECE IT CIVIL SAL 10 20 30 40 50
Theta join: It is denoted with the symbol It is used for to place a condition If the condition satisfies then results will displayed
It will performs single or double tables Ex: Sal < 20 in emp1 name A Relational set operators: These are of 4 types They are as follows 1. Union 2. Union all 3. Intersect 4. Minus Work properly if the relations are union compactable Names of relation attribute must be same and their data types must be identical It is used to work properly on columns branch CSE Sal 10
UNION: It combines rows from 2 or more queries without including duplicate rows It can be used to write more than 2queries Ex: Select e_name, e_id from EMP
Union Select e_name, e_id from emp1; Union all: It produces a relation that retains duplicate rows Ex: Select e_name, e_id from EMP Union all Select e_name, e_id from emp1; It can be used to write more than 2 queries Intersect: It combines rows from 2 queries and returns rows that appear in both sets Does not allow duplication Ex:
Select e_name, e_id from EMP Intersect Select e_name, e_id from emp1; Minus: It is also called as DIFFERENCE or SET DIFFERENCE It combines rows from 2 queries Returns only the rows that appear in the first set but not in the second set
Ex: Select e_name, e_id from EMP Minus Select e_name, e_id from emp1; Ex for relational set operators: EMP name A B C D E EMP1 name A B F G H UNION NAME A B C D E F G ID 1 2 3 4 5 6 7 Id 1 2 6 7 8 Id 1 2 3 4 5
INTERSECT NAME A B ID 1 2
UNION ALL NAME A A B B C D E F G H MINUS ID 1 1 2 2 3 4 5 6 7 8
NAME C E G
ID 3 5 7
QUERIES USING SQL FUNCTIONS: Sub queries and correlated queries: Sub queries: It is also called as nested query A nested query is a form of command that appears inside SQL statement Simply select within a select query called nested query Embedding one select statement into another select statement is called nested query Embedding select query is called inner query or sub query The outer query is called main query Nested queries are used to design complex queries Ex: Find all the employees who have the same post as RAJ; Select post from employee where E_NAME=RAJ; Output: MANAGER Note: In ordinary queries each result display separately where by using sub querywe can display the result in one query Ex: To find all employees who have the same SAL as MANOJ SELECT * FROM EMPLOYEE WHERE SAL= (SELECT SAL FROMemployee where name=MANOJ);
Output: Name MANOJ Raj
Branch HYDERABAD BANGLORE
Post G.M M.D
Sal 2L 2L
CORRELATED SUBQUERY: It is used for to execute each sub query more than once It is possible by using all comparison operators Ex:exists,not exists Exists and not exists are comparison operators It allow the user to test for each row of outer query whether the inner query is true or false Here the sub query is dependent on main query Ex: Find the names of the sailors who have reserved boat number=103 Output: Select s.s_name from sailors s where exists (select from reserves r where s.s_id=r_sid and r.bid=103);
Ex: Find the names of the sailors who are not reserved boat number=103
Output: Select s.s_name from sailors s where not exists (select from reserves r where s.s_id=r_sid and r.bid=103);
Oracle sequences: A sequence is an object which is used to generate sequential values which are often assigned to primary keys A sequence is not physically linked with a table It is the responsibility of the developer to only use a sequence with primary key that is defined for
Syntax:
Create sequence seq _name Increment by n Start with 1000 Here increment =to increase sequence values and it is used for records to give the sequence numbers Ex: Create sequence emp_no_seq Increment by n Start with n A sequence has 2 pseudo columns. They are: 1. Next Val 2. Curr Val NeXT Val: It is used to get the next sequential value of the sequence Ex: Insert into EMP values (emp_no_seq.nextval);
Currval: It is used to get the current value of the sequence Ex: Select emp_no_seq.currval from dual; Updatable views: It means updating the views by using the UPDATE command Here view is a schema View is the duplicate table to the original table In the updatable views we can performs joins but in normal views we cannot perform join operator Ex: Prod master Prod _id 1 2
Prod no A123 B456
Prod sales Prod_id 1 2 3 RESULT Prod id 1 2
Prod_qty 50 60 70
Prod _qty 50 60
PLSQL AND EMBEDDED SQL: PLSQL: PLSQL has control statements but in SQL no of control statements Through SQL the DBA suffers from various disadvantages like 1. SQL does not have any procedural capability 2. SQL does not provide the programming techniques like control statements i.e.; checking, looping and branching 3. In SQL speed of data processing is slow 4. SQL does not after exception handling ADVANTAGES OF PLSQL: PLSQL is a development tool which supports SQL and provides other facilities like checking, looping and branching PLSQL allows exception handling mechanisms PLSQL allows procedures PLSQL allows data processing very fast Architecture of PLSQL: It contains 3 blocks namely 1. BEGIN block 2. DECLARE block 3. END block PLSQL supports other functionalities like procedures, triggers, cursors and etc EMBEDDED SQL: We are accessing the data from the data base by embedding all DML, DDL statements in an application program
Here application program means different programming languages Ex: We may embedded DML, DDL statements in the programming language like COBOL language COBOL BEGIN _______ __ _____ DDL STATEMENTS DML STATEMENTS _______ __ _____ END
UNIT -5 NORMALIZATION OF DATA BASE TABLES Database tables & normalization the need for normalization the normalization process improving the design surrogate key considerations higher level normal forms normalization and database design denormalization. Normalization of Database Tables Database tables and Normalization: Generally database normalization is a combination of schema refinement with decomposition technique. Or Database normalization is the process of removing redundant data and some other anomalies like multivalued dependency, functional dependency, trivial dependency, partial dependency etc . Then database becomes efficient database by applying different normal forms. The process of normalization includes a series of stages known as normal forms. These normal forms are developed by Edgar.F. codd. The different normal forms are first normal form(1NF),second normal form(2NF),and third normal form(3NF). Although we have other higher normal forms like 4NF and 5NF but relations in 3NF are often described as normalized because they are free from redundancy.
Dependencies: Definitions Multivalued Attributes (or repeating groups): non-key attributes or groups of non-key attributes the values of which are not uniquely identified by (directly or indirectly) (not functionally dependent on) the value of the Primary Key (or its part).
STUDENT Stud_ID 101 101 125 Name Lennon Lennon Johnson Course_ID MSI 250 MSI 415 MSI 331 Units 3.00 3.00 3.00
Partial Dependency: when a non-key attribute is determined by a (part) determinant attributes in a composite primary key called partial dependency.
Partial Dependency
CUSTOMER Cust_ID 101 101 125 Name AT&T AT&T Cisco Order_ID 1234 156 1250
Transitive Dependency: when a non-key attribute determines another non-key attribute is called transitive or trivial dependency.
Transitive Dependency
EMPLOYEE Emp_ID 111 122 F_Name Mary Sarah L_Name Jones Smith Dept_ID 1 2 Dept_Name Acct Mktg
The normalization process: First normal form (1NF): 1NF can be defined as applying a primary key with no multivalued attribute or repeating groups or redundancy values in a table called 1NF. A relation schema is said to be in first normal if the attributes values in the relation are atomic i.e .there should be no repeated values in a particular column. Converting a Relational Table in 1NF The conversion of relational table to 1NF can be done by performing the following steps. Step1 In this step, the repeating groups must be removed by eliminating the null columns. This can be done by ensuring that every repeating group has an appropriate value.
Step2 In this step, a primary key among a set of attributes must be identified. Thus, this primary key helps in uniquely identifying every attribute values. Step3 In this step, the dependencies existing in a relation must be identified. This can be done by knowing which attribute in the relation is dependent / based on the primary key. Example Step1 Bringing a Relation to 1NF Make a determinant of the repeating group (or the multivalued attribute) a of the primary key. part
Composite Primary Key
Step2 Consider the transitive dependency the Course_ID as primary key.
the above table contains two primary keys Stud_ID and Course_ID called composite primary key. Step3 Remove repeating group from the relation. Create another relation which would contain all the attributes of the repeating group plus the PK from the 1st relation.
STUDENT Stud_ID 101

STU E T_C U125 DN O R SE Stud_ID 101 101 125 C ours e M 250 SI M 415 SI M 331 SI
Name Lennon Jonson

U nits 3 3 3
Second normal form(2NF): A relation is said to be in 2NF,if the following two conditions are satisfied. A relation must be in 1NF. It does not contain any partial dependencies. In other words, it can be said that a relation is in 2NF,a non-key attribute must be functionally dependent on primary key. Example The conversion of a relation in 1NF (with composite primary keys) into a relation in 2NF can be done by performing the following steps. Step1 Initially all the single key components of the relation is listed separately. Then the key component which is a composite key is specified. Consider the student table which is having composite primary key and partial dependencies.
Composite Primary Key

STUDENT Stud_ID 101 101 125 Name Lennon Lennon Johnson
Partial Dependencies
Course_ID MSI 250 MSI 415 MSI 331
Units 3.00 3.00 3.00
Step2 The above table contains two partial dependencies. Remove attributes that are dependent from the part but not the whole of the primary key from the original relation.
STUDENT Stud_ID 101 101 125

Step3 In this step, for each partial dependency create a new relation with the corresponding part of the primary key.
Name Lennon Lennon Johnson
Course_ID MSI 250 MSI 415 MSI 331
Units 3.00 3.00 3.00
STUDENT_COURSE
STUDENT
Stud_ID 101 125
Course_ID MSI 415 MSI 331
COURSE Units 3.00 3.00 3.00
101 Stud_ID
MSI Name 250 Course_ID
101 101 125
Lennon Lennon
MSI 250 MSI 415 MSI 331
Johnson
Third normal form (3NF): relation is said to be in third normal form if the following two conditions are satisfied. A relation must be 2NF. A relation must not have any transitive dependencies. In other words, the relation does not contain transitive dependencies i.e. a non key attribute determines another non key attribute. Converting 2NF to 3NF The conversion of a relation in 2NF into a relation in 3NF can be done by performing the following steps. Step1:
In this step consider an employee table.
Transitive Dependency
EMPLOYEE Emp_ID 111 122 F_Name Mary Sarah L_Name Jones Smith Dept_ID Dept_Name 1 2 Acct Mktg
EMPLOYEE
DATABASE MANAGEMENT SYSTEMS PREPARED BY II CSE(2010-2014) CBIT, PRODDATUR Emp_ID batchF_Nameby J UMA MAHESH, CBIT CSE, DEPT.) L_Name Dept_ID Dept_Name R09 (Guidance
111 122
Mary Sarah
Jones Smith
1 2
Acct Mktg
Step2 Removing the attributes which are dependent on a non key attributes from the original relation.
EMPLOYEE Emp_ID 111 122 F_Name Mary Sarah L_Name Jones Smith Dept_ID Dept_Name 1 2 Acct Mktg
Step3 For each transitive dependency create a new relation with the non key attribute which is determinant in the transitive dependency as a primary key and dependent non key attribute as a dependent.
EMPLOYEE Emp_ID 111 122 F_Name Mary Sarah L_Name Jones Smith Dept_ID 1 2
DEPARTMENT Dept_ID Dept_Name 1 2 Acct Mktg
Boyce Code Normal Form (BCNF): A table is said to be in BCNF, when each and every determinant of the table refers to candidate key. BCNF is extension to 3NF. BCNF removes the dependency like non-key attribute with candidate key. BCNF may contain more than one candidate key and it does not allow redundancy. BCNF used in higher level database like business transitive database always BCNF is a 3NF but not be a 3NF as a BCNF. In a table, 3NF and BCNF are said to be equal when table includes single candidate key. Otherwise, they said to be unequal. A table is in 3NF but not in BCNF when table includes a transitive dependency (i.e. non key attributes determines the value of another non key attribute).
In a table, when non key attribute depends on key attribute then the table is still in 3NF but it is not in BCNF. The reason is the BCNF requires that every determinant in the table must refer to candidate key. Fourth Normal Form (4NF): 4NF is a 3NF used for to remove multi valued dependency and other anomalies. Here multi valued dependency means redundant values in a relation. You might encounter poorly designed database, or you might be asked convert spreadsheets into a database format in which multiple multivalued attributes exist. Unnormalization: Unnormalization table relation contains redundancy or repeating groups. It does not contain primary key. A normalized database tables generally help to create a good database. But processing requirements are considered, sometimes unnormalized database tables provide a better solution. Because, the normalized tables include several data tables, which need to be joined using I/O operations, there by reducing the processing speed. However, rare and accasional circumstances may allow some degree of denormalization so processing speed can be increased. The disadvantages are: When programs for reading and updating tables are considered, they deal with large tables and hence data updates are considered to be less efficient. The implementation of indexing is very complex in unnormalized table. The virtual tables or views cannot be created using unnormalized tables. Improving the design: Every relation can be designed with following rules in order to improving the efficient database. 1. Evaluating primary key over a relation. 2. Evaluating naming conventions (no attribute should be same). 3. Refining the attribute atomicity. 4. Evaluate related attributes.
5. Evaluated related relationship. 6. Connect existing relation details to the new relation. 7. Evaluating type of attributes.
Surrogate key consideration: Surrogate key is like artificial primary key which is generated automatically by the system and managed through DBMS. The surrogate key is numeric and it is automatically incremented for each new row. Designers need surrogate key when primary is treated as inappropriate. Surrogate key is used for improve the efficiency of the database. Surrogate key acts like a index concepts. Index is used for quick reference. Indexes are classified into 2 types. Normal index Unique index Applying a unique index key is called surrogate key. Syntax: Create unique index indexname on table name (field1, field2); Eg: Create unique index cus_x on customer (cus_name, cus_id);
UNIT-VI What is transaction? - Transaction state-implementation of atomicity and durability- concurrency control- serializability -testing for serializabilityconcurrency control with locking methods -concurrency control with time stamping methods - concurrency control with optimistic methods - database recovery management -validation based protocols - multiple granularities
What is transaction? Transaction processing systems collect, store, modify and retrieve the transactions of an organisation A transaction is an event that generates or modifies data that is eventually stored in an information system. Transaction is a small application program or unit of programs executed by the user. Every transaction is written in high level manipulation language or programming language. Transaction states:For recovery purposes, the system needs to keep track of when the transaction starts, terminates, and commits or aborts
Every transaction having 5 states which are as follows: Active state Partially committed Committed Failure Abort state
Active state:Active state marks the beginning of transaction execution. Or Active state is initial state of the transaction. If there is any error in the transaction then it went to failed state. If there is no error in the transaction then it went to partially committed state. Partially committed state:When the transaction comes from active state to partially committed state the transaction executes some list of operations. Failed state:When ever we are performing transactions due to some failures that mean Hardware or software failures or system crashes then the transactions lead to failure state. Committed state:Commit gives the signal a successful end of the transaction so that any changes executed by the transaction can be safely committed to the database. Or
If the transaction is successfully executed then the transaction is in the commit state. Abort state:Abort state is also called as rollback state. Abort state gives the signal that the transaction so that any changes or effects that the transaction may have applied to the database. Abort state performs 2 operations on the failed states they are as follows. Kill the transaction Restart the transaction Kill the transaction:Kill the transaction performs when the transaction have a logical errors. Reconstructing the new transaction and executing the new transaction. Restart the transaction:It performs recovery mechanism after identifying and rectifying the type of error. Recovery mechanisms examples are Write ahead protocols Shadow paging technique Or Shadow database scheme. In abort state transaction may be restarted later or automatically or after being resubmitted by the user. The transaction state diagram is follows as
Types of Transactions:Transactions are classified into 2 types namely: 1. serial transaction 2. non-serial transaction Serial Transaction:Serial transaction follows the property called serializability. List of operations are operating serially (or) sequence called transaction. Ex:- consider one example followed by read and write operations to perform a serializability i.e., funds transferring from A account to B account . T1: READ (A) A: A-50 WRITE (A) T2: READ (B)
B: B+50 WRITE (B) Non serial transaction:Non serial transaction is also called as concurrent transactions. A transaction executing parallel order (or) non serial order called non serial transaction Non serial transaction follows the property called non serializability. Ex:- consider one example followed by read and write operations to perform non serializability i.e., funds transferring from both accounts.
T1: READ (A) A: A-50 WRITE (A) T2: READ (B) B: B-50 WRITE (B) T3: READ (B) B: B+50 WRITE (B) T4: READ (A) A: A+50
WRITE (A) Schedule:Executing a list of operations in a transaction is called schedule. Schedules are again classified into 2 types they are: serial schedule non serial schedule Serial schedule called serial transaction Non serial schedule called concurrent transactions or non serial transactions Note:All the list of begin and end statements.
Properties of transaction:Every transaction is executed by following 4 properties called ACID properties namely:
A- Atomicity C- Consistency I- Isolation D- Durability. Atomicity:In an atomicity the transaction may be executed successfully or may not be executed successfully.
Consistency:In a DBMS always database is consistent then the result is of a transaction shall consistent in the database. Isolation:DBMS allows the transaction may be executed concurrently but transaction does not know one another. Isolation leads to complication when the transactions are executing in concurrent transactions. Complications are nothing but it gives a loss of data. To recover the loss of data we use Recovery Mechanisms. Recovery mechanisms are 2-phase locking protocol, 2-phase commitment protocol. Durability:If the transaction is executed successfully it is stored in the persistent way i.e., if any failure occurs it does not effect to the result of a transactions.
Implementation of atomicity and durability:Problems occurred in atomicity and durability so we are used to implementing atomicity and durability Atomicity problem:Atomicity problem occurs due to some hardware or software failure or system crashes which means does not execute successfully this leads to the inconsistent database. Durability problem:If atomicity property fails it causes the durability which mean the result does not stored in the persistent way
To overcome the atomicity and durability problems we are implementing atomicity and durability with the help of recovery mechanisms namely: Write ahead logging protocol. Shadow paging technique. Or Shadow database scheme Write ahead logging protocol:To be able to recover from failure that affects transactions, the system maintains a log to keep track of all transaction operations that affect the values of database items. This information may be needed to permit recovery from failures. The log is kept on disk, so it is not affected by ant type of failure except for disk (or) catastrophic failure. In addition, the log is periodically backed up to archival storage (tape) to guard against such disk failures. Log is protocol used for the recovery mechanism which maintains a log file. Log contains all the details of the transaction redo and undo operations. If our transaction got failure while we executing, no problem with the help of log file we can recovery the loss of data after find outing and rectifying the type of errors. Here the type of error means hardware (or) system crashes (or) software failure (or) improper logic. Shadow paging technique:Shadow paging technique is also called as shadow database scheme. Shadow paging technique is one of the recovery mechanisms. Shadow paging techniques maintain 2 databases namely old copy and shadow copy. Shadow copy is also called as new copy and old copy is also called as original copy. Every database contains the database pointer.
Database pointer used to point the error in a shadow copy. Shadow paging technique is not wormed with concurrent transactions. It means it is worked with only serial transactions. It does not allow large database. During transaction execution the shadow copy is never modified. The database pointer which moves back to old copy. By doing this we can recovery the loss of data of a particular transaction. In a multi user environment concurrent transactions logs and check points must be incorporated into a shadow paging technique. If our operations in a new database executed in successfully there we are deleting old copy of the data base. The diagrammatical representation of an shadow paging technique is:-
Concurrency control:Database management systems allow multiple transactions to allow execution of a transaction.
Concurrent transactions contains advantages when compare with serial transactions. Advantages are, Increased processing speed and disk utilisation. Reduced average time for transaction. Concurrency control scheme:Mechanisms to achieve isolation property that mean to control the interactions among the concurrent transactions and in order to provide consistent database we are using concurrency control schemes. Eg:Locking protocols Time stamp based protocol, Optimistic method
Concurrency control mechanism through locking technique:Lock is a variable and it is associated with the data item. Locking is a technique for concurrency control lock information managed by the lock manager. Every database management system contains lock manager. The main techniques are used to control concurrent executions of a transaction are based on the concept of locking. Types of locks:Locks are classified into 3 types they are: 1. binary lock 2. multiple mode lock 3. 2-phase locking Binary locking:Binary locking is only applied on the levels of granularities.
In binary lock every lock has 2 states. Locked state Unlocked state Locked state is denoted with the symbol 1. Unlocked state is denoted with the symbol 0. Locking is nothing but acquiring the lock. Unlocking is nothing but releasing the lock. There are 4 levels of granularities shown in below as follows: 1. column level 2. row level 3. page level 4. table level To control concurrent transactions we are using binary locking. A binary lock us simple to be implemented by represented the lock as a record of he form <Data item, name, lock, locking transaction> + a queue for waiting transaction for the data item x. Multiple mode locking (or) shared (or) exclusive locking:Multiple mode locking is also called as shared / exclusive locking (or) read/write locking. Every data base management contains lock manager. Locking are 2 types. 1. Shared lock. 2. Mutually exclusive Shared lock:If the transactions are T1 and T2. T1 is applying with the shared lock operation when T2 is read the data from T1 then T2 does not read because it is locked. So, T2 sends the request for the lock manager then the lock manager gives the permission then T2 reads data from T1.
Mutually exclusive lock:If the transactions are T1 and T2.T1 is apply with exclusive lock when T2 is read the data from T1 but it cannot read because it has licked then T2 send request for the lock manager but the lock manager does not releasing to the T2. If the T1 transaction is completed automatically it releases then we read the data from T1 toT2.If does not give permission then T2 is waiting (or) suspended. If LOCK(X) = write locked, the value of locking transactions is a single transaction that holds the exclusive (write) lock on x. If LOCK(X) = read-locked, the value of locking transactions is a list of one (or) more transactions that hold the shared (read) lock on x.
2-phase mode locking:A transaction is said to follow two- phase locking protocol if all the locking operations are (read-lock, write-lock) precede the first unlock operation in the transaction. (Or) To follow two- phase locking protocol in a transaction here the condition is it should be unlock in the first phase. Such a transaction can be divided into 2 phases: 1. growing phase 2. shrinking phase Growing phase:Growing phase is also called as expanding phase. Growing phase is the first phase, during which new locks on items can be acquired but none can be released. (Or)
In growing phase it acquires the lock on a transaction but no transaction will be released. Shrinking phase:Shrinking phase is second phase, during which exists the locks can be released but no new locks can be acquired. (Or) No new locks can be granted.
TStarts
TEnds
Note:Locking mechanisms over concurrency control suffer with dead lock problem. Dead lock:Dead lock occurs when each transaction T in a set of 2 (or) more transactions is waiting for some data item that is locked by some other transaction T in the set.
Example:T1 READ LOCK(X) WRITE LOCK (X) READ LOCK(X) WRITE LOCK(X) T2
Dead lock can be solved by using some other concurrency control mechanisms like:
1. optimistic concurrency control 2. dead lock prevention protocol 3. validation based protocol 4. dead lock detection protocol 5. time outs 6. dead lock avoidance protocol Serializability:Serializability means list of instructions (or) operations are executed serially (or) non-serially but it produces same result. In this we are using the concept of schedules. If there is serializability then the database is always consistent. Serializability is classified into 2 types. 1. Conflict serializability 2. View serializability.
Conflict serializability:Consider 2 transactions TA and TB with the instructions IA and IB and perform read and write operations in the same data item x then it belongs to the conflict serializability. In conflict serializability IA is dependent of IB and IB is dependent of IA. Conflicts are 4 types they are: Read-read conflict.(mostly not occur) 2. Read-write conflict. 3. Write-read conflict. 4. Write-write conflict. To solve conflict serializability we are using testing for serializability with the precedence graph (or) directed graph (or) serializability graph.
1.
Directed graph can be represented as:
Ta
Tb
Every directed graph contains vertices (or) nodes (or) arcs and edges. Transactions can be represented as nodes. Operations can be represented as arcs. Every directed graph is a cyclic graph. In directed graph all the transactions are dependent with no conflicts. Testing for serializability:-
Testing of serializability is nothing but we have to test the conflict serializability. Read read conflict:Consider 2 transactions TA and TB of instructions IA and IB. Suppose IB wants to perform read operation but it cannot read the data item until and unless if IA performs read operation on the data item X. Here data item is same for TA and TB transactions. Read write conflict:Consider 2 transactions TA and TB of instructions of IA and IB. suppose IA wants to perform read operation but it cannot read until and unless write operation is performed on the data item X by the instruction IB in transaction TB. Write Read Conflict: Consider 2 transactions TA and TB of instructions of IA and IB. suppose IA wants to perform write operation but it cannot perform on the data item X by the instruction IB in the transaction TB.
Write write conflict:Consider 2 transactions TA and TB of instructions of IA and IB. suppose IA wants to perform write operation but it cannot write until and unless write operation perform on the data item x by the instruction IB in the transaction TB. View serializability:View serializability is a part of serializability. Here first we are executing all the operations (or) instructions of TA then we are moving to TB else first execute all operations of TB then TA. Example:TA Read (A) TB
A: A-50 Write (A) Read (B) B: B+50 Write (B) Read (A) A: A-50 Write (A) Read (B) B: B+50 Write (B)
Time stamp mechanisms to control concurrent transactions:In the time stamp mechanism it has a time stamp manager. Time stamp manager classified into 2 types they are: 1. Global clock manager. 2. Local clock manager. Time stamp mechanism is used to decide if the transaction involved in a dead lock situation there we are performing different operations to solve dead lock. 1. Abort (or) pre-empt. 2. Kill the transaction. Time stamp mechanism for each transaction is denoted with the symbol TS (TA) Here,
TS is time stamp value TA is transaction A. Every time stamp contains 2 properties: 1. Uniqueness. 2. Monotonocity. Uniqueness:Every transaction contains unique time stamp value. Monotonocity:Always time stamp value should be increases. Example:If T1 stared before T2 then it is noted as TS (T1) < TS (T2) Here T1 is the older transaction. T2 is the younger transaction. There are 2 schemes to prevent dead lock which are as follows: Wait die and Wound wait Wait die:Suppose that T1 transaction tries to lock the data item x but it is not able because the data item X is already locked by T2 and here the rule is: If TS (T1)< TS (T2) then T1 is allowed to wait (or) if T1 is younger transaction then T1 dies and restart it later. Wound wait:Suppose that T1 tries to lock X but it is not able to because X is locked by T1 with a conflicting lock. Then the rule is;
If TS (T1) < TS (T2) then abort T2 (T1 wound T2) and restart it later with the same time stamp otherwise T1 is allowed to wait. If any transaction is in the wait state then it leads to the dead lock phase. Optimistic concurrency control:Optimistic is one mechanism to control concurrent transaction. Optimistic concurrency control is also called as validation based protocol. Optimistic concurrency control which cannot use any locking and time stamp mechanism. In this every transaction can be executed in 3 phases: 1. Read phase. 2. Validation phase. 3. Write phase.
Read phase:In this we are reading the database (or) transaction (or) data item and we are performing some computations and its results will be stored in the temporary database (or) copy of the database. Then we have to move to validation phase. Validation phase:In this we are validating the transaction and its results are stored in the temporary database. If the results are positive (or) validation phase by copying the results from temporary to permanent database. If the results are negative (or) validation is negative then the restart (or) reschedule the transaction from the previous phase that mean read phase.
Write phase:In the write phase we are writing the transaction in permanent database when the validation is positive. Recovery management:Recovery management means recovery mechanism to recover the loss of data throw shadow copy technique (or) shadow paging technique and write ahead logging protocol. Write ahead logging protocol:To be able to recover from failure that affects transactions, the system maintains a log to keep track of all transaction operations that affect the values of database items. This information may be needed to permit recovery from failures. . The log is kept on disk, so it is not affected by ant type of failure except for disk (or) catastrophic failure. In addition, the log is periodically backed up to archival storage (tape) to guard against such disk failures. Log is protocol used for the recovery mechanism which maintains a log file. Log contains all the details of the transaction redo and undo operations. If our transaction got failure while we executing, no problem with the help of log file we can recovery the loss of data after find outing and rectifying the type of errors. Here the type of error means hardware (or) system crashes (or) software failure (or) improper logic. Shadow paging technique:Shadow paging technique is also called as shadow database scheme. Shadow paging technique is one of the recovery mechanisms. Shadow paging techniques maintain 2 databases namely old copy and shadow copy. Shadow copy is also called as new copy and old copy is also called as original copy.
Every database contains the database pointer. Database pointer used to point the error in a shadow copy. Shadow paging technique is not wormed with concurrent transactions. It means it is worked with only serial transactions. It does not allow large database. During transaction execution the shadow copy is never modified. The database pointer which moves back to old copy. By doing this we can recovery the loss of data of a particular transaction. In a multi user environment concurrent transactions logs and check points must be incorporated into a shadow paging technique. If our operations in a new database executed in successfully there we are deleting old copy of the data base. The diagrammatical representation of an shadow paging technique is:-
Multiple Granularities:Multiple locking is a locking mechanism to apply locking on different data items at a time where as, in normal locking it is not possible to lock different data items of different transactions. So, normal locking is meant for to lock single data item at a time.
Multiple granularities is defined as applying locking at different levels Here, different levels are database, tables, columns, pages, rows, etc. Multiple granularities is divided into 2 methods to apply locking. 1. finite granularity 2. coarse granularity Finite granularity:Finite granularity as applying locking from bottom to top and here performance is high for concurrent transactions. High order locking is also available. Coarse granularity:Coarse granularity is defined as locking from top to bottom here performance is low for concurrent transaction and also low order locking is available.
Unit-VII RECOVERY SYSTEM 1. RECOVERY AND ATOMICITY 2. LOG BASED RECOVERY 3. RECOVERY WITH CONCURRENT TRANSACTIONS 4. BUFFER MANAGEMENTS 5. FAILURE WITH LOSS OF NONVOLATILE STORAGE 6. ADVANCE RECOVERY TECHNIQUES 7. REMOTE BACKUP SYSTEMS RECOVERY & ATOMICITY:
Consider 2 transactions with 2 accounts A & B now we are transferring funds from A account ot B account initially A contains 700 and B contains 500 Now the transaction performed by the user by adding Rs.80 from T(A) to T(B).Suddenly if power failure or any errors occur to that transaction then database leads to inconsistent state that is transaction is in ideal state.Rs.80 is not added to T(B). If user perform 2 operations like re-exchange dont re-execute which leads to also inconsistent state because transaction is not committed. So use recovery technique to recover the transaction from inconsistent state to consistant state. Recovery technique provides recovery mechanisms like lock based recovery, shadow phasing check point, write head logging protocol etc.., LOGBASED RECOVERY:Log is a file which maintain all the details of a transaction. Log is most widely used structure for recording database and its modifications. The log refers to sequence of log records and maintains a history of all updated activities in the database. Log maintains the details like *Transaction identifier *Data-item identifier *Old value *New value Transaction identifier refers to transaction which executes the write operation Data-item identifier refers to the data item written to the disk that is location of data item on disk Old value refer to the value of data item before performing the write operation New value refer to the value of data item after performing the write operation Log records that are useful for recovery from system and disk failures must reside on stable storage. Log records help in recording the significant activities associated with the database transactions such as the beginning of a transaction, aborting or committing of a transaction ect. < T begin>
This refer to begin of transaction < T, x, v1, v2> This refer to transaction T which preform a write operation on the data item x. The value of x before and after performing the write operation ate v1 & v2 respectively < T commit> This refer to commit transaction T Basic operation of log are redo& undo Through log we can recovery the loss of data with 2 methods or 2 tables 1. Transaction Table: It contains all the new values or updated operations on a transaction 2. Dirty page Table: It contains all the old values or out dated operation on a transaction. RECOVERY WITH CONCURRENT TRANSACTIONS: 1. Log based recovery techniques used for handling concurrent transactions. 2. Every system not only deals with the execution of concurrent transactions but also contains log and disk buffer. 3. In case of transaction failure an undo operation will be performed during the recovery phase undo also called as back-word scanning. (redo also called as forward scanning). 4. Consider an example suppose if the values of a data-item(x) has been changed or modified by T1 then for restoring its values some log based recovery technique can be used. 5. In this technique the old value of data-item(x) can be achieved by using undo operation. 6. Only a single transaction should update a data-item at any time that is only after commit transaction.
Note:-Number of transactions can perform different update operations by 2phase locking protocol. TRANSACTION ROLLBACK:Every transaction can rollback by using 2 operations 1. Undo 2. undo Every record in the log in the form <T, x, 15, 25> For example consider 2 transactions T1 & T2 in the form <T1, p, 25, 35>&<T2, q, 35, 45> The above 2 transactions T1 and T2 has modified its data items values. Hence back-word scanning of the log assign a value 25 to p and if a forward scanning is done then the value of p is 35. During the role back of transaction if a data item is updated by T1 then no other transactions can perform an update operation on it. CHECK POINTS:1. Check point is a recovery mechanism. 2. Checkpoints are the synchronization points between the database and the log file for reducing the time taken to recover from the system. 3. During the system crash of particular transaction, the entire log is to be search for determining the transaction by using undo or redo operations. 4. Log results 2 major problems (a).Time consuming while searching the record in entire log. (b). the redo operation takes longer time. 5. Checkpoints refer to a sequence of actions (a).Every log that are currently available inside the main memory must be move to stable storage. (b).Every log records contain <check point L>
Note:-During the recovery of a system the number of records in the log that the system must can scan be minimized by using checkpoints. BUFFER MANAGEMENT: Buffer managements allows 2 techniques to ensure consistency and to reduce the interactions of a transactions in a database. The 2 techniques are 1. Log record buffering. 2. Database buffering. Log record buffering:1. During the creation of log records all the records available in the Log are output to stable storage. 2. The log records in a stable storage are in the form of units called data blocks. 3. At the physical level mostly this blocks apper greater than the records. 4. At the physical level the result of data block may consist several different output operations. 5. Several different output operations leads to confusion in recovery mechanism. 6. So use additional requirements while storing log records into the stable storage. 7. The additional requirements are after providing the transaction <T commit> it only stores to the stable storage. 8. After providing all the records associated with a Data locks which is stored in stable storage. From the above rules some of the log records should be given to stable storage called log record buffering.
DATABASE BUFFERING:1. The database is stored in the non-volatile storage disk then we can transfer needed data blocks from the system to main memory. 2. Typically the size of main memory is less than the entire database. 3. Applying requirements or rules on the block of data called database buffering. 4. Consider 2 data blocks B1 & B2 then by considering certain rules, the log records are given to the stable storage. 5. Thisrules restrict the freedom to system to provide blocks of data to main memory that is before bringing the block 2 into the main memory B1 has taken out from the main memory into the disk then operations are possible on block B2. 6. In database buffering the following actions take place (a). The log records are brought into the stable storage. (b).The block B1 is brought onto the disk after its operations. (c).From the disk the block B2 is brought into main memory. FAILURES WITH LOSS OF NON VOLATILE STORAGE:1. The information present in the volatile storage gets lost whenever a system crash occurs. 2. But the loss of information in a non-volatile storage is very rare. 3. To avoid this type of failure certain techniques to be considered. 4. One technique is to be dump the entire data base information onto the stable storage. 5. When a system crashes the information present in physical data-base get lost. 6. In order to bring the data-base back to consistant state the dump is used for restoring the information by the system.
7. During the processing of a dump no transaction must be in process. 8. It follows the additional requirements which are: All the records that are present in the memory must be stored into the stable storage.
All the DB information is copied onto the stable storage. A log record of the form log< dump > is stored into the stable storage. NOTE: If a failure results in the loss of information residing in a nonvolatile storage then with the help of dump procedure the data-base can be stored back to the disk by the system. Disadvantages:
1. During the processing of dump technique no transaction is allowed. 2. Dump procedure is expensive because huge amount of data transfer is copied
into the stable storage. ADVANCED RECOVERY TECHNIQUES: Advanced recovery technique is a combination of several techniques. Techniques are undo phase & redo phase techniques like Transaction rollback, write ahead logging protocol, checkpoints ect. ARIES: It is a recovery algorithm. It is based on write ahead log protocol. It supports undo and redo operations. It stores related log record into the stable storage. It supports concurrent protocol.
Example: care optimistic concurrency control, time stamp concurrency control. It supports dirty page table and transaction page table. It has three phases a. Analysis phase: the following recovery techniques are performed 1. Determining log records in stable storage. 2. Determining checkpoints 3. Determining dirty page table & transaction table. b. Redo: forward scanning. c. Undo: backward scanning Other features are supporting locking at different levels by using multiple granularity technique. REMOTE BACKUP SYSTEM: 1.There is a need for designing transaction processing systems that can continue even if the system fails or crashes due to some natural disasters like earth quakes, floods ect.., 2. That can be possible by using the property for transaction processing system called high degree of availability. 3. Availability can be define as same copy of data presented at primary site as well as secondary sites. 4.Now remote backup system defined as maintaining same copy of data at several sites even if site got failure no problem we can continue our transaction with backup sites. 5. while designing remote backup system we should consider the following major issues a. Failure detection b. Recovery time
c. Control transfer d. commit time
a.FAILURE DETECTION: The failure of primary site must be detected by remote backup system possible through good network communication. b.RECOVERY TIME: Using checkpoints we can reduce the recovery time because log size at remote system also affects the recovery time that is if the size of the log increases then the time taken to perform recovery also increases. c.CONTROL TRANSFER: Control transfer can be defined as acting remote system as a primary site.
d.COMMIT TIME: After completion of every transaction we must commit the transaction else leads to inconsistent data-base means not achieving durability.
UNIT-VIII FILE STRUCTURE AND INDEXING Overview of physical storage media- magnetic disks RAID tertiary storage storage access file organization organization of records in files data dictionary storage basic concepts of indexing ordered indices B+ - tree index files B tree index files multiple key access static hashing dynamic hashing comparison of ordered index & hashing bit map indices indexed sequential access methods (ISAM). Overview of physical storage media- magnetic disks: Overview of Physical Storage Media Several types of data storage exist in most computer systems. They vary in speed of access, cost per unit of data, and reliability.
Cache: most costly and fastest form of storage. Usually very small, and managed by the operating system. Main Memory (MM): the storage area for data available to be operated on.
General-purpose machine instructions operate on main memory. Contents of main memory are usually lost in a power failure or ``crash''. Usually too small (even with megabytes) and too expensive to store the entire database. Flash memory: EEPROM (electrically erasable programmable read-only memory). Data in flash memory survive from power failure. Reading data from flash memory takes about 10 nano-secs (roughly as fast as from main memory), and writing data into flash memory is more complicated: writeonce takes about 4-10 micro secs. To overwrite what has been written, one has to first erase the entire bank of the memory. It may support only a limited number of erase cycles ( to ). It has found its popularity as a replacement for disks for storing small volumes of data (5-10 megabytes). Magnetic-disk storage: primary medium for long-term storage. Typically the entire database is stored on disk. Data must be moved from disk to main memory in order for the data to be operated on. After operations are performed, data must be copied back to disk if any changes were made. Disk storage is called direct access storage as it is possible to read data on the disk in any order (unlike sequential access). Disk storage usually survives power failures and system crashes. Optical storage: CD-ROM (compact-disk read-only memory), WORM (writeonce read-many) disk (for archival storage of data), and Juke box (containing a few drives and numerous disks loaded on demand). Tape Storage: used primarily for backup and archival data. Cheaper, but much slower access, since tape must be read sequentially from the beginning.
Used as protection from disk failures! The storage device hierarchy is presented in Figure 10.1, where the higher levels are expensive (cost per bit), fast (access time), but the capacity is smaller.
Figure 10.1: Storage-device hierarchy
Magnetic disk:
Read-write head Positioned very close to the platter surface (almost touching it) Reads or writes magnetically encoded information. Surface of platter divided into circular tracks Over 50K-100K tracks per platter on typical hard disks Each track is divided into sectors. Sector size typically 512 bytes Typical sectors per track: 500 (on inner tracks) to 1000 (on outer tracks) To read/write a sector disk arm swings to position head on right track platter spins continually; data is read/written as sector passes under head Head-disk assemblies multiple disk platters on a single spindle (1 to 5 usually) one head per platter, mounted on a common arm.
Cylinder i consists of ith track of all the platters Earlier generation disks were susceptible to head-crashes leading to loss of all data on disk Current generation disks are less susceptible to such disastrous failures, but individual sectors may get corrupted RAID: Redundant Arrays of Independent Disks disk organization techniques that manage a large numbers of disks, providing a view of a single disk of high capacity and high speed parallel, and by using multiple disks in
high reliability by storing data redundantly, so that data can be recovered even if a disk fails Improvement of Reliability via Redundancy
Redundancy store extra information that can be used to rebuild information lost in a disk failure
E.g., Mirroring (or shadowing) Duplicate every disk. Logical disk consists of two physical disks. Every write is carried out on both disks Reads can take place from either disk If one disk in a pair fails, data still available in the other RAID Levels: RAID Level 0: Block striping; non-redundant RAID Level 1: Mirrored disks with block striping RAID Level 2: Memory-Style Error-Correcting-Codes (ECC) with bit striping. RAID Level 3: Bit-Interleaved Parity RAID Level 4: Block-Interleaved Parity
RAID Level 5: Block-Interleaved Distributed Parity; partitions data and parity among all N + 1 disks, rather than storing data in N disks and parity in 1 disk
RAID
Redundant Arrays of Independent Disks: disk organization that takes advantage of utilizing large numbers of inexpensive, mass-market disks Main Idea: Improvement of reliability via redundancy, i.e., store extra information that can be used to rebuild information lost in case of a disk failure. Use Mirroring (or shadowing): duplicate every disk (logical disk consists of two physical disks) Different RAID levels (0-6) have different cost, performance, and reliability characteristics.
Storage Access A database le is partitioned into xed-length storage units called blocks (or pages). Blocks/pages are units of both storage allocation and data transfer. Database system seeks to minimize the number of block transfers between disk and main memory. Transfer can be reduced by keeping as many blocks as possible in main memory. Buffer Pool: Portion of main memory available to store copies of disk blocks. Buffer Manager: System component responsible allocating and managing bu er space in main memory. Bu ffer Manager Program calls on bu er manager when it needs block from disk The requesting program is given the address of the block in for
main memory, if it is already present in the bu er. If the block is not in the bu er, the bu er manager allocates space in the bu er for the block, replacing (throwing out) some other blocks, if necessary to make space for new blocks. The block that is thrown out is written back to the disk only if it was modi ed since the most recent time that it was written to/fetched from the disk. Once space is allocated in the bu er, the bu er manager reads in the block from the disk to the bu er, and passes the address of the block in the main memory to the requesting program. Buffer Replacement Policies Most operating systems replace the block least recently used (LRU strategy ) LRU { Use past reference of block as a predictor of future references Queries have well-de ned access patterns (such as sequential scans), and a database system can use the information in a user's query to predict future references LRU can be a bad strategy for certain access patterns involving repeated sequential scans of data les Mixed strategy with hints on replacement strategies provided by the query optimizer is preferable (based on the used query processing algorithm(s)).
Pinned block: memory block that is not allowed to be written back to disk Toss immediate strategy: frees the space occupied by a block as soon as the nal record (tuple) of that block has been processed.
Most recently used strategy (MRU): system must pin the block currently being processed. After the nal tuple of that block has been processed, the block is unpinned, and it becomes the most recently used block. Buffer manager can use statistical information regarding the probability that a request will reference a particular relation, e.g., the data dictionary is frequently accessed ; keep data dictionary blocks in main memory bu er
File Organization A database is stored as a collection of les. Each le is a sequence of records, and a record is a sequence of elds. Approaches to organizing records in les: Assume the record size is FIxed Each file has records of one particular type only Different files are (typically) used for diff erent relations
Fixed-Length Records Simple approach: Store record i starting at byte n (i 1), where n is the size of each record. Record access is simple but records may span blocks. Deletion of record i (alternatives to avoid fragmentation) move records i + 1; : : : ; n to i; : : : ; n 1 move record n to i Link all free records in a free list Free Lists Maintained in the block(s) that contains deleted records. Store the address of the first record whose content is deleted in the le header.
Use this first record to store the address of the second available record, and so on. One can think of these stored addresses as pointers, because they \point" to the location of a record. (linked list) More space efficient representation: reuse space for normal attributes of free records to store pointers (i.e., no pointers are stored in in-use records). Requires careful programming: Dangling pointers occur if a record is moved or deleted and another record contains a pointer to this record; that pointer then doesn't point any longer to the desired record.
Variable-Length Records Variable-length records arise in database systems in several ways: Storage of multiple record types in a file. Record types that allow variable length for one or more fields (e.g., varchar)
Approaches to store variable length records: 1. End-of-Record marker difficult to reuse space of deleted records (fragmentation) { no space for a record to grow (e.g., due to an update) ; requires moving 2. Field delimiters (e.g., a $) Field 1 $ Field2 $ Field3 $ Field4 $
requires scan of record to get to n-th field value {requires a eld for a null value
3. Each
record has an array of field o sets Field 1
Field2
Field3
Field4
+ For the overhead of the o set, we get direct access to any eld + Null values: begin/end pointer of a eld point to the same address
Variable length records typically cause problems in case of (record) attribute modi fications: Growth of a eld requires shifting all other elds A modified record may no longer t into the block (leave forwarding address at the old location) A record can span multiple blocks
Block formats for variable length records Each record is identi ed by a record identi fier (rid) (or tuple identi fier (tid)). The rid/tid contains number of block and position in block simplifi es shifting a record from one position/block to another position/block Allocation of records in a block is based on slotted block structure: Block Header #Entries Size Locatio n ...... Free Space ...... R4 R3 R2 R 1
block header contains number of record entries, end of the free space in the block, and location and size of each record records are inserted from the end of the block records can be moved around in block to keep them contiguous
RID-Concept:
Block 4711 RID 4711/ 2 1 2 3 1 2 3 Update of the record
Block 4711
record record record
reco rd 5013 /1 reco rd
Organization of Records in Files Requirements: A le must (efficiently) support insert/delete/update of a record access to a record (typically using rid) scan of all records
Ways to organize blocks (pages) in a le: Heap File (unsorted le) simplest le structure; contains records in no particular order; record can be placed anywhere in the le where there is space Sequential File records are stored in sequential order, based on the value of the search key of each record Clustered Index related to sequential les; we'll talk about this later in Section 7 (Indexes)
Heap File Organization At DB run-time, pages/blocks are allocated and deallocated Information to maintain for a heap le includes pages, free space on pages, records on a page A typical implementation is based on a two doubly linked list of pages; starting with header block. Two lists can be associated with header block: (1) full page list, and (2) list of pages having free space
Sequential File Organization Suitable for applications that require sequential processing of the entire le The records in the le are ordered by a search-key. Deletions of records are managed using pointer chains. Insertions must locate the position in the le where the record is to be inserted if there is free space, insert the record there if no free space, insert the record in an over ow block in either case, pointer chain must be updated If many record modi fications (in particular insertions and deletions), correspondence between search key order and physical order can be totally lost =) le reorganization
Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL Multiple-Key Access Data Dictionary Storage: Data dictionary (also called system catalog) stores metadata: that is, data about data, such as Information about relations names of relations names and types of attributes of each relation names and definitions of views integrity constraints User and accounting information, including passwords Statistical and descriptive data number of tuples in each relation Physical file organization information
131
How relation is stored (sequential/hash/) Physical location of relation Information about indices Catalog structure Relational representation on disk specialized data structures designed for efficient access, in memory A possible catalog representation: Relation_metadata = (relation_name, number_of_attributes, storage_organization, location ) (attribute_name, relation_name, domain_type,
Attribute_metadata = position, length) User_metadata =
(user_name, encrypted_password, group) (index_name, relation_name, index_type,
Index_metadata = index_attributes) View_metadata =
(view_name, definition)
Indexing: Basic Concepts Indexing mechanisms are used to speed up access to desired data. E.g., author catalog in library

Search Key - attribute to set of attributes used to look up records in a file. An index file consists of records (called index entries) of the form
Index files are typically much smaller than the original file Two basic kinds of indices:
Ordered indices: search keys are stored in sorted order
132
Hash indices: search keys are distributed uniformly across buckets using a hash function.
Index definition: Index is a file whose records (entries) have the following structure: (search-key, pointer or set of pointers to data records). When the search key is unique the index has fixed size records, otherwise it may have variable size records. What is the pointer? Block number or RID! Index Evaluation Metrics: Access types supported efficiently. E.g., records with a specified value in the attribute or records with an attribute value falling in a specified range of values. Access time Insertion time Deletion time Space overhead Ordered Indices: Indexing techniques evaluated on basis of:
In an ordered index, index entries are stored sorted on the search key value. E.g., author catalog in library. Primary index: in a sequentially ordered file, the index whose search key specifies the sequential order of the file.
Also called clustering index

133
The search key of a primary index is usually but not necessarily the primary key.
Secondary index: an index whose search key specifies an order different from the sequential order of the file. Also called non-clustering index. Index-sequential file: ordered sequential file with a primary index. Classes of Indexes:
Unique/non-unique whether search key is unique/non-unique Dense/Nondense every/not every search key (every record for Unique key) in the data file has a corresponding pointer in the index. Clustered/Nonclustered order of index search key is equal/not equal to file order. Primary/Secondary search key equal/not equal to primary key. Also, the answer is a single/set of data file records. Note difference in definition of Primary index (us vs. Silberschatz) Note: for Unique key Dense index indexes every record, Non-dense does not index every record Note: Non-dense index must be clustered! Dense Clustered Index File: Dense index Index record appears for every search-key value in the file. (note, for every search key vs. for every record)
134
Sparse Index Files: Sparse Index: contains index records for only some search-key values. Applicable when records are sequentially ordered on search-key
To locate a record with search-key value K we:
Find index record with largest search-key value < K
Search file sequentially starting at the record to which the index record points Less space and less maintenance overhead for insertions and deletions. Generally slower than dense index for locating records. Good tradeoff: sparse index with an index entry for every block in file, corresponding to least search-key value in the block. Basis of the ISAM index
135
Multilevel Index: If primary index does not fit in memory, access becomes expensive. To reduce number of disk accesses to index records, treat primary index kept on disk as a sequential file and construct a sparse index on it. outer index a sparse index of primary index inner index the primary index file If even outer index is too large to fit in main memory, yet another level of index can be created, and so on. Indices at all levels must be updated on insertion or deletion from the file.
Index Update: Deletion:
136
If deleted record was the only record in the file with its particular search-key value, the search-key is deleted from the index also. Single-level index deletion: Dense indices deletion of search-key is similar to file record deletion. Sparse indices if an entry for the search key exists in the index, it is deleted by replacing the entry in the index with the next search-key value in the file (in search-key order). If the next search-key value already has an index entry, the entry is deleted instead of being replaced.
Index Update: Insertion: Single-level index insertion: Perform a lookup using the search-key value appearing in the record to be inserted. Dense indices if the search-key value does not appear in the index, insert it. Sparse indices if index stores an entry for each block of the file, no change needs to be made to the index unless a new block is created. In this case, the first search-key value appearing in the new block is inserted into the index. Multilevel insertion (as well as deletion) algorithms are simple extensions of the single-level algorithms Secondary Indices:
137
Frequently, one wants to find all the records whose values in a certain field (which is not the search-key of the primary index satisfy some condition.
Example 1: In the account database stored sequentially by account number, we may want to find all accounts in a particular branch
Example 2: as above, but where we want to find all accounts with a specified balance or range of balances We can have a secondary index with an index record for each search-key value; index record points to a bucket that contains pointers to all the actual records with that particular search-key value Primary and Secondary Indices: Secondary indices have to be dense. Indices offer substantial benefits when searching for records. When a file is modified, every index on the file must be updated, Updating indices imposes overhead on database modification. Sequential scan using primary index is efficient, but a sequential scan using a secondary index is expensive each record access may fetch a new block from disk since index is usually un-clustered B+-Tree Index Files: A B+-tree is a rooted tree satisfying the following properties: All paths from root to leaf are of the same length

Each node that is not a root or a leaf has between [n/2] and n children. A leaf node has between [(n1)/2] and n1 values
Special cases: If the root is not a leaf, it has at least 2 children.

138
If the root is a leaf (that is, there are no other nodes in the tree), it can have between 0 and (n1) values.
B+-Tree Node Structure:
Typical
node
Ki are the search-key values Pi are pointers to children (for non-leaf nodes) or pointers to records or buckets of records (for leaf nodes).
The search-keys in a node are ordered
K1 < K2 < K3 < . . . < Kn1
B+-tree for account file (n = 3) Note, index points to FIRST key B+-Tree File Organization:
Index file degradation problem is solved by using B+-Tree indices. Data file degradation problem is solved by using B+-Tree File Organization.
139
The leaf nodes in a B+-tree file organization store records, instead of pointers.
Since records are larger than pointers, the maximum number of records that can be stored in a leaf node is less than the number of pointers in a nonleaf node. Leaf nodes are still required to be half full.
Insertion and deletion are handled in the same way as insertion and deletion of entries in a B+-tree index.
Simplest to implement sequential file organization no need for overflows!
Good space utilization important since records use more space than pointers. To improve space utilization, involve more sibling nodes in redistribution during splits and merges Involving 2 siblings in redistribution (to avoid split / merge where possible) results in each node having at least entries B-Tree Index Files: Similar to B+-tree, but B-tree allows search-key values to appear only once; eliminates redundant storage of search keys. Search keys in nonleaf nodes appear nowhere else in the B-tree; an additional pointer field for each search key in a nonleaf node must be included.
140
Generalized B-tree leaf node
Nonleaf node pointers Bi are the bucket or file record pointers.
Advantages of B-Tree indices:
May use less tree nodes than a corresponding B+-Tree.
Sometimes possible to find search-key value before reaching leaf node.
Disadvantages of B-Tree indices: Only small fraction of all search-key values are found early
Non-leaf nodes are larger, so fan-out is reduced (because of the extra pointer!). Thus B-Trees typically have greater depth than corresponding B+-Tree Insertion and deletion more complicated than in B+-Trees Implementation is harder than B+-Trees.
Reading in sorted order is hard
Typically, advantages of B-Trees do not out weigh disadvantages.
Hash Indices: Hashing can be used not only for file organization, but also for indexstructure creation.
A hash index organizes the search keys, with their associated record pointers, into a hash file structure.
141
Strictly speaking, hash indices are always secondary indices if the file itself is organized using hashing, a separate primary hash index on it using the same search-key is unnecessary. However, we use the term hash index to refer to both secondary index structures and hash organized files.
I dont agree! You may have Hash index as primary and the file is clustered on another key or a Heap file
Note the difference between a Hash Index and a Hash file whether the buckets contain Pointers to data records or the records themselves! (same as difference between A B+-tree index and a B+-tree file! ) Deficiencies of Static Hashing:
In static hashing, function h maps search-key values to a fixed set of B of bucket addresses. Databases grow with time. If initial number of buckets is too small, performance will degrade due to too much overflows. If file size at some point in the future is anticipated and number of buckets allocated accordingly, significant amount of space will be wasted initially. If database shrinks, again space will be wasted. One option is periodic re-organization of the file with a new hash function, but it is very expensive.
These problems can be avoided by using techniques that allow the number of buckets to be modified dynamically. Summary of Static Hashing:
Advantages: Simple.
142
Very fast (1-2 accesses).
Disadvantages: Direct access only (no ordered access!). Non dynamic, needs Reorganization.
Dynamic Hashing: Good for database that grows and shrinks in size Allows the hash function to be modified dynamically
Extendable hashing one form of dynamic hashing
Hash function generates values over a large range typically b-bit integers, with b = 32.
At any time use only a prefix of the hash function to index into a table of bucket addresses.

Let the length of the prefix be i bits, 0 i 32. Bucket address table size = 2i. Initially i = 0 Value of i grows and shrinks as the size of the database grows and shrinks.
Multiple entries in the bucket address table may point to a bucket.
Thus, actual number of buckets is < 2i The number of buckets also changes dynamically due to coalescing and splitting of buckets.
Multiple-Key Access: Use multiple indices for certain types of queries. Example:
143

select account-number from account where branch-name = Perryridge and balance - 1000
Possible strategies for processing query using indices on single attributes: 1. Use index on branch-name to find accounts with balances of $1000; test branch-name = Perryridge. 2. Use index on balance to find accounts with balances of $1000; test branchname = Perryridge. 3. Use branch-name index to find pointers to all records pertaining to the Perryridge branch. Similarly use index on balance. Take intersection of both sets of pointers obtained. Indices on Multiple Attributes: Suppose we have an index on combined search-key (branch-name, balance).
With the where where branch-name = Perryridge and balance the index on the combined search-key will fetch only records both Using separate indices in less efficient we may fetch many pointers) that satisfy only one of the conditions. Can also efficiently handle where branch-name - Perryridge and balance < 1000
clause = 1000 that satisfy conditions. records (or
But cannot efficiently handle where branch-name < Perryridge and balance = 1000 May fetch many records that satisfy the first but not the second condition.
ISAM: Indexed-Sequential-Access-Method: A Note of Caution

144
ISAM is an old-fashioned idea B+-trees are usually better, as well see
Though not always
But, its a good place to start Simpler than B+-tree, but many of the same ideas Upshot
Dont brag about being an ISAM expert on your resume, Do understand how they work, and tradeoffs with B+-trees
R n eS ac e ag er hs
` F da stu e ts w ` in ll dn ith g a > 3 .0 p I d t isins r e f , d b ays ac t f df s s c f aa ot d ile o in r e r h o in ir t u h s u e t t e s a t f doh r . t d n, h n c n o in t es C s o b ays ac c nb q it h h o t f in r e r h a e u e ig . S p id a C a a ` d x file im le e : re te n in e . L v l o in ir c io a a ! e e f d e t n g in
k1 k2 k N
I d xFle ne i
Pg ae
Pg ae
Pg ae
Pg N ae
D t Fl aa i e
Cn o inr sac o ( m r ine f ! a d b ay e h n s a ) dx ile r lle
145
in dex en try
ISAM
P 0
K 2
K m
P m
Index file may still be quite large. But we can apply the idea repeatedly!
N on -leaf Pages
Leaf Pages O verflow page Prim ary pages
Lef p g sc n in a a e o ta
d tae trie a n s
Exam ISAM Tree ple

Each node can hold 2 entries; no need for `next -leaf -page pointers. (Why?)
Root
40
20
33
51
63
10 *
15 *
20 *
27 *
33 *
37 *
40 *
46 *
51 *
55 *
63 *
97 *
146
D ta P g s a ae
C m om ents on ISA M
In e P g s dx ae
File creation
: Leaf (data) pages allocated sequentially, sorted by search key. Then index pages allocated. Then space for overflow pages. Index entries : <search key value, page id > ; they search for data entries , which are in leaf pages. Search : Start at root; use key comparisons to go to leaf. Cost log F N ; F = # entries/index pg, N = # leaf pgs Insert : Find leaf where data entry belongs, put it there. (Could be on an overflow page). Delete : Find and remove from leaf; if empty overflow page, de -allocate.
O e wp g s v rflo ae
`direct
Saict e sr cu e t t r e tut r
: in e /dle sa c o lylef p gs s rts e te ffet n a a e
Example ISAM Tree

Each node can hold 2 entries; no need for `next -leaf -page pointers. (Why?)
Root
40
20
33
51
63
10 *
15 *
20 *
27 *
33 *
37 *
40 *
46 *
51 *
55 *
63 *
97 *
147
After Inserting
Root Index Pages
20 33
23 *, 48 *, 41 *, 42 * ...
40
51
63
Prim ary Leaf Pages

41 * 10 * 15 * 20 * 27 * 33 * 37 * 40 * 46 * 51 * 55 * 63 * 97 *
O verflow Pages
23 *
48 *
42 *
... then Deleting

R oot
40
42 *, 51 *, 97 *
20
33
51
63
10 *
15 *
20 *
27 *
33 *
37 *
40 *
46 *
55 *
63 *
23 *
48 * 41 *
No th t te a
5 a p a inin e le e , b t 1 p e rs d x v ls u
5 n t inle f! 1* o a
ADDITIONAL NOTES ON DATABASE NORMALIZATION Normalization
148
Normalization is a formal process for determining which fields belong in which tables in a relational database. Normalization follows a set of rules worked out at the time relational databases were born. A normalized relational database provides several benefits: Elimination of redundant data storage. Close modeling of real world entities, processes, and their relationships. Structuring of data so that the model is flexible. Normalization ensures that you get the benefits relational databases offer. Time spent learning about normalization will begin paying for itself immediately. Normalized Design: Pros and Cons There are various advantages to producing a properly normalized design before you implement your system. A detailed list of the pros and cons are given below:
Pros of Normalizing More efficient structure.
Cons of normalizing database You cant start building the database before you know what the user needs.
Better understanding of your data. More flexible data structure. Easier to maintain database structure. Few (if any) costly surprises down the road. Validates your common sense and intuition. Avoid redundant fields. Ensures that distinct tables exist when necessary.
149
1st Normal Form (1NF) Def: A table (relation) is in 1NF if 1. There are no duplicated rows in the table. 2. Each cell is single-valued (i.e., there are no repeating groups). 3. Entries in a column (attribute, field) are of the same kind. Note: The order of the rows is immaterial; the order of the columns is immaterial. The requirement that there be no duplicated rows in the table means that the table has a key (although the key might be made up of more than one columneven, possibly, of all the columns). A relation is in 1NF if and only if all underlying domains contain atomic values only. The first normal form deals only with the basic structure of the relation and does not resolve the problems of redundant information or the anomalies. For example consider the following example relation: student(sno, sname, dob) Add some other attributes so it has anomalies and is not in 2NF The attribute dob is the date of birth and the primary key of the relation is sno with the functional dependencies sno -> sname and eno dob The relation is in 1NF as long as dob is considered an atomic value and not consisting of three components (day, month, year). The above relation of course suffers from all the anomalies and needs to be normalized. (add example with date of birth) A relation is in first normal form if and only if, in every legal value of that relation every tuple contains one value for each attribute The above definition merely states that the relations are always in first normal form which is always correct. However the relation that is only in first normal form has a structure those undesirable for a number of reasons. First normal form (1NF) sets the very basic rules for an organized database:
150
Eliminate duplicative columns from the same table. Create separate tables for each group of related data and identify each row with a unique column or set of columns (the primary key).
2nd Normal Form (2NF) Def: A table is in 2NF if it is in 1NF and if all non-key attributes are dependent on the entire key. The second normal form attempts to deal with the problems that are identified with the relation above that is in 1NF. The aim of second normal form is to ensure that all information in one relation is only about one thing. A relation is in 2NF if it is in 1NF and every non-key attribute is fully dependent on each candidate key of the relation. Note: Since a partial dependency occurs when a non-key attribute is dependent on only a part of the (composite) key, the definition of 2NF is sometimes phrased as, "A table is in 2NF if it is in 1NF and if it has no partial dependencies." Recall the general requirements of 2NF: Remove subsets of data that apply to multiple rows of a table and place them in separate rows. Create relationships between these new tables and their predecessors through the use of foreign keys. These rules can be summarized in a simple statement: 2NF attempts to reduce the amount of redundant data in a table by extracting it, placing it in new table(s) and creating relationships between those tables. For example. Imagine an online store that maintains customer information in a database. Their Customers table might look something like this:
CustNum FirstName LastName Address 1 John Doe 12Main
City Sea
State NY
ZIP 11579
151
Street 2 Alan Johnson
Cliff NY 11579
82 Sea Evergreen Cliff Tr
3 4 5
Beth Jacob Sue
Thompson 1912 NE Miami Ist St Smith Ryan 142 Irish South Way Bend 412 NE Miami Ist St
FL IN FL
33157 46637 33157
A brief look at this table reveals a small amount of redundant data. We're storing the "Sea Cliff, NY 11579" and "Miami, FL 33157" entries twice each. Now, that might not seem like too much added storage in our simple example, but imagine the wasted space if we had thousands of rows in our table. Additionally, if the ZIP code for Sea Cliff were to change, we'd need to make that change in many places throughout the database. In a 2NF-compliant database structure, this redundant information is extracted and stored in a separate table. Our new table (let's call it ZIPs) might look like this: ZIP 11579 33157 46637 City Sea Cliff Miami South Bend State NY FL IN
If we want to be super-efficient, we can even fill this table in advance -- the post office provides a directory of all valid ZIP codes and their city/state relationships.
152
Surely, you've encountered a situation where this type of database was utilized. Someone taking an order might have asked you for your ZIP code first and then knew the city and state you were calling from. This type of arrangement reduces operator error and increases efficiency. Now that we've removed the duplicative data from the Customers table, we've satisfied the first rule of second normal form. We still need to use a foreign key to tie the two tables together. We'll use the ZIP code (the primary key from the ZIPs table) to create that relationship. Here's our new Customers table:
CustNum FirstName LastName Address 1 2 John Alan Doe Johnson 12Main Street
ZIP 11579
82 11579 Evergreen Tr
3 4 5
Beth Jacob Sue
Thompson 1912 NE 33157 Ist St Smith Ryan 142 Irish 46637 Way 412 NE 33157 Ist St
We've now minimized the amount of redundant information stored within the database and the structure is in second normal form. Lets take one more example The concept of 2NF requires that all attributes that are not part of a candidate key be fully dependent on each candidate key. If we consider the relation student (sno, sname, cno, cname) and the functional dependencies
153
eno ename cno -> cname and assume that (sno, cno) is the only candidate key (and therefore the primary key), the relation is not in 2NF since sname and cname are not fully dependent on the key. The above relation suffers from the same anomalies and repetition of information, since sname and cname will be repeated. To resolve these difficulties we could remove those attributes from the relation that are not fully dependent on the candidate keys of the relations. Therefore we decompose the relation into the following projections of the original relation: S1 (sno, sname) S2 (cno, cname) SC (sno, cno) Use an example that leaves one relation in 2NF but not in 3NF. We may recover the original relation by taking the natural join of the three relations. If however we assume that sname and cname are unique and therefore we have the following candidate keys (sno, cno) (sno, cname) (sname, cno) (sname, cname) The above relation is now in 2NF since the relation has no non-key attributes. The relation still has the same problems as before but it then does satisfy the requirements of 2NF. Higher level normalization is needed to resolve such problems with relations that are in 2NF and further normalization will result in decomposition of such relations 3rd Normal Form (3NF) Def: A table is in 3NF if it is in 2NF and if it has no transitive dependencies. The basic requirements of 3NF are as follows Meet the requirements of 1NF and 2NF Remove columns that are not fully dependent upon the primary key. Although transforming a relation that is not in 2NF into a number of relations that are in 2NF removes many of the anomalies that appear in the relation that was not
154
in 2NF, not all anomalies are removed and further normalization is sometime needed to ensure further removal of anomalies. These anomalies arise because a 2NF relation may have attributes that are not directly related to the thing that is being described by the candidate keys of the relation. A relation R is in third normal form if it is in 2NF and every non-key attribute of R is non-transitively dependent on each candidate key of R. To understand the third normal form, we need to define transitive dependence which is based on one of Armstrong's axioms. Let A, B and C be three attributes of a relation R such that A B and B C From these FDs, we may derive A C , this dependence A C is transitive. The 3NF differs from the 2NF in that all non-key attributes in 3NF are required to be directly dependent on each candidate key of the relation. The 3NF therefore insists, in the words of Kent (1983) that all facts in the relation are about the key (or the thing that the key identifies), the whole key and nothing but the key. If some attributes are dependent on the keys transitively then that is an indication that those attributes provide information not about the key but about a non-key attribute. So the information is not directly about the key, although it obviously is related to the key. Consider the following relation subject (cno, cname, instructor, office) Assume that cname is not unique and therefore cno is the only candidate key. The following functional dependencies exist Eno ename Eno instructor Instructor office We can derive Eno office from the above functional dependencies and therefore the above relation is in 2NF. The relation is however not in 3NF since office is not directly dependent on cno. This transitive dependence is an indication that the
155
relation has information about more than one thing (viz. course and instructor) and should therefore be decomposed. The primary difficulty with the above relation is that an instructor might be responsible for several subjects and therefore his office address may need to be repeated many times. This leads to all the updation problems. To overcome these difficulties we need to decompose the above relation in the following two relations: s (cno, cname, instructor) ins (instructor, office) s is now in 3NF and so is ins. An alternate decomposition of the relation subject is possible: s(cno, cname) inst(instructor, office) si(cno, instructor) The decomposition into three relations is not necessary since the original relation is based on the assumption of one instructor for each course. The 3NF is usually quite adequate for most relational database designs. There are however some situations, for example the relation student(sno, sname, cno, cname) is in 2NF above, where 3NF may not eliminate all the redundancies and inconsistencies. The problem with the relation student(sno, sname, cno, cname) is because of the redundant information in the candidate keys. These are resolved by further normalization using the BCNF. Imagine that we have a table of widget orders: Order Number 1 2 3 4 Customer Number 241 842 919 919 Unit Price $10 $9 $19 $12 Quantity 2 20 1 10 Total $20 $180 $19 $120
156
Remember, our first requirement is that the table must satisfy the requirements of 1NF and 2NF. Are there any duplicative columns? No. Do we have a primary key? Yes, the order number. Therefore, we satisfy the requirements of 1NF. Are there any subsets of data that apply to multiple rows? No, so we also satisfy the requirements of 2NF. Now, are all of the columns fully dependent upon the primary key? The customer number varies with the order number and it doesn't appear to depend upon any of the other fields. What about the unit price? This field could be dependent upon the customer number in a situation where we charged each customer a set price. However, looking at the data above, it appears we sometimes charge the same customer different prices. Therefore, the unit price is fully dependent upon the order number. The quantity of items also varies from order to order. What about the total? It looks like we might be in trouble here. The total can be derived by multiplying the unit price by the quantity; therefore it's not fully dependent upon the primary key. We must remove it from the table to comply with the third normal form:
Order Number 1 2 3 4
Customer Number 241 842 919 919
Unit Price $10 $9 $19 $12
Quantity 2 20 1 10
Now our table is in 3NF.
157
Boyce-Codd Normal Form (BCNF) The relation student(sno, sname, cno, cname) has all attributes participating in candidate keys since all the attributes are assumed to be unique. We therefore had the following candidate keys: (sno, cno) (sno, cname) (sname, cno) (sname, cname) Since the relation has no non-key attributes, the relation is in 2NF and also in 3NF, in spite of the relation suffering the modification problems. The difficulty in this relation is being caused by dependence within the candidate keys. The second and third normal forms assume that all attributes not part of the candidate keys depend on the candidate keys but does not deal with dependencies within the keys. BCNF deals with such dependencies.
A relation R is said to be in BCNF if whenever X A holds in R, and A is not in X, then X is a candidate key for R.
It should be noted that most relations that are in 3NF are also in BCNF. Infrequently, a 3NF relation is not in BCNF and this happens only if (a) the candidate keys in the relation are composite keys (that is, they are not single attributes), (b) there is more than one candidate key in the relation, and (c) the keys are not disjoint, that is, some attributes in the keys are common.
158
The BCNF differs from the 3NF only when there are more than one candidate keys and the keys are composite and overlapping. Consider for example, the relationship enrol (sno, sname, cno, cname, date-enrolled) Let us assume that the relation has the following candidate keys: (sno, cno) (sno, cname) (sname, cno) (sname, cname) (we have assumed sname and cname are unique identifiers). The relation is in 3NF but not in BCNF because there are dependencies
sno sname eno ename where attributes that are part of a candidate key are dependent on part of another candidate key. Such dependencies indicate that although the relation is about some entity or association that is identified by the candidate keys e.g. (sno, cno), there are attributes that are not about the whole thing that the keys identify. For example, the above relation is about an association (enrolment) between students and subjects and therefore the relation needs to include only one identifier to identify students and one identifier to identify subjects. Providing two identifiers about students (sno, sname) and two keys about subjects (cno, cname) means that some information about students and subjects that is not needed is being provided. This provision of information will result in repetition of information and the anomaliess. If we wish to include further information about students and courses in the database, it should not be done by putting the information in the present relation but by creating new relations that represent information about entities student and subject. These difficulties may be overcome by decomposing the above relation in the following three relations: (sno, sname)
159
(cno, cname) (sno, cno, date-of-enrolment) We now have a relation that only has information about students, another only about subjects and the third only about enrolments. All the anomalies and repetition of information have been removed. So, a relation is said to be in the BCNF if and only if it is in the 3NF and every non-trivial, left-irreducible functional dependency has a candidate key as its determinant. In more informal terms, a relation is in BCNF if it is in 3NF and the only determinants are the candidate keys. Desirable properties of decompositions So far our approach has consisted of looking at individual relations and checking if they belong to 2NF, 3NF or BCNF. If a relation was not in the normal form that was being checked for and we wished the relation to be normalized to that normal form so that some of the anomalies can be eliminated, it was necessary to decompose the relation in two or more relations. The process of decomposition of a relation R into a set of relations R1,R2,..Rn was based on identifying different components and using that as a basis of decomposition. The decomposed relations R1,R2,..Rn are projections of R and are of course not disjoint otherwise the glue holding the information together would be lost. Decomposing relations in this way based on a recognize and split method is not a particularly sound approach since we do not even have a basis to determine that the original relation can be constructed if necessary from the decomposed relations. Desirable properties of decomposition are: 1. Attribute preservation 2. Lossless-join decomposition 3. Dependency preservation 4. Lack of redundancy
160
Attribute Preservation This is a simple and an obvious requirement that involves preserving all the attributes that were there in the relation that is being decomposed. Lossless-Join Decomposition We decomposed a relation intuitively. We need a better basis for deciding decompositions since intuition may not always be correct. We illustrate how a careless decomposition may lead to problems including loss of information. Consider the following relation enrol (sno, cno, date-enrolled, room-No., instructor) Suppose we decompose the above relation into two relations enrol1 and enrol2 as follows enrol1 (sno, cno, date-enrolled) enrol2 (date-enrolled, room-No., instructor) There are problems with this decomposition but we wish to focus on one aspect at the moment. Let an instance of the relation enrol be Sno 830057 830057 820159 825678 826789 Cno CP302 CP303 CP302 CP304 CP305 Date-Enrolled 1FEB1984 1FEB1984 10JAN1984 1FEB1984 15JAN1984 Roomno. MP006 MP006 MP006 CE122 EA123 Instructor Gupta Jones Gupta Wilson Smith
161
and let the decomposed relations enrol1 and enrol2 be: Sno 830057 830057 820159 825678 826789 Cno CP302 CP303 CP302 CP304 CP305 Date-Enrolled 1FEB1984 1FEB1984 10JAN1984 1FEB1984 15JAN1984
Date-Enrolled 1FEB1984 1FEB1984 10JAN1984 1FEB1984 15JAN1984
Roomno. MP006 MP006 MP006 CE122 EA123
Instructor Gupta Jones Gupta Wilson Smith
All the information that was in the relation enrol appears to be still available in enrol1 and enrol2 but this is not so. Suppose, we wanted to retrieve the student numbers of all students taking a course from Wilson, we would need to join enrol1 and enrol2. The join would have 11 tuples as follows:
Sno 830057
Cno CP302
DateEnrolled 1FEB1984
Room-No. MP006
Instructor Gupta
162
830057 830057 830057 830057 830057
CP302 CP303 CP303 CP302 CP303
1FEB1984 1FEB1984 1FEB1984 1FEB1984 1FEB1984
MP006 MP006 MP006 CE122 CE122
Jones Gupta Jones Wilson Wilson
(add further tuples ...) The join contains a number of spurious tuples that were not in the original relation Enrol. Because of these additional tuples, we have lost the information about which students take courses from WILSON. (we have more tuples but less information because we are unable to say with certainty who is taking courses from WILSON). Such decompositions are called lossy decompositions. A nonloss or lossless decomposition is that which guarantees that the join will result in exactly the same relation as was decomposed. We need to analyze why some decompositions are lossy. The common attribute in above decompositions was Date-enrolled. The common attribute is the clue that gives us the ability to find the relationships between different relations by joining the relations together. If the common attribute is not unique, the relationship information is not preserved. If each tuple had a unique value of Date-enrolled, the problem of losing information would not have existed. The problem arises because several enrolments may take place on the same date. A decomposition of a relation R into relations R1,R2,..Rn is called a lossless-join decomposition (with respect to FDs F) if the relation R is always the natural join of the relations R1,R2,..Rn It should be noted that natural join is the only way to recover the relation from the decomposed relations. There is no other set of operators that can recover the relation if the join cannot. Furthermore, it should be noted when the decomposed relations R1,R2,..Rn are obtained by projecting on the relation R, for example R1 by projection (R) the relation R1 may not always be
163
precisely equal to the projection since the relation R1 might have additional tuples called the dangling tuples. It is not difficult to test whether a given decomposition is lossless-join given a set of functional dependencies F. We consider the simple case of a relation R being decomposed into R1 and R2 . the decomposition is lossless-join, then one of the following two conditions must hold R1 R2 R1 R2 R1 R2 R2 R1 That is, the common attributes in R1 and R2 must include a candidate key of either R1 or R2. Dependency Preservation It is clear that decomposition must be lossless so that we do not lose any information from the relation that is decomposed. Dependency preservation is another important requirement since a dependency is a constraint on the database and if X Y holds than we know that the two (sets) attributes are closely related and it would be useful if both attributes appeared in the same relation so that the dependency can be checked easily. Let us consider a relation R(A, B, C, D) that has the dependencies F that include the following: A B AC etc
If we decompose the above relation into R1(A, B) and R2(B, C, D) the dependency AC cannot be checked (or preserved) by looking at only one relation. It is desirable that decompositions be such that each dependency in F may be checked by looking at only one relation and that no joins need be computed for checking dependencies. In some cases, it may not be possible to preserve each and every
164
dependency in F but as long as the dependencies that are preserved are equivalent to F, it should be sufficient.
Let F be the dependencies on a relation R which is decomposed in relations R1,R2, .,Rn.
We can partition the dependencies given by F such that F1,F2,.,Fn. Fn are dependencies that only involve attributes from relations R1,R2,.,Rn. If the union of dependencies F1 imply all the dependencies in F, then we say that the decomposition has preserved dependencies, otherwise not. If the decomposition does not preserve the dependencies F, then the decomposed relations may contain relations that do not satisfy F or the updates to the decomposed relations may require a join to check that the constraints implied by the dependencies still hold.
Consider the following relation sub(sno, instructor, office) We may wish to decompose the above relation to remove the transitive dependency of office on sno. A possible decomposition is S1(sno, instructor) S2(sno, office) The relations are now in 3NF but the dependency instructor office cannot be verified by looking at one relation; a join of S1 and S2 is needed. In the above decomposition, it is quite possible to have more than one office number for one instructor although the functional dependency instructor office does not allow it. Lack of Redundancy
165
Lossless-join, dependency preservation and lack of redundancy not always possible with BCNF. Lossless-join, dependency preservation and lack of redundancy is always possible with 3NF.
Deriving BCNF Given a set of dependencies F, we may decompose a given relation into a set of relations that are in BCNF using the following algorithm. The algorithm consists of (1) Find out the facts about the real world. (2) Reduce the list of functional relationships. (3) Find the keys. (4) Combine related facts. If there is any relation R that has a dependency A B and A is not a key, the relation violates the conditions of BCNF and may be decomposed in AB and R - A. The relation AB is now in BCNF and we can now check if R - A is also in BCNF. If not, we can apply the above procedure again until all the relations are in fact in BCNF.
Multivalued dependencies Several difficulties that can arise when an entity has multivalue attributes. It was because in the relational model, if all of the information about such entity is to be represented in one relation, it will be necessary to repeat all the information other than the multivalue attribute value to represent all the information that we wish to represent. This results in many tuples about the same instance of the entity in the relation and the relation having a composite key (the entity id and the mutlivalued attribute). Of course the other option suggested was to represent this multivalue information in a separate relation. The situation of course becomes much worse if an entity has more than one multivalued
166
attributes and these values are represented in one relation by a number of tuples for each entity instance such that every value of one the multivalued attributes appears with every value of the second multivalued attribute to maintain consistency. The multivalued dependency relates to this problem when more than one multivalued attributes exist.
Consider the following relation that represents an entity employee that has one mutlivalued attribute proj:
emp (e#, dept, salary, proj)
normalization upto BCNF are based on functional dependencies; dependencies that apply only to single-valued facts. For example, e# dept implies only one dept value for each value of e#. Not all information in a database is single-valued, for example, proj in an employee relation may be the list of all projects that the employee is currently working on. Although e# determines the list of all projects that an employee is working on, eno proj is not a functional dependency.
Multivalued facts about an entity is represented by having a separate relation for that multivalue attribute and then inserting a tuple for each value of that fact. This resulted in composite keys since the multivalued fact must form part of the key.
The fourth and fifth normal forms deal with multivalued dependencies.
167
programmer (emp_name, qualifications, languages)
The above relation includes two multivalued attributes of entity programmer; qualifications and languages. There are no functional dependencies.
The attributes qualifications and languages are assumed independent of each other. If we were to consider qualifications and languages separate entities, we would have two relationships (one between employees and qualifications and the other between employees and programming languages). Both the above relationships are many-tomany i.e. one programmer could have several qualifications and may know several programming languages. Also one qualification may be obtained by several programmers and one programming language may be known to many programmers. The above relation is therefore in 3NF (even in BCNF) but it still has some disadvantages. Suppose a programmer has several qualifications (B.Sc, Dip. Comp. Sc, etc) and is proficient in several programming languages; how should this information be represented? There are several possibilities.
Cmp name
Qualifications
Languages
168
Smith Smith Smith Smith Smith Smith
B.Sc B.Sc B.Sc Dip. CS Dip. CS Dip. CS
Fortran Cobol Pascal Fortran Cobol Pascal
Cmp Name Smith Smith Smith Smith Smith
Qualification B.Sc Dip. CS NULL NULL NULL
Language NULL NULL Fortran Cobol Pascal
Cmp Name Smith Smith Smith
Qualificatiom B.Sc Dip CS NULL
Language Fortran Cobol Pascal
169
Other variations are possible (there is no relationship between qualifications and programming languages). All these variations have some disadvantages. If the information is repeated we face the same problems of repeated information and anomalies as we did when second or third normal form conditions are violated. If there is no repetition, there are still some difficulties with search, insertions and deletions. For example, the role of NULL values in the above relations is confusing. Also the candidate key in the above relations is (emp name, qualifications, language) and existential integrity requires that no NULLs be specified. These problems may be overcome by decomposing a relation like the one above as follows:
Cmp Name Smith Smith
Qualifications B.Sc Dip. CS
Cmp Name Smith Smith Smith
Languages Fortran Cobol Pascal
170
The basis of the above decomposition is the concept of multivalued dependency (MVD). Functional dependency A B relates one value of A to one value of B while multivalued dependency A B defines a relationship in which a set of values of attribute B are determined by a single value of A. The concept of multivalued dependencies was developed to provide a basis for decomposition of relations like the one above. Therefore if a relation like enrolment(sno, subject#) has a relationship between sno and subject# in which sno uniquely determines the values of subject#, the dependence of subject# on sno is called a trivial MVD since the relation enrolment cannot be decomposed any further. More formally, a MVD XY is called trivial MVD if either Y is a subset of X or X and Y together form the relation R. The MVD is trivial since it results in no constraints being placed on the relation. Therefore a relation having non-trivial MVDs must have at least three attributes; two of them multivalued. Non-trivial MVDs result in the relation having some constraints on it since all possible combinations of the multivalue attributes are then required to be in the relation. Let us now define the concept of multivalued dependency. The multivalued dependency XY is said to hold for a relation R(X, Y, Z) if for a given set of value (set of values if X is more than one attribute) for attributes X, there is a set of (zero or more) associated
171
values for the set of attributes Y and the Y values depend only on X values and have no dependence on the set of attributes Z.
In the example above, if there was some dependence between the attributes qualifications and language, for example perhaps, the language was related to the qualifications (perhaps the qualification was a training certificate in a particular language), then the relation would not have MVD and could not be decomposed into two relations as abve. In the above situation whenever XY holds, so does XZ since the role of the attributes Y and Z is symmetrical.
Consider two different situations. (a) Z is a single valued attribute. In this situation, we deal with R(X, Y, Z) as before by entering several tuples about each entity. (b) Z is multivalued. Now, more formally, XY is said to hold for R(X, Y, Z) if t1 and t2 are two tuples in R that have the same values for attributes X and therefore with t1[x] = t2[x] then R also contains tuples t3 and t4 (not necessarily distinct) such that
172
t1[x] = t2[x] = t3[x] = t4[x] t3[Y] = t1[Y] and t3[Z] = t2[Z] t4[Y] = t2[Y] and t4[Z] = t1[Z]
In other words if t1 and t2 are given by t1 = [X, Y1, Z1], and t2 = [X, Y2, Z2]
then there must be tuples t3 and t4 such that t3 = [X, Y1, Z2], and t4 = [X, Y2, Z1]
We are therefore insisting that every value of Y appears with every value of Z to keep the relation instances consistent. In other words, the above conditions insist that Y and Z are determined by X alone and there is no relationship between Y and Z since Y and Z appear in every possible pair and hence these pairings present no information and are of no significance. Only if some of these pairings were not present, there would be some significance in the pairings. Give example (instructor, quals, subjects) --- explain if subject was single valued; otherwise all combinations must occur. (Note: If Z is single-valued and functionally dependent on X then Z1 = Z2. If Z is multivalue dependent on X then Z1 <> Z2). The theory of multivalued dependencies in very similar to that for functional dependencies. Given D a set of MVDs, we may find D+, the closure of D using a set of axioms.
173
Multivalued Normalization --- Fourth Normal Form Def: A table is in 4NF if it is in BCNF and if it has no multi-valued dependencies. An example of Programmer(Emp name, qualification, languages) and problems that may arise if the relation is not normalised further. The relation could be decomposed into P1(Emp name, qualifications) and P2(Emp name, languages) to overcome these problems. The decomposed relations are in fourth normal form (4NF). A relation R is in 4NF if, whenever a multivalued dependency X Y holds then either (a) the dependency is trivial, or (b) X is a candidate key for R. The dependency X Y in a relation R(X, Y) is trivial since they must hold for all R(X, Y). Similarly (X,Y) Z must hold for all relations R(X, Y, Z) with only three attributes. In fourth normal form, we have a relation that has information about only one entity. If a relation has more than one multivalue attribute, we should decompose it to remove difficulties with multivalued facts. Intuitively R is in 4NF if all dependencies are a result of keys. When multivalued dependencies exist, a relation should not contain two or more independent multivalued attributes. The decomposition of a relation to achieve 4NF would normally result in not only reduction of redundancies but also avoidance of anomalies.
Fifth Normal Form Def: A table is in 5NF, also called "Projection-Join Normal Form" (PJNF), if it is in 4NF and if every join dependency in the table is a consequence of the candidate keys of the table.
174
The normal forms discussed so far required that the given relation R if not in the given normal form be decomposed in two relations to meet the requirements of the normal form. In some rare cases, a relation can have problems like redundant information and update anomalies because of it but cannot be decomposed in two relations to remove the problems. In such cases it may be possible to decompose the relation in three or more relations using the 5NF. The fifth normal form deals with join-dependencies which is a generalisation of the MVD. The aim of fifth normal form is to have relations that cannot be decomposed further. A relation in 5NF cannot be constructed from several smaller relations. A relation R satisfies join dependency (R1,R2,..,Rn) if and only if R is equal to the join of R1,R2,.,Rn where Ri are subsets of the set of attributes of R. A relation R is in 5NF (or project-join normal form, PJNF) if for all join dependencies at least one of the following holds. (a) (R1,R2,..,Rn) is a trivial join-dependency (that is, one of Ri is R) (b) Every Ri is a candidate key for R. An example of 5NF can be provided by the example below that deals with departments, subjects and students.
Dept
Subject
Student
175
Comp Sc. Mathematics Comp Sc. Comp Sc. Physics Chemistry
CP1000 MA1000 CP2000 CP3000 PH1000 CH2000
John Smith John Smith Arun Kumar Reena Rani Raymond Chew Albert Garcia
The above relation says that Comp. Sc. offers subjects CP1000, CP2000 and CP3000 which are taken by a variety of students. No student takes all the subjects and no subject has all students enrolled in it and therefore all three fields are needed to represent the information. The above relation does not show MVDs since the attributes subject and student are not independent; they are related to each other and the pairings have significant information in them. The relation can therefore not be decomposed in two relations (dept, subject), and (dept, student) without loosing some important information. The relation can however be decomposed in the following three relations (dept, subject), and (dept, student) (subject, student) and now it can be shown that this decomposition is lossless. Domain-Key Normal Form (DKNF) Def: A table is in DKNF if every constraint on the table is a logical consequence of the definition of keys and domains.
176
177
178
179
180
181

Database Management Systems Complete Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Database Management Systems Complete Notes

Uploaded by

Copyright:

Available Formats

DATABASE MANAGEMENT SYSTEMS PREPARED BY II CSE(2010-2014) CBIT, PRODDATUR R09 batch (Guidance by J UMA MAHESH, CBIT CSE, DEPT.

DATA VS INFORMATION INTRODUCTION: DATABASE:

The hierarchy of every Data Base is as follows:

INTRODUCING THE DATABASE AND THE DBMS

OBDM (object based data model)

ERDM (entity relational data model)

OODM (object oriented data model)

RBDM (record based data model)

HDM (hierarchical data model)

NDM (network data model)

(physical data model)

Fig: EVOLUTION OF DATA MODELS

DIAMOND:represent relationships among entities.

Data Abstraction Levels

Example of Bank Er diagram:

THE EXTENDED ENTITY RELATIONSHIP MODEL

It consists of entities, attributes , relationships,cardinalities and constraints.

It is the top to bottom approach of sub group entities.

It is the top _down approach of sub group entities.

The combination of both specialization and generalization is called CLASS HIERARCHY.

ISA relationship is denoted with TRIANGLE. i.e;

Here total participation and weak entity is denoted with

Here partial participation and strong entity is denoted with

with clustering is very simple than the without clustering.

EXAMPLE FOR WITH OUT CLUSTERING:

CLASS tea ha che s ROOM STUDENTS s

entity integrity: selecting primary keys-learning flexible data base design:

1. Simple Primary key. 2. Composite Primary key.

It allows NULL values and doesnt allow Duplication.

Business Rule Constraints:

EXAMPLE: Cus_name A B C Cus_id 1 2 3 U Cus_name D A B C Cus_id 4 1 2 3

RESULT: Cus_name A Cus_id 1

EXAMPLE: Cus_name A B C Cus_id 1 2 3 Cus_name Cus_id D A Result: Cus_name A B Cus_id 1 2 B c 4 3 2 1

EXAMPLE: Cus_name A B C D E Cus_id 1 2 3 4 5

RESULT: Cus_id E F Cus-name 5 6

Lets look at the Person - ID Document example:

Lets look at the Master - Detail:

Lets look at the Student - Professor :

Right Outer join FullOuter join

EMP: name A B C street X Y Z City 1 2 3

EMP WORK: NAME A B C D E BRANCH CSE EEE ECE IT CIVIL SAL 10 20 30 40 50

RESULT: NAME A B C D STREET X Y Z W CITY 1 2 3 4 BRANCH CSE EEE ECE IT SAL 10 20 30 40

EMP: name A B C D street X Y Z W City 1 2 3 4

EMP WORK: NAME A B C D E BRANCH CSE EEE ECE IT CIVIL SAL 10 20 30 40 50

UNION ALL NAME A A B B C D E F G H MINUS ID 1 1 2 2 3 4 5 6 7 8

Output: Name MANOJ Raj

Branch HYDERABAD BANGLORE

Post G.M M.D

Prod no A123 B456

Prod sales Prod_id 1 2 3 RESULT Prod id 1 2

Composite Primary Key

Step2 Consider the transitive dependency the Course_ID as primary key.

STUDENT Stud_ID 101

Name Lennon Jonson

Composite Primary Key

Course_ID MSI 250 MSI 415 MSI 331

Units 3.00 3.00 3.00

STUDENT Stud_ID 101 101 125

Name Lennon Lennon Johnson

Course_ID MSI 250 MSI 415 MSI 331

Units 3.00 3.00 3.00

Stud_ID 101 125

Course_ID MSI 415 MSI 331