You are on page 1of 135

UNIT-4

DATABASE DESIGN ISSUES Er model Normalization Security,Integrity Consistency,Tuning Optimization Research Issues Temporal and Spatial Database

Database Design Process


Real World Requirements Analysis
Functional Requirements Database Requirements

Functional Analysis
Access Specifications

E-R Modeling Choice of a DBMS

Conceptual Design
Conceptual Model

Data Model Mapping


Application Pgm Design

Logical Design
Logical Schema

Fall 2001

Database Systems

Physical Design 2

Entity-Relationship Model
An entity is a collection of real-world objects that have many common properties. Examples: Students, Instructors, Courses, Sections Student entities have properties: name, address, major, graduation-year A student may be John Smith, 22 Sage Rd., Computer Science, 2000 An attribute is a data item that describes a property of an entity

Fall 2001

Database Systems

Entities
primary identifier
sid

multi valued attribute


hobbies

Students

composite attribute
student_name

lastname

firstname

mid_initial

Fall 2001

Database Systems

Mapping Entities to Relations


Each entity in an E-R model is mapped to a separate relation the primary identifier is mapped to the primary key (underlined!) all regular attributes are mapped to an attribute in the table each subpart of a composite attribute is mapped to a different attribute each multi-valued attributed is mapped to a separate relation that inherits the primary key of the parent relation

Fall 2001

Database Systems

Mapping Entities
Students( sid, lastname, firstname, mid_initial )
hobbies

sid

Hobbies( sid, hobby )

Students

student_name

Foreign key sid references the Students relation!


mid_initial

lastname

firstname

Fall 2001

Database Systems

Map the Auction Entities


buyer_name oid #itemsold

buyid

cc_num email

Owners

email

Buyers

owner_name

phone

address

phone

lastname

firstname

mid_initial

street
state

city
zip

Fall 2001

Database Systems

Map the Action Entities


iid
description bid

Items

Bids
location

name

date/time date time

amount

Fall 2001

Database Systems

Relationships
Given a set of entities E1,E2,,Ek, a relationship R defines a rule of correspondence between these entities. An instance r(e1,e2,,ek) of the R relation means entities e1,e2,,ek are in a relation r at this instance.

If two people are married, they are in a relationship:


married(Bob, Margaret) If a student A takes a course C offered by professor B, then A,B,C are in a relationship.

Fall 2001

Database Systems

Relationships
binary relationship
own

Items

Owners

accept

Buyers

place

Bids
date

ternary relationship
Fall 2001 Database Systems 10

Cardinalities of Relationships
Participation cardinalities of a relationship R for an entity E are: min-card(E, R) : the minimum number of entities in E that should be mapped via R max-card(E, R): the maximum number of entities in E that can be mapped via R Own is a relation between owner and item Should each owner be selling items? How many items can an owner sell?

Fall 2001

Database Systems

11

Cardinalities of Relationships
E R F E R F E R F

One-to-one relationship min-card(E, R)=0 max-card(E,R)=1 min-card(F,R)=0 max-card(F,R)=1

Many-to-one relationship min-card(E, R)=0 max-card(E,R)=N min-card(F,R)=1 max-card(F,R)=1

Many-to-many relationship min-card(E, R)=0 max-card(E,R)=N min-card(F,R)=0 max-card(F,R)=N


12

Cardinalities
(1,1) own (0,N)

Items
(0,N)

Owners
(0,N) accept

Buyers
(0,N)

place (1,1)

Bids

(0,1)
date

Fall 2001

Database Systems

13

Cardinalities
If max-card(E,R)=1 then E has single-valued participation in R If max-card(E,R)=N then E has multi-valued participation in R Given a binary relation R between E and F, R is said to be one-to-one if both E and F have single-valued participation one-to-many if E has single and F has multi-valued participation many-to-many if both E and F have multi-valued participation

Fall 2001

Database Systems

14

Mapping Relationships to Relations


Map one-to-one and one-to-many (or many-to-one) relationships into the existing relations (derived from entities) If E-R-F is one-to-many, then include the primary key of the relation for F in the relation for E. If E-R-F is one-to-one, then include key for E in F, or the key for F in E. If E-R-F is many-to-many, create a new relation for R that has the primary keys for both E and F. If R has attributes, migrate them to the relation with the foreign keys!

Fall 2001

Database Systems

Contains Foreign Key(s) 15

Mapping the Auction Database


Owners( oid, itemsold, lastname, firstname, mid_initial, phone ) OwnerEmail( oid, email ) Buyers( buyid, buyername, ccnum,street, state, city, zip, phone ) BuyerEmail( buyid, email ) Items( iid, name, location, description, oid ) Bids( bid, date, time, amount, acceptingoid, acceptdate, buyid, iid )
Fall 2001 Database Systems 16

Cardinalities
(1,1) own (0,N)

Items
(0,N)

Owners
(0,N) accept

Buyers
(0,N)

place (1,1)

Bids

(0,1)
date

Fall 2001

Database Systems

17

Problem
Consider the design of a database to manage airline reservations: For flights, it contains the departure and arrival airports, dates and times For flights, it also contains a number of different pricing plans with different conditions (Saturday stay, advance booking, etc.) For passengers, it contains the name, telephone number and seat type preference Reservations include the seat assigned to a passenger Passengers can have multiple reservations

Fall 2001

Database Systems

18

Solution
date depart airport (0,N) (0,N) arrive (1,1) (1,1) time (0,N) flight name time pricing plan

date

(0,N) conditions
reservation

passenger
seat name
Fall 2001

phone

seat pref
Database Systems 19

Entity Relationship Model


Foreign key to Employees

Entities play different roles in a relationship


eid
Supervised-by (1, 1)

Employees(eid, , supervisor-id)

Employees

supervises
Supervisor-of (0, N) employee-supervisor (1, 1)

Recursive relationship
project-supervisor (1, N)

Employees
supervised (0, N)

supervises

Projects
Database Systems 20

Fall 2001

Entity Relationship Model


Many-to-many relationships are translated into new relations
(0,N) (0,N)

Buyers
amount

buy
(0,N)

Items

Stores

BUY

Item I1 I2 I3 I2

Buyer Store Amount B1 B1 B4 B5 S1 S2 S1 S2 3 4 5 2


21

Fall 2001

Database Systems

Entity Relationship Model


Ternary relationships may be represented by binary relationships
(0,N) (0,N)

Buyers
(0,N)

buy_item

Items
(0,N)

buy_from
(0,N)

sell_item
(0,N)

Stores

Is this conceptually equivalent to the previous ternary relationship?


22

Fall 2001

Database Systems

Weak Entities
The existence of a weak entity W depends on the existence of another (strong) entity E through a relationship R. (Alternate) Two different weak entities may have the same identity (key) if they are related to two different strong entities.

(0,N)

(1,1)

Bank

has

Branch

name
Fall 2001 Database Systems

number

address

23

Weak Entities
Weak entities can be mapped to the relational model by:
Map each weak entity E that depends on a strong entity F to a new relation R Relation R contains all the attributes in E and the primary key of F The primary key for R is the primary key of E and the primary key of F

Fall 2001

Database Systems

24

Generalization Hierarchies
Lower items inherit attributes of their parents
date

Concerts

location

Option 1. Translate into a single relation with a flag for the type of entity [many null values]

Classical
orchestra pieces soloists conductor
Fall 2001

Other
Option 2. Translate into three entities and two is-a relationships, then translate the resulting graph.
25

performers

Database Systems

Extensions
All relational DBMSs come with extensions that give more flexibility to the DBA Examples from Informix composite attributes -> translate as a record address of type ROW(street string, city string, state string, zip string) multi-valued attributes -> translate into collection types such as sets, lists, multi-sets (bags) hierarchies -> create typed tables and translate into a type hierarchy.

REMEMBER, the extensions complicate the data model and make certain SQL queries much harder or impossible, leaving the database programmer with a much harder job of maintaining the database! 26

Security and Integrity

Intro
Database Security
Aspects of security Access to databases Privileges and views

Database Integrity
View updating, Integrity constraints

For more information


Connolly and Begg chapters 6 and 19

Security and Integrity

Database Security
Database security is about controlling access to information
Some information should be available freely Other information should only be available to certain people or groups

Many aspects to consider for security


Legal issues Physical security OS/Network security Security policies and protocols Encryption and passwords DBMS security

Security and Integrity

DBMS Security Support


DBMS can provide some security
Each user has an account, username and password These are used to identify a user and control their access to information

DBMS verifies password and checks a users permissions when they try to
Retrieve data Modify data Modify the database structure

Security and Integrity

Permissions and Privilege


SQL uses privileges to control access to tables and other database objects
SELECT privilege INSERT privilege UPDATE privilege DELETE privilege The owner (creator) of a database has all privileges on all objects in the database, and can grant these to others The owner (creator) of an object has all privileges on that object and can pass them on to others

Security and Integrity

Privileges in SQL
GRANT ON TO [WITH <privileges> <object> <users> GRANT OPTION] <users> is a list of user names or PUBLIC <object> is the name of a table or view (later)

<privileges> is a list of SELECT <columns>, INSERT <columns>, DELETE, and UPDATE <columns>, or simply ALL

WITH GRANT OPTION means that the users can pass their privileges on to others

Security and Integrity

Privileges Examples
GRANT ALL ON Employee TO Manager WITH GRANT OPTION The user Manager can do anything to the Employee table, and can allow other users to do the same (by using GRANT statements) GRANT SELECT, UPDATE(Salary) ON Employee TO Finance The user Finance can view the entire Employee table, and can change Salary values, but cannot change other values or pass on their privilege

Security and Integrity

Removing Privileges
If you want to remove a privilege you have granted you use
REVOKE <privileges> ON <object> FROM <users>

If a user has the same privilege from other users then they keep it All privileges dependent on the revoked one are also revoked

Security and Integrity

Removing Privileges
Example
Admin grants ALL privileges to Manager, and SELECT to Finance with grant option Manager grants ALL to Personnel Finance grants SELECT to Personnel
SELECT

Admin
ALL

Finance
SELECT

Manager
ALL

Personnel

Security and Integrity

Removing Privileges
Manager revokes ALL from Personnel
Personnel still has SELECT privileges from Finance
SELECT

Admin
ALL

Admin revokes SELECT from Finance


Personnel loses SELECT also

Finance
SELECT

Manager
ALL

Personnel

Security and Integrity

Views
Privileges work at the level of tables
You can restrict access by column You cannot restrict access by row

Views provide derived tables


A view is the result of a SELECT statement which is treated like a table You can SELECT from (and sometimes UPDATE etc) views just like tables

Views, along with privileges, allow for customised access

Security and Integrity

Creating Views
CREATE VIEW <name> AS <select stmt> <name> is the name of the new view <select stmt> is a query that returns the rows and columns of the view Example
We want each user to be able to view the names and phone numbers (only) of those employees in their own department

Security and Integrity

View Example
Example
We want each user to be able to view the names and phone numbers (only) of those employees in their own department In Oracle, you can refer to the current user as USER Employee ID E158 E159 E160 Name Phone Department Mark Mary Jane x6387 Accounts x6387 Marketing x6387 Marketing Salary 15,000 15,000 15,000

Security and Integrity

Database Tuning

Overview
After ER design, schema refinement, and the definition of views, we have the conceptual and external schemas for our database. The next step is to choose indexes, make clustering decisions, and to refine the conceptual and external schemas (if necessary) to meet performance goals. We must begin by understanding the workload:

The most important queries and how often they arise. The most important updates and how often they arise. The desired performance for these queries and updates.

Understanding the Workload


For each query in the workload:

Which relations does it access? Which attributes are retrieved? Which attributes are involved in selection/join conditions? How selective are these conditions likely to be?
Which attributes are involved in selection/join conditions? How selective are these conditions likely to be? The type of update (INSERT/DELETE/UPDATE), and the attributes that are affected.

For each update in the workload:

Decisions to Make
What indexes should we create?

Which relations should have indexes? What field(s) should be the search key? Should we build several indexes?
Clustered? Hash/tree? Dynamic/static? Dense/sparse? Consider alternative normalized schemas? (Remember, there are many choices in decomposing into BCNF, etc.) Should we ``undo some decomposition steps and settle for a lower normal form? (Denormalization.) Horizontal partitioning, replication, views ...

For each index, what kind of an index should it be?

Should we make changes to the conceptual schema?


Choice of Indexes
One approach: consider the most important queries in turn. Consider the best plan using the current indexes, and see if a better plan is possible with an additional index. If so, create it. Before creating an index, must also consider the impact on updates in the workload!

Trade-off: indexes can make queries go faster, updates slower. Require disk space, too.

Issues to Consider in Index Selection


Attributes mentioned in a WHERE clause are candidates for index search keys.

Exact match condition suggests hash index. Range query suggests tree index.
Clustering is especially useful for range queries, although it can help on equality queries as well in the presence of duplicates.

Try to choose indexes that benefit as many queries as possible. Since only one index can be clustered per relation, choose it based on important queries that would benefit the most from clustering.

Issues in Index Selection (Contd.)


Multi-attribute search keys should be considered when a WHERE clause contains several conditions.

If range selections are involved, order of attributes should be carefully chosen to match the range ordering. Such indexes can sometimes enable index-only strategies for important queries.
For index-only strategies, clustering is not important!

When considering a join condition:

Hash index on inner is very good for Index Nested Loops.


Should be clustered if join column is not key for inner, and inner tuples need to be retrieved.

Clustered B+ tree on join column(s) good for Sort-Merge.

Example 1

SELECT E.ename, D.mgr FROM Emp E, Dept D WHERE D.dname=Toy AND E.dno=D.dno

Hash index on D.dname supports Toy selection.


Given this, index on D.dno is not needed.

Hash index on E.dno allows us to get matching (inner) Emp tuples for each selected (outer) Dept tuple. What if WHERE included: `` ... AND E.age=25 ?

Could retrieve Emp tuples using index on E.age, then join with Dept tuples satisfying dname selection. Comparable to strategy that used E.dno index. So, if E.age index is already created, this query provides much less motivation for adding an E.dno index.

Example 2

SELECT E.ename, D.mgr FROM Emp E, Dept D WHERE E.sal BETWEEN 10000 AND 20000 AND E.hobby=Stamps AND E.dno=D.dno

Clearly, Emp should be the outer relation.

Suggests that we build a hash index on D.dno. B+ tree on E.sal could be used, OR an index on E.hobby could be used. Only one of these is needed, and which is better depends upon the selectivity of the conditions.
As a rule of thumb, equality selections more selective than range selections.

What index should we build on Emp?

As both examples indicate, our choice of indexes is guided by the plan(s) that we expect an optimizer to consider for a query. Have to understand optimizers!

Examples of Clustering
B+ tree index on E.age can be used to get qualifying tuples.
How selective is the condition? Is the index clustered? Consider the GROUP BY query. If many tuples have E.age > 10, using E.age index and sorting the retrieved tuples may be costly. Clustered E.dno index may be better!

SELECT E.dno FROM Emp E WHERE E.age>40

SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age>10 GROUP BY E.dno SELECT E.dno FROM Emp E WHERE E.hobby=Stamps

Equality queries and duplicates:

Clustering on E.hobby helps!

Clustering and Joins


SELECT E.ename, D.mgr FROM Emp E, Dept D WHERE D.dname=Toy AND E.dno=D.dno

Clustering is especially important when accessing inner tuples in INL.

Should make index on E.dno clustered.

Suppose that the WHERE clause is instead:


WHERE E.hobby=Stamps AND E.dno=D.dno

If many employees collect stamps, Sort-Merge join may be worth considering. A clustered index on D.dno would help.

Summary: Clustering is useful whenever many tuples are to be retrieved.

Multi-Attribute Index Keys


To retrieve Emp records with age=30 AND sal=4000, an index on <age,sal> would be better than an index on age or an index on sal.

Such indexes also called composite or concatenated indexes. Choice of index key orthogonal to clustering etc. Clustered tree index on <age,sal> or <sal,age> is best. Clustered <age,sal> index much better than <sal,age> index!

If condition is: 20<age<30 AND 3000<sal<5000:

If condition is: age=30 AND 3000<sal<5000:

Composite indexes are larger, updated more often.

Index-Only Plans

<E.dno>

SELECT D.mgr FROM Dept D, Emp E WHERE D.dno=E.dno

SELECT D.mgr, E.eid A number of <E.dno,E.eid> FROM Dept D, Emp E queries can Tree index! WHERE D.dno=E.dno be answered SELECT E.dno, COUNT(*) without <E.dno> FROM Emp E retrieving any GROUP BY E.dno tuples from one or more SELECT E.dno, MIN(E.sal) <E.dno,E.sal> FROM Emp E of the Tree index! GROUP BY E.dno relations involved if a <E. age,E.sal> SELECT AVG(E.sal) suitable index FROM Emp E or is available. <E.sal, E.age> WHERE E.age=25 AND

Tree!

E.sal BETWEEN 3000 AND 5000

Tuning the Conceptual Schema


The choice of conceptual schema should be guided by the workload, in addition to redundancy issues:

We may settle for a 3NF schema rather than BCNF. Workload may influence the choice we make in decomposing a relation into 3NF or BCNF. We may further decompose a BCNF schema! We might denormalize (i.e., undo a decomposition step), or we might add fields to a relation. We might consider horizontal decompositions.

If such changes are made after a database is in use, called schema evolution; might want to mask some of these changes from applications by defining views.

Example Schemas
Contracts (Cid, Sid, Jid, Did, Pid, Qty, Val) Depts (Did, Budget, Report) Suppliers (Sid, Address) Parts (Pid, Cost) Projects (Jid, Mgr)

We will concentrate on Contracts, denoted as CSJDPQV. The following integrity constraints are given to hold: JP C, SD P, C is the primary key.

What are the candidate keys for CSJDPQV? What normal form is this relation schema in?

Settling for 3NF vs BCNF


CSJDPQV can be decomposed into SDP and CSJDQV, and both relations are in BCNF. (Which FD suggests that we do this?)

Lossless decomposition, but not dependency-preserving. Adding CJP makes it dependency-preserving as well. Find the number of copies Q of part P ordered in contract C. Requires a join on the decomposed schema, but can be answered by a scan of the original relation CSJDPQV. Could lead us to settle for the 3NF schema CSJDPQV.

Suppose that this query is very important:


Denormalization
Suppose that the following query is important:

Is the value of a contract less than the budget of the department?

To speed up this query, we might add a field budget B to Contracts.

This introduces the FD D B wrt Contracts. Thus, Contracts is no longer in 3NF.

We might choose to modify Contracts despite this if the query is sufficiently important, and we cannot obtain adequate performance otherwise (i.e., by adding indexes or by choosing an alternative 3NF schema.)

Choice of Decompositions
There are 2 ways to decompose CSJDPQV into BCNF:

SDP and CSJDQV; lossless-join but not dep-preserving. SDP, CSJDQV and CJP; dep-preserving as well.

The difference between these is really the cost of enforcing the FD JP C.


CREATE ASSERTION CheckDep CHECK 2nd decomposition: Index on JP on relation CJP. ( NOT EXISTS 1st: ( SELECT * FROM PartInfo P, ContractInfo C WHERE P.sid=C.sid AND P.did=C.did GROUP BY C.jid, P.pid HAVING COUNT (C.cid) > 1 ))

Choice of Decompositions (Contd.)


The following FDs were given to hold: JP C, SD P, C is the primary key. Suppose that, in addition, a given supplier always charges the same price for a given part: SPQ V. If we decide that we want to decompose CSJDPQV into BCNF, we now have a third choice:

Begin by decomposing it into SPQV and CSJDPQ. Then, decompose CSJDPQ (not in 3NF) into SDP, CSJDQ. This gives us the lossless-join decomp: SPQV, SDP, CSJDQ. To preserve JP C, we can add CJP, as before.

Choice: { SPQV, SDP, CSJDQ } or { SDP, CSJDQV } ?

Decomposition of a BCNF Relation


Suppose that we choose { SDP, CSJDQV }. This is in BCNF, and there is no reason to decompose further (assuming that all known integrity constraints are FDs). However, suppose that these queries are important:

Find the contracts held by supplier S. Find the contracts that department D is involved in.

Decomposing CSJDQV further into CS, CD and CJQV could speed up these queries. (Why?) On the other hand, the following query is slower:

Find the total value of all contracts held by supplier S.

Horizontal Decompositions
Our definition of decomposition, so far: Relation is replaced by a collection of relations that are projections. This is vertical decomposition. Most important case. Sometimes, might want to replace relation by a collection of relations that are selections. This is horizontal decomposition.

Each new relation has same schema as the original, but a subset of the rows. Collectively, new relations contain all rows of the original. Typically, the new relations are disjoint.

Horizontal Decompositions (Contd.)


Suppose that contracts with value > 10000 are subject to different rules. This means that queries on Contracts will often contain the condition val>10000. One way to deal with this is to build a clustered B+ tree index on the val field of Contracts. A second approach is to replace contracts by two new relations: LargeContracts and SmallContracts, with the same attributes (CSJDPQV).

Performs like index on such queries, but no index overhead. Can build clustered indexes on other attributes, in addition!

Masking Conceptual Schema Changes


CREATE VIEW Contracts(cid, sid, jid, did, pid, qty, val) AS SELECT * FROM LargeContracts UNION SELECT * FROM SmallContracts

The replacement of Contracts by LargeContracts and SmallContracts can be masked by the view. However, queries with the condition val>10000 must be asked wrt LargeContracts for efficient execution: so users concerned with performance have to be aware of the change.

Tuning Queries and Views


If a query runs slower than expected, check if an index needs to be re-built, or if statistics are too old. Sometimes, the DBMS may not be executing the plan you had in mind. Common areas of weakness:

Selections involving null values. Selections involving arithmetic or string expressions. Selections involving OR conditions. Lack of evaluation features like index-only strategies or certain join methods or poor size estimation.

Check the plan that is being used! Then adjust the choice of indexes or rewrite the query/view.

Rewriting SQL Queries


Complicated by interaction of:

NULLs, duplicates, aggregation, subqueries.

Guideline: Use only one query block, if possible. SELECT DISTINCT * SELECT DISTINCT S.*
FROM Sailors S WHERE S.sname IN (SELECT Y.sname FROM YoungSailors Y)

FROM Sailors S, YoungSailors Y WHERE S.sname = Y.sname

Not always possible ...

SELECT * FROM Sailors S WHERE S.sname IN (SELECT DISTINCT Y.sname FROM YoungSailors Y)

SELECT S.* FROM Sailors S, YoungSailors Y WHERE S.sname = Y.sname

The Notorious COUNT Bug


SELECT dname FROM Department D WHERE D.num_emps > (SELECT COUNT(*) FROM Employee E WHERE D.building = E.building) CREATE VIEW Temp (empcount, building) AS SELECT COUNT(*), E.building FROM Employee E GROUP BY E.building SELECT FROM WHERE AND dname Department D,Temp D.building = Temp.building D.num_emps > Temp.empcount;

What happens when Employee is empty??

Summary on Unnesting Queries


DISTINCT at top level: Can ignore duplicates. Can sometimes infer DISTINCT at top level! (e.g. subquery clause matches at most one tuple)

DISTINCT in subquery w/o DISTINCT at top: Hard to

convert. Subqueries inside OR: Hard to convert. ALL subqueries: Hard to convert.

EXISTS and ANY are just like IN.

Aggregates in subqueries: Tricky. Good news: Some systems now rewrite under the covers (e.g. DB2).

More Guidelines for Query Tuning


Minimize the use of DISTINCT: dont need it if duplicates are acceptable, or if answer contains a key. Minimize the use of GROUP BY and HAVING:
SELECT MIN (E.age) FROM Employee E GROUP BY E.dno HAVING E.dno=102 SELECT MIN (E.age) FROM Employee E WHERE E.dno=102

Consider DBMS use of index when writing arithmetic expressions: E.age=2*D.age will benefit from index on E.age, but might not benefit from index on D.age!

Guidelines for Query Tuning (Contd.) SELECT E.dno, AVG(E.sal)


Avoid using intermediate relations:

FROM Emp E, Dept D WHERE E.dno=D.dno AND D.mgrname=Joe GROUP BY E.dno

SELECT * INTO Temp SELECT T.dno, AVG(T.sal) FROM Emp E, Dept D vs. and FROM Temp T WHERE E.dno=D.dno GROUP BY T.dno AND D.mgrname=Joe

Does not materialize the intermediate reln Temp. If there is a dense B+ tree index on <dno, sal>, an index-only plan can be used to avoid retrieving Emp tuples in the first query!

Summary (Design)
Database design consists of several tasks: requirements analysis, conceptual design, schema refinement, physical design and tuning.

In general, have to go back and forth between these tasks to refine a database design, and decisions in one task can influence the choices in another task.

Understanding the nature of the workload for the application, and the performance goals, is essential to developing a good design.

What are the important queries and updates? What attributes/relations are involved?

Summary (Design Contd.)


Indexes must be chosen to speed up important queries (and perhaps some updates!).

Index maintenance overhead on updates to key fields. Choose indexes that can help many queries, if possible. Build indexes to support index-only strategies. Clustering is an important decision; only one index on a given relation can be clustered! Order of fields in composite index key can be important.

Static indexes may have to be periodically re-built. Statistics have to be periodically updated.

Summary (Tuning)
The conceptual schema should be refined by considering performance criteria and workload:

May choose 3NF or lower normal form over BCNF. May choose among alternative decompositions into BCNF (or 3NF) based upon the workload. May denormalize, or undo some decompositions. May decompose a BCNF relation further! May choose a horizontal decomposition of a relation. Importance of dependency-preservation based upon the dependency to be preserved, and the cost of the IC check.
Can add a relation to ensure dep-preservation (for 3NF, not BCNF!); or else, can check dependency using a join.

Summary (Tuning Contd.)


Over time, indexes have to be fine-tuned (dropped, created, re-built, ...) for performance.

Should determine the plan used by the system, and adjust the choice of indexes appropriately.
Only left-deep plans considered! Null values, arithmetic conditions, string expressions, the use of ORs, etc. can confuse an optimizer.

System may still not find a good plan:


So, may have to rewrite the query/view:

Avoid nested queries, temporary relations, complex conditions, and operations like DISTINCT and GROUP BY.

Temporal Databases

What is temporal DB?


Temporal databases, encompass all DB applications that require some aspect of time when organizing their information. They exhibit the need for developing a set of unifying concepts for application developers to use. Temporal DB applications have been developed since the early days of database usage. However, in creating these applications, it was mainly left to the application developers to discover, design, program, and implement the temporal concepts.

Applications of temporal db
There are many examples of applications where some aspect of time is needed to maintain the information in a DB. Health care: patient histories need to be maintained Insurance: claims and accident histories are required Finance: stock price histories need to be maintained. Personnel management: salary and position history need to be maintained Banking: credit histories

temporal database applications.


Terminology Valid time. The valid time denotes when facts are true with respect to the real world. Transaction time. The transaction time of a database fact is the time when the fact is current in the database.

Introduction
Temporal database: a database that contains historical data as well as current data.
Note: historical is a misleading term temporal databases may contain data regarding the future as well as the past.

Extreme case: data is only inserted, never deleted from a temporal database (eg. vehicle position data in the project). So far, we have studied the other extreme - i.e. snapshot databases. 77

Introduction
Temporal data: encoded representation of timestamped facts. Each tuple must include at least one timestamp.
Problem:What about queries that produce results that are not temporal? i.e. result of query is outside the domain of (temporal) database. eg. Get names of all people who have supplied something in the past.

Redefine temporal database: database that includes, but is not limited to, temporal data.
78

Motivation
Queries on time-varying data are difficult to express in SQL. Temporal databases provide build-in support for recording and querying such information. It is possible to use SQL to evaluate these queries, but performance is poor.
79

Motivation
Most applications manage temporal data. If a temporal database is used for such data:
Schemas, including integrity constraints are simpler. Queries are simpler

Application code is less complex


easier to understand easier to produce easier to maintain
80

Applications
Most applications of database technology are temporal in nature: Financial apps.: portfolio management, accounting & banking, stock market analysis, audit analysis

Record-keeping apps.: personnel, medical records, inventory management, legal records (commercial laws change frequently)
Data Warehousing: historical trends for analysis Scheduling apps.: airline, car, hotel reservations and project management Scientific apps.: weather monitoring, chemical process monitoring 81

Intervals
An interval [s,e] is a set of times from time s to time e. Does interval [s,e] represent an infinite set? Assumption: Timeline is a finite sequence of discrete, indivisible time quanta. Time Quanta: smallest unit of time system can represent. Timepoints/point: time unit considered indivisible for our purpose. An interval is treated as a single type, not as pair of separate values. Interval can be open/closed w.r.t. start point/end point. eg. [d04,d10],[d04,d11),(d03,d10],(d03,d11) all represent the sequence of days from day4 to day10 inclusive.

82

Operators on Intervals
Temporal predicate operators: i1 = [s1,e1]; i2 = [s2,e2] i1 BEFORE i2 (e1<s2) i1 MEETS i2 (s2 = e1) i1 EQUALS i2 (s1 = s2 AND e1 = e2) i1 OVERLAPS i2 (s2 < s1 < e2 OR s1 < s2 < e1)

i1 i1

i2 i2

i1
i2 i1 i2

83

Operators on Intervals
i1
i1 DURING i2 (s2 < s1 AND e2 > e1 ) i1 STARTS i2 (s1 = s2 AND e1 < e2) i1 FINISHES i2 (e1 = e2 AND s1 > s2) Additional operators: i1 MERGES i2: i1 CONTAINS i2: (i1 MEETS i2 OR i1 OVERLAPS i2) (i2 DURING i1)
84

i2 i1 i2 i1 i2

Scalar and Relational Operators


DURATION(i) - returns the number of time points in i eg. DURATION ([d03,d07]) returns 5 i1 UNION i2 returns [MIN(s1,s2),MAX(e1,e2) ] if (i1 MERGES i2) otherwise undefined i1 INTERSECT i2 returns [MAX(s1,s2),MIN(e1,e2)] if (i1 OVERLAPS i2) otherwise undefined
85

Aggregate Operators
EXPAND(X): Where X is a set. The output is also a set. Used to generate time quantum intervals. The expanded form of X is the set of all intervals of the form [p,p] where p is a time point in some interval in X. e.g.: X1 = { [d01,d01],[d03,d05],[d04,d06] } X2 = { [d01,dp1],[d03,d04],[d05,d05],[d05,d06] } X3 = { [d01,d01],[d03,d03],[d04,d04],[d05,d05],[d06,d06] } Then EXPAND(X1) = EXPAND(X2) = X3
86

Aggregate Operators
COLLAPSE(X): The collapsed form of X is the set Y of intervals of the same type such that (a) X & Y have the same unfolded form. (b) no two distinct members i1 and i2 of Y are such that (i1 MERGES i2) is true. e.g.: X1 = { [d01,d01],[d03,d05],[d04,d06] } X2 = { [d01,d01],[d03,d04],[d05,d05],[d05,d06] } X3 = { [d01,d01],[d03,d06] } Then COLLAPSE (X1) = COLLAPSE (X2) = X3
87

Relation Operators Involving Intervals

PACK r on A: groups the relation r by all its attributes apart from A This is equivalent to WITH ( r GROUP {A} AS X ) AS R1 ( EXTEND R1 ADD COLLAPSE (X) AS Y ) {ALL BUT X } AS R2 : R2 UNGROUP Y
UNPACK r on A: Replace COLLAPSE with EXPAND in PACK.

88

Example
Given two temporal relations: S: Supplier S# was under contract during the interval During SP: Supplier S# was able to supply During During part P# during S theS# interval
S1 S2 S2 S3 S4 S5 [d04,d10] [d02,d04] [d07,d10] [d03,d10] [d04,d10] [d02,d10]

SP

S# P# During S1 P1 [d04,d10] S1 P7 [d05,d10] S1 P3 [d09,d10] S1 P5 [d06,d10] S2 P1 [d02,d04] S2 P9 [d03,d03] S2 P1 [d08,d10]

S2 P5 [d09,d10]
S3 P1 [d08,d10] S4 P2 [d06,d09] S4 P5 [d04,d08] S4 P7 [d05,d10]
89

Example 1
Active supplier intervals: Get S#-DURING pairs for suppliers who have been able to supply at least one part during at least one interval of time, where DURING designates such an interval. PACK SP {S#,DURING} ON DURING

SP

S# P# During S1 P1 [d04,d10] S1 P7 [d05,d10]

S1 P3 [d09,d10]
S1 P5 [d06,d10] S2 P1 [d02,d04] S2 P9 [d03,d03] S2 P1 [d08,d10]

RESULTS#
S1 S2 S2 S3 S4

During [d04,d10] [d02,d04] [d08,d10] [d08,d10] [d04,d10]

S2 P5 [d09,d10]
S3 P1 [d08,d10] S4 P2 [d06,d09] S4 P5 [d04,d08] S4 P7 [d05,d10]
90

Example 2
Inactive (passive) supplier intervals: Get S#DURING pairs for suppliers who have been unable to supply any parts at all during at least S# During RESULT one interval of time, where DURING designates S2 [d07,d07] such an interval.
S3 [d03,d07] [d02,d10]

S5

PACK ( ( UNPACK S {S#,DURING} ON DURING ) MINUS ( UNPACK SP {S#,DURING} ON DURING ) ) ON DURING

91

More Relational Operators


USING ( AList ) r1 op r2 is a shorthand for:
PACK ( ( UNPACK r1 on (AList) ) op ( UNPACK r1 on (AList) ) ) ON (AList) Where op is either UNION, INTERSECT, MINUS or JOIN

Various comparison operators on relations are 92 defined similarly.

Temporal Databases
1. VALID-TIME TEMPORAL DATA MODEL 2. TIME NORMALIZATION 3. TEMPORAL QUERY LANGUAGE

4. CONCEPTUAL DESIGN AND LOGICAL DESIGN

1. VALID-TIME TEMPORAL DATA MODEL


Temporal database systems typically use relational databases, which provide well-defined data models and query languages. However, the relational model has two significant shortcomings regarding temporal data: 1. The relational model provides poor support for storing complex temporal information. An example of this shortcoming is that the relational model does not support automatic merging of temporally overlapping data.

2. The SQL query language provides very limited support for expressing temporal queries. Therefore, applications that work with complex temporal data should define their own (1) temporal models and (2) query systems.

TIME NORMALIZATION This section defines different types of synchronism among time-varying attributes. It is valid to maintain synchronous attributes in a single relation. We define the concept of temporal dependence, which is used to define the notion of time normalization. Synchronism and Temporal Dependence A set of time-varying attributes (TAVs) in a given relation is called synchronous if every TVA can be uniformly associated with and be directly applied to the timestamp values in each tuple of the relation.

Example 1: The Employee Relation. Here, an employee gets a raise in salary if and only if he or she gets a promotion, and an employee is never demoted. Thus, the Salary and Position form a set of synchronous attributes.

Empno
33 33

Salary
20K 25K

Position
Typist Secretary

TS
12 25

TE
24 35

45 45

27K 30K

Jr Engr Sr Engr

28 38

37 42

Example 2: The relation Maintenance. All time-varying attributes Part, Cond, Place and Cost collectively describe the maintenance event. These TVAs form a quasi synchronous set.

Plane# 91

Part Wheel

Cond Detached

Place Atlanta

Cost 1000

TS 10

TE 20

105
105 142

Door
Door Wing

Broken
Unhinged Cracked

N.Y.
L.A. Boston

2000
2500 7000

35
35 60

47
62 72

The relation Sal-Mgr


Empno Salary Manager TS TE

52

18K

Smith

52
52 52 52 52

20K
25K 25K 31K 31K

Smith
Smith Jones Jones Smith

10
21 30 39 43

20
29 38 42 47

52 97
97

38K 30K
35K

Smith Bradford
Bradford

48 12
18

Now 17
Now

Consider the relation Sal-Mgr. The relation shows the manager and salary of employees over a period of time. In this relation, the attributes Salary and Manager form two singleton synchronous. They change in an asynchronous fashion. Such asynchronism leads to the fragmentation of the lifespan information of a TVA over several tuples and create update and retrieval anomalies.

Definition (Temporal dependence). Let R be a time-varying relation, where K is its temporal invariant key, and let Xi, for i [1,n], be its TVAs and TS and TE be its timestamp attributes. In a relational schema R, for any two TVAs Xi and Xj (i != j), R is said to have a temporal dependency, Xi T Xj, iff there exists an instance of R such that it contain 2 tuples t1 and t2 such that: t1(K) = t2(K) t1(Xi) = t2(Xi) XOR t1(Xj) = t2(Xj) intervals [t1(TS),t1(TE)] and [t2(TS), t2(TE)] are adjacent.

In Sal-Mgr, the attributes Salary and Manager, according to the above definition, have a temporal dependency (consider two tuples <52, 18K, Smith, 5, 9> and <52, 20K, Smith, 10, 20> or two tuples <52, 25K, Smith, 21, 29> and <52, 25K, Jones, 30, 38>).
Temporal dependency arise when two or more temporally unrelated facts are mixed in one timevarying relation.

3.TEMPORAL QUERY LANGUAGE


A query language called TSQL, which has been designed for querying a temporal database. TSQL was proposed by S.B. Navathe and R. Ahmed, 1993. TSQL is a superset of SQL and introduces several new semantics and syntactic components. TSQL add the following new constructs to standard SQL: - Conditional temporal expressions using the WHEN clause - Retrieval of timestamp values with or without computation

- Retrieval of temporally ordered information


- Specification of time domain using the TIME-SLICE clause - Modified aggregate functions and the GROUP BY clause

The formal syntax of a TSQL retrieval statement: SELECT [FIRST| SECOND|THIRD| Nth |LAST] select_item_list FROM table_name_list WHEN temporal_comparison_list WHERE search_condition_list

Example Database TSQL will be illustrated by examples on a database with the following relational schema: E(eno, name, address, date-of-birth) S(eno, salr, TS, TE) M(eno, mgr, TS, TE) T(eno, city, country, cost, TS, TE) E stands for Employee, S for Salary, and M for Manager, T for travel.

Temporal Query Semantics


Syntax of temporal query language is an extension of standard SQL syntax. The semantics of a temporal query are based on the temporal relational model outlined in section 1. The temporal semantics contained in a temporal query cannot be translated in to standard relational algebra. We modify and extend standard relational algebra to create a version that incorporates temporal operations on timepoints and intervals. A set of algebraic operators that support temporal querying requirements: temporal projection, selection, and joins

Assumption: well-defined tables Assume that the valid time component in temporal table(s) must be well-defined before performing the operation. That means temporal tables do not contain tuples with the same non-temporal attribute values but overlapping or consecutive time intervals. Such tuples are automatically folded in advance by merging their time intervals.

Temporal projection Temporal projection is similar to standard projection, except that the restriction applies to only the non-temporal attributes. Both timestamp columns cannot be excluded in the resultant history. After temporal projection, folding is enforced in order that adjoining intervals should be merged into a single interval in the resultant relation.

Temporal selection
TSQL adds the following new construct to standard SQL: selection based on temporal comparisons of timepoints and intervals using terms in a WHEN clause. The WHEN clause is used to express the temporal part of a query. The temporal comparison in the WHEN clause has the following form: WHEN a interval_compare_operator b where a,b are intervals and interval_compare_operator can be one of the keywords: BEFORE, AFTER, DURING, EQUIVALENT, ADJACENT, OVERLAPS, PRECEDES, and FOLLOWS.

CONCEPTUAL DESIGN AND LOGICAL DESIGN FOR TEMPORAL DATABASES


Conceptual Design Therere some approaches in conceptual design for temporal DB. Snodgrass advocates the following approach: Conceptual design initially ignores the time-varying nature of the application. We focus on capturing the currently reality and temporarily ignore any history that may be useful. Only after the full design is complete, we augment the ER schema with the timevarying semantics of the application. We consider each component of ER schema in turn, annotating that component with its temporal semantics, if any. Entity types, relationship types, attributes, and keys are each individually considered.

Nontemporal ER Schema
Strong Entity Types Weak Entity Types Entity Type Identifiers (Key Attributes) Attributes Relationship Types Integrity Constraints

Adding Temporal Annotations

Entity Lifespans Entities have a lifespan denoting when they existed. Entities are instantaneous or have a lifespan with a duration. If the entities of an entity type exist for all of time, there may be no need to record the lifespan explicitly. (They are nontemporal). Otherwise, the entity types are temporal. In this case, the designer should also specify the granularity of the lifespan.

Adding Temporal Annotations (cont.)


Relationship Valid Time A relationship type can either model instantaneous or it can model relationships that have a duration. The valid time for any specific relationship must be a subset of the intersection of the lifespans of the associated entities.

Adding Temporal Annotations (cont.)


Valid Time of Attributes The value of an attribute may change over the lifespan of the associated entity or the valid time of the associated relationship, or may not vary over time. The valid time of an attributes value for any specific entity (or relationship) must be a subset of the lifespan of that entity (relationship). Key Attributes A time-varying key uniquely identifies a particular entity at each point in time. A nontemporal key (time-invariant key) identifies a particular entity over all time.

Logical Design for Temporal DB


Logical Design proceeds in two stages. First, the nontemporal ER schema is mapped to a nontemporal relation schema, a collection of tables. Here again we ignore the temporal aspects of the application. In the second stage, each of the annotations is applied to the logical schema, modifying the tables (or the integrity constraints) to accommodate that temporal aspect. We proceed in a disciplined fashion, dealing with each annotation in turn.

Mapping to Relational Schema.


The nontemporal ER schema is mapped to a nontemporal relation schema, a collection of tables.

Applying Temporal Annotations


User-Defined-Time Attributes Each attribute is mapped to a column in the associated table. Attributes that record user-defined-time values can be of type: an instant, and an interval or a period. All temporal values have a granularity. Entity Lifespan To each table corresponding to an entity type for which the lifespan or valid time of an associated attribute is captured, there are two alternatives for timestamps:

instant
period (represented with two instants)

Applying Temporal Annotations (cont.)


Relationship Valid Time To each table corresponding to a relationship type with a recorded valid-time extent or having attribute(s) whose valid time is recorded, we add either instant or period timestamps.

In short, for tables corresponding to entity and relationship types for which valid time is to be recorded, add either - a single instant timestamp column or - a period timestamp, represented with two instant timestamp columns.

Applying Temporal Annotations (cont.)


Valid Time of Attributes If some attributes have a valid time and if the lifespan of the associated entity or the valid time of the associated relationship is not recorded, the time-varying columns should be placed in a separate table, along with the primary key of the original table, which also serves as a foreign key to that table. This task is termed temporal support decomposition. Example: EMPLOYEE(empno, sal, sex, addr, birth-ofdate,) in which sal is an attribute with valid time, should be separated into two tables: EMPLOYEE(empno, sex, addr, birth-of-date,)

SALARY(empno, sal, ts, te)

Applying Temporal Annotations (cont.)


Note: When the granularity of the attribute is finer than that of the entity or relationship type to which the attribute is attached, there are two possible ways: (1) We change the granularity of the associated table (entity type) to that of the column (attribute). (2) We can break off those columns into a separate table, termed precision decomposition.

In short, we should decompose tables so that all attributes of a table have an identical temporal support and precision.

Spatial Databases

Outline:
a rather old (but quite complete) survey on Spatial DBMS

Introduction & definition Modeling Querying Data structures & algorithms System architecture

Introduction
A common technology for some Applications:
GIS (geographic/geo-referenced data) VLSI design (geometric data) modeling complex phenomena (spatial data)

All need to manage large collections of relatively simple spatial objects Spatial DB vs. Image/pictorial DB [1990]
Spatial DB contains objects in the space Image DB contains representations of a space (images, pictures, : raster data)

SDBMS Definition
A spatial database system: Is a database system
A DBMS with additional capabilities for handling spatial data

Offers spatial data types (SDTs) in its data model and query language
Structure in space: e.g., POINT, LINE, REGION Relationships among them: (l intersects r)

Supports SDT in its implementation providing at least


spatial indexing (retrieving objects in particular area without scanning the whole space) efficient algorithms for spatial joins (not simply filtering the cartesian product)

Modeling
Assume 2-D and GIS application, two basic things need to be represented: Objects in space: cities, forests, or rivers single objects Coverage/Field: say something about every point in space (e.g., partitions, thematic maps)

spatially related collections of objects

Modeling: spatial primitives for objects


Point: object represented only by its location in space, e.g. center of a state Line (actually a curve or ployline): representation of moving through or connections in space, e.g. road, river Region: representation of an extent in 2d-space, e.g. lake, city

Modeling: coverages
Partition: set of region objects that are required to be disjoint (adjacency or region objects with common boundaries), e.g. thematic maps Networks: embedded graph in plane consisting of set of points (vertices) and lines (edges) objects, e.g. highways, power supply lines, rivers

Modeling: a sample spatial type system (1/2)


EXT={lines, regions}, GEO={points, lines, regions}

Spatial predicates for topological relationships:


inside: geo x regions bool intersect, meets: ext1 x ext2 bool adjacent, encloses: regions x regions bool

Operations returning atomic spatial data types:


intersection: lines x lines points intersection: regions x regions regions plus, minus: geo x geo geo contour: regions lines

Modeling: a sample spatial type system (2/2)


Spatial operators returning numbers
dist: geo1 x geo2 real perimeter, area: regions real

Spatial operations on set of objects


sum: set(obj) x (objgeo) geo A spatial aggregate function, geometric union of all attribute values, e.g. union of set of provinces determine the area of the country closest: set(obj) x (objgeo1) x geo2 set(obj) Determines within a set of objects those whose spatial attribute value has minimal distance from geometric query object Other complex operations: overlay, buffering,

Modeling: spatial relationships


Topological relationships: e.g. adjacent, inside, disjoint. Are invariant under topological transformations like translation, scaling, rotation Direction relationships: e.g. above, below, or north_of, sothwest_of, Metric relationships: e.g. distance 6 valid topological relationships between two simple regions (no holes, connected): disjoint, in, touch, equal, cover, overlap

Modeling: SDBMS data model


DBMS data model must be extended by SDTs at the level of atomic data types (such as integer, string), or better be open for user-defined types (OR-DBMS approach):
relation states (sname: STRING; area: REGION; spop: INTEGER) relation cities (cname: STRING; center: POINT; ext: REGION;cpop: INTEGER); relation rivers (rname: STRING; route: LINE)

Querying
Two main issues:
1. Connecting the operations of a spatial algebra (including predicates for spatial relationships) to the facilities of a DBMS query language. Fundamental spatial algebra operator are:
Spatial selection Spatial join (overlay, fusion)

2. Providing graphical presentation of spatial data (i.e. results of queries), and graphical input of SDT values used in queries.

Querying: spatial selection


Spatial selection: returning those objects satisfying a spatial predicate with the query object
All cities in Bavaria
SELECT sname FROM cities c WHERE c.center inside Bavaria.area

All rivers intersecting a query window


SELECT * FROM rivers r WHERE r.route intersects Window

All big cities no more than 100 Kms from Hagen


SELECT cname FROM cities c WHERE dist(c.center, Hagen.center) < 100 and c.pop > 500k (conjunction with other predicates and query optimization)

Querying: spatial join


Spatial join: A join which compares any two joined objects based on a predicate on their spatial attribute values.
For each river pass through Bavaria, find all cities within less than 50 Kms.
SELECT r.rname, c.cname, length(intersection(r.route, c.area)) FROM rivers r, cities c WHERE r.route intersects Bavaria.area and dist(r.route,c.area) < 50

Querying: I/O (1/2)


Graphical I/O issue: how to determine Window or Bavaria in previous examples (input); or how to show intersection(route, Bavaria.area) or r.route (output) (results are usually a combination of several queries). Requirements for spatial querying [Egenhofer]:
Spatial data types Graphical display of query results Graphical combination (overlay) of several query results (start a new picture, add/remove layers, change order of layers) Display of context (e.g., show background such as a raster image (satellite image) or boundary of states) Facility to check the content of a display (which query contributed to the content)

Querying: I/O (2/2)


Other requirements for spatial querying [Egenhofer]:
Extended dialog: use pointing device to select objects within a subarea, zooming, Varying graphical representations: different colors, patterns, intensity, symbols to different objects classes or even objects within a class Legend: clarify the assignment of graphical representations to object classes Label placement: selecting object attributes (e.g., population) as labels Scale selection: determines not only size of the graphical representations but also what kind of symbol be used and whether an object be shown at all Subarea for queries: focus attention for follow-up queries

You might also like