You are on page 1of 15

INFO2120 SUMMARY BOOKLET

By the awesome people in Jennas tutorials


These notes are merged from multiple groups summarizing the same chapters
Topics missing from this booklet
Week 1: Introduction to Databases and Transactions

SEMESTER 1, 2014

Week 2: Conceptual DB Design (ER diagrams)


Conceptual design - A technique for understanding and capturing business information requirement
graphically, Facilitate planning, operation and maintenance of various data resources
Entities
Entity a person, place, object, event, or concept about which you ant gather and store data. It must
be distinguishable from other entities. Eg John Doe, unit COMP5138, account 4711
Entity type (set) is a collection of entities that share common properties or characteristics eg :
student, courses, account. (rectangle represent) NOTE: entity sets need not to be disjoint
Attribute describes one aspect of an entity type eg people have name and address
Relationships
Relationship relates two or more entities, number of entities is also known as the degree of the
relationship eg John is enrolled in INFO2120
Relationship type (R.ship Set) set of similar relationships, eg Student (entity type) related to
UnitOfStudy(entity type) by EnrolledIn (relationship type)
Distinction Relation (relational model) set of tuples, relationship (E-R model) describes
relationship between entities, both entity sets and relationship sets (E-R model) may be represented
as relations (in the relational model)
Schema of relationship types
-

The combination of the primary keys of the participating entity types forms a super key of a
relationship
Relationship Set Schema , relationship name ,role names ,relationship attributes and their
types ,key

Key Constraints if for a particular participant entity type, each entity participates in at most one
relationship, the corresponding role is a key of relationship type eg employee role is unique in workIn
Participation constraint if every entity participates in at least one relationship participation
constraint holds - A participation constraint of entity type E having role in relationship
type R states that for e in E there is an r in R such that (r) = e. (representation in E-R diagram ;thick
line)
Cardinality Constraints Generalisation of key and participation constraints, A cardinality
constraint for the participation of an entity set E in a relationship R specifies how often an entity of
set E participates in R at least (minimum cardinality) and at most (maximum cardinality).
Weak entities An entity type that does not have a primary key eg child from parents payment of
loan. The primary key of a weak entity type is formed by the primary key of the strong entity type(s)
on which the weak entity type is existence dependent, plus the weak entity types discriminator.
Constraints On ISA Hierarchies
- Overlap Constraints disjoint : an entity can belong to only one lower-level entity set
Overlapping : an entity can belong to more than one lower level entity set
- Covering Constraints total : an entity must belong to one of the lower level entity sets
- Partial (the default) An entity need not belong to one of the lower level entity set

Week 3: The Relational Data Model (NULLs, keys, referential integrity)


The relation data model of data is based on the mathematical concept of Relation.
The strength of the relational approach to data management comes from its simple way of structuring data.
Data Model vs. Schema
Data model: a collection of concepts for describing data
Schema: a description of a particular collection of data at some abstraction level, using a given data model.
Relational data model is most widely used model today
Definition of Relation: A relation is a named, two-dimensional table of data -> consists of rows (records) and columns
(attribute or field).
Relation schema vs. Relation instance
A relation R has a relation schema: specifies name of relation and name and date type of each attribute.
A relation instance: a set of tuples (table) for a schema
Creating and Deleting Relations in SQL

Create of table (relations): create table name (list of columns)

Deletion of table (relation): Drop table name


Base Data types of SQL
SMALLINT/INTEGER/BIGIT integer values
DECIMAL/NUMERIC
Fixed-point number
FLOAT/REAL
Floating point number with precision p
CHAR/VHARCHAR/CLOB
alphanumerical character string types
Null value
RDBMS allows special entry NULL in a column to represent facts that are not relevant, or not yet known.
PRO: NULL is useful because using an ordinary value with special meaning does not always work.
Con: NULL causes complications in the definition of mane operations
Modifying relations using SQL
Insert of new data into a table: insert into table (list of columns) values (list of expression)
Updating of tuples in a table: update table set column = expression {, column=expression}
Deleting of tuples from a table: delete from table [where search_codition]
Relational database
Data structure: a relational database is a set of relations with tuples and fields- a simple and consistent structure.
Data manipulation: powerful operators to manipulate the data stored in relations
Data integrity: facilities to specify a variety of rules to maintain the integrity of data when it is manipulated.
Integrity constraints
It is condition that must be true for any instance of the database. A legal instance of relation is one that satisfies all specified
ICs.
Non-null columns
One domain constraint is to insist that no value in a given column can be null. In SQL-based RDBMS, it is possible to insert a
row where every attributes has the same value as an existing row.
Relational key
Primary keys are unique, minimal identifiers in a relation. i.e. CONSTRINT Student_PK primary key (Sid).
Foreign keys are identifies that enable a dependent relation to refer to its parent relation i.e. FOREIGN KEY (lecturer)
references Lecturer (empid).
Mapping E_R diagrams into relations
Each entity type becomes a relation; simple attributes map directly the relation. Composite attributes are flattened out by
creating a separate field for each component attribute
Weak Entity type
Become a separate relation with a foreign key taken from the superior entity.
Mapping of relationship type
Many to many create a new relation with the primary keys of the two entity type as its primary key
One to many primary keys on the one side becomes a foreign key on many side
One to one primary key on the mandatory side becomes a foreign key on the optional side
Relationship becomes fields of either the dependent, respectively new relation
Relational views
A view is a virtual relation.
Syntax: create view name as <query expression>.
<query expression> is any legal query expression (can even combine multiple relations)

Week 4A: Introduction to Declarative Querying - Relational Algebra


1. Set Operations
Union ( ) tuples in relation 1 OR in relation 2.
o Example: R S
o Definition: R U S = { t | t R t S }
Intersection ( ) tuples in relation 1, AND in relation 2.
o Example: R S
o Definition: R S = { t | t R t S }
Difference ( - ) tuples in relation 1, but not in relation 2.
o Example: R S
o Definition: R - S = { t | t R t S }
Important:

R and S have the same schema


R and S have the same arity (same number of fields)
Corresponding fields must have the same names and domains

2. Operations that remove parts of a relation


Selection ( ) selects a subset of rows from relation.
o Example: country=AUS (Student)
Projection ( ) deletes unwanted columns from relation. i.e.
o Example: name, country (Student)
3. Operations that combine tuples from two relations
Cross-product ( X ) allows us to fully combine two relations. Also called the Cartesian product
Join ( ><) to combine matching tuples from two relations.
o
family_name=last_name Lecturer
Natural Join (><) Equijoin on all common fields
o Example: R><S
o Result schema similar to cross-product, but only one copy of fields for which equality is
specified.
4. A schema-level rename operation
Rename ( ) allows us to rename one field to another name.
o Example: Classlist(2-> cid, 4-> uos_code) ( Enrolled X UnitOfStudy )
Six basic operations
We can distinguish between basic and derived RA operations
1. Union
(U)
4. Projection
()
2. Set Difference ( - )
5. Cross-product( X )
3. Selection
()
6. Rename ( )
Additional (derived) operations:
Intersection, join division:
o Not essential, but VERY useful
Cf. Join
Composition and equivalence rules
Commutation rules
1. A( p( R ) ) = p( A ( R ) )
2. R >< S = S >< R
Association rule
1. R >< (S >< T) = (R >< S) >< T
Idempotence rules
1. A( B ( R ) ) = A( R ) if A B
2. p1 (p2 ( R )) = p1 p2 ( R )
Distribution rules
1. A ( R S ) = A( R ) A ( S )
2. P ( R S ) = P ( R ) P ( S )
3. P ( R >< S ) = P (R) >< S if P only references R
4. A,B(R >< S ) = A (R) >< B ( S ) if join-attr. in (A B)
5. R >< ( S T ) = ( R S ) ( R >< T )

Week 4B: Introduction to SQL (+ Joins)


1. DDL (Data Definition Language) - Create, drop, or alter the relation schema
specifiy integrity constraints, PK, FK, NULL/not NULL constraints
DML (Data Manipulation Language) - Query, insert, delete and modify information in the DB
INSERT INTO, UPDATE, DELETE FROM
DCL (Data Control Language) - Control the DB, like administering privileges and users
SELECT - Lists the columns (and expressions) that should be returned from the query
DISTINCT removes duplicates, * = all, can have +,-,*,/ as arithmetic operators
FROM - Indicate the table(s) from which data will be obtained, lists the relations involved in the
query. AS renames relations and attributes
WHERE - Indicate the conditions to include a tuple in the result
comparison operators: = , > , >= , < , <= , != , <>. Combine with AND, OR, and NOT
BETWEEN allows a range query, LIKE used for string matching, % = any substring,
_ = any character, || = concatenate
GROUP - BY Indicate the categorization of tuples
HAVING - Indicate the conditions to include a category
ORDER BY - Sorts the result according to specified criterial, ASC (default), DESC

Date and Time:


4 Types: DATE, TIME, TIMESTAMP, INTERVAL. Can use CURRENT_DATE,
CURRRENT_TIME as constraints, normal time-order comparisons apply:=, >, <, <=, >=
Main Operations:EXTRACT( component FROM date ),DATE string, +/- INTERVAL

2.

Join:
You can join two or more table using the attribute conditions
Type of join: NATURAL JOIN, INNER JOIN and OUTER JOIN
R NATURAL JOIN S
R INNER JOIN S ON <join condition>
R INNER JOIN S USING (<list of attributes>)
R LEFT OUTER JOIN S
R RIGHT OUTER JOIN S
R FULL OUTER JOIN S

3. NULL Value and Three Valued logic:


Three-valued logic uses three different result values for logical expressions:
TRUE if a condition holds;
FALSE if a condition does not hold; and
UNKNOWN if a comparison includes a NULL
The use of three-valued logic is needed because of possible NULL values in databases
and because a logical condition to be decidable needs all values to be known.

4. Set operator:
The set operations union, intersect, and except (Oracle:
minus) operate on relations and correspond to the relational
algebra operations union, intersect and except.
Example: (select customer_name from depositor)
union
(select customer_name from borrower)

Week 5: Nested Subqueries, Grouping, and Relational Division


Nested Subqueries

A sub query is a SELECT-FROM-WHERE expression that is nested within another query


Common use : set membership , set comparisons and set cardinality

Non-correlated sub queries


Dont depend on data from the outer query
Execute once for the entire outer query

Correlated sub queries


Make use of data from the outer query
Execute once for each row of the outer
query
Can use the EXISTS operator

IN ( Non-correlated sub query)


A comparison operation that compares a
value v with a set/multi-set of values V, and
evaluate to true if v is one of the elements in
V

EXISTS (Correlated sub query)


Used to check whether the results of a
correlated nested query is empty(contains
no tuples) or not

The following checks for each student S whether there is at least one entry in the Enrolled table for that student in INFO2120:

SELECT sid, name


FROM Student
WHERE sid IN ( SELECT E.sid
FROM Enrolled E
WHERE E.uos_code
= 'INFO2120' )

SELECT sid, name


FROM Student S
WHERE EXISTS ( SELECT *
FROM Enrolled E
WHERE E.sid = S.sid AND uos_code =
'INFO2120' )

Grouping

A group is a set of tuples that have the same value for all attributes in grouping list
NOTE : an attribute in the SELECT clause must be in the GROUP BY clause as well

SYNTAX it must follow this order


SELECT target-list
FROM relation-list
WHERE qualification
GROUP BY grouping-list
HAVING group-qualification

EXAMPLE What was the average mark of each course?


SELECT uos_code as unit_of_study , AVG (mark)
FROM Assessment
GROUP BY uos_code

Relational Division
Definition

R (a1, an, b1, bm)


S (b1 bm)
R/S, with attributes a1, an, is the set of all tuples <a> such that for every tuple <b> in S, there is an <a,b> tuple in R
It is not an essential operator: just a useful shorthand

Week 6: Schema Normalization (including BCNF)


Motivation
Most important requirement of DB Design is adequacy every important process can be done using the data in the
database.
If a design is adequate, seek to avoid redundancy in the data same information repeated in several places.
Redundancy is at the root of several problems associated with relational schemas: Redundant storage, Insertion
anomaly, Deletion anomaly, Update.

Functional Dependencies and Normal Forms


Functional Dependency (FD): the value of one attribute (the determinant) determines the value of another attribute.
X Y means X functionally determines Y and Y is functionally dependent on X.
If you know the FDs, you can check whether a column (or set) is a key for the relation. There may be several candidate
keys. Choose one candidate key as the primary key. A superkey is a column/set that includes a candidate key.
Schema Normalisation (SN): Only allow FDs of the form of key constraints. SN is the process of validating and
improving a logical design so that it satisfies certain constraints (Normal Forms) that avoid unnecessary duplication of
data.

First normal form (1NF): domains of all attributes are atomic


Second normal form (2NF): 1NF + no partial dependencies
Third normal form (3NF): 2NF + no transitive dependencies
BCNF: the only non-trivial FDs that hold are key constraints

Table Decomposition
A decomposition of R consists of replacing R by two or more relations such that: Each new relation scheme contains a
subset of the attributes of R (and no attributes that do not appear in R), every attribute of R appears as an attribute of
one of the new relations, and all new relations differ. Example: R ( A, B, C, D ) with FDs: {A -> B D and B -> C}.
Overall Design Process: Consider a proposed schema | Find out application domain properties expressed as
functional dependencies | See whether every relation is in BCNF | If not, use a bad FD to decompose one of the
relations; start with partial dependencies (Replace the original relation by its decomposed tables) | Repeat the above,
until you find that every relation is in BCNF.

Making it Precise
It is essential that all decompositions used to deal with redundancy be lossless!
Dependency-preserving: If R is decomposed into S and T, then all FDs that were given to hold on R must also hold
on S and/or T. (Dependency preserving does not imply lossless join & vice-versa!)
Must consider whether all FDs are preserved. If a dependency-preserving decomposition into BCNF is not possible
(or unsuitable, given typical queries), should consider decomposition into 3NF.
Candidate Key: Main Idea -> only allow FDs of form of a key constraint. Each non-key field is functionally dependent
on every candidate key. Candidate Key Identification: Identifying all FDs that hold on our data set | Then reasoning
over those FDs using a set of rules to on how we can combine FDs to infer candidate keys | Or alternatively, using
these FDs top verify whether a given set of attributes is a candidate key or not.
From FDs to Keys: Candidate keys are defined by functional dependencies | Consequently, FDs help us to identify
candidate keys. From the Attribute Closure to Keys: The set of Functional Dependencies can be used to find
candidate keys.

Week 7: Database Security and Integrity (+ Triggers)


Every database security needs to be managed at some level thats why there is database access control. There are two types of
access control namely authentication and authorization. Authentication make use of logins and passwords to make sure the
person who tries to login is really the owner of the database. Authorization on the other hand can make the owner of the
database give some rights to other people on their tables and views using the syntax
GRANT event on tablename to personname.
And revoke access using the syntax:
REVOKE event on tablename from personname.
Event: insert/ delete/ select /update
Now we can generate some views and grant or revoke access to some people. We can add some constraints on the database
like ON DELETE NO ACTION so that if a parent tables tuple is deleted the child table tuples is not deleted. Why are all these
measures taken? To protect the private data of an individual. There has been an introduction of semantic integrity constraint so
that there are no losses of data consistency when changes are done to the database. One example of semantic integrity
constraint is the UNIQUE keyword on the student ID. Integrity constraints are conditions that must be satisfied for every
instance of the database. Integrity constraints are specified in the database schema and are checked when the database is
modified. If the conditions are not satisfied then the integrity constraint will abort the transaction. There are 2 types of integrity
constraints namely static integrity constraint and dynamic integrity constraint. Static integrity constraint is a condition that every
legal instance of a database must satisfy, some examples of the static integrity constraints are domain constraints, key
constraints and assertions. The dynamic integrity constraint is a condition that a legal database state change must satisfy

e.g. triggers. Lets say you have a database containing many Varchar data types and you dont want to rewrite the
same thing again and again then you can use domain constraint to create a varchar which will be available to all the
tables in the database and a check will be made to verify that it is within limit. E.g.
CREATE DOMAIN domain name check (value in ());
DEFRERING constraint let the transaction be completed first then check the constraint and NON-DEFERABBLE check
the constraint immediately afterwards every time the database is modified after the database gets modified.
ASSERTIONs are schema objects and are static integrity constraints that will make the database always satisfy a
condition. E.g. CREATE TABLE student { Sid INTEGER PRIMARY KEY name varchar};
CREATE ASSERTION checksid CHECK (select count (Sid) <=100) to check that the number of students must not
exceed 100.
One example of the dynamic integrity constraint is the trigger. Trigger is a statement that automatically fires if some
specific modifications occur on the database. E.g.
CREATE TRIGGER
AFTER/BEFORE insert OR update OF tuple on tablename BEGIN action END;

Week 8: DB Application Development


Database Application Architectures
-Data-intensive systems: Three types of functionality - presentation logic, processing logic, data
management
-System architectures can be 1, 2, or 3 tiered depending on presence of client, DB server and
web/application server
-Interactive SQL refers to SQL statements input directly to the terminal, DBMS outputs to screen,
Non-interaction SQL refers to SQL statements included in an application program
Client-side Database Application Development
To integrate SQL with host language (e.g. Java, C), can either embed SQL in language (Statementlevel Interface), or create an API to call SQL commands (Call-level interface)
PHP scripting language for dynamic websites that is embedded into HTML
Variables: begin with $, value must belong to a class, but can be declared without giving a type
Strings: double quotes replace substrings with variables, single quotes do not
Arrays: numeric arrays are indexed 0,1, etc. associative arrays are paired
PDO PHP Data Objects, extension to PHP that provides a database abstraction layer (used to
connect PHP and database). Five problems with interfacing with SQL:
1. Establishing database connection. $conn = new PDO( DSN, $userid, $passwd [,$params] );
a. PDO is DBMS independent, when creating new connection, need to insert DBMS prefix
b. New connections take some time, so should only be done once in a program
2. Executing SQL statements. Three different ways of executing SQL statements: semistatic(PDO::query(sql)), parameterized (PDO::prepare(sql)), or immediately run (PDO::exec(sql))
Placeholders: Anonymous placeholders are represented as a ? inside a query and linked using
$stmt->bindValue(1, $variable) the 1 represents the first ? in the query
Named placeholders are represented in the query using the format :name and are linked using $stmt>bindValue(:name, $variable)
NULL PHP supports NULL by default isset($var) checks if var exists and is not NULL
empty($var) if var exists and has a non-empty, nonzero value
Error Handling - Never show Database errors to end user.
Exception Handling - PDOException::getMessage() - returns the exception message
- PDOException::getCode() - returns the exception code
SQL Injection attacks most frequently occur when an unauthorised user exploits the unchecked user
input or buffer overflows in the database. Often when the user specifies a static query in their code this
holds potential for an SQL injection attack, for this reason dynamic queries are a better choice to avoid
this.
Stored Procedures are when application logic is run from within the database server. There are many
advantages to stored procedures: improved maintainability, reduced data transfer, fewer locks that are
being held for long periods, abstraction layer (programmers need not know the schema).

Week 9: Transaction Management (ACID, serialisability)


Transaction a collection of one or more operations on one or more databases, which reflects a discrete unit of
work.
Transaction does:
Return information from database
Update the database to reflect the occurrence of a real world event
Cause the occurrence of a real world event
ACID Properties:
Atomicity: Transaction should either complete or have no effect at all
Consistency: Execution of a transaction in isolation preserves the consistency of the database
Isolation: Although multiple transactions may execute concurrently, each transaction must be unaware of other
concurrently executing transactions
Durability: The effect of a transaction on the database state should not be lost once the transaction has committed
Commit: if the transaction successfully completes
Abort: if transaction does not successfully complete
Database is consistent if all static integrity constraints are satisfied
A sequence of database operations is serializable if it is equivalent to a serial execution of the involved transactions.
A serializable execution guarantees correctness in that it moves a database from one consistent state to another
consistent state. Basically Each transaction preserves database consistency. Thus it follows ACID and fulfils the
consistency component.
Concurrency control is the protocol that manages simultaneous operations against a database so that serializability
is assured.
Locking Protocol => Two-phase Locking Protocol (2PL) A transaction must obtain either:
o S (shared) lock - Data can only be read but is shared.
o X (exclusive) lock Data can be only read and write by one transaction.
Transactions must release locks once complete and cannot request additional locks afterwards.
Be careful of deadlocks, cycle of transactions waiting for locks to be released by each other.
Versioning / Snapshot Isolation => A new version of the items (snapshot) being accessed are created on update. All
queries are performed on this new version and then applied to the old version.
There are different levels of serialization and different databases require different levels. From lowest to highest:
o Read uncommitted- Uncommitted records may be read.
o Read committed- Only committed records can be read but successive reads of record may return different
values. Most used level in practice.
o Repeatable read- Only committed records can be read, repeated reads of same record must return same
value. Doesnt mean a transaction is always 100% serializable.
o Serializable- Default according to SQL-standard. Means that all transactions are serialized and follow ACID.

Week 10: Indexing and Tuning


A database is a collection of relations. Each relation is a set of records. A record is a sequence of attributes.
1. Indexes
- data structures to organize records via trees or hashing
1.1. Two examples:
1.1.1.Ordered index
: search keys are stored in sorted order
1.1.2.Hash index : search keys are distributed uniformly across buckets using hash function

An index is an access path to efficiently locate row(s) via search key fields without having
to scan the entire table.
Primary index:

Secondary index:

index whose search key specifies the sequential An index whose structure is separated from the
order of file . Also called main index or integrated data file and whose search key typically specifies
index
an order different from the sequential order of
the file.

In SQL, index is:


CREATE INDEX name ON relation-name (<attributelist>)

Clustered index

Good for range searches over a range of


search key values
index entries and rows are ordered in the
same way
There can be at most one clustered index on
a table
CREATE TABLE generally creates
an integrated, clustered (main) index
on primary key

Unclustered (secondary) index

index entries and rows are not ordered in


the same way
There can be many unclustered indices on a
table
Unclustered isnt ever as good as clustered,
but may be necessary for attributes other
than the primary key

Types of Indexes:
Tree-based Indexes:B +-Tree
o Very flexible, only indexes to support point queries, range queries and prefix searches
Hash-based Indexes
o Fast for equality searches
Special Indexes
o Such as Bitmap Indexes for OLAP or R-Tree for spatial databases

Week 11: Data Analysis - OLAP and Data Warehousing


The Problem/Motivation:
- Data such as currency and historical data are being analyzed to identify useful patterns and support
strategies.
- Businesses aim to create complex, interactive and exploratory analysis of datasets by integrating data
collected across the enterprise.
- Internet helps the sharing of big data sets and correlating the data with own data becomes more important.
- Data visualization turns large amount of data into useful information that businesses can understand easily
and have decision based on them.
- Example: Google Fusion Tables, Maptd.com
- Data needs to be gathered in a form suitable for analysis
Data Warehousing: Issues and the ETL Process
- Three Complementary Trends of data analysis in enterprise includes
Data Warehousing: Consolidate data from many sources in one large repository
OLAP: Interactive and online queries based on spreadsheet-style operations and multidimensional
view of data
Data Mining: Exploratory search for interesting trends and anomalies
OLTP vs OLAP vs Data Mining
- OLTP (On Line Transaction Processing) maintains a database of some real-world enterprise and supports dayto-day operations. They are short simple transactions with frequent updates that only access small fraction
of the database at a time.
- OLAP (On Line Analytic Processing) uses mainly historic data in database to guide strategic decisions. They
contain complex queries with infrequent updates. Transactions access large fraction of the database.
- Traditionally OLAP query data collected in its OLTP system but newer applications such as Internet
companies prefer gathering data that it needs and potentially even purchasing them. Data are query more
sophistically and in more specific ways.
- Data mining attempt to find pattern and extract useful information from a database and not setting a strict
guideline in the query.
Data Warehouse
- Data (often derived from OLTP) for OLAP and data mining applications is usually stored in a special database
called a data warehouse
- Data warehouse contain large amount of read-only data that has been gathered at different times spanning
long periods provided by different vendors and with different schemas.
- Populating such warehouses in non-trivial (data integration etc.)
- Issues in data warehousing includes semantic integration (eliminate mismatches from different sources e.g.
Different attribute names or domains), Heterogeneous Sources (access data from variety of source formats),
Load Refresh Purge (load data, refresh periodically, and purge old data) and Metadata Management (Keep
track of source, loading time, and other information for data in warehouse)
- Must include a metadata repository which is information about physical and logical organization of data
Populating a data Warehouse: ETL Process (Capture/Extract, Transform, and Load)
- Typical operational data is transient, not comprehensive and potentially contain inconsistencies and errors.
After ETL data should be detailed, periodic, and comprehensive
New techniques for database design, indexing, and analytical querying need to be supported.
- Star Schema
- CUBE, ROLLUP and GROUPING SET
- Window and Ranking Queries
- ROLAP/MOLAP

Week 12: Introduction to Data Exchange with XML


XML has 4 core specifications: XML Documents, Document Type Definitions (DTDs), Namespaces, XML Schema
SQL can be ignorant of how data is stored, but a schema is still required! But, how do we transport semi-structured data? XML!
Semistructured Data: Self-describing, irregular data, no a priori structure
Origins: Integration of heterogeneous sources; Data sources with non-rigid structure (Biological or Web)
Characteristics: Missing or additional attributes; Multiple attributes; different types in different objects, and heterogeneous
collections
XML describes content whereas HTML describes presentation. Specifics for XML: syntactic structure, elements & attributes,
character set; and has a logical / physical structure - DTD with entities.
Database Issues: Model XML using graphs, Store XML, Query XML using XQuery, Processing XML.
Paradigm Shifts: Web - From HTML to XML; from information retrieval to data management
Databases - from relational model to semistructured data, from data processing to
data/query
translation, from storage to transport
XML vs. JSON JSON: JavaScript Object Notation - text-based, semi-structure for data interchange, originates from object
serialization a la Javascript. Low-overhead format opposed to XML.
XML
JSON {
<person name="John Smith">
"name": "John Smith",
<address street="1 Cleveland Street" city="Sydney"
"address": {
state="NSW" zipcode="2006" />
"street": "1 Cleveland Street", "city":
</person>
"Sydney",
"state": "NSW", "zipcode": 2006 } }
DTD (Document Type Definition)
<!ELEMENT book
(title)>

Grammar
Elements + attributes

XML Schema
<xsd:simpleType name="Score">
<xsd:restriction base="xsd:integer">
<xsd:minInclusive value="0"/>
<xsd:maxInclusive value="100"/>
</xsd:restriction>
</xsd:simpleType>
Structure and Typing
Elements, attributes, simple and complex types,
groups
Supports includes relationships - inheritance
Specified as attribute of the document elements

Only Part of relationships


Specified as part of the prologue of an XML
document
Modern databases support SQL/XML
Provide XML datatype to store XML in database - stored in native tree form.
Integrates XML support functions for querying and inserting XML data:
XMLPARSE()parses XML fragments or documents so that it can be stored in SQL.
XPATH(xpath, xml): Selects the XML content specified by the xpath expression from the xml data.
XMLEXTRACT and XMLEXISTS: Tell whether the set of nodes returned by XPath expression is empty (not supported
by PostgreSQL will be added in upcoming version 9.3)
XMLELEMENT() produces a single nested XML element
XMLATTRIBUTES() only as optional part of an XMLELEMENT call, adds attribute(s) to a new XML element.
XMLCONCAT() concatenates individual XML values
XMLAGG() an aggregate function that concatenates several input xml rows to a single XML output value
XMLCOMMENT creates an XML comment element containing text
An SQL query does not return XML directly. Produces tables that can have columns of type XML.

Jennas Super Summary


1. Basic database stuff
- Types of relations
- Relational schema (e.g. ER diagram)
- Relational schema instance (e.g. DB)
- Relation (e.g. foreign key field)
- Relational instance (e.g. foreign key value, e.g. 3)
- Types of fields
- INTEGER
- VARCHAR
- CHAR
- TEXT
- ENUM: CREATE TYPE x AS ENUM
- Types of keys
1. Primary
2. Candidate
3. Super
4. Foreign
- personid INTEGER REFERENCES person (id) ON DELETE NO ACTION
- ON DELETE: CASCADE, NO ACTION (default, post-triggers), RESTRICT
(pre-triggers), SET NULL, SET DEFAULT
- Types of constraints
- Integrity constraints (all constraints, enforces data integrity)
- Static constraints:
- Domain constraints (fields must be of correct data domain) (constraint
on ONE attribute)
1. Null/not null
2. ENUM checks
3. Unique and Unique checks
- Key constraints
1. Keys (including foreign keys)
- Semantic integrity constraints (constraints on MULTIPLE attributes)
1. Checks
- anonym: status VARCHAR CHECK (status = 'A' OR STATUS = 'B')
- named: CONSTRAINT chk_status CHECK (status = 'A' OR STATUS =
'B')
2. Assertions
- CREATE ASSERTION x CHECK ( NOT EXISTS ( SELECT ... ) )
3. Functional dependencies
- Dynamic constraints
1. Triggers
2. ER diagrams
- syntax
- entity: square
- attribute: ellipse
- keys are underlined
- double-ellipses: multi-valued attributes
- ellipses to ellipses: related attributes
- relationships: diamonds
- arrow: at most one
- thick line: at least one
- thick arrow: exactly one
- relationship applies to FURTHEST entity
- Employee works in AT MOST ONE department
Employee --> WorksIn --- Department
- Employee works in AT LEAST ONE department
Employee === WorksIn --- Department
- Employee works in EXACTLY ONE department
Employee ==> WorksIn --- Department
- Employee works in 1 TO 3 departments
Employee ==1..3== WorksIn --0..*-- Department
- Weak entity types: double rectangles
- Weak entity relationship: double rectangles
- Discriminator (aka partial key): discriminates among all entities related to
one of the other entity
- Superclass/subclass
- Triangle - superclass at tip of triangle
- overlapping: default (can belong to 1 or more)
- disjoint: write disjoint (can belong to only 1)
- total: default (an entity must belong to one)
- partial: thick line (an entity doesn't have to belong to any)
3. SQL
- CREATE TABLE

- SELECT FROM WHERE GROUP BY HAVING ORDER BY


- SELECT stuff
- SELECT ALL (keep dups - default) or SELECT DISTINCT (remove dups)
- SELECT * (all columns)
- SELECT 3 * 4 (arithmetic operations)
- SELECT x AS y (rename operator)
- WHERE stuff
- = , > , >= , < , <= , != , <>
- AND, OR, NOT
- BETWEEN 75 AND 100
- LIKE 'POST%' (and lots of other string/regex operations)
- CURRENT_DATE and CURRENT_TIME
- EXTRACT(year FROM enrolmentDate)
- DATE 2012-03-01
- 2012-04-01 + INTERVAL 36 HOUR
- JOIN stuff
- join: combine fields
- Equi-join: when fields are equal
- Natural join: duplicate column names
- Outer join: non-matches included as NULL
- left outer join: joined table can have null attributes
- right outer join: non-joined table can have null attributes
- full oiuter join: both tables can have null attributes
- Union join: all columns included, all rows included (cartesian join)
- e.g.
- R natural join S
- R inner join S on <join condition>
- R inner join S using (<list of attributes>)
- Aggregate functons
- Avg, min, max, sum, count
- select count(*)
- select count(distinct sid) from Enrolled
- select avg(mark)
- set operations
- UNION (add rows)
- INTERSECT (duplicate rows only)
- EXCEPT (minus duplicate rows)
- Subqueries
- Correlated vs uncorrelated
- IS NULL, not = null
- 5 + null returns null
- most aggr. functions ignore nulls
- three-valued logic - OR, AND, NOT
4. Relational Algebra
1. set operations
- union - OR
- intersection - AND
- difference - MINUS
2. remove parts
- selection (sigma) - WHERE clause (select rows)
- projection (pi) - SELECT clause (select only specified cols)
3. combine parts
- cross-product (X) - fully combine relations (col x row = for each thing in
the col, that plus the row)
- natural join: join on all equal fields
- conditional join: join on specified fields
- join (triangular-infinity thing) - combine matching tuples (col x row = same
as col, but with extra row-part for matches)
4. rename parts
- rename (row, looks like rounded p) - rename one field to another
5. Functional dependencies
- Why is a table called a relation? Relation from primary key to every column
- Data redundancy causes anomalies
1. insertion: duplicate data or null values
2. delete: loss of data needed for future rows
3. update: changes in one row cause changes to all rows (biggest problem)
- A --> B if 'A functionally determines B', or 'B is functionally dependent on A'
- a primary key functionally determines the whole row
- a candidate key determines every column
- a superkey is a set of columns that contains the candidate key
- Attribute closure X^+ of some attributes X is 'all attributes that are
determined by X' (functionally dependent on X), including X itself
1. Initialise result with the given set of attributes: X={A1, , An}
2. Repeatedly search for some FD: A1 A2 Am -> C

such that all A1, , Am are already in the set of attributes result, but C is

8. Indexing
- records are stored in pages, each page contains a maximum number of
Add C to the set result.
records
3. Repeat step 2 until no more attributes can be added to result
- an index is a type of page
4. The set result is the correct value of X^+ (the closure of attributes)
- Types of indexes
- To find all candidate keys, look at each set of attributes K and calculate the
- sorted (uncommon, tree is better)
attribute closure K^+
- tree (like sorted but better, good for range, equality and prefix searches)
- If K+ contains all columns, K is a superkey
- multileveled, e.g. 2 levels mean records are 2 indexes away at most
- Check each subset of K to see if it is also a superkey
- hash (good for equality and thats it)
- Find the candidate keys (the smallest subset that is still a superkey)
- special (e.g. bitmap indexes, r-trees for spatial data)
- Pick one candidate key to be the primary key
- With an index, selecting takes less time, inserting takes more time
- A covering index (for a query) means all fields in the query are indexed, so
6. Normalisation ('decomposing' into normal forms)
the records are not accessed at all
- 1NF, all attributes are atomic (no multivalued or composite attributes)
- An "access path" is the journey you take to reach the data (e.g. query -->
- 2NF no partial dependencies (not important)
table scan --> record)
- 3NF no transitive dependencies (not important)
- A "search key" is a sequence of attributes that are indexed, includes primary
- BCNF no remaining anomalies from functional dependencies (good!)
key
- The only non-trivial FDs are key constraints
- Properties of indexes
- trivial FDs is X --> Y and Y is a subset of X (you determine yourself)
1. Main [or primary] (indexes contain the whole row) vs secondary (indexes
- formally: for every FD A --> B, either the FD is trivial or A is a superkey contain a pointer)
(primary key, candidate key, or more)
2. Unique (index over a candidate key) vs nonunique
- 4NF no multivalued dependencies (not important)
3. Clustered (data records are ordered the same way as indexes) vs
- 5NF no remaining anomalies (not important)
unclustered
- Decomposition attributes
- There can be at most one clustered index on a table
- Lossless-join decomposition
- Clustered is good for "range searches" (key is between two limits)
- When you join the decomposed relations, you get the original relation
4. Single- vs multi-attribute
- Not lossless-join doesn't usually mean whole rows are lost, it could
- CREATE TABLE usually creates a unique, clustered, main index on the
mean that meaningless rows are added
primary key
- If R(A, B, C) has A -> B, then the decomposition L(A, B) and L(A, C) is
- CREATE INDEX usually creates a secondary, unclustered index
always lossless-join
- CREATE INDEX name ON table (field)
- Dependency-preserving decomposition
- Space and time problems
- Every dependency from the original is still in the decomposed relations
- how much space per row? add up space per field (e.g. 20 byte record)
- Often, we say every original dependency is in exactly ONE of the
- how many records per block? divide the space of a block by this amount
decomposed relations
(calculate records per block, round down!) (e.g. 4K block)
- how many blocks? divide total # of records by # of blocks (e.g. 50 blocks)
7. Serialisability
- how long does the query take? times the number of blocks by the time an
- ACID
access takes (reading a disk block into memory)
- Atomicity (all or nothing)
- assumptions:
- Consistency (db always in valid state: triggers, cascading deletes, CHECKs,
- if a field has 3 possible values, there are an equal number of records
etc)
with each value
- Isolation (transactions do not interfere)
- 10% of the records with A = a also have B = b
- Durability (committing MEANS committed, once a commit returns, any
crashes can return to that commit)
9. OLAP
- a Transaction is a list of SQL statements that are ACID, one logical 'unit of
- OLAP stands for "online analytical processing"
work'
- Data warehousing
- they happen in order, together, if one fails they all fail
- db needs to be optimised for SELECT queries - UPDATE, DELETE etc can be
- 'Auto-commit' means every SQL statement is an entire transaction
slow
- Serialisability means interleaved execution is the same as batch execution:
- LOTS of tricks used: indexes, redundant fields, etc
given 2 transactions, the final state is the same regardless of the order
- maximise (to a point) redundancy
- dirty read (reading uncommitted data, WR conflict)
- Star schema
T1: R(A),W(A),
R(B),W(B),Abort
- 1 central fact table, n tables with FKs from the fact table
T2:
R(A),W(A),Commit
- for each dimension, we have a hierarchy
- getting totals and subtotals for the hierarchies:
- unrepeatable read (two reads in a transaction give different results, RW
- CUBE(x, y, z) does GROUP BY (nothing), GROUP BY (every combination)
conflict)
- ROLLUP(x, y, z) does GROUP BY (x, y, z), GROUP BY (y, z), GROUP BY (z),
T1: R(A),
R(A),W(A),Commit
GROUP BY (nothing)
T2:
R(A),W(A),Commit
- WINDOW queries
SELECT AGG() OVER name FROM ...
- lost update (overwriting uncommitted data, WW conflict)
WINDOW name AS (
T1: W(A),
W(B),Commit
[ PARTITION BY attributelist ] (attributes to select)
T2:
W(A),W(B),Commit
[ ORDER BY attributelist ] (attributes to order by)
- 2-phase-locking ensures serializable executions, but can mean some
[(RANGE|ROWS) BETWEEN v1 PRECEEDING AND v2 FOLLOWING] ) (rows
operations are blocked
to look at)
- Before reading, take shared lock
- Before writing, take exclusive lock
10. XML
- Hold lock until transaction commits/aborts
- Not summarised
not.

You might also like