Written by Luka Kerr on November 23, 2019
Introduction
Data is known recorded facts, with implicit meaning. A Database is a collection of related data, satisfying constraints. A DBMS is a database management system.
Data Modelling
Data modelling is a design process that converts requirements into a data model.
The aims of data modelling are
- To describe what information is contained in the database
- To describe relationships between data items
- To describe constraints on data
There are two kinds of data models:
- Logical: abstract, for conceptual design (e.g. ER, ODL)
- Physical: record-based, for implementation (e.g. relational)
The most important aspects of a data model are correctness, completeness and consistency.
ER Model
When modelling using the ER model we think of the world as a collection of inter-related entities.
The ER model has three main constructs
- Attribute: data item describing a property of interest
- Entity: collection of attributes describing object of interest
- Relationship: association between entities
ER Diagrams
ER diagrams are a graphical tool for data modelling.
An ER diagram consists of
- A collection of entity set definitions
- A collection of relationship set definitions
- Attributes associated with entity and relationship sets
- Connections between entity and relationship sets
Entities
An entity can be viewed as either a set of entities with the same set of attributes (extensional) or an abstract description of a class of entities (intensional).
Entities contain
- Keys: set of attributes whose set of values are distinct over entity set
- Candidate Key: a minimal key
- Primary Key: candidate key chosen by databse designer
Relationship Sets
A relationship is an association among several entities. A relationship set is a collection of relationships of the same type.
Relationships contain
- Degree: the number of entities involved in the relationship
- Cardinality: the number of associated entities on each side of the relationship
- one-to-one (
[entity]<---(rel)--->[entity]
) - one-to-many (
[entity]<---(rel)----[entity]
) - many-to-many (
[entity]----(rel)----[entity]
)
- one-to-one (
- Participation: whether every entity must be in the relationship
- partial (thin line)
- total (thick line)
In some cases, a relationship needs associated attributes.
Weak Entities
Weak entities exist only because of association with strong entities. They have no key of their own and have a discriminator.
Subclasses & Inheritance
A subclass of an entity set $A$ is a set of entities with all attributes of $A$, plus (usually) it own attributes, that is involved in all of $A$’s relationships, plus its own.
Subclasses can be
- overlapping or disjoint (can an entity be in multiple subclasses?)
- total or partial (does every entity have to also be in a subclass?)
Relational Model
The relational data model describes the world as a collection of inter-connected relations (or tables).
A relational data model consists of
- Relations
- Attributes
- Constraints
The value NULL
belongs to all domains.
Relations
Eeach relation has
- A name which is unique within a given database
- A set of attributes
- A unique key
Attributes
Each attribute has
- A name which is unique within a given database
- An associated domain or set of allowed values
Constraints
Constraints are logical statements that describe what values are/are not allowed and what combinations of values are/are not allowed.
Integrity Constraints
Integrity constraints can
- Limit the set of values that attributes can take (domain constraints)
- Identify attributes that uniquely identify tuples (key constraints)
- Require keys to be fully-defined (entity integrity constraints)
- Require references to other tables to be valid (referential integrity constraints)
Referential Integrity
Referential integrity constraints can
- Describe references between relations
- Are related to notion of a foreign key
Relational DBMSs
A relational database management system (RDBMS)
- Is software designed to support large-scale data-intensive applications
- Allows high-level description of data with high-level access to the data
- Provides efficient storage and retrieval
- Supports multiple simultaneous users
- Does multiple simultaneous operations
- Maintains reliable access to the stored data
PostgreSQL
PostgreSQL is a fully featured, client server DBMS. It is resource intensive and can run distributed and replicated. It follows the SQL standard closely, but not totally. It also has extra datatypes (e.g. JSON)
SQLite
SQLite is a fully featured, serverles DBMS. It is lightweight and intended to be embedded in applications. It doesn’t have stored procedures, but does have extra datatypes.
Managing Databases
# Create a new, empty database
$ createdb dbname
# Drop and remove all data from a database
$ dropdb dbname
# Dump the contents of a database
$ pg_dump dbname > dumpfile
# Restore the contents of a database dump
$ psql dbname -f dumpfile
SQL
SQL is a Data Definition Language that can formalise relational schemas.
CREATE TABLE TableName (
attrName1 domain1 constraints1,
attrName2 domain2 constraints2,
...
PRIMARY KEY (attri, attrj,...)
FOREIGN KEY (attrx, attry,...)
REFERENCES
OtherTable (attrm, attrn,...)
);
Syntax
-- Comments are after two dashes
-- Identifiers are alphanumeric and can be inside double quotes
-- Identifiers are case insensitive
"An Identifier", AnIdentifier
-- There are many keywords
CREATE, SELECT, TABLE, WHERE
-- Strings are inside single quotes
'a string'
-- Numbers are similar to C
1, -5, 3.14159
-- There are multiple types
integer, float, char(n), varchar(n), date
-- There are multiple operators
=, <>, <, <=, >, >=, AND, OR, NOT
Managing Tables
-- Create a table with attributes and constraints
CREATE TABLE table (Attributes + Constraints)
-- Modify a table
ALTER TABLE table TableSchemaChanges
-- Drop a table
DROP TABLE table(s) [ CASCADE ]
-- Truncate a table
TRUNCATE TABLE table(s) [ CASCADE ]
Managing Tuples
-- Insert into a table
INSERT INTO table (Attrs) Values Tuple(s)
-- Delete from a table
DELETE FROM table WHERE condition
-- Update a table
UPDATE table SET AttrValueChanges WHERE condition
where
Attrs
=(attr1, attr2, ..., attrn)
Tuple
=(val1, val2, ..., valn)
AttrValueChanges
is a comma seperated list ofattrname = expression
Types
-- Define own constrained type
CREATE DOMAIN Name AS Type CHECK (Constraint)
-- Define tuple type
CREATE TYPE Name AS (AttrName AttrType, ...)
-- Define enumerated type
CREATE TYPE Name AS ENUM ('Label', ...)
String Comparison
a < b
compares stringsa
andb
using dictionary ordera LIKE pattern
matches stringa
to pattern%
matches anything_
matches a single char
PostgreSQL provides regexp-based pattern matching.
-- ~ and !
Attr ~ 'RegExp' -- Matches 'RegExp'
Attr !~ 'RegExp' -- Doesn't match 'RegExp'
-- ~* and !~*
Attr ~* 'RegExp' -- Matches 'RegExp' case-insensitive
Attr !~* 'RegExp' -- Doesn't match 'RegExp' case-insensitive
String Manipulation
-- Concatenate a and b
a || b
-- Lowercase a
lower(a)
-- Extract substring from a
substring(a, start, count)
Queries
An SQL query consists of a sequence of clauses:
SELECT projectionList
FROM relations/joins
WHERE condition
GROUP BY groupingAttributes
HAVING groupCondition
The FROM
, WHERE
, GROUP BY
and HAVING
clauses are optional.
Views
A view associates a name with a query. Each time the view is invoked (in a FROM
clause) the query is evaluated, yielding a set of tuples. The set of tuples is used as the value of the view.
A view can be treated as a ‘virtual table’ and are useful for packaging a complex query to use in other queries.
CREATE VIEW viewName [(attributes)] AS Query
Programming With SQL
PostgreSQL Stored Procedures
The PostgreSQL syntax for defining stored functions:
CREATE OR REPLACE FUNCTION
funcName(arg1, arg2, ...) RETURNS retType
AS $$
String containing function definition
$$ LANGUAGE funcDefLanguage;
where
arg1, arg2, ...
consists ofname type
$$ ... $$
are just a type of string quoteLANGUAGE
is a function definition language (e.g. Python, SQL, PLpgSQL)
PLpgSQL
The PLpgSQL interpreter
- Executes procedural code and manages variables
- Calls the PostgreSQL engine to evaluate SQL statements
Function Return Types
A PostgreSQL function can return a value which is
void
- An atomic data type (
integer
,text
, …) - A tuple
- A set of atomic values (
setof integer
, …) - A set of tuples (
setof Employee
whereEmployee
is a tuple)
A function returning a set of values is similar to a view.
SQL Functions
PostgreSQL allows functions to be defined in SQL:
CREATE OR REPLACE
funcName(arg1type, arg2type, ...)
RETURNS retType
AS $$
SQL statements
$$ LANGUAGE sql;
Within the function, arguments are accessed as $1, $2, ...
corresponding to their position in the function definition.
A parameterless function behaves similar to a view.
PLpgSQL
PLpgSQL stands for Procedural Language extensions to PostgreSQL. It is a PostgreSQL-specific language integrating features of procedural programming and SQL programming.
It provides a means for extending DBMS functionality, specifically it
- Implements constraint checking (triggered functions)
- Has complex query evaluation (e.g. recursive)
- Has complex computation of column values
- Has detailed control of displayed results
PLpgSQL functions are created (and inserted into db) via:
CREATE OR REPLACE
funcName(param1, param2, ...)
RETURNS retType
AS $$
DECLARE
variable declarations
BEGIN
code for function
END;
$$ LANGUAGE plpgsql;
The entire function body is stored as a single SQL string.
PLpgSQL
- Doesn’t provide any I/O facilities
- Does not syntxa check functions when loaded into DB
- Has unhelpful error messages
- Stores functions as strings
Data Types
PLpgSQL constants and variables can be defined using
- Standard SQL data types (
CHAR
,DATE
,NUMBER
, …) - User defined PostgreSQL data types (e.g.
Point
) - A special structured record type (
RECORD
) - Table-row types (e.g.
Branches%ROWTYPE
) - Types of existing variables (e.g.
Branches.location%TYPE
)
There is also a CURSOR
type for interacting with SQL.
Variables can also be defined in terms of
- The type of an existing variable or table column
- The type of an existing table row (implict
RECORD
type)
Syntax
-- Assignment
var := exp
SELECT exp INTO var
-- Selection
IF C1 THEN S1
ELSIF C2 THEN S2 ...
ELSE S END IF
-- Iteration
LOOP S END LOOP
WHILE C LOOP S END LOOP
FOR rec_var IN Query LOOP ...
FOR int_var IN lo..hi LOOP ...
SELECT ... INTO
Can capture query results via
SELECT Exp1, Exp2, ..., ExpN
INTO Var1, Var2, ..., VarN
FROM TableList
WHERE Condition ...
where the query is executed as usual, a projection list is returned as usual and each ExpI
is assigned corresponding to each VarI
.
Aggregates
Aggregates reduce a collection of values into a single result.
The action of an aggregate function can be viewed as
State = initial state for each item V {
// update State to include V
State = updateState(State, V)
}
return makeFinal(State)
PostgreSQL User Defined Aggregates
The SQL standard does not specify user-defined aggregates, but PostgreSQL does provides a mechanism for defining them.
To define a new aggregate, we need to supply
- A
BaseType
, the type of input values - A
StateType
, the type of intermediate states - A state mapping function
sfunc(state, value) -> newState
- Optionally, an initial state value (defaults to
NULL
) - Optionally, a final function
ffunc(state) -> result
Aggregates are created using the CREATE AGGREGATE
statement:
CREATE AGGREGATE AggName(BaseType) (
sfunc = UpdateStateFunction,
stype = StateType,
initcond = InitialValue,
finalfunc = MakeFinalFunction,
sortop = OrderingOperator
);
where
initcond
(with typeStateType
) is optional; defaults toNULL
finalfunc
is optional; defaults to identity functionsortop
is optional; needed for min/max-type aggregates
User Defined count()
CREATE AGGREGATE myCount(anyelement) (
stype = int,
initcond = 0,
sfunc = oneMore
);
CREATE FUNCTION oneMore(sum int, x anyelement) RETURNS int
AS $$
BEGIN
RETURN sum + 1;
END;
$$ LANGUAGE PLPGSQL;
Constraints
Column and table constraints ensure validity of one table. Referential integrity constraints ensure connections between tables are valid. However, specifying validity of an entire database often requires constraints involving multiple tables.
Assertions
Assertions are schema-level constraints. They
- Typically involving multiple tables
- Express a condition that must hold at all times
- Need to be checked on each change to relevant tables
- Reject changes if changes would cause check to fail
To create an assertion, we use the syntax CREATE ASSERTION name CHECK (condition)
, where the condition is expressed as “there are no violations in the database”.
For example, to ensure that the number of students in any university course must be < 10000, an assertion can be defined as follows
CREATE ASSERTION ClassSizeConstraint CHECK (
NOT EXISTS (
SELECT c.id FROM Courses c, CourseEnrolments e
WHERE c.id = e.course
GROUP BY c.id
HAVING count(e.student) > 9999
)
)
On each update, it is expensive to determine which assertions need to be checked and to run the queries which check the assertions. Most RDBMSs do not implement general assertions, but rather provide triggers
Triggers
Triggers are procedures stored in the database that are activated in response to database events.
Triggers provide event-condition-action (ECA) programming where
- An event activates the trigger
- A condition is checked
- If the condition holds, an action is executed
To define a trigger, we use the syntax
CREATE TRIGGER TriggerName
{AFTER|BEFORE} Event1 [ OR Event2 ... ]
[ FOR EACH ROW ]
ON TableName
[ WHEN ( Condition ) ]
Block of Procedural/SQL Code;
Possible Event
s are INSERT
, DELETE
, UPDATE
.
Semantics
Triggers can be activated BEFORE
or AFTER
the event.
If activated BEFORE
- A value
NEW
contains the “proposed” value of changed tuple - Modifying
NEW
causes a different value to be placed in the database
If activated AFTER
NEW
contains the current value of the changed tuple- A value
OLD
contains the previous value of the changed tuple - Constraint checking has been done for
NEW
Catalogs
A standard information_schema
exists for describing metadata about the database.
PostgreSQL Catalog
PostgreSQL implemented it’s own catalogs long before the standard information_schema
was developed.
The PostgreSQL catalog is accessible via pg_XXX
tables/views, for example
pg_roles(oid, rolname, rolsuper, ...)
pg_database(oid, datname, datdba, ..., datacl)
pg_attribute(oid, attrrelid, attname, atttypid, ...)
PostgreSQL also has a standard catalog also available, via information_schema
.
Programming With Databases
A common database access pattern used in programming languages is seen below.
db = connect_to_dbms(DBname, User/Password)
query = build_SQL('SQLStatementTemplate', values)
results = execute_query(db, query)
while more_tuples_in(results):
tuple = fetch_row_from(results)
# do something with values in tuple
Python & Psycopg2
Psycopg2 is a Python module that provides
- A method to connect to PostgreSQL databases
- A collection of DB-related exceptions
- A collection of type and object constructors
A standard Psycopg2 program is seen below.
import psycopg2
conn = psycopg2.connect(DB_connection_string)
cur = conn.cursor()
cur.execute('SQLStatementTemplate', values)
conn.close()
Psycopg2 has various useful methods
conn.commit()
- Commit changes made to the database since last
commit()
- Commit changes made to the database since last
cur.mogrify('SQLStatementTemplate', values)
- Returns the SQL statement as a string with values inserted
cur.fetchall()
- Returns a list of tuples
cur.fetchone()
- Returns a single tuple
cur.fetchmany(nTuples)
- Returns
nTuples
amount of tuples
- Returns
Relational Design Theory
A good relational database design
- Must capture all necessary attributes/associations
- Do this with minimal amount of stored information
Relational Design & Redundancy
In database design, redundancy is generally a bad thing as it causes problems maintaining consistency after updates.
Functional Dependencies
A relation instance $r(R)$ satisfies a dependency $X \to Y$ if for any $t, u \in r, t[X] = u[X] \implies t[Y] = u[Y]$.
In other words, if two tuples in $R$ agree in their values for the set of attributes $X$, then they must also agree in their values for the set of attributes $Y$. In this case, we say that $Y$ is functionally dependent on $X$.
Attribute sets $X$ and $Y$ may overlap, and it is trivially true that $X \to X$.
Inference Rules
Armstrong’s rules are general rules of inference on functional dependencies.
- Reflexivity: $X \to X$
- Augmentation: $X \to Y \implies XZ \to YZ$
- Transitivity: $X \to Y, Y \to Z \implies X \to Z$
Other useful rules exist too.
- Additivity: $X \to Y, X \to Z \implies X \to YZ$
- Projectivity: $X \to YZ \implies X \to Y, Y \to Z$
- Pseudotransitivity: $X \to Y, YZ \to W \implies XZ \to W$
Closure
Given a set $F$ of functional dependencies, how many new functional dependencies can we derive? For a finite set of attributes, there must be a finite set of derivable functional dependencies.
The largest collection of dependencies that can be derived from $F$ is called the closure of $F$ and is denoted $F^+$.
Normalisation
Normalisation aims to
- Be able to characterise the level of redundancy in a relational schema
- Provide mechanisms for transforming schemas to remove redundancy
Normal Forms
Normalisation theory defines six normal forms
- First, Second, Third Normal Forms (1NF, 2NF, 3NF)
- Boyce-Codd Normal Form (BCNF)
- Fourth Normal Form (4NF)
- Fifth Normal Form (5NF)
where 1NF allows the most redundancy and 5NF allows the least redundancy.
Normalisation aims to puts a schema into xNF by ensuring that all relations in the schema are in xNF.
Relation Decomposition
The standard transformation technique to remove redundancy is to decompose a relation $R$ into relations $S$ and $T$.
We accomplish decomposition by
- Selecting (overlapping) subsets of attributes
- Forming new relations based on attribute subsets
Properties include $R = S \cup T$, $S \cap T = \emptyset$ and $r(R) = s(S) \bowtie t(T)$.
Boyce-Codd Normal Form (BCNF)
A relation schema $R$ is in BCNF with reference to a set $F$ of functional dependencies iff for all functional dependencies $X \to Y \in F^+$
- Either $X \to Y$ is trivial (i.e. $Y \subset X$)
- Or $X$ is a superkey (i.e. non-strict superset of attributes in key)
A database schema is in BCNF if all of its relation schemas are in BCNF.
The following algorithm converts an arbitrary schema to BCNF.
# Inputs: schema R, set F of functional dependencies
# Output: set Res of BCNF schemas
Res = {R};
while any schema S in Res is not in BCNF:
choose any fd X -> Y on S that violates BCNF
Res = (Res-S) union (S-Y) union XY
Third Normal Form (3NF)
A relation schema $R$ is in 3NF with reference to a set $F$ of functional dependencies iff for all functional dependencies $X \to Y \in F^+$
- Either $X \to Y$ is trivial (i.e. $Y \subset X$)
- Or $X$ is a superkey
- Or $Y$ is a single attribute from a key
A database schema is in 3NF if all relation schemas are in 3NF.
If we transform a schema into 3NF, we are guaranteed
- Lossless join decomposition
- The new schema preserves all of the functional dependencies from the original schema
The following algorithm converts an arbitrary schema to 3NF.
# Inputs: schema R, set F of fds
# Output: set R_i of 3NF schemas
let F_c be a minimal cover for F
Res = {}
for each fd X -> Y in F_c:
if no schema S in Res contains XY:
Res = Res union XY
if no schema S in Res contains a candidate key for R:
K = any candidate key for R
Res = Res union K
Relational Algebra
Relational algebra can be viewed as a
- Mathematical system for manipulating relations
- Data manipulation language for the relational model
It consists of
- Operands: relations, or variables representing relations
- Operators that map relations to relations
- Rules for combining operands/operators into expressions
- Rules for evaluating such expressions
The core relational algebra operations are
- Selection: choosing a subset of tuples/rows
- Projection: choosing a subset of attributes/columns
- Product, Join: combining relations
- Union, Intersection, Difference: combining relations
- Rename: change names of relations/attributes
Notation
Operation | Standard Notation | Our Notation |
---|---|---|
Selection | $\sigma_{expr}(Rel)$ | $Sel[expr](Rel)$ |
Projection | $\pi_{A, B, B}(Rel)$ | $Proj[A, B, C](Rel)$ |
Join | $Rel_1 \bowtie_{expr} Rel_2$ | $Rel_1 \ Join[expr] \ Rel_2$ |
Rename | $\rho_{schema} Rel$ | $Rename[schema](Rel)$ |
We define the semantics of RA operations using
- “Conditional set” expressions, e.g. $\{ X \mid condition \ on \ X \}$
- Tuple notations
- $t[A, B]$ extracts attributes $A$ and $B$ from tuple $t$
- $(x, y, z)$ enumerated tuples, specify attribute values
- Quantifiers, set operations, boolean operators
Selection
Selection $Sel[C](r)$ returns a subset of the tuples in a relation $r(R)$ that satisfy a specified condition $C$.
result = {}
for each tuple t in relation r:
if C(t):
result = result + {t}
Projection
Projection $Proj[X](r)$ returns a set of tuples containing a subset of the attributes in the original relation, where $X$ specifies a subset of the attributes of $R$.
In pseudocode
result = {}
for each tuple t in relation r:
result = result + {t[X]}
Product
Product $r \times s$ combines information from two relations pairwise on tuples.
In pseudocode
result = {}
for each tuple t1 in relation r:
for each tuple t2 in relation s:
result = result + {(t1:t2)}
Natural Join
Natural join $r \ Join \ s$ is a specialised product. It contains only pairs that match on common attributes with one of each pair of common attributes eliminated.
In pseusocode
result = {}
for each tuple t1 in relation r:
for each tuple t2 in relation s:
if matches(t1, t2):
result = result + {combine(t1, t2)}
Theta Join
Theta join $r \ Join[C] \ s$ is a specialised product containing only pairs that match on a supplied condition $C$.
Rename
Rename provides “schema mapping”. If expression $E$ returns a relation $R(A_1, A_2, \dots, A_n)$ then $Rename[S(B_1, B_2, \dots, B_n)](E)$ gives a relation called $S$ containing the same set of tuples as $E$ but with the name of each attribute changed from $A_i$ to $A_b$.
Union
Union $r_1 \cup r_2$ combines two compatible relations into a single relation via set union of sets of tuples.
In pseudocode
result = r1
for each tuple t in relation r2:
result = result + {t}
Intersection
Intersection $r_1 \cap r_2$ combines two compatible relations into a single relation via set intersection of sets of tuples.
In pseudocode
result = {}
for each tuple t in relation r1:
if t in r2:
result = result + {t}
Difference
Difference $r_1 - r_2$ finds the set of tuples that exist in one relation but do not occur in a second compatible relation.
In pseudocode
result = {}
for each tuple t in relation r1:
if t not in r2:
result = result + {t}
Division
Division $r \ Div \ s$ considers each subset of tuples in a relation $R$ that match on $t[R - S]$ for a relation $S$. For this subset of tuples, it takes the $t[S]$ values from each, and if this covers all tuples in $S$, then it includes $t[R - S]$ in the result.
We have $r \ Div \ s = \{ t : t \in r[R - S] \land satisfy \}$ where $satisfy = \forall t_s \in S : \exists t_r \in R : t_r[S] = t_s \land t_r[R - S] = t$.
DBMS Architecture
Query Evaluation
The cost of evaluating a SQL query is determined by
- The operations specified in the query execution plan
- The size of relations (database relations and temporary relations)
- Access mechanisms (indexing, hashing, sorting, join algorithms)
- Size/number of main memory buffers (and replacement strategy)
The analysis of costs involves estimating the size of intermediate results then, based on this, the cost of secondary storage accesses. Accessing data from disk is the dominant cost in query evaluation.
Relational algebra operations can use different algorithms:
- Sorting
- External merge sort ($\mathcal{O}(N \log_B N)$ with $B$ memory buffers)
- Selection
- Sequential scan ($\mathcal{O}(N)$)
- Index based ($\mathcal{O}(\log_N)$ with b-trees)
- Hash based ($\mathcal{O}(1)$ best case, but only works with equality tests)
- Join
- Nested loop join ($\mathcal{O}(NM)$, reduced to $\mathcal{O}(N + M)$ with buffering)
- Sort merge join ($\mathcal{O}(N \log_N + M \log_M)$)
- Hash join ($\mathcal{O}(N + MN / B)$ with $B$ memory buffers)
Query Optimisation
A DBMS query optimiser works as follows:
# Input: relational algebra expression
# Output: execution plan (sequence of RA ops)
bestCost = INF
bestPlan = None
while more possible plans:
plan = produce a new evaluation plan
cost = estimated_cost(plan)
if cost < bestCost:
bestCost = cost
bestPlan = plan
return bestPlan
Typically, there are very many possible plans, but smarter optimisers generate a subset of possible plans.
Indexes
Indexes provide efficient content-based access to tuples. Indexes can be built on any combination of attributes.
To define an index
CREATE INDEX name ON table (attr1, attr2, ...)
PostgreSQL Query Tuning
PostgreSQL provides the EXPLAIN
statement to give a representation of the query execution plan with information that may help to tune query performance.
To explain an query
EXPLAIN ANALYZE Query
Without ANALYZE
, EXPLAIN
shows a plan with estimated costs, with ANALYZE
, EXPLAIN
executes the query and prints the real costs.
Transactions
A transaction is an atomic “unit of work” in an application which may require multiple database changes. Transactions happen in a multi-user, unreliable environment.
To maintain integrity of data, transactions must be
- Atomic: either fully completed or totaly rolled-back
- Consistent: map database between consistent states
- Isolated: transactions do not interfere with each other
- Durable: persistent, restorable after system failures
A transaction must always terminate, either successfully (COMMIT
) with all changes preserved, or unsuccessfully (ABORT
) with the database unchanged.
To describe transaction effects, we consider
READ
: transfer data from disk to memoryWRITE
: transfer data from memory to diskABORT
: terminate transaction, unsuccessfullyCOMMIT
: terminate transaction, successfully
normally abbreviated to R(X)
, W(X)
, A
, C
.
Transaction Consistency
Transactions typically have intermediate states that are invalid. However, states before and after transactions must be valid.
A schedule defines a specific execution of one or more transactions, typically concurrent, with interleaved operations.
Abribtrary interleaving of operations causes anomalies, so that two consistency-preserving transactions produce a final state which is not consistent.
Serial Schedules
Given two transactions $T_1$ and $T_2$
T1: R(X) W(X) R(Y) W(Y)
T2: R(X) W(X) R(Y) W(Y)
there exist two serial executions
T1: R(X) W(X) R(Y) W(Y)
T2: R(X) W(X) R(Y) W(Y)
and
T1: R(X) W(X) R(Y) W(Y)
T2: R(X) W(X) R(Y) W(Y)
Serial execution guarantees a consistent final state if the initial state of the database is consistent and $T_1$ and $T_2$ are consistency-preserving.
Concurrent Schedules
Concurrent schedules interleave transaction operations. Some concurrent schedules ensure consistency, and some cause anomalies.
Serialisability
A serialisable schedule is a concurrent schedule for $T_1 \dots T_n$ with a final state $S$. $S$ is also a final state of one of the possible serial schedules for $T_1 \dots T_N$.
There are two common formulations of serialisability
- Conflict serialisability: read/write operations occur in the right order
- View serialisability: read operations see the correct version of data
Conflict Serialisability
Two transactions have a potential conflict if they perform operations on the same data item and at least one of the operations is a write operation. In such cases, the order of operations affects the result. If there is no conflict, we can swap order without affecting the result.
If we can transform a schedule by swapping the order of non-conflicting operations such that the result is a serial schedule then we can say that the schedule is conflict serialisable.
To check for conflict serialisability we need to show that ordering in concurrent schedule cannot be achieved in any serial schedule. This can be achieved by building a precedence graph where
- Nodes represent transactions
- Edges represent the order of an action on shared data
- An edge from $T_1$ to $T_2$ means $T_1$ acts on $X$ before $T_2$
- Cycles incicate that the schedule is not conflict serialisable
View Serialisability
View serialisability is an alternative formulation of serialisability that is less strict than conflict serialisability.
Two schedules $S$ and $S^\prime$ on $T_1 \dots T_n$ are view equivalent iff for each shared data item $X$
- If, in $S$, $T_j$ reads the initial value of $X$, then, in $S^\prime$, $T_j$ also reads the initial value of $X$
- If, in $S$, $T_j$ reads $X$ written by $T_k$, then, in $S^\prime$, $T_j$ also reads the value of $X$ written by $T_k$ in $S^\prime$
- If, in $S$, $T_j$ performs the final write of $X$, then, in $S^\prime$, $T_j$ also performs the final write of $X$
To check serialisability of $S$, find a serial schedule that is view equivalent to $S$ from among the $n!$ possible serial schedules.
Concurrency Control
Serialisability tests are useful theoretically, but don’t provide a mechanism for organising schedules.
Approaches to ensuring ACID transactions:
- Lock based
- Synchronise transaction execution via locks on some portion of the database
- Version based
- Allow multiple consistent versions of the data to exist, and allow each transaction exclusive access to one version
- Timestamp based
- Organise transaction execution in advance by assigning timestamps to operations
- Validation based
- Exploit typical execution-sequence properties of transactions to determine safety dynamically
Lock Based Concurrency Control
Locak based concurrency control involves synchronising access to shared data items via the following rules
- Before reading $X$, get a shared (read) lock on $X$
- Before writing $X$, get a exclusive (write) lock on $X$
- An attempt to get a shared lock on $X$ is blocked if another transaction already has an exclusive lock on $X$
- An attempt to get an exclusive lock on $X$ is blocked if another transaction has any kind of lock on $X$
Overall, locking reduces concurrency which leads to lower throughput. Different granularity levels of locking (i.e. field, row, table, whole database) can also impact performance.
Version Based Concurrency Control
Version based concurrency control provides multiple (consistent) versions of the database and gives each transaction access to a version that maintains the serialisability of the transaction.
In version based concurrency control, writing never blocks reading and reading never blocks writing.