23
CS 4332 Introduction to Database Systems Byron Gao

CS 4332 Introduction to Database Systems

Embed Size (px)

DESCRIPTION

CS 4332 Introduction to Database Systems. Byron Gao. Logistics. Instructor: Byron Gao Page: http://cs.txstate.edu/~jg66/ Course page Textbook workload Grading: nice curve, but won ’ t please History (usually you get what you expect) Weighting Late policy Academic honesty - PowerPoint PPT Presentation

Citation preview

Page 1: CS 4332 Introduction to Database Systems

CS 4332Introduction to Database

Systems

Byron Gao

Page 2: CS 4332 Introduction to Database Systems

Logistics

Instructor: Byron Gao Page: http://cs.txstate.edu/~jg66/

Course page

Textbook workload Grading: nice curve, but won’t please

History (usually you get what you expect) Weighting Late policy Academic honesty TRACS: https://tracs.txstate.edu/portal

Page 3: CS 4332 Introduction to Database Systems

Objective

Provide in-depth introduction to database management systems, emphasizing how to design a database and use a DBMS effectively Relational Database Management Systems (RDBMS) Relational algebra Structured Query Language (SQL) Database design

Web application Data warehousing and data mining XML data and unstructured data management

Page 4: CS 4332 Introduction to Database Systems

Overview of Database Systems

Chapter 1

Page 5: CS 4332 Introduction to Database Systems

What Is a DBMS? Science paradigms: empirical, theoretical,

computational, data-intensive Data deluge, information explosion, information overload Databases (structured), data mining, information retrieval DBA … data scientists (more data mining, analytics …)

A very large, integrated collection of data Models real-world enterprise

Entities (e.g., students, courses) Relationships (e.g., Madonna is taking CS 4332)

A Database Management System (DBMS) is a software package designed to store and manage databases

Page 6: CS 4332 Introduction to Database Systems

Example: use of structured data

mysql -u jg66 -h mysql.cs.txstate.edu –pshow databases;use jg66;show tables;select * from Sailors;select * from Boats;select * from Reserves;

Find the names of sailors who’ve reserved a red boat?

SELECT S.sname FROM Sailors S, Boats B, Reserves R WHERE S.sid=R.sid and B.bid=R.bid and B.color="red";

Page 7: CS 4332 Introduction to Database Systems

More examplesFind the colors of boats reserved by Lubber?

SELECT B.colorFROM Sailors S, Boats B, Reserves RWHERE S.sid=R.sid and B.bid=R.bid and S.sname="Lubber";

Find sid’s of sailors who’ve reserved red but not green boats?

Find the age of the youngest sailor with age 18, for each rating with at least 2 such sailors?

Find those ratings for which the average age is the minimum over all ratings?

Page 8: CS 4332 Introduction to Database Systems

Historical Perspective Early 60’s: network data model

Charles Bachman, 1973 first Turing award Late 60’s: hierarchical data model, IBM 70: relational data model

Edgar Codd, IBM Research, 1981 Turing award 80’s:

SQL, IBM System-R; Transactions, concurrent execution of database programs, Jim

Gary, 1999 Turing award Late 80’s and 90’s:

Storing images and text, more complex queries, data warehouses Internet era, files -> DBMS Decision support, data mining

Recently, managing unstructured data DB2, Oracle, Sql Server, Informix ... Ingres, PostgreSql …

MySql

Page 9: CS 4332 Introduction to Database Systems

Files vs. DBMS

OS inadequate Application must stage large datasets

between main memory and secondary storage (e.g., buffering, page-oriented access, 32-bit addressing, etc.)

Special code for different queries Must protect data from inconsistency due to

multiple concurrent users Crash recovery Security and access control

Page 10: CS 4332 Introduction to Database Systems

Why Use a DBMS?

Data independence and efficient access Reduced application development time Data integrity and security Uniform data administration Concurrent access, recovery from

crashes

Reasons not to use DBMS

Page 11: CS 4332 Introduction to Database Systems

Why Study Databases??

Shift from computation to information At the “low end”: scramble to webspace (a mess!) At the “high end”: scientific applications

Datasets increasing in diversity and volume. Digital libraries, interactive video, Human

Genome project, EOS project ... need for DBMS exploding

DBMS encompasses most of CS OS, languages, theory, AI, multimedia, logic

?

Page 12: CS 4332 Introduction to Database Systems

Data Models A data model is a collection of concepts for

describing data A schema is a description of a particular

collection of data, using the given data model The relational model of data is the most

widely used model today Main concept: relation, basically a table with rows

and columns Every relation has a schema, which describes the

columns, or fields

Page 13: CS 4332 Introduction to Database Systems

Levels of Abstraction

Many views, single conceptual (logical) schema and physical schema Views describe how

users see the data

Conceptual schema defines logical structure

Physical schema describes the files and indexes used Schemas are defined using DDL; data is modified/queried using DML

Physical Schema

Conceptual Schema

View 1 View 2 View 3

Page 14: CS 4332 Introduction to Database Systems

Example: University Database

Conceptual (logical) schema: Students(sid: string, name: string, login: string,

age: integer, gpa:real) Courses(cid: string, cname:string, credits:integer) Enrolled(sid:string, cid:string, grade:string)

Physical schema: Relations stored as unordered files Index on first column of Students

External Schema (View): course_info(cid:string,enrollment:integer)

Page 15: CS 4332 Introduction to Database Systems

Data Independence

Applications insulated from how data is structured and stored

Logical data independence: Protection from changes in logical structure of data

Physical data independence: Protection from changes in physical structure of data

One of the most important benefits of using a DBMS!

Page 16: CS 4332 Introduction to Database Systems

Concurrency Control

Concurrent execution of user programs is essential for good DBMS performance Because disk accesses are frequent, and relatively

slow, it is important to keep the cpu humming by working on several user programs concurrently

Interleaving actions of different user programs can lead to inconsistency: e.g., check is cleared while account balance is being computed

DBMS ensures such problems don’t arise: users can pretend they are using a single-user system

Page 17: CS 4332 Introduction to Database Systems

Transaction: An Execution of a DB Program Key concept is transaction, which is an atomic

sequence of database actions (reads/writes) Each transaction, executed completely, must

leave the DB in a consistent state if DB is consistent when the transaction begins Users can specify some simple integrity constraints on

the data, and the DBMS will enforce these constraints Beyond this, the DBMS does not really understand the

semantics of the data. (e.g., it does not understand how the interest on a bank account is computed)

Thus, ensuring that a transaction (run alone) preserves consistency is ultimately the user’s responsibility!

Page 18: CS 4332 Introduction to Database Systems

Scheduling Concurrent Transactions

DBMS ensures that execution of {T1, ... , Tn} is equivalent to some serial execution T1’ ... Tn’ Before reading/writing an object, a transaction

requests a lock on the object, and waits till the DBMS gives it the lock. All locks are released at the end of the transaction (Strict 2PL locking protocol)

Idea: If an action of Ti (say, writing X) affects Tj (which perhaps reads X), one of them, say Ti, will obtain the lock on X first and Tj is forced to wait until Ti completes; this effectively orders the transactions

What if Tj already has a lock on Y and Ti later requests a lock on Y? (Deadlock!) Ti or Tj is aborted and restarted!

Page 19: CS 4332 Introduction to Database Systems

Ensuring Atomicity

DBMS ensures atomicity (all-or-nothing property) even if system crashes in the middle of a Xact

Idea: Keep a log (history) of all actions carried out by the DBMS while executing a set of Xacts: Before a change is made to the database, the

corresponding log entry is forced to a safe location. (WAL protocol; OS support for this is often inadequate)

After a crash, the effects of partially executed transactions are undone using the log. (Thanks to WAL, if log entry wasn’t saved before the crash, corresponding change was not applied to database!)

Page 20: CS 4332 Introduction to Database Systems

The Log

The following actions are recorded in the log: Ti writes an object: the old value and the new value

• Log record must go to disk before the changed page! Ti commits/aborts: a log record indicating this action

Log records chained together by Xact id, so it’s easy to undo a specific Xact (e.g., to resolve a deadlock)

Log is often duplexed and archived on “stable” storage.

All log related activities (and in fact, all CC related activities such as lock/unlock, dealing with deadlocks etc.) are handled transparently by the DBMS

Page 21: CS 4332 Introduction to Database Systems

Databases make these folks happy ...

End users and DBMS vendors DB application programmers

e.g. smart webmasters Database administrator (DBA)

Designs logical /physical schemas Handles security and authorization Data availability, crash recovery Database tuning as needs evolve

Must understand how a DBMS works!

Page 22: CS 4332 Introduction to Database Systems

Structure of a DBMS

A typical DBMS has a layered architecture

The figure does not show the concurrency control and recovery components

This is one of several possible architectures; each system has its own variations

Query Optimizationand Execution

Relational Operators

Files and Access Methods

Buffer Management

Disk Space Management

DB

These layersmust considerconcurrencycontrol andrecovery

Page 23: CS 4332 Introduction to Database Systems

Summary DBMS used to maintain, query large datasets Benefits include recovery from system

crashes, concurrent access, quick application development, data integrity and security.

Levels of abstraction give data independence. A DBMS typically has a layered architecture DBAs hold responsible jobs

and are well-paid! DBMS R&D is one of the broadest,

most exciting areas in CS