81
Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke [email protected] http://www.cs.cornell.edu/ johannes

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke [email protected]

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

The Software Infrastructurefor Electronic Commerce

Databases and Data MiningLecture 1:

A Manager’s View of Database Technology

Johannes [email protected]

http://www.cs.cornell.edu/johannes

Page 2: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Goal Of Lectures One and Two

• Understand the basics of modern data management• DBMS technology• Data models• Transaction processing versus decision

support• An data management architecture of an

e-business

Page 3: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Goals of the DBMS Lectures (Contd.)

• Gain technical understanding to feel confident to ask questions.

• Learn to ask the right questions.

Page 4: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

The Big Picture

WWW SiteVisitor

THE WEB

Public Web Server

BusinessTransaction

Server

MainMemoryCache

DBMS

DataWarehouseApplication

Server

INTRANET,VPN

Internal User

InternalWeb Server

Page 5: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Web Pages with Database Contents

• Web pages contain the results of database queries. How do we generate such pages?• Web server creates a new process for a

program interacts with the database.• Web server communicates with this

program via some data exchange protocol

• Program generates result page with content from the database

Page 6: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Business Transaction Server

• Application server: Piece of software between the web server and the applications

• Functionality:• Hold a set of pre-forked threads or processes

for performance• Database connection pooling (reuse a set of

existing connections)• Integration of heterogeneous data sources• Transaction management involving several

data sources• Session management

Page 7: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Other Server-Side Processing

• Java Servlets: Java programs that run on the server and interact with the server through a well-defined API

• JavaBeans: Reusable software components written in Java

• Java Server Pages and Active Server Pages: Code inside a web page that is interpreted by the web server

Page 8: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

What Happens If You Click On A Link?

www.company.com

Company web server

www.company.com Johannes Gehrke

Banner web serverwww.banners.com

Hidden linkAdditional log serverwww.mycookies.com

1

2

3

4

CLIENT SIDE SERVER SIDE

Page 9: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Web Server Log Data

Common Log Format:host ident authuser date request status bytes

• host: domain name or IP address of the client

• date: date and time of the request• request: the request line from the client• status: three digit status code returned to

the client• bytes: number of bytes in the object

returned to the client

Page 10: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Web Server Log Data (Contd.)

Usually, even more data available:• URL of the referring server• Name and version of the browser• Name of the file requested• Time to serve the request• IP address of the requesting browser• Cookie

Page 11: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Cookies

• The communication protocol (HTTP) between your browser and the web server is stateless. (Compare to a vending machine.)

• Remedy: Store information (called a cookie) at the browser of the user

• Example cookie (from www.google.com): PREFID=3415aaaf73b7bfe3,TM=956724506|google.com/|0|261887833632111634|2662722800|29339450|*(name=value|domain|secure|expiration date|expiration time|last used date|last used time)

Page 12: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Cookies (Contd.)

• A cookie is always associated with a specific domain (amazon.com, bn.com, preferences.com, doubleclick.net, cornell.edu)

• Cookies have expiration dates• The secrets are in the (name=value) pairs

(usually encrypted):PREFID=3415aaaf73b7bfe3,TM=956724506

• Cookies have their own life on your computer:• \Documents and Settings\UserName\Cookies,

\Windows\Cookies\Windows\Profiles\UserName\Cookies\ProgramFiles\Netscape\Users\Default\cookies.txt

Page 13: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Cookies (Contd.)

• Applications of cookies:• Shopping carts• “Log in once” (example: New York Times)• Personalization (amazon.com: Welcome back,

Johannes)• General tracking of user behavior

• User privacy• Other personalization/tracking techniques:

Hidden fields in html pages, unique page names

• Online demonstration

Page 14: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

And Then You Click Purchase …

(Simplified process)• Insert customer data into the

database/check customer data• Check order availability• Insert order data into the database• Return order confirmation number to the

customer

All this data is stored in a database system (DBMS).

Page 15: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Why Store Data in a DBMS?

• Benefits• Transactions (concurrent data access,

recovery from system crashes)• High-level abstractions for data access,

manipulation, and administration• Data integrity and security• Performance and scalability

Page 16: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Transactions

• A transaction is an atomic sequence of database actions (reading, writing or updating a database object).

• Each transaction must leave the DB in a consistent state (if DB is consistent when the transaction starts).

• The ACID Properties:• Atomicity• Consistency• Isolation• Durability

Page 17: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Example Transaction: Online Store

Your purchase transaction:• Atomicity: Either the complete purchase

happens, or nothing• Consistency: The inventory and internal

accounts are updated correctly• Isolation: It does not matter whether other

customers are also currently making a purchase

• Durability: Once you have received the order confirmation number, your order information is permanent, even if the site crashes

Page 18: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Transactions (Contd.)

A transaction will commit after completing all its actions, or it could abort (or be aborted by the DBMS) after executing some actions.

Page 19: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Example Transaction: ATM

You withdraw money from the ATM machine• Atomicity• Consistency• Isolation• Durability

Commit versus Abort?What are reasons for commit or abort?

Page 20: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Concurrency Control

(Start: A=$100; B=$100)

Consider two transactions:• T1: START, A=A+100, B=B-100, COMMIT• T2: START, A=1.06*A, B=1.06*B, COMMIT

The first transaction is transferring $100 from B’s account to A’s account. The second transaction is crediting both accounts with a 6% interest payment.

Page 21: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Example (Contd.)

(Start: A=$100; B=$100)

• Consider a possible interleaving (schedule):T1: A=A+$100, B=B-$100 COMMIT T2: A=1.06*A, B=1.06*B COMMITEnd result: A=$106; B=$0

• Another possible interleaving:T1: A=A+100, B=B-100 COMMITT2: A=1.06*A, B=1.06*B COMMITEnd result: A=$106; B=$6

The second interleaving is incorrect! Concurrency control of a database system makes sure that the second schedule does not happen.

Page 22: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Ensuring Atomicity

• DBMS ensures atomicity (all-or-nothing property) even if the system crashes in the middle of a transaction.

• Idea: Keep a log (history) of all actions carried out by the DBMS while executing :• Before a change is made to the database, the

corresponding log entry is forced to a safe location.

• After a crash, the effects of partially executed transactions are undone using the log.

Page 23: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Recovery

• A DBMS logs all elementary events on stable storage. This data is called the log.

• The log contains everything that changes data: Inserts, updates, and deletes.

• Reasons for logging:• Need to UNDO transactions• Recover from a systems crash

Page 24: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Recovery: Example

(Simplified process)• Insert customer data into the

database• Check order availability• Insert order data into the database• Write recovery data (the log) to stable

storage• Return order confirmation number to

the customer

Page 25: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Tips

• Transactions, concurrency control, and recovery are important aspects of the functionality of a database system

• Tips for capacity planning:• Load influences level of concurrency

Determines hardware requirements• Insufficient resources for concurrency and

recovery can force serialization of transactions Bad performance

• Need ample space for the log, often mirrored onto two disks at the same time

Page 26: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Why Store Data in a DBMS?

• Benefits• Transactions (concurrent data access,

recovery from system crashes)• High-level abstractions for data access,

manipulation, and administration• Data integrity and security• Performance and scalability

Page 27: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Data Independence

Applications should be insulated from how data is structured and stored. Thus the DBMS needs to provide high-level abstractions to applications!

Physical Schema

Conceptual Schema

View 1 View 2 View 3

Page 28: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Data Model

• A data model is a collection of concepts for describing data.

• Examples:• ER model (used for conceptual

modeling)• Relational model, object-oriented

model, object-relational model (actually implemented in current DBMS)

Page 29: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

The Relational Data Model

A relational database is a set of relations. Turing Award (Nobel Price in CS) for Codd in 1980 for his work on the relational model

• Example relation:Customers(cid: integer, name: string, byear: integer, state: string)

cid name byear state

1 Jones 1960 NY

2 Smith 1974 CA

3 Smith 1950 NY

Page 30: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

The Relational Model: Terminology

• Relation instance and schema• Field (column)• Record or tuple (row)• Cardinality

cid name byear state

1 Jones 1960 NY

2 Smith 1974 CA

3 Smith 1950 NY

Page 31: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Customer Relation (Contd.)

• In your enterprise, you are more likely to have a schema similar to the following:

Customers(cid, identifier, nameType, salutation, firstName, middleNames, lastName, culturalGreetingStyle, gender, customerType, degrees, ethnicity, companyName, departmentName, jobTitle, primaryPhone, primaryFax, email, website, building, floor, mailstop, addressType, streetNumber, streetName, streetDirection, POBox, city, state, zipCode, region, country, assembledAddressBlock, currency, maritalStatus, bYear, profession)

Page 32: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Product Relation

• Relation schema:Products(pid: integer, pname: string, price: float,

category: string)

• Relation instance:

pid pname price category

1 Intel PIII-700 300.00 hardware

2 MS Office Pro 500.00 software

3 IBM DB2 5000.00 software

4 Thinkpad 600E 5000.00 hardware

Page 33: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Transaction Relation

• Relation schema:Transactions(

tid: integer,tdate: date,cid: integer,pid: integer)

• Relation instance:

tid tdate cid pid

1 1/ 1/ 2000 1 1

1 1/ 1/ 2000 1 2

2 1/ 1/ 2000 1 4

3 2/ 1/ 2000 2 3

3 2/ 1/ 2000 2 4

Page 34: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

TIPS

• Any enterprise has an abundance of different relations in its DBMS. Good management of this meta-data is crucial:• Documentation• Evolution• Assignment of responsibilities• Security

• (ERP packages usually create several thousand relations)

Page 35: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

The Relational DBMS Market

Market Share Of RDBMS In 1998(Gartner Group 4/1999)

0.0%10.0%20.0%30.0%40.0%50.0%60.0%70.0%

Ora

cle

7/8

and

Lite

Mic

roso

ftS

QL

IBM

DB

2

Syb

ase

AS

A/A

SA

Info

rmix

NC

R

Oth

er

NT

Unix

Page 36: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

The Relational DBMS Market (Contd.)

Best-Selling Client/Server DBMS(Computer Reseller News 8/1999)

43%

25%

6% 4%

22%

41%

32%

6% 5%

16%

0%

10%

20%

30%

40%

50%

Microsoft Oracle Sybase IBM Others

Jan-Jun 98

Jan-Jun 99

Page 37: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

The Relational DBMS Market (Contd.)

Market Share Of Database Revenues(Computer Reseller News, 5/1999)

0.0%

5.0%

10.0%

15.0%

20.0%

25.0%

30.0%

35.0%

IBM Oracle Microsoft Informix Sybase Other

In 1997

In 1998

Page 38: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

The Object-Oriented Data Model

• Richer data model. Goal: Bridge impedance mismatch between programming languages and the database system.

• Example components of the data model: Relationships between objects directly as pointers.

• Result: Can store abstract data types directly in the DBMS• Pictures• Geographic coordinates• Movies• CAD objects

Page 39: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Object-Oriented DBMS

• Advantages: Engineering applications (CAD and CAM and CASE computer aided software engineering), multimedia applications.

• Disadvantages:• Technology not as mature as relational

DMBS• Not suitable for decision support, weak

security• Vendors are much smaller companies and

their financial stability is questionable.

Page 40: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Object-Oriented DBMS (Contd.)

Vendors:• Gemstone (www.gemstone.com)• Objectivity (www.objy.com) • Object Design (www.odi.com) • POET (www.poet.com) • Versant (www.versant.com)Organizations:• OMDG: Object Database Management Group

(www.omdg.org) • OMG: Object Management Group (www.omg.org)

Page 41: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

The OO DBMS Market

Forecast Revenues For OO Systems Software Worldwide (IDC 3/1999)

0.0

200.0

400.0

600.0

800.0

1,000.0

1997 1998 1999 2000 2001 2002

Year

Mil

lio

n $

Page 42: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Object-Relational DBMS

• Mixture between the object-oriented and the object-relational data model• Combines ease of querying with ability

to store abstract data types• Conceptually, the relational model, but

every field

• All major relational vendors are currently extending their relational DBMS to the object-relational model

Page 43: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Query Languages

We need a high-level language to describe and manipulate the data

Requirements:• Precise semantics • Easy integration into applications written

in C++/Java/Visual Basic/etc.• Easy to learn• DBMS needs to be able to efficiently

evaluate queries written in the language

Page 44: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Relational Query Languages

• The relational model supports simple, powerful querying of data. • Precise semantics for relational queries• Efficient execution of queries by the

DBMS• Independent of physical storage

Page 45: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

SQL: Structured Query Language

• Developed by IBM (System R) in the 1970s

• ANSI standard since 1986:• SQL-86• SQL-89 (minor revision)• SQL-92 (major revision, current

standard)• SQL-99 (major extensions)

Page 46: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Why Store Data in a DBMS?

• Benefits• Transactions (concurrent data access,

recovery from system crashes)• High-level abstractions for data access,

manipulation, and administration• Data integrity and security• Performance and scalability

Page 47: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Integrity Constraints

• Integrity Constraints (ICs): Condition that must be true for any instance of the database.• ICs are specified when schema is defined.• ICs are checked when relations are

modified.• A legal instance of a relation is one that

satisfies all specified ICs.• DBMS should only allow legal instances.

• Example: Domain constraints.

Page 48: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Primary Key Constraints

• A set of fields is a superkey for a relation if no two distinct tuples can have same values in all key fields.

• A set of fields is a key if the set is a superkey, and none of its subsets is a superkey.

• Example:• {cid, name} is a superkey for Customers• {cid} is a key for Customers

• Where do primary key constraints come from?

Page 49: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Primary Key Constraints (Contd.)

• Can there be more than one key for a relation?

• What is the maximum number of superkeys for a relation?

• What is the primary key of the Products relation? How about the Transactions relation?

Page 50: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Foreign Keys, Referential Integrity

• Foreign key : Set of fields in one relation that is refers to a unique tuple in another relation. (The foreign key must be a superkey of the second relation.)

• Example: The field cid in the Transactions relation is a foreign key referring to Customers.

• If all foreign key constraints are enforced, we say that referential integrity is achieved.• No dangling references.• Compare to links in HTML.

Page 51: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Foreign Keys: Example

• The pid field of the Transactions relation refers to the cid field of the Customer relation.

tid tdate cid pid

1 1/ 1/ 2000 1 1

1 1/ 1/ 2000 1 2

2 1/ 1/ 2000 1 4

3 2/ 1/ 2000 2 3

3 2/ 1/ 2000 2 4

cid name byear state

1 Jones 1960 NY

2 Smith 1974 CA

3 Smith 1950 NY

Page 52: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Enforcing Referential Integrity

• What should be done if a Transaction tuple with a non-existent Customer id is inserted? (Reject it!)

• What should be done if a Customer tuple is deleted?• Also delete all Transaction tuples that refer to

it.• Disallow deletion of a Customer tuple that has

associated Transactions.• Set cid in Transactions tuples to a default or

special cid.

• SQL supports all three choices

Page 53: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Where do ICs Come From?

• ICs are based upon the semantics of the real-world enterprise that is being described in the database relations.

• We can check a database instance to see if an IC is violated, but we can NEVER infer that an IC is true by looking at an instance.• An IC is a statement about all possible

instances!• From example, we know state cannot be a key,

but the assertion that cid is a key is given to us.

• Key and foreign key ICs are very common; a DBMS supports more general ICs.

Page 54: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Security

• Secrecy: Users should not be able to see things they are not supposed to.• E.g., A student can’t see other students’

grades.• Integrity: Users should not be able to

modify things they are not supposed to.• E.g., Only instructors can assign grades.

• Availability: Users should be able to see and modify things they are allowed to.

Page 55: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Discretionary Access Control

• Based on the concept of access rights or privileges for objects (tables and views), and mechanisms for giving users privileges (and revoking privileges).

• Creator of a table or a view automatically gets all privileges on it.• DMBS keeps track of who subsequently gains

and loses privileges, and ensures that only requests from users who have the necessary privileges (at the time the request is issued) are allowed.

• Users can grant and revoke privileges

Page 56: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Role-Based Authorization

• In SQL-92, privileges are actually assigned to authorization ids, which can denote a single user or a group of users.

• In SQL:1999 (and in many current systems), privileges are assigned to roles.• Roles can then be granted to users and to

other roles.• Reflects how real organizations work.• Illustrates how standards often catch up with

“de facto” standards embodied in popular systems.

Page 57: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Security Is Important

Financial estimate of database losses, by activity in 1998

(ENT/CSI/FBI Computer Crime Survey, 5/1999)

Activity Loss in million $Theft of Proprietary Information 28.51 System Penetration by Outsiders 13.39 Financial Fraud 11.24 Unauthorized Insider Access 50.57 Laptop Theft 0.16

Total 103.87

Page 58: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Why Store Data in a DBMS?

• Benefits• Transactions (concurrent data access,

recovery from system crashes)• High-level abstractions for data access,

manipulation, and administration• Data integrity and security• Performance and scalability

Page 59: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Why Parallel Access To Data?

1 Terabyte

10 MB/s

At 10 MB/s1.2 days to scan

1 Terabyte

1,000 x parallel1.5 minute to scan.

Parallelism: Divide a big problem into many smaller ones to be solved in parallel.

Bandwidth

Page 60: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Parallel DBMS: Intro

• Parallelism is natural to DBMS processing• Pipeline parallelism: many machines each doing one step in a

multi-step process. • Partition parallelism: many machines doing the same thing to

different pieces of data.• Both are natural in DBMS!

Pipeline

Partition

Any Sequential Program

Any Sequential Program

SequentialSequential SequentialSequentialAny

Sequential Program

Any Sequential Program

outputs split N ways, inputs merge M ways

Page 61: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

DBMS: The || Success Story

• DBMSs are the most (only?) successful application of parallelism.• Every major DBMS vendor has some || server• Workstation manufacturers now depend on ||

DB server sales.

• Reasons for success:• Bulk-processing (= partition ||-ism).• Natural pipelining.• Inexpensive hardware can do the trick!• Users/app-programmers don’t need to think in ||

Page 62: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Some || Terminology

• Speed-Up• More resources

means proportionally less time for given amount of data.

• Scale-Up• If resources

increased in proportion to increase in data size, time is constant.

degree of ||-ism

Xact

/sec.

(th

roug

hpu

t) Ideal

degree of ||-ism

sec.

/Xact

(resp

on

se t

ime)

Ideal

Page 63: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Architecture Issue: Shared What?

Shared Memory (SMP)

Shared Disk Shared Nothing (network)

CLIENTS CLIENTSCLIENTS

MemoryProcessors

Easy to programExpensive to buildDifficult to scaleup

Hard to programCheap to buildEasy to scaleup

Sybase Oracle Compaq, Teradata, SP2

Page 64: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Distributed Database Systems

• Data is stored at several sites, each managed by a DBMS that can run independently.

• Distributed data independence: Users should not have to know where data is located (extends physical and logical data independence principles).

• Distributed transaction atomicity: Users should be able to write transactions accessing multiple sites just like local transactions.

Page 65: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Types of Distributed Databases

• Homogeneous: Every site runs same type of DBMS.

• Heterogeneous: Different sites run different DBMSs (different RDBMSs or even non-relational DBMSs).

DBMS1 DBMS2 DBMS3

Gateway

Page 66: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Distributed DBMS Architectures

• Client-Server

Collaborating-Server

CLIENT CLIENT

SERVER SERVER SERVER

QUERY

SERVER

SERVER

SERVERQUERY

Client ships query to single site. All queryprocessing at server. - Thin vs. fat clients. - Set-oriented communication, client side caching.

Query can span multiplesites.

Page 67: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

DBMS and Performance

• Efficient implementation of all database operations

• Indexes: Auxiliary structures that allow fast access to the portion of data that a query is about

• Smart buffer management• Query optimization: Finds the best

way to execute a query• Automatic query parallelization

Page 68: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Performance Tips

• Bad connection management is the number one scalability problem (can make two order of magnitude difference)

• Reuse once-executed queries as much as possible

• Physical performance database tuning is crucial for good performance• Performance tuning is an art and a science• Recent trend: DBMS include tuning wizards for

automatic performance tuning

• Include space for indexes during space planning

Page 69: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Summary Of DBMS Benefits

• Transactions• ACID properties, concurrency control, recovery

• High-level abstractions for data access• Data models

• Data integrity and security• Key constraints, foreign key constraints,

access control

• Performance and scalability• Parallel DBMS, distributed DBMS, performance

tuning

Page 70: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

The Future of Data Exchange: Markup

From HTML to XML

Markup Languages: HTML• Simple markup language• Text is annotated with language

commands called tags, usually consisting of a start tag and an end tag

Page 71: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

HTML Example: Book Listing

<HTML><BODY>Fiction:<UL><LI>Author: Milan Kundera</LI? <LI>Title: Identity</LI> <LI>Published: 1998</LI></UL>Science:<UL><LI>Author: Richard Feynman</LI> <LI>Title: The Character of Physical

Law</LI> <LI>Hardcover</LI></UL></BODY></HTML>

Page 72: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Beyond HTML: XML

• Extensible Markup Language (XML): “Extensible HTML”

• Confluence of SGML and HTML: The power of SGML with the simplicity of HTML

• Allows definition of new markup languages, called document type declarations (DTDs)

Page 73: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

XML: Language Constructs

• Elements• Main structural building blocks of XML• Start and end tag• Must be properly nested

• Element can have attributes that provide additional information about the element

• Entities: like macros, represent common text.

• Comments• Document type declarations (DTDs)

Page 74: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Booklist Example in XML

<?XML version=“1.0” standalone=“yes”?><!DOCTYPE BOOKLIST SYSTEM “booklist.dtd”><BOOKLIST><BOOK genre=“Fiction”> <AUTHOR> <FIRST>Milan</FIRST><LAST>Kundera</LAST> </AUTHOR> <TITLE>Identity</TITLE> <PUBLISHED>1998</PUBLISHED><BOOK genre=“Science” format=“Hardcover”> <AUTHOR> <FIRST>Richard</FIRST><LAST>Feynman</LAST> </AUTHOR> <TITLE>The Character of Physical Law</TITLE></BOOK></BOOKLIST>

Page 75: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

XML: DTDs

• A DTD is a set of rules that defines the elements, attributes, and entities that are allowed in the document.

• An XML document is well-formed if it does not have an associated DTD but it is properly nested.

• An XML document is valid if it has a DTD and the document follows the rules in the DTD.

Page 76: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

An Example DTD

<!DOCTYPE BOOKLIST [ <!ELEMENT BOOKLIST (BOOK)*> <!ELEMENT BOOK (AUTHOR, TITLE, PUBLISHED?)> <!ELEMENT AUTHOR (FIRST, LAST)> <!ELEMENT FIRST (#PCDATA)> <!ELEMENT LAST (#PCDATA)> <!ELEMENT TITLE (#PCDATA)> <!ELEMENT PUBLISHED (#PCDATA)> <!ATTLIST BOOK genre (Science|Fiction) #REQUIRED> <!ATTLIST BOOK format (Paperback|Hardcover)

“Paperback”>]>

Page 77: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Domain-Specific DTDs

• Development of standardized DTDs for specialized domains enables data exchange between heterogeneous sources

• Example: Mathematical Markup Language (MathML)• Encodes mathematical material on the web• In HTML: <IMG SRC=“xysquare.gif” ALT=“(x+y)^2”>

• In MathML: <apply> <power/> <apply> <plus/> <ci>x</ci> <ci>y</ci> </apply> <cn>2</cn> </apply>

Page 78: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

XML: The Future

• Seamless integration of heterogeneous B2B data sources and business processes

• XML-specified services, discovered by autonomous software agents

• Industry-specific DTDs that simplify data exchange

• “Only XML”-industry revenue forecasts:1999: $31 millions, 2001: $93 millions(Interactive Week, 4/1999)

Page 79: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Summary Of Lecture One

• Database systems are powerful, but also complex pieces of software• Transactions (concurrent data access, recovery from system

crashes)• High-level abstractions for data access, manipulation, and

administration• Data integrity and security• Performance and scalability

• XML is the Future of Data Exchange

Page 80: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

In The Second Lecture

• Database design• A short introduction to SQL• Transaction processing versus decision

support: On to the data warehouse!

Page 81: The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 1: A Manager’s View of Database Technology Johannes Gehrke johannes@cs.cornell.edu

Questions?