33
Introduction to Databases Second La Serena School for Data Science: Applied Tools for Astronomy August 2014 Mauro San Martín [email protected] Universidad de La Serena

Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

Introduction to Databases

Second La Serena School for Data Science: Applied Tools for Astronomy

August 2014

Mauro San Martín [email protected]

Universidad de La Serena

Page 2: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

Contents

• Introduction Databases and scientific data management

• Part I. Relational Databases Defining, storing, updating and querying data

• Part II. No Relational Databases Short introduction to alternatives to relational databases

Page 3: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Intro. to Astronomical RDBs Introduction

Definitions

• Database - An organized and self-describing collection of

data, with a intended meaning, and maintained with a purpose.

• Database Management Systems (DBMS) - Software system designed and implemented to

define, maintain and share a database, and - to separate app. logic from low level data I/O.

�3Definitions

Page 4: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Intro. to Astronomical RDBs Introduction

Scientific Data Management

�4Scientific Data Management

RelationalDatabases

(SQL)

NoSQLDatabases

Scientific Data Management(collect, organize, store, transform, and query)

Page 5: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

Part I. Relational Databases

Theory

Page 6: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Applied Tools for AstronomyLSSDS2013: Intro. to Astronomical RDBs Part I: Relational Databases - Theory RDBs at a Glance

RDBs at a glance

• E. F. Codd 1970 "A Relational Model of Data for Large Shared Data Banks"

• Main characteristics - One simple data structure: relation - Solid mathematical foundations - Several comprehensive implementations available:

PostgreSQL, MySQL, Oracle, SQL Server, etc.

• Industry standard since the 80’s

�6

Page 7: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Applied Tools for AstronomyLSSDS2013: Intro. to Astronomical RDBs Part I: Relational Databases - Theory Modeling Data

Modeling data

Capturing the world (or the universe)

• The relational data model - data structure

relations/tables: collections of tuples

- operations (query + update) relational algebra and calculus/SQL

- integrity constraints Data type, not null, referential integrity

�7

Update Q

uery

Page 8: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Applied Tools for AstronomyLSSDS2013: Intro. to Astronomical RDBs Part I: Relational Databases - Theory Modeling Data

Schemas

• Schema Definition of relations (columns, types, and keys) and integrity constraints.

• Good Schema avoids Data duplication, null values, and update anomalies.

• Normalization: algorithm to build good schemas.

Identify keys and divide relations (separate columns) with problems.

�8

Page 9: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Applied Tools for AstronomyLSSDS2013: Intro. to Astronomical RDBs Part I: Relational Databases - Theory Querying the Database

Querying the DBMap data from DB to the information needed

�9

Query Evaluation

Database Intermediate Result

Data Collection

Sort Group

Aggregate

Result

Cost:reading

data

Cost:storing

temp data

Cost:sortingdata

Page 10: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Applied Tools for AstronomyLSSDS2013: Intro. to Astronomical RDBs Part I: Relational Databases - Theory Updates

Updates

• Update: add and modify data. - Updates may render the database inconsistent

• Transactions and ACID - Atomicity - Consistency - Isolation - Durability

�10

Operation 1

Operation 2

Operation 3

...

Operation n

Transaction

Page 11: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

Part I. Relational Databases

Practice

Page 12: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Applied Tools for AstronomyLSSDS2013: Intro. to Astronomical RDBs Part I: Relational Databases - Practice What is an RDBMS?

What is an RDBMS?A DataBase Management System (software) for the Relational model

�12

Stored Database(Instance)

Schema(Metadata) Runtime Query

Evaluator Concurrency controlDB Backup

DB Recovery

QueryOptimizer

QueryCompiler

SQL Query

Page 13: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Applied Tools for AstronomyLSSDS2013: Intro. to Astronomical RDBs Part I: Relational Databases - Practice RDBMS Objects

RDBMS Objects

• Tables Represent data: collection of records Record: set of attributes (columns)

• Views: named queries

• Indices: improve search and access time

• Functions: extend query language

�13

ObjectIDID1 3.4 aID2 4.0 bID2 2.1 c

A B

Page 14: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Applied Tools for AstronomyLSSDS2013: Intro. to Astronomical RDBs Part I: Relational Databases - Practice SQL

SQL

• Structured Query Language

• Actually it includes - Data Definition Language (Schema)

create table myTable(number int, letter char)

- Data Manipulation Language (Update) insert into myTable values(1, ‘a’)

�14

Page 15: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Applied Tools for AstronomyLSSDS2013: Intro. to Astronomical RDBs Part I: Relational Databases - Practice Querying the DB

Querying the DB

• Basic Query Structure SELECT: definition of the output table (set of columns) FROM: identification of source tables WHERE: optional condition (filter or join)

• Aditional blocks GROUP BY: group defining criteria HAVING: optional condition un aggregate values ORDER BY: sorting criteria for the result

�15

Page 16: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Applied Tools for AstronomyLSSDS2013: Intro. to Astronomical RDBs Part I: Relational Databases - Practice Querying the DB

Query Evaluation

Note that query results are relations (query composition)

�16

Query

Database Intermediate Result

Data Collection

Sort Group

Aggregate

Result

SelectFrom Where

Page 17: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Applied Tools for AstronomyLSSDS2013: Intro. to Astronomical RDBs Part I: Relational Databases - Practice Executing Queries

Executing Queries

• Parametric

• SQL - System console - Applications and web interfaces

• From code - Parametric from programmer’s perspective - Languages + libraries

�17

Page 18: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Applied Tools for AstronomyLSSDS2013: Intro. to Astronomical RDBs Part I: Relational Databases - Practice Query Examples

Query Examples

• Example Database - Source: SLOAN DR10

• Schema - Tables

photoObj(oid, ra, dec, g, r) specObj(oid, class, subclass)

�18

Page 19: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Applied Tools for AstronomyLSSDS2013: Intro. to Astronomical RDBs Part I: Relational Databases - Practice Query Examples

Basic Query

SELECT * FROM photoObj;

SELECT oid, class

FROM specObj

WHERE class = ‘GALAXY’;

�19

Page 20: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Applied Tools for AstronomyLSSDS2013: Intro. to Astronomical RDBs Part I: Relational Databases - Practice Query Examples

Complex Conditions

�20

SELECT oid, ra, dec

FROM photoObj

WHERE

g < 12

and r < 12

and g - r < 0;

Page 21: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Applied Tools for AstronomyLSSDS2013: Intro. to Astronomical RDBs Part I: Relational Databases - Practice Query Examples

Joins

�21

SELECT p.oid, p.ra, p.dec, s.subclass

FROM photoObj as p, specObj as s

WHERE

p.oid = s.oid

and p.g < 12 and p.r < 12

and p.g - p.r < 0

and s.class = ‘GALAXY’;

Page 22: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Applied Tools for AstronomyLSSDS2013: Intro. to Astronomical RDBs Part I: Relational Databases - Practice Query Examples

Groups and Aggregates

�22

SELECT s.subclass, count(*)

FROM photoObj as p, specObj as s

WHERE

p.oid = s.oid and p.g < 12

and p.r < 12 and p.g - p.r < 0

and s.class = ‘GALAXY’

GROUP BY s.subclass;

Page 23: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Applied Tools for AstronomyLSSDS2013: Intro. to Astronomical RDBs Part I: Relational Databases - Practice Query Examples

Sub-Queries

�23

SELECT oid, ra, dec

FROM photoObj

WHERE g < 12 and r < 12

and g - r < 0

and oid in(SELECT oid

FROM specObj

WHEREs.class = ‘GALAXY’);

Page 24: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Applied Tools for AstronomyLSSDS2013: Intro. to Astronomical RDBs Part I: Relational Databases - Practice Query Complexity

Query Complexity• Data Volume

- I/O based cost model number of reads from and writes to persistent storage

• Query Complexity - table size: n, number of tables: k - search: O(1) to O(log n) to O(n) - joins: O(n) to O(nk) - sort, group, and aggregates: O(n log n)

(size of intermediate result)

- subqueries: hard for the optimizer

�24

Page 25: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

Part II. NoSQL

Not only SQL

Page 26: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Applied Tools for AstronomyLSSDS2013: Intro. to Astronomical RDBs Part II: NoSQL RDBMS Comfort Zone

RDBMS Comfort Zone

�26

RelationalDatabases(SQL)

An RDBMS performs better when …

• Data is complete, homogeneous and well defined.

• All data is together (in the same computer).

• Answers must be complete and fully consistent.

• Vertical scaling is possible.

Page 27: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Applied Tools for AstronomyLSSDS2013: Intro. to Astronomical RDBs Part II: NoSQL

NoSQL

NoSQL Comfort Zone

NoSQL Comfort Zone

�27

• Data is massive, heterogeneous, and distributed.

• Partial and eventually consistent answers are acceptable.

• Data must be always available.

• Horizontal scaling is preferred (or vertical scaling is not practical).

Page 28: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Applied Tools for AstronomyLSSDS2013: Intro. to Astronomical RDBs Part I: Relational Databases - Practice NoSQL Databases

NoSQL Databases• Aggregate

Key: identify each aggregate Data: heterogeneous collections of attributes as name/value pairs.

• Main Types • Key-Value Stores

fast to retrieve data with unknown structure

• Document Databases (mostly) tree structured data

• Column-Family Stores complex structured data

�28

Page 29: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Applied Tools for AstronomyLSSDS2013: Intro. to Astronomical RDBs Part I: Relational Databases - Practice CAP Theorem

CAP Theorem

• Consistency

• Availability

• Partition Tolerance

Choose two!

�29

Page 30: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Applied Tools for AstronomyLSSDS2013: Intro. to Astronomical RDBs Part I: Relational Databases - Practice Query Evaluation

Query Evaluation

�30

• Map-Reduce Parallel (cluster) data-processing pattern.

• Two steps • Map

Input is an aggregate, output is a bunch of key-value pairs.

Each map is independent (across aggregates in all the cluster).

• Reduce Map results are collected, sorted and combined.

Page 31: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Applied Tools for AstronomyLSSDS2013: Intro. to Astronomical RDBs Part I: Relational Databases - Practice Query Evaluation

Map-Reduce: Map

�31

Fuente: Fowler and Sadalage, NOSQL Distilled.

Page 32: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Applied Tools for AstronomyLSSDS2013: Intro. to Astronomical RDBs Part I: Relational Databases - Practice Query Evaluation

Map-Reduce: Reduce

�32

Fuente: Fowler and Sadalage, NOSQL Distilled.

Page 33: Introduction to Databases - AURA Astronomy Intro. to Astronomical RDBs Introduction Definitions • Database-An organized and self-describing collection of data, with a intended meaning,

LSSDS2013: Applied Tools for AstronomyLSSDS2013: Intro. to Astronomical RDBs Part I: Relational Databases - Practice

Summary

• RDBMS - Tables: collections of records with keys - SQL

Queries: basic, join, groups and aggregates, subqueries.

• NoSQL - Aggregates: collections of key-value pairs with

one identifier. - Map-Reduce

�33