39
Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Embed Size (px)

Citation preview

Page 1: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Introduction to SQL and ADQL

Tom McGlynnNASA/GSFC

(with thanks to Maria Nieto-Santisteban and Gretchen Greene)

Page 2: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

What are SQL and ADQL?

• SQL (‘sequel’ sometimes) is Structured Query Language a standard for creating and getting information from relational databases.

• ADQL (Astronomy Data Query Language) is an adaptation of SQL to meet specific needs for astronomical queries, especially positional queries.

• These enable astronomers to make sophisticated queries of astronomical databases.

Page 3: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Lots of places to find info

• NVO Book chapters on SQL• Web site with lots of links about SQL:

– http://www.thefreecountry.com/documentation/onlinesql.shtml• On-line tutorials

• http://nvo-twiki.stsci.edu/twiki/pub/Main/NVOSS3CourseNotes/SQL2006.html– http://www.w3schools.com/sql/default.asp– http://www.sql-tutorial.net/– http://www.firstsql.com/tutor.htm– GIYF

• ADQL standard – http://www.ivoa.net/Documents/latest/ADQL.html

• SDSS Online database– http://casjobs.sdss.org/CasJobs/

• OpenSkyQuery– http://openskyquery.net/Sky/skysite/

Page 4: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Background: The kinds of databases

• Network databases: pointers– Internal data in programs

• Hierarchical database: structure– XML files

• Relational databases: common indices– Relational database management systems

Page 5: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Network database

Dora

Supervisor

Jamie

CMS

Joe

Salary:

75,000

Admin

Roles

Personnel

Team

The WorldWideWeb is the most successful of all network databases.

Page 6: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Hierarchical Database

<Company> <Team name=CMS> <TeamLeader> <Person name=Dora> <Salary>75,000</Salary> </Person> </TeamLeader> <TeamMembers> <Person name=Joe><Salary>… <Person name=Jamie><Salary>… </TeamMembers> </Team> ….</Company>

With XML, hierarchical databases are making a comeback. (e.g., Carnivore registry)

Page 7: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Relational Database

Personnel

UID Name Salary

1 Dora 75,000

2 Joe 30,000

3 Jamie 66,000

Teams

TeamID Name LeadUID

1 CMS 1

2 Test 13

TeamMembers

TeamID UID

1 2

1 3

Page 8: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

What about astronomy?

• Relational DBMS’s used by all major astronomical data providers:– ADS, MAST, IPAC, SDSS, …– Only RDBMS can scale to size of modern

astronomical tables (100’s of columns, 109 rows)

– Flexible interactions between tables

• Standard SQL provides limited support for positional queries

• Some RDBMS’s have support for objects

Page 9: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

RDBMS servers

• MySQL: free, widely used, fast• Postgres: free widely used, better standards

compliance, object support• Sybase (commercial)• SQLServer (Microsoft)• Gazillions more• MySQL installation at NOAO available for use

in summer school projects -- or just download a copy to run on your machine. (http://dev.mysql.com/downloads/)

Page 10: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Web availability

• CASJOBS– SQL based, SDSS database. Allows users to

generate their own tables.

• OpenSkyQuery– ADQL based, lots of missions but more

fragile.

Page 11: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Using CASJOBS

1. Connect to CASJOBS web site:1. http://casjobs.sdss.org/CasJobs/

2. Get account or login3. Build and query dat

This is a production service and does not always respond in ‘webtime’.

Page 12: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Basic SQL commands

CREATE TABLE tablename (col1 type1, col2 type2, …)

DROP TABLE tablename;

INSERT INTO TABLE tablename (col1,col2,…) VALUES(val1,val2,…)

DELETE FROM TABLE tablename WHERE condition

UPDATE tablename SET col1=val1,col2=val2,… WHERE condition

SELECT fields FROM tables WHERE conditions ORDER BY col1,col2

Page 13: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Create an account in CASJOBS

Page 14: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Select MYDB to get access your private database

Then use CREATE TABLE command

Table name

Column names

Column types

MySQL doesn’t like this name .Must be escaped as `dec`

Page 15: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Types

• Numeric types– int,bigint,smallint– real,float– As in C, the size of types is not well standardized– Typically lots of aliases for various sizes of integers and

floating point numbers

• Character types: char(n), varchar(n), text– Use varchar for long, variable length strings– Use text for very long strings that you won’t need to

compare with others (e.g., file content)

• Business oriented types (money, dates, decimal values)– Some are occasionally useful

Page 16: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Add data to small table: The INSERT command

Use single quotes for strings.

Values in same order as in create statement.

RA,Dec normally stored as decimal degrees

Page 17: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

How do I get rid of rows in a table?

DELETE FROM table WHERE conditionsDELETE FROM stars WHERE ra is null

To delete all rows (but not delete the table entirely)

DELETE from table

Page 18: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

How do I modify a table?

• Update values in existing rows:– UPDATE table SET field=value,field=value,… WHERE condition UPDATE stars SET ra=1.14983, dec=-31.243 WHERE starid=49

• Adding columns.– Not supported by standard SQL. Create a new table with new

columns and copy old values. Some DB’s support ALTER TABLE … ADD COLUMN …

• Copying tables (differs a bit from DMBS to DBMS)– Create second table thenSELECT INTO table2 * FROM table1 orINSERT INTO TABLE2 (SELECT * FROM table1)

Page 19: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Null values: no name for these stars

Specify parameters to fill

Include an explicit null in the list

Page 20: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

What’s in the table?

Status of my tables

Let’s try to query the table we created: use the SELECT command

Page 21: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

But it fails! Try again…

This query works. CasJobs doesn’t like null values. The lesson is that one needs to be chary of nulls (and that CasJobs is not a full featured DB).

CasJobs put the results in another table for us. We click on MYDB in the top bar, then the created table name, then on Sample to see the results.

Page 22: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

The SELECT statement

The SELECT statement is used to query and existing table or set of tables

A typical query is:SELECT field1,field2,… FROM table1, table2, …

WHERE condition1 AND/OR condition2 AND/OR … ORDER BY sortfield1,sortfield2,…

The list of fields to be returned can usually be specified as ‘*’ to get all the fields in the table.

There are also GROUP BY and HAVING clauses for advanced queries.

Page 23: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

The FROM clause

• The table, or list of tables, to be queried.– … FROM mystars …

• There may be a ‘database’ specified, a collection of related tables– … FROM mydb.mystars …

• Each table in the list may have an alias that can be used to identify that table elsewhere– … FROM mystars m ….

Page 24: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Fields (all other clauses)Can be:• Simple name of column in table: mag• Expression: ra-dec• A constant: 3.14159, or ‘pi’

• You can name the result if you like using AS , e.g.select ra AS myra,ra-dec AS diff from mystarswill return columns named myra and diff• When there are multiple tables, table columns can be distinguished

by table names and aliases.– Select t1.ra-t2.ra from mystars t1, yourstars, t2 where t1.id=t2.id– Select mystars.ra-yourstars.ra from mystars,yourstars where

mystars.id=yourstars.id

• One difference between SQL and ADQL is that ADQL requires all column names to specify their origin table, not just where there is an ambiguity.

Page 25: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Conditions

• Comparisons:– mag < 6– mag between 4 and 6– name=‘star1’– ra < 3*dec– sin(ra)*cos(dec)-cos(ra)*sin(dec) > .234– name like ‘star%’

• ‘%’ is the wild card. ‘_’ matches one character.

• Lists– name in (‘star1’, ’star2’, ‘star2’)

• Null tests:– name is null

• Negation– not between 4 and 6– name is not null

• Compound conditions– (mag < 6 and mag > 10 and mag>2*flux)

Page 26: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Querying multiple table: Joins

Joins ‘join’ the rows in two or more tables– One to one relationship

• The two tables are logically one bigger table• A salary table and address table with one row for each

employee.

– Many to one• Components• Checks to Checking account

– Each check belongs to exactly one account but each account can have many checks

– Many to Many• Observations/Sources

– An single observation may include many sources– A single source may be seen in many observations

Page 27: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

How does a query work…

Table1a11,a12,a13,…a21,a22,a23,..a31,a32,a33,..…

Table2b11,b12,b13,…b21,b22,b23,..b31,b32,b33,..…

Naïve view:First looking at the FROM clause the database creates the product of all the rows of all the participating tables.

If there are n rows in table 1 and m in table 2, then there are nxm in the join. For many way joins this can very large very quickly!

Data base optimization is all about trying to reduce the number of rows that have to be looked at.

a11,a12,a13…,b11,b12,b13…a11,a12,a13,…b21,b22,b23,…a11,a12,a13,…b31,b32,b33,…a21,a22,a23,…b11,b12,b13,…a21,a22,a23,…b11,b12,b13,…a21,a22,a23,…b11,b12,b13,…a21,a22,a23,…b11,b12,b13,…a31,a32,a33,…b11,b12,b13,…a31,a32,a33,…b11,b12,b13,…a31,a32,a33,…b11,b12,b13,……

Page 28: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

WHERE clause: Filtering the intermediate table

The WHERE clause filters the rows in the product table.

If a condition involves only one table it can be applied to the inputs before cross-product.

If a condition applies to both tables, they can sometimes be organized (indexed) such that only a few rows in one table need to be checked for each row in the other.

Page 29: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Specifying the output

• The selection list tells the DBMS which columns to include (or generate) on the output

• The sort tells the DBMS the order.

Page 30: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Astronomy and RDBMS’s

• RDBMS’s expect exact joins on keys (user id, SSN’s, product number) which matches business use.– Sometimes correct for astronomical tables, e.g., targets

in a proposal.

• Many astronomical joins are soft: nearby in position and/or time looking for counterparts.– 1-D is not a problem, but 2-D soft joins are hard.– GIS systems provide some non-standard features– ADQL extensions– Indexing of tables and planning of queries requires lots

of work to do efficiently.

Page 31: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Create sample tables

create table pets (owner varchar(12), pet varchar(12), species varchar(12))

insert into pets values('fred', 'fido', 'dog')insert into pets values('joe', 'ruff', 'dog')insert into pets values('mary', 'ears', 'dog')insert into pets values('andie', 'max', 'dog')insert into pets values('joe', 'silvertip', 'cat')insert into pets values('marlene', 'buster', 'cat')insert into pets values('bill', 'meow', 'cat')insert into pets values('simone', 'furball', 'cat')insert into pets values('mary', 'silver', 'horse')insert into pets values('nell', 'fairmane', 'horse')insert into pets values('effie', 'belle', 'horse')insert into pets values('nell', 'slither', 'salamander')insert into pets values('nell', 'hairy', 'tarantula')

create table vets (vet varchar(12), species varchar(12))

insert into vets values('merriwether', 'dog')insert into vets values('parell', 'dog')insert into vets values('parell', 'cat')insert into vets values('parell', 'horse')insert into vets values('nestor', 'salamander')

Page 32: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Sample Queries:

Which vets should each owner know?select p.owner,v.vet from pets p, vets vwhere p.species = v.speciesorder by p.owner

Are any pets not covered by a vet?select pet from petswhere species not in (select species from vets)

How many pets can each vet treat?select v.vet,count(p.pet) as pcnt from pets p, vets v where v.species = p.speciesgroup by v.vet

Page 33: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Upload SPOCS data

Page 34: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Query for Sun-like stars

SELECT s.spocs, s.Name, s.Teff,s.Log_g,s.M_o_H FROM spocs s WHERE s.Teff BETWEEN 5720 AND 5820

AND s.Log_g BETWEEN 4.34 AND 4.54 AND s.m_o_h BETWEEN -0.1 AND +0.1

ORDER BY s.Teff

Page 35: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)
Page 36: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Correlate with SDSS: Can it be done?

SELECT p.objID,p.field, p.ra, p.decFROM dr2..PhotoObj p,mydb..spocs sWHERE s._dej2000 between p.dec-.001 and

p.dec+.001and s._raj2000 between p.ra-.001 and p.dec+.001

Making this possible is what ADQL is really about!

Page 37: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

~7 ½ minutes to run this little query.

Page 38: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

Other SQL topics

• TOP/LIMIT/SET ROWCOUNT– Different ways to limit the rows output

• Functions and procedures• Indices, clusters, and ensuring efficient access• How do we maintain a dynamic database?

– Referential integrity, triggers, transactions.– Less critical to typical, relatively static astronomical

databases.

• Groups and group functions– AVG, MIN, MAX, COUNT, SUM– GROUP BY and HAVING clauses

Page 39: Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)

What to do if DB needed in project

• Databases and accounts set up and available for use at NOAO.

• Lots of free databases available. If you want help setting one up just ask….