Upload
jonathan-mcconnell
View
229
Download
4
Tags:
Embed Size (px)
Citation preview
Introduction to SQL and ADQL
Tom McGlynnNASA/GSFC
(with thanks to Maria Nieto-Santisteban and Gretchen Greene)
What are SQL and ADQL?
• SQL (‘sequel’ sometimes) is Structured Query Language a standard for creating and getting information from relational databases.
• ADQL (Astronomy Data Query Language) is an adaptation of SQL to meet specific needs for astronomical queries, especially positional queries.
• These enable astronomers to make sophisticated queries of astronomical databases.
Lots of places to find info
• NVO Book chapters on SQL• Web site with lots of links about SQL:
– http://www.thefreecountry.com/documentation/onlinesql.shtml• On-line tutorials
• http://nvo-twiki.stsci.edu/twiki/pub/Main/NVOSS3CourseNotes/SQL2006.html– http://www.w3schools.com/sql/default.asp– http://www.sql-tutorial.net/– http://www.firstsql.com/tutor.htm– GIYF
• ADQL standard – http://www.ivoa.net/Documents/latest/ADQL.html
• SDSS Online database– http://casjobs.sdss.org/CasJobs/
• OpenSkyQuery– http://openskyquery.net/Sky/skysite/
Background: The kinds of databases
• Network databases: pointers– Internal data in programs
• Hierarchical database: structure– XML files
• Relational databases: common indices– Relational database management systems
Network database
Dora
Supervisor
Jamie
CMS
Joe
Salary:
75,000
Admin
Roles
Personnel
Team
The WorldWideWeb is the most successful of all network databases.
Hierarchical Database
<Company> <Team name=CMS> <TeamLeader> <Person name=Dora> <Salary>75,000</Salary> </Person> </TeamLeader> <TeamMembers> <Person name=Joe><Salary>… <Person name=Jamie><Salary>… </TeamMembers> </Team> ….</Company>
With XML, hierarchical databases are making a comeback. (e.g., Carnivore registry)
Relational Database
Personnel
UID Name Salary
1 Dora 75,000
2 Joe 30,000
3 Jamie 66,000
…
Teams
TeamID Name LeadUID
1 CMS 1
2 Test 13
…
TeamMembers
TeamID UID
1 2
1 3
…
What about astronomy?
• Relational DBMS’s used by all major astronomical data providers:– ADS, MAST, IPAC, SDSS, …– Only RDBMS can scale to size of modern
astronomical tables (100’s of columns, 109 rows)
– Flexible interactions between tables
• Standard SQL provides limited support for positional queries
• Some RDBMS’s have support for objects
RDBMS servers
• MySQL: free, widely used, fast• Postgres: free widely used, better standards
compliance, object support• Sybase (commercial)• SQLServer (Microsoft)• Gazillions more• MySQL installation at NOAO available for use
in summer school projects -- or just download a copy to run on your machine. (http://dev.mysql.com/downloads/)
Web availability
• CASJOBS– SQL based, SDSS database. Allows users to
generate their own tables.
• OpenSkyQuery– ADQL based, lots of missions but more
fragile.
Using CASJOBS
1. Connect to CASJOBS web site:1. http://casjobs.sdss.org/CasJobs/
2. Get account or login3. Build and query dat
This is a production service and does not always respond in ‘webtime’.
Basic SQL commands
CREATE TABLE tablename (col1 type1, col2 type2, …)
DROP TABLE tablename;
INSERT INTO TABLE tablename (col1,col2,…) VALUES(val1,val2,…)
DELETE FROM TABLE tablename WHERE condition
UPDATE tablename SET col1=val1,col2=val2,… WHERE condition
SELECT fields FROM tables WHERE conditions ORDER BY col1,col2
Create an account in CASJOBS
Select MYDB to get access your private database
Then use CREATE TABLE command
Table name
Column names
Column types
MySQL doesn’t like this name .Must be escaped as `dec`
Types
• Numeric types– int,bigint,smallint– real,float– As in C, the size of types is not well standardized– Typically lots of aliases for various sizes of integers and
floating point numbers
• Character types: char(n), varchar(n), text– Use varchar for long, variable length strings– Use text for very long strings that you won’t need to
compare with others (e.g., file content)
• Business oriented types (money, dates, decimal values)– Some are occasionally useful
Add data to small table: The INSERT command
Use single quotes for strings.
Values in same order as in create statement.
RA,Dec normally stored as decimal degrees
How do I get rid of rows in a table?
DELETE FROM table WHERE conditionsDELETE FROM stars WHERE ra is null
To delete all rows (but not delete the table entirely)
DELETE from table
How do I modify a table?
• Update values in existing rows:– UPDATE table SET field=value,field=value,… WHERE condition UPDATE stars SET ra=1.14983, dec=-31.243 WHERE starid=49
• Adding columns.– Not supported by standard SQL. Create a new table with new
columns and copy old values. Some DB’s support ALTER TABLE … ADD COLUMN …
• Copying tables (differs a bit from DMBS to DBMS)– Create second table thenSELECT INTO table2 * FROM table1 orINSERT INTO TABLE2 (SELECT * FROM table1)
Null values: no name for these stars
Specify parameters to fill
Include an explicit null in the list
What’s in the table?
Status of my tables
Let’s try to query the table we created: use the SELECT command
But it fails! Try again…
This query works. CasJobs doesn’t like null values. The lesson is that one needs to be chary of nulls (and that CasJobs is not a full featured DB).
CasJobs put the results in another table for us. We click on MYDB in the top bar, then the created table name, then on Sample to see the results.
The SELECT statement
The SELECT statement is used to query and existing table or set of tables
A typical query is:SELECT field1,field2,… FROM table1, table2, …
WHERE condition1 AND/OR condition2 AND/OR … ORDER BY sortfield1,sortfield2,…
The list of fields to be returned can usually be specified as ‘*’ to get all the fields in the table.
There are also GROUP BY and HAVING clauses for advanced queries.
The FROM clause
• The table, or list of tables, to be queried.– … FROM mystars …
• There may be a ‘database’ specified, a collection of related tables– … FROM mydb.mystars …
• Each table in the list may have an alias that can be used to identify that table elsewhere– … FROM mystars m ….
Fields (all other clauses)Can be:• Simple name of column in table: mag• Expression: ra-dec• A constant: 3.14159, or ‘pi’
• You can name the result if you like using AS , e.g.select ra AS myra,ra-dec AS diff from mystarswill return columns named myra and diff• When there are multiple tables, table columns can be distinguished
by table names and aliases.– Select t1.ra-t2.ra from mystars t1, yourstars, t2 where t1.id=t2.id– Select mystars.ra-yourstars.ra from mystars,yourstars where
mystars.id=yourstars.id
• One difference between SQL and ADQL is that ADQL requires all column names to specify their origin table, not just where there is an ambiguity.
Conditions
• Comparisons:– mag < 6– mag between 4 and 6– name=‘star1’– ra < 3*dec– sin(ra)*cos(dec)-cos(ra)*sin(dec) > .234– name like ‘star%’
• ‘%’ is the wild card. ‘_’ matches one character.
• Lists– name in (‘star1’, ’star2’, ‘star2’)
• Null tests:– name is null
• Negation– not between 4 and 6– name is not null
• Compound conditions– (mag < 6 and mag > 10 and mag>2*flux)
Querying multiple table: Joins
Joins ‘join’ the rows in two or more tables– One to one relationship
• The two tables are logically one bigger table• A salary table and address table with one row for each
employee.
– Many to one• Components• Checks to Checking account
– Each check belongs to exactly one account but each account can have many checks
– Many to Many• Observations/Sources
– An single observation may include many sources– A single source may be seen in many observations
How does a query work…
Table1a11,a12,a13,…a21,a22,a23,..a31,a32,a33,..…
Table2b11,b12,b13,…b21,b22,b23,..b31,b32,b33,..…
Naïve view:First looking at the FROM clause the database creates the product of all the rows of all the participating tables.
If there are n rows in table 1 and m in table 2, then there are nxm in the join. For many way joins this can very large very quickly!
Data base optimization is all about trying to reduce the number of rows that have to be looked at.
a11,a12,a13…,b11,b12,b13…a11,a12,a13,…b21,b22,b23,…a11,a12,a13,…b31,b32,b33,…a21,a22,a23,…b11,b12,b13,…a21,a22,a23,…b11,b12,b13,…a21,a22,a23,…b11,b12,b13,…a21,a22,a23,…b11,b12,b13,…a31,a32,a33,…b11,b12,b13,…a31,a32,a33,…b11,b12,b13,…a31,a32,a33,…b11,b12,b13,……
WHERE clause: Filtering the intermediate table
The WHERE clause filters the rows in the product table.
If a condition involves only one table it can be applied to the inputs before cross-product.
If a condition applies to both tables, they can sometimes be organized (indexed) such that only a few rows in one table need to be checked for each row in the other.
Specifying the output
• The selection list tells the DBMS which columns to include (or generate) on the output
• The sort tells the DBMS the order.
Astronomy and RDBMS’s
• RDBMS’s expect exact joins on keys (user id, SSN’s, product number) which matches business use.– Sometimes correct for astronomical tables, e.g., targets
in a proposal.
• Many astronomical joins are soft: nearby in position and/or time looking for counterparts.– 1-D is not a problem, but 2-D soft joins are hard.– GIS systems provide some non-standard features– ADQL extensions– Indexing of tables and planning of queries requires lots
of work to do efficiently.
Create sample tables
create table pets (owner varchar(12), pet varchar(12), species varchar(12))
insert into pets values('fred', 'fido', 'dog')insert into pets values('joe', 'ruff', 'dog')insert into pets values('mary', 'ears', 'dog')insert into pets values('andie', 'max', 'dog')insert into pets values('joe', 'silvertip', 'cat')insert into pets values('marlene', 'buster', 'cat')insert into pets values('bill', 'meow', 'cat')insert into pets values('simone', 'furball', 'cat')insert into pets values('mary', 'silver', 'horse')insert into pets values('nell', 'fairmane', 'horse')insert into pets values('effie', 'belle', 'horse')insert into pets values('nell', 'slither', 'salamander')insert into pets values('nell', 'hairy', 'tarantula')
create table vets (vet varchar(12), species varchar(12))
insert into vets values('merriwether', 'dog')insert into vets values('parell', 'dog')insert into vets values('parell', 'cat')insert into vets values('parell', 'horse')insert into vets values('nestor', 'salamander')
Sample Queries:
Which vets should each owner know?select p.owner,v.vet from pets p, vets vwhere p.species = v.speciesorder by p.owner
Are any pets not covered by a vet?select pet from petswhere species not in (select species from vets)
How many pets can each vet treat?select v.vet,count(p.pet) as pcnt from pets p, vets v where v.species = p.speciesgroup by v.vet
Upload SPOCS data
Query for Sun-like stars
SELECT s.spocs, s.Name, s.Teff,s.Log_g,s.M_o_H FROM spocs s WHERE s.Teff BETWEEN 5720 AND 5820
AND s.Log_g BETWEEN 4.34 AND 4.54 AND s.m_o_h BETWEEN -0.1 AND +0.1
ORDER BY s.Teff
Correlate with SDSS: Can it be done?
SELECT p.objID,p.field, p.ra, p.decFROM dr2..PhotoObj p,mydb..spocs sWHERE s._dej2000 between p.dec-.001 and
p.dec+.001and s._raj2000 between p.ra-.001 and p.dec+.001
Making this possible is what ADQL is really about!
~7 ½ minutes to run this little query.
Other SQL topics
• TOP/LIMIT/SET ROWCOUNT– Different ways to limit the rows output
• Functions and procedures• Indices, clusters, and ensuring efficient access• How do we maintain a dynamic database?
– Referential integrity, triggers, transactions.– Less critical to typical, relatively static astronomical
databases.
• Groups and group functions– AVG, MIN, MAX, COUNT, SUM– GROUP BY and HAVING clauses
What to do if DB needed in project
• Databases and accounts set up and available for use at NOAO.
• Lots of free databases available. If you want help setting one up just ask….