Upload
justin-horton
View
218
Download
2
Tags:
Embed Size (px)
Citation preview
Promised AbstractPromised Abstract
Can database technology help manage and mine Can database technology help manage and mine scientific data? That is the question I have been trying to scientific data? That is the question I have been trying to answer with my astronomy colleagues (especially Alex answer with my astronomy colleagues (especially Alex Szalay.) We have had some success but still face many Szalay.) We have had some success but still face many problems. I will start by describing the odyssey of putting problems. I will start by describing the odyssey of putting the Sloan Digital Sky Survey online and give some the Sloan Digital Sky Survey online and give some statistics about how it is used and what we are doing now. statistics about how it is used and what we are doing now. That segment will end with a discussion of how the That segment will end with a discussion of how the integration of SQL with the CLR (common language run integration of SQL with the CLR (common language run time) makes it much easier for us to handle scientific time) makes it much easier for us to handle scientific datatypes and spatial access methods. The World-Wide datatypes and spatial access methods. The World-Wide Telescope is an attempt to federate all the astronomy Telescope is an attempt to federate all the astronomy archives of the world. I will briefly describe the architecture archives of the world. I will briefly describe the architecture of SkyQuery a prototype portal to several archives each of of SkyQuery a prototype portal to several archives each of which is a web service. Urls: which is a web service. Urls: http://research.microsoft.com/~grayhttp://research.microsoft.com/~gray http://skyserver.pha.jhu.edu/dr1/http://skyserver.pha.jhu.edu/dr1/ http://skyQuery.nethttp://skyQuery.net
Session CodeSession CodeYukon Features For Yukon Features For SkyServer Database SkyServer Database Jim Gray: MicrosoftJim Gray: MicrosoftAlex Szalay (and friends): Johns HopkinsAlex Szalay (and friends): Johns Hopkins
Help from:Help from:
Cathan Cook (personal SkyServer),Cathan Cook (personal SkyServer),
Maria A. Nieto-Santisteban (image cutout service)Maria A. Nieto-Santisteban (image cutout service)
SkyServer Overview SkyServer Overview (10 min)(10 min)
10 minute SkyServer tour10 minute SkyServer tourPixel space Pixel space http://skyserver.sdss.org/en/http://skyserver.sdss.org/en/
Record space:Record space:
Doc space: NedDoc space: Ned
Set space: Set space:
Web & Query Logs Web & Query Logs
Dr1 WebServiceDr1 WebService
You can download You can download (thanks to Cathan Cook )(thanks to Cathan Cook )
Data + Database code:Data + Database code:
Website: Website:
Data Mining the SDSS SkyServer DatabaseData Mining the SDSS SkyServer Database MSR-TR-2002-01MSR-TR-2002-01
select top 10 * from weblog..weblog where yy = 2003 and mm=7 and dd =25 order by seq desc
select top 10 * from weblog..sqlLogorder by theTime Desc
http://research.microsoft.com/~gray/SDSS
Cutout Service (10 min)Cutout Service (10 min)A typical web serviceA typical web service
Show itShow it
Show WSDLShow WSDL
Show fixing a bugShow fixing a bug
Rush through code.Rush through code.
You can download it.You can download it.Maria A. Nieto-Santisteban did most of this (Alex and I started it)Maria A. Nieto-Santisteban did most of this (Alex and I started it)
http://research.microsoft.com/~gray/SDSS/personal_skyserver.htm
SkyQuery: SkyQuery: http://skyquery.net/http://skyquery.net/
Distributed Query tool using a set of web servicesDistributed Query tool using a set of web servicesFifteen astronomy archives from Fifteen astronomy archives from Pasadena, Chicago, Baltimore, Cambridge (England)…Pasadena, Chicago, Baltimore, Cambridge (England)…Feasibility study, built in 6 weeks Feasibility study, built in 6 weeks
Tanu Malik (JHU CS grad student) Tanu Malik (JHU CS grad student) Tamas Budavari (JHU astro postdoc)Tamas Budavari (JHU astro postdoc)With help from Szalay, Thakar, GrayWith help from Szalay, Thakar, Gray
Implemented in C# and .NETImplemented in C# and .NETAllows queries like:Allows queries like:
SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o,
TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5
AND AREA(181.3,-0.76,6.5) AND o.type=3 and (o.I - t.m_j)>2
2MASS
INT
SDSS
FIRST
SkyQueryPortal
ImageCutout
SkyQuery StructureSkyQuery Structure
Each SkyNode publishes Each SkyNode publishes
Schema Web ServiceSchema Web Service
Database Web ServiceDatabase Web Service
Portal is Portal is
Plans Query (2 phase) Plans Query (2 phase)
Integrates answersIntegrates answers
Is itself a web serviceIs itself a web service
Four Database TopicsFour Database Topics
Sparse tables: Sparse tables: column vs row storecolumn vs row storetag and index tablestag and index tablespivotpivot
Maplist (cross apply)Maplist (cross apply)
Bookmark bug Bookmark bug
Object Relational has arrived.Object Relational has arrived.
Column Store PyramidColumn Store PyramidUsers see fat base tables Users see fat base tables (universal relation)(universal relation)
Define popular columns indexDefine popular columns index tag tabletag table 10% ~ 100 columns 10% ~ 100 columns
Make many skinny indicesMake many skinny indices 1% ~ 10 columns 1% ~ 10 columns
Query optimizer picks right planQuery optimizer picks right plan
Automate definition & useAutomate definition & use
Fast read, slow insert/update Fast read, slow insert/update
Data warehouseData warehouse
Note: prior to Yukon, Note: prior to Yukon, index had 16 column limit. index had 16 column limit.
A bane of my existence. A bane of my existence. Simpl
e
Typical Semi-join
Fat quer
y
Obese query
BASEBASE
INDICIESINDICIES
TAGTAG
ExamplesExamples
create table base (id bigint, f1 int primary key,f2 int, …,f1000 int)
create index tag on base (id) include (f1, …, f100)
create index skinny on base(f2,…f17)
Simple
Typical Semi-join
Fat quer
y
Obese query
BASE
INDICIESINDICIES
TAGTAG
A Semi-Join ExampleA Semi-Join Example
create table fat(a int primary key, b int, c int, fat char (988))declare @i int, @j int; set @i = 0again: insert fat values(@i, cast(100*rand() as int), cast (100*rand() as int), ' ') set @i = @i + 1; if (@i < 1000000) goto again create index ab on fat(a,b)create index ac on fat(a,c)
dbcc dropcleanbuffers with no_infomsgs select count(*) from fat with(index (0)) where c = b-- Table 'fat'. Scan 3, reads 137,230, CPU : 1.3 s, elapsed 31.1s.
dbcc dropcleanbuffers with no_infomsgs select count(*) from fat where b=c-- Table 'fat'. Scan 2, reads: 3,482 CPU 1.1 s, elapsed: 1.4 s.
1GB
8MB 8MB
b=c 3.4K IO 1.4 sec
abab acac
b=c 137 K IO
31 sec
Moving From Rows to ColumnsMoving From Rows to ColumnsPivotPivot & & UnPivotUnPivot
What if the table is sparse? What if the table is sparse? LDAP has 7 mandatory LDAP has 7 mandatory
and 1,000 optional attributesand 1,000 optional attributes
Store row, col, valueStore row, col, value
create table Features ( object varchar , attribute varchar,
value varchar, primary key ( object,
attribute)) select *from (features pivot value on attribute
in (year, color) ) as Twhere object = ‘4PNC450’
Featuresobject attribute value
●●●●4PNC450 year 20004PNC450 color white4PNC450 make Ford4PNC450 model Taurus●●●●
TObject year color 4PNC450 2000 white
Maplist Meets SQL – cross applyMaplist Meets SQL – cross apply
Your table-valued function F(a,b,c) Your table-valued function F(a,b,c) returns all objects related to a,b,c.returns all objects related to a,b,c.
spatial neighbors, spatial neighbors,
sub-assemblies, sub-assemblies,
members of a group, members of a group,
items in a folder,…items in a folder,…
Apply this function to each rowApply this function to each row
Classic drill-downClassic drill-downuse use outer applyouter apply if f() may be null if f() may be null
select p.*, q.*from parent as p cross apply f(p.a, p.b, p.c) as qwhere p.type = 1
p1
f(p1)
p2
f(p2)
pn
f(pn)
The Bookmark BugThe Bookmark Bug
SQL is a non-procedural language.SQL is a non-procedural language.
The compiler/optimizer picks the procedureThe compiler/optimizer picks the procedurebased on statistics.based on statistics.
If the stats are wrong or missing….If the stats are wrong or missing….Bad things happen.Bad things happen.Queries can run VERY slowly.Queries can run VERY slowly.
Strategy 1: allow users to specify plan.Strategy 1: allow users to specify plan.
Strategy 2: make the optimizer smarterStrategy 2: make the optimizer smarter(and accept hints from the user.)(and accept hints from the user.)
An Example of the ProblemAn Example of the ProblemA query selects some fields A query selects some fields
of an index of an index and of huge table.and of huge table.
Bookmark plan: Bookmark plan:
look in index for a subset.look in index for a subset.
Lookup subset in Fat table.Lookup subset in Fat table.
This is This is
great if subset << table.great if subset << table.
terrible if subset ~ table.terrible if subset ~ table.
If statistics are wrong, If statistics are wrong, or if predicates not independent,or if predicates not independent,you get the wrong plan.you get the wrong plan.
How to fix the statistics?How to fix the statistics?
Ind
ex
Huge table
A Fix: Let user ask for statsA Fix: Let user ask for statsCreate Statistics on View(f1,..,fn)Create Statistics on View(f1,..,fn)
Then the optimizer has the right dataThen the optimizer has the right dataPicks the right plan.Picks the right plan.Statistics on ViewsStatistics on Views, , C. Galindo-Legaria, M. Josi, F. Waas, M. Wu, VLDB 2003C. Galindo-Legaria, M. Josi, F. Waas, M. Wu, VLDB 2003, ,
Q3: Q3: Select count(*) from Galaxy Select count(*) from Galaxy where r < 22 and r_extinction > 0.120where r < 22 and r_extinction > 0.120
Bookmark: 34 M random IO, Bookmark: 34 M random IO, 520 minutes520 minutes Create Statistics on Galaxy(objID ) Create Statistics on Galaxy(objID )
Scan: 5 M sequential IOScan: 5 M sequential IO 18 minutes18 minutes
Ultimately this should be automated,Ultimately this should be automated, but for now,… it’s a step in the right direction.but for now,… it’s a step in the right direction.
Object Relational Has ArrivedObject Relational Has ArrivedVMs are moving inside the DBVMs are moving inside the DB
Yukon includes Common Language Runtime Yukon includes Common Language Runtime (Oracle & DB2 have similar mechanisms).(Oracle & DB2 have similar mechanisms).
So, C++, VB, C# and Java are So, C++, VB, C# and Java are co-equal with TransactSQL.co-equal with TransactSQL.
You can define classes and methodsYou can define classes and methodsSQL will store the instancesSQL will store the instancesAccess them via methodsAccess them via methods
You can put your analysis code You can put your analysis code INSIDEINSIDE the database. the database.
Minimizes data movement.Minimizes data movement.You can’t move petabytes to the clientYou can’t move petabytes to the clientBut we will soon have petabyte databases.But we will soon have petabyte databases.
datacode
datacode
+code
And..And..Fully-async and synchronous (blocking) callsFully-async and synchronous (blocking) callsand multi-concurrent-result sets and multi-concurrent-result sets per connection (transaction)per connection (transaction)
Queues built in (service broker):Queues built in (service broker):Fire-and forget asynchronous processingFire-and forget asynchronous processing
It listens to Port 80 for SOAP calls :It listens to Port 80 for SOAP calls : TP-lite is backTP-lite is back
It’s a web service It’s a web service
Notification service and Notification service and data mining and data mining and olap and olap and reporting and reporting and xml and xml and xquery andxquery and.... ).... )
But, back to OR.But, back to OR.
Some BackgroundSome BackgroundTable valued functionsTable valued functions
SQL operates on tables.SQL operates on tables.If you can make tables, you can extend SQLIf you can make tables, you can extend SQLThis is the idea behind OLE/DBThis is the idea behind OLE/DB
create function Evens(@maxVal int)returns @T table (a int)
beginwhile (@maxVal > 0)
beginif (@maxVal % 2 = 0) insert @T values(@maxVal)set @maxVal = @maxVal -1end
return end
select * from Evens(10)
a ----------- 108642
Using table Valued Functions Using table Valued Functions For Spatial SearchFor Spatial Search
Use function to return likely key ranges.Use function to return likely key ranges.
Use filter predicate to eliminate objects Use filter predicate to eliminate objects outside the query box. outside the query box.
Select objIDFrom Objects O join fGetRanges( @latitude, @longitude, @radius) R
on O.htmID between R.begin and R.endwhere abs(o.Lat - @latitude) + abs(o.Lon – @longitude) < @radius
Table valued function returns candidate ranges of some space-filling curve.
Filter discards false positives.
The Pre CLR designThe Pre CLR design
Transact SQL sp_HTM(20 lines) 469 lines of
“glue”looking like: // Get Coordinates param datatype, and param length information of if (srv_paraminfo(pSrvProc, 1, &bType1, &cbMaxLen1, &cbActualLen1, NULL, &fNull1) == FAIL) ErrorExit("srv_paraminfo failed...");
// Is Coordinate param a character stringif (bType1 != SRVBIGVARCHAR && bType1 != SRVBIGCHAR &&
bType1 != SRVVARCHAR && bType1 != SRVCHAR)ErrorExit("Coordinate param should be a string.");
// Is Coordinate param non-nullif (fNull1 || cbActualLen1 < 1 || cbMaxLen1 <= cbActualLen1)
ErrorExit("Coordinate param is null.");
// Get pointer to Coordinate parampzCoordinateSpec = (char *) srv_paramdata (pSrvProc, 1);if (pzCoordinateSpec == NULL)
ErrorExit("Coordinate param is null.");pzCoordinateSpec[cbActualLen1] = 0;
// Get OutputVector datatype, and param length information if (srv_paraminfo(pSrvProc, 2, &bType2, &cbMaxLen2, &cbActualLen2, NULL, &fNull2) == FAIL) ErrorExit("Failed to get type info on HTM Vector param...");
The HTM code body
The “glue” CLR designThe “glue” CLR designDiscard 450 lines of UGLY codeDiscard 450 lines of UGLY code
The HTM code body
C# SQL sp_HTM(50 lines)
using System;using System.Data;using System.Data.SqlServer;using System.Data.SqlTypes;using System.Runtime.InteropServices;namespace HTM {
public class HTM_wrapper {[DllImport("SQL_HTM.dll")] static extern unsafe void * xp_HTM_Cover_get (byte *str);public static unsafe void HTM_cover_RS(string input) {
// convert the input from Unicode (array of 2 bytes) to an array of bytes (not shown) byte * input; byte * output;
// invoke the HTM routine output = (byte *)xp_HTM_Cover_get(input);
// Convert the array to a tableSqlResultSet outputTable = SqlContext.GetReturnResultSet();
if (output[0] == 'O') { // if Output is “OK”uint c = *(UInt32 *)(s + 4); // cast results as datasetInt64 * r = ( Int64 *)(s + 8); // Int64 r[c-1,2]
for (int i = 0; i < c; ++i) { SqlDataRecord newRecord = outputTable.CreateRecord(); newRecord.SetSqlInt64(0, r[0]);
newRecord.SetSqlInt64(1, r[1]); r++;r++;outputTable.Insert(newRecord);
} } // return outputTable;} } }
Thanks!!! To Peter Kukol (who wrote this)
The Clean CLR designThe Clean CLR designDiscard all glue codeDiscard all glue codereturn array cast as tablereturn array cast as table
CREATE ASSEMBLY HTM_AFROM '\\localhost\HTM\HTM.dll' CREATE FUNCTION HTM_cover( @input NVARCHAR(100) )RETURNS @t TABLE ( HTM_ID_START BIGINT NOT NULL PRIMARY KEY, HTM_ID_END BIGINT NOT NULL
)ASEXTERNAL NAME HTM_A:HTM_NS.HTM_C::HTM_cover
using System;using System.Data;using System.Data.Sql;using System.Data.SqlServer;using System.Data.SqlTypes;using System.Runtime.InteropServices;namespace HTM_NS {
public class HTM_C {public static Int64[,2] HTM_cover(string input) {
// invoke the HTM routine return (Int64[,2]) xp_HTM_Cover(input); // the actual HTM C# or C++ or Java or VB code goes here.
} } }
Your/My code
goes here
Performance (Beta1)Performance (Beta1) On a 2.2 Ghz Xeon On a 2.2 Ghz Xeon
Call a Transact SQL functionCall a Transact SQL function 3333μμss
Call a C# functionCall a C# function 5050μμss
Table valued functionTable valued function 1,580 μμss + per row per row 42 μs
Array (== table) valued function Array (== table) valued function 200 200 μμss
+ per row 27 + per row 27 μμs s
CREATE ASSEMBLY ReturnOneAFROM '\\localhost\C:\ReturnOne.dll'GOCREATE FUNCTION ReturnOne_Int( @input INT) RETURNS INTAS EXTERNAL NAME ReturnOneA:ReturnOneNS.ReturnOneC::ReturnOne_IntGO----------------------------------------------- time echo an integerdeclare @i int, @j int, @cpu_seconds float, @null_loop floatdeclare @start datetime, @end datetimeset @j = 0set @i = 10000set @start = current_Timestampwhile(@i > 0) begin
set @j = @j + 1set @i = @i -1
endset @end = current_Timestampset @null_loop = datediff(ms, @start,@end) / 10.0
set @i = 10000set @start = current_Timestampwhile(@i > 0) begin
select @j = dbo.ReturnOne_Int(@i)set @j = @j + 1set @i = @i -1
endset @end = current_Timestamp set @cpu_seconds = datediff(ms, @start,@end) / 10.0 - @null_loopprint 'average cpu time for 1,000 calls to ReturnOne_Int was ' + str(@cpu_seconds,8,2)+ ' micro seconds'
The CodeThe Code
using System;using System.Data;using System.Data.SqlServer;using System.Data.SqlTypes;using System.Runtime.InteropServices;namespace ReturnOneNS { public class ReturnOneC {
public static int ReturnOne_Int(int input) {return input;
} }}
Function written in C# inside the DB
Program in DB in different
language (Tsql) calling function
What Is the Significance?What Is the Significance?
No more inside/outside DB dichotomy.No more inside/outside DB dichotomy.
You can put your code near the data.You can put your code near the data.
Indeed, we are letting users put personal Indeed, we are letting users put personal databases near the data archive.databases near the data archive.
This avoids moving large datasets.This avoids moving large datasets.
Just move questions and answers.Just move questions and answers.
Meta-MessageMeta-Message
Trying to fit science data into databasesTrying to fit science data into databases
When it does not fit, something is wrong.When it does not fit, something is wrong.
Look for solutionsLook for solutions
Many solutions come from OR extensionsMany solutions come from OR extensions
Some are fundamental engine changesSome are fundamental engine changes
More structure in DBMore structure in DB
Richer operator setsRicher operator sets
Better statisticsBetter statistics
© 2002 Microsoft Corporation. All rights reserved.© 2002 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.
evaluationsevaluations
© 2003 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only.© 2003 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only.MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.