Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Introduc)ontoLargeDatabases&DataMining
TipsforAssemblingYourDataAnalysisToolboxforthe22ndCentury
10/05/12 JimHeasley,Ins)turteforAstronomy 1
Outline‐I
• Rela)onalDatabases&BIGDATA– Bigdatavolumesrequireanewdatahandlingparadigm
– Advantagesofarela)onaldatabase• Organiza)onofdata• Dataintegrity• SQL‐‐Structured(andalmoststandard)querylanguageforqueries
– Whatadatabaseisnot.
10/05/12 JimHeasley,Ins)turteforAstronomy 2
Outline‐II
• Datamining– Whatisit?
– Commondataminingtasks– (FREE)Toolsavailabletoyoutoperformmanyofthesetasks.
10/05/12 JimHeasley,Ins)turteforAstronomy 3
Outline‐III
• Examples–Imagined&Real– Ifweonlyhad)metravel…
– ThingsonemightstarttodowithPAN‐STARRSdata(rightnow).
10/05/12 JimHeasley,Ins)turteforAstronomy 4
RELATIONALDATABASES
10/05/12 JimHeasley,Ins)turteforAstronomy 5
10/05/12 JimHeasley,Ins)turteforAstronomy
BasicDefini8ons• Database:
– Acollec)onofrelateddataorganizedtoprovideinforma)on.
• Data:– Knownfactsthatcanberecorded
andhaveanimplicitmeaning.– Obenintegratedfromseveral
sources.– Storedinastandardformatforuse
bymul)pleapplica)ons.• DatabaseManagementSystem
(DBMS):– Asobwarepackage/systemto
facilitatethecrea)onandmaintenanceofacomputerizeddatabase.
• DatabaseSystem:– TheDBMSsobwaretogetherwith
thedataitselfandthehardwareuponwhichitruns.Some)mes,theapplica)onsarealsoincluded.
6
10/05/12 JimHeasley,Ins)turteforAstronomy
Twoapproaches
– Generally,therearetwoapproachestoextractinforma)onfromdata:• fileprocessingapproach
– filebasedsobwareprograms
• databaseapproach– DBMS
7
10/05/12 JimHeasley,Ins)turteforAstronomy
Fileprocessingapproach
– Issues:• dataredundancy• redundantprocesses/interfaces• dataintegrity
– easeofmaintenance– consistency
• Security– preserva)on–valuablecompanyasset– accesscontrol
• Each application program has a specific purpose
• Each program uses its own data
...
Application program 1
Data
Instructions
Application program n
Data
Instructions
8
10/05/12 JimHeasley,Ins)turteforAstronomy
Mo8va8onfordatabases
– Dataisaveryimportantassetofanorganiza)on
– Mo)va)onsfordatabases• tomaintaindataindependentfromapplica)onprograms
• toavoid:– redundantdata– redundantprocesses/interfaces
• toenable:– easeofmaintenance
– sharingofdata– dataaccesscontrol
9
10/05/12 JimHeasley,Ins)turteforAstronomy
Databaseapproach
– DBMS‐a“generalpurpose”sobware• isself‐describing• contains
– data– metadata(i.e.dataaboutdata)
DBMS Application program 1
Instructions
...Data
Metadata Application program n
Instructions
10
10/05/12 JimHeasley,Ins)turteforAstronomy
MainCharacteris8csoftheDatabaseApproach
• Self‐describingnatureofadatabasesystem:
– ADBMScatalogstoresthedescrip)onofapar)culardatabase(e.g.datastructures,types,andconstraints)
• Insula8onbetweenprogramsanddata:– Calledprogram‐dataindependence.
• DataAbstrac8on:– Adatamodelisusedtohidestoragedetails
andpresenttheuserswithaconceptualviewofthedatabase.
• Supportofmul8pleviewsofthedata:– Eachusermayseeadifferentviewofthe
database,whichdescribesonlythedataofinteresttothatuser.
• ConcurrentExecu8ons
11
10/05/12 JimHeasley,Ins)turteforAstronomy
Characteris8csofDBMS
– Datais:• integrated,shared,persistent• self‐describing
– Abstrac)on• programanddataindependence
– Mul)pleviewsofthedata• differentusersneeddifferentkindsofinforma)on
12
10/05/12 JimHeasley,Ins)turteforAstronomy
AdvantagesofUsingtheDatabaseApproach
• Controllingredundancy– Sharingofdataamongmul)pleusers.
• Restric)ngunauthorizedaccesstodata.• Providingpersistentstoragefor
programObjects• ProvidingStorageStructures(e.g.
indexes)forefficientQueryProcessing• backupandrecoveryservices.• mul)pleinterfacestodifferentclasses
ofusers.• complexrela)onshipsamongdata.• integrityconstraints.• Drawinginferencesandac)onsfrom
thestoreddatausingdeduc)veandac)verules
13
10/05/12 JimHeasley,Ins)turteforAstronomy
– Re‐useofdataacrossmul)pleapplica)ons– Datastructureandaccesscanbechangedwithoutchangingapplica)ons
– Enforcementofstandardsandcomputa)onofsta)s)cs
– Improvedresponsiveness,produc)vity
Addi8onaladvantagesofthedatabaseapproach
14
10/05/12 JimHeasley,Ins)turteforAstronomy
Addi8onalImplica8onsofUsingtheDatabaseApproach
• Poten)alforenforcingstandards• Reducedapplica)ondevelopment)me• Flexibilitytochangedatastructures• Availabilityofcurrentinforma)on
– Extremelyimportantforon‐linetransac)onsystemssuchasairline,hotel,carreserva)ons.
• Economiesofscale
15
10/05/12 JimHeasley,Ins)turteforAstronomy
– Complexity
– Size(ofsobwareandapplica)on)– Cost– Performance
– Riskof(spectacular!)failures
Disadvantagesofthedatabaseapproach
16
10/05/12 JimHeasley,Ins)turteforAstronomy
WhennottouseaDBMS
• Maininhibitors(costs)ofusingaDBMS:– Highini)alinvestmentandpossibleneedforaddi)onalhardware.– Overheadforprovidinggenerality,security,concurrencycontrol,
recovery,andintegrityfunc)ons.
• WhenaDBMSmaybeunnecessary:– Ifthedatabaseandapplica)onsaresimple,welldefined,andnot
expectedtochange.– Ifaccesstodatabymul)pleusersisnotrequired.
• WhennoDBMSmaysuffice:– Ifthedatabasesystemisnotabletohandlethecomplexityofdata
becauseofmodelinglimita)ons– Ifthedatabaseusersneedspecialopera)onsnotsupportedbythe
DBMS.
17
DatabaseLogic
• Opera)onswithinthedatabasearegovernedbystandardsettheoryandlogic.Newtypesofdatabasesthatarebuiltuponfuzzysets,fuzzylogic,andfuzzymeasurearecurrentlythesubjectofac)veresearch,butarenot(asyet)widelyavailable.
• Thetwokeysetopera)onsofinterestindatabasesareINTERSECTION(theJOIN)andUNION(calledthesameintheDBworld).
10/05/12 JimHeasley,Ins)turteforAstronomy 18
StructuredQueryLanguage
• Theuserusuallyinteractswiththedatabasebyexpressingwhatshe/hewantstoaccomplishbyexpressingtherequestinSQL.NoteSQLtellsthedatabasewhatyouwanttodo,butnothowtodoit.
• TherearemanyhelpfultutorialsaboutSQLavailableontheweb.Anexcellentintroduc)onisavailableat
www2.aao.gov.au/2dfgrs/Public/Release/Database/sql_intro.pdf
• Thisintroduc)onissufficientlyvanillaitwillgetyoustarteddespitetheminorvaria)onsbetweendifferentflavorsofSQL
10/05/12 JimHeasley,Ins)turteforAstronomy 19
TheSchema
• Thelogicalschemadefineshowaoributesareassignedtovarioustablesandthedefini)onofkeys(indexes)thathelpto)etablestogether.Ausermusthaveunderstandingofthelogicalschema.
• Thephysicalschemadefineshowthedatatablesarestoredonthephysicalstoragemedia(e.g.,disks).Generally,usersdonotneedtoknowthephysicalschemaalthoughthesystemdevelopersmustleveragethistomaximizetheperformanceoftheirsystem.
10/05/12 JimHeasley,Ins)turteforAstronomy 20
UserQueries
• Usersdevelopqueriestothedatabaseinaprocedurallanguage,usuallysomeformofSQL,thatbuildsrequestsforinforma)onstoredinthedatabasestables,obenmakinguseofinternalrela)onshipsinherentinthedata(e.g.,intersec)onsbetweendifferenttables).
10/05/12 JimHeasley,Ins)turteforAstronomy 21
TheSQLSelectCommand
• ThemostfrequentlyusedSQLcommand(bythetypicalusers)istheSELECTcommand.Thisisusedtoget(i.e.select)datafromthedatabasetables.
• ThebasicsyntaxoftheSELECTcommandis
SELECT(listofaoributesyouwant)FROM
(listoftablescontainingthem)WHERE
(listoflimi)ng/restric)ngcondi)ons)
10/05/12 JimHeasley,Ins)turteforAstronomy 22
WhataDatabaseisn’t!
WhilethecolumnarrangementofaoributesindatabasetablesmightremindtheuserofaspreadsheetprogramlikeExcel,adatabaseisnotacompu)ngengine.Further,becauseofthenatureofSQL,theuser’squerysimplydefineswhatdataiswanted,nothowtogetit.Thatalsoincludeshowthedatabasemaychoosetoexecutenumericalopera)onstheuserembedsinthequery.
10/05/12 JimHeasley,Ins)turteforAstronomy 23
DATAMINING:CONFLUENCEOFMULTIPLEDISCIPLINES
Data Mining
Database Technology
Statistics
Other Disciplines
Information Science
Machine Learning Visualization
10/05/12 JimHeasley,Ins)turteforAstronomy 24
Thepurposeofcompu)ngisinsight,notnumbers.
RichardHamming,intheprefacetohis1962textonnumericalmethods.
10/05/12 JimHeasley,Ins)turteforAstronomy 25
WhatisDataMining?
• Finding(meaningful)paoernsindata– Classifica)on– Associa)onRules– ClusterAnalysis– AnomalyDetec)on– Regression
• Dataminingtoolshavebeenusedextensivelyin– Biology,gene)cs,medicalresearch(Bioinforma)cs)– BusinessandEconomics– Ecologyandresourcemanagement– Engineering– Literature– Music– Voiceandfacialrecogni)on
10/05/12 JimHeasley,Ins)turteforAstronomy 26
Don’tRe‐inventtheWheel!
10/05/12 JimHeasley,Ins)turteforAstronomy 27
Rela8onshipbetweenDatabases&DataMining
• Databasesareobenakeycomponentindatamining.Oneobenfindsdatawarehousesprovidingtheinforma)onneededbytheminingtools.
• However,oneusuallyfindsthattheactualdataminingopera)onsareexecutedoutsidethedatabaseitself.Databasesareexcellentinforma)onseversbutarenotgoodcomputeengines!
10/05/12 JimHeasley,Ins)turteforAstronomy 28
Classifica8on:Defini8on
• Givenacollec)onofrecords(trainingset)– Eachrecordcontainsasetofa<ributes,oneoftheaoributesistheclass.
• Findamodelforclassaoributeasafunc)onofthevaluesofotheraoributes.
• Goal:previouslyunseenrecordsshouldbeassignedaclassasaccuratelyaspossible.– Atestsetisusedtodeterminetheaccuracyofthemodel.Usually,thegivendatasetisdividedintotrainingandtestsets,withtrainingsetusedtobuildthemodelandtestsetusedtovalidateit.
10/05/12 JimHeasley,Ins)turteforAstronomy 29
Associa8onRuleMining• Givenasetoftransac)ons,findrulesthatwillpredictthe
occurrenceofanitembasedontheoccurrencesofotheritemsinthetransac)on
Market‐Baskettransac)onsExampleofAssocia)onRules
{Diaper}→{Beer},{Milk,Bread}→{Eggs,Coke},{Beer,Bread}→{Milk},
Implica)onmeansco‐occurrence,notcausality!
10/05/12 JimHeasley,Ins)turteforAstronomy 30
WhatisClusterAnalysis?• Findinggroupsofobjectssuchthattheobjectsinagroupwill
besimilar(orrelated)tooneanotheranddifferentfrom(orunrelatedto)theobjectsinothergroups
Inter-cluster distances are maximized
Intra-cluster distances are
minimized
10/05/12 JimHeasley,Ins)turteforAstronomy 31
Anomaly/OutlierDetec8on
• Whatareanomalies/outliers?– Thesetofdatapointsthatareconsiderablydifferentthantheremainder
ofthedata
• VariantsofAnomaly/OutlierDetec)onProblems– GivenadatabaseD,findallthedatapointsx∈Dwithanomalyscores
greaterthansomethresholdt– GivenadatabaseD,findallthedatapointsx∈Dhavingthetop‐nlargest
anomalyscoresf(x)
– GivenadatabaseD,containingmostlynormal(butunlabeled)datapoints,andatestpointx,computetheanomalyscoreofxwithrespecttoD
• Applica)ons:– Creditcardfrauddetec)on,telecommunica)onfrauddetec)on,network
intrusiondetec)on,faultdetec)on
10/05/12 JimHeasley,Ins)turteforAstronomy 32
Regression(Predic8on)
Regressionistheprocessoffindingafunc)onthatdescribesdataclassesforthepurposeofbeingabletopredictdiscretenumericaldatavalues.Numerousapproachesfordevelopingthedesiredfunc)onexist,includingclassifica)on(IF‐THEN)rules,decisiontrees,mathema)calformulae,orneuralnetworks.Predic)onalsoencompassestheiden)fica)onofdistribu)ontrendsbasedontheavailabledata.
Bothclassifica)onandpredic)onmayneedtobeprecededbyrelevanceanalysis,whichaoemptstoiden)fythoseaoributesorfeaturesthatdonotcontributetotheclassifica)onorpredic)onprocess.Theseaoributescanthenbeexcludedfromtheanalysis.Acommonrelevanceanalysistechniqueisprincipalcomponentanalysis.
10/05/12 JimHeasley,Ins)turteforAstronomy 33
MachineLearning
10/05/12 JimHeasley,Ins)turteforAstronomy 34
DataMiningEnvironments
Therearealargenumberofdataminingsobwarepackagesavailable,bothcommercialandopensource.Asearchoftheinternetcanquicklyiden)fythese.Acomprehensivereviewofthesepackagesisfarbeyondthescopeofwhatwecandealwithinthistalk,soIwillrestrictmycommentsheretoseveralwell‐knownpackagesusedfordataanalysisandmining:theRsta)s)calanalysispackage,Matlab(andtheopensourcework‐alikeOctave),anddataminingpackagesWekaandScikits.Learn.
10/05/12 JimHeasley,Ins)turteforAstronomy 35
• TheRProjectforSta8s8calCompu8ngwww.r‐project.org/
• R,alsocalledGNUS,isastronglyfunc)onallanguageandenvironmenttosta)s)callyexploredatasets,makemanygraphicaldisplaysofdata.Verystrongsta)sicaltools.
• Thebasicsystemhasbeengreatlyexpandedbytheaddi)onofpackagesdevelopedbyitsusercommunity
10/05/12 JimHeasley,Ins)turteforAstronomy 36
Matlab(Octave)
• MATLAB,acommercialproductfromMathWorks,isahigh‐leveltechnicalcompu)nglanguageandinterac)veenvironmentforalgorithmdevelopment,datavisualiza)on,dataanalysis,andnumericalmodeling.
hop://www.mathworks.com/products/matlab/• GNUOctaveisahigh‐levelinterpretedlanguage,primarilyintendedfornumericalcomputa)ons.Itisianopensourcework‐alikeversionofMATLAB.hop://www.gnu.org/sobware/octave/
10/05/12 JimHeasley,Ins)turteforAstronomy 37
10/05/12 JimHeasley,Ins)turteforAstronomy 38
Weka(WaikatoEnvironmentforKnowledgeAnalysis)isawell‐knownsuiteofmachinelearningsobwarethatsupportsseveraltypicaldataminingtasks,par)cularlydatapreprocessing,clustering,classifica)on,regression,visualiza)on,andfeatureselec)on.Itstechniquesarebasedonthehypothesisthatthedataisavailableasasingleflatfileorrela)on,whereeachdatapointislabeledbyafixednumberofaoributes.WekaprovidesaccesstoSQLdatabasesu)lizingJavaDatabaseConnec)vityandcanprocesstheresultreturnedbyadatabasequery.ItsmainuserinterfaceistheExplorer,butthesamefunc)onalitycanbeaccessedfromthecommandlineorthroughthecomponent‐basedKnowledgeFlowinterface.
hop://www.cs.waikato.ac.nz/~ml/weka/
10/05/12 JimHeasley,Ins)turteforAstronomy 39
scikit‐learnisaPythonmoduleintegra)ngclassicmachinelearningalgorithmsinthe)ghtly‐knitscien)ficPythonworld(numpy,scipy,matplotlib).Itaimstoprovidesimpleandefficientsolu)onstolearningproblems,accessibletoeverybodyandreusableinvariouscontexts:machine‐learningasaversa)letoolforscienceandengineering.
Toolsareavailableforsupervised&unsupervisedlearning,modelselec)on,datasets,featureextrac)on.
hop://scikit‐learn.org/stable/
Pluses,Minuses,Observa8onsTheRandWekasobwarebothhavealargecommunitywhichcontributestoextendingtheirfunc)onalitythroughthedevelopmentofnewadd‐onpackages.FurtherRandWekacanbeinterfacedviatheRWekapackage.Therearemanyexcellenton‐linetutorialsforthesepackages,andWekaitselfiswelldescribedinthetextDataMining–PracBcalMachineLearningToolsandTechniquesbyWioen,Frank,&Hall.Thistextprovidesbothagoodunderpinningofthemethodsandprac)caltutorialinforma)on.(Thetextisavailableasane‐book.)
Scikits.learn,whiles)llfairlynew(currentreleaseisversion0.7),hasaveryimpressivecollec)onoftoolsandanextensiveuserguide.ThesobwareiswrioeninPython.Mymainreserva)onaboutthissobwareisthatwhiletheuserguidepresentsmanyexamples,thereisanimplicitassump)onthattheuserknowsagreatdealaboutthefieldofdatamining.Thismayleavethenewusersomewhatinovertheirheadintryingtodetermineexactlywhichtoolbestservestheirneed.
10/05/12 JimHeasley,Ins)turteforAstronomy 40
EXAMPLES–IMAGINARY&REAL
10/05/12 JimHeasley,Ins)turteforAstronomy 41
Howcouldwehavehelpedthislady?
10/05/12 JimHeasley,Ins)turteforAstronomy 42
10/05/12 JimHeasley,Ins)turteforAstronomy 43
10/05/12 JimHeasley,Ins)turteforAstronomy 44
Orthesegentlemen?
10/05/12 JimHeasley,Ins)turteforAstronomy 45
10/05/12 JimHeasley,Ins)turteforAstronomy 46
Orhim?
10/05/12 JimHeasley,Ins)turteforAstronomy 47
Pan‐STARRSOpportuni8es• ThePS1SmallAreaSurvey(SAS),coveringanareaof81deg2,overlaps
withtheSDSSStripe82.Inaddi)ontothedeepStripe82database,theimagesfromthisregionhavebeenexaminedbytheCi)zenScienceteamknownastheGalaxyZoo.Thisinteres)ngoverlapofresourcesprovidesdataforsomeexci)ngdataminingexperiments.
• Star‐Galaxyclassifica)on(ormoreprecisely,Star‐Galaxy‐QSOclassifica)on)isanon‐goingchallengeforthePS1scienceteams.Whilethisworkhasbeenreasonablysuccessful,theeffortsthusfarseemtohaveaoemptedtogetbywiththesimplestpossibleclassifica)onapproach.Whatmighthappenifweperformedaclassifica)onexercisewhereinweuseawiderangeofIPPmeasurements(e.g.,psf,Kron,Petrosianmagnitude,Petrosianradii,variousmomentsmeasuredinindividualframesandstack)withSDSSandGalaxyZoodataprovidingclassifica)on“truth?”
• Asimilaranalysis,usingvisualinspec)onoftheimagestoiden)fyar)factsinthePS1imagesand/orstacks,mightprovidearobustgarbagerejec)onprocess.Notnecessarilyglamorousbutdefinitelyimportant.
10/05/12 JimHeasley,Ins)turteforAstronomy 48
EmpiricalPhoto‐ZMethods
• Ar)ficialNeuralNetworks• SupportVectorMachines• Self‐OrganizingMaps• GaussianProcessRegression• KernelRegression• Linear/Nonlinearpolynomialfixng• InstanceBasedLearning&NearestNeighbors• BoostedDecisionTrees• RegressionTrees
AndthesearejusttheonesI’vefoundsofar!
10/05/12 JimHeasley,Ins)turteforAstronomy 49
GalaxyClusters?
• Weallknowthebestwaytoiden)fyclustersofgalaxiesisfromtheirx‐rayemission.Unfortunately,currentx‐raysurveysdon’tprovidesufficientsky&depthcoveragetodothis.
• Op)calsurveyshavesufficientdepthbutsufferfrombackgroundissues,overlappingforeground&backgroundclusters,etc.
• Ithaslongbeenhopedthatinlargescaleop)calsurveyssuchasPan‐STARRSandLSST,wewillbeabletousePhoto‐Zvaluestosortoutrealclustersfromaccidentalclusteringofgalaxies,andoverlappingclustersatdifferentdistances.(SomeofthePS1partnersinTaiwanareworkingonthisproblem.)
10/05/12 JimHeasley,Ins)turteforAstronomy 50
GalaxyClusters–CanDataMiningHelp?
• Whilethereisaplethoraofdataminingtechniquesforfindingclusterswithindata,mostareprobablynotwellsuitedforfindinggalaxyclusters.Manymethodsstartoffbyassumingthatinagivenregionthatoneknowshowmanyclustersarepresent.Clearlythisisnotthecasewithourproblem.Further,weneedtodealwiththefactthatinthe3‐Drepresenta)on,wehavemuchlargeruncertaintyalongthelineofsightduetotheaccuracyofthePhoto‐Zmeasures.
• Someinteres)ngworkinthisareahasmadeuseofafriend‐of‐friendsapproach.Ithinkthiscouldbegeneralizedtoincludebeoerbackgrounddiscrimina)onincludingthePhoto‐Zdistribu)on.
10/05/12 JimHeasley,Ins)turteforAstronomy 51
PAU
10/05/12 JimHeasley,Ins)turteforAstronomy 52