52
Introduc)on to Large Databases & Data Mining Tips for Assembling Your Data Analysis Toolbox for the 22 nd Century 10/05/12 Jim Heasley, Ins)turte for Astronomy 1

Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Introduc)ontoLargeDatabases&DataMining

TipsforAssemblingYourDataAnalysisToolboxforthe22ndCentury

10/05/12 JimHeasley,Ins)turteforAstronomy 1

Page 2: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Outline‐I

•  Rela)onalDatabases&BIGDATA– Bigdatavolumesrequireanewdatahandlingparadigm

– Advantagesofarela)onaldatabase•  Organiza)onofdata•  Dataintegrity•  SQL‐‐Structured(andalmoststandard)querylanguageforqueries

– Whatadatabaseisnot.

10/05/12 JimHeasley,Ins)turteforAstronomy 2

Page 3: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Outline‐II

•  Datamining– Whatisit?

– Commondataminingtasks–  (FREE)Toolsavailabletoyoutoperformmanyofthesetasks.

10/05/12 JimHeasley,Ins)turteforAstronomy 3

Page 4: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Outline‐III

•  Examples–Imagined&Real–  Ifweonlyhad)metravel…

– ThingsonemightstarttodowithPAN‐STARRSdata(rightnow).

10/05/12 JimHeasley,Ins)turteforAstronomy 4

Page 5: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

RELATIONALDATABASES

10/05/12 JimHeasley,Ins)turteforAstronomy 5

Page 6: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy

BasicDefini8ons•  Database:

–  Acollec)onofrelateddataorganizedtoprovideinforma)on.

•  Data:–  Knownfactsthatcanberecorded

andhaveanimplicitmeaning.–  Obenintegratedfromseveral

sources.–  Storedinastandardformatforuse

bymul)pleapplica)ons.•  DatabaseManagementSystem

(DBMS):–  Asobwarepackage/systemto

facilitatethecrea)onandmaintenanceofacomputerizeddatabase.

•  DatabaseSystem:–  TheDBMSsobwaretogetherwith

thedataitselfandthehardwareuponwhichitruns.Some)mes,theapplica)onsarealsoincluded.

6

Page 7: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy

Twoapproaches

–  Generally,therearetwoapproachestoextractinforma)onfromdata:•  fileprocessingapproach

–  filebasedsobwareprograms

•  databaseapproach–  DBMS

7

Page 8: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy

Fileprocessingapproach

–  Issues:•  dataredundancy•  redundantprocesses/interfaces•  dataintegrity

–  easeofmaintenance–  consistency

•  Security–  preserva)on–valuablecompanyasset–  accesscontrol

•  Each application program has a specific purpose

•  Each program uses its own data

...

Application program 1

Data

Instructions

Application program n

Data

Instructions

8

Page 9: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy

Mo8va8onfordatabases

–  Dataisaveryimportantassetofanorganiza)on

–  Mo)va)onsfordatabases•  tomaintaindataindependentfromapplica)onprograms

•  toavoid:–  redundantdata–  redundantprocesses/interfaces

•  toenable:–  easeofmaintenance

–  sharingofdata–  dataaccesscontrol

9

Page 10: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy

Databaseapproach

–  DBMS‐a“generalpurpose”sobware•  isself‐describing•  contains

–  data–  metadata(i.e.dataaboutdata)

DBMS Application program 1

Instructions

...Data

Metadata Application program n

Instructions

10

Page 11: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy

MainCharacteris8csoftheDatabaseApproach

•  Self‐describingnatureofadatabasesystem:

–  ADBMScatalogstoresthedescrip)onofapar)culardatabase(e.g.datastructures,types,andconstraints)

•  Insula8onbetweenprogramsanddata:–  Calledprogram‐dataindependence.

•  DataAbstrac8on:–  Adatamodelisusedtohidestoragedetails

andpresenttheuserswithaconceptualviewofthedatabase.

•  Supportofmul8pleviewsofthedata:–  Eachusermayseeadifferentviewofthe

database,whichdescribesonlythedataofinteresttothatuser.

•  ConcurrentExecu8ons

11

Page 12: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy

Characteris8csofDBMS

– Datais:•  integrated,shared,persistent•  self‐describing

– Abstrac)on•  programanddataindependence

– Mul)pleviewsofthedata•  differentusersneeddifferentkindsofinforma)on

12

Page 13: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy

AdvantagesofUsingtheDatabaseApproach

•  Controllingredundancy–  Sharingofdataamongmul)pleusers.

•  Restric)ngunauthorizedaccesstodata.•  Providingpersistentstoragefor

programObjects•  ProvidingStorageStructures(e.g.

indexes)forefficientQueryProcessing•  backupandrecoveryservices.•  mul)pleinterfacestodifferentclasses

ofusers.•  complexrela)onshipsamongdata.•  integrityconstraints.•  Drawinginferencesandac)onsfrom

thestoreddatausingdeduc)veandac)verules

13

Page 14: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy

– Re‐useofdataacrossmul)pleapplica)ons– Datastructureandaccesscanbechangedwithoutchangingapplica)ons

– Enforcementofstandardsandcomputa)onofsta)s)cs

–  Improvedresponsiveness,produc)vity

Addi8onaladvantagesofthedatabaseapproach

14

Page 15: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy

Addi8onalImplica8onsofUsingtheDatabaseApproach

•  Poten)alforenforcingstandards•  Reducedapplica)ondevelopment)me•  Flexibilitytochangedatastructures•  Availabilityofcurrentinforma)on

–  Extremelyimportantforon‐linetransac)onsystemssuchasairline,hotel,carreserva)ons.

•  Economiesofscale

15

Page 16: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy

–  Complexity

–  Size(ofsobwareandapplica)on)–  Cost–  Performance

–  Riskof(spectacular!)failures

Disadvantagesofthedatabaseapproach

16

Page 17: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy

WhennottouseaDBMS

•  Maininhibitors(costs)ofusingaDBMS:–  Highini)alinvestmentandpossibleneedforaddi)onalhardware.–  Overheadforprovidinggenerality,security,concurrencycontrol,

recovery,andintegrityfunc)ons.

•  WhenaDBMSmaybeunnecessary:–  Ifthedatabaseandapplica)onsaresimple,welldefined,andnot

expectedtochange.–  Ifaccesstodatabymul)pleusersisnotrequired.

•  WhennoDBMSmaysuffice:–  Ifthedatabasesystemisnotabletohandlethecomplexityofdata

becauseofmodelinglimita)ons–  Ifthedatabaseusersneedspecialopera)onsnotsupportedbythe

DBMS.

17

Page 18: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

DatabaseLogic

•  Opera)onswithinthedatabasearegovernedbystandardsettheoryandlogic.Newtypesofdatabasesthatarebuiltuponfuzzysets,fuzzylogic,andfuzzymeasurearecurrentlythesubjectofac)veresearch,butarenot(asyet)widelyavailable.

•  Thetwokeysetopera)onsofinterestindatabasesareINTERSECTION(theJOIN)andUNION(calledthesameintheDBworld).

10/05/12 JimHeasley,Ins)turteforAstronomy 18

Page 19: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

StructuredQueryLanguage

•  Theuserusuallyinteractswiththedatabasebyexpressingwhatshe/hewantstoaccomplishbyexpressingtherequestinSQL.NoteSQLtellsthedatabasewhatyouwanttodo,butnothowtodoit.

•  TherearemanyhelpfultutorialsaboutSQLavailableontheweb.Anexcellentintroduc)onisavailableat

www2.aao.gov.au/2dfgrs/Public/Release/Database/sql_intro.pdf

•  Thisintroduc)onissufficientlyvanillaitwillgetyoustarteddespitetheminorvaria)onsbetweendifferentflavorsofSQL

10/05/12 JimHeasley,Ins)turteforAstronomy 19

Page 20: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

TheSchema

•  Thelogicalschemadefineshowaoributesareassignedtovarioustablesandthedefini)onofkeys(indexes)thathelpto)etablestogether.Ausermusthaveunderstandingofthelogicalschema.

•  Thephysicalschemadefineshowthedatatablesarestoredonthephysicalstoragemedia(e.g.,disks).Generally,usersdonotneedtoknowthephysicalschemaalthoughthesystemdevelopersmustleveragethistomaximizetheperformanceoftheirsystem.

10/05/12 JimHeasley,Ins)turteforAstronomy 20

Page 21: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

UserQueries

•  Usersdevelopqueriestothedatabaseinaprocedurallanguage,usuallysomeformofSQL,thatbuildsrequestsforinforma)onstoredinthedatabasestables,obenmakinguseofinternalrela)onshipsinherentinthedata(e.g.,intersec)onsbetweendifferenttables).

10/05/12 JimHeasley,Ins)turteforAstronomy 21

Page 22: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

TheSQLSelectCommand

•  ThemostfrequentlyusedSQLcommand(bythetypicalusers)istheSELECTcommand.Thisisusedtoget(i.e.select)datafromthedatabasetables.

•  ThebasicsyntaxoftheSELECTcommandis

SELECT(listofaoributesyouwant)FROM

(listoftablescontainingthem)WHERE

(listoflimi)ng/restric)ngcondi)ons)

10/05/12 JimHeasley,Ins)turteforAstronomy 22

Page 23: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

WhataDatabaseisn’t!

WhilethecolumnarrangementofaoributesindatabasetablesmightremindtheuserofaspreadsheetprogramlikeExcel,adatabaseisnotacompu)ngengine.Further,becauseofthenatureofSQL,theuser’squerysimplydefineswhatdataiswanted,nothowtogetit.Thatalsoincludeshowthedatabasemaychoosetoexecutenumericalopera)onstheuserembedsinthequery.

10/05/12 JimHeasley,Ins)turteforAstronomy 23

Page 24: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

DATAMINING:CONFLUENCEOFMULTIPLEDISCIPLINES

Data Mining

Database Technology

Statistics

Other Disciplines

Information Science

Machine Learning Visualization

10/05/12 JimHeasley,Ins)turteforAstronomy 24

Page 25: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Thepurposeofcompu)ngisinsight,notnumbers.

RichardHamming,intheprefacetohis1962textonnumericalmethods.

10/05/12 JimHeasley,Ins)turteforAstronomy 25

Page 26: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

WhatisDataMining?

•  Finding(meaningful)paoernsindata–  Classifica)on–  Associa)onRules–  ClusterAnalysis–  AnomalyDetec)on–  Regression

•  Dataminingtoolshavebeenusedextensivelyin–  Biology,gene)cs,medicalresearch(Bioinforma)cs)–  BusinessandEconomics–  Ecologyandresourcemanagement–  Engineering–  Literature–  Music–  Voiceandfacialrecogni)on

10/05/12 JimHeasley,Ins)turteforAstronomy 26

Page 27: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Don’tRe‐inventtheWheel!

10/05/12 JimHeasley,Ins)turteforAstronomy 27

Page 28: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Rela8onshipbetweenDatabases&DataMining

•  Databasesareobenakeycomponentindatamining.Oneobenfindsdatawarehousesprovidingtheinforma)onneededbytheminingtools.

•  However,oneusuallyfindsthattheactualdataminingopera)onsareexecutedoutsidethedatabaseitself.Databasesareexcellentinforma)onseversbutarenotgoodcomputeengines!

10/05/12 JimHeasley,Ins)turteforAstronomy 28

Page 29: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Classifica8on:Defini8on

•  Givenacollec)onofrecords(trainingset)–  Eachrecordcontainsasetofa<ributes,oneoftheaoributesistheclass.

•  Findamodelforclassaoributeasafunc)onofthevaluesofotheraoributes.

•  Goal:previouslyunseenrecordsshouldbeassignedaclassasaccuratelyaspossible.–  Atestsetisusedtodeterminetheaccuracyofthemodel.Usually,thegivendatasetisdividedintotrainingandtestsets,withtrainingsetusedtobuildthemodelandtestsetusedtovalidateit.

10/05/12 JimHeasley,Ins)turteforAstronomy 29

Page 30: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Associa8onRuleMining•  Givenasetoftransac)ons,findrulesthatwillpredictthe

occurrenceofanitembasedontheoccurrencesofotheritemsinthetransac)on

Market‐Baskettransac)onsExampleofAssocia)onRules

{Diaper}→{Beer},{Milk,Bread}→{Eggs,Coke},{Beer,Bread}→{Milk},

Implica)onmeansco‐occurrence,notcausality!

10/05/12 JimHeasley,Ins)turteforAstronomy 30

Page 31: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

WhatisClusterAnalysis?•  Findinggroupsofobjectssuchthattheobjectsinagroupwill

besimilar(orrelated)tooneanotheranddifferentfrom(orunrelatedto)theobjectsinothergroups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

10/05/12 JimHeasley,Ins)turteforAstronomy 31

Page 32: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Anomaly/OutlierDetec8on

•  Whatareanomalies/outliers?–  Thesetofdatapointsthatareconsiderablydifferentthantheremainder

ofthedata

•  VariantsofAnomaly/OutlierDetec)onProblems–  GivenadatabaseD,findallthedatapointsx∈Dwithanomalyscores

greaterthansomethresholdt–  GivenadatabaseD,findallthedatapointsx∈Dhavingthetop‐nlargest

anomalyscoresf(x)

–  GivenadatabaseD,containingmostlynormal(butunlabeled)datapoints,andatestpointx,computetheanomalyscoreofxwithrespecttoD

•  Applica)ons:–  Creditcardfrauddetec)on,telecommunica)onfrauddetec)on,network

intrusiondetec)on,faultdetec)on

10/05/12 JimHeasley,Ins)turteforAstronomy 32

Page 33: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Regression(Predic8on)

Regressionistheprocessoffindingafunc)onthatdescribesdataclassesforthepurposeofbeingabletopredictdiscretenumericaldatavalues.Numerousapproachesfordevelopingthedesiredfunc)onexist,includingclassifica)on(IF‐THEN)rules,decisiontrees,mathema)calformulae,orneuralnetworks.Predic)onalsoencompassestheiden)fica)onofdistribu)ontrendsbasedontheavailabledata.

Bothclassifica)onandpredic)onmayneedtobeprecededbyrelevanceanalysis,whichaoemptstoiden)fythoseaoributesorfeaturesthatdonotcontributetotheclassifica)onorpredic)onprocess.Theseaoributescanthenbeexcludedfromtheanalysis.Acommonrelevanceanalysistechniqueisprincipalcomponentanalysis.

10/05/12 JimHeasley,Ins)turteforAstronomy 33

Page 34: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

MachineLearning

10/05/12 JimHeasley,Ins)turteforAstronomy 34

Page 35: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

DataMiningEnvironments

Therearealargenumberofdataminingsobwarepackagesavailable,bothcommercialandopensource.Asearchoftheinternetcanquicklyiden)fythese.Acomprehensivereviewofthesepackagesisfarbeyondthescopeofwhatwecandealwithinthistalk,soIwillrestrictmycommentsheretoseveralwell‐knownpackagesusedfordataanalysisandmining:theRsta)s)calanalysispackage,Matlab(andtheopensourcework‐alikeOctave),anddataminingpackagesWekaandScikits.Learn.

10/05/12 JimHeasley,Ins)turteforAstronomy 35

Page 36: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

•  TheRProjectforSta8s8calCompu8ngwww.r‐project.org/

•  R,alsocalledGNUS,isastronglyfunc)onallanguageandenvironmenttosta)s)callyexploredatasets,makemanygraphicaldisplaysofdata.Verystrongsta)sicaltools.

•  Thebasicsystemhasbeengreatlyexpandedbytheaddi)onofpackagesdevelopedbyitsusercommunity

10/05/12 JimHeasley,Ins)turteforAstronomy 36

Page 37: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Matlab(Octave)

•  MATLAB,acommercialproductfromMathWorks,isahigh‐leveltechnicalcompu)nglanguageandinterac)veenvironmentforalgorithmdevelopment,datavisualiza)on,dataanalysis,andnumericalmodeling.

hop://www.mathworks.com/products/matlab/•  GNUOctaveisahigh‐levelinterpretedlanguage,primarilyintendedfornumericalcomputa)ons.Itisianopensourcework‐alikeversionofMATLAB.hop://www.gnu.org/sobware/octave/

10/05/12 JimHeasley,Ins)turteforAstronomy 37

Page 38: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy 38

Weka(WaikatoEnvironmentforKnowledgeAnalysis)isawell‐knownsuiteofmachinelearningsobwarethatsupportsseveraltypicaldataminingtasks,par)cularlydatapreprocessing,clustering,classifica)on,regression,visualiza)on,andfeatureselec)on.Itstechniquesarebasedonthehypothesisthatthedataisavailableasasingleflatfileorrela)on,whereeachdatapointislabeledbyafixednumberofaoributes.WekaprovidesaccesstoSQLdatabasesu)lizingJavaDatabaseConnec)vityandcanprocesstheresultreturnedbyadatabasequery.ItsmainuserinterfaceistheExplorer,butthesamefunc)onalitycanbeaccessedfromthecommandlineorthroughthecomponent‐basedKnowledgeFlowinterface.

hop://www.cs.waikato.ac.nz/~ml/weka/

Page 39: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy 39

scikit‐learnisaPythonmoduleintegra)ngclassicmachinelearningalgorithmsinthe)ghtly‐knitscien)ficPythonworld(numpy,scipy,matplotlib).Itaimstoprovidesimpleandefficientsolu)onstolearningproblems,accessibletoeverybodyandreusableinvariouscontexts:machine‐learningasaversa)letoolforscienceandengineering.

Toolsareavailableforsupervised&unsupervisedlearning,modelselec)on,datasets,featureextrac)on.

hop://scikit‐learn.org/stable/

Page 40: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Pluses,Minuses,Observa8onsTheRandWekasobwarebothhavealargecommunitywhichcontributestoextendingtheirfunc)onalitythroughthedevelopmentofnewadd‐onpackages.FurtherRandWekacanbeinterfacedviatheRWekapackage.Therearemanyexcellenton‐linetutorialsforthesepackages,andWekaitselfiswelldescribedinthetextDataMining–PracBcalMachineLearningToolsandTechniquesbyWioen,Frank,&Hall.Thistextprovidesbothagoodunderpinningofthemethodsandprac)caltutorialinforma)on.(Thetextisavailableasane‐book.)

Scikits.learn,whiles)llfairlynew(currentreleaseisversion0.7),hasaveryimpressivecollec)onoftoolsandanextensiveuserguide.ThesobwareiswrioeninPython.Mymainreserva)onaboutthissobwareisthatwhiletheuserguidepresentsmanyexamples,thereisanimplicitassump)onthattheuserknowsagreatdealaboutthefieldofdatamining.Thismayleavethenewusersomewhatinovertheirheadintryingtodetermineexactlywhichtoolbestservestheirneed.

10/05/12 JimHeasley,Ins)turteforAstronomy 40

Page 41: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

EXAMPLES–IMAGINARY&REAL

10/05/12 JimHeasley,Ins)turteforAstronomy 41

Page 42: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Howcouldwehavehelpedthislady?

10/05/12 JimHeasley,Ins)turteforAstronomy 42

Page 43: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy 43

Page 44: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy 44

Page 45: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Orthesegentlemen?

10/05/12 JimHeasley,Ins)turteforAstronomy 45

Page 46: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy 46

Page 47: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Orhim?

10/05/12 JimHeasley,Ins)turteforAstronomy 47

Page 48: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Pan‐STARRSOpportuni8es•  ThePS1SmallAreaSurvey(SAS),coveringanareaof81deg2,overlaps

withtheSDSSStripe82.Inaddi)ontothedeepStripe82database,theimagesfromthisregionhavebeenexaminedbytheCi)zenScienceteamknownastheGalaxyZoo.Thisinteres)ngoverlapofresourcesprovidesdataforsomeexci)ngdataminingexperiments.

•  Star‐Galaxyclassifica)on(ormoreprecisely,Star‐Galaxy‐QSOclassifica)on)isanon‐goingchallengeforthePS1scienceteams.Whilethisworkhasbeenreasonablysuccessful,theeffortsthusfarseemtohaveaoemptedtogetbywiththesimplestpossibleclassifica)onapproach.Whatmighthappenifweperformedaclassifica)onexercisewhereinweuseawiderangeofIPPmeasurements(e.g.,psf,Kron,Petrosianmagnitude,Petrosianradii,variousmomentsmeasuredinindividualframesandstack)withSDSSandGalaxyZoodataprovidingclassifica)on“truth?”

•  Asimilaranalysis,usingvisualinspec)onoftheimagestoiden)fyar)factsinthePS1imagesand/orstacks,mightprovidearobustgarbagerejec)onprocess.Notnecessarilyglamorousbutdefinitelyimportant.

10/05/12 JimHeasley,Ins)turteforAstronomy 48

Page 49: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

EmpiricalPhoto‐ZMethods

•  Ar)ficialNeuralNetworks•  SupportVectorMachines•  Self‐OrganizingMaps•  GaussianProcessRegression•  KernelRegression•  Linear/Nonlinearpolynomialfixng•  InstanceBasedLearning&NearestNeighbors•  BoostedDecisionTrees•  RegressionTrees

AndthesearejusttheonesI’vefoundsofar!

10/05/12 JimHeasley,Ins)turteforAstronomy 49

Page 50: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

GalaxyClusters?

•  Weallknowthebestwaytoiden)fyclustersofgalaxiesisfromtheirx‐rayemission.Unfortunately,currentx‐raysurveysdon’tprovidesufficientsky&depthcoveragetodothis.

•  Op)calsurveyshavesufficientdepthbutsufferfrombackgroundissues,overlappingforeground&backgroundclusters,etc.

•  Ithaslongbeenhopedthatinlargescaleop)calsurveyssuchasPan‐STARRSandLSST,wewillbeabletousePhoto‐Zvaluestosortoutrealclustersfromaccidentalclusteringofgalaxies,andoverlappingclustersatdifferentdistances.(SomeofthePS1partnersinTaiwanareworkingonthisproblem.)

10/05/12 JimHeasley,Ins)turteforAstronomy 50

Page 51: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

GalaxyClusters–CanDataMiningHelp?

•  Whilethereisaplethoraofdataminingtechniquesforfindingclusterswithindata,mostareprobablynotwellsuitedforfindinggalaxyclusters.Manymethodsstartoffbyassumingthatinagivenregionthatoneknowshowmanyclustersarepresent.Clearlythisisnotthecasewithourproblem.Further,weneedtodealwiththefactthatinthe3‐Drepresenta)on,wehavemuchlargeruncertaintyalongthelineofsightduetotheaccuracyofthePhoto‐Zmeasures.

•  Someinteres)ngworkinthisareahasmadeuseofafriend‐of‐friendsapproach.Ithinkthiscouldbegeneralizedtoincludebeoerbackgrounddiscrimina)onincludingthePhoto‐Zdistribu)on.

10/05/12 JimHeasley,Ins)turteforAstronomy 51

Page 52: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

PAU

10/05/12 JimHeasley,Ins)turteforAstronomy 52