29
Data Informatics Seon Ho Kim, Ph.D. [email protected]

Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

DataInformatics

SeonHoKim,[email protected]

Page 2: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

WhatisBigData?

Page 3: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

WhatisBigData?

“BigData”isdatawhosescale,diversity,andcomplexityrequirenewarchitecture,techniques,algorithms,andanalyticstomanageitandextract

valueandhiddenknowledgefromit…

Page 4: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

4

TrendsleadingtoDataFlood

• Moredataisgenerated:– Bank,telecom,other

businesstransactions...– Scientificdata:astronomy,

biology,etc– Web,text,ande-commerce

Page 5: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

Who’sGeneratingBigData

• Theprogressandinnovationisnolongerhinderedbytheabilitytocollectdata

• But,bytheabilitytomanage,analyze,summarize,visualize,anddiscoverknowledgefromthecollecteddatainatimelymannerandinascalablefashion

5

Socialmediaandnetworks(allofusaregeneratingdata)

Scientificinstruments(collectingallsortsofdata)

Mobiledevices(trackingallobjectsallthetime)

Sensortechnologyandnetworks(measuringallkindsofdata)

Page 6: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

UnstructuredData

• Unstructureddataisagenericlabelfordescribinganycorporateinformationthatisnotinadatabase.– Textualornon-textual– Facebook,YouTube,Twitter,Weblog,etc.

• Storageandsearchproblem– justaddingmorehardwaretohousedatawhileignoringitscontentnolongersuffices

Page 7: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

CharacteristicsofBigData:1-Scale(Volume)

• DataVolume– 44xincreasefrom20092020– From0.8zettabytes to35zb

• Datavolumeisincreasingexponentially

7

Exponentialincreaseincollected/generateddata

Page 8: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

CharacteristicsofBigData:2-Complexity(Varity)

• Variousformats,types,andstructures• Text,numerical,images,audio,video,

sequences,timeseries,socialmediadata,multi-dimarrays,etc…

• Staticdatavs.streamingdata• Asingleapplicationcanbe

generating/collectingmanytypesofdata

8

Toextractknowledgeè allthesetypesofdataneedtolinked together

Page 9: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

CharacteristicsofBigData:3-Speed(Velocity)

• Dataisbegingeneratedfastandneedtobeprocessedfast

• OnlineDataAnalytics• Latedecisionsè missingopportunities• Examples

– E-Promotions:Basedonyourcurrentlocation,yourpurchasehistory,whatyoulikeè sendpromotionsrightnowforstorenexttoyou

– Healthcaremonitoring:sensorsmonitoringyouractivities andbodyèanyabnormalmeasurements requireimmediate reaction

9

Page 10: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

BigData:3V’s

10

Page 11: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

SomeMakeit4V’s

11

Page 12: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual
Page 13: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual
Page 14: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

TheModelHasChanged…

• TheModelofGenerating/ConsumingDatahasChanged

14

OldModel:Fewcompaniesaregeneratingdata,allothersareconsuming data

NewModel:allofusaregeneratingdata,andallofusareconsuming data

Page 15: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

MoreFormallyBigData• Bigdata isatermfor datasets thataresolargeorcomplex

thattraditional dataprocessing applicationsareinadequate.• Challengesinclude:– Management (capture,store,process,share,etc.).Forexample,HadoopEcosystem.

– Analysis (Predictiveanalysisorotherstoextractvaluefromdata).Forexample,machinelearning.

– Privacy:openquestion• Accuracyinbigdatamayleadtomoreconfidentdecision

making,andbetterdecisionscanresultingreateroperationalefficiency,costreductionandreducedrisk.

Page 16: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

Management

Page 17: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

ExploringBigData

Gathering&preparingdata(95%)

§ Thetimefor developingananalysis (Initiallyworkingwithbigdata)

§ ETLprocess: takingarawfeedofdata,readingit,andproducingausablesetofoutput

Analyzingdata(5%)

Extract Transform Load

Page 18: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

Why MachineLearning?• Machinelearning isprogramming computers to optimizea

performance criterion using example data or past experience.• There isno need to “learn”to calculate payroll• Learningisused when:

– Humanexpertisedoes notexist (navigatingonMars),– Humans areunable to explain their expertise (speech

recognition)– Solutionchanges intime(routingonacomputer network)– Solutionneeds to beadapted to particular cases (user

biometrics)

18

Page 19: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

WhatWeTalkAboutWhenWeTalkAbout “Learning”

• Learningmodels from adataofparticular examples• Dataischeap and abundant;knowledge isexpensive and

scarce.• Example inretail:

Customer transactions to consumer behavior:Peoplewho bought “X”also bought “Y”

• Build amodelthat isagood and useful approximation to thedata.

19

Page 20: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

WhatisMachineLearning?• Optimizeaperformance criterion using example dataor past

experience.• RoleofStatistics:– Build mathematical models– Inference from samples

• RoleofComputer science:– Efficient algorithms to• Solve the optimizationproblem• Representing and evaluating the modelfor inference

20

Page 21: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

TheStructureofBigData

• Structured:Mosttraditionaldatasources

• Semi-structured:Manysourcesofbigdata

• Unstructured:Videodata,audiodata

Page 22: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

Applications• Association• Supervised Learning:learning from known values– Classification (Recognition)– Regression

• Unsupervised Learning:from notknown values– Clustering(Grouping)

• ReinforcementLearning:learning apolicy,asequence ofoutputs

22

Page 23: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

TechniquesCreatingBusinessValuesAnomalyorOutlierdetection

Associationrulelearning

Clusteringanalysis

Classificationanalysis

Regressionanalysis

Page 24: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

BigDataVisualization

Page 25: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

BigDataAnalysisExample

25

Page 26: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

What’sdrivingBigData

- Ad-hocqueryingandreporting- Datamining techniques- Structureddata,typicalsources- Smalltomid-sizedatasets

- Optimizationsandpredictiveanalytics- Complexstatisticalanalysis- Alltypesofdata,andmanysources- Verylargedatasets- Moreofareal-time

Page 27: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

ValueofBigDataAnalytics• Bigdataismorereal-timeinnaturethantraditionalDataWarehouseapplications

• TraditionalDWarchitecturesarenotwell-suitedforbigdataapps

• Sharednothing,massivelyparallelprocessing,scaleoutarchitecturesarewell-suitedforbigdataapplications

Page 28: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

ChallengesinHandlingBigData

• TheBottleneckisintechnology– Newarchitecture,algorithms,techniquesareneeded

• Alsointechnicalskills– Expertsinusingthenewtechnologyanddealingwithbigdata

Page 29: Data Informatics - InfoLab | Welcome · Unstructured Data • Unstructured data is a generic label for describing any corporate information that is not in a database . – Textual

BigDataSummary• BigDataisbeinggeneratedeverywhere– Humanandmachines

• BigDataanalysisisalreadyeverywhere• StillRisks:– Overwhelmed– rightproblem,rightperson?– Costescalatesfast– howmuchdata,accuracy?– Privacyissue– whatistolerable?

• Bigpotentialfornewstartupbusinesstoo!