Welcome to DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/...• Final Project...

Preview:

Citation preview

DS504/CS586: Big Data Analytics Application I Prof. Yanhua Li

Welcome to

Time:6:00pm–8:50pmRLoca2on:AK232

Fall2016

•  18 critiques & Next Thur we have the last critique. –  Already graded 8 of them. –  Plan to grade 1-2 more.

•  Grading –  Projects (40%)

•  Project 1 (10%) •  Project 2 (30%)

–  Finalreportsinthediscussionforum(by11:59pm12/13);–  Self-and-peerevalua2onformforproject2(by11:59PM12/13);

– WriPenwork(30%):•  Cri2ques+Projectreports(20%)•  Quiz(10%,with5%each)

– Oralwork(30%):•  Presenta2on

•  Final Project Presentation – 22 minutes each group (including Q&A) –  Schedule:

•  12/15Thu

5

Next Class: Summary and Discussion

v  Review of the semester v  Plus the last critique/review

Ar2ficialNeuralNetworks(ANN)

X1 X2 X3 Y1 0 0 01 0 1 11 1 0 11 1 1 10 0 1 00 1 0 00 1 1 10 0 0 0

X1

X2

X3

Y

Black box

Output

Input

OutputYis1ifatleasttwoofthethreeinputsareequalto1.

Ar2ficialNeuralNetworks(ANN)

X1 X2 X3 Y1 0 0 01 0 1 11 1 0 11 1 1 10 0 1 00 1 0 00 1 1 10 0 0 0

S

X1

X2

X3

Y

Black box

0.3

0.3

0.3 t=0.4

Outputnode

Inputnodes

⎩⎨⎧

=

>−++=

otherwise0 trueis if1

)( where

)04.03.03.03.0( 321

zzI

XXXIY

Ar2ficialNeuralNetworks(ANN)

•  Modelisanassemblyofinter-connectednodesandweightedlinks

•  Outputnodesumsupeachofitsinputvalueaccordingtotheweightsofitslinks

•  Compareoutputnodeagainstsomethresholdt

S

X1

X2

X3

Y

Black box

w1

t

Outputnode

Inputnodes

w2

w3

)( tXwIYi

ii −= ∑PerceptronModel

)( tXwsignYi

ii −= ∑

or

GeneralStructureofANN

Activationfunction

g(Si )Si Oi

I1

I2

I3

wi1

wi2

wi3

Oi

Neuron iInput Output

threshold, t

InputLayer

HiddenLayer

OutputLayer

x1 x2 x3 x4 x5

y

TrainingANNmeanslearningtheweightsoftheneurons

Real-worldproblemsarealwaysmessy

•  Mul2plemodels

•  Keyfeatures

•  DataSparsity

•  Whatdowedotosolveaclassifica2on/inference/predic2onproblem?–  DataCleaning

–  Featureselec2on

–  Inferencemodel

–  Evalua2on

•  Anexampleofhowtosolverealworldapplica2onproblem

U-Air:WhenUrbanAirQualityMeetsBigData

Authors:YuZheng,MicrosogResearchAsia

Background •  Airquality

–  NO2,SO2–  Aerosols:PM2.5,PM10

•  WhyitmaPers–  Healthcare–  Pollu2oncontrolanddispersal

•  Reality–  Buildingameasurementsta2on

isnoteasy–  Alimitednumberofsta2ons

(poorcoverage)

Beijingonlyhas22airqualitymonitorsta2onsinitsurbanareas(50kmx40km)

Air quality monitor station

2PM,June17,2013

Challenges•  Airqualityvariesbyloca2ons

non-linearly•  Affectedbymanyfactors

–  Weathers,traffic,landuse…–  Subtletomodelwithaclearformula

0 40 80 120 160 200 240 280 320 360 400 440 4800.00

0.05

0.10

0.15

0.20

0.25

0.30

Por

titio

n

Deviation of PM2.5 between S12 and S13

>35%Prop

or2o

n

A) Beijing (8/24/2012 - 3/8/2013)

WedonotreallyknowtheairqualityofalocaGonwithoutamonitoringstaGon!

Challenges •  Exis2ngmethodsdonotworkwell

–  Linearinterpola2on–  Classicaldispersionmodels

•  GaussianPlumemodelsandOpera2onalStreetCanyonmodels•  Manyparametersdifficulttoobtain:Vehicleemissionrates,street

geometry,theroughnesscoefficientoftheurbansurface…

–  Satelliteremotesensing•  Sufferfromclouds•  Doesnotreflectgroundairquality•  Varyinhumidity,temperature,loca2on,andseasons

–  Outsourcedcrowdsensingusingportabledevices•  Limitedtoafewgasses:CO2andCO•  Sensorsfordetec2ngaerosolarenotportable:PM10,PM2.5•  Alongperiodofsensingprocess,1-2hours

30,000+USD,10ug/m3 202×85×168(mm)

InferringReal-TimeandFine-GrainedairqualitythroughoutacityusingBigData

Meteorology Traffic POIs RoadnetworksHumanMobility

Historicalairqualitydata Real-2meairqualityreports

Applica2ons •  Loca2on-basedairqualityawareness

–  Fine-grainedpollu2onalert–  Rou2ngbasedonairquality

•  Deployingnewmonitoringsta2ons•  Asteptowardsiden2fyingtherootcauseofairpollu2on

S2

S1

S5

S3

S7

S6S4

S1

S8

S9

S10

B) Shanghai

Difficul2es •  1.Howtoiden2fyfeaturesfromeachkindofdatasource

•  2.Incorporatemul2pleheterogeneousdatasourcesintoalearningmodel–  Spa2ally-relateddata:POIs,roadnetworks–  Temporally-relateddata:traffic,meteorology,humanmobility

•  3.Datasparseness(liPletrainingdata)–  Limitednumberofsta2ons–  Manyplacestoinfer

MethodologyOverview •  Par22onacityintodisjointgrids•  Extractfeaturesforeachgridfromitsaffec2ngregion

–  Meteorologicalfeatures –  Trafficfeatures–  Humanmobilityfeatures–  POIfeatures–  Roadnetworkfeatures

•  Co-training-basedsemi-supervisedlearningmodelforeachpollutant

–  PredicttheAQIlabels–  Datasparsity–  Twoclassifiers

MeteorologicalFeatures:Fm

•  Rainy,Sunny,Cloudy,Foggy•  Windspeed•  Temperature•  Humidity•  Barometerpressure

GoodModerate

UnhealthyUnhealthy-S

Very Unhealthy

AQIofPM10AugusttoDec.2012inBeijing

TrafficFeatures:Ft

•  Distribu2onofspeedby2me:F(v)•  Expecta2onofspeed:E(V)•  Standarddevia2onofSpeed:D

Good Moderate

km

Unhealthy-S Very UnhealthyUnhealthy

0≤v<20

20≤v<40

v≥ 40

E(v)

D(v)

GPStrajectoriesgeneratedbyover30,000taxisFromAugusttoDec.2012inBeijing

HumanMobilityFeatures:Fh •  Humanmobilityimplies

–  Trafficflow–  Landuseofaloca2on–  Func2onofaregion(likeresiden2alorbusinessareas)

•  Features:–  Numberofarrivals 𝑓↓𝑎 andleavings𝑓↓𝑙 

A) AQI of PM10 B) AQI of NO2

fl fl

fa faGoodModerate

UnhealthyUnhealthy-S

Very Unhealthy

GoodModerate

UnhealthyUnhealthy-S

Very Unhealthy

Numberofarrivalsfaandleavings(departures)fl

Parksvsfactories

Extrac2ngTraffic/HumanMobilityFeatures

•  Offlinespa2o-temporalindexing•  ta:arrival2me•  Traj:trajID•  Ii:theindexofthefirstGPSpoint(inthetrajectory)enteringagrid•  Io:theindexofthelastGPSpoint(inthetrajectory)enteringthegrid

POIFeatures:Fp•  WhyPOI

–  Indicatethelanduseandthefunc2onoftheregion–  thetrafficpaPernsintheregion

•  Features–  NumbersofPOIsovercategories–  Por2onofvacantplaces–  ThechangesinthenumberofPOIs

•  Factories,shoppingmalls,•  hotelandrealestates•  Parks,decora2onandfurnituremarkets

RoadNetworkFeatures:Fr •  Whyroadnetworks

–  Haveastrongcorrela2onwithtrafficflows–  Agoodcomplementaryoftrafficmodeling

•  Features:

–  Totallengthofhighways𝑓↓ℎ –  Totallengthofother(low-level)roadsegments𝑓↓𝑟 –  Thenumberofintersec2ons𝑓↓𝑠 inthegrid’saffec2ngregion

fh

fr

fs

Semi-SupervisedLearningModel

•  Philosophyofthemodel–  Statesofairquality

•  Temporaldependencyinaloca2on•  Geo-correla2onbetweenloca2ons

–  Genera2onofairpollutants•  Emissionfromaloca2on•  Propaga2onamongloca2ons

–  Twosetsoffeatures•  Spa2ally-related•  Temporally-related

s2s1

s3

s4l

s2s1

s3

s4l

s2s1

s3

s4

ti

t1

t2

lTime

Geospace

A location with AQI labels A location to be inferred Temporal dependencySpatial correlation

POIs: Spatial

Fh Temporal

Road Networks: Fr

Ft FmMeteorologic:Traffic:Human mobility:

FpSpa2alClassifier

TemporalClassifierCo-Training

Co-Training-BasedLearningModel

•  Spa2alclassifier–  Modelthespa2alcorrela2onbetweenAQIofdifferentloca2ons–  Usingspa2ally-relatedfeatures–  BasedonaBPneuralnetwork

•  Inputgenera2on–  Selectnsta2onstopairwith–  Performmrounds

∆P1x

∆R1x

c

D1Fp

Frl1 D2

c

d1x

D1

D2

D1

D1

1

1

Fp

Frlk

k

k

Fpx Frx lx

∆Pkx

∆Rkx

cdkx

1

k

x

ANNInput generation

w'11

w'qr

w1

wr

wpq

w11 b1

bq

b'r

b'1

b''

∆P1x

∆R1x

c

D1Fp

Frl1 D2

c

d1x

D1

D2

D1

D1

1

1

Fp

Frlk

k

k

Fpx Frx lx

∆Pkx

∆Rkx

cdkx

1

k

x

ANNInput generation

w'11

w'qr

w1

wr

wpq

w11 b1

bq

b'r

b'1

b''

Co-Training-BasedLearningModel

•  Temporalclassifier–  Modelthetemporaldependencyoftheairqualityinaloca2on–  Usingtemporallyrelatedfeatures–  BasedonaLinear-ChainCondi2onalRandomField(CRF)

Yt-1

Fm(t-1) t-1Ft(t-1) Fh(t-1) Fm(t) tFt(t) Fh(t) Fm(t+1) t+1Ft(t+1) Fh(t+1)

Yt Yt-1

LearningProcess

Yt-1

Fm(t-1) t-1Ft(t-1) Fh(t-1) Fm(t) tFt(t) Fh(t) Fm(t+1) t+1Ft(t+1) Fh(t+1)

Yt Yt-1

∆P1x

∆R1x

c

D1Fp

Frl1 D2

c

d1x

D1

D2

D1

D1

1

1

Fp

Frlk

k

k

Fpx Frx lx

∆Pkx

∆Rkx

cdkx

1

k

x

ANNInput generation

w'11

w'qr

w1

wr

wpq

w11 b1

bq

b'r

b'1

b''

Training

Temporally-relatedfeatures

Spa2ally-relatedfeatures

Labeleddata Unlabeleddata

Inference

InferenceProcess

∆P1x

∆R1x

c

D1Fp

Frl1 D2

c

d1x

D1

D2

D1

D1

1

1

Fp

Frlk

k

k

Fpx Frx lx

∆Pkx

∆Rkx

cdkx

1

k

x

ANNInput generation

w'11

w'qr

w1

wr

wpq

w11 b1

bq

b'r

b'1

b''

Temporally-relatedfeatures

Spa2ally-relatedfeatures

<𝑝↓𝑐1 , 𝑝↓𝑐2 ,…,𝑝↓𝑐𝑛 >

<𝑝′↓𝑐1 , 𝑝′↓𝑐2 ,…,𝑝′↓𝑐𝑛 >

× 𝑐= arg↓𝑐↓𝑖 ∈𝒞 𝑀𝑎𝑥( 𝑝↓𝑐𝑖 × 𝑝′↓𝑐𝑖 )

< pc1 , · · · , pcn >

< p0c1 , · · · , p0cn >

c = argci2CMax(P ciSC ⇥ P

ciTC)

Yt-1

Fm(t-1) t-1Ft(t-1) Fh(t-1) Fm(t) tFt(t) Fh(t) Fm(t+1) t+1Ft(t+1) Fh(t+1)

Yt Yt-1

Evalua2on

Datasources Beijing Shanghai Shenzhen Wuhan POI 2012Q1 271,634 321,529 107,061 102,467

2012Q3 272,109 317,829 107,171 104,634

Road #.Segments 162,246 171,191 45,231 38,477 Highways 1,497km 1,963km 256km 1,193km Roads 18,525km 25,530kmKM 6,100km 9,691km #.Intersec. 49,981 70,293 32,112 25,359

AQI #.Sta2on 22 10 9 10 Hours 23,300 8,588 6,489 6,741 Timespans 8/24/2012-3/8/201

3 1/19/2013-3/8/20

13 2/4/2013-3/8/201

3 2/4/2013-3/8/2013

UrbanSize(grids) 5050km(2500) 5050km(2500) 5745km(2565) 4525km(1165)

•  Datasets

S1

S2

S4

S5

S8

S5

S2

S1

S7

S5

S3

S3

S6 S7S6

S9S10

S12

S11

S13 S14

S22S15

S16

S16

S17

S18

S19

S20

S21S3

S7

S6

S4

S1

S8

S9

S10

S1

S4

S2

S6

S9

S8

S1

S2

S4

S3S10

S5

S9

S6 S7

S8

A) Beijing B) Shanghai C) Shenzhen D) Wuhan

Evalua2on

•  GroundTruth–  Removeasta2on–  Crossci2es

•  Baselines–  LinearandGaussianInterpola2ons–  ClassicalDispersionModel–  DecisionTree(DT):–  CRF-ALL–  ANN-ALL

Evalua2on

PM10 NO2

Features Precision Recall Precision Recall

Fm 0.572 0.514 0.477 0.454 Ft 0.341 0.36 0.371 0.35 Fh 0.327 0.364 0.411 0.483 Fp+Fr 0.441 0.443 0.307 0.354

Fm+Ft 0.664 0.675 0.634 0.635 Fm+Ft+Fp+Fr 0.731 0.734 0.701 0.691

Fm+Ft+Fp+Fr+Fh 0.773 0.754 0.723 0.704

•  Doeseverykindoffeaturecount?

Evalua2on •  Overallperformanceoftheco-training

0 20 40 60 80 100 120 140 160

0.65

0.70

0.75

0.80

Pre

cisi

on

Num. of Iterations

SC TC Co-Training

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

NO2PM10

Acc

urac

y

U-Air Linear Guassian Classical DT CRF-ALL ANN-ALL

Accuracy

Evalua2on

GroundTruth

PredicGons

G M S U

G 3789 402 102 0 0.883

Recall

M 602 3614 204 0 0.818 S 41 200 532 50 0.646 U 0 22 70 219 0.704 0.855 0.853 0.586 0.814 0.828

Precision

•  ConfusionmatrixofCo-TrainingonPM10

Evalua2on

CiGes PM2.5 PM10 NO2

Prec. Rec. Prec. Rec. Prec. Rec.

Beijing 0.764 0.763 0.762 0.745 0.730 0.749

Shanghai 0.705 0.725 0.702 0.718 0.715 0.706

Shenzhen 0.740 0.737 0.710 0.742 0.732 0.722

Wuhan 0.727 0.723 0.731 0.739 0.744 0.719

•  PerformanceofSpa2alclassifier

Evalua2on

Procedures Time(ms) Procedures Time(ms) Feature

extracGon (pergrid)

Ft&Fh 53.2 Inference (pergrid)

SC 21.5 Fp 28.8 TC 13.1 Fr 14.4 Total 131

•  Efficiencystudy•  Singlegrid131s•  InferringtheAQIsforen2reBeijingin5minutes

Conclusion•  Inferfine-grainedairqualitywith

–  Real-2meandhistoricalairqualityreadingsfromexis2ngsta2ons–  Otherdatasources:meteorology,POIs,roadnetwork,humanmobility,and

trafficcondi2on

•  Co-Training-basedsemi-supervisedlearningapproach–  Dealwithdatasparsitybylearningfromunlabeleddata–  Modelthespa2alcorrela2onamongtheairqualityofdifferentloca2ons–  Modelthetemporaldependencyoftheairqualityinaloca2on

•  Results–  0.82withtrafficdata(co-training)–  0.76ifonlyusingspa2alclassifier

Ques2ons?

Recommended