TACO: T unable A pproximate C omputation of O utliers in Wireless Sensor Networks

TACO:

Tunable Approximate Computation of Outliers in

Wireless Sensor Networks

8 July 2010 HDMS 2010, Ayia Napa, Cyprus

* Dept. of Informatics, University of Piraeus,

Piraeus, Greece

† Dept. of Informatics,Athens University of

Economics and Business,Athens, Greece

# Dept. of Electronic and Computer Engineering,Technical University of

Crete,Crete, Greece

Nikos Giatrakos* Yannis Kotidis† Antonios Deligiannakis#

Vasilis Vassalos† Yannis Theodoridis*

ΠΑΟ: Προσεγγιστικός υπολογισμός Ακραίων τιμών σε περιβάλλΟντα ασυρμάτων δικτύων αισθητήρων

8 July 2010 HDMS 2010, Ayia Napa, Cyprus

* Dept. of Informatics, University of Piraeus,

Piraeus, Greece

† Dept. of Informatics,Athens University of

Economics and Business,Athens, Greece

# Dept. of Electronic and Computer Engineering,Technical University of

Crete,Crete, Greece

Nikos Giatrakos* Yannis Kotidis† Antonios Deligiannakis#

Vasilis Vassalos† Yannis Theodoridis*

Outline• Introduction

– Why outlier detection is important– Definition of outlier

• The TACO Framework– Compression of measurements at the sensor level (LSH)– Outlier detection within and amongst clusters– Optimizations: Boosting Accuracy & Load Balancing

• Experimental Evaluation• Related Work• Conclusions

4

• Wireless Sensor Networks utility– Place inexpensive, tiny motes in areas of interest– Perform continuous querying operations– Periodically obtain reports of quantities under study– Support sampling procedures, monitoring/ surveillance applications etc

• Constraints– Limited Power Supply– Low Processing Capabilities– Constraint Memory Capacity

• Remark - Data communication is the main factor of energy drain

Introduction

5

• Outliers may denote malfunctioning sensors– sensor measurements are often unreliable– dirty readings affect computations/decisions [Deligiannakis ICDE’09]

• Outliers may also represent interesting events detected by few sensors– fire detected by a sensor

• Take into consideration– the recent history of samples acquired by single motes – correlations with measurements of other motes!

Why Outlier Detection is Useful

6

16 19 24 30 32 40 39

Outlier Definition• Let ui denote the latest W measurements obtained by mote Si

• Given a similarity metric sim: RW→[0,1] and a similarity threshold Φ, sensors Si, Sj are considered similar if:

sim(ui , uj ) > Φ

• Minimum Support Requirement– a mote is classified as outlier if its latest W measurements are not

found to be similar with at least minSup other motes

10

Network organization into clusters [(Younis et al, INFOCOM ’04),(Qin et al, J. UCS ‘07)]

TACO Framework – General Idea

11

Step 1: Data Encoding and Reduction• Motes obtain samples and keep the

latest W measurements in a tumble• Encode W in a bitmap of d<<W size

Clusterhead Regular Sensor

8.2

4.3

…

W d5.1

0

0…

1


12



Step 2: Intra-cluster Processing• Encodings are transmitted to

clusterheads• Clusterheads perform similarity tests

based on a given similarity measure and a similarity threshold Φ

• … and calculate support values


If Sim(ui,uj)>Φ { supportSi++; supportSj++;}


13








Step 3: Inter-cluster Processing• An approximate TSP problem is solved.

Lists of potential outliers are exchanged.

Additional load-balancing mechanisms and improvements in accuracy devised

TACO Framework

14




8.2

4.3

…

W d5.1

0

0…

1

• Desired Properties– Dimensionality Reduction Reduced bandwidth consumption

– Similarity PreservationAllows us to later derive initial sim(ui , uj ) during vector comparisons

Data Encoding and Reduction

15

• Locality Sensitive Hashing (LSH)

Ph є F [h(ui)=h(uj)]= sim(ui , uj )

• Practically, any similarity measure satisfying a set of criteria [Charikar, STOC ‘02] may be incorporated in TACO’s framework

16

LSH Example: Random Hyperplane Projection

• Family of n d-dimensional random vectors (rvi)

• Generates for each data vector a bitmap of size n as follows:- Sets biti=1 if dot product of

data vector with i-th random vector is positive

- Sets biti=0 otherwise

Sensor data

(2-dimensional)rv1

rv2

rv3

rv4

1 0 10

TACO encoding:

[(Goemans & Wiliamson, J.ACM ’95),(Charikar, STOC ‘02) ]

17

Computing Similarity• Cosine Similarity: cos(θ(ui,uj))

1 0 1 1 1 1

0 0 0 1 1 1θ(RHP(ui),RHP(uj))=2/6*π=π/3

ui

ui

RHP(ui)

RHP(uj)

n bits θ(ui, ui)

Angle Similarity

Hamming Distance

Supported Similarity Measures

Cosine Similarity

cos(θ(ui , uj))

Correlation Coefficient

corr(ui , uj )=cov(ui , uj) /( σui*σuj

) = Ε[(ui- E[ui])(uj-E[uj])]/ (σui*σuj

)[details in paper]

Jaccard Coefficient

Jaccard(A,B) = |A B|/ |A B|see [Gionis et al, SIGMOD ‘01]

Lp-Norms see [Datar et all, DIMACS SDAM’03]

19

TACO Framework

20








If Sim(ui,uj)>Φ { supportSi++; supportSj++;}

Intra-cluster Processing• Goal: Find potential outliers within the clusters realm• Back to our running example, sensor vectors are considered

similar whenθ(ui , uj) < Φθ

• Translate user-defined similarity threshold Φθ Φh = Φθ * d/π

• For any received pair of bitmaps Xi, Xj, clusterheads can obtain an estimation of the initial similarity based on their hamming distance Dh(Xi,Xj) using:

Dh(Xi,Xj) < Φh

• At the end of the process <Si, Xi, support> lists are extracted for motes that do not satisfy the minSup parameter

21

Intra-cluster Processing

22

Probability of correctly classifying similar motes as such (W=16, θ=5, Φθ=10):

TACO Framework

23









Lists of potential outliers are exchanged.

Boosting TACO Encodings

Obtain the answer provided by the majority of the μ tests

25

0 0 000 111 11 0 0 000 111 11 0 0 000 111 11

10 00 01 01 1 1 10 00 00 01 1 1 10 00 00 01 1 1

d=n·μ

1 1 0 1 0 1

Xi :

Xj :

SimBoosting(Xi,Xj)=1

• Check the quality of the boosting estimation(θ(ui,uj)≤ Φθ):- Unpartitioned bitmaps: Pwrong(d)=1-Psimilar(d)

- Boosting: , Pwrong(d,μ) ≤

• Decide an appropriate μ:- Restriction on μ : Psimilar(d/μ)>0.5 - Comparison of (Pwrong(d,μ) , Pwrong(d))

Comparison Pruning

26

Modified cluster election process, returns B bucket nodes

Introducing a 2nd level of hashing based on the hamming weight of the bitmaps

Comparison pruning is achieved by hashing highly dissimilar bitmaps to different buckets

d

0 d/4 d/4 d/2 d/2 3d/4 3d/4 d

Clusterhead – Bucket Node Regular Sensor

Load Balancing Among Buckets

27

Histogram Calculation Phase:• Buckets construct equi-width histogram

based on the received Xi s hamming weight frequency

0 d/4 d/4 d/2 d/2 3d/4 3d/4 d

Histogram Communication Phase:• Each bucket communicates to the

clusterhead- Estimated frequency counts- Width parameter ci

Hash Key Space Reassignment: • Clusterhead determines a new space

partitioning and broadcasts the corresponding information

SB1 SB2=SC

SB3 SB4

0 0 1

c1=d/12

001

c4=d/12c2=d/163 3 2 2

c3=d/163 3 4 6

[f=(0,0,1), c1=d/12]

[f=(1,0,0), c4=d/12]

[f=(3,3,4,6), c3=d/16]

SB1 [0-3d/8] SB2 (3d/8-9d/16] SB3 (9d/16-11d/16] SB4 (11d/16-d]


– Why is Outlier Detection Important and Difficult• Our Contributions

– Outlier detection with limited bandwidth– Compute measurement similarity over compressed representations of

measurements (LSH) • The TACO Framework

– Compression of measurements at the sensor level– Outlier detection within and amongst clusters

• Optimizations: Load Balancing & Comparison Pruning• Experimental Evaluation• Related Work• Conclusions

28

Sensitivity Analysis• Intel Lab Data -

Temperature

29

10 15 20 25 300

0.10.20.30.40.50.60.70.80.9

1

1/2 Reduction 1/4 Reduction


Similarity AngleTumbleSize=16 support=4

10 15 20 25 300

0.10.20.30.40.50.60.70.80.9

1



Similarity Angle TumbleSize=16 support=4

Avg.

Rec

all

Avg.

Pre

cisi

on

Sensitivity Analysis

30

• Boosting Intel Lab Data - Humidity

16 20 24 28 320

0.10.20.30.40.50.60.70.80.9

1

1 Boosting Group

4 Boosting Groups

8 Boosting Groups

TumbleSizeReduction = 1/8, support=4, Φθ=30

16 20 24 28 320

0.10.20.30.40.50.60.70.80.9

1

1 Boosting Group4 Boosting Groups8 Boosting Groups

TumbleSizeReduction=1/8, support=4, Φθ=30

Avg.

Rec

all

Avg.

Pre

cisi

on

Performance Evaluation in TOSSIM

31

• For 1/8reduction TACO provides on average 1/12 less bandwidth consumption, which reaches a maximum value of 1/15

TACO 1/16 Reduction

TACO 1/8 Reduction

TACO 1/4 Reduction

NonTACO SelectStar0.00E+00

1.00E+04

2.00E+04

3.00E+04

4.00E+04

5.00E+04

6.00E+04

7.00E+04

8.00E+04

9.00E+04

Min

Average

MaxTo

tal B

its T

rans

mitt

ed P

er T

umbl

e


32

• Network Lifetime: the epoch at which the first mote in the network dies.

• Average lifetime for motes initialized with 5000 mJ residual energy

• Reduction in power consumption reaches a ratio of 1/2.7

TACO 1/4 Reduction

NonTACO SelectStar0

100

200

300

400

500

Epoc

h

TACO vs Hierarchical Outlier Detection Techniques

33

1 2 3 40

0.10.20.30.40.50.60.70.80.9

1

RobustTACO 1/4 Re-ductionTACO 1/8 Re-duction

SupportTumbleSize=16,Corr _Threshold=Cos(30)≈0.87

F-M

easu

re

• Robust [Deligiannakis et al, ICDE ‘09] falls short up to 10% in terms of the F-Measure metric

• TACO ensures less bandwidth consumption with a ratio varying from 1/2.6 to 1/7.8

1 2 3 40.00E+00

5.00E+03

1.00E+04

1.50E+04

2.00E+04

2.50E+04

3.00E+04

3.50E+04

4.00E+04 Series4

TACO - Remaining

TACO - Intercluster

SupportTumbleSize=16, Corr _Threshold=Cos(30)≈0.87

Avg.

Bits

Tra

nsm

itted

Per

Tum

ble

1/4

Red

uctio

n

1/8

Redu

c-tio

n1/

16 R

educ

tion

1/4

Redu

ction

1/8

Redu

ction

1/16

Red

uctio

n

1/4

Redu

ction

1/8

Redu

ction

1/16

Red

uctio

n

1/4

Redu

ction

1/8

Redu

ction

1/16

Red

uc-

tion


– Why is Outlier Detection Important and Difficult• Our Contributions

– Outlier detection with limited bandwidth– Compute measurement similarity over compressed representations of

measurements (LSH) • The TACO Framework

– Compression of measurements at the sensor level– Outlier detection within and amongst clusters

• Optimizations: Load Balancing & Comparison Pruning• Experimental Evaluation• Related Work• Conclusions

34

Related Work - Ours• Outlier reports on par with aggregate query answer [Kotidis et al,

MobiDE’07]– hierarchical organization of motes– takes into account temporal & spatial correlations as well– reports aggregate, witnesses & outliers

• Outlier-aware routing [Deligiannakis et al, ICDE ‘09]– route outliers towards motes that can potentially witness them– validate detection scheme for different similarity metrics (correlation

coefficient, Jaccard index also supported in TACO)• Snapshot Queries [Kotidis, ICDE ’05]

– motes maintain local regression models for their neighbors– models can be used for outlier detection

• Random Hyperplane Projection using Derived Dimensions [Georgoulas et al MobiDE’10]– extends LSH scheme for skewed datasets– up to 70% improvements in accuracy

35

Related Work

• Kernel based approach [Subramaniam et al, VLDB ‘06]• Centralized Approaches [Jeffrey et al, Pervasive ‘06]• Localized Voting Protocols

[(Chen et al, DIWANS ’06),(Xiao et al, MobiDE ‘07) ]• Report of top-K values with the highest deviation

[Branch et al, ICDCS ‘06]• Weighted Moving Average techniques [Zhuang et al,

ICDCS ’07]

36

Συμπεράσματα• Our Contributions

– outlier detection with limited bandwidth• The TACO/ΠΑΟ Framework

– LSH compression of measurements at the sensor level– outlier detection within and amongst clusters– optimizations: Boosting Accuracy & Load Balancing

• Experimental Evaluation– accuracy exceeding 80% in most of the experiments– reduced bandwidth consumption up to a factor of 1/12 for 1/8

reduced bitmaps– prolonged network lifetime up to a factor of 3 for 1/4 reduction

ratio

38

TACO:

Tunable Approximate Computation of Outliers in

Wireless Sensor Networks

Thank you!Nikos Giatrakos Yannis Kotidis Antonios Deligiannakis

Vasilis Vassalos Yannis Theodoridis

Backup Slides

40

TACO Framework

41


8.24.3…

W01…

d Step 1: Data Encoding and Reduction• Motes obtain samples and keep the







Lists of potential outliers are exchanged.If Sim(ui,uj)>Φ {

supportSi++; supportSj++;}

• Comparison Pruning is ensured by the fact that highly dissimilar bitmaps are hashed to different buckets, thus never being tested for similarity

Leveraging Additional Motes for Outlier Detection

42

Introducing a 2nd level of hashing:• Besides cluster election, process

continuous in each cluster so as to select B bucket nodes with

• For , 0≤ Wh(Xi)≤ d equally distribute the hash key space amongst them

• Hash each bitmap to thebucket

• For bitmaps with Wh(Xi) at the edge of a bucket, transmit Xi to the range:

which is guaranteed to contain at most 2 buckets since

d

0 d/4 d/4 d/2 d/2 3d/4 3d/4 d


Leveraging Additional Motes for Outlier Detection

43

Intra-cluster Processing:• Buckets perform bitmap comparisons

as in common Intra-cluster processing• Constraints:

-If , similarity test is performed only in that bucket- For encodings that were hashed to the same 2 buckets, similarity is tested only in the bucket with the lowest SBi

• PotOut formation:-Si PotOut if it is not reported by all buckets it was hashed to-Received support values are added and Si є PotOut iff SupportSi < minSup

d

0 d/4 d/4 d/2 d/2 3d/4 3d/4 d


Ï

Experimental Setup• Datasets:

– Intel Lab Data : • Temperature and Humidity measurements • Network consisting of 48 motes organized into 4 clusters• Measurements for a period of 633 and 487 epochs respectively• minSup=4

– Weather Dataset : • Temperature, Humidity and Solar Iradiance measurements• Network consisting of 100 motes organized into 10 clusters• Measurements for a period of 2000 epochs• minSup=6

44

Experimental Setup• Outlier Injection

– Intel Lab Data & Weather Temperature, Humidity data : • 0.4% probability that a mote obtains a spurious measurement at

some epoch• 6% probability that a mote fails dirty at some epoch

– Every mote that fails dirty increases its measurements by 1 degree until it reaches a MAX_VAL parameter, imposing a 15% noise at the values

– Intel Lab Data MAX_VAL=100– Weather Data MAX_VAL=200

– Weather Solar Irradiance data : • Random injection of values obtained at various time periods to the

sequence of epoch readings• Simulators

– TOSSIM network simulator– Custom, lightweight Java simulator

45

Sensitivity Analysis• Intel Lab Data -

Humidity

46

• Weather Data -Humidity

10 15 20 25 300

0.10.20.30.40.50.60.70.80.9

1

1/2 Reduction1/4 Reduction1/8 Reduction1/16 Reduction


10 15 20 25 300

0.10.20.30.40.50.60.70.80.9

1



10 15 20 25 300

0.10.20.30.40.50.60.70.80.9

1



10 15 20 25 300

0.10.20.30.40.50.60.70.80.9

1


Smilarity Angle TumbleSize=20 support=6

Avg.

Pre

cisi

onAv

g. P

reci

sion

Avg.

Rec

all

Avg.

Rec

all

Sensitivity Analysis• Weather Data -

Solar Irradiance

47

• Boosting Intel Lab Data - Humidity

10 15 20 25 300

0.10.20.30.40.50.60.70.80.9

1



10 15 20 25 300

0.10.20.30.40.50.60.70.80.9

1



Avg.

Rec

all

Avg.

Pre

cisi

on

16 20 24 28 320

0.10.20.30.40.50.60.70.80.9

1


TumbleSizeReduction = 1/8, support=4, Φθ=30

16 20 24 28 320

0.10.20.30.40.50.60.70.80.9

1


TumbleSizeReduction=1/8, support=4, Φθ=30

Avg.

Rec

all

Avg.

Pre

cisi

on


48

• Transmitted bits categorization per approach

ToClusterhead Retransmissions Intercluster ToBS0.00E+00

5.00E+03

1.00E+04

1.50E+04

2.00E+04

2.50E+04

3.00E+04

3.50E+04

TACO 1/16 Reduction

TACO 1/8 Reduction

TACO 1/4 Reduction

NonTACO

SelectStar

Avg.

Bits

Tra

nsm

itted

Per

Tum

ble

Bucket Node Introduction

49

Φθ

Cluster Size

10 20

#Buckets #Cmps#Multihash Messages

#Bitmaps PerBucket #Cmps

#Multihash Messages

#Bitmaps PerBucket

121 66 0 12 66 0 122 38,08 0,90 6,45 40,92 1,36 6,684 24,55 7,71 3,65 30,95 8,88 4,08

241 276 0 24 276 0 242 158,06 1,62 12,81 171,80 2,76 13,384 101,10 14,97 7,27 128,63 17,61 8,15

361 630 0 36 630 0 362 363,64 2,66 19,33 394,97 4,30 20,154 230,73 22,88 10,88 291,14 26,28 12,19

481 1128 0 48 1128 0 482 640,10 3,14 25,57 710,95 5,85 26,934 412,76 30,17 14,49 518,57 34,64 16,21

Documents

TACO: T unable A pproximate C omputation of O utliers in Wireless Sensor Networks