Upload
marlon
View
38
Download
0
Tags:
Embed Size (px)
DESCRIPTION
TACO: T unable A pproximate C omputation of O utliers in Wireless Sensor Networks. ΠΑΟ : Π ροσεγγιστικός υπολογισμός Α κραίων τιμών σε περιβάλλ Ο ντα ασυρμάτων δικτύων αισθητήρων. Outline. Introduction Why outlier detection is important Definition of outlier The TACO Framework - PowerPoint PPT Presentation
Citation preview
TACO:
Tunable Approximate Computation of Outliers in
Wireless Sensor Networks
8 July 2010 HDMS 2010, Ayia Napa, Cyprus
* Dept. of Informatics, University of Piraeus,
Piraeus, Greece
† Dept. of Informatics,Athens University of
Economics and Business,Athens, Greece
# Dept. of Electronic and Computer Engineering,Technical University of
Crete,Crete, Greece
Nikos Giatrakos* Yannis Kotidis† Antonios Deligiannakis#
Vasilis Vassalos† Yannis Theodoridis*
ΠΑΟ: Προσεγγιστικός υπολογισμός Ακραίων τιμών σε περιβάλλΟντα ασυρμάτων δικτύων αισθητήρων
8 July 2010 HDMS 2010, Ayia Napa, Cyprus
* Dept. of Informatics, University of Piraeus,
Piraeus, Greece
† Dept. of Informatics,Athens University of
Economics and Business,Athens, Greece
# Dept. of Electronic and Computer Engineering,Technical University of
Crete,Crete, Greece
Nikos Giatrakos* Yannis Kotidis† Antonios Deligiannakis#
Vasilis Vassalos† Yannis Theodoridis*
Outline• Introduction
– Why outlier detection is important– Definition of outlier
• The TACO Framework– Compression of measurements at the sensor level (LSH)– Outlier detection within and amongst clusters– Optimizations: Boosting Accuracy & Load Balancing
• Experimental Evaluation• Related Work• Conclusions
4
• Wireless Sensor Networks utility– Place inexpensive, tiny motes in areas of interest– Perform continuous querying operations– Periodically obtain reports of quantities under study– Support sampling procedures, monitoring/ surveillance applications etc
• Constraints– Limited Power Supply– Low Processing Capabilities– Constraint Memory Capacity
• Remark - Data communication is the main factor of energy drain
Introduction
5
• Outliers may denote malfunctioning sensors– sensor measurements are often unreliable– dirty readings affect computations/decisions [Deligiannakis ICDE’09]
• Outliers may also represent interesting events detected by few sensors– fire detected by a sensor
• Take into consideration– the recent history of samples acquired by single motes – correlations with measurements of other motes!
Why Outlier Detection is Useful
6
16 19 24 30 32 40 39
Outlier Definition• Let ui denote the latest W measurements obtained by mote Si
• Given a similarity metric sim: RW→[0,1] and a similarity threshold Φ, sensors Si, Sj are considered similar if:
sim(ui , uj ) > Φ
• Minimum Support Requirement– a mote is classified as outlier if its latest W measurements are not
found to be similar with at least minSup other motes
10
Network organization into clusters [(Younis et al, INFOCOM ’04),(Qin et al, J. UCS ‘07)]
TACO Framework – General Idea
11
Step 1: Data Encoding and Reduction• Motes obtain samples and keep the
latest W measurements in a tumble• Encode W in a bitmap of d<<W size
Clusterhead Regular Sensor
8.2
4.3
…
W d5.1
0
0…
1
TACO Framework – General Idea
12
Step 1: Data Encoding and Reduction• Motes obtain samples and keep the
latest W measurements in a tumble• Encode W in a bitmap of d<<W size
Step 2: Intra-cluster Processing• Encodings are transmitted to
clusterheads• Clusterheads perform similarity tests
based on a given similarity measure and a similarity threshold Φ
• … and calculate support values
Clusterhead Regular Sensor
If Sim(ui,uj)>Φ { supportSi++; supportSj++;}
TACO Framework – General Idea
13
Clusterhead Regular Sensor
Step 1: Data Encoding and Reduction• Motes obtain samples and keep the
latest W measurements in a tumble• Encode W in a bitmap of d<<W size
Step 2: Intra-cluster Processing• Encodings are transmitted to
clusterheads• Clusterheads perform similarity tests
based on a given similarity measure and a similarity threshold Φ
• … and calculate support values
Step 3: Inter-cluster Processing• An approximate TSP problem is solved.
Lists of potential outliers are exchanged.
Additional load-balancing mechanisms and improvements in accuracy devised
TACO Framework
14
Step 1: Data Encoding and Reduction• Motes obtain samples and keep the
latest W measurements in a tumble• Encode W in a bitmap of d<<W size
Clusterhead Regular Sensor
8.2
4.3
…
W d5.1
0
0…
1
• Desired Properties– Dimensionality Reduction Reduced bandwidth consumption
– Similarity PreservationAllows us to later derive initial sim(ui , uj ) during vector comparisons
Data Encoding and Reduction
15
• Locality Sensitive Hashing (LSH)
Ph є F [h(ui)=h(uj)]= sim(ui , uj )
• Practically, any similarity measure satisfying a set of criteria [Charikar, STOC ‘02] may be incorporated in TACO’s framework
16
LSH Example: Random Hyperplane Projection
• Family of n d-dimensional random vectors (rvi)
• Generates for each data vector a bitmap of size n as follows:- Sets biti=1 if dot product of
data vector with i-th random vector is positive
- Sets biti=0 otherwise
Sensor data
(2-dimensional)rv1
rv2
rv3
rv4
1 0 10
TACO encoding:
[(Goemans & Wiliamson, J.ACM ’95),(Charikar, STOC ‘02) ]
17
Computing Similarity• Cosine Similarity: cos(θ(ui,uj))
1 0 1 1 1 1
0 0 0 1 1 1θ(RHP(ui),RHP(uj))=2/6*π=π/3
ui
ui
RHP(ui)
RHP(uj)
n bits θ(ui, ui)
Angle Similarity
Hamming Distance
Supported Similarity Measures
Cosine Similarity
cos(θ(ui , uj))
Correlation Coefficient
corr(ui , uj )=cov(ui , uj) /( σui*σuj
) = Ε[(ui- E[ui])(uj-E[uj])]/ (σui*σuj
)[details in paper]
Jaccard Coefficient
Jaccard(A,B) = |A B|/ |A B|see [Gionis et al, SIGMOD ‘01]
Lp-Norms see [Datar et all, DIMACS SDAM’03]
19
TACO Framework
20
Step 1: Data Encoding and Reduction• Motes obtain samples and keep the
latest W measurements in a tumble• Encode W in a bitmap of d<<W size
Step 2: Intra-cluster Processing• Encodings are transmitted to
clusterheads• Clusterheads perform similarity tests
based on a given similarity measure and a similarity threshold Φ
• … and calculate support values
Clusterhead Regular Sensor
If Sim(ui,uj)>Φ { supportSi++; supportSj++;}
Intra-cluster Processing• Goal: Find potential outliers within the clusters realm• Back to our running example, sensor vectors are considered
similar whenθ(ui , uj) < Φθ
• Translate user-defined similarity threshold Φθ Φh = Φθ * d/π
• For any received pair of bitmaps Xi, Xj, clusterheads can obtain an estimation of the initial similarity based on their hamming distance Dh(Xi,Xj) using:
Dh(Xi,Xj) < Φh
• At the end of the process <Si, Xi, support> lists are extracted for motes that do not satisfy the minSup parameter
21
Intra-cluster Processing
22
Probability of correctly classifying similar motes as such (W=16, θ=5, Φθ=10):
TACO Framework
23
Clusterhead Regular Sensor
Step 1: Data Encoding and Reduction• Motes obtain samples and keep the
latest W measurements in a tumble• Encode W in a bitmap of d<<W size
Step 2: Intra-cluster Processing• Encodings are transmitted to
clusterheads• Clusterheads perform similarity tests
based on a given similarity measure and a similarity threshold Φ
• … and calculate support values
Step 3: Inter-cluster Processing• An approximate TSP problem is solved.
Lists of potential outliers are exchanged.
Boosting TACO Encodings
Obtain the answer provided by the majority of the μ tests
25
0 0 000 111 11 0 0 000 111 11 0 0 000 111 11
10 00 01 01 1 1 10 00 00 01 1 1 10 00 00 01 1 1
d=n·μ
1 1 0 1 0 1
Xi :
Xj :
SimBoosting(Xi,Xj)=1
• Check the quality of the boosting estimation(θ(ui,uj)≤ Φθ):- Unpartitioned bitmaps: Pwrong(d)=1-Psimilar(d)
- Boosting: , Pwrong(d,μ) ≤
• Decide an appropriate μ:- Restriction on μ : Psimilar(d/μ)>0.5 - Comparison of (Pwrong(d,μ) , Pwrong(d))
Comparison Pruning
26
Modified cluster election process, returns B bucket nodes
Introducing a 2nd level of hashing based on the hamming weight of the bitmaps
Comparison pruning is achieved by hashing highly dissimilar bitmaps to different buckets
d
0 d/4 d/4 d/2 d/2 3d/4 3d/4 d
Clusterhead – Bucket Node Regular Sensor
Load Balancing Among Buckets
27
Histogram Calculation Phase:• Buckets construct equi-width histogram
based on the received Xi s hamming weight frequency
0 d/4 d/4 d/2 d/2 3d/4 3d/4 d
Histogram Communication Phase:• Each bucket communicates to the
clusterhead- Estimated frequency counts- Width parameter ci
Hash Key Space Reassignment: • Clusterhead determines a new space
partitioning and broadcasts the corresponding information
SB1 SB2=SC
SB3 SB4
0 0 1
c1=d/12
001
c4=d/12c2=d/163 3 2 2
c3=d/163 3 4 6
[f=(0,0,1), c1=d/12]
[f=(1,0,0), c4=d/12]
[f=(3,3,4,6), c3=d/16]
SB1 [0-3d/8] SB2 (3d/8-9d/16] SB3 (9d/16-11d/16] SB4 (11d/16-d]
Outline• Introduction
– Why is Outlier Detection Important and Difficult• Our Contributions
– Outlier detection with limited bandwidth– Compute measurement similarity over compressed representations of
measurements (LSH) • The TACO Framework
– Compression of measurements at the sensor level– Outlier detection within and amongst clusters
• Optimizations: Load Balancing & Comparison Pruning• Experimental Evaluation• Related Work• Conclusions
28
Sensitivity Analysis• Intel Lab Data -
Temperature
29
10 15 20 25 300
0.10.20.30.40.50.60.70.80.9
1
1/2 Reduction 1/4 Reduction
1/8 Reduction 1/16 Reduction
Similarity AngleTumbleSize=16 support=4
10 15 20 25 300
0.10.20.30.40.50.60.70.80.9
1
1/2 Reduction 1/4 Reduction
1/8 Reduction 1/16 Reduction
Similarity Angle TumbleSize=16 support=4
Avg.
Rec
all
Avg.
Pre
cisi
on
Sensitivity Analysis
30
• Boosting Intel Lab Data - Humidity
16 20 24 28 320
0.10.20.30.40.50.60.70.80.9
1
1 Boosting Group
4 Boosting Groups
8 Boosting Groups
TumbleSizeReduction = 1/8, support=4, Φθ=30
16 20 24 28 320
0.10.20.30.40.50.60.70.80.9
1
1 Boosting Group4 Boosting Groups8 Boosting Groups
TumbleSizeReduction=1/8, support=4, Φθ=30
Avg.
Rec
all
Avg.
Pre
cisi
on
Performance Evaluation in TOSSIM
31
• For 1/8reduction TACO provides on average 1/12 less bandwidth consumption, which reaches a maximum value of 1/15
TACO 1/16 Reduction
TACO 1/8 Reduction
TACO 1/4 Reduction
NonTACO SelectStar0.00E+00
1.00E+04
2.00E+04
3.00E+04
4.00E+04
5.00E+04
6.00E+04
7.00E+04
8.00E+04
9.00E+04
Min
Average
MaxTo
tal B
its T
rans
mitt
ed P
er T
umbl
e
Performance Evaluation in TOSSIM
32
• Network Lifetime: the epoch at which the first mote in the network dies.
• Average lifetime for motes initialized with 5000 mJ residual energy
• Reduction in power consumption reaches a ratio of 1/2.7
TACO 1/4 Reduction
NonTACO SelectStar0
100
200
300
400
500
Epoc
h
TACO vs Hierarchical Outlier Detection Techniques
33
1 2 3 40
0.10.20.30.40.50.60.70.80.9
1
RobustTACO 1/4 Re-ductionTACO 1/8 Re-duction
SupportTumbleSize=16,Corr _Threshold=Cos(30)≈0.87
F-M
easu
re
• Robust [Deligiannakis et al, ICDE ‘09] falls short up to 10% in terms of the F-Measure metric
• TACO ensures less bandwidth consumption with a ratio varying from 1/2.6 to 1/7.8
1 2 3 40.00E+00
5.00E+03
1.00E+04
1.50E+04
2.00E+04
2.50E+04
3.00E+04
3.50E+04
4.00E+04 Series4
TACO - Remaining
TACO - Intercluster
SupportTumbleSize=16, Corr _Threshold=Cos(30)≈0.87
Avg.
Bits
Tra
nsm
itted
Per
Tum
ble
1/4
Red
uctio
n
1/8
Redu
c-tio
n1/
16 R
educ
tion
1/4
Redu
ction
1/8
Redu
ction
1/16
Red
uctio
n
1/4
Redu
ction
1/8
Redu
ction
1/16
Red
uctio
n
1/4
Redu
ction
1/8
Redu
ction
1/16
Red
uc-
tion
Outline• Introduction
– Why is Outlier Detection Important and Difficult• Our Contributions
– Outlier detection with limited bandwidth– Compute measurement similarity over compressed representations of
measurements (LSH) • The TACO Framework
– Compression of measurements at the sensor level– Outlier detection within and amongst clusters
• Optimizations: Load Balancing & Comparison Pruning• Experimental Evaluation• Related Work• Conclusions
34
Related Work - Ours• Outlier reports on par with aggregate query answer [Kotidis et al,
MobiDE’07]– hierarchical organization of motes– takes into account temporal & spatial correlations as well– reports aggregate, witnesses & outliers
• Outlier-aware routing [Deligiannakis et al, ICDE ‘09]– route outliers towards motes that can potentially witness them– validate detection scheme for different similarity metrics (correlation
coefficient, Jaccard index also supported in TACO)• Snapshot Queries [Kotidis, ICDE ’05]
– motes maintain local regression models for their neighbors– models can be used for outlier detection
• Random Hyperplane Projection using Derived Dimensions [Georgoulas et al MobiDE’10]– extends LSH scheme for skewed datasets– up to 70% improvements in accuracy
35
Related Work
• Kernel based approach [Subramaniam et al, VLDB ‘06]• Centralized Approaches [Jeffrey et al, Pervasive ‘06]• Localized Voting Protocols
[(Chen et al, DIWANS ’06),(Xiao et al, MobiDE ‘07) ]• Report of top-K values with the highest deviation
[Branch et al, ICDCS ‘06]• Weighted Moving Average techniques [Zhuang et al,
ICDCS ’07]
36
Συμπεράσματα• Our Contributions
– outlier detection with limited bandwidth• The TACO/ΠΑΟ Framework
– LSH compression of measurements at the sensor level– outlier detection within and amongst clusters– optimizations: Boosting Accuracy & Load Balancing
• Experimental Evaluation– accuracy exceeding 80% in most of the experiments– reduced bandwidth consumption up to a factor of 1/12 for 1/8
reduced bitmaps– prolonged network lifetime up to a factor of 3 for 1/4 reduction
ratio
38
TACO:
Tunable Approximate Computation of Outliers in
Wireless Sensor Networks
Thank you!Nikos Giatrakos Yannis Kotidis Antonios Deligiannakis
Vasilis Vassalos Yannis Theodoridis
Backup Slides
40
TACO Framework
41
Clusterhead Regular Sensor
8.24.3…
W01…
d Step 1: Data Encoding and Reduction• Motes obtain samples and keep the
latest W measurements in a tumble• Encode W in a bitmap of d<<W size
Step 2: Intra-cluster Processing• Encodings are transmitted to
clusterheads• Clusterheads perform similarity tests
based on a given similarity measure and a similarity threshold Φ
• … and calculate support values
Step 3: Inter-cluster Processing• An approximate TSP problem is solved.
Lists of potential outliers are exchanged.If Sim(ui,uj)>Φ {
supportSi++; supportSj++;}
• Comparison Pruning is ensured by the fact that highly dissimilar bitmaps are hashed to different buckets, thus never being tested for similarity
Leveraging Additional Motes for Outlier Detection
42
Introducing a 2nd level of hashing:• Besides cluster election, process
continuous in each cluster so as to select B bucket nodes with
• For , 0≤ Wh(Xi)≤ d equally distribute the hash key space amongst them
• Hash each bitmap to thebucket
• For bitmaps with Wh(Xi) at the edge of a bucket, transmit Xi to the range:
which is guaranteed to contain at most 2 buckets since
d
0 d/4 d/4 d/2 d/2 3d/4 3d/4 d
Clusterhead – Bucket Node Regular Sensor
Leveraging Additional Motes for Outlier Detection
43
Intra-cluster Processing:• Buckets perform bitmap comparisons
as in common Intra-cluster processing• Constraints:
-If , similarity test is performed only in that bucket- For encodings that were hashed to the same 2 buckets, similarity is tested only in the bucket with the lowest SBi
• PotOut formation:-Si PotOut if it is not reported by all buckets it was hashed to-Received support values are added and Si є PotOut iff SupportSi < minSup
d
0 d/4 d/4 d/2 d/2 3d/4 3d/4 d
Clusterhead – Bucket Node Regular Sensor
Ï
Experimental Setup• Datasets:
– Intel Lab Data : • Temperature and Humidity measurements • Network consisting of 48 motes organized into 4 clusters• Measurements for a period of 633 and 487 epochs respectively• minSup=4
– Weather Dataset : • Temperature, Humidity and Solar Iradiance measurements• Network consisting of 100 motes organized into 10 clusters• Measurements for a period of 2000 epochs• minSup=6
44
Experimental Setup• Outlier Injection
– Intel Lab Data & Weather Temperature, Humidity data : • 0.4% probability that a mote obtains a spurious measurement at
some epoch• 6% probability that a mote fails dirty at some epoch
– Every mote that fails dirty increases its measurements by 1 degree until it reaches a MAX_VAL parameter, imposing a 15% noise at the values
– Intel Lab Data MAX_VAL=100– Weather Data MAX_VAL=200
– Weather Solar Irradiance data : • Random injection of values obtained at various time periods to the
sequence of epoch readings• Simulators
– TOSSIM network simulator– Custom, lightweight Java simulator
45
Sensitivity Analysis• Intel Lab Data -
Humidity
46
• Weather Data -Humidity
10 15 20 25 300
0.10.20.30.40.50.60.70.80.9
1
1/2 Reduction1/4 Reduction1/8 Reduction1/16 Reduction
Similarity AngleTumbleSize=16 support=4
10 15 20 25 300
0.10.20.30.40.50.60.70.80.9
1
1/2 Reduction1/4 Reduction1/8 Reduction1/16 Reduction
Similarity AngleTumbleSize=16 support=4
10 15 20 25 300
0.10.20.30.40.50.60.70.80.9
1
1/2 Reduction1/4 Reduction1/8 Reduction1/16 Reduction
Similarity AngleTumbleSize=20 support=6
10 15 20 25 300
0.10.20.30.40.50.60.70.80.9
1
1/2 Reduction1/4 Reduction1/8 Reduction1/16 Reduction
Smilarity Angle TumbleSize=20 support=6
Avg.
Pre
cisi
onAv
g. P
reci
sion
Avg.
Rec
all
Avg.
Rec
all
Sensitivity Analysis• Weather Data -
Solar Irradiance
47
• Boosting Intel Lab Data - Humidity
10 15 20 25 300
0.10.20.30.40.50.60.70.80.9
1
1/2 Reduction1/4 Reduction1/8 Reduction1/16 Reduction
Similarity AngleTumbleSize=32 support=6
10 15 20 25 300
0.10.20.30.40.50.60.70.80.9
1
1/2 Reduction1/4 Reduction1/8 Reduction1/16 Reduction
Similarity AngleTumbleSize=32 support=6
Avg.
Rec
all
Avg.
Pre
cisi
on
16 20 24 28 320
0.10.20.30.40.50.60.70.80.9
1
1 Boosting Group4 Boosting Groups8 Boosting Groups
TumbleSizeReduction = 1/8, support=4, Φθ=30
16 20 24 28 320
0.10.20.30.40.50.60.70.80.9
1
1 Boosting Group4 Boosting Groups8 Boosting Groups
TumbleSizeReduction=1/8, support=4, Φθ=30
Avg.
Rec
all
Avg.
Pre
cisi
on
Performance Evaluation in TOSSIM
48
• Transmitted bits categorization per approach
ToClusterhead Retransmissions Intercluster ToBS0.00E+00
5.00E+03
1.00E+04
1.50E+04
2.00E+04
2.50E+04
3.00E+04
3.50E+04
TACO 1/16 Reduction
TACO 1/8 Reduction
TACO 1/4 Reduction
NonTACO
SelectStar
Avg.
Bits
Tra
nsm
itted
Per
Tum
ble
Bucket Node Introduction
49
Φθ
Cluster Size
10 20
#Buckets #Cmps#Multihash Messages
#Bitmaps PerBucket #Cmps
#Multihash Messages
#Bitmaps PerBucket
121 66 0 12 66 0 122 38,08 0,90 6,45 40,92 1,36 6,684 24,55 7,71 3,65 30,95 8,88 4,08
241 276 0 24 276 0 242 158,06 1,62 12,81 171,80 2,76 13,384 101,10 14,97 7,27 128,63 17,61 8,15
361 630 0 36 630 0 362 363,64 2,66 19,33 394,97 4,30 20,154 230,73 22,88 10,88 291,14 26,28 12,19
481 1128 0 48 1128 0 482 640,10 3,14 25,57 710,95 5,85 26,934 412,76 30,17 14,49 518,57 34,64 16,21