Upload
quettra
View
934
Download
2
Embed Size (px)
Citation preview
Location Trace Analysis
Deepti Chafekar
Quettra, June 6, 2014
Design Problem• Given: Location GPS trace data for users• Goal: Extract insights for a given user from his location trace
data• Solution spilt into several parts:
Extracting Insights How would you organization the data What models could be created to understand this data? What useful insights be extracted from these models?
Using the extracted insight What apps and business services can be designed to use this insight?
Scalability issues How can this system be made scalable across millions of users and 100s
of incoming datapoints per day
Input data conditioning
AggregationClusteringFilteringLabeling
Data Modeling
Markov Models
Location-Time Distribution
Model
ML Classifiers for location prediction
Labeled Significant Location Clusters
PlaceRecommendation
Service
Collaborative Filtering
Memory-based (Min-Hashing)
Part1:ExtractInsights
Part2:Use Insight
Overview
Scalability
Scalability
Scalability
Data Organization
Location Trace DataData Source and Format• Generated sample data (~400
points) for 5 weekdays and 4 weekends
• Raw Data Format: Timestamp, Lat, Long
• Probes Generated every 15 mins
Data Organization• Aggregate similar points• make data manageable• remove noise• identify significant data points Trace data generated based on my schedule on weekdays for 1 week
around 400 points generated
Clustering
Data Organization: ClusteringVariation of K-means
• K is not fixed but dynamically set
• 2-Phase approach
Phase 1:
1. For each point P, compute distance between it and all cluster centroids Ci
2. Compute Minimum distance: Dmin(P, Ci)
3. If Dmin(P, Ci) <= d, insert point P in cluster Ci.
4. else create new cluster with P as centroid
5. Update centroids of all clusters
Phase 2:
1. Run K-means on the clusters and points
Extracting Places from Traces of Locations, Kang, WellborneUsing GPS to Learn Significant Locations and Predict Movement Across Multiple Users, Ashbrook, Starner
C1 C2 C3
P
D(P,C1) D(P,C2) D(P,C3)
C1 C2 C3
Clustering Results
Before
After
• Reduce effective data size • Filter out redundant points
Identify Significant Clusters
• Filter distance based clusters further to identify significant clusters
• Significance: based on time spent per cluster and frequency of visit
• Sort the traces by time and calculate hours spent per cluster
• Frequency: # of times a cluster is visited in a day
0
20
40
60
80
L10 L6 L7 L23
Tota
l Tim
e s
pe
nt
(ho
urs
)
Cluster
Total Time Spent per cluster
L10L6
L7
Data Conditioning: Labeling• Add labels to the cluster to make them meaningful
• Google Places API:
– Enter lat, lon: gives information about the places
– Gives meta-data such as: name of the place and type
– For e.g. : Given lat lon (37.406679, -122.036603), Places API gives
name: Moffett Towers Club
type: [“gym”, “health”, “spa”]
• Meta Data Associated with a cluster:– Center point (Lat, Lon)
– Label: name, type
– Total hours spent
• Survey data from American Time Use Survey (ATUS, http://www.bls.gov/tus/) -stats when people are normally home (sample collected over 38K people, gives avg hours spent at work and home)
Learning Likely Locations, Krum et.alGoogle maps places api
Data Conditioning: Labeling
HomeWork
Gym
School
Labeled clusters for weekday trace
Park
Park
Park
Museum
Home
Zoo
Labeled clusters for weekend trace
Modeling to extract insights
• Goal– Predict user’s location at time t
– Given user’s current location and time t, predict user’s next location
• Can be used for location based services:– Place/activity recommendation systems
– Automatically provide relevant traffic updates and route recommendation
– Targeted advertisements for certain activities (gym, kids places)
Data Modeling
Modeling: Markov Model
home school work gym0.8
0.2
1.0
0.7
0.3
home Park0.7
Zoo/Museum
0.3
1.0
1.0
1.0
Markov model for weekday activities
Markov model for weekend activities
Insight: Predicts your next location• Simple design• Does not capture temporal
aspects
Modeling: Location-Time Distribution• Need to capture time-location correlation
• Capture Location distribution with time
• Discretize time into 24, 1-hour slots (T0,T1,T2…T23)
• For every slot, sum up hours spent at a given location
• E.g. At 4 to 4:30 Location was work and 4:45 location was gym, T16(work) = 30 mins
PnLUM: System for prediction of next location for users with mobility, Nguyen et.al
Day Time Home School Work Gym Park
weekday 8 5
weekday 9 1
weekday 10 5
weekday 11 5
weekday 12 5
weekday 13 5
Table shows for each time slot the sum of hours user spent at a certain location
Modeling: Location-Time Distribution
0
0.2
0.4
0.6
0.8
1
1.2
0 10 20 30
Time
Location Time Distribution Weekday
Home
Work
School
Gym
0
0.2
0.4
0.6
0.8
1
1.2
0 10 20 30
Time
Location Time Distribution Weekend
Home
Park
Zoo/Museum
• For each time slot Ti, compute the probability PTi(Lj) the user is at location Lj
• E.g. P12(work) = 1, P18(gym) = 0.3
Insights: One weekdays, user typically spends mornings and evenings at home, during the day at workOne weekends, user typically spends mornings and evenings at home and during the day between 10 am to 6pm at parks and kids activity places
Modeling: Location Prediction Classifiers
• Predict future location given my current context (location and time) – Classification problem
• Trained 3 classifiers (1) Pruned decision tree (J48), (2) Naïve Bayes, (3) K- Nearest Neighbors
• Training set
References:
Time Slot Curr Location Day Next Location
9 school weekday work
10 work weekday work
11 work weekday work
16 gym weekday home
PnLUM: System for prediction of next location for users with mobility, Nguyen et.al
Location Prediction: Classifiers
Classifier Accuracy
J48 90%
Naïve Bayes 91%
K-NN 95%
Weka: Open Source ML library (http://www.cs.waikato.ac.nz/ml/weka/)PnLUM: System for prediction of next location for users with mobility, Nguyen et.al
• Approach is good for a coarse predictions of locations. For e.g. where is my location in next hour
• For finer predictions, we need to have finer time slots- could increase the complexity of the training.
Data Insights
0
0.5
1
1.5
0 20 40
Time
Location Time Distribution Weekend
Home
Park
Zoo/Museum
HomeWork
Gym
School0
0.5
1
1.5
0 10 20 30
Time
Location Time Distribution Weekday
Home
Work
School
Gym
Park
Park
Park
Museum
Home
Zoo
Models and Insights- Recap• A single model does not capture all aspects
– consider combination of models to gain insights– Identify user’s location at a given time– Predict user’s next location given current context
• Insights: – Understand user’s routine and schedule on weekdays and weekends
• User leaves home at ~9am and leaves work at ~5pm (Show traffic update and route suggestion at those times)
• User visits preschool on weekdays (Target advertisement for preschoolers)
• User is at kid friendly places on weekends during the day (Recommendation for kids places and activities, Show deals and coupons at these places)
– Insights into user’s interest and habits• User goes to gym on an avg of 2 times a week (Target advertisement
for gym accessories)
Part 2: Using insights to construct an app/service
Place Recommendation System
• Goal: Recommend places for kids activities based on user preferences
• Problem Statement
– Given: N users, M kids places and user history of places visited
– Output: recommend K places to the user that he/she might be interested in visiting
Place Recommendation Methods
Collaborative Filtering
• Recommended places that people with similar preferences liked in the past
• Can provide recommendations for new types of places
Content Based Filtering
• Recommend places similar to ones the user herself preferred in the past
• Compare similarity of places (need detailed meta-data for each place), Compare park with zoo?
• Does not recommend a new type of place e.g. Planetarium
A Survey of Collaborative Filtering Techniques, Su .et. AlTowards the Next Generation of Recommender Systems:A Survey of the State-of-the-Art and Possible Extensions,
Adomavicius et. Al.
Collaborative Filtering: Memory Based Approach
User Rating for visited places
Find people with similar tastes (neighbors)
Recommend places highly rated by neighbors
Make ratings predictions for users based on their past ratings
Collaborative Filtering• User Rating: Implicit
• Frequency fsi = # of times a user visited a place si /total places visited– E.g. P = {(Discovery Museum, 5),
{Oakland zoo, 1), (Ortega park, 6)} fDiscovery Museum = 5/12
• For simplicity, binary rating. If fsi> Threshold, Rating rsi = 1 else 0
Rating Set R = {s1, s2, .., sl} consists of all places having rating = 1
User Rating for visited places
Find people with similar tastes (neighbors)
Recommend places highly rated by
neighbors
Collaborative Filtering
• Similarity Metric: Find common places between 2 users A, B.
RA = {s1,s3}
RB = {s1,s2,s3,s4}
• For user u set of top L users (neighbors) Wu that are similar to u
Google News Personalization: Scalable Online Collaborative Filtering, Das et. Al.
User Rating for visited places
Find people with similar tastes (neighbors)
Recommend places highly rated by
neighbors
Sim(A,B) =| RA ÇRB |
| RA ÈRB |
Sim(A,B) =2
4
Collaborative Filtering
• Predict the rating for user u for place sk by considering the ratings given by users in set Wu User Rating for visited
places
Find people with similar tastes (neighbors)
Recommend places highly rated by
neighbors
ru,sk =
Sim(u,v)rv,skvÎWu
å
Sim(u,v)vÎWu
å
Memory-based: Pros• Easy implementation, simple design• New users can be added easily and incrementallyCons• Scalability issues for millions of users• Performance decreases when data is sparse• Adding new place would require re-computation of
rating vector
Scalability
Scalability
• 100 points per day ( for 6 months) for 1 M users. Data size is in Tera Bytes
• Data Storage: Map Reduce. Trace data for a given user is split across different machines
• Scalability challenges
– Clustering:
– Location-Time Distribution: (sorting of trace data)
– Min-Hash
References:
K means Clustering-Map Reduce
M1 M2 M3
(C1, P1)
Compute K-meansKey,value pair(Cluster centroid, point)
Data(u1) Data(u1) Data(u1)Input Trace data for a user can be split into different shards
ReduceK-means, new clusters and centroids
Input Trace Data
(C2, P2) (C4, P3)
(C1, P4) (C3, P5) (C4, P6)
(C’1, P1, P2), (C’
2, P4, P6) (C’
3, P3, P5)
Sorting of traces with time-Map Reduce
M1 M2 M3
(T1, P1)
Sort Trace data, generateKey,value pair(Timestamp, data point)
Data(u1) Data(u1) Data(u1)Input Trace data for a user can be split into different shards
Reducer: Each reducer has a key. Assign elements <= keySort
Input Trace Data
(T3, P2) (T4, P3)
(T2, P4) (T5, P5) (T6, P6)
(T1, P1)
(T2, P4)
(T3, P2)
(T4, P3)
(T5, P5)
(T6, P6)
R(3) R(4)
Collaborative Filtering-Scalability• Similarity metric is computed between all pairs of
users, complexity O(N2) explodes for millions of users
• Don’t have to consider all user pairs. Consider those pairs that have a high probability of being similar
• Locality Sensitive Hashing (LSH): Hashing technique, that hashes data points such that probability of collision is higher for objects close to each other
• Points that have same hash value form a cluster, similarity metric is computed only with pairs of users in that cluster
Collaborative Filtering-Min-Hash• Hashing Function
– Let P = {s1, s2, s3, …,sM} be a set of all M possible kids places in an area
– For a user u, v let Ru = {s1,s3, s4}, Rv = {s1,s3} be the rating set. Hash h(u) = randomly pick number from the set [1,3,4] , h(v) = randomly pick number from set [1,3]
– Probabilty[h(u)=h(v)] = 2/5, which is the same as our similarity metric
Google News Personalization: Scalable Online Collaborative Filtering, Das et. Al.
Location Sensitive Hashing
U1, U3, U7
U2, U5, U8
1
2
Users U1, U3, U7 have same hash value = 1 are similar to each other with probability = similartiy metric
Location Sensitive Hashing • Hashing Function
– Let P = {s1, s2, s3, …,sM} be a set of all M possible kids places in an area
– For a user u, v let Ru = {s1,s3, s4}, Rv = {s1,s3} be the rating set. Hash h(u) = randomly pick number from the set [1,3,4] , h(v) = randomly pick number from set [1,3]
– Probabilty[h(u)=h(v)] = 2/5, which is the same as our similarity metric
Google News Personalization: Scalable Online Collaborative Filtering, Das et. Al.
Location Sensitive Hashing-Map Reduce
Google News Personalization: Scalable Online Collaborative Filtering, Das et. Al.
M1 M2 M3
H(u1)=1 H(u2)=1 H(u3)=2
Compute hash for each user in parallel
Ru1 Ru2 Ru3
Rating Set for different users spread out on different machines
(1, u1) (1, u2) (2, u3)Output Key, value pair (hash, userid)
ReduceCombine users with same hash value
1->(u1, u2)2->(u3)
Summary
• Data organization: Variation of K-means, labeling
• Modeling: Different models to predict user’s current and future location
• Insights: User’s schedule and interests
• Business Service: Place Recommendation system
• Scalability: approaches on Map Reduce
References:
References• Extracting Places from Traces of Locations, Kang et. Al.• Using GPS to Learn Significant Locations and Predict Movement
Across Multiple Users, Ashbrook et. Al.• PnLUM: System for prediction of next location for users with
mobility, Nguyen et.al• A Survey of Collaborative Filtering Techniques, Su et. al• Towards the Next Generation of Recommender Systems:A Survey of
the State-of-the-Art and Possible Extensions, Adomavicius et. al.• Google News Personalization: Scalable Online Collaborative
Filtering, Das et. Al.• Learning Likely Locations, Krumm et. Al.• Learning Travel Recommendations from User-Generated GPS
Traces, Zheng et. Al.• Google maps APIs• Weka: Open source ML library
Thanks
Questions/Discussion