1. Crowd DetectorCrowd Detector Reza Asad Insight Data
Engineering June 2015
2. Motivation Avoid waiting time in crowded areas.
3. Data Lets imagine we had data about people's location. This
could be collected form people's cell phones. How can we use such
data?
4. Naive Approach
5. Demo
6. Data But such data is not available to me ... Solution :
Engineer the data! Take data from yelp Perform a random walk
7. Pipeline Data
8. Engineering Challenges Choosing K?
9. Engineering Challenges The area of SF: 46.87 mi For the
purpose of this project each cluster is 0.09 mi This means k is
roughly 500
10. Engineering Challenges Parameters to tune: Time it takes to
produce the messages Processing time for k-means in Spark Streaming
The update interval for a fixed data point in the database
11. Goal Tune the parameters in order to have a stable system
The total delay after processing each batch must be constant and
comparable to the batch interval. You can check this in the Spark
API
12. Tackling Challenges Having multiple producers and consumers
Kafka is fast with sending messages and is not the bottleneck
Establishing some safe limits: Using
spark.streaming.receiver.maxRate to control the input rate
Understanding the complexity of the process in Spark Streaming
Choosing the right batch interval
13. Raw Data
14. Data Process Data filteration in spark streaming
15. Data Process
16. About Me Long time ago - B.S in pure math, University of
Toronto More recent - M.S in applied math, University of British
Columbia The exciting now - A data engineer who wants to go camping
with other data engineers