Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Towards ScalableMultimedia Analytics
Björn Þór Jónssondatasys groupComputer Science DepartmentIT University of Copenhagen
Today’s Media Collections
• Massive and growing– Europeana > 50 million items– DeviantArt > 250 million items (160K/day)– Facebook > 1,000 billion items (200M/day)
• Variety of users and applications– Novices à enthusiasts à scholars à experts– Current systems aimed at helping experts
• Need for understanding and insights2
Media Tasks
3
SearchExploration
Media Tasks
4[Zahálka and Worring, 2014]
MultimediaAnalytics
Multimedia Analytics
Multimedia Analysis
VisualAnalytics
[Zahálka and Worring, 2014] 5
[Zahálka and Worring, 2014; Keim et al., 2010]
From Data to Insight
6
The Two Gaps
7
Generic data and annotation
based on objective understanding
Predefined, fixedannotation based
on understanding of the collection
Specific contextand task-driven subjective understanding
Dynamically evolving and interaction-driven understanding of collections
Semantic Gap[Smeulders et al., 2000]
Pragmatic Gap[Zahálka and Worring, 2014]
Multimedia AnalyticsState of the Art
• Theory is developing• Early systems have appeared• No real-life applications (?)• Small collections only
8
ScalableMultimediaAnalytics
ScalableMultimedia Analytics
Multimedia Analysis
VisualAnalytics
DataManagement
[Jónsson et al., 2016] 9
The Three Gaps
10
Generic data and annotation
based on objective understanding
Predefined, fixedannotation based
on understanding of the collection
Pre-computed indices and bulk
processing of large datasets
Specific contextand task-driven subjective understanding
Dynamically evolving and interaction-driven understanding of collections
Serendipitousand highly interactive sessions on small data subsets
Semantic Gap[Smeulders et al., 2000]
Pragmatic Gap[Zahálka and Worring, 2014]
Scale Gap[Jónsson et al., 2016]
VELO
CIT
YVOLUME
VAR
IETY
VISUALINTERACTION
[Jónsson et al., MMM 2016]
Ten Research Questions forScalable Multimedia Analytics
Service Layer
Big Data Framework: Lambda Architecture
12
Batch Layer
Storage Layer
Speed Layer
[Marz and Warren, 2015]
Service Layer
Big Data Framework: Lambda Architecture
13
Batch Layer
Storage Layer
Speed Layer
[Marz and Warren, 2015]
Outline
• Motivation:Scalable multimedia analytics
• Batch Layer:Spark and 43 billion high-dim features
• Service Layer:Blackthorn and 100 million images
• Conclusion:Importance and challenges of scale!
14
Gylfi Þór Guðmundsson, Laurent Amsaleg, Björn Þór Jónsson, Michael J. FranklinTowards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference (MMSys)Taipei, Taiwan, June, 2017
15
Spark Case Study:Motivation
• How can multimedia tasks harness the power of cloud-computing? – Multimedia collections are growing– Computing power is abundant
• ADCFs = Hadoop || Spark– Automatically Distributed
Computing Frameworks– Designed for high-throughput processing
16
Design Choices: ADCF = Spark
• Hadoop is not suitable (more later) • Resilient Distributed Datasets (RDDs)
– Transform one RDD to another via operators– Lazy execution– Master and Workers paradigm
• Supports deep pipelines• Supports worker’s memory sharing• Lazy execution allows for optimizations
17
Design Choices: Application Domain
• Content-Based Image Retrieval (CBIR)– Well known application– Two phases: Off-line & “On-line”
Search resultsQuery Image
CBIRSystem
18
Properties:• Clustering-based• Deep hierarchical index• Approximate k-NN search• Trades response time for
throughput by batching
Why?• Very simple• Prototypical of many
CBIR algorithms• Previous Hadoop
implementation facilitates comparison
Design Choices: DeCP Algorithm
19
DeCP as a CBIR System
20
• Off-line– Build the index
hierarchy– Cluster the data
collection• On-line
– Approximate k-NN search
– Vote aggregation
Index isin RAM
Clusters reside on disk
Searching a single feature
IdentifyRetrieve
Scan
k-NN
Clustered collection
Design Choices: Feature Collection
• YLI feature corpus from YFCC100M– Various feature sets (visual, semantic, …)– 99.2M images and 0.8M videos– Largest dataset publicly available
• Use all 42.9 billion SIFT features!– Goal is to test at a very large scale– No feature aggregation or compression– Largest feature collection reported!
21
Research Questions
• What is the complexity of the Spark pipeline for typical multimedia-related tasks?
• How well does background processing scale as collection size and resources grow?
• How does batch size impact throughput of an online service?
22
Requirementsfor the ADCF
R1 ScalabilityAbility to scale out with additional computing powerR2 Computational flexibilityAbility to carefully balance system resources as neededR3 CapacityAbility to gracefully handle data that vastly exceeds main memory capacity
R4 UpdatesAbility to gracefully update the data structures for dynamic workloadsR5 Flexible pipelineAbility to easily implement variations of the indexing and/or retrieval processR6 SimplicityHow efficiently the programmer’s time is spent
23
DeCP on Hadoop
24
• Prior work evaluated DeCP on Hadoop using 30 billion SIFTs on 100+ machines
• Conclusion = limited success– Scalability limited due to RAM per core– Two-step Map-Reduce pipeline is too rigid
• Ex: Single data-source only• Ex: Could not search multiple clusters
– R1, R2, R3 = partially; R4 = no; R5, R6 = no
DeCP on Spark
• A very different ADCF from Hadoop• Several advantages
– Arbitrarily deep pipelines • Easily implement all features and functionality
– Broadcast variables• Solves the RAM per core limitation
– Multiple data sources• Ex: Allows join operations for maintenance (R4)
25
Spark Pipeline Symbols
26
• .map = one-to-one transformation• .flatmap = one-to-any transformation
• .groupByKey = Hadoop’s Shuffle • .reduceByKey = Hadoop’s Reduce
• .collectAsMap = collect to Master
.map
.flatmap
.groupByKey
.reduceByKey
.collectAsMap
Search Pipeline
27
Indexing
Search
Evaluation: Off-line Indexing
• Hardware: 51 AWS c3.8xl nodes– 800 real cores + 800 virtual cores– 2.8 TB of RAM and 30 TB of SSD space
• Indexing time as collection grows
28
Features(billions)
Indexing time (seconds)
Scaling(relative)
8.5 3,287 –17.2 5,030 1.5326.0 11,943 3.6334.5 14,192 4.3142.9 19,749 6.00
Evaluation: “On-line” Search
29
● Throughput with batching
Hadoop limit
Summary
30
R1
Scal
abili
ty
R2
Com
puta
tiona
lFl
exib
ility
R3
Cap
acity
R4
Upd
ates
R5
Flex
ible
Pi
pelin
es
R6
Sim
plic
ity
Spark Yes Yes YesPartialfull re-shuffle
Yes Yes
Hadoop PartialRAM
per corePartial Partial No No No
Outline
31
• Motivation:Scalable multimedia analytics
• Batch Layer:Spark and 43 billion high-dim features
• Service Layer:Blackthorn and 100 million images
• Conclusion:Importance and challenges of scale!
32
Jan Zahálka, Stevan Rudinac, Björn Þór Jónsson, Dennis C. Koelma, Marcel WorringBlackthorn: Large-Scale Interactive Multimodal LearningUnder revision at IEEE Transactions on Multimedia
Jan Zahálka, Stevan Rudinac, Björn Þór Jónsson, Dennis C. Koelma, Marcel WorringInteractive Multimodal Learning on 100 Million ImagesProceedings of the ACM International Conference on Multimedia Retrieval (ICMR)New York, NY, USA, June, 2016
Service Layer
Framework:Lambda Architecture
33
Batch Layer
Storage Layer
Speed Layer
BlackthornMotivation
• Do not impose a dictionary on the user• Let the user synthesize categories
of relevance from semantic annotations on the fly
• Let the user search and explore along those categories interactively
• Interactive semi-supervised learning
34
at scale!
Honza’sScalability Illustration
• “Yesterday”: 10-100K images
• YFCC: 100M images
35Image credit: http://demonocracy.info/infographics/usa/us_debt/us_debt.html
Scale > Size
36
• Single (high-end) workstation• 1000D features à 800GB
• Interactive response time!• Computing feature scores takes minutes!
Blackthorn Overview
37
Blackthorn Compression
38
Blackthorn Results:1.2M Collection
39
• Compression: 880GB à 5GB• Precision: 89-108% of uncompressed• Scoring time:
60-80x faster• Recall over time:
Blackthorn rocks!
Blackthorn Results: YFCC100M Collection
• Scoring time: ~1 second!
40
Blackthorn Future Work
• More (user) evaluation is needed• Other applications may (will) require
adaptations• Further scalability:
Combine eCP and Blackthorn
41
Outline
• Motivation:Scalable multimedia analytics
• Batch Layer:Spark and 43 billion high-dim features
• Service Layer:Blackthorn and 100 million images
• Conclusion:Importance and challenges of scale!
42
Why Scale?
43
• Current and future applications• Future of computing• Because we cannot yet!
“We choose to … in this decade and do the other things,not because they are easy, but because they are hard, …”
We choose to go to the moon. We choose to go to the moon in this decade and do the other things, not because they are easy, but because they are hard, because that goal will serve to organize and measure the best of our energies and skills, because that challenge is one that we are willing to accept, one we are unwilling to postpone, and one which we intend to win, and the others, too.
JFK, September 12, 1962
Scalability Hurdles: Can Industry Help?Industry-Level Collections
– Data– Processing capacity
The Small-Minded Reviewer– “Are there users willing to explore 100M data sets interactively?”
Interactive Applications– Application knowledge– User study “victims”
44
46
Summary
• Motivation:Scalable multimedia analytics
• Batch Layer:Spark and 43 billion high-dim features
• Serving Layer:Blackthorn and 100 million images
• Conclusion:Importance and challenges of scale!
45