Video Search: Opportunities and Challenges Keynote Speech at ACM MIR Workshop ACM Multimedia Conference 2005 Singapore Dr. Ramesh R. Sarukkai Yahoo! Search

Embed Size (px)

Citation preview

  • Slide 1

Video Search: Opportunities and Challenges Keynote Speech at ACM MIR Workshop ACM Multimedia Conference 2005 Singapore Dr. Ramesh R. Sarukkai Yahoo! Search { [email protected] [email protected] }[email protected]@yahoo-inc.com Slide 2 About the Presenter Dr. Ramesh Rangarajan Sarukkai currently heads up Video Search engineering at Yahoo where he manages video search/core media search infrastructure. Over his 6+ years tenure at Yahoo, Dr. Sarukkai has grown and managed different teams including the Small Business (e-commerce stores, hosting, domains) billing platform, and architected wireless applications & voice portals including 1-800-my-yahoo. Prior to Yahoo, he has worked in Kurzweil A.I., IBM Watson Lab, and has collaborated with HP Labs. Dr. Sarukkai is the author of the book entitled Foundations of Web Technology (Kluwer/Springer 2002). In addition, he holds the first patent on large vocabulary voice web browsing, and has a number of patents awarded/pending in the areas of speech, natural language modeling, personalized search, information retrieval, optimal sponsored search, media and wireless technologies. Dr. Sarukkai also served on the World Wide Web Consortium (W3C) Voice Browser working group, recently chaired a panel in World Wide Web 2005 conference (Japan) on networking effects of the Web, and has contributed as a reviewer/Program Committee in a number of leading conferences/journals. His current interests include next-generation media search, social annotation, networking effects on the Web, emergent web technologies and new business models. Slide 3 Outline Introduction Market Trends Video Search Opportunities Challenges Conclusion Slide 4 Introduction Key questions discussed in this talk: Why is video search important? What broadly constitutes video search? How is Web Video Search different? What are some opportunities and challenges? Slide 5 Outline Introduction Market Trends Video Search Opportunities Challenges Conclusion Slide 6 Market Trends Whoever controls the media - the images - controls the culture. - Allen Ginsberg American Poet Slide 7 Market Trends Broadband doubling over next 3-5 years Video enabled devices are emerging rapidly Emergence of mass internet audience Mainstream media moving to the Web International trends are similar Money Follows Slide 8 Market Trends How many of you are aware of video on the Web? How many have viewed a video on the Web in The last 3 months? The last 6 months? Ever? Would you watch video on your devices (ipod/wireless)? How many of you have produced video (personal or otherwise) recently? How many of you have shared that with your friends/community? Would you have liked to? Slide 9 Market Trends How many of you are aware of video on the Web? Large portion of online users How many have viewed a video on the Web in The last 3 months? 50% The last 6 months? Ever? Would you watch video on your devices (ipod/wireless)? 1M downloads in 20 days (iPod) How many of you have produced video (personal or otherwise) recently? Continuing to skyrocket with digital camera phones/devices How many of you have shared that with your friends/community? Would you have liked to? Huge interest & adoption in viral communities * Source: Forrester Report Slide 10 Market Trends Technology more media friendly Storage costs plummeting (GB TB) CPU speed continuing to double (Moores law) Increased bandwidth Device support for media Adding media to sites drives traffic Web continues to propel scalable infrastructure for media products/communities Slide 11 Market Trends Democratization of Mass Media in the 21 st Century (aka the long tail) Of the People (media is ultimately defined by users) By the people (production,tagging) For the people (consumption, devices, personalization) - Abraham Lincoln as a Media Mogul in the 21 st Century! The Long Tail was coined by Chris Andersens article in the Wired magazine. Slide 12 Market Trends Mainstream media migrating online & interactive Media content online is growing Video Streaming continuing to grow aggressively online. More interactive in nature (user voting): MTV Interactive American Idol Reality shows Slide 13 Market Trends Video search is a key part of the future of digital media Slide 14 Outline Introduction Market Trends Video Search Opportunities Challenges Conclusion Slide 15 Video Search What is Video Search? Multimedia? As far as I'm concerned, it's reading with the radio on! - Rory Bremner Actor/Writer Slide 16 Video Search Meta-Content Video Production Unstructured Data Video Consumption /Community Content Features + Meta-Content Structured + Unstructured Data Slide 17 Video Search Media Information Retrieval Been around since the late 1970s Text Based/DB Issues: Manual Annotation, Subjectivity of Human Perception Content Based Color, texture, shape, face detection/recognition, speech transcriptions, motion, segmentation boundaries/shots High Dimensionality Limited success to date. Citations: Image Retrieval: Current Techniques, Promising Directions, and Open Issues [Rui et al 99] Slide 18 Video Search Active Research Area Graph from A new perspective on Visual Information Retrieval, Horst Eidenberger, 2004 Black: Image Retrieval; Grey:Video Retrieval; IEEE Digital Library Slide 19 Video Search Traditional 3-Step Overview: Meta-Data/Visual Feature Extraction Multi-dimensional indexing Retrieval System design Slide 20 Video Search Popular features/techniques: Color, Shape, Texture, Shape descriptors OCR, ASR A number of prototype or research products with small data sets More researched for visual queries Slide 21 Video Search: Features 1970 2000199019802010 Texture: Autocorrelation; Wavelet transforms; Gabor Filters Shape: Edge Detectors; Moment invariants; Animate Vision Marr; Finite Element Methods; Shape from Motion; Color: Color Moments Color Histograms Color Autocorrelograms Segmentation: Scene segmentation; Scene Segmentation; Shot detection; OCR: Modeling; Successful OCR deployments; Face: Face Detection algorithms; Neural Networks; EigenFaces ASR: Acoustic analysis; HMMS; N-grams; CSR; LVCSR; Domain Specific; NIST Video TREC Starts Media IR systems Web Media Search Slide 22 Video Search: Features Color Robust to background Independent of size, orientation Color Histogram [Swain & Ballard] Sensitive to noise and sparse- Cumulative Histograms [Stricker & Orgengo] Color Moments Color Sets: Map RGB Color space to Hue Saturation Value, & quantize [Smith, Chang] Color layout- local color features by dividing image into regions Color Autocorrelograms Texture One of the earliest Image features [Harlick et al 70s] Co-occurrence matrix Orientation and distance on gray- scale pixels Contrast, inverse deference moment, and entropy [Gotlieb & Kreyszig] Human visual texture properties: coarseness, contrast, directionality, likeliness, regularity and roughness [Tamura et al] Wavelet Transforms [90s] [Smith & Chang] extracted mean and variance from wavelet subbands Gabor Filters And so on Region Segmentation Partition image into regions Strong Segmentation: Object segmentation is difficult. Weak segmentation: Region segmentation based on some homegenity criteria Scene Segmentation Shot detection, scene detection Look for changes in color, texture, brightness Context based scene segmentation applied to certain categories such as broadcast news Slide 23 Video Search: Features Face Face detection is highly reliable - Neural Networks [Rwoley] - Wavelet based histograms of facial features [Schneiderman] Face recognition for video is still a challenging problem. - EigenFaces: Extract eigenvectors and use as feature space OCR OCR is fairly successful technology. Accurate, especially with good matching vocabularies. Script recognition still an open problem. ASR Automatic speech recognition fairly accurate for medium to large vocabulary broadcast type data Large number of available speech vendors. Still open for free conversational speech in noisy conditions. Shape Outer Boundary based vs. region based Fourier descriptors Moment invariants Finite Element Method (Stiffness matrix- how each point is connected to others; Eigen vectors of matrix) Turing function based (similar to Fourier descriptor) convex/concave polygons[Arkin et al] Wavelet transforms leverages multiresolution [Chuang & Kao] Chamfer matching for comparing 2 shapes (linear dimension rather than area) 3-D object representations using similar invariant features Well-known edge detection algorithms. Slide 24 Video Search: Video TREC Overview: Shot detection, story segmentation, semantic feature extraction, information retrieval Corpora of documentaries, advertising films Broadcast news added in 2003 Interactive and non-interactive tests CBIR features Speech transcribed (LIMSI) OCR Slide 25 Video Search: Video TREC 1 st TREC (2001) Mean Average Precision @100 items (MAP) 0.033 (in general category) Transcript only was better than transcript+image aspects+ASR Also there were different types of test: known items versus general. Known items had at least one result. 2 nd TREC (2002) Interactive runs to rephrase queries Again text based on only ASR was the best performing system Mean Average precision of 0.137 Leading system employed multiple systems: TF-IDF variants (Mean Average Precision MAP 0.093). Other was boolean with query expansion (0.101 MAP) OCR was not particularly applicable; Phonetic ASR did not help. 3 rd TREC (2003) Data changes radically (Broadcast news added CNN, ABC, CSPAN) Baseline ASR + CC MAP 0.155 ASR + CC + VOCR + Image Similarity + Person X retrieval MAP 0.218 * Successful Approaches in the TREC Video Retrieval Evaluations, Alexander Hauptmann, Michael Christel, ACM Multimedia 2004. Slide 26 Video Search: Browsing Search by text Navigation with customized categories Random Browsing Search by example Search by Sketch Slide 27 Video Search: Pre-2000 or Research QBIC (IBM Almaden) Photobook (MIT Media Lab) FourEyes (MIT Media Lab) Netra (UCSB Digital Library) MARS (UCI) PicToSeek (UVA.nl) VisualSEEK (Columbia) PicHunter (NEC) ImageRover (BU) WebSEEK (Columbia) Virage (now Autonomy) Visual RetrievalWare (Convera) AMORE (NEC) BlobWorld (UC Berkeley) Slide 28 Video Search We have discussed the background of video search, techniques used, and applications. It should be clear that video search is a fairly broad term. Slide 29 Video Search Note the differences in these broader definitions of video search: No mention of content based versus meta-data based. No mention of the actual browsing paradigm. Production and consumption are a key aspect of video search, and should be taken into core consideration in the models. Device, personalization and on-demand needs addressed. Slide 30 Video Search Video Search has evolved a lot since its inception and opportunities much broader. Multimedia technologists (We) are defining and influencing the emergent media culture! Slide 31 Outline Introduction Market Trends Video Search Opportunities Challenges Conclusion Slide 32 Opportunities Engineering is the art of compromise, and there is always room for improvement in the real world; but engineering is also the art of the practical. Engineers realize that they must, at some point curtail design, and begin to manufacture or build - H. Petrovski Invention by design (Harvard Univ. Press) Slide 33 Opportunities The Internet Revolution Exposed the internet backbone to the masses Standardized publication of web media (HTML,CSS,XML) Scaled to millions of users Pre-dot-com bust- slow emergence of media search (notably AltaVista Audio-Visual search) Post-dot-com bust Recent trends of community tagging especially for Images: Flickr (Y!), DeliCious, and so on. Slide 34 Opportunities Media Search Now Yahoo! Image, Video, Audio, Podcast searches Flickr(Y!) AOL/SingingFish BlinxTV Google Many smaller companies Slide 35 Opportunities Web Based Video Search Web Search meets Media Search Leverage Meta-content from: Web (inferred meta-data) Community (Tagging and user submission/mRSS) Structured Meta-Data from different sources Augment with limited media analysis OCR, ASR, Classification Slide 36 Video Search Meta-Content Video Production Unstructured Data Video Consumption /Community Content Features + Meta-Content Structured + Unstructured Data Web Video Search Traditional Media Search Slide 37 Opportunities Yahoo! Media Search Billions of Images Millions of Videos Hundreds of Millions of users Reasonable Precision rates Slide 38 Opportunities Example 1: User query zorro User 1 wants to see Zorro videos User 2 wants to see Legend of Zorro movie clips User 3 just wants to see home videos about Zorro Can content based analysis help over structured meta- data query inference? Slide 39 Slide 40 Opportunities Example 2: For main-stream head content such as news videos. Meta-data are fairly descriptive Usually queried based on non-visual attributes. Task: Pull up recent Hurricane Katrina videos Slide 41 Slide 42 Opportunities Example 3: Creative Home Video Community video rendering! The now famous Star Wars Kid Example of social buzz combined with innovative tail content video production. Slide 43 Play Video 1 Play Video 2 Slide 44 Opportunities A hard example: Lets take an example: Supposing you want to find videos that depict a monkey/chimp doing karate! CBIR Approach: Train models for Chimps/Monkeys Motion Analysis for Karate movement models Many open issues/problems! Slide 45 Play Video Slide 46 Opportunities What are the key opportunities for Video Search? Automatic Speech Recognition (ASR) as a comparable case study Slide 47 Opportunities Speech Recognition: Case study From small experimental models to usable large scale products in restricted domains. 60s Very small experimental tests; Digit recognition; 70s Core Acoustic modeling work; Vocal Tract Modeling; Formant analysis; Signal Processing (FFT, MFCC) 80s HMMs; Statistical Language Models; Large collections of acoustic data. Very Large collection of non-acoustic Text Data; Speaker ID; 90s Large vocabulary speech recognition deployments in many domains (IVR, LVCSR) Natural Language Understanding, Core Background Noise/Multiple speakers, Spontaneous Dialog still open research problems Slide 48 Opportunities 1970 2000199019802010 Acoustic Modeling Vocal Tract Modeling Formant Analysis DSP: FFT, MFCC Digit Recognizers Viterbi search; HMMs Sub-phonetic Models FSMs, Grammars, Statistical N-grams Digit Recognition; Small Vocabulary Discrete Medium Vocabulary Continuous Large Vocabulary Continuous LVCSR Telephony/Broadcast Slide 49 Opportunities 1970 2000199019802010 Acoustic Modeling Vocal Tract Modeling Formant Analysis DSP: FFT, MFCC Digit Recognizers Viterbi search; HMMs Sub-phonetic Models FSMs, Grammars, Statistical N-grams Digit Recognition; Small Vocabulary Discrete Medium Vocabulary Continuous Large Vocabulary Continuous LVCSR Telephony/Broadcast Millions of non-audio Text Data Hundreds of Hours of Audio Data Modeling Framework Milestones Slide 50 Opportunities Speech Recognition Case Study Key Milestones: Foundations: HMMs, Viterbi search, Statistical Language Models Creative use of Data: Millions of non-audio text data from Wall Street Journal Corpus applied for Language model training Data Crunching at the lowest levels: Hundreds of hours of data applied to model thousands of sub-phonetic models Slide 51 Opportunities Speech Recognition Case Study: Exploiting non-media sources of meta/text data are beneficial Applying contextual restraints enables improved applicability Large scale data crunching does indeed help solve hard problems Slide 52 Opportunities Web Video Search: Mass Adoption: Large Community of users/taggers/web developers Different sources of Media and non-Media data Highly Scalable systems Slide 53 Opportunities The ASR takeaways can be applied to video search: Exploiting non-media sources of meta/text data are beneficial (Web Meta-data, Community Tagging) Applying contextual restraints enables improved applicability (User models, structured data,Tailored) Large scale data crunching does indeed help solve hard problems (Throw Horsepower) Slide 54 Opportunities Web Browsing Paradigms Search by text (Very popular) Navigation with customized categories (Very Popular) Random Browsing (Popular for certain categories: Discovery) Search by example (Not popular currently) Search by Sketch (Not popular) Combined usage models - users skipping to scan and then pick selective videos to play. Slide 55 Opportunities Leverage Community Exploit Structured Meta-Data Constrain with Context Slide 56 Opportunities Tying into social network modeling for media search How do you define social networks? How do you integrate into video search indexing/ranking? How do you filter out the noise from the authoritative ones? Slide 57 Opportunities Exploit Structured Meta-Data Video has a number of different editorial sources; Exploit such structured data Tailor results to users perceived needs Music Videos, Movies aka traditional video markets Slide 58 Opportunities Constrain with context Occams razor: Simpler the model, lesser the number of parameters to train, the more general the model. Rather than throw in all meta + media feature vectors, tailor the model; A decision tree based approach to picking the right algorithms. E.g. News broadcast will require certain type of content features and meta-data. Context can also be constrained by user (preferences), device parameters (media type, geo-location) Slide 59 Outline Introduction Market Trends Video Search Opportunities Challenges Conclusion Slide 60 Challenges A man will occasionally stumble over truth, but usually manages to pick himself up, walk over or around it and carry on. - Winston Churchill Slide 61 Challenges Theoretical Frameworks Leverage CBIR solutions Learn User Needs/Behavior Inherit Web Search Problems Slide 62 Challenges Theoretical Frameworks Establishing the right theoretical frameworks for video search is still a problem for a variety of reasons: Lack of proper classified data Large dimensionality; Different sources of meta-data. Difficult to define the correct perceived relevance metric (Semantic Gap*) Cost functions such as average precision are not differentiable & need to be approximated with heuristics Integrate with Application specific cost functions: E.g. search: Click through rates, clicks, video views, customer time spent Sometimes ranking needs to be based on user buzz, quality and other non-IR metrics * Content-Based Image Retrieval at the End of the Early Years, Arnold Smeulders et al, IEEE PAMI, Dec. 2000 Slide 63 Challenges Leverage CBIR Video TREC Summary: Text based retrieval still is much better than non-text based searches. Visual queries have poor MAP, but amenable for interactive search. Commercial detectors work okay with broadcast news, but not okay with documentaries with no commercials. (also face detection) Visual features are seldom useful: Not without exceptions, edge detectors usually helped only in aircraft and animal categories when color features were not as discriminative. ASR was not as useful in tasks such as shot detection (as opposed to indexing whole clip). OCR is more closer to the actual occurrence of the word. Successful Approaches in the TREC Video Retrieval Evaluations, Alexander Hauptmann, Michael Christel, ACM Multimedia 2004. Video Retrieval using Speech and Image Information, Alex Hauptmann et al. Slide 64 Challenges Leverage CBIR Apply what works in a tailored manner: Color, Texture, OCR, Speech Recognition Classification (in restricted domains) Copyright infringement prevention Story-boarding/Shots Music/Speech Detection Apply other techniques in selective domains Broadcast news: ASR, Segmentation, Speaker Detection Motion for event detection restricted to time regions isolated by ASR Slide 65 Challenges Overcome the temptation to solve the computer vision problem*. Yet leverage computer vision techniques Today, CBIR is mostly image based Leverage motion flow information more Apply techniques in compressed domain (or leverage compressed data- e.g. MPEG-4 already computes local/global motion) Strong Image Segmentation is hard- Consistent Segmentation across image sequences in video can be exploited. * Guest Introduction: The changing Shape of Computer Vision in the Twenty-First Century, M. Shah, IJCV, 2002 Slide 66 Challenges User Needs/Behavior Understand user needs Better modeling of query and user intent for video search Not much work has been done on user intent analysis for video search Leverage User Behavior Implicit versus explicit relevance feedback Challenges in tracking user feedback Blind/Implicit feedback is useful for limited domains How do you assess user relevance? More challenging for video since there is a temporal duration associated with the actions. Explicit ratings are an option, but not incentivizable Slide 67 Challenges Inherit Web Search Problems Meta-data Spam Users spam through Web or tagging Leverage Web Search solutions of spam detection based on text, linkage analysis, or user disrepute. Potential dilution due to long tail How do you balance classic web search ranking with media based ranking? Slide 68 Outline Introduction Market Trends Video Search Opportunities Challenges Conclusion Slide 69 If you look back too much, you will soon be headed that way. - Unknown Slide 70 Conclusion In this presentation, we covered: Market trends that are fuelling large scale demand globally for video search. The changing nature of video search largely driven by the Web. Discussed different systems and the techniques that have been applied to media search. Identified key areas of opportunities and challenges. Slide 71 Conclusion Video Search is here to stay. We (the attendees) have an opportunity to define and shape next generation media production, consumption, usage and ultimately the culture. Slide 72 Conclusion Many thanks to: MIR Workshop organizers Qi Tian Alex Jaimes ACM Multimedia 2005 Conference Organizers Y! Search John Thrall Qi Lu Bradley Horowitz Video Search team, esp. Ruofei (Bruce) Zhang for help with references Y! Blore R&D: Sesha Shah, Srinivasan, YBL (Marc Davis et al) & YRL (Malcolm Slaney) - Prof. S.K. Rangarajan for help with quotes Slide 73 Conclusion Thanks to the audience Questions? Slide 74 Appendix: Leveraging Research & Industrial Communities Scalable systems deployed and processing millions of media files End-user testing with huge volume of traffic General problem areas covered in R&D (e.g. VideoTREC) are very relevant Selective application and proper modelling of information fusion is key. How do we share or leverage that data to foster research? How can industries help? How can universities help?