Upload
ramesh-jain
View
109
Download
4
Embed Size (px)
DESCRIPTION
In computer vision, context has been mostly ignored in the last two decades. We show that in understanding images, context plays more significant role that content.
Citation preview
© Ramesh Jain
Ramesh Jain (with Pinaki Sinha and other collaborators)
Department of Computer Science University of California, Irvine
Contenxt: Bridging the Semantic Gap
© Ramesh Jain
Football Highlight System: Automatic Segmentation
15 College teams All games – 4 cameras 30 minutes after the game
© Ramesh Jain
Find Mubarak Shah
© Ramesh Jain
Image Search: Ramesh Jain
© Ramesh Jain
Gives some details
© Ramesh Jain
Tells me who may not be the ‘Real’
© Ramesh Jain
Finds people who are my friends
© Ramesh Jain
Image Search: Finds activities
© Ramesh Jain
My current research
EventWeb Connecting and accessing Events
From Twitter, Facebook From Web cams, Planetary Skin, …
Connecting environments Personal Media Management
Images, Video, Text, … Doing Computer Vision Correctly
© Ramesh Jain
Computer Vision
Computer vision is the science and technology of machines that see.
As a scientific discipline, computer vision is concerned with the theory behind artificial systems that extract information from images.
From Wikipedia.
© Ramesh Jain
How do you Search for Images?
Use a Content-based Image retrieval engine from XYZ University?
Use a Content Based Image Search Engine from a company? Is there any? I tried to do one in 1994 and built Virage as a
result – but ….. Or do you use just a ‘text’ search engine?
© Ramesh Jain
Text Search Engines
How good are text search engines in object recognition?
Lets Look at some real working systems by searching for people here.
© Ramesh Jain
Disruptive Stages in Computing:1
Data: Numbers, Text,
Statistics, Sensors (Video)
Data (Computation)
© Ramesh Jain
Computing 1: Data
Mainframe and workstations Main applications:
Scientific and engineering Business
Users: Sophisticated Expected to be trained
Dominant Technology Computing
© Ramesh Jain
Disruptive Stages in Computing:2
Data: Numbers, Text,
Statistics, Sensors (Video)
Data (Computation)
Information: Search, Specialized sources
Information (Communication)
© Ramesh Jain
Computing 2: Information PC and Internet Main applications:
Information Communication
Users: Common people in ‘developed world’ Easy access using keyboards
Dominant Technology Authoring tools Access mechanisms Sharing
© Ramesh Jain
What Next?
Disruptive Stages in Computing:3
Data: Numbers, Text,
Statistics, Sensors (Video)
Data (Computation)
Information: Search, Specialized sources
Information (Communication)
Experience: Direct observation or
participation
Experience (Insights)
© Ramesh Jain
Computing 3: Experience
Experiential devices: Mobile phones Main applications:
Experience management Experiential communication
Users: Humans No language issues
Dominant Technology Sensor understanding
Vision and audio will be dominant
© Ramesh Jain
© Ramesh Jain
© Ramesh Jain
The Challenge
Connecting
© Ramesh Jain
Bits and Bytes
Alphanumeric Characters
Lists, Arrays, Documents, Images …
Transforma)ons
© Ramesh Jain
Semantic Gap
The semantic gap is the lack of coincidence between the information that one can extract from the (visual) data and the interpretation that the same data have for a user in a given situation. A linguistic description is almost always contextual, whereas an (image) may live by itself.
Content-Based Image Retrieval at the End of the Early Years Found in: IEEE Transactions on Pattern Analysis and Machine Intelligence Arnold Smeulders , et. al., December 2000
© Ramesh Jain
Data Information Experience
© Ramesh Jain
Rohrsach Test
use this test to examine a person's personality characteristics and emotional functioning
© Ramesh Jain
Falling Tree and George Berkeley "If a tree falls in a forest and no one is
around to hear it, does it make a sound” "No. Sound is the sensation excited in the
ear when the air or other medium is set in motion.“
Observation, Reality, and Perception.
© Ramesh Jain
Context
- text surrounding word or passage: the words, phrases, or passages that come before and after a particular word or pas…
- surrounding conditions: the circumstances or events that form the environment within which something exists or takes …
- data transfer structure: a data structure used to transfer electronic data to and from a business management system
© Ramesh Jain
Content
- amount of something in container: the amount of something contained in something else
- subject matter: the various issues, topics, or questions dealt with in speech, discussion, or a piece of writing
- meaning or message: the meaning or message contained in a creative work, as distinct from its appearance, form, or style
© Ramesh Jain
The Story of Computer Vision
The Psychology of Computer Vision (McGraw-Hill Computer Science Series), 1975.
Marvin Minsky and the summer project to solve computer vision.
© Ramesh Jain
D.L. Waltz, Understanding Line Drawings of Scenes with Shadows.
© Ramesh Jain
MSYS: A System for Reasoning about Scenes
Harry Barrow and Martin Tenebaum
April 1976
© Ramesh Jain
MSYS: Relational Constraints
© Ramesh Jain
Relaxation labelling algorithms — a review J Kittler and J Illingworth
Image and Vision Computing Volume 3, Issue 4, November 1985, Pages 206-216
Abstract An important research topic in image processing and image interpretation methodology is the development of methods to incorporate contextual information into the interpretation of objects. Over the last decade, relaxation labelling has been a useful and much studied approach to this problem. It is an attractive technique because it is highly parallel, involving the propagation of local information via iterative processing. The paper. surveys the literature pertaining to relaxation labelling and highlights the important theoretical advances and the interesting applications for which it has proven useful.
© Ramesh Jain
Serge Belongie and Co-Researchers
semantic context (probability), spatial context (position) and scale context (size).
© Ramesh Jain
Modeling the World
Data (Semantic Web) Objects (Search Companies, …) Events (Relationships among objects and
attributes)
Both Objects and Events are essential to model the world.
© Ramesh Jain
Events
Take place in the real world. Captured using different sensory
mechanism. Each sensor captures only a limited aspect of the
event. Are used to understand a Situation.
© Ramesh Jain
What is in an Event?
© Ramesh Jain
Tim
e
1- dimensional Space Events
© Ramesh Jain
History: Gopher to Google
We had Internet. Lots of computers were connected to each other. Computers had files on them. We had GOPHER and other FTP mechanisms.
© Ramesh Jain
Tim Berners-Lee thought:
Suppose all the information stored on computers everywhere were linked.
Suppose I could program my computer to create a space in which anything could be linked to anything.
Others – including Bush -- had that idea earlier but the technology was not ready.
© Ramesh Jain
That resulted in the Web
DocumentWeb Each node is a ‘Page’ or a document. Pages are linked through explicit referential links
© Ramesh Jain
Then Came Google, Facebook, Twitter
Search Maps … Social Network Events
Twitter Status Updates Eventful
© Ramesh Jain
Evolution of Search
Alphanumeric structured data: Databases Information Retrieval Search Multimedia Search Real Time Search (Event Search)
Will lead to identifying situations
© Ramesh Jain
Continuing the Evolution of the Web
Consider a Web in which each node Is an event Has informational as well as experiential data Is connected to other nodes using
Referential links Structural links Relational links Causal links
Explicit links can be created by anybody This EventWeb is connected to other Webs.
© Ramesh Jain
Connectors
My 5 Senses are connectors between ‘me’ and the world.
We use our sensors (vision, audio, …) to experience the world.
Sensors could be the interface between the Cyberspace and the Real World.
Sensors are placed for ‘detecting events’. How do you decide what sensors to put at any
place? Would you put a sensor if nothing interesting
ever happens at a place?
© Ramesh Jain
From Atomic Events to Composite Events
Spatial and Temporal aggregation Assimilation Composition
Using sophisticated models Ontolgical models could be used May include causality
© Ramesh Jain
Tim
e
1- dimensional Space EventWeb
© Ramesh Jain
Types of Context
Relationship among different objects and even in their subparts in real world
Environmental parameters of the digital devices at the time of photo taking
Knowledge about the person taking photos and even of the person Interpreting photo
Real world situation in which the data is interpreted
© Ramesh Jain
Context Starts much Before the Photo is Taken
Where When Why Who (Photographer) Which device Parameters of the device
© Ramesh Jain
Modern Cameras Are more than ‘Camera Obscura’: They capture an
event. Many sensors capture scene context and store it along
with intensity values. EXIF data is all metadata related to the Event.
Exposure Time Aperture Diameter Flash Metering Mode ISO Ratings Focal Length
Time Location (soon) Face
© Ramesh Jain
Sony CyberShot DSC-T2 Touchscreen 8MP Digital Camera with Smile Detection
© Ramesh Jain
Information in a Digital Photo
Exposure Time, Focal Length, Aperture, Flash, ISO Ratings Date, Time, Time Zone
Latitude, Longitude
Voice Tags, Preset Modes, Ontology etc
© Ramesh Jain
Experiential Media Management Environment
Event-based Should be able to deal with ‘multimedia’
Photos Audio Video Text Information and data …
Searching based on events and media. Storytelling
© Ramesh Jain
EMME Event Cycle
Event Base
Event Presentation/ Navigation
Event Grouping, Linking, Assimilation
Atomic Event Entry
EXIF
Features
Tags/ Context
Photo stream Segment. Event
Ontology
User Annot- ations
Story Telling
Search
Explore
© Ramesh Jain
Using EMME Searching for photo
ACM MM 2009 Creating Albums:
Professional Family Tourism
Telling stories What did I do in Beijing?
Scenario: In December 2009, I have 20,000 pictures taken in 2008. How do I (semi-automatically) select 25 to send to My mother The uncle that I hate My personal friend My professional friend …
© Ramesh Jain
Contenxt Content Context
Contenxt = Content + Context
Context is as powerful, possibly more, as content in understanding audio-visual information
© Ramesh Jain
Examples of Photos from the Unsupervised Clusters: High Exposure Time, Small Aperture
© Ramesh Jain
Examples of Photos from the Unsupervised Clusters:
Low Aperture (High DOF), Low FL (Wide Angle)
© Ramesh Jain
Examples of Photos from the Unsupervised Clusters: High Aperture (Low DOF), High FL (Telephoto)
© Ramesh Jain
Examples of Photos from the Unsupervised Clusters: Photos with Flash: Indoor shots
© Ramesh Jain
Examples of Photos from the Unsupervised Clusters:
Photos with Flash: Darker Outdoors
© Ramesh Jain
Photos can be tagged using only EXIF!
© Ramesh Jain
Guess the Tags!!
Using Image Features Only:
Scenery, City Streets, Illuminations, People Posing for Photo, Wildlife.
Using Optical Parameters:
Single Person Indoors, Portraits, Party Indoors, People at Dinner.
© Ramesh Jain
Confusing Background !!
Predicted Tags:
Using Image Features Only:
Scenery
City Streets
People Posing Outdoors
Group Photo Indoors
Wildlife
Using Optical Metadata and Thumbnail Features:
Group Photo Indoors
Single Person Indoors
Indoor Party
Indoor Artifact
Illuminations
Guess The Tags!!
© Ramesh Jain
Automatic Annotation
Use both Content and Optical Context
How to Combine them? Are the Optical Context Really Useful for
Annotation? What should be the nature of annotations?
Grass, sky, … People, animals, …
© Ramesh Jain
More on Exif Related Experiments For Photo Tagging
Build models separately for Point-and-Shoots vs SLR cameras since their optical parameters vary a lot.
Do rigorous experiments using the same dataset (NUS WIDE or MIR Flickr) to find how content based classifiers compare with context based classifiers.
How much do we gain by including both.
© Ramesh Jain
Personal-Photo-EventWeb
© Ramesh Jain
Singapore – Outdoor -- People
© Ramesh Jain
People-No Face - Outdoor
© Ramesh Jain
Sharing Photos
Taking photos is (almost) zero cost. People now ‘Shoot first – see later’.
Let me share 344 photos that I took yesterday with you. Here On Flickr On Facebook
Tweeting cameras
$12.30 At Amazon.com
This is a serious problem now. Today.
© Ramesh Jain
I want to share, but …
Flickr Problem Facebook
© Ramesh Jain
Our Solution: Photo Summarization
Many TYPES of Summaries to choose from: Time/ Face Based Image Feature Based
Applications Sharing with friends without making them enemy Uploading to your favorite sites Selecting exemplar photos for printing Refreshing your memory Photo frames
Soon will be available on your camera.
© Ramesh Jain
Technical Specifications: Uses and extends state of art
EXIF GIST Features Faces Color Histograms Affinity Propagation Algorithm
Performance: Great! Very Intuitive Very fast
Human in the Loop: Fine Tuning We believe – You are the BOSS
© Ramesh Jain
Photos Summarization
© Ramesh Jain
Original Data Set
© Ramesh Jain
Photo-Summarization using content
© Ramesh Jain
Photo-Summarization using Faces
© Ramesh Jain
Using Contenxt to find Unique People in Photostreams from Multiple People in an Event
© Ramesh Jain
Step 1: Detect Faces Across All Photostreams Step2: Detect Clothing Across all Photostreams
Step3: Cluster Clothing Based on Color Step 4: Find Unique Faces within each Clothing
Cluster Step 5: Iterate through 3-4 by refining the parameters to get a unique set of people.
Using Clothing + Face Feature (Contenxt)
© Ramesh Jain
Clothing Cluster 1 with corresponding Faces
© Ramesh Jain
Unique Faces in Cluster 1: (each row is one person)
© Ramesh Jain
Clothing Cluster 2 with corresponding Faces
© Ramesh Jain
Unique Faces in Cluster 2: (each row is one person)
© Ramesh Jain
Clothing Cluster 3 with corresponding Faces
© Ramesh Jain
Unique Faces in Cluster 3: (each row is one person)
© Ramesh Jain
Conclusions and Future research
Content (data) is important for computer vision.
Context is more important than content for solving real (and hard) problems in vision.
Real success is only possible by using ConteNXt.