87
© Ramesh Jain Ramesh Jain (with Pinaki Sinha and other collaborators) Department of Computer Science University of California, Irvine [email protected] Contenxt: Bridging the Semantic Gap

Contenxt 100407

Embed Size (px)

DESCRIPTION

In computer vision, context has been mostly ignored in the last two decades. We show that in understanding images, context plays more significant role that content.

Citation preview

Page 1: Contenxt 100407

© Ramesh Jain

Ramesh Jain (with Pinaki Sinha and other collaborators)

Department of Computer Science University of California, Irvine

[email protected]

Contenxt: Bridging the Semantic Gap

Page 2: Contenxt 100407

© Ramesh Jain

Football Highlight System: Automatic Segmentation

15 College teams All games – 4 cameras 30 minutes after the game

Page 3: Contenxt 100407

© Ramesh Jain

Find Mubarak Shah

Page 4: Contenxt 100407

© Ramesh Jain

Image Search: Ramesh Jain

Page 5: Contenxt 100407

© Ramesh Jain

Gives some details

Page 6: Contenxt 100407

© Ramesh Jain

Tells me who may not be the ‘Real’

Page 7: Contenxt 100407

© Ramesh Jain

Finds people who are my friends

Page 8: Contenxt 100407

© Ramesh Jain

Image Search: Finds activities

Page 9: Contenxt 100407

© Ramesh Jain

My current research

  EventWeb   Connecting and accessing Events

  From Twitter, Facebook   From Web cams, Planetary Skin, …

  Connecting environments   Personal Media Management

  Images, Video, Text, …   Doing Computer Vision Correctly

Page 10: Contenxt 100407

© Ramesh Jain

Computer Vision

  Computer vision is the science and technology of machines that see.

  As a scientific discipline, computer vision is concerned with the theory behind artificial systems that extract information from images.

From Wikipedia.

Page 11: Contenxt 100407

© Ramesh Jain

How do you Search for Images?

  Use a Content-based Image retrieval engine from XYZ University?

  Use a Content Based Image Search Engine from a company?   Is there any?   I tried to do one in 1994 and built Virage as a

result – but …..   Or do you use just a ‘text’ search engine?

Page 12: Contenxt 100407

© Ramesh Jain

Text Search Engines

  How good are text search engines in object recognition?

  Lets Look at some real working systems by searching for people here.

Page 13: Contenxt 100407

© Ramesh Jain

Disruptive Stages in Computing:1

Data: Numbers, Text,

Statistics, Sensors (Video)

Data (Computation)

Page 14: Contenxt 100407

© Ramesh Jain

Computing 1: Data

  Mainframe and workstations   Main applications:

  Scientific and engineering   Business

  Users:   Sophisticated   Expected to be trained

  Dominant Technology   Computing

Page 15: Contenxt 100407

© Ramesh Jain

Disruptive Stages in Computing:2

Data: Numbers, Text,

Statistics, Sensors (Video)

Data (Computation)

Information: Search, Specialized sources

Information (Communication)

Page 16: Contenxt 100407

© Ramesh Jain

Computing 2: Information   PC and Internet   Main applications:

  Information   Communication

  Users:   Common people in ‘developed world’   Easy access using keyboards

  Dominant Technology   Authoring tools   Access mechanisms   Sharing

Page 17: Contenxt 100407

© Ramesh Jain

What Next?

Disruptive Stages in Computing:3

Data: Numbers, Text,

Statistics, Sensors (Video)

Data (Computation)

Information: Search, Specialized sources

Information (Communication)

Experience: Direct observation or

participation

Experience (Insights)

Page 18: Contenxt 100407

© Ramesh Jain

Computing 3: Experience

  Experiential devices: Mobile phones   Main applications:

  Experience management   Experiential communication

  Users:   Humans   No language issues

  Dominant Technology   Sensor understanding

  Vision and audio will be dominant

Page 19: Contenxt 100407

© Ramesh Jain

Page 20: Contenxt 100407

© Ramesh Jain

Page 21: Contenxt 100407

© Ramesh Jain

The Challenge

Connecting

Page 22: Contenxt 100407

© Ramesh Jain

Bits and Bytes

Alphanumeric Characters

Lists, Arrays, Documents, Images …

Transforma)ons

Page 23: Contenxt 100407

© Ramesh Jain

Semantic Gap

The semantic gap is the lack of coincidence between the information that one can extract from the (visual) data and the interpretation that the same data have for a user in a given situation. A linguistic description is almost always contextual, whereas an (image) may live by itself.

Content-Based Image Retrieval at the End of the Early Years Found in: IEEE Transactions on Pattern Analysis and Machine Intelligence Arnold Smeulders , et. al., December 2000

Page 24: Contenxt 100407

© Ramesh Jain

Data Information Experience

Page 25: Contenxt 100407

© Ramesh Jain

Rohrsach Test

  use this test to examine a person's personality characteristics and emotional functioning

Page 26: Contenxt 100407

© Ramesh Jain

Falling Tree and George Berkeley   "If a tree falls in a forest and no one is

around to hear it, does it make a sound”   "No. Sound is the sensation excited in the

ear when the air or other medium is set in motion.“

  Observation, Reality, and Perception.

Page 27: Contenxt 100407

© Ramesh Jain

Context

  - text surrounding word or passage: the words, phrases, or passages that come before and after a particular word or pas…

  - surrounding conditions: the circumstances or events that form the environment within which something exists or takes …

  - data transfer structure: a data structure used to transfer electronic data to and from a business management system

Page 28: Contenxt 100407

© Ramesh Jain

Content

  - amount of something in container: the amount of something contained in something else

  - subject matter: the various issues, topics, or questions dealt with in speech, discussion, or a piece of writing

  - meaning or message: the meaning or message contained in a creative work, as distinct from its appearance, form, or style

Page 29: Contenxt 100407

© Ramesh Jain

The Story of Computer Vision

The Psychology of Computer Vision (McGraw-Hill Computer Science Series), 1975.

Marvin Minsky and the summer project to solve computer vision.

Page 30: Contenxt 100407

© Ramesh Jain

D.L. Waltz, Understanding Line Drawings of Scenes with Shadows.

Page 31: Contenxt 100407

© Ramesh Jain

MSYS: A System for Reasoning about Scenes

Harry Barrow and Martin Tenebaum

April 1976

Page 32: Contenxt 100407

© Ramesh Jain

MSYS: Relational Constraints

Page 33: Contenxt 100407

© Ramesh Jain

Relaxation labelling algorithms — a review J Kittler and J Illingworth

Image and Vision Computing Volume 3, Issue 4, November 1985, Pages 206-216

Abstract An important research topic in image processing and image interpretation methodology is the development of methods to incorporate contextual information into the interpretation of objects. Over the last decade, relaxation labelling has been a useful and much studied approach to this problem. It is an attractive technique because it is highly parallel, involving the propagation of local information via iterative processing. The paper. surveys the literature pertaining to relaxation labelling and highlights the important theoretical advances and the interesting applications for which it has proven useful.

Page 34: Contenxt 100407

© Ramesh Jain

Serge Belongie and Co-Researchers

  semantic context (probability),   spatial context (position)   and scale context (size).

Page 35: Contenxt 100407

© Ramesh Jain

Modeling the World

  Data (Semantic Web)   Objects (Search Companies, …)   Events (Relationships among objects and

attributes)

Both Objects and Events are essential to model the world.

Page 36: Contenxt 100407

© Ramesh Jain

Events

  Take place in the real world.   Captured using different sensory

mechanism.   Each sensor captures only a limited aspect of the

event.   Are used to understand a Situation.

Page 37: Contenxt 100407

© Ramesh Jain

What is in an Event?

Page 38: Contenxt 100407

© Ramesh Jain

Tim

e

1- dimensional Space Events

Page 39: Contenxt 100407

© Ramesh Jain

History: Gopher to Google

  We had Internet.   Lots of computers were connected to each other.   Computers had files on them.   We had GOPHER and other FTP mechanisms.

Page 40: Contenxt 100407

© Ramesh Jain

Tim Berners-Lee thought:

  Suppose all the information stored on computers everywhere were linked.

  Suppose I could program my computer to create a space in which anything could be linked to anything.

Others – including Bush -- had that idea earlier but the technology was not ready.

Page 41: Contenxt 100407

© Ramesh Jain

That resulted in the Web

  DocumentWeb   Each node is a ‘Page’ or a document.   Pages are linked through explicit referential links

Page 42: Contenxt 100407

© Ramesh Jain

Then Came Google, Facebook, Twitter

  Search   Maps   …   Social Network   Events

  Twitter   Status Updates   Eventful

Page 43: Contenxt 100407

© Ramesh Jain

Evolution of Search

  Alphanumeric structured data: Databases   Information Retrieval   Search   Multimedia Search   Real Time Search (Event Search)

  Will lead to identifying situations

Page 44: Contenxt 100407

© Ramesh Jain

Continuing the Evolution of the Web

  Consider a Web in which each node   Is an event   Has informational as well as experiential data   Is connected to other nodes using

  Referential links   Structural links   Relational links   Causal links

  Explicit links can be created by anybody   This EventWeb is connected to other Webs.

Page 45: Contenxt 100407

© Ramesh Jain

Connectors

  My 5 Senses are connectors between ‘me’ and the world.

  We use our sensors (vision, audio, …) to experience the world.

  Sensors could be the interface between the Cyberspace and the Real World.

  Sensors are placed for ‘detecting events’.   How do you decide what sensors to put at any

place?   Would you put a sensor if nothing interesting

ever happens at a place?

Page 46: Contenxt 100407

© Ramesh Jain

From Atomic Events to Composite Events

  Spatial and Temporal aggregation   Assimilation   Composition

  Using sophisticated models   Ontolgical models could be used   May include causality

Page 47: Contenxt 100407

© Ramesh Jain

Tim

e

1- dimensional Space EventWeb

Page 48: Contenxt 100407

© Ramesh Jain

Types of Context

  Relationship among different objects and even in their subparts in real world

  Environmental parameters of the digital devices at the time of photo taking

  Knowledge about the person taking photos and even of the person Interpreting photo

  Real world situation in which the data is interpreted

Page 49: Contenxt 100407

© Ramesh Jain

Context Starts much Before the Photo is Taken

  Where   When   Why   Who (Photographer)   Which device   Parameters of the device

Page 50: Contenxt 100407

© Ramesh Jain

Modern Cameras   Are more than ‘Camera Obscura’: They capture an

event.   Many sensors capture scene context and store it along

with intensity values.   EXIF data is all metadata related to the Event.

Exposure Time Aperture Diameter Flash Metering Mode ISO Ratings Focal Length

Time Location (soon) Face

Page 51: Contenxt 100407

© Ramesh Jain

Sony CyberShot DSC-T2 Touchscreen 8MP Digital Camera with Smile Detection

Page 52: Contenxt 100407

© Ramesh Jain

Information in a Digital Photo

Exposure Time, Focal Length, Aperture, Flash, ISO Ratings Date, Time, Time Zone

Latitude, Longitude

Voice Tags, Preset Modes, Ontology etc

Page 53: Contenxt 100407

© Ramesh Jain

Experiential Media Management Environment

  Event-based   Should be able to deal with ‘multimedia’

  Photos   Audio   Video   Text   Information and data   …

  Searching based on events and media.   Storytelling

Page 54: Contenxt 100407

© Ramesh Jain

EMME Event Cycle

Event Base

Event Presentation/ Navigation

Event Grouping, Linking, Assimilation

Atomic Event Entry

EXIF

Features

Tags/ Context

Photo stream Segment. Event

Ontology

User Annot- ations

Story Telling

Search

Explore

Page 55: Contenxt 100407

© Ramesh Jain

Using EMME   Searching for photo

  ACM MM 2009   Creating Albums:

  Professional   Family   Tourism

  Telling stories   What did I do in Beijing?

  Scenario: In December 2009, I have 20,000 pictures taken in 2008. How do I (semi-automatically) select 25 to send to   My mother   The uncle that I hate   My personal friend   My professional friend   …

Page 56: Contenxt 100407

© Ramesh Jain

Contenxt Content Context

  Contenxt = Content + Context

  Context is as powerful, possibly more, as content in understanding audio-visual information

Page 57: Contenxt 100407

© Ramesh Jain

Examples of Photos from the Unsupervised Clusters: High Exposure Time, Small Aperture

Page 58: Contenxt 100407

© Ramesh Jain

Examples of Photos from the Unsupervised Clusters:

Low Aperture (High DOF), Low FL (Wide Angle)

Page 59: Contenxt 100407

© Ramesh Jain

Examples of Photos from the Unsupervised Clusters: High Aperture (Low DOF), High FL (Telephoto)

Page 60: Contenxt 100407

© Ramesh Jain

Examples of Photos from the Unsupervised Clusters: Photos with Flash: Indoor shots

Page 61: Contenxt 100407

© Ramesh Jain

Examples of Photos from the Unsupervised Clusters:

Photos with Flash: Darker Outdoors

Page 62: Contenxt 100407

© Ramesh Jain

Photos can be tagged using only EXIF!

Page 63: Contenxt 100407

© Ramesh Jain

Guess the Tags!!

Using Image Features Only:

Scenery, City Streets, Illuminations, People Posing for Photo, Wildlife.

Using Optical Parameters:

Single Person Indoors, Portraits, Party Indoors, People at Dinner.

Page 64: Contenxt 100407

© Ramesh Jain

Confusing Background !!

Predicted Tags:

Using Image Features Only:

Scenery

City Streets

People Posing Outdoors

Group Photo Indoors

Wildlife

Using Optical Metadata and Thumbnail Features:

Group Photo Indoors

Single Person Indoors

Indoor Party

Indoor Artifact

Illuminations

Guess The Tags!!

Page 65: Contenxt 100407

© Ramesh Jain

Automatic Annotation

  Use both Content and Optical Context

  How to Combine them?   Are the Optical Context Really Useful for

Annotation?   What should be the nature of annotations?

  Grass, sky, …   People, animals, …

Page 66: Contenxt 100407

© Ramesh Jain

More on Exif Related Experiments For Photo Tagging

  Build models separately for Point-and-Shoots vs SLR cameras since their optical parameters vary a lot.

  Do rigorous experiments using the same dataset (NUS WIDE or MIR Flickr) to find how content based classifiers compare with context based classifiers.

  How much do we gain by including both.

Page 67: Contenxt 100407

© Ramesh Jain

Personal-Photo-EventWeb

Page 68: Contenxt 100407

© Ramesh Jain

Singapore – Outdoor -- People

Page 69: Contenxt 100407

© Ramesh Jain

People-No Face - Outdoor

Page 70: Contenxt 100407

© Ramesh Jain

Sharing Photos

  Taking photos is (almost) zero cost.   People now ‘Shoot first – see later’.

  Let me share 344 photos that I took yesterday with you.   Here   On Flickr   On Facebook

  Tweeting cameras

$12.30 At Amazon.com

This is a serious problem now. Today.

Page 71: Contenxt 100407

© Ramesh Jain

I want to share, but …

  Flickr Problem   Facebook

Page 72: Contenxt 100407

© Ramesh Jain

Our Solution: Photo Summarization

  Many TYPES of Summaries to choose from:   Time/ Face Based   Image Feature Based

  Applications   Sharing with friends without making them enemy   Uploading to your favorite sites   Selecting exemplar photos for printing   Refreshing your memory   Photo frames

  Soon will be available on your camera.

Page 73: Contenxt 100407

© Ramesh Jain

Technical Specifications:   Uses and extends state of art

  EXIF   GIST Features   Faces   Color Histograms   Affinity Propagation Algorithm

  Performance: Great!   Very Intuitive   Very fast

  Human in the Loop: Fine Tuning   We believe – You are the BOSS

Page 74: Contenxt 100407

© Ramesh Jain

Photos Summarization

Page 75: Contenxt 100407

© Ramesh Jain

Original Data Set

Page 76: Contenxt 100407

© Ramesh Jain

Photo-Summarization using content

Page 77: Contenxt 100407

© Ramesh Jain

Photo-Summarization using Faces

Page 78: Contenxt 100407

© Ramesh Jain

Using Contenxt to find Unique People in Photostreams from Multiple People in an Event

Page 79: Contenxt 100407

© Ramesh Jain

Step 1: Detect Faces Across All Photostreams Step2: Detect Clothing Across all Photostreams

Step3: Cluster Clothing Based on Color Step 4: Find Unique Faces within each Clothing

Cluster Step 5: Iterate through 3-4 by refining the parameters to get a unique set of people.

Using Clothing + Face Feature (Contenxt)

Page 80: Contenxt 100407

© Ramesh Jain

Clothing Cluster 1 with corresponding Faces

Page 81: Contenxt 100407

© Ramesh Jain

Unique Faces in Cluster 1: (each row is one person)

Page 82: Contenxt 100407

© Ramesh Jain

Clothing Cluster 2 with corresponding Faces

Page 83: Contenxt 100407

© Ramesh Jain

Unique Faces in Cluster 2: (each row is one person)

Page 84: Contenxt 100407

© Ramesh Jain

Clothing Cluster 3 with corresponding Faces

Page 85: Contenxt 100407

© Ramesh Jain

Unique Faces in Cluster 3: (each row is one person)

Page 86: Contenxt 100407

© Ramesh Jain

Conclusions and Future research

  Content (data) is important for computer vision.

  Context is more important than content for solving real (and hard) problems in vision.

  Real success is only possible by using ConteNXt.

Page 87: Contenxt 100407

© Ramesh Jain

Thanks.

For more information,

[email protected]

?