14
220513 1 A Rose’a Stone for Image Understanding Cees Snoek University of Amsterdam The Netherlands Euvision Technologies The Netherlands A classical problem Understanding was lost from 394CE to 1822

Presentation 17 may morning keynote cees snoek

Embed Size (px)

Citation preview

Page 1: Presentation 17 may morning keynote cees snoek

22-­‐05-­‐13  

1  

A  Rose'a  Stone  for  Image  Understanding  

Cees  Snoek    

University  of  Amsterdam  

The  Netherlands  

Euvision  Technologies  

The  Netherlands  

A  classical  problem  

Understanding  was  lost  from  394CE  to  1822  

Page 2: Presentation 17 may morning keynote cees snoek

22-­‐05-­‐13  

2  

RoseEa  Stone  discovery  in  1799  

A  decree  by  King  Ptolemy  V  – Hieroglyphs  – DemoMc  script  – Ancient  Greek  

Key  to  decipherment  in  1822  

JF  Champollion  

RECOGNIZING  WORDS  Understanding  images  

Mazloom  et  al.,  ICMR  2013    

Page 3: Presentation 17 may morning keynote cees snoek

22-­‐05-­‐13  

3  

How  difficult  is  the  problem?  

Human  vision  consumes  50%  brain  power…  

Van  Essen,  Science  1992  

Visual  labeling  in  a  nutshell  

Visualization by Jasper Schulte

Page 4: Presentation 17 may morning keynote cees snoek

22-­‐05-­‐13  

4  

Visual  labeling  by  machine  

Encode Reduce

Encode Reduce

Learn

Label

InternaMonal  compeMMon  

NIST  TRECVID  Benchmark  

Promote  progress  in  video  retrieval  research  

Open  data,  tasks,  evaluaMon  and  innovaMon  

hEp://trecvid.nist.gov/  

Page 5: Presentation 17 may morning keynote cees snoek

22-­‐05-­‐13  

5  

Are  we  making  progress?  

•  1000+  others  

x MediaMill team

MediaMill team, TRECVID 2004-2012

Performance  doubled  in  just  3  years  Snoek & Smeulders, IEEE Computer 2010

So&ware  licensed  by  Euvision  Technologies  

Page 6: Presentation 17 may morning keynote cees snoek

22-­‐05-­‐13  

6  

MediaMill  video  search  engines  

Learning  from  social-­‐tagged  images  Xirong  Li  et  al,  TMM  2009  

 Exploit  consistency  in  tagging  behavior  of  different  users  for  visually  similar  images  

Page 7: Presentation 17 may morning keynote cees snoek

22-­‐05-­‐13  

7  

Tag  relevance  

ObjecMve  tags  are  idenMfied  and  reinforced  

Based  on  3.5  Million  images  downloaded  from  Flickr  

RECOGNIZING  SENTENCES  Understanding  images  

Mazloom  et  al.,  ICMR  2013    

Page 8: Presentation 17 may morning keynote cees snoek

22-­‐05-­‐13  

8  

Human  event  descripMon  on  web  video  

We  analyze  13K  web  videos  and  their  descripMons  

People  compe:ng  in  a  sand  sculp:ng  compe::on  and  children  playing  on  the  beach.  

A  woman  folds  and  packages  a  scarf  she  has  made.  

Habibian  et  al.,  ICMR  2013    

Human  concept-­‐vocabulary  

Consists  of  5K  disMnct  and  mostly  rare  concepts  Includes  general  and  specialized  concepts  It  is  composed  of  various  concept  types  

0 10 20 30 40 50

Non Visual

Attribute

Scene

Action

Object

Portions (in %)

Anim

al

Peop

le

Page 9: Presentation 17 may morning keynote cees snoek

22-­‐05-­‐13  

9  

Concepts  categorized  by  type  

Object  

People  

Animal  

Scene  

AcDon  

A'ribute  

From  concepts  to  sentences  

Input  Video  

Event  Models  

Concept  1  

Concept  2  

Concept  K  

…  

Concept  Vocabulary  

Train  SVM  

Crea9ng  the  concept  vocabulary  is  cri9cal    

Sadanand,  CVPR12  Merler,  TMM12  Althoff,  MM12  

 

AEempMng  a  board  trick  

Page 10: Presentation 17 may morning keynote cees snoek

22-­‐05-­‐13  

10  

Video  sentence  examples  

ABemp9ng  a  board  trick  

Working  on  a  woodworking  project  

Changing  a  vehicle  9re  

Are  more  concepts  beEer?  

In  general,  more  is  beBer.  But,  a  vocabulary  of    500  concepts  exists  that  outperforms  all  others    

Mazloom  et  al.,  ICMR  2013    

Page 11: Presentation 17 may morning keynote cees snoek

22-­‐05-­‐13  

11  

Results  for  “Landing  a  fish  in”  

A  vocabulary  of  100  concepts  is  the  best  performer  

InformaMve  concepts  vs  All  concepts  

The  23%  most  informa9ve  concepts  lead  to    a  65%  rela9ve  increase  in  event  detec9on  accuracy.    

Page 12: Presentation 17 may morning keynote cees snoek

22-­‐05-­‐13  

12  

What  concepts  are  informaMve  

Font size correlates with informativeness

Wedding  Ceremony  Landing  a  Fish  

Visual  translaMon  

Represent images and text in unified semantic space

C1  

Cn  

C2  

The  18th-­‐largest  country  in  the  world   in   terms   of   area   at  1 , 6 4 8 , 1 9 5   I r a n   h a s   a  populaMon   of   around   75  million.   It   is   a   country   of  parMcular  geo..  

Concept  Detectors  (Textual)   Concept  Detectors  (Visual)  

SemanMc  Space  

Page 13: Presentation 17 may morning keynote cees snoek

22-­‐05-­‐13  

13  

Example:  query  by  a  video  

Video  translaMon  

Summary  of  most  likely  translaMons  

Habibian  et  al.,  submi@ed  

Page 14: Presentation 17 may morning keynote cees snoek

22-­‐05-­‐13  

14  

Conclusion  

   AI-­‐progress  and  human  descripMons  on  the  web  act  as  ‘RoseEa  Stone’  for  image  understanding.  

 AutomaMc  metadata  generaMon  jumps  from  

words  to  sentences.    

www.ceessnoek.info