MediaEval 2016 - Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features

Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG FeaturesGiorgos Kordopatis-Zilos1, Adrian Popescu2, Symeon Papadopoulos1 and Yiannis Kompatsiaris1

1 Information Technologies Institute (ITI), CERTH, Greece

2 CEA LIST, 91190 Gif-sur-Yvette, France

MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands.

Summary

#2

Tag-based location estimation (1 runs)• Built upon the scheme of our 2015 participation [1] (Kordopatis-Zilos et

al., MediaEval 2015)• Based on a refined probabilistic Language Model

Visual-based location estimation (1 run)• Extract PCA-reduced VGG features to compute image similarities• Geospatial clustering scheme of the most visually similar images

Hybrid location estimation (3 run)• Combination of the textual and visual approaches using a set of rules

Training sets• Training set released by the organisers (≈4.7M geotagged items)• YFCC dataset, excl. images from users in test set (≈40M geotagged items)• External data derived from gazetteers, i.e. Geonames and OpenStreetMap

Tag-based location estimation

#3

• Processing steps of the approach– Offline: language model construction– Online: location estimation

OpenStreetMap

Pre-processing

• Tags and titles of the training set items are processed

• Apply – URL decoding– lowercase transformation– tokenization

• Remove– accents– symbols– punctuations

• The multi-word tags are split into their individual terms, which are also included in the item's term set

• Discarded numerics or less than three characters terms

#4

Language Model (LM)

• LM-based estimation– Most Likely Cell (mlc) considered the cell with the highest probability and

used to produce the estimation

𝑚𝑙𝑐𝑗 = argmax𝑖

𝑘=1

𝑇𝑗

𝑝(𝑡𝑘|𝑐𝑖) ∗ 𝑤(𝑡𝑘)

Inspired from [4]: (Popescu, MediaEval 2013)

#5

• LM generation scheme– divide earth surface in rectangular

cells with a side length of 0.01°

– calculate term-cell probabilities𝑝(𝑡|𝑐) = 𝑁𝑢/𝑁𝑡

Feature Selection and Weighting

#6

Feature Weighting

• Locality weight function, a function based on term relative position in T

• Spatial Entropy weight function, a Gaussian function based on the term’s spatial entropy

• Linear combination of the two weights

Feature Selection

• Calculate terms locality using a grid of 0.01°×0.01°

• When a user uses a given term, he/she is assigned to the entire cell neighborhood instead of a unique cell as in [1]

𝑙 𝑡 = 𝑁𝑡 ∗σ𝑐∈𝐶σ𝑢∈𝑈𝑡,𝑐

|{𝑢′|𝑢′ ∈ 𝑈𝑡,𝑐 , 𝑢′ ≠ 𝑢}|

𝑁𝑡2

• Terms with non-zero locality score form the term set 𝑇

Refinements

#7

• Multiple Grids– Built an additional LM using a finer

grid (cell side length of 0.001°)– combine the MLC of the individual

language models

• Similarity search [5] (Van Laere et al., ICMR 2011)– determine 𝑘𝑡 most similar training images in the MLC– their center-of-gravity is the final location estimation

From [2]: (Kordopatis-Zilos et al., PAISI 2015)

Visual-based location estimation

#8

• Main Objectives

• Ensure that the visual features are generic and transferable• Provide a compact representation of the features

• Model building

• CNN features extracted by fine-tuning the VGG model [4]

• Training: ~5K Points Of Interest (POIs), over 7M Flickr images using queries with:

– the POI name and a radius of 5km around its coordinates– the POI name and the associated city name

• Compressed outputs of fc7 layer (4096d) to 128d using PCA, learned on a subset of 250,000 train images

• Similarity Search based on the PCA-reduced CNN features

Visual-based location estimation

#9

Location Estimation

• Geospatial clustering of 𝑘𝑣 = 20 visually most similar images

• The largest cluster (or the first in case of equal size) is selected and its centroid is used as the location estimate

Visual Confidence

• Confidence metric for the visual estimation is based on the size of the largest cluster

𝑐𝑜𝑛𝑓𝑣 𝑖 = max(𝑛 𝑖 − 𝑛𝑡𝑘𝑣 − 𝑛𝑡

, 0)

𝑛 𝑖 : number of neighbors in the largest cluster of image i𝑛𝑡: configuration parameter of the confidence score ‘’strictness’’

Hybrid-based location estimation

• A set of rules to determine the source of estimation between the text and visual approaches

• The visual estimation is chosen in cases:

→ No estimation could be produced by the text approach

→ Visual estimation fell inside the borders of the mlc

→ By comparing the confidence scores 𝑐𝑜𝑛𝑓𝑣 and 𝑐𝑜𝑛𝑓𝑡 [1]

• Otherwise the text estimation is selected

#10

Runs and Results

#11

RUN-1: Tag-based location estimation + released training set

RUN-2: Visual-based location estimation + released training set

RUN-3: Hybrid location estimation + released training set

RUN-4: Hybrid location estimation + YFCC dataset

RUN-5: Hybrid location estimation + YFCC + External data

RUN-E: Visual-based location estimation + entire YFCC dataset

Images

Runs and Results

#12

RUN-1: Tag-based location estimation + released training set

RUN-2: Visual-based location estimation + released training set

RUN-3: Hybrid location estimation + released training set

RUN-4: Hybrid location estimation + YFCC dataset

RUN-5: Hybrid location estimation + YFCC + External data

Videos

References

#13

[1] G. Kordopatis-Zilos, A. Popescu, S. Papadopoulos, and Y. Kompatsiaris. Socialsensor at mediaeval placing task 2015. In MediaEval 2015 Placing Task, 2015.

[2] G. Kordopatis-Zilos, S. Papadopoulos, and Y. Kompatsiaris. Geotagging social media content with a refined language modelling approach. In Intelligence and Security Informatics, pages 21–40, 2015.

[3] A. Popescu. CEA LIST's participation at mediaeval 2013 placing task. In MediaEval 2013 Placing Task, 2013.

[4] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.

[5] O. Van Laere, S. Schockaert, and B. Dhoedt. Finding locations of Flickr resources using language models and similarity search. ICMR ’11, pages 48:1–48:8, New York, NY, USA, 2011. ACM.

Thank you!

#14

Data/Code:

– https://github.com/MKLab-ITI/multimedia-geotagging/

Get in touch:

– Giorgos Kordopatis-Zilos: [email protected]

– Symeon Papadopoulos: [email protected] / @sympap

With the support of:

https://github.com/MKLab-ITI/multimedia-geotagging/tree/master/samples

mailto:[email protected]

mailto:[email protected]

Science

MediaEval 2016 - Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features