1
Proximate Sensing:Inferring What-Is-Where From
Georeferenced Photo Collections
Daniel Leung and Shawn Newsam
Electrical Engineering & Computer Science
University of California at MercedCVPR 2010June 17th, 2010
2
Remote sensing: using overhead images of distant scenes to derive geographic information.
satellite image (Google Maps) National Land Cover Database (USGS)
3
Proximate sensing: use ground-level images of close-by objects and scenes.
Land Cover Map 2000(UK Centre for Ecology &
Hydrology)
?
community-contributed photos(Geograph Britain and Ireland project)
study area: 100x100 km region in southeastern UK
(region TQ in National Grid)
4
community-contributed photos(Geograph Britain and Ireland project)
Proximate sensing: use ground-level images of close-by objects and scenes.
5
Proximate Sensing
• We conjecture that the visual content of georeferenced images can be used to derive maps of what-is-where on the surface of the earth.
• Motivation:– Such collections are becoming increasingly available,
e.g. Flickr (100+ million geotagged images), Panoramio, Picasa, Geograph, TrekEarth.
– Derive geographic information not possible through other means, e.g. land-use classification.
– Exciting new application of CV that not only provides another context to apply/revisit standard techniques but stands to motivate novel problems.
6
Proximate Sensing: Context
• Volunteered Geographic Information (Wikipedia):– VGI is the harnessing of tools to create,
assemble, and disseminate geographic data provided voluntarily by individuals (Goodchild, 2007).
– Goodchild, M. 2007. Citizens as Sensors: The World of Volunteered Geography.
proximate
sensing
citizen science
volunteeredgeographicinformation
7
VGI: Flickr• 103,679,986 geotagged items• 2.8 million things geotagged this month
8
VGI: Geograph• “The Geograph Britain and Ireland project aims to collect
geographically representative photographs and information for every square kilometre of Great Britain and Ireland, and you can be part of it.”
• 9,973 users have contributed 1,897,042 images covering 255,904 grid squares, or 77.1% of the total.
“Railway bridge crossing R. Rother
This is now a dismantled railway, further east it
becomesthe Kent & East Sussex
Railway.”
9
Objective
• Eventual goal is to use the visual content of georeferenced photos to produce land use/cover maps.
• Initial focus on simpler problem of binary classification into developed and undeveloped regions.
10
Related Work
• Other researchers have leveraged location information in georeferenced photo collections:– To annotate novel images [Quack et al., CIVR 2008;
Moxley et al., MIR 2008].– To geolocate novel images [Hays and Efros, CVPR
2008]. – To organize the collections themselves [Crandall et
al., WWW 2009].
• However, ours is the first work (to the best of our knowledge) to use the collections to infer what-is-where on the surface of the earth on a large scale.
11
Overview
fraction developed map
binary classification map
trainingimages
labelimages
trainclassifier
featureextraction
aggregate labelsin 1x1 km tiles
targetimages
featureextraction
classifytarget images
12
Ground Truth (1)• Land Cover Map 2000 (UK Centre
for Ecology & Hydrology)
LCM AC 10: Oceanic Seas
LCM AC 8: Standing open water
LCM AC 4: Improved grassland
LCM AC 7: Built up areas and gardens
LCM AC 3: Arable and horticulture
LCM AC 1: Broad-leaved / mixed woodland
LCM AC 9: Coastal
LCM AC 2: Coniferous woodland
LCM AC 5: Semi-natural grass
LCM AC 6: Mountain, heath, bog
13
Ground Truth (2)• Aggregate 10 land cover classes into 2
superclasses:– Developed: LCM AC:7 Built up areas and gardens– Undeveloped: other 9 classes
• Derive 2 ground truth maps:– Fraction map: percent developed for each 1x1 km
tile.– Binary classification map: apply 50% threshold to
fraction map.
Ground truth fraction map indicating percent developed
for each 1x1km tile.
Ground truth binary classification map indicating tiles labelled as developed
(white) or undeveloped (black).
14
Datasets (1)
• Downloaded 920K Flickr images for the TQ region.
• Distribution for 1x1 km tiles shown to left (log10 scale).
• 5,420 tiles contain no Flickr images.• 4,580 tiles contain average of 200,
median of 10, and maximum of 53,840 images.
Flickr
15
Datasets (2)
• Downloaded 120K images from the Geograph Britain and Ireland project
• Distribution for 1x1 km tiles shown to left (log10 scale).
• Only 614 tiles without images.• 9,386 tiles contain average of 13,
median of 5, and maximum of 1,458 images.
Geograph
16
Image Features
• Extract simple five dimensional edge histogram features for each image.
• Motivated by the observation that images of developed scenes typically have a higher proportion of horizontal and vertical edges than images of undeveloped scenes.
17
Image Classification
• Perform image level binary classification:– Developed.– Undeveloped.
• SVM classifier with Gaussian RBF kernel, five-fold cross validation, and grid search for optimal parameter selection.
18
Experiments (1)
fraction developed map
binary classification map
trainingimages
labelimages
trainclassifier
featureextraction
targetimages
featureextraction
aggregate labelsin 1x1 km tiles
classifytarget images
19
Experiments (2)• Fraction developed map: the fraction of
images classified as developed in each tile.
• Binary classification map: threshold applied to fraction map.
• Explore two types of thresholds:– Fixed at 0.5.– Adaptive so that 38.9% of the tiles are labelled as
developed (this represents prior knowledge on the distribution of developed vs. undeveloped regions).
20
Experiments (3)• Results are qualitatively evaluated by visually
comparing predicted maps with ground truth maps.• Results are quantitatively evaluated using ground
truth:– Binary classification: number of tiles with same label.– Fraction developed: correlation coefficient () over
tiles. Also, mean absolute difference (MAD) and root mean squared difference (RMSD).
• Quantitative results computed over 4,553 tiles for which there are both Flickr and Geograph images.– 38.9% of these tiles are developed in the ground truth
so that chance binary classification is 61.1% achievable by labelling all tiles as undeveloped.
21
Experiments (4)• Manual vs. weakly-supervised labelling
of training set.• Effect of photographer intent.• Relative importance of training vs. target
set.• Filtering out non-informative images.• Training set size.• Training set quality.
22
Results—Manually Labelled Training Set (1)
• Training set contains 2,740 Flickr images which have been manually labeled as depicting a scene that is developed or undeveloped.
• Developed ~ containing constructed materials such as used in houses, buildings, etc.
23
Results—Manually Labelled Training Set (2)
Ground Truth MapsMaps Generated Using
Flickr Images
24
Binary Maps
Fraction MapsOverall Class. Rate Avg. Class. Rate
Training Set
Target Set
Training Set Size
FixedThresh.
%
AdaptiveThresh.
%
FixedThresh.
%
AdaptiveThresh.
% MAD RMSD
Manual (Flickr) Flickr 2740 (0.51) 66.4 64.9 68.8 63 0.374 0.287 0.383
fraction of images labelled as developed
in the training set
• Performance is better than chance (61.1%)
Results—Manually Labelled Training Set (4)
25
• Labelled training set constructed in fully automated fashion:– Select 2 images at random from tiles with
4 or more images.– Label them with the majority label of the
tile in the ground truth map.
Results—Weakly-Supervised Training (1)
26
Results—Weakly-Supervised Training (2)
Binary Maps
Fraction MapsOverall Class. Rate Avg. Class. Rate
Training Set
Target Set
Training Set Size
FixedThresh.
%
AdaptiveThresh.
%
FixedThresh.
%
AdaptiveThresh.
% MAD RMSD
Manual (Flickr) Flickr 2740 (0.51) 66.4 64.9 68.8 63 0.374 0.287 0.383
Weakly(Flickr) Flickr 5872 (0.52) 67.2 66.9 68.7 65.2 0.380 0.279 0.373
• Weakly-labelled training set outperforms manually-labelled one.– Suggests training sets can be generated
from regions for which maps exist and then used to train classifiers for mapping unmapped regions.
27
Results—Photographer Intent (1)
• Compare Flickr vs. Geograph results.
28
Ground Truth Maps
Maps GeneratedUsing
Flickr Images
Maps GeneratedUsing
Geograph Images
Results—Photographer Intent (2)
29
Results—Photographer Intent (4)
Binary Maps
Fraction MapsOverall Class. Rate Avg. Class. Rate
Training Set
Target Set
Training Set Size
FixedThresh.
%
AdaptiveThresh.
%
FixedThresh.
%
AdaptiveThresh.
% MAD RMSD
Flickr Flickr 5872 (0.52) 67.2 66.9 68.7 65.2 0.380 0.279 0.373
Geograph Geograph 10576 (0.26) 68.2 74.0 60.8 72.6 0.520 0.271 0.358
• Photographer intent is a significant factor.
30
Results—Importance of Training vs. Target Set (1)
• Geograph training+target set outperforms Flickr training+target set.
• Investigate whether improvement is due to training or target set.
• Training and target sets from different collections.
31
Binary Maps
Fraction MapsOverall Class. Rate Avg. Class. Rate
Training Set
Target Set
Training Set Size
FixedThresh.
%
AdaptiveThresh.
%
FixedThresh.
%
AdaptiveThresh.
% MAD RMSD
Flickrgood Flickr 5070 (0.49) 67.0 68.1 67.4 66.6 0.329 0.285 0.374
Geographgood Flickr 5603 (0.47) 60.7 68.3 53.8 66.6 0.330 0.294 0.381
Geographgood Geograph 5603 (0.47) 74.2 74.6 71.5 73.1 0.551 0.231 0.308
Flickrgood Geograph 5070 (0.49) 69.9 73.1 71.5 71.7 0.496 0.254 0.331
• Photographer intent is more important for target than training set.
Results—Importance of Training vs. Target Set (2)
32
Results—Filtering Out Non-informative Images (1)
• Investigate whether removing images with faces improves results.
• Motivation: photographs of people are less likely to be geographically informative, especially close-in portraits.
33
Results—Filtering Out Non-informative Images (2)
Binary Maps
Fraction MapsOverall Class. Rate Avg. Class. Rate
Training Set
Target Set
Training Set Size
FixedThresh.
%
AdaptiveThresh.
%
FixedThresh.
%
AdaptiveThresh.
% MAD RMSD
Flickr Flickr 5872 (0.52) 67.2 66.9 68.7 65.2 0.380 0.279 0.373
FlickrFlickr
no faces 5872 (0.52) 66.8 66.7 66.8 64.2 0.367 0.301 0.414
Geograph Flickr 5603 (0.47) 60.7 68.3 53.8 66.6 0.330 0.294 0.381
GeographFlickr
no faces 5603 (0.47) 59.9 68.0 52.0 65.2 0.312 0.321 0.428
• Filtering out images with faces from the target set does not result in improved performance.
34
• Demonstrated that georeferenced community-contributed photo collections can be considered as a form of VGI.
• Maps of developed/undeveloped regions automatically generated using Flickr and Geograph images shown to be similar to ground truth maps.– Despite simple image features.
Discussion (1)
35
• Weakly-labelled training set outperforms manually-labelled training set.– Clear benefits for training classifiers.
• Photographer intent is significant, especially for target set.– Restricts what can be used as target sets.– Poses interesting research challenges such as how
to use the Geograph dataset to filter the “noisy” Flickr dataset.
• Initial results on filtering out images with faces inconclusive.
Discussion (2)
36
• Improved image features.– Gist.
• Integrate textual annotations.– Flickr tags.– Geograph descriptive text.
• Additional land-cover/use classes.• Spatial models:
– Tobler’s first law of geography: all things are related, but nearby things are more related than distant things.
Extensions
37
Come to our poster this afternoon
38
Thank you! and questions?
Acknowledgements:• This work was funded in part by the following
grants:– DOE Early Career Scientist and Engineer
Award/PECASE– NSF 0917069: IIS Core
• Thanks to Nathan Graves for implementing the edge histogram descriptors.