Using the Grid to enable content-based image retrieval on an industrial scale Karl Harrison School of Physics and Astronomy University of Birmingham UK

Using the Grid to enable content-based image retrieval

on an industrial scale

Karl HarrisonSchool of Physics and Astronomy

University of Birmingham

UK e-Science All-Hands Meeting“Crossing The Boundaries”

Edinburgh, 8-11 September 2008

11th September 2008

2/23

Proliferation of images

-Images can be massively influential: informing, entertaining, inspiring, shocking, capturing attention

-Volume of digital images to which people have access is growing continually

-Global annual sales of 1x108 digital cameras and 3x108 camera phones-People are creating their own images and sharing with others: flickr,

ImageShack, facebook, etcNumber of images on internet estimated to be in excess of

1.5x1010

-Also have professionally maintained image collections: newspaper archives, picture libraries, etc

-Finding the right image for a particular situation isn’t always easy

11th September 2008

3/23

Approaches to image retrieval

-Majority of information-retrieval systems tend to rely on keyword searches

-Keyword searches often give good results, but have drawbacksNo attempt to understand meaning of a query

•For example, a search with Google for James Bond films without Roger Moore does exactly the opposite of what’s intended, or finds nothing if an exact match is requested

For image searches, rely on annotations (tags), which may be missing or wrong, or on surrounding text, which may be misleading

•No attempt to extract information from image itself-Imense Ltd (formerly Cambridge Ontology) has developed a

new kind of image-retrieval systemAutomated analysis and recognition of image content,

using Java and C++ software, and an extensible set of classifiers

Interpretation of user queries, based on semantic and linguistic relations between search terms

11th September 2008

4/23

Imense image analysis

Pixels

Regions

Objects

Concepts

Segmentation into regionsComputation of properties: size, colour, shape, texture

Scene classificationindoor, beach, sunset, nighttime, autumn

Region classificationMaterial and environmental categories:skin, cloth, grass, sky, wood, water

Object detection and recognitionHuman faces detected and analysed:sex, age, facial expression

Semantic descriptor extractionCombine all information in index

11th September 2008

5/23

Sample image features extracted by Imense analysis

- Group - Adult woman- Boy - Girl- Indoors - White plastic- Orange wall - Dark clothing

- No people- Outdoors- Light pink flowers- Dark green leaves

11th September 2008

6/23

Imense image search

11th September 2008

7/23

Turning a good idea into a working system

Image retrieval- Retrieve images from storage location to processing node

Image analysis- Perform feature extraction

Indexing- Collate and store analysis results

Image location- Images may be in an archive stored on disk, or may be distributed between web sites

Four basic steps to enabling searches based on image content

-Bulk of processing requirement is in analysis step: typically a few seconds per image

-Proof of principle based on several thousands of events is straightforward using minimal resources

-Building up index for many millions of images is more challenging

-Images are analysed independently of one another, so massive parallelisation is possible

This is the type of problem where Grid solutions works well

11th September 2008

8/23

Getting image processing onto the particle-physics Grid

-STFC knowledge-transfer projects set up to investigate Grid solutions for large-scale image processing

November 2006 - June 2007: mini-PIPSS award feasibility study

October 2007 - April 2009: PIPSS award optimised systemCollaboration between Imense Ltd, University of Cambridge

High-Energy Physics Group and Cambridge e-Science CentreContinued involvement from former Cambridge researchers

now based at Birmingham-New Virtual Organisation (camont) set up, and enabled at seven

GridPP sitesAccess to more sites possible if neededHelp with teething problems from GridPP experts and site

managers-Grid effectively providing computing on demand

Highest number of parallel jobs so far is about 150Often useful at present to be able to run a few tens of parallel

jobsAim to ramp up to larger samples later in the year

11th September 2008

9/23

Job-management sytem and Grid user interface

LHCbapplications

ATLASapplications

Otherapplications

Applications

Experiment-specificworkload-management systems

Local batch systems Distributed (Grid) systems

Processing systems (backends)

Metadatacatalogues

Data

stora

ge a

nd re

trieval

Filecatalogues

Tools fordata

management

Localrepository

Remoterepository

Ganga job archives

Gangamonitoring

loop

User interfacefor job definition

and management

• Use Ganga system, developed to support particle-physics experiments (ATLAS and LHCb)

• Component architecture allows customisation for other user groups

11th September 2008

10/23

Ganga job abstraction

Merger

Application

Backend

Input Dataset

Output Dataset

Splitter

Data read by application

Data written by application

Rule for dividing into subjobs

Rule for combining outputs

Where to run

What to run

Job

A job in Ganga is constructed from a set of building blocks, not all needed for every job

11th September 2008

11/23

Image-analysis jobs in Ganga

# Define application to perform image analysis, specifying input app = Classify( version = “2.0.1”, imageList = “imageURLs.txt” ) # Define processing system where job will run bck = LCG( middleware = “GLITE” ) # Define type of output data to be produced out = CamontDataset() # Create job j = Job( app = application, backend = bck, outputdata = out ) # Submit job j.submit()

Ganga provides a command-line interface and scripting language, built on Python

Job details

Logical

Folders

Job Monitoring

Ganga also provides a graphical interface

In practice, use Ganga script:

automated job-submission

and checking 24 hours a day

11th September 2008

12/23

Job destinations and execution times

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.



Results for 3638 jobs submitted over four-week period, July-August 2008

Destination chosen by Resource Broker of

Workload Management System, based on

minimum estimated waiting time

Significant differences in execution times

reflect inhomogeneity of site resources

11th September 2008

13/23

Site monitoring

Example site monitoring: running jobs at Lancaster for 8-day period (July 2008)

Image-processing jobs (camont) are small fraction of total

11th September 2008

14/23

Data-transfer rates



Image downloads from hosting site, using wget





Upload of results to Grid storage elements, using globus-url-copy


11th September 2008

15/23

Grid overheads








-Useful time is when job is downloading and processing

images- Grid overheads come from: startup time, system time

for logging job completion, result upload and retrieval- For jobs of 500 images, average start-to-finish time is

164 minutes, with 39 minutes spent on Grid overheads:

73% useful time

Timing distributions non-Gaussian, with long tails

Need to increase processing load to increase fraction

of useful time

11th September 2008

16/23

Experience of image processing on the Grid

-Grid has been successfully used to process several million imagesHave processed both images from a disk archive and images

retrieved directly to Grid nodes from image-hosting web sitesThis has contributed to launch of beta version of new image

search engine: http://imense.com/-Have automated system, based on Ganga, for job submission and

output retrievalMakes keeping track of thousands of jobs and millions of

images completely painless-Job failure rates have been at 2% level, with two main causes

Proxy credential of submitting user expires before job startsNetwork failures, preventing upload of results to storage

element-Positive experience with using the Grid for image retrieval and

processing has prompted interest in using the Grid also for image location

Grid-enabled web crawler now at testing phase

11th September 2008

17/23

Imense: sunset with cloudy sky

11th September 2008

18/23

Google: sunset with cloudy sky

Keyword search:- confuses skyscapes and cocktails- images found often disappoint expectations

11th September 2008

19/23

Imense: streets after dark

11th September 2008

20/23

Google: streets after dark

Keyword search:- doesn’t distinguish between an

object and a reference

11th September 2008

21/23

Imense: man on right

11th September 2008

22/23

Google: man on right

Keyword search:- doesn’t know left from wrong- doesn’t understand prepositions- has no concept of what a man is

Conclusions

-Imense Ltd has developed a new kind of image-retrieval systemAllows searches based on image contentAble to extract meaning from the user query

-Image retrieval and processing has been performed successfully at seven GridPP sites, allowing analysis of several million images

Aiming for samples of tens of millions of images by end of year

-Ganga job-management system, developed for particle-physics experiments, has been used to submit and track Grid jobs

Makes whole process painless-Timing studies show that Grid overheads, mainly for queueing and

for completion logging, contribute an average of 39 minutes per jobRelevant timing distributions are non-Gaussian with long tailsOverhead likely to be a function of sites used and overall Grid

activityLonger jobs have greater fraction of useful time

-Beta version of new image search engine has been launchedTry it out at: http://imense.com/

Documents

Using the Grid to enable content-based image retrieval on an industrial scale Karl Harrison School of Physics and Astronomy University of Birmingham UK