Upload
gabriel-ellis
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Using the Grid to enable content-based image retrieval
on an industrial scale
Karl HarrisonSchool of Physics and Astronomy
University of Birmingham
UK e-Science All-Hands Meeting“Crossing The Boundaries”
Edinburgh, 8-11 September 2008
11th September 2008
2/23
Proliferation of images
-Images can be massively influential: informing, entertaining, inspiring, shocking, capturing attention
-Volume of digital images to which people have access is growing continually
-Global annual sales of 1x108 digital cameras and 3x108 camera phones-People are creating their own images and sharing with others: flickr,
ImageShack, facebook, etcNumber of images on internet estimated to be in excess of
1.5x1010
-Also have professionally maintained image collections: newspaper archives, picture libraries, etc
-Finding the right image for a particular situation isn’t always easy
11th September 2008
3/23
Approaches to image retrieval
-Majority of information-retrieval systems tend to rely on keyword searches
-Keyword searches often give good results, but have drawbacksNo attempt to understand meaning of a query
•For example, a search with Google for James Bond films without Roger Moore does exactly the opposite of what’s intended, or finds nothing if an exact match is requested
For image searches, rely on annotations (tags), which may be missing or wrong, or on surrounding text, which may be misleading
•No attempt to extract information from image itself-Imense Ltd (formerly Cambridge Ontology) has developed a
new kind of image-retrieval systemAutomated analysis and recognition of image content,
using Java and C++ software, and an extensible set of classifiers
Interpretation of user queries, based on semantic and linguistic relations between search terms
11th September 2008
4/23
Imense image analysis
Pixels
Regions
Objects
Concepts
Segmentation into regionsComputation of properties: size, colour, shape, texture
Scene classificationindoor, beach, sunset, nighttime, autumn
Region classificationMaterial and environmental categories:skin, cloth, grass, sky, wood, water
Object detection and recognitionHuman faces detected and analysed:sex, age, facial expression
Semantic descriptor extractionCombine all information in index
11th September 2008
5/23
Sample image features extracted by Imense analysis
- Group - Adult woman- Boy - Girl- Indoors - White plastic- Orange wall - Dark clothing
- No people- Outdoors- Light pink flowers- Dark green leaves
11th September 2008
6/23
Imense image search
11th September 2008
7/23
Turning a good idea into a working system
Image retrieval- Retrieve images from storage location to processing node
Image analysis- Perform feature extraction
Indexing- Collate and store analysis results
Image location- Images may be in an archive stored on disk, or may be distributed between web sites
Four basic steps to enabling searches based on image content
-Bulk of processing requirement is in analysis step: typically a few seconds per image
-Proof of principle based on several thousands of events is straightforward using minimal resources
-Building up index for many millions of images is more challenging
-Images are analysed independently of one another, so massive parallelisation is possible
This is the type of problem where Grid solutions works well
11th September 2008
8/23
Getting image processing onto the particle-physics Grid
-STFC knowledge-transfer projects set up to investigate Grid solutions for large-scale image processing
November 2006 - June 2007: mini-PIPSS award feasibility study
October 2007 - April 2009: PIPSS award optimised systemCollaboration between Imense Ltd, University of Cambridge
High-Energy Physics Group and Cambridge e-Science CentreContinued involvement from former Cambridge researchers
now based at Birmingham-New Virtual Organisation (camont) set up, and enabled at seven
GridPP sitesAccess to more sites possible if neededHelp with teething problems from GridPP experts and site
managers-Grid effectively providing computing on demand
Highest number of parallel jobs so far is about 150Often useful at present to be able to run a few tens of parallel
jobsAim to ramp up to larger samples later in the year
11th September 2008
9/23
Job-management sytem and Grid user interface
LHCbapplications
ATLASapplications
Otherapplications
Applications
Experiment-specificworkload-management systems
Local batch systems Distributed (Grid) systems
Processing systems (backends)
Metadatacatalogues
Data
stora
ge a
nd re
trieval
Filecatalogues
Tools fordata
management
Localrepository
Remoterepository
Ganga job archives
Gangamonitoring
loop
User interfacefor job definition
and management
• Use Ganga system, developed to support particle-physics experiments (ATLAS and LHCb)
• Component architecture allows customisation for other user groups
11th September 2008
10/23
Ganga job abstraction
Merger
Application
Backend
Input Dataset
Output Dataset
Splitter
Data read by application
Data written by application
Rule for dividing into subjobs
Rule for combining outputs
Where to run
What to run
Job
A job in Ganga is constructed from a set of building blocks, not all needed for every job
11th September 2008
11/23
Image-analysis jobs in Ganga
# Define application to perform image analysis, specifying input app = Classify( version = “2.0.1”, imageList = “imageURLs.txt” ) # Define processing system where job will run bck = LCG( middleware = “GLITE” ) # Define type of output data to be produced out = CamontDataset() # Create job j = Job( app = application, backend = bck, outputdata = out ) # Submit job j.submit()
Ganga provides a command-line interface and scripting language, built on Python
Job details
Logical
Folders
Job Monitoring
Ganga also provides a graphical interface
In practice, use Ganga script:
automated job-submission
and checking 24 hours a day
11th September 2008
12/23
Job destinations and execution times
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Results for 3638 jobs submitted over four-week period, July-August 2008
Destination chosen by Resource Broker of
Workload Management System, based on
minimum estimated waiting time
Significant differences in execution times
reflect inhomogeneity of site resources
11th September 2008
13/23
Site monitoring
Example site monitoring: running jobs at Lancaster for 8-day period (July 2008)
Image-processing jobs (camont) are small fraction of total
11th September 2008
14/23
Data-transfer rates
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Image downloads from hosting site, using wget
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Upload of results to Grid storage elements, using globus-url-copy
Results for 3638 jobs submitted over four-week period, July-August 2008
11th September 2008
15/23
Grid overheads
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Results for 3638 jobs submitted over four-week period, July-August 2008
-Useful time is when job is downloading and processing
images- Grid overheads come from: startup time, system time
for logging job completion, result upload and retrieval- For jobs of 500 images, average start-to-finish time is
164 minutes, with 39 minutes spent on Grid overheads:
73% useful time
Timing distributions non-Gaussian, with long tails
Need to increase processing load to increase fraction
of useful time
11th September 2008
16/23
Experience of image processing on the Grid
-Grid has been successfully used to process several million imagesHave processed both images from a disk archive and images
retrieved directly to Grid nodes from image-hosting web sitesThis has contributed to launch of beta version of new image
search engine: http://imense.com/-Have automated system, based on Ganga, for job submission and
output retrievalMakes keeping track of thousands of jobs and millions of
images completely painless-Job failure rates have been at 2% level, with two main causes
Proxy credential of submitting user expires before job startsNetwork failures, preventing upload of results to storage
element-Positive experience with using the Grid for image retrieval and
processing has prompted interest in using the Grid also for image location
Grid-enabled web crawler now at testing phase
11th September 2008
17/23
Imense: sunset with cloudy sky
11th September 2008
18/23
Google: sunset with cloudy sky
Keyword search:- confuses skyscapes and cocktails- images found often disappoint expectations
11th September 2008
19/23
Imense: streets after dark
11th September 2008
20/23
Google: streets after dark
Keyword search:- doesn’t distinguish between an
object and a reference
11th September 2008
21/23
Imense: man on right
11th September 2008
22/23
Google: man on right
Keyword search:- doesn’t know left from wrong- doesn’t understand prepositions- has no concept of what a man is
Conclusions
-Imense Ltd has developed a new kind of image-retrieval systemAllows searches based on image contentAble to extract meaning from the user query
-Image retrieval and processing has been performed successfully at seven GridPP sites, allowing analysis of several million images
Aiming for samples of tens of millions of images by end of year
-Ganga job-management system, developed for particle-physics experiments, has been used to submit and track Grid jobs
Makes whole process painless-Timing studies show that Grid overheads, mainly for queueing and
for completion logging, contribute an average of 39 minutes per jobRelevant timing distributions are non-Gaussian with long tailsOverhead likely to be a function of sites used and overall Grid
activityLonger jobs have greater fraction of useful time
-Beta version of new image search engine has been launchedTry it out at: http://imense.com/