TOWARD FLEXIBLE USER INTERACTION INCONTENT-BASED MULTIMEDIA DATA RETRIEVAL
BY
MUNEHIRO NAKAZATO
B.S., Keio University, 1995M.S., Keio University, 1997
DISSERTATION
Submitted in partial fulfillment of the requirementsfor the degree of Doctor of Philosophy in Computer Science
in the Graduate College of the University of Illinois at Urbana-Champaign, 2003
Urbana, Illinois
iii
Abstract
This thesis discusses various aspects of digital image retrieval and management. First, we discuss user
interface and visualization for digital image management. Two innovative systems are proposed.
3D
MARS
is immersive 3D display for image visualization and search systems. The user browses and
searches images in 3D virtual reality environment.
ImageGrouper
is another graphical user interface for
digital image search and organization. New concept,
Object-Oriented User Interaction
is introduced.
The system improves image retrieval and eases text annotation and organization of digital images.
Unlike the traditional user interfaces for image retrieval,
ImageGrouper
allows the user to group query
example images. To take advantage of this feature, a new algorithm for relevance feedback is proposed.
Next, this thesis discusses data structures and algorithms for high-dimensional data access. This is an
essential component of multimedia data retrieval. The results of preliminary experiments are presented.
Table of Contents
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Chapter 1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
1.1 Motivation: The World of Digital Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Drawbacks of the Traditional Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 User Interfaces Support for Content-based Image Retrieval . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 Two-class Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.3 Indexing Visual Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 2 Navigation in Immersive 3D Image Space. . . . . . . . . . . . . . . . . . . . . . . .6
2.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Text Visualization vs. Image Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 3D MARS: the Interactive Visualization for CBIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 User Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Query Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
v
2.7.1 Total Ranking vs. Feature Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.8 Visualization Engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.8.1 Projection Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.8.1.1 Static Axes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.8.1.2 Dynamic Axes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.9 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.10 Possible Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.10.1 Integration of Browsing and Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.10.2 Migrating to 6-Sided CAVE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.10.3 Improvement on User Input Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.10.4 Improvement on Navigation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.10.5 Multi-Cluster Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Chapter 3 Group-Oriented User Interface for Digital Image Retrieval and Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26
3.1 User Interface Support for Content-based Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 The Traditional Approaches: Incremental Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Limitation of Incremental Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.3 El Ninõ System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Query-by-Groups with ImageGrouper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.1 The Basic Query Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 The Flexibility of Query-by-Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.1 Trial and Error Query by Mouse Dragging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
vi
3.4.2 Groups in a Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Experiment on Trial and Error Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 Text Annotations on Images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6.1 Current Approaches for Text Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6.2 Annotation by Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6.2.1 Annotating New Images with the Same Keywords. . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6.2.2 Hierarchical Annotation with Groups in a Group . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6.2.3 Overlap between Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.7 Organizing Images by Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.7.1 Photo Albums and Group Icons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.8 Usability Study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.8.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.8.1.1 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.8.1.2 Apparatus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.8.2 Scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.8.2.1 Experimental Task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.8.2.2 Training Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.8.2.3 Experiment Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.8.3 The Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.8.3.1 Error Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.8.3.2 Task Completion Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.8.4 Improving ImageGrouper Based on the Lessons We Learned . . . . . . . . . . . . . . . . . . . 53
3.9 Implementation Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
vii
3.9.1 The Client-Server Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.9.1.1 The User Interface Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.9.1.2 The Query Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.9.2 Relevance Feedback Algorithm in the Query Engine. . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.10 Conclusion and Future Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Chapter 4 Relevance Feedback Algorithms for Group Query . . . . . . . . . . . . . . .59
4.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.1 Image Retrieval as a One-Class Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2.2 Image Retrieval as a Two-Class Problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.3 Image Retrieval as a (1+x)-Class Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.4 Multi-Class Relevance Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 Proposed Approach: Extending to a (x+y)-Class Problem . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.5 Analysis on Toy Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.6 Experiments on Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.6.1 Data Sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.6.2 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.6.3 Ranking Strategies for GBDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.7 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.8 Possible Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.8.1 Groups with Different Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
viii
4.8.2 Automated Clustering of Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.8.3 More than Two Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Chapter 5 Integrating ImageGrouper into the 3D Virtual Space . . . . . . . . . . . . . .82
5.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2 Design Choices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3 User Interaction of Grouper in 3D MARS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.5 The Benefits of Integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Chapter 6 Storage and Visual Features for Content-based Image Retrieval . . . . .89
6.1 Data Structure for High-Dimensional Data Access. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.1.1 Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 Preliminary Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3 Image Retrieval by Local Image Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.4 Integration of Content-based and Keyword-based Image Retrieval . . . . . . . . . . . . . . . . . . 94
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .97
Appendix A Image Features in the Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . .106
A.1 Color Distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
A.2 Texture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
A.3 Edge Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Appendix B Implementation Details of ImageGrouper and Query Engine . . . . .110
B.1 ImageGrouper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
B.1.1 Structure of ImageGrouper User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
B.1.2 Image Drag Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
ix
B.2 Query Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
B.2.1 Overview of the Query Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
B.2.2 System Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
B.2.3 Client-Server Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
B.2.3.1 Standalone Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
B.3 Building Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
B.3.1 Directory Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
B.3.2 Building Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
B.3.2.1 Additional Libraries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
B.3.2.2 Setting Java Parameter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
B.3.2.3 Modifying Server Makefile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
B.3.2.4 Compiling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
B.3.3 Running Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
B.3.3.1 Image File Location. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
B.3.3.2 Metadata File Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
B.3.3.3 Server URL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
B.3.3.4 Starting Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .122
x
List of Figures
Chapter 1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1-1 The Digital Image Toolkit.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 2 Navigation in Immersive 3D Image Space. . . . . . . . . . . . . . . . . . . . . . 6
2-1 Initial configuration of 3DMARS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2-2 3D MARS in CAVE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2-3 3D MARS on a desktop VR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2-4 The result after the user selected one “red flower” picture (in Fixed axes mode.) The queryexample is displayed near the origin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2-5 The result after the user selected another “flower” images. Red flowers of different textureare aligned along the red arrow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2-6 The Sphere Mode. The number of images is 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2-7 The sphere mode from a different view angle (from the zenith of the space.) Relationshipbetween color and structure is visualized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2-8 The system architecture of 3D MARS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Chapter 3 Group-Oriented User Interface for Digital Image Retrieval and Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3-2 Example of “More is not necessarily better”. The left is the case of one example, the rightis the case of two examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3-1 Typical Graphical User Interfaces for CBIR Systems: (a) Slider-based GUI, (b) Click-based GUI. On both systems, the search results are also displayed on the same workspace.29
3-3 The overview of ImageGrouper user interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
xi
3-4 The sequence of the basic query operation on ImageGrouper . . . . . . . . . . . . . . . . . . . 36
3-5 The number of hits until the 10th round or convergence. . . . . . . . . . . . . . . . . . . . . . . 40
3-6 Groups in a group. In ImageGrouper, the user can create a hierarchy of image group. Inthis example, the entire group is “cars.” Once the whole group is annotated as “car,” theuser only need to type “red” to annotate the red cars. . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3-7 Overlap between groups. Two images in the overlapped region contain both mountainand cloud. Once the two groups are annotated, the overlapped region are automaticallyannotated with both keywords. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3-8 The Usability Test Setting (Click-base GUI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3-9 The average time required to select each images in different problems. The X-axis is thetotal number of images selected during in the problems. . . . . . . . . . . . . . . . . . . . . . . . 52
3-10 The average selection time per image. The differences are statistically significant at p <.001. The click-baed user interface achieves the shortest task completion time. . . . . . . 53
Chapter 4 Relevance Feedback Algorithms for Group Query . . . . . . . . . . . . . 59
4-1 Co-evolution of GUI design and search algorithm development. A new GUI designmotivates development of a new search algorithm. Meanwhile, an existing searchalgorithm could be reinforced by a new GUI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4-2 Graphical concepts of the previous space translation scheme. (a) Two-class problemscheme tries to divide data into two classes, which are positive and negative groups. (b)(1+x)-class scheme handles problem as one positive group and multiple negative groups.. 65
4-3 White flowers and Red flowers. Both groups can be considered as subsets of “flower class”In ImageGrouper, users can separate them into two positive groups. . . . . . . . . . . . . . . 68
4-4 Concept of the new feature space transform. It minimizes the scatter of each positive classwhile maximizing the scatter between positive and negative samples. . . . . . . . . . . . . . . 70
4-5 Comparison of MDA, BDA, and GBDA on toy problems 1. The original data (in 3D)and the result projections (2D) are plotted. In this problem, GBDA performs similar toMDA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4-6 Comparison of MDA, BDA, and GBDA on toy problems 2. The original data (in 3D)and the result projections (2D) are plotted. In this toy problem, GBDA performssimilarly to BDA.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4-7 Sample Query Images. Each set is divided into two sub-sets. . . . . . . . . . . . . . . . . . . . . 76
xii
4-8 Comparison of BDA and GBDA on the real data. The results are shown in the weightedhit count. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4-9 Comparison of BDA and GBDA on the real data (measured in the precision and recallfor different sample size) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Chapter 5 Integrating ImageGrouper into the 3D Virtual Space. . . . . . . . . . . . 82
5-1 User interacting in 3DMARS with a wireless-equipped note book PC. . . . . . . . . . . . . 85
5-2 Overview of the integrated system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Chapter 6 Storage and Visual Features for Content-based Image Retrieval. . . 89
6-1 Comparison of K-Nearest neighbor search time. K=50, Dimemsion=37, The number oforiginal data is 100,000 (28MB.) Iterated 100,000 times. . . . . . . . . . . . . . . . . . . . . . . 92
6-2 Block based image selection. In this example, the image is divided into 5 x 5 blocks. Theuser may be interested in region of 2 x 2 colored blue. . . . . . . . . . . . . . . . . . . . . . . . . . 93
6-3 Approximating image region from smaller blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6-4 Quad-Tree Decomposition [98]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6-5 Example of Keyword and Free Text Annotation of MPEG-7 . . . . . . . . . . . . . . . . . . . . 95
Appendix A Image Features in the Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
A-1 The HSV color space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
A-2 The wavelet texture features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Appendix B Implementation Details of ImageGrouper and Query Engine . . . 110
B-1 Layered structure of ImageGrouper user interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
B-2 Image Dragging from the Grid to the Palette. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
B-3 Client-Server configuration.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
B-4 Standalone configuration with a local query engine . . . . . . . . . . . . . . . . . . . . . . . . . . 117
xiii
Chapter 1
Introduction
1.1 Motivation: The World of Digital ImagesMore and more people are enjoying Digital Imaging these days [37][64]. New inexpensive and high
quality digital cameras hit the market every month. They are replacing traditional film cameras. Even
camera-equipped mobile phones have appeared [93]. We can take plenty of digital pictures without
worrying about the cost of films. We can easily edit pictures, as we like. We can share the images with
our family or friends by E-mail and World-Wide Web.
Meanwhile, hospitals store a huge amount of medical images such as MRT and CT in digital formats.
Digital museums are becoming numerous, too. Thus, Digital Imaging has important roles both in
consumer and professional markets.
The users are, however, having difficulty in organizing and searching a large number of images in the
databases. The current commercial database systems are designed for text data and not suitable for Data
Mining [39] of digital images. To make matters worse, digital image systems often automatically
generate file names like “DSCF0052.JPG,” which are meaningless to humans. An Efficient way for
digital image management is desired.
1
1.2 Related WorkMany researchers have proposed systems to find an image from large image databases
[18][20][25][31][43][55][58][63][77][84][86][99][110]. We can divide these approaches into two
types of interactions: Browsing and Searching. In Image Browsing, the users look through the entire
collections on the display. In most systems, the images are clustered in hierarchical manner and the user
can traverse the hierarchy by zooming and panning [9][20][55][74]. In [74], browsing and searching
are integrated so that the user can switch back and forth between browsing and searching.
Image Searching systems ask the user to provide some information to the systems. Image searching can
be further divided into Keyword-based and Content-based. The advantages of Keyword-based
approaches are that the user can retrieve images with high-level concepts of images such as object names
in the image or the location where the picture was taken. However, in order to make a keyword-based
approach effective, the users have to annotate all images manually. While this might make sense for
commercial photo stocks, it is an extremely tedious task for home users.
Meanwhile, enormous amounts of research have been done on Content-Based Image Retrieval (CBIR)
[25][31][86][99]. In CBIR systems, the user searches images by low-level visual similarity such as color
[105], texture [97] and structure [120]. These features are automatically extracted from the images and
indexed in the database. Then, the system computes the similarity between the images based on these
features.
The most popular query method for CBIR is Query-by-Example (QbE). In this method, the users select
example images (as relevant or irrelevant) and ask the system to retrieve visually similar images. In
addition, in order to improve the retrieval further, CBIR systems often employ Relevance Feedback
[46][86][87][119], in which the users refine the search incrementally by giving feedback to the
2
previous query result. In Active Learning Relevance Feedback [110], the users are asked to select relevant
images from a set of the most informative images.
1.3 Drawbacks of the Traditional SystemsIn this section, we briefly discuss the major drawbacks of traditional CBIR systems in several aspects.
1.3.1 User Interfaces Support for Content-based Image Retrieval
In most user interfaces for digital image retrieval, the query results are aligned in a grid. This means the
results must be ordered and mapped in a line. However, the visual features of CBIR systems are made
of high-dimensional numerical data. Therefore, much information has to be discarded for the result
display. Next, in CBIR systems, there are inevitable gaps between low-level image features and the user’s
concepts. Therefore, trying different combinations of query examples in query-by-example is essential
for successful retrieval. Most systems do not support this type of query, though. These problems are
addressed in Chapter 2 and Chapter 3.
1.3.2 Two-class Classification
Most CBIR systems ask the user to specify relevant and irrelevant image examples. Therefore, many
relevance feedback algorithms for CBIR addressed image retrieval as a two-class pattern classification
problem that classifies data into two classes (positive and negative) [26][48][86][87][110][117][118].
These approaches, however, introduce undesirable side effects because they mix up all negative
examples into one class. In the actual scenario of image retrieval, negative examples can be from many
classes of images in the database.
3
In addition, the user’s high-level concepts often cannot be expressed by only one class of images. Our
new flexible user interface allows the user to specify more than one group of relevant images. Not many
researchers have addressed multi-positive class relevance feedback algorithms [76]. We will discuss how
we extend two-class classification to take advantages of multiple positive classes in Chapter 4.
1.3.3 Indexing Visual Features
The visual image features for CBIR systems are high-dimensional numerical data. It is difficult to
manage these data with traditional commercial database systems because these systems are designed for
text data and low-dimensional numerical data. While many researchers have proposed architectures for
indexing high-dimensional data [10][11][12][35] [38][50][57][81][88][113][114], these systems are
not tested in real world performance. Often, a simpler method can outperform those sophisticated
systems. Section 6.1 discusses this problem in detail.
1.4 System OverviewFigure 1-1 shows the overview of our proposed system. The system consists of several components as
follows.
• User Interface (Chapter 2, Chapter 3 and Chapter 5)
• Query Engine (Chapter 4)
• Fast Data Indexing Structure (Section 6.1 in Chapter 6)
• Feature Extraction (Section 6.3 in Chapter 6)
• Meta-data Management (Section 6.4 in Chapter 6)
We discuss each component in the following chapters.
4
1.5 Organization of the ThesisIn the next chapter, we propose an innovative 3D visualization system for digital image named 3D
MARS. 3D MARS allows the user to browse and search images in an immersive virtual reality
environment. In Chapter 3, a new graphical user interface (GUI) for digital image retrieval and
organization is presented. The new GUI, named ImageGrouper introduces a new interaction method
for content-based image query. To take advantages of this new interaction method, a new algorithm
for relevance feedback is proposed in Chapter 4. In Chapter 5, 3D MARS and ImageGrouper are
integrated to provide more flexible image query. Finally, several other research topics are discussed in
Chapter 6.
Meta Data Manager
Image Files Database
Keyword Annotation
Retrieval
Query
KeywordsLow Level
Image Features
Query Engine Fast Index
Face Features
User Interfaces
Feature Extraction
Figure 1-1. The Digital Image Toolkit.
5
Chapter 2
Navigation in Immersive 3D Image Space
In this chapter, we propose an interactive 3D visualization system for Content-based Image Retrieval
(CBIR) named 3D MARS. In 3D MARS, query results are displayed on a projection-based immersive
Virtual Reality system or a desktop Virtual Reality system. Based on the users’ feedback, the system
dynamically reorganizes its visualization strategies. 3D MARS eases tedious task of searching images
from a large set of images. In addition, the sphere display mode effectively visualize clusters in the
image database. This will be a powerful analyzing tool for CBIR researchers.
2.1 IntroductionWhile CBIR systems provided us with a smart way of searching images, they had a significant
limitation. First, the image features consist of high-dimensional vectors of different image properties
such as color, texture and structure. Meanwhile, in the traditional CBIR systems, the query results are
ordered and displayed in a line based on the weighted sum of the distance measures. This means high-
dimensional image features have to be mapped on one dimensional space. As a result, much
6
information can be lost for visualization. This causes problems especially when the number of query
examples is small. The system cannot tell which feature is the most important for a user. Consequently,
the most important image may not appear in the early stage of query operations. One solution to this
problem is to allow the user to adjust the query parameter as done in many image retrieval systems [31].
In this approach, the user has to specify the weights of each feature. This process, however, is very
tedious and difficult for novice users.
Second, in the conventional two-dimensional display, the query result images are tiled in a monitor.
Thus, only limited number of images can be displayed at the same time. It is painful for user to go back
and forth in the browser by clicking “Next” and “Previous” buttons.
3D visualization for Content-Based Images Retrieval (CBIR) alleviates these problems. In addition, it
eases integration of searching and browsing images in a large image database. Many researchers have
proposed 3D visualization system for CBIR. In most approaches, however, the images features for
display are fixed by the system developers. Meanwhile, not all features are always equally important for
the users. Moreover, in most systems, all images in the database are displayed despite the user’s interest.
Displaying too many images consumes the resources and may confuse the users.
In this chapter, we propose a new visualization system for Content-based image retrieval named 3D
MARS. In this system, images are displayed on a projection-based immersive Virtual Reality or non-
immersive desktop VR. The three dimensional space can display more images than traditional CBIR
systems at the same time. By giving different meaning to each axis, the user can simultaneously browse
the retrieved images with respect to three different criteria.
7
In addition, responding to the users’ feedback, the system incrementally refines the query results and
dynamically adapts visualization strategies by relevance feedback techniques [86][87][119]. Moreover,
with Sphere Display Mode, the system provides a powerful analyzing tool for CBIR researchers.
The rest of this chapter is organized as follows. In the next section, we describe the difference between
3D visualization for text databases and image databases. In Section 2.3 a brief overview of previous
approaches is presented. Then, the proposed system is described in the following sections. Finally, the
future work and conclusion are presented in Section 2.10.
2.2 Text Visualization vs. Image VisualizationMany researchers have already proposed 3D information visualization systems for Text document
database [9][115][41]. Why do we need another visualization system for image database? This is
because there are significant differences between Text documents and Image documents with regard to
visualization.
First, in most Text document visualization systems, only the title and minimal information can be
displayed simultaneously. Otherwise, the display would be cluttered with texts. Meanwhile, it is
difficult for user to judge the relevance of the documents only from the titles. In order to see the
detailed information such as the abstract or the contents of the documents, the user has to select one
of the documents and open another display window [focus+context.]
On the other hand, in image retrieval, the user need only image itself for relevance judgement. This
user judgement is instant and does not require an additional display window. Hence, the system need
to show only images themselves (and the titles if necessary.) Therefore, images are more suitable for
fully immersive Virtual Reality systems such as CAVE.
8
Second, in both text and image retrieval systems, documents are indexed in a high dimensional space.
Thus, in order to display the documents in a 3D space, the dimensionality has to be reduced. Because
the index of text retrieval is made of the occurrence and frequency of keywords, it is difficult to group
these components automatically in a meaningful manner. Such a task is usually domain specific and
requires human operation. On the other hand, the feature vectors of image retrieval systems can be
grouped easily, for example, into color, texture and structure. Therefore, the feature space can be easily
organized in a hierarchical manner for 3D visualization.
In content-based image databases, however, there are significant semantic gaps [89] between the image
features and the user’s concept. Most image databases index the images into numerical features such as
color moments and wavelet coefficients as described in Appendix A These features are not directly
related to the user’s concept. Even if two images are close to each other in the high dimensional feature
space, they do not necessarily look similar for the users. Therefore, in order to express the user’s
semantic concept with these low-level features, the weights of these feature components should be
adjusted automatically. Therefore, relevance feedback has a significant role. Meanwhile, in text
databases, it is more likely that related documents have the same keywords and are located close to each
other in the feature space.
2.3 Related WorkMany researchers proposed 2D or 3D visualization systems for Content-based Image Retrieval
[21][42][54][84][109].
Virgilio [54] is a non-immersive VR environment for image retrieval. Their system is implemented in
VRML. In this system, the location of the images is computed off-line and interactive query is not
9
possible. Only system administrators can send a query to the system and the other users can only
browse the resulting visualization.
Hiroike et al. [42] also developed VR system for image retrieval. In their system, hundreds of images
in the database are displayed in a 3D space. According to the user feedback, these images are
reorganized and form a cluster around the sample images. In their system, all the images in the database
are always displayed at the same time.
Chen et al. [21] applied the Pathfinder Network Scaling technique [91] on image database. Pathfinder
network creates links among the images so that each path represents a shortest path between images.
In the system, mutually linked images are displayed in a 3D VR space. Depending on the features
selected, the network forms very different shapes. The number of images is fixed to 279.
Several researchers applied Multidimensional Scaling (MDS) [52] to image visualization. Rubner et al.
[84][85] used MDS for 2D and 3D visualization of images. Tian and Taylor [109] applied MDS to
visualize 80 color texture images in 3D. They compared visualization results with different sets of image
features. However, because MDS is computationally expensive ( time) it is not suitable for
interactive visualization of a large number of images.
2.4 3D MARS: the Interactive Visualization for CBIRIn most approaches described above, a set of image features has to be selected in advance. The problem
is, however, that not all features are equally important for the users. For example, assume a user is
looking for images of “balls of any color.” In this case, “Color” features are not very useful and should
not be used for visualization. Inappropriate visualization can be misleading. Furthermore, the
O N2( )
10
important sets of features are context dependent. Thus, the user has to change the feature set according
to his current interest. This is very difficult task for novice users.
Furthermore, in many systems, all images in the database are displayed regardless of the users’ interest.
Displaying too many images exhausts resources and is annoying for the users.
To address these problems, we propose a new visualization system for Image database named 3D
MARS. In 3D MARS, the system dynamically changes visualization strategies (sets of image features,
sets of displayed images and their locations) according to users’ interest. The user of 3D MARS tells the
system his interest by specifying example images (Query-by-Example.) Repeating this feedback loop,
the system can incrementally optimize the display space.
2.5 User NavigationIn 3D MARS, images are displayed in a projection-based immersive VR or non-immersive desktop VR.
In the immersive case, we use NCSA CAVE. The image space is projected on four walls (front, left,
right, and floor) surrounding the user (Figure 2-2.) With shutter glasses, the user can see a stereoscopic
view of the space. The user interacts with objects by a wand. The user can freely walk around in CAVE.
In the desktop case, the VR space is displayed on a CRT monitor. The user interacts with the system
with a keyboard and a mouse (Figure 2-3.) The user can ware shutter glasses for better VR experience.
When the system starts, it displays images aligned in front of the user like a gallery (Figure 2-1.) As the
user moves, the images rotate to face the user. These images are randomly chosen by the system. When
the user touches one of the images by the wand, the image is highlighted and the filename is displayed
below it. By moving the wand (or mouse) the image can be moved to any position. The user can select
an image as relevant (i.e., a query example) by pressing a wand/mouse button. More than one image
11
can be selected. The selected images are displayed with red frames. In order to deselect an image, the
user presses the button again. S/he can also specify an image as a negative example. The negative
examples are displayed with blue frames. Moreover, the users can fly-through in the space with joystick.
To prevent the user from getting lost in the space, a virtual compass is provided on the floor. Three
arrows of the compass are always facing X-axis, Y-axis, and Z-axis respectively (Figure 2-5.)
When the user presses the QUERY button on the left wall, the system retrieves and displays the most
similar images from the image database (Figure 2-4.) The locations of the images are determined by
the similarity the query images. The X-axis, Y-axis and Z-axis represent color, texture and structure of
the images respectively. The more similar an image is, the closer to the origin of the space it is located.
If the user finds another relevant (or irreverent) image in the result set, s/he selects it with the wand as
an additional relevant (or irrelevant) example and presses the QUERY button again. By repeatedly
picking up new images, the query is improved incrementally and additional relevant images of the
user’s interest are clustered near the origin (Figure 2-5.)
Figure 2-4 shows the result after a user selects one “red flower” image as a positive example. Because
only one example is specified, the system assumes every feature is equally important. As a result, various
types of images are displayed. From this result, the user can give another feedback by selecting more
“red flower” images. In this example, the total number of images is 50.
Figure 2-5 shows the resulting visualization after the user selected two “red flower” images as query
examples. More pictures of flower are clustered around the origin. Here, the green arrow means the
“Color (X)” axis, the blue arrow means the “Texture (Y)” axis, and the red arrow means “Edge structure
(Z)” axis. The “Red flower” pictures have very similar color features to the query examples but have
different texture and structure. Therefore, they are displayed on Y-Z plane. Meanwhile, The “white
12
flower” has different color features, but has the similar shape to the examples. Thus, it is displayed on
X-Y plane.
For researchers of image retrieval systems, showing how query vector is formed and how images are
clustered in the feature space are useful information to evaluate their algorithms. For this purpose, we
have implemented Sphere Mode in our system (Figure 2-6.) In this mode, all the images are represented
by spheres. Therefore, it is easier for the user to examine the clusters in the VR space at a glance. The
positive examples are displayed as red spheres, and the negative ones are displayed as blue spheres. By
flying through the space in this mode, the researcher can examine how images are clustered from
different view angles. For example, by looking down the floor from a higher position, the user can see
how images are clustered with respect to color and structure (see Figure 2-7.)
Figure 2-1. Initial configuration of 3DMARS.
13
Figure 2-4. The result after the user selected one “red flower” picture (in Fixed axes mode.) Thequery example is displayed near the origin.
Figure 2-5. The result after the user selected another “flower” images. Red flowers of differenttexture are aligned along the red arrow.
15
Figure 2-6. The Sphere Mode. The number of images is 100.
Figure 2-7. The sphere mode from a different view angle (from the zenith of the space.)Relationship between color and structure is visualized
16
2.6 System Overview3D MARS is implemented as a client-server system. The system consists of Query Server and
Visualization Engine (client) as shown below (Figure 2-8.) They are communicating via Hyper-Text
Transfer Protocol (HTTP) [29]. More than one client can connect to one server simultaneously. Image
features are extracted in advance and stored into the Meta-data database.
2.7 Query ServerThe Query Server is implemented as an extension of MARS (Multimedia Analysis and Retrieval
System) [86][119] to 3D Visualization. The server maintains image files and their meta-data (or visual
features, see Appendix A.) When the server receives a request from a client, it computes distance
between the user-selected images and images in the database. Then, the server sends back IDs of the k
most similar images and their locations in 3D.
Query Server
CAVEHTTP
HTTP
SGI Onyx
Server (Sun Enterprise)
Immersive VR Client
Meta-dataDatabase Visualization
Engine
VisualizationEngine
SGI O2
Desktop VR
Image FileDatabase
FeatureExtractor
Figure 2-8. The system architecture of 3D MARS
17
2.7.1 Total Ranking vs. Feature Ranking
In the original MARS system [86], the ranking of the similar images are based on the weighted
combination of the all three features. The weight of each feature is computed from query examples. In
early stage of user interaction loop, however, the user may specify only one example. In this case, the
query server cannot tell which feature is important. Therefore, the system assumes every feature is
equally important. As a result, an image is considered to be relevant only when every feature is close to
the query. This can cause the search to fall into a local minimum.
To remedy this problem, we use two ranking strategy: Feature Ranking and Total Ranking. The Feature
Ranking is a ranking with respect to only one group of the features. First, for each feature group
, the system computes a query vector based on the positive
examples specified by the user. Next, it computes feature distance of each image n in the database
as follows,
(2-1)
where is the k th component of the i-th feature, is the k-th component of . The weight
is the inverse of the standard deviation of ( ),
(2-2)
Then, the feature ranking is computed by comparing ( ). In addition, the value
is also used to determine the location along the corresponding axis in fixed axes mode described later.
After the Feature Ranking is computed, the system combines each feature distance into the total
distance . The total distance of image n is the weighted sum of each ,
(2-3)
i Color, Texture, Structure{ }= qi
dni
dni wik xnik qik–( )k∑=
xnik qik qi wik
xnik n 1…N=
wik1σik-------=
dni n 1 2…N,= dni
dni
Dn dni
Dn uT dn=
18
where . I is the total number of feature groups. In our case, I is 3. The optimal
solution of is solved by Rui et al.[87] as follows,
(2-4)
where , and N is the number of positive examples. This gives a higher weight to that
feature whose total distance is smaller. Which means that if the query examples are similar with respect
to a feature, this feature gets higher weight. The complete discussion of the original MARS system is
found in [86] and [87] as well as Chapter 4.
Finally, the Total Ranking is computed based on the total distances. The server sends back the client
IDs of the top images in the feature ranking and the top images in the total ranking.
With both the Feature Ranking and the Total Ranking, the system can return images even if only one
of their feature is close to the query. Without the feature ranking, they are located at distant positions
in the 3D space. The feature ranking is important especially in the early stage of query process where
the user does not have enough number of query example. They could be ignored in the traditional
CBIR systems.
2.7.2 Implementation
The server is implemented as a Java Servlet with Apache Web Server. It is written in C++ and Java. The
server can simultaneously communicate with different types of client such as Java applet client [69]. It
is running on a Sun Enterprise Server. Currently, 17,000 images and their feature vectors are stored.
dn dn1 … dnI, ,[ ]=
u u1 …uI,[ ]=
ui
f j
f i-----
j 1=
I∑=
f i dnin 1=N∑=
K feature Ktotal
19
2.8 Visualization EngineThe Visualization Engine takes a request from the user, sends the request to the server and receives the
result from the server. Then it visualizes the resulting images in VR space. In the immersive display, the
images are displayed on the four walls of CAVE, which is a projection-based Virtual Reality system.
When the user pushes the QUERY button, it sends the IDs of the selected (positive or negative) images
to the server. The requests are sent as a “GET” command of HTTP. When the reply is returned, the
client receives a list of IDs of the k most similar images and their locations. Next, it downloads all the
corresponding image files (such as JPEG files) from the image database. Finally, these images are
displayed on the virtual space. The system can display an arbitrary number of images on the VR space
as long as resources (texture memory) are available. In our environment, 50 to 200 images are
displayed. In the Sphere mode, more data can be displayed simultaneously because the image textures
do not have to be stored in the memory.
This component is written in C++ with OpenGL and CAVE library. The immersive version of the
visualization engine is running on a twelve-processor Silicon Graphics Onyx 2. Each wall of the CAVE
is drawn by a dedicated processor. Loaded image data are stored on a share memory and accessed from
these processors. For the desktop VR version, the system is running on SGI O2.
2.8.1 Projection Strategies
In order to project the high dimensional feature space into 3D space, we take two different approaches:
Static Axes and Dynamic Axes.
20
2.8.1.1 Static Axes
In the static axes approach, the meanings of X, Y, and Z-axis are fixed to some extent. In our
implementation, the X, Y and Z always mean the distance with respect to Color, Texture and Structure,
respectively. The location of each image is determined by the weighted sum of the corresponding
feature distance computed in the Query Server as described in Eq. 2-1. Therefore, for each axis, the
system automatically chooses an appropriate combination of features from the corresponding feature
group.
Because the meanings of the axes do not change for each interaction, the user can use the axes to obtain
a context of image searching. This makes navigation in the VR space easier. The problem of static axes
approach is that some axes (a group of features) may not give any useful information to the user. For
example, if none of the texture features are significant, the Y-axis does not have any meaning.
2.8.1.2 Dynamic Axes
In the dynamic axes, the meanings of the axes change for every interaction. The location of images is
determined by projecting the full 34 dimensional feature vector into a three dimensional space. Many
techniques have been proposed for this purpose. Because our goal is to provide a fully interactive
visualization, computationally expensive method such as MDS [52] is not suitable. In stead, we use
faster FastMap [27] method developed by Faloutsos et al. FastMap takes a distance matrix of points and
recursively maps points into lower dimensional hyperplanes. FastMap requires only
computation, where N is the number of images and k is the desired dimension.
First, we feed the raw feature vectors of the retrieved images (including the query vector) into FastMap.
Here, there is no distinction among color, texture and structure feature groups. They are combined into
one 34 dimensional vectors. After FastMap projects the image features onto 3D, we translate the entire
O Nk( )
21
VR space so that the location of the query vector matches the origin of the space. This guarantees that
the distance between an image and the origin always represents the degree of similarity to the query
example. The advantage of this approach is that the system requires only the feature vectors of the
images to discriminate the images. The disadvantage is that because the meanings of the directions are
always changing, the user may be confused.
2.9 ConclusionIn this chapter, we proposed a new interactive visualization system for Content-Based Image Retrieval,
named 3D MARS. Compared with the traditional CBIR systems, more images can be displayed
simultaneously in 3D space. By giving different meaning to each axis, the user can browse the retrieved
images with respect to three different criteria at a glance. In addition, using the feature ranking, the
system can display images that could be ignored in the traditional CBIR systems. Furthermore, unlike
other 3D image visualization systems, where mapping to 3D space is fixed, 3D MARS can interactively
optimize the visualization strategies in response to the users’ feedback.
With the Sphere display mode, the 3D MARS provides CBIR researcher with a powerful analyzing
tool. By flying through the space, the user can analyze image clusters from different viewpoints.
2.10 Possible ImprovementIn this section, we discuss possible improvements in the user interaction and the display strategies of
the system.
22
2.10.1 Integration of Browsing and Searching
One limitation of our system is that the user has to find an initial query example from a random
selection. The user has to repeat the random query until any interesting image is found. Chen et al.
[20] proposed a technique to automatically generate a tree structure of image database. By following
this hierarchy, the user can effectively browse images.
Pecenovic et al. [120] integrated image browsing and query-by-example. In their system, images are
organized into hierarchical structure by recursively clustering images by k-means algorithm. At every
level, each node is represented by image that is the closest to the centroid of the cluster. The user can
switch from the browsing to query-by-example anytime.
If some images in the database have text annotation, these information can be used as a starting point
of a query. We plan to integrate several browsing strategies into 3D MARS.
2.10.2 Migrating to 6-Sided CAVE
The current system was developed on 4-sided CAVE, which has projectors on the front, right, left, and
floor. Because the system does not have projectors above and behind the users, our display algorithm
was limited. The query results have to be displayed only on the front side of the user. We are going to
implement the system on 6-sided CAVE (CUBE). Because 6-sided CAVE provides the user with a full
field of view, there are no limitations on the visualization strategies. On the 6-sided CAVE, we are going
to investigate more effective and intuitive user interface. For example, we can rank retrieved images in
six different ways.
23
2.10.3 Improvement on User Input Methods
In the current system, the user specifies his interest by specifying sample images as relevant or irrelevant
one by one. We will investigate other forms of query specification. For example, Santini et al. [89]
proposed a topological user input. In this system, the user specifies relevance of images by moving
images in the display space. If the user believes two images are similar, he moves these images close to
each other. If the user believes an image is not relevant, the moves it to a distant location form the
relevant images. In his system, however, the orientation of the display space is ignored. We are
interested in how the orientation in the display space improves the usability of the system.
2.10.4 Improvement on Navigation
We plan to improve the feedback from the system to the user in several ways. First, the user of 3D
MARS can move in the virtual space freely. In order to prevent the user from being lost, we provided a
compass on the floor. The arrows in the compass always point to x, y, and z direction. We plan to
investigate other forms of feedback method for navigation. One possibility is the use of sonification
[15][95]. By changing pitch of the ambient sounds, the user can tell a state changes in the virtual space.
In addition, by localized sound, the system can give the user another clue for the orientation/location.
The user can hear these sound feedbacks even when s/he is concentrating on other tasks.
Another possibility is the use of force feedback. Gravity to the origin of the virtual space gives the user
a sense of the location during traveling. Moreover, attractive force or repulsive force among the image
objects may provide another type of feedback.
24
2.10.5 Multi-Cluster Display
In the prototype, the system computes one query vector from the positive and negative examples.
Therefore, the user has to select only one set of similar examples. Querying two different types of
images simultaneously is not allowed. Therefore, the system shows only one cluster for each query. For
some users, however, querying more than one type of images might be desired. The important question
is how to display the relationship between two different image classes in the display space. To this end,
modification of the classification algorithm may be required.
25
Chapter 3
Group-Oriented User Interface for Digital Image Retrieval and Management
3.1 User Interface Support for Content-based Image RetrievalIn Content-based Image Retrieval (CBIR), experimental (i.e., Trial-and-Error) query is essential for
successful retrieval. Unfortunately, the traditional user interfaces are not suitable for trying different
combinations of query examples. This is because first, these systems assume query examples are added
incrementally. Second, the query specification and result display are done on the same workspace.
Once the user removes an image from the query examples, the image may disappear from the user
interface. In addition, it is difficult to combine the result of different queries.
In this chapter, we propose a new user interface for the Content-based Image Retrieval named
ImageGrouper. ImageGrouper is a Group-Oriented User Interface in that each operation is done by
creating groups of images. The users can interactively compare different combinations of query
examples by dragging and grouping images on the workspace (Query-by-Group.) Because the query
26
results are displayed on another pane, the user can quickly review the results. Combining different
queries is also easy.
Furthermore, the concept of “Image Groups” is also applied for annotating and organizing many
images. Annotation-by-Groups method relieves the user of tedious task of annotating textual
information on many images. This method realizes hierarchical annotation of images and bulk
annotation. Organize-by-Group method lets the users manipulate the image groups as “photo albums”
to organize a large number of images.
3.2 Related Work
3.2.1 The Traditional Approaches: Incremental Query
Not many researches have been done regarding user interface support for Content-based Image
Retrieval (CBIR) systems [89][120]. Figure 3-1 shows a typical GUI for CBIR system that supports
the Query-by-Examples with Relevance Feedback. Here, a number of images are aligned in grids. In the
beginning, the system displays randomly selected images. The effective ways to align images are studied
in [82]. In some cases, they are images found by browsing or keyword-based search.
Under, each image, a slide bar is attached so that the user can tell the system which images are relevant.
If the user thinks an image is relevant, s/he moves the slider to the right. If s/he thinks an image is not
relevant and should be avoided, s/he moves the slider to the left. The amount of slider movement
represents the degree of relevance (or irrelevance.) In some systems, the user selects example images by
clicking check boxes or by clicking the images. In these cases, the degrees are not specified.
27
When the “Query” button is pressed, the system computes the similarity between selected images and
the database images, then retrieves the N most similar images. The grid images are replaced with the
retrieved images. These images are ordered based on the degree of similarity.
If the user finds additional relevant images in the result set, s/he selects them as new query examples.
If a highly irrelevant image appears in the result set, the user can select it as a negative example. Then,
the user press the “Query” button again. The user can repeat this process until s/he is satisfied. In some
systems, the users are allowed to directly weight the importance of image features such as color and
texture.
In [96], Smeulders et al. classified Query by Image Example and Query by Group Example into two
different categories. From user interface viewpoint, however, these two are very similar. The only
difference is whether the user is allowed to select multiple images or not. In this chapter, we classify
Figure 3-2. Example of “More is not necessarily better”. The left is the case of one example,the right is the case of two examples.
Query
Results
Query Results
28
Figure 3-1. Typical Graphical User Interfaces for CBIR Systems: (a) Slider-based GUI, (b) Click-based GUI. On both systems, the search results are also displayed on the same workspace.
(a) Slider-Based GUI
(b) Click-Based GUI
29
both approaches as Query by Examples method. In stead, we use term “Query by Groups” to refer our
new model of query specification method described later.
3.2.2 Limitation of Incremental Query
The traditional Query-by-Example approach has several drawbacks. First of all, these systems assume
that “the More Query Examples are Available, the Better Result We Can Get.” Therefore, the users are
supposed to search images incrementally by adding new example images from the result of the previous
query. However, this assumption is not always true. Additional query examples may contain undesired
features and degenerate the retrieval performance.
Figure 3-2 shows an example of situations when more query examples could lead to worse results. In
this example, the user is trying to retrieve pictures of cars. The left column shows the query result when
only one image of “car” is used as a query example. The right column shows the result of two query
examples. The results are ordered based on the similarity ranks. In both cases, the same relevance
feedback algorithm (Section 3.9.1.2 and [86]) was used and tested on Corel image set of 17,000
images. In this example, even if this additional example image looks visually good for human eyes, it
introduces undesirable features into the query. Thus, no car image appears in the top 8 images. An
image of car appears in the rank 13th for the first time.
This example is not a special case. It happens often in image retrieval and confuses the users. This
problem happens because of Semantic Gap [89][96] between the high-level concept in the user’s mind
and the extracted features of images. Furthermore, finding good combinations of query examples is
very difficult because image features are numerical values that are impossible to be estimated by human.
The Only way to find the right combination is trial and error. Otherwise, the user can be trapped in a
small part of image database [120].
30
Unfortunately, the traditional user interfaces were designed for incremental search and are not suitable
for the Trial-and-Error query. This is because in these systems, query specification and result display
must be done on the same workspace. Once the user removes an image from the query examples during
relevance feedback loops, the image may disappear from the user interface. Thus, it is awkward to bring
it back later for another query.
Second, the traditional interface does not allow the user to put aside the query results for later uses.
This type of interaction is desired because the users are not necessarily looking for only one type of
images. The users’ interest may change during retrieval. This behavior is known as berry picking [7] and
has been observed for text documents retrieval by O’Day and Jeffries [73].
Moreover, because of the Semantic Gap [89] mentioned above, the users often need to make more than
one query to satisfy his/her need [7]. For instance, a user may be looking for images of “beautiful
flowers.” The database may contain many different “flower” images. These images might be completely
different in terms of low-level visual features. Thus, the user needs to retrieve “beautiful flowers” as a
collection of different types of images.
Finally, in some case, the user had better start from a general concept of objects and narrow down to
specific ones. For example, suppose the user is looking for images of “red cars.” Because image retrieval
systems use various image features [97][120] including colors [105], even cars with different colors may
have many common features with “red cars.” In this case, it is better to start by collecting images of
“cars of any color.” Once enough number of car images are collected, the user can specify “red cars” as
positive examples, and the other cars as negative examples. Current interfaces for CBIR systems,
however, do not support these types of query behavior.
31
3.2.3 El Ninõ System
Another interesting approach for the Query-by-Examples has been proposed by Santini et.al [89]. In
their El Ninõ system, the user specifies a query by mutual distance between example images. The user
drags images on the workspace so that the more similar images (in the user’s mind) are located closer
to each other. The system then reorganizes the images’ locations reflecting the user’s intent. There are
two drawbacks in El Ninõ system. First, it is unknown to the users how close similar images should be
located and how far negative examples should be apart from good examples. It may take a while for the
user to learn “the metric system” used in this interface.
The second problem is that like traditional interfaces, query specification and result display are done
on the same workspace. Thus, the user’s previous decision (in the form of the mutual distance between
the images) is overridden by the system when it displays the results. This makes trial-and-error query
difficult. Given the analogue nature of this interface, trial-and-error support might be essential. Even
if the user gets an unsatisfactory result, there is no way to redo the query with a slightly different
configuration. Any experimental result is not provided in the chapter.
3.3 Query-by-Groups with ImageGrouperWe developed a new user interface for CBIR systems named ImageGrouper. The design goal of
ImageGrouper is to improve the flexibility and usability of image retrieval. In this system, a new concept
for the relevance feedback, called Query-by-Groups was introduced. The Query-by-Groups mode is an
extension of the Query-by-Example mode described above. The major difference is that while the
Query-by-Example handles the images individually, in the Query-by-Group, a “group of images” is
considered as the basic unit of the query.
32
Figure 3-3 shows the display layout of ImageGrouper. The main menu on the top of the user interface
is used for detailed control of the system. Under the menu, there are several buttons and text fields.
“Query” button initiates the search. “Random” button is used to retrieve images randomly from the
database. “Keyword” text field is used for the keyword-based image retrieval. “Selection Clear Button”
removes all group bounding boxes on the workspace. The “Remove” button deletes selected objects.
“Information” button displays the user manual in another window.
The main workspace under the button bar is divided into two panes. The left pane is ResultView that
displays the results of content-based retrieval, keyword-based retrieval, or random retrieval. The images
Figure 3-3. The overview of ImageGrouper user interface.
Result View
Positive Group
Negative Group
Popup Menu
Group Icons
Neutral Group
Query Button Keyword FieldRandom Button Selection Clear ButtonMenus
GroupPalette
33
are tiled in a grid. This is very similar to the traditional user interfaces (Figure 3-1) except for there are
no sliders or buttons under the images. The right pane is GroupPalette, where the user creates and
manages image groups by drawing bounding boxes as described in the following sections. Unlike the
ResultView, the user can move the images to arbitrary positions within the palette.
3.3.1 The Basic Query Sequences
Figure 3-4 shows the sequence of the typical image retrieval with the Query-by-Group. In order to
search images, the user first has to create at least one image group. To this end, the user drags one or
more images from the ResultView into the GroupPalette (Figure 3-4 (1)) Then, the s/he encloses the
images by drawing a rectangle (box) as we draw a rectangle in drawing applications (Figure 3-4 (2).)
All the images within the group box become the member of this group. Any number of groups can be
created in the palette. The user can move images from one group to another at any moment. In
addition, groups can be overlapped to each other, i.e. each image can belong to multiple groups. To
remove an image from a group, the user simply drags it out of the box.
When the right mouse button is pressed on a group box, a popup menu appears so that the user can
give query properties (positive, negative, or neutral) to the group (Figure 3-4 (2).) The properties of
the groups can be changed at any moment. The colors of the corresponding boxes change accordingly.
To retrieve images based on these groups, the user press the “Query” button placed at the top of the
window (Figure 3-3.) Then, the system retrieves new images that are similar to images in positive
groups while avoiding images similar to negative groups. The result images are displayed in the
ResultView (Figure 3-4 (3).) If the user find new relevant images in the result, s/he can refine the search
by dragging these images to the Palette, the press the “Query” again (Figure 3-4 (4).) S/he can repeat
this until s/he is satisfied, or no additional relevant images can be found in the ResultView.
34
When a group is specified as Neutral (displayed as a white box), this group does not contribute to the
search at the moment. This group can be turned to a positive or negative group later for another
retrieval. If a group is Positive (displayed as a blue box), the system uses common features among the
images in the group. On the other hand, if a group is given Negative (red box) property, the common
features in the group are used as negative feedbacks. The user can specify multiple groups as positive
or negative. In this case, these groups are merged into one group, i.e., the union of the groups is taken.
The detail of the algorithm is described in Section 3.9.1.2.
While the user created only one group in Figure 3-4, the user can create multiple groups on the
workspace. Figure 3-3 is an example of three groups. As in Figure 3-4, the user is retrieving images of
“flowers.” In the GroupPalette, three flower images are grouped as a positive group. On the right of this
group, a red box is representing a negative group that consists of only one image. Below the “flowers”
group, there is a neutral group (white box), which is not used for retrieval at this moment. Images can
be moved out of any groups in order to temporarily remove images from the groups.
The gestural operations of ImageGrouper are similar to file operations of a Window-based Operating
Systems. Furthermore, because the user’s mission is to collect images, the operation “Dragging Images
into a Box” naturally matches the user’s cognitive state.
35
(1) Drag a Flower image to the palette.
(2) Draw a rectangle around the image, thenchose “Relevant” from the popup menu.
(3) The result after the first query.
(4) More relevant images are beingdragged into the group.
Figure 3-4. The sequence of the basic query operation on ImageGrouper
36
3.4 The Flexibility of Query-by-GroupsImageGrouper provides greater flexibility to image retrieval with the Query-by-Groups. In this section,
we describe how the Query-by-Groups method improve the relevance feedback for content-based
image retrieval.
3.4.1 Trial and Error Query by Mouse Dragging
In ImageGrouper, the images can be easily moved between the groups by mouse drags. In addition, the
neutral groups and space outside any groups in the palette can be used as storage area [49] for the images
that are not used at the moment. They can be reused later for another query. It makes trial and error
of relevance feedbacks easy. The user can quickly explore different combinations of query examples by
dragging the images into or out of the boxes. Moreover, the query specification that the user made is
preserved and visible in the palette. Thus, it is easy to modify the previous decision when the query
result is not satisfactory. In Section 3.5, we will evaluate the effects of the experimental query on the
image retrieval performances.
3.4.2 Groups in a Group
ImageGrouper allows the users to create a new group within a group (Groups in a Group.) With this
method, the user begins with collecting relatively generic images first, then narrows down to more
specific images.
Figure 3-6 shows an example of Groups in a Group. Here, the user is looking for “Red cars.” When s/
he does not have enough number of examples, however, the best way to start is to retrieve images of
“cars with any color.” This is because these images may have many common features with red car
37
images, though their colors features are different. The large white box is a group for “Cars with any
colors.”
Once the user found enough number of car images, s/he can narrow down the search only for red cars.
In order to narrow down the search, the user divide the collected images into two sub-groups by
creating two new boxes for red cars and other cars. Then the user specifies the red car group as positive
and the other cars group as negative, respectively. In Figure 3-6, the smaller blue (i.e. positive) box in
the left is the group of red cars. And the red (i.e., negative) box in the right is the group of non-red cars.
This narrow down search was not possible on the conventional CBIR systems.
3.5 Experiment on Trial and Error QueryIn order to examine the effect of ImageGrouper’s trial-and-error query, we compared the query
performance of our system with that of a traditional incremental approach (Figure 3-1.) In this
experiments, we used Corel photo stock that contains 17000 images as the data set. For both interfaces,
the same image features and relevance feedback algorithms (described in Section 3.9.1.2) are used.
For the traditional interface, the top 30 images are displayed and examined by the user in each
relevance feedback. For ImageGrouper, the top 20 images are displayed in the ResultView. Only one
positive group and one neutral group are created for this test. On both interfaces, no negative feedback
is given. Feedback loops are repeated up to 10 rounds or until convergence.
We tested over eight classes of images (table 3-1). For each class, a query starts from one image example.
In case of the traditional interface, the query is repeated by giving additional example from the result
of previous query. When no more good example appears, the search stops (convergence). Meanwhile,
for ImageGrouper, the search is refined incrementally at first. When the incremental search converges,
38
trial and error search is applied by moving some images out of the positive group into a neutral group
(This means that the number of positive examples is temporarily decreased.) Then, the search is refined
incrementally again until another convergence occurs. The user repeats this until trial and error query
has no effect.
Figure 3-5 shows the number of correctly retrieved images after the convergence (or the 10th rounds.)
This value is proportional to the recall rate. Thus, larger value means better retrieval performance.
Table 3-1 shows the number of relevance feedback loops until convergence. The value 10 means the
query did not converge before the 10th round.
Clearly, ImageGrouper can achieve better retrieval (i.e., higher recall) even if underlying technologies
(relevance feedback algorithm and visual features) are identical. In addition, the search with
ImageGrouper is less likely to converge prematurely even when the search with the traditional interface
converges into small number of images after a few iterations. This result suggests the importance of
support for the trial-and-error query.
Meanwhile, Query-by-Group can introduce a new problem. In an experimental query, the users need
to decide which images should be included in the query set. Often, the user expects the search to
improve by adding or removing an image. This task is not always easy to predict. Although
ImageGrouper enables fast trial-and-error, the user may have to try many different combinations until
the search improves. Unfortunately, it is difficult to automate this process because the judgement of
the query results is subjective and depends on the user. However, it is still desirable that the system
suggests some promising query combinations. For example, the system can find an outlier in the
current set of query examples.
39
Table 3-1. The number of feedback loops until convergence.10 means that it did not converge until 10th round
Object Traditional Grouper
Red Car 4 10
Tiger 3 10
Bird 4 8
Yellow Flower 3 10
Citrus 2 4
Polar Bear 5 10
Elephant 3 10
Swimmer 2 10
0
5
10
15
20
25
30
Hits
TraditionalImageGrouper
Image Class
Red Car
Tiger Bird YellowFlower
Citrus PolarBear
Elephant Swimmerr
Figure 3-5. The number of hits until the 10th round or convergence
40
3.6 Text Annotations on ImagesKeyword-based search is a very powerful method for searching images. The problem is that it works
well only when all the images are annotated with textual information. For commercial photo stocks, it
may be feasible to add keywords to all images manually. For home users, however, it is too tedious.
When keyword search is integrated with CBIR like our system and [120], keyword-based search can
be used to find the initial query examples for content-based search. For this scheme, the user does not
have to annotate all images. It is very important to provide easy and quick ways to annotate text on
many images.
3.6.1 Current Approaches for Text Annotation
The most primitive way for annotation is to select an image, then type in keywords. Because this
interaction requires the user to switch between mouse and keyboard repeatedly, it is too frustrating for
a large image database.
Several researchers have proposed smarter user interfaces for keyword annotation on images. In Bulk
Annotation method of FotoFile [53], the user selects multiple images on the display, selects several
attribute/value pairs from a menu, and then presses the “Annotate” button. Therefore, the user can add
the same set of keywords on many images at the same time. To retrieve images, the user selects entries
from the menu, and then presses the “Search” button. Because of visual and gestural symmetry [53], the
user needs to learn only one tool for both annotation and retrieval.
PhotoFinder [94] introduced drag-and-drop method, where the user selects a label from a scrolling list,
then drags it directly onto an image. Because the labels remain visible at the designated location on the
images and these locations are stored in the database, these labels can be used as “captions” as well as
41
for keyword-based search. For example, the user can annotate the name of a person directly on his/her
portrait in the image, so that other users can associate the person with his/her name. When the user
needs new words to annotate, s/he adds them to the scrolling list. Because the user drags keywords into
individual images, bulk annotation is not supported in this system.
3.6.2 Annotation by Groups
Most home users do not want to annotate images one by one, especially when the number of images
is large. In many cases, the same set of keywords is enough for several images. For example, a user may
just want to annotate “My Roman Holiday, 1997” on all images taken in Rome. Annotating the same
keywords repeatedly is painful enough to discourage him/her from using the system.
ImageGrouper introduces Annotation-by-Groups method where keywords are annotated not on
individual image, but on groups. As in Query-by-Groups, the user first creates a group of images by
dragging images from the ResultView into the GroupPalette and drawing a rectangle around them. In
order to give keywords to the group, the user opens Group Information Window by selecting “About
This Group” from the pop-up menu (Figure 3-3.) In this window, arbitrary number of words can be
added. Because the users can simultaneously annotate the same keywords on a number of images,
annotation becomes much faster and less error prone. Although Annotation-by-Groups is similar to the
bulk annotation of FotoFile, there are several advantages described below.
3.6.2.1 Annotating New Images with the Same Keywords
In the bulk annotation [53], once the user finished annotating keywords to some images, there is no
fast way to give the same annotation to another image later. The user has to repeat the same steps (i.e.
select images, select keywords from the list, then press “Annotate”.) This is awkward when the user has
to add a large number of keywords. Meanwhile, in Annotation-by-Group, the system attaches
42
annotations not on each images, but on groups. Therefore, by dragging new images into an existing
group, the same keywords are automatically given to it. The user does not have to type the same words
again.
3.6.2.2 Hierarchical Annotation with Groups in a Group
In ImageGrouper, the user can annotate images hierarchically using Groups in a Group method
described above (Figure 3-6.) For example, the user may want to add new keyword “Trevi Fountain”
to only a part of the image group that has been labeled “My Roman Holiday, 97.” This is easily done
by creating a new sub-group within the group and annotating only on the sub-group.
In order to annotate hierarchically on FotoFile with bulk annotation, the user has to select a subset of
images that are already annotated, and then annotate them again with more keywords. On the other
hand, ImageGrouper allows the user to visually construct a hierarchy on the GroupPalette first, then
edit keywords on the Group Information Window. This method is more intuitive and less error prone.
3.6.2.3 Overlap between Images
An image often contains multiple objects or people. In such cases, the image can be referred in more
than one context. ImageGrouper support this multiple reference by allowing overlaps between image
groups, i.e., an image can belong to multiple groups at the same time. For example, in Figure 3-7, there
are two image groups: “Cloud” and “Mountains.” Because some images contain both cloud and
mountain, these images belong to both groups (in overlapped region.) Once these two groups are
annotated with “Cloud” and “Mountain” respectively, images in overlapped region are automatically
referred as “Cloud and Mountain.” This concept is not supported in other systems.
43
3.7 Organizing Images by GroupsIn the previous two sections, we described how ImageGrouper supports content-based query and
keyword annotation. These features are closely related and complementary to each other. In order to
annotate images, the user can collect visually similar images first, using content-based retrieval with
Query-by-Groups. Then s/he can annotate textual information to the group of collected images. After
this point, the user can quickly retrieve the same images using keyword-based search.
Conversely, the results of keyword-based search can be used as a starting point for content-based search.
This method is useful especially when the image database is only partially annotated or when the user
is searching images based on visual appearance only.
3.7.1 Photo Albums and Group Icons
As described above, ImageGrouper allows the groups to be overlapped. In addition, the user can attach
textual information on these groups. Therefore, the groups in ImageGrouper can be used to organize
pictures as “photo albums [53]” Similar concepts are proposed in FotoFile [53] and Ricoh’s Storytelling
system [6]. In both systems, albums are used for “slide shows” to tell stories to the other users.
In ImageGrouper, the user can convert a group into a Group Icon (Figure 3-3.) When the user selects
“Iconify” from the popup menu (Figure 3-3) images in the group disappear and a new icon for the
group appears in the GroupPalette. When the group has an overlap with another group, images in the
overlapped region remain in the display.
Furthermore, the users can manipulate those group icons as they handle individual images. They can
drag the group icons anywhere in the palette. The icons can be even moved into another group box
realizing Groups in a group.
44
Finally, group icons themselves can be used as examples for content-based query. A group icon can be
used as an independent query example or combined with other images and groups. In order to use a
group icon as a normal query group, the user right clicks the icon and opens a popup menu. Then, s/
he can select “relevant,” “irrelevant” or “neutral.” On the other hand, in order to combine a group icon
with other example images, the user simply draws a new rectangle and drags them into it.
Organize-by-Groups method described here is partially inspired by the Digital Library Integrated Task
Environment (DLITE) [24]. In DLITE, each text document and the search results is visually
represented by icons. The user can directly manipulate those documents in a workcenter (direct-
manipulation.) In [49], Jones proposed another graphical tool for query specification, named VQuery.
In VQuery, the user specifies the query by creating Venn diagrams. The number of matched documents
is displayed in the center of each circle.
While DLITE and VQuery are designed for text document retrieval systems [3], the idea of direct-
manipulation [24] is applicable more naturally to image databases. In text document database, it is
difficult to determine the contents of text documents from the icons. Therefore, the user has to open
another window to examine the detail [24] (in the case of DLITE, a web browser is opened.) On the
other hand, in image databases, images themselves (or their thumbnails) are the objects that the user
operates on. Therefore, instant judgment by the user is possible on a single workspace [120][96].
45
Red Cars Non-Red Cars
Figure 3-6. Groups in a group. In ImageGrouper, the user can create a hierarchy of image group. Inthis example, the entire group is “cars.” Once the whole group is annotated as “car,” the user onlyneed to type “red” to annotate the red cars.
Cars of any color
CAR
RED CAR OTHER CAR
“Red”
“Car”
Figure 3-7. Overlap between groups. Two images in the overlapped region contain both mountainand cloud. Once the two groups are annotated, the overlapped region are automatically annotatedwith both keywords.
Mountain
Cloud and Mountain
Cloud
46
3.8 Usability StudyEven if ImageGrouper provides a powerful search ability, the users cannot take advantages of the system
unless the system is easy to use. In order to compare the usability of ImageGrouper with the traditional
graphical user interfaces (GUIs), we conducted usability tests. ImageGrouper was compared with two
traditional GUIs: a simple Click-based Interface (Figure 3-1 (b)) and a Slider-based Interface (Figure 3-
1 (a).) An example of the Click-based Interface is QBIC [31] and an example of the Slider-based
Interface is the original MARS [86]. Each GUI employs different relevance feedback method and has
different expressiveness. For example, the Click-based GUI is very simple and easy to use, but only
positive feedback was allowed. The Slider-based GUI allows the user to specify the degree of relevance
(from -1.0 to 1.0.) Both the Click-based and the Slider-based lose the previous query information
when the search results are returned. Meanwhile, ImageGrouper requires drag-and-drop operations, but
it realizes a more flexible query with trial-and-error query. Therefore, it is very difficult to compare the
usabilities of these systems in the actual image retrieval scenarios. Instead, we compared the task
completion time and the error rate of the image selection tasks in a simplified scenario as described
below.
3.8.1 Experimental Settings
3.8.1.1 Subjects
Ten people volunteered to participate in this experiment. All subjects were familiar with the commonly
used widgets that require mouse operations such as sliders and check boxes. The ages ranged from 20’s
to 30’s. One was a female and nine were male. Most subjects did not have experiences in any content-
based image retrieval systems.
47
3.8.1.2 Apparatus
Both training and tests are conducted with a PC with Dual Intel Xeon 2GHz Processors with 1GB of
Main Memory running Windows XP. The PC was located in a small quiet room. The video card was
nVIDIA Quadro2 with 32MB Video RAM. For the display, a 17 inch LCD monitor set at 1280 by
960 in 32-bit color was used. All user operations were done with a Microsoft Optical USB Mouse.
Each system was implemented in Java2 (version 1.3.x) with Swing API. ImageGrouper requires slightly
more graphics computation than the other two because the images have to be dragged over the
workspace. Above specs, however, are more than enough to run each GUI smoothly, eliminating the
differences in the computational overhead. Note that ImageGrouper does not require such a high-spec
machines. It runs smoothly on average PCs or workstations.
3.8.2 Scenarios
3.8.2.1 Experimental Task
The task of this experiment is to find and select relevant images on each GUI. In each run, when a
subject presses the “start” button, a sample image and its description (in text) are displayed on a
separate window located to the left of the main GUI. At the same time, 16 different images are tiled
on the main GUI. Then, they are asked to find and select all images that are “semantically similar” to
the sample images. The correct images do not have to be visually similar to the sample image. For
example, if a sample image is a picture of a standing penguin and the description is “Penguin,” the
subject has to choose all pictures of penguins even if the pictures look different from the sample image
(for example a picture of two penguins lying on ice.)
Each user interface requires different operations to select the images. In the case of the Click-based
system, “Select” means simply clicking the pictures on the GUI. In the case of the Slider-based
48
Interface, the user selects an image by moving the slider to the right. For ImageGrouper, the images are
selected by dragging images from the ResultView to the GroupPalette. In any cases, the subjects can
examine all images without scrolling. When the subject believes every relevant image is selected, s/he
presses the “Next” button. Then the system displays the next problem (a new sample images and
another set of 16 images on the GUI.) For both trainings and experiments, the task was repeated 10
times with different sample images such as “elephant,” “flowers,” “airplane,” “cars,” and “fireworks.”
The number of the correct images was different from one problem to another. The size of each image
was fixed to 120 by 80 pixels (or 80 by 120.)
Each subject was tested with all three GUIs one by one. The order of the GUIs was randomly chosen
for each subject. For each interface, the subjects were first trained with the sample problems, then they
were tested with the same interface immediately after the training.
3.8.2.2 Training Session
Before the training and testing of each GUI began, each subject individually received a brief instruction
about the concept of the GUI (how to select and deselect images) and the procedure of the experiment.
Then, s/he was asked to conduct the image selecting operation on the GUI with 10 training problems
under the trainer’s supervision. Because the purpose of this training is to make him/her familiar with
the GUI, the subject was allowed to ask the trainer assistance during the session.
3.8.2.3 Experiment Session
After the subject became familiar enough with the user interface, the subject began the same procedure
with new problem sets. The number of images to be selected was not known to the subject and was
different for each problem. Unlike the training phase, the subject was encouraged to conduct the
operations as fast as possible and as accurate as possible. No question was allowed during the sessions.
49
The completion time and the number of errors (the number of images missed and the number of image
incorrectly selected) were recorded. The subjects were interviewed after the experiment.
3.8.3 The Results
3.8.3.1 Error Rate
To see the effects of the user interfaces on the accuracy of the tasks, we compared the number of missed
images and the number of incorrectly selected images during the image selection tasks. Table 3-2 shows
the total of 10 problems. The values are averaged over ten subjects. With each interface, the subjects
rarely selected wrong images (at most once in ten problems.) The differences among the user interfaces
was not significant (F(2, 27) = 0.144, p = .254).
Figure 3-8. The Usability Test Setting (Click-base GUI)
50
Meanwhile, the subjects failed to select the correct images up to four times in ten problems (the average
was less than 1.5 times per 10 runs in any interface.) In each problem, most subjects missed at most
one image. The differences in the total number of missed images were not statistically significant
among the three interfaces (F(2, 27) = 0.0859, p = .918).
During the experiments, we informed the subjects that the tasks were timed and encouraged them to
finish the tasks as fast as possible. Therefore, most subjects focused on the speed over the accuracy. The
subjects, however, achieved fairly accurate image selection with these user interfaces.
3.8.3.2 Task Completion Time
The task completion time was measured by the duration from the time the subject pressed the start
button to the moment the user pressed the finish button after s/he completed the ten problems. The
number of images to be selected varies from 2 to 6, depending on the problem. Figure 3-10 shows the
average time required to select each single image. The differences in completion time were significantly
different among the user interfaces (F(2, 27) = 21.05, p < .001). The Slider-based GUI was the slowest.
Its average selection time was significantly slower than that of ImageGrouper (F(1,18) = 17.32 (p <
.001)). The Click-based user interface achieved the shortest task completion time and significantly
faster than ImageGrouper (F(1,18) = 4.455 (p < .05)). This result is not surprising since ImageGrouper
Table 3-2. The number of missed images and incorrectly selected images inten problems. The values are averaged over ten subjects. There is nostatistically significant differences in the accuracy.
MissedIncorrectly
Selected Total
Click 1.1 0.2 1.3
Slider 1.3 0.3 1.6
Grouper 1.1 0.5 1.6
51
requires dragging operations in addition to the mouse clicks on the images. The difference between the
Click-based and Grouper was smaller than the difference between Grouper and the Slider-based GUI.
The number of images to be selected was different among the problems. The number ranged from two
to six. Figure 3-9 shows how the time required to select one image differs with the total number of
images to be selected. Before the experiments, we expected the Time Per Image in ImageGrouper
decreases significantly as the total number increases. This is because on ImageGrouper, as the user drags
the images to the workspace, the number of images on the grid (ResultView) decreases, making it easier
to examine the remaining images. Interestingly, the similar effects also occurred on the traditional user
interfaces.
In the interview, the several subjects claimed they felt frustrated with the Slider-based GUI. In order
to move the mouse pointer to the small slider handles and move them, the users need to focus their
Selection Time Per Image
0
500
1000
1500
2000
2500
3000
3500
4000
4500
2 3 4 5 6
Total Number of Images
Click Grouper
Slider
Tim
e (m
sec)
Figure 3-9. The average time required to select each images in different problems. The X-axis is thetotal number of images selected during in the problems.
52
concentration on the handles instead of the tasks. Therefore, the Slider-based GUI is not suitable for
repetitive image retrieval tasks with the relevance feedback.
3.8.4 Improving ImageGrouper Based on the Lessons We Learned
As shown in the previous section, the Click-based GUI is the simplest and achieves the shortest task
completion time for the simplified tasks. However, the Click-based interface is limited to positive-only
relevance feedbacks, where the user can select only relevant images and cannot select irrelevant images
as the negative feedback. In addition, as discussed in Section 3.2.2 the Click-based GUI is not suitable
for the trial-and-error relevance feedback since it uses the same workspace both for the query creation
and the result display. In order to incorporate the advantages of the Click-based GUI into
ImageGrouper, we modified ImageGrouper so that it allows the user to select and moves the images by
double clicking. When the user double clicks an image on the ResultView, the image is moved to the
selected group on the GroupPalette. If no group is selected, the image is moved to the positive group
that is created first. If there is no group on the GroupPalette, the system creates a new group and place
0
500
1000
1500
2000
2500
3000
3500
Click Slider Drag
Average Selection Time
Sel
ectio
n Ti
me
(mse
c)1903.2
2203.0
2859.0
Figure 3-10. The average selection time per image. The differences are statistically significant at p <.001. The click-baed user interface achieves the shortest task completion time.
53
the image in the new group. This is effective especially when many images have to be dragged to the
GroupPalette. Of course, the user can also move the images by drag and drop.
3.9 Implementation DetailA prototype of ImageGrouper is implemented as a client-server system, which consists of User Interface
Clients and Query Server. As in the 3D MARS (Chapter 2) they are communicating via Hyper-Text
Transfer Protocol (HTTP) [29].
3.9.1 The Client-Server Architecture
3.9.1.1 The User Interface Client
The user interface client of ImageGrouper is implemented as a Java2 Applet with Swing API (Figure 3-
3.) Thus, the users can use the system through Web browsers on various platforms such as Windows,
Linux, Unix and Mac OS X.
The client interacts with the user and determines his/her interests from the group information or
keywords input. When “Query” button is pressed, it sends the information to the server. Then, it
receives the result from the server and displays it on the ResultView. Because the client is implemented
in multi-thread manner, it remains reactive while it is downloading images. Thus, the user can drag a
new image into the palette as soon as it appears in the ResultView.
Note that the user interface of ImageGrouper is independent of relevance feedback algorithms and the
extracted image features (described below.) Thus, as long as the communication protocols are
compatible, the user interface clients can access to any image database servers with various algorithms
54
and image features. Although the retrieval performance depends on the underlying algorithms and
image features used, the usability of ImageGrouper is not affected by those factors.
3.9.1.2 The Query Server
The Query Server stores all the image files and their low-level visual features. These visual features are
extracted and indexed in advance. When the server receives a request from a client, it computes the
weights of features and compares user-selected images with images in the database. Then, the server
sends back the IDs of the k most similar images.
The server is implemented as a Java Servlet that runs on the Apache Web Server and Jakarta Tomcat
Servlet container. It is written in Java and C++. In addition, the server is implemented as a stateless
server, i.e., the server does not hold any information about the clients. This design allows different types
of clients such as the traditional user interface [68] (Figure 3-1) and 3D Virtual Reality interface
(Chapter 2) can access to the same server simultaneously.
For home users who wish to organize and retrieve images locally on their PCs’ hard disks, ImageGrouper
can be configured as a standalone application, in which the user interface and the query server are
resident on the same machine and communicate directly without a Web server. Currently, the system
runs on Solaris (with Sun Performance Library) Mac OS X, and Microsoft Windows with Cygwin
[79]. CLAPACK [2] is required to compile the system.
3.9.2 Relevance Feedback Algorithm in the Query Engine
The similarity ranking is computed as follows. First, the system computes the similarity of each image
with respect to only one of the features. For each feature i ( ), thei color, texture, structure{ }=
55
system computes a query vector based on the positive and negative examples specified by the user.
Then, it calculates the feature distance between each image n and the query vector,
(3-1)
where is the feature vector of image n regarding the feature i. For the computation of the distance
matrix , we used Biased Discriminant Analysis (BDA.) The detail of the BDA is described in [119].
After the feature distances are computed, the system combines each feature distance into the total
distance . The total distance of image n is a weighted sum of each ,
(3-2)
where . I is the total number of features. In our case, I is 3. The optimal solution
of the feature weighting vector is solved by Rui et al. [87] as follows,
(3-3)
where , and N is the number of positive examples. This gives higher weight to that
feature whose total distance is small. This means that if the positive examples are similar with respect
to a certain feature, this feature gets higher weight. Finally, the images in the database are ranked by
the total distance. The system returns the k most similar images. The details of the algorithms are
further discussed and extended in Chapter 4.
3.10 Conclusion and Future WorkIn this chapter, we presented ImageGrouper, a new group-oriented user interface for the digital image
retrieval and organization. In this system, the users search, annotate, and organize digital images by
qi
gni
gni pni qi–( )T W i pni qi–( )=
pni
Wi
gni
dn gni
dn uT gn=
gn gn1 … gnI, ,[ ]=
u u1 …uI,[ ]=
ui
f j
f j-----
j 1=I∑=
f i gnin 1=N∑=
56
groups. ImageGrouper has several advantages regarding image retrieval, text annotation, and image
organization.
First, in content-based image retrieval (CBIR), predicting a good combination of query examples is
very difficult. Thus, trial-and-error is essential for successful retrieval. However, the previous systems
are assuming an incremental search and do not support trial-and-error search. On the other hand, the
Query-by-Groups concept in ImageGrouper allows the user to try different combinations of query
examples quickly and easily. We showed this lightweight operation helps the users to achieve higher
recall rate. Second, with the Groups in a Group configuration, the user can search images by narrowing
down the scope of the search from the generic concept to more specific concept.
Next, typing text information to a large number of images is very tedious and time consuming.
Annotate-by-Groups method eases the users of this task by allowing them to annotate multiple images
at the same time. The Groups in Group method realizes a hierarchal annotation, which was difficult in
the previous systems. Moreover, by allowing the groups to overlap to each other, ImageGrouper further
reduces typing.
In addition, our concept of image groups is also applied for organizing image collections. A group in
the GroupPalette can be shrunk into a small icon. These group icons can be used as “Photo Albums”
which can be directly manipulated and organized by the users.
Finally, these three concepts: Query-by-Groups, Annotation-by-Groups and Organize-by-Groups share the
similar gestural operations, i.e. dragging images and drawing a rectangle around them. Thus, once the
user learned one task, s/he can easily adapt herself/himself to the other tasks. The operations in
ImageGrouper are also similar to file operations used in Windows and Macintosh computers as well as
most drawing programs. Therefore, the user can easily learn to use our system.
57
There are several possibilities to improve the system. First, although our system automatically
determines the feature weights, the advanced users might be confident that they know which features
are important for their query. Thus, we plan to allow the users to specify which features are weighted
more for each group. Some image groups might be important in terms of the color features only, while
others might be important in terms of the structures.
Second, when the number of the photo albums becomes large, the workspace may be cluttered. We
plan to add “book shelves” for the photo albums so that the user can archive the unused photo albums
and saving the space. The users should be able to search the photo albums by keyword or content-based
search.
Finally, because the implementation of ImageGrouper does not depend on underlying image retrieval
technologies (features and algorithms.) In addition, since the user interface and the query engines
communicate via HTTP over the internet, they can be located in different places. This makes
ImageGrouper system an ideal platform for benchmarking [59] of various image retrieval systems.
58
Chapter 4
Relevance Feedback Algorithms for Group Query
4.1 IntroductionIn the previous chapter, we describe how the design of our new group-oriented user interface addresses
problems in the traditional image retrieval systems. By supporting fast trial-and-error search, the new
user interface helps the user improve the search results even if the underlying algorithms are the same.
Although ImageGrouper can improve an existing image retrieval system without changing relevance
feedback algorithms, new algorithms are desired so that we can take advantages of the new user
interface further. Thus, our new user interface motivated development of new search algorithms for
relevance feedback. ImageGrouper allows the user to create multiple image groups visually in the
workspace. For example, the user can specify positive query examples as two distinct groups, each of
which has different properties (see Figure 4-3.) However, the existing relevance feedback algorithms
cannot take advantage of this information. We propose a new feature-weighting algorithm that can
handle the user’s request as multiple positive and multiple negative classes problem. This is an
interesting example where the user interface design and algorithm development co-evolve (Figure 4-1.)
59
In the next section, the basic concepts of ImageGrouper are briefly introduced. In the following two
sections, we describe how our system improves and extends content-based image retrieval. In Section
4.6, we present experimental results of our new algorithms.
4.2 Related WorkIn interactive image retrieval, the system needs to recalculate the distance measure based on the user’s
feedback (Relevance Feedback [46][86][87][119][121].) This problem is reduced into online
calculation of the feature space transformation matrix . Once is obtained, the images in the
database are ordered by the distance from the query vector as follows,
(4-1)
When the matrix is diagonal, this is equivalent to
(4-2)
where, k is the dimensionality of the features, is the (i,i)-th element of , and are i-th
element of and , respectively.
GUI DesignSearch Algorithm
Development
Figure 4-1. Co-evolution of GUI design and search algorithm development. A new GUI designmotivates development of a new search algorithm. Meanwhile, an existing search algorithm couldbe reinforced by a new GUI.
W W x
q
dist q x,( ) q x–( )T W q x–( )=
W
dist q x,( ) wii xi qi–( )2
i 1=
k∑=
wii W xi qi
x q
60
In most methods discussed below, the query vector is defined as the weighted centroid of N positive
examples specified by the user:
(4-3)
where is the n-th positive example, and is its degree of relevance. Porkaew et al. [78][79]
proposed another way to define the query points. In their Query Expansion (QEX) method, multiple
query points are created by clustering relevant examples. The distance from each image in the database
to the multipoint query is the weighted combinations of the distance from the image to the centroid
of each query cluster. The weight for each query cluster is proportional to the number of relevant
examples in the cluster.
In the following subsections, we briefly explains previously proposed methods in the literature. Those
approaches can be classified into four types by the ways the system address the image retrieval:
• One-Class Relevance Feedback [46][86][87]
• Two-Class (positive and negative) Relevance Feedback [26][33]
• (1+x)-Class (one positive and x negatives) Relevance Feedback [119]
• Multi-Class Relevance Feedback [76]
4.2.1 Image Retrieval as a One-Class Problem
When no negative image examples are considered, the simplest way of feature weighting is to calculate
the variance of each feature among the query examples. First, the variance of each feature
among the positive examples (relevant images) is calculated. Then, the inverse of
q
qπnxnn 1=
N∑πnn 1=
N∑-----------------------------=
xn πn
σ1 σ2 σ3 … σk, , , ,( )
61
variance becomes the weight of each feature [86]. Thus, the feature weighting matrix becomes a
diagonal matrix:
(4-4)
where is the (i, i)-th element of W.
This gives higher weights to the features in which example images are similar. If the number of example
images is larger than the dimensionality of the features, the optimal solution of is proposed by Rui
et al. [87] as follows:
(4-5)
where k is the dimensionality of the features, and C is the weighted covariance matrix [46] of the
N positive examples defined as,
(4-6)
is the (i,i)-th element of , is the i-th element of n-th image in the database and is the
degree of the relevance for n-th image specified by the users (Chapter 3.) takes a full matrix form
when the number of relevant images are larger than the dimensionality of the features.
In [23], Chen et al. applied variants of Support Vector Machine [12][16][111] called (linear) One-class
SVM (LOC-SVM) and Kernel One-class SVM (KOC-SVM) [90] to the one-class image retrieval
problems. The authors reported that in their experiments with 500 images, KOC-SVM outperformed
LOC-SVM and the linear transform based approaches discussed above.
W
wii1σi-----=
wii
W
W det C( )( )1 k⁄ C 1–=
k k×
Cij πn xni qi–( ) xnj q j–( ) πnn 1=N∑( )⁄
n 1=N∑=
Cij C xni πn
W
62
4.2.2 Image Retrieval as a Two-Class Problem
Some systems allow the user to specify negative (irrelevant) examples explicitly. Therefore, image
retrieval can be considered as a two-class classification problem. One example of this problem is the
Fisher’s Discriminant Analysis (FDA) [26][33]. FDA tries to maximize the distance between the two
classes (between class scatter) while minimizing scatters within each class (within scatter.) The within-
scatter matrix is defined as follows,
(4-7)
and are the scatters of two classes ( and ) as follows,
(4-8)
where is the mean vector of a class . Next, the between-scatter matrix is defined as follows,
(4-9)
Finally, the criterion function of FDA is
(4-10)
Several researchers applied Support Vector Machine [12][16][111] to two-class image retrieval
[48][110][118]. In [110], Tong and Chang proposed an Active Learning method. During the relevance
feedback loops of this method, the system displays “the most informative” images instead of “the most
relevant” images. The users are then asked to select only relevant images from the image set. The images
to display are chosen so that the possible solution space (called version space) is halved after each
feedback.
SW S1 S2+=
S1 S2 C1 C2
Si x mi–( ) x mi–( )x Ci∈∑=
mi Ci
SB m1 m2–( ) m1 m2–( )T=
WWT SBW
WT SW W-----------------------
Wargmax=
63
Meanwhile, Wu et al. [117] applied Discriminant Expectation-Maximization (D-EM) algorithm to
image retrieval problem. The D-EM approach requires, however, many annotated data and
computationally expensive.
These systems are addressing image retrieval as a two-class (positive and negative) problem (Figure 4-2
(a).) These approaches, however, introduce undesirable side effects because they mix up all negative
examples into one class. In the actual scenario of image retrieval, negative examples can be from many
classes of images in the database.
4.2.3 Image Retrieval as a (1+x)-Class Problem
Zhou et. al. [119][121] proposed a new relevance feedback algorithm, which takes negative examples
into account effectively. They consider the relevance feedback problem as a (1+x)-class (i.e., one positive
and multiple negative classes) problem (Figure 4-2 (b).) In this scheme, it is assumed that the negative
examples are coming from an uncertain number of classes, while the positive examples can be clustered
into one class. Their algorithm, named Biased Discriminant Analysis (BDA) is characterized by the
following objective function,
(4-11)
where,
(4-12)
(4-13)
is a set of positive examples and D is a set of negative examples. In short, BDA tries to minimize
within-class scatter matrix of the positive example, while is keeping the negative examples
WWT SbiasW
WT SW W--------------------------W
argmax=
SW x m–( ) x m–( )Tx C∈∑=
Sbias x m–( ) x m–( )Ty D∈∑=
C
SW Sbias
64
away from the positive examples. It is shown that BDA outperforms FDA and Multiple Discriminant
Analysis (MDA) [121]
4.2.4 Multi-Class Relevance Feedback
Peng [76] proposed a new algorithm named MUlti-class Relevance Feedback algorithm (MURF),
which considers image retrieval as a multi-class problems. MURF assumes that the images can belong
to different classes according to different image features.
Suppose there are J classes in the database and the dimensionality of the visual feature is N. For each
class j, the system first calculates the estimates of the class-conditional probability at the query
vector as follows:
Negative Samples
Positive SamplesNegative Samples
Positive Samples
Negative Samples
Figure 4-2. Graphical concepts of the previous space translation scheme. (a) Two-class problemscheme tries to divide data into two classes, which are positive and negative groups. (b) (1+x)-classscheme handles problem as one positive group and multiple negative groups.
(a) Two-Class Problem (FDA)
(b) (1+x)-Class problem (BWT)
Prˆ j q( )
q
65
(4-14)
Next, the system assumes that the images can be clustered differently for each dimension. Thus, it also
calculates the estimate of the class-conditional probability of the query vector attributed
to the i-th feature ( ),
(4-15)
The parameters to determine the neighbor (i.e., the number of neighbors) are defined by the users.
and are normalized as follows,
(4-16)
(4-17)
where,
(4-18)
(4-19)
and and are constants. Then, it calculates “the ability of i-th feature to predict the s at
” [76] for the weighted distance,
(4-20)
Here, becomes larger when the difference between class-conditional probability of the whole query
vector and that of its i-th feature is small. Therefore, it measures how much the i-th feature contributes
to the posterior probabilities. Finally, the feature weighting matrix becomes a diagonal matrix,
Prˆ j q( ) # of neighbors of the query that is labeled as class jTotal # of neighbors of the query
--------------------------------------------------------------------------------------------------------------------------=
Prˆ j xi qi=( )
i 1…k=
Prˆ j xi qi=( ) # of neighbors along the i th feature labeled as class j
Total # of neighbors along the i th feautre-------------------------------------------------------------------------------------------------------------------------------=
Prˆ j q( ) Prˆ j xi qi=( )
Pr˜ j q( ) Prˆ j q( ) c1 P̂ Prˆ j q( )–( )+=
Pr j xi qi=( ) Prˆ j xi qi=( ) c2 Piˆ Prˆ j xi qi=( )–( )+=
P̂1J--- Prˆ j q( )
j 1=
J∑=
P̂i1J--- Prˆ j xi qi=( )
j 1=
J∑=
c1 c2 Pr j q( )
qi zi= χ2
ri Pr j xi qi=( ) Pr˜ j q( ) Pr j xi qi=( )–[ ]2
j 1=
J∑=
ri
66
(4-21)
where
(4-22)
This means a feature with smaller is given a higher weight. These values have similar effect to Eq.
4-4, where more discriminating features are given higher weights and each feature is independent of
the others.
In theory, MURF can support arbitrary number of image classes and the author tested the system on
eight class classification on simulated data. Meanwhile, the system is also tested on real image retrieval
problems. Unlike our proposed system (Section 4.3,) however, the image retrieval is considered as a
three class problem where the users label the images as “relevant,” “somewhat relevant,” or “irrelevant.”
Since this method can be considered as a coarse way to specify the degree of relevance rather than a
multi-class classification, it can be classified as a variant of two-class classification problems discussed
in Section 4.2.2.
Peng claims the better retrieval performance of MURF [76] over his previous PFRL (Probabilistic
Feature Relevance Learning) system [75]. However, it might be because of more query examples. That
is in their experiments, MURF is given both “relevant” and “somewhat relevant” examples, while PFRL
is given only relevant examples [76].
wii
T ri( )exp
T rl( )expl 1=q∑
--------------------------------------=
ri r j{ }j
max ri–=
ri
67
4.3 Proposed Approach: Extending to a (x+y)-Class ProblemIn the initial version of ImageGrouper (Chapter 3), when the user specifies more than one group as
positive, these groups are merged into one group for a query. Thus, Biased Discriminant Analysis
described in Section 4.2.3 was used to calculate the feature weights. Here, the system supported one
positive group and multiple negative groups.
However, this approach is not taking full advantages of our group-oriented interface. When the user
searches images, s/he may have high-level concepts in her/his mind such as “beautiful flowers.” Such
concepts cannot be expressed by only one class of images. See Figure 4-3. Although white flowers and
red flowers may have common visual features, they also have very different features, which are the color
features. If the system tries to put them together as one class of images, the color features have to be
discarded. This is not desirable because these features may be beneficial to retrieve a specific color of
flowers.
Figure 4-3. White flowers and Red flowers. Both groups can be considered as subsetsof “flower class” In ImageGrouper, users can separate them into two positive groups.
White Flowers Red Flowers
Flowers
68
For the users, it is more intuitive if they can specify them as two different, but closely related subsets
of an image class. This problem has not been addressed before because the traditional user interfaces
do not allow the user to divide positive image examples into multiple groups. This fact motivated us
to extend the relevance feedback problems to a (x+y)-class (i.e. multiple positive and multiple negative
classes) problem.
In our new scheme, named Group Biased Discriminant Analysis (GBDA), the objective function is:
(4-23)
where is the sum of the within-class scatter matrix [26] of the positive groups as defined as follows,
(4-24)
(4-25)
where, is the mean vector of i-th positive class , and c is the number of samples in the group.
is the positive-to-negative scatter, which is introduced in this chapter as follows,
(4-26)
(4-27)
where, is the set of the negative examples.
Figure 4-4 illustrates the concept of our scheme. In short, GBDA tries to cluster each positive class
while trying to scatter the negative examples away from the positive classes. Like BDA, each negative
example belongs to its own class.
As in FDA and BDA, is solved as the generalized eigenvector(s) associated with the largest
eigenvalue(s) ,
WWT SPNW
WT SW W-------------------------
Wargmax=
SW
SW Sii 1=c∑=
Si x mi–( ) x mi–( )Tx Ci∈∑=
mi Ci
SPN
SPN SNii 1=c∑=
SNi y mi–( ) y mi–( )Ty D∈∑=
D
W
λ
69
(4-28)
Finally, our Discriminating Transformation Matrix [26] becomes,
(4-29)
where is the matrix whose columns are the eigenvectors, and is the diagonal matrix of the
corresponding eigenvalues. Once the transformation matrix is available, the distance between two
images and are computed as follows,
(4-30)
In our current implementation, the system compares the distance between images in the database and
the mean of each positive group. Then, the database images are ordered according to those distances
(see Section 4.6.3.)
4.4 ImplementationThe architecture of the system is similar to one discussed in the previous chapter. The user interface is
slightly modified so that it can support more flexible query. The feature weighting module of the Query
SPNwi λiSW wi=
A ΦΛ1 2/=
Φ Λ
Negative Samples
Positive Samples
Negative Samples
Figure 4-4. Concept of the new feature space transform. It minimizes the scatter of each positiveclass while maximizing the scatter between positive and negative samples.
x y
dist x y,( ) x y–( )T A x y–( )=
70
Server is replaced with the new algorithms. When the user does not specify any negative examples, the
server use Multiple Discriminant Analysis (MDA) [26] instead of GBDA. Currently, the system runs on
Solaris, Mac OS X, and Microsoft Windows. They are implemented in Java and C++ with CLAPACK
[2] and Automatically Tuned Linear Algebra Software (ATLAS) [4]. The Windows version requires
Cygwin [80] to compile. For the Solaris version, Sun Performance Library [106] is used instead of
ATLAS. In addition, we also implemented this algorithm on MATLAB. The MATLAB version of our
algorithm made modification and evaluation of our algorithms easier. Experiments discussed in the
following section are conducted with this version.
4.5 Analysis on Toy ProblemsTo illustrate the effects of GBDA, we use three-dimensional toy problems as shown in Figure 4-5 and
Figure 4-6. The original data are plotted in 3D where the positive examples are represented by ‘o’s and
negative ones by ‘x’s. The projections of the data into a two-dimensional space are shown directly below.
In this section, we address two typical cases. The first case is when two positive clusters are distant from
each other (Figure 4-5.) The second case is when two positive clusters are close to each other (Figure 4-
6.) In both cases, the negative examples are scattered across the problem space. We compared the
projection results with Multiple Discriminant Analysis (MDA) [26] and BDA. In MDA, the negative
examples are considered as one class. Thus, it deals with the problem as a three-class (two positive and
one negative) one.
In the first case, BDA merges the two positive clusters into one while MDA and GBDA preserve the
separation of the two clusters. GBDA clusters each positive class tighter than MDA. For the second
case, GBDA and BDA have similar effects: the two positive clusters are merged into one positive
71
cluster. This is a desirable effect since these two positive groups are very close to each other and should
be considered as one group even though the user specifies them as different. Meanwhile, although
MDA keeps these positive groups separated, it attracts the negative examples towards one positive
cluster. This effect is not suitable for image retrieval since it increases the possibility that many
irrelevant images are retrieved.
4.6 Experiments on Real Data
4.6.1 Data Sets
To examine the advantages of GBDA over BDA, the retrieval performances of the two systems are
tested on COREL image database. The dimensionality of the image feature space (color, texture and
structure) is 37 as described in the previous chapters. We used a subset of 804 images from the database.
These subset images consisted of six classes of images: airplanes, cars, flowers, stained-glass, horses and
eagles. From each class, up to 40 images are given as positive query examples. For GBDA, the query
examples are further divided into two sub-groups by the domain expert. For example, images of horses
are divided into two subsets: “white horses” and “brown horses.” In BDA, they are considered as one
group. For simplicity, the sizes of the sub-groups are kept the same (up to 20.) Figure 4-7 shows the
images from two representative test sets. The first set (Figure 4-7 (a)) is images of “Horses.” The image
set is divided into two subsets: “white horses and brown horses.” The second set is images of “Flowers.”
They are also divided into two subsets: “red flowers” and “yellow flowers.”
72
20
10
0
10
20
40
20
0
20
40 10
5
0
5
10
15
20
25
30
35
Original Data
20 10 0 10 20 30 15
10
5
0
5
10
15
20MDA Projections
40 20 0 20 40 60 20
10
0
10
20BDA Projections J =0.18565
40 20 0 20 40 60 20
10
0
10
20BWT Projections
40 20 0 20 40 60 30
20
10
0
10
20GBDA Projections J =0.51331
Figure 4-5. Comparison of MDA, BDA, and GBDA on toy problems 1. The original data (in 3D)and the result projections (2D) are plotted. In this problem, GBDA performs similar to MDA.
73
20 10 0 10 20 20
10
0
10
20
30MDA Projections
40 20 0 20 40 60 30
20
10
0
10
20BDA Projections J =0.70149
40 20 0 20 40 60 30
20
10
0
10
20BWT Projections
40 20 0 20 40 40
30
20
10
0
10
20GBDA Projections J =0.75454
20
10
0
10
20
40
20
0
20
40 10
5
0
5
10
15
20
25
30
35
Original Data
Figure 4-6. Comparison of MDA, BDA, and GBDA on toy problems 2. The original data (in 3D)and the result projections (2D) are plotted. In this toy problem, GBDA performs similarly to BDA.
74
Figure 4-8 (a) and (b) shows the result of the two images classes. For each test, an equal number of
images are picked from the sub-group. Therefore, the total sample size is varied from 2 to 40 (in
increments of two.)
4.6.2 Performance Measures
As a performance measure, we introduced the weighted hit count. The weighted hit count (whc) is
calculated as follows. For each image in the top 20 results,
(4-31)
where,
(4-32)
and is the target image class. In this scheme, every relevant image in the result set is given a count
of one. Meanwhile, any false hits are given negative points based on their ranks in the result set.
Therefore, false hits that are closer to the top of the list are penalized more than those at the bottom of
the list. This measure reflects the quality of returned ranking as well as the number of correctly returned
images. We conducted each test 10 times and the results are averaged.
4.6.3 Ranking Strategies for GBDA
For GBDA, we used two different ranking strategies (GBDA-1 and GBDA-2.) For both strategies, two
different ranking are calculated based on the distance from the two group means. Then, GBDA-1
returns the top 10 images from each group. On the other hand, GBDA-2 orders the images by the
closer of the distances to one of the two groups and the top 20 images are returned.
xi
whc hc xi( )i 1=20∑=
hc xi( )1 xi C∈( )
1 i⁄ (otherwise)–⎩⎨⎧
=
C
75
4.6.4 Results
Figure 4-8 (a) and (b) shows the result of “horses” and “flowers” respectively. The horizontal axis is the
number of positive examples. The vertical axis is the weighted hit count described above.
In the case of “horses” (Figure 4-8 (a)), when the number of query examples is small (sample size < 8),
GBDA-1 shows similar performance as BDA. GBDA-2 performs worse than BDA.
When the number of samples gets larger (sample size > 8), GBDA-1 performs better than BDA, while
GBDA-2 shows similar performance as BDA. This corresponds to the first toy problem in Section 4.5.
It also shows that in order for GBDA to take advantages of the group information, it requires at least
four samples from each group.
Figure 4-8 (b) shows the results of “red and yellow flower” images. In this case, there are no significant
differences among BDA, GBDA-1 and GBDA-2. This is because the two groups are too close to each
other1 for GBDA to create two separated clusters. Note that GBDA does not do any harm even if it
cannot utilize group information. In this situation, it has similar effects as BDA. This corresponds to
the second toy problem.
Figure 4-9 (a) and (b) show the same results in the precision and recall [5] for different sample sizes.
Recall and precision are defined as follows:
(4-33)
(4-34)
1. The closeness of two groups depends on the diversity of images in the database.
RecallThe number of relevant images retrieved
The number of relevant images in the database----------------------------------------------------------------------------------------------------------------=
PrecisionThe number of relevant images retrieved
The number of retrieved images -------------------------------------------------------------------------------------------------=
77
0 2 4 6 8 10 12 14 16 18 2015.5
16
16.5
17
17.5
18
18.5
19
19.5
20HORSES
Sample Size / 2
# H
its (
on w
eigh
ted
aver
age)
GBDA 1 * 19.105
GBDA 2 * 18.455
BDA * 18.615
0 2 4 6 8 10 12 14 16 18 207
7.5
8
8.5
9
9.5
10
10.5
11
11.5
Sample Size / 2
# H
its (
on w
eigh
ted
aver
age)
FLOWERS
GBDA 2 - 8.725
BDA - 8.975
GBDA 1 - 9.04
Figure 4-8. Comparison of BDA and GBDA on the real data. The results are shown in theweighted hit count.
(a) Horse Examples
(b) Flower Examples
78
0 2 4 6 8 10 12 14 16 18 200.85
0.9
0.95
1Average Precision
Sample Size / 2
Pre
cisi
on (
1.0)
0 2 4 6 8 10 12 14 16 18 200.17
0.175
0.18
0.185
0.19
0.195
0.2Average Recall
Sample Size / 2
Rec
all (
0.2)
GBDA1
BDA
GBDA2
GBDA1
BDA
GBDA2
0 2 4 6 8 10 12 14 16 18 200.3
0.35
0.4
0.45
0.5
0.55
0.6Average Precision
Sample Size / 2
Pre
cisi
on (
1.0)
0 2 4 6 8 10 12 14 16 18 200.06
0.07
0.08
0.09
0.1
0.11Average Recall
Sample Size / 2
Rec
all (
0.17
)
GBDA1
GBDA1
GBDA2
GBDA2
BDA
BDA
Figure 4-9. Comparison of BDA and GBDA on the real data (measured in the precision and recallfor different sample size)
(a) Horse Examples
(b) Flower Examples
79
Note that in information retrieval with relevance feedback, many systems include the query examples
themselves in the query result so that the user can reuse the images in the previous query. Therefore,
the traditional Precision versus Recall curve [5] is not suitable for the evaluation.
4.7 ConclusionWe are investigating query algorithms that can take advantage of the group-oriented user interface. In
this chapter, we introduced a new feature space transform algorithm. In this method, the user’s interest
can be expressed as a (x+y)-class (multiple positive and multiple negative) problem. This new scheme
is inspired by the development of the user interface. In fact, there was no way to apply this scheme on
the traditional interfaces because the user could not specify positive examples as multiple classes.
4.8 Possible Improvement
4.8.1 Groups with Different Sizes
In our experiments, we assumed that the sizes of two positive groups are equal. This is not appropriate
assumption in the real world. Especially with relevance feedback, the user adds new images into the
groups. Therefore, it is very likely the number of image in each group is different. We will investigate
the case of different group sizes. It may be necessary to normalize the clusters to prevent the returned
result from being biased to the larger group.
80
4.8.2 Automated Clustering of Groups
Although making groups on ImageGrouper is very easy, it would be convenient if the system can
automatically create multiple positive groups from one positive group. Once groups are created, this
group information can be used by GBDA. While this approach gives less control to the users, it can
avoid making superfluous groups.
4.8.3 More than Two Groups
In this chapter, we used only two positive groups. Although our implementation supports an arbitrary
number of positive groups, the benefit of more than two groups is not clear. In order to determine the
properties of each group, the system requires several samples for each group. Unless plenty of positive
examples are available, it is difficult to create many groups effectively.
81
Chapter 5
Integrating ImageGrouper into the 3D Virtual Space
5.1 IntroductionIn the previous chapters, we introduced two different types of user interface for content-based image
retrieval: 3D MARS and ImageGrouper. Both systems have their own strength and weakness. Table 5-
1 compares 3D MARS and ImageGrouper in various aspects. Here, we assume 3D MARS is used in a
projection based virtual reality system (i.e., CAVE see Figure 2-2 on page 14.) 3D MARS can display
many images simultaneously in the three-dimensional virtual reality space. Therefore, it is suitable for
finding relevant images from many irrelevant images. In addition, those images are located based on
three different criteria along x, y, and z axis. Therefore, it provides a powerful visualization tool for the
visual feature space.
However, because the electromagnetic tracking devices attached on the wand and shutter glasses
(Section 2.5) are not very accurate. Moreover, because the immersive 3D virtual reality is realized by
projecting images on the four flat screens, the objects can not be perfectly aligned. As a result, there are
discrepancies between the tracking and the 3D projection. Thus, it is very difficult for the user to
conduct operations that require precision in CAVE. Interactions are limited to simple operations such
82
as touching and moving relatively large objects. It is not suitable for manipulating small objects. Hence,
the user of 3D MARS creates a query by touching the images and clicking the wand button as discussed
in Section 2.5. S/he can only specify each image as relevant, irrelevant or neutral. Furthermore, since
the images are scattered in the space, grouping operation is not straightforward. The user has to fly
though the space to collect images with a joystick.
Meanwhile, ImageGrouper uses a mouse as an input device. Therefore, it can support finer and more
accurate control on the workspace. In addition, the user can make a flexible query by making image
groups. On the other hand, the workspace is two-dimensional. Thus, it can display only a limited
number of images at the same time.
We are interested in how we can take advantages of ImageGrouper and 3D MARS by integrating the
two systems. In the following sections, we discuss the design choices of the integration, the system
architecture, and the benefit of the integration.
Table 5-1. Comparison of 3DMARS and ImageGrouper
3DMARS ImageGrouper
Display Dimensionality 3D 2D
Number of images to be displayed Many Limited
Control Device Wand Mouse
Control Accuracy Inaccurate Accurate
Flexibility of Query Less Flexible (only rele-vant and irrelevant)
Flexible(Query-by-Groups)
83
5.2 Design ChoicesFor integration, we had two choices. The first option was to realized the group-oriented interaction
similar to ImageGrouper in CAVE. The second option was to bring a wireless equipped portable device
that runs ImageGrouper into CAVE.
Although implementing the first method is easy, grouping operation in this method is awkward for
several reasons. First, grouping operation requires enclosing objects in the three-dimensional space. In
the case of 3D MARS in CAVE, this can be achieved by allowing the user to draw a translucent
rectangular solid that encloses images. However, since the images are flat rectangle plates, it is hard for
the user to see whether an image is within the rectangular solid. In addition, as discussed above, the
images in 3D MARS are scattered in the space. To enclose many images, the user has to draw a large
rectangular solid or put relevant images together by flying through the space.
Therefore, we chose the second option, i.e., to bring a portable device running ImageGrouper into
CAVE. In the next section, we discuss how the users interacts with the integrated system for image
retrieval tasks.
5.3 User Interaction of Grouper in 3D MARSFigure 5-1 shows a user interacting with ImageGrouper in 3D MARS. The user is holding a wireless
equipped notebook PC, which is running a customized version of ImageGrouper. This version of
ImageGrouper is designed to synchronize with 3D MARS as described in the next section. Ideally,
tablet-based PCs or PDAs should be used because the user has to hold the device in CAVE. Pen input
is a natural choice for drag-and-drop operation in the group-oriented user interactions in this situation.
84
In the integrated system, ImageGrouper and 3D MARS are always synchronized. Both systems display
the same set of images. This means 3D MARS displays all images that appear in both the GroupPalette
and the ResultView of the ImageGrouper (Section 3.3.) The images’ locations in the 3D space are
determined by their visual features.
The user interactions are also synchronized. When the user selects an image on ImageGrouper, the same
image in 3D MARS is highlighted and vice versa. If an image clicked in 3D MARS is hidden in the
ResultView of ImageGrouper, it automatically scrolls so that the image is visible. If the user delete an
image in one user interface, the same image disappears in the other interface. When the user clicks a
group in the GroupPalette, all images in the group are highlighted simultaneously in 3D MARS. Thus,
the user can see how images in the group s/he created are distributed in the feature space. In addition,
the user can change the display options of 3D MARS such as the sphere mode (Section 2.5) and the
Figure 5-1. User interacting in 3DMARS with a wireless-equipped note book PC.
85
projection strategies (Section 2.8.1) from menus of ImageGrouper. The user does not have to touch the
buttons in the 3D space.
The user can create a query from either system. The query method differs depending on the system on
which the user creates the query. Query in ImageGrouper is the Query-by-Groups method where the
user can specify an arbitrary number of image groups (Section 3.3.) As described above, this Query-
by-Group operation is difficult in 3D MARS. Therefore, query in 3D MARS is the standard Query-by-
Example method where the user specifies images as relevant, irrelevant or neutral. Regardless of system
the query is created, the search results are displayed on both systems.
5.4 System ArchitectureFigure 5-2 show the system architecture of the Grouper in 3D MARS system. The portable device in
CAVE is running a customized version of ImageGrouper. This version is designed to control and
synchronized with the 3D MARS. The device is equipped with IEEE 802.11b wireless card so that it
can communicate with 3D MARS and the Query Engine without a wire. When the user presses the
query button on the grouper, the grouper sends a query to the Query Engine. In addition to the IDs
and files names of the result images, this version receives the feature distances from the query vector of
the images. The grouper not only displays the results on its ResultView, but also sends those
information to 3D MARS. Then, 3D MARS loads the image files directly from the Query Engine.
When the user makes a query and presses the button from 3D MARS, the system collects query
information in the 3D space and sends it to ImageGrouper. Then, ImageGrouper sends the query to the
Query Engine, receives the result, and forward it to 3D MARS. Therefore, 3D MARS communicates
86
with the Query Engine only when it loads image files. This approach makes sure two systems are always
synchronized.
5.5 The Benefits of IntegrationThe integration of the two systems has several benefits. First, the number of images that the original
version of ImageGrouper can display at once is limited by the size of the monitor. To see more images,
the user has to scroll down the workspace. This is tiresome. Meanwhile, 3D MARS can display many
images in the 3D virtual space simultaneously. This helps the user find relevant images quickly.
Next, in the integrated system, even if the user is using Query-by-Group method on ImageGrouper, the
result is displayed in 3D MARS as well. Those result images are located based on the distance to the
query vector in three different criteria. When s/he clicks a group on the grouper, all images in the group
are highlighted in 3D MARS. This is very useful feature because the user can see how images in the
group are related in the visual feature space. The image groups in ImageGrouper are usually created by
CAVE (3DMARS)SGI Onyx
Image Data
Loading ImagesLoading Images
Wireless PC
Rendering
Query Engine
Figure 5-2. Overview of the integrated system
87
the user based on his/her high level semantic concepts. Therefore, the images in the group are not
necessarily close to one another in the feature space. Visualizing image groups in the 3D feature space
helps effective trial-and-error query. If one of images in a group is located far from the others (i.e.,
outlier) removing this image may improve the query result. Finding those outliers in two-dimensional
user interfaces is very difficult.
88
Chapter 6
Storage and Visual Features for Content-based Image Retrieval
In this chapter, we discuss additional research topics we are investigating. The first topic is data
structures for high-dimensional data. The second topic is feature extraction of local image features. In
addition, we discuss integration of text information with low-level image features.
6.1 Data Structure for High-Dimensional Data Access
6.1.1 Background
Content-based Image Retrieval (CBIR) systems need to store the metadata of the image files on the
disks. Those meta-data are numerical, and high-dimensional data made of low-level visual features
(Appendix A) It is difficult to manage these data with traditional commercial database systems because
these systems are designed for text and low-dimensional numerical data. Thus, many researchers have
proposed architectures for indexing high-dimensional data [10][11][12][35][38][50]
[57][81][88][113][114]. Some systems index numerical data into B-tree like hierarchical structures
[8][10][38]. Other systems separate data space into grid regions and construct a tree of the regions
89
[72][81][81][92]. Most of the systems were designed to minimize the number of I/O access to the
disks.
Few researchers, however, take the real time performance into account. Therefore, many systems
consume an enormous amount of computation (CPU time) in creating and traversing complicated
trees while they are minimizing I/O accesses time. Beyer et. al. [13] showed that in high (10-15)
dimension, a simple linear scan algorithm outperforms those sophisticated index structures. This is
because many sophisticated data structures do not scale to larger dimensionality. When the
dimensionality gets higher, those systems have to traverse every node in the trees. Meanwhile, the linear
scan requires a minimal number of computation in any dimensionality. Therefore, Beyer et. al.
recommend every researcher compare his/her new algorithms with the linear scan [13].
The advantages of linear scan increase dramatically when the size of memory gets larger because the
linear scan can utilize as much memory as available for caching the data in the disks. These days, the
price of memory is getting lower. Therefore, it is very difficult to exceed the speed of the linear scan.
Weber et. al. [113] proposed an approximation-based data structure named VA-File. In VA-File, each
datum on the disks is quantized into 4-6 bits per dimension. These data are used as an approximation
to the original data. Because the size of the quantized data is much smaller, the system can hold all
approximated data in the main memory. For the nearest neighbor query, the system first scans every
approximated datum on memory and finds candidate data. Then it loads only data block where the
candidate data are stored. Because many data are pruned in the first phase, it reduces I/O accesses in
the second phase. In addition, it requires only a small amount of computation since it does not create
a complicated data structure. Wu and Manjunath applied VA-File to interactive image retrieval systems
[116].
90
While VA-File is very simple and outperforms the other sophisticated methods, it cannot utilize a large
memory space. The size of approximated memory is proportional to the number of data in the original
database. Therefore, even if more memory is available, VA-File cannot improve the search time further.
As a result, when a large amount of memory is available and the dimensionality is large, the linear scan
algorithm still outperforms VA-File [22]. In addition, because the approximated data are squeezed into
smaller region, each datum is not aligned in byte boundaries. Thus, the system requires many memory
copy operations for various computation.
6.2 Preliminary ExperimentsWe are investigating new algorithms to overcome the bottleneck of VA-File [22]. First, we have
implemented “cache-aware” versions of X-Tree [10], VA-File and the linear scan algorithm [22]. These
customized versions are designed to utilize the available memory as much as possible. For example, our
new implementation of X-Tree caches data blocks and superblocks in memory. These improvements
are very important especially when relevance feedback is used for data retrieval. During relevance
feedback loops, the same data are repeatedly accessed.
Then we compared real time performances of K-Nearest Neighbor search. Figure 6-1 shows the result
of our preliminary experiments. Here, K=50, Dimemsion=37 and the data size is 100,000 (it occupies
28MB of the disk.) The horizontal axis is the size of memory available, and the vertical axis is the time
required for K-Nearest Neighbor search. As shown in the figure, while the linear scan algorithm can
consistently improve the performance as the memory size increases, the performances of X-Tree and
VA-File are not improved. Therefore, when the available memory is small, VA-File outperforms the
scan. However, with larger memory, the scan searches faster than VA-File.
91
In investigating data access method for high-dimensional data, various factors need to be considered.
First, we need to trade off I/O bus bandwidth with memory size/speed and CPU speed. If CPU speed
is very fast, we can afford a complicate algorithm. On the other hand, if I/O speed is fast enough, we
may increase I/O access and reduce the computation complexity. In addition, the dimensionality and
the distribution of data change the performance dramatically.
6.3 Image Retrieval by Local Image FeaturesIn discussions of image retrieval in the previous chapters, we extracted only global image features from
the entire image region. While these features work well in many situations, they are not always suitable.
Even though the user is interested in an object in an image, the image features are often dominated by
the background of image. For example, if we extract the global feature of a picture in Figure 6-2, the
gray background becomes the dominant color instead of the color of the cat. In order to use this picture
to search cat images, the system needs to extract local features related to the cat. Many researchers have
0123456789
10VA-File
X-Tree
Scan
201612107.55321.510.8
Cache Size (MB)
Tim
e (s
ec)
Figure 6-1. Comparison of K-Nearest neighbor search time. K=50, Dimemsion=37, The numberof original data is 100,000 (28MB.) Iterated 100,000 times.
92
proposed image retrieval systems with local image features [18][19][28][98][102][116]. Most popular
approach is to extract features of objects by image segmentation [18][19][28][116]. For example,
Blobworld [18][19] ask the user to select his/her interested region from pre-segmented images. The
system extracts blobs from an image based on color and texture similarity and then calculates the
features of each blobs.
While those segmentation based approaches are very effective and intuitive for the users, there are
several drawbacks. First, although image segmentations are done off line, segmentation and feature
extraction on a large image database is prohibitive. Since the number of segments in each image differs,
search in the database is inefficient. Finally, image segmentation is not very robust and unreliable.
We are investigating simpler and more efficient ways of image retrieval with local image features. In
our approach, each image is divided into 5 by 5 grids. Then, image features for each grid are calculated
(Figure 6-2.) For retrieval, the system asks the user to select his/her interest by selecting a region of an
Figure 6-2. Block based image selection. In this example, the image is divided into 5 x 5 blocks.The user may be interested in region of 2 x 2 colored blue.
93
image by selecting the corresponding grids. In the case of Figure 6-2, the user selected 2 by 2 blocks
that enclose the cat (blue region.) The system approximates image features of this 2 by 2 region from
the features of blocks in this region (Figure 6-3.) In searching similar images in the database, the system
scans blocks of the same size. In this example, each image has 16 possible regions of 2 by 2 grids.
Smith et. al. [98] proposed similar approach for texture features. In their system, images are
decomposed into quad-tree structures (Figure 6-4.) For each node of the quad-tree, wavelet transform
is applied to extract local texture features. This system, however, searches only the same grid position
in the database. Thus, it cannot deal with translated images.
6.4 Integration of Content-based and Keyword-based Image RetrievalDespite the advances in content-based image retrieval research, the use of keyword information is very
helpful. “Annotation by Groups” method of ImageGrouper (Section 3.6 in Chapter ) makes it easier to
annotate keywords on many images. It also allows the user to create a semantic hierarchy of images.
F33 F34
F43 F44
F'
Figure 6-3. Approximating image region from smaller blocks.
94
MPEG-7 data types [62] allow text annotation to be attached on multimedia objects. The standard
supports various forms of text information including keywords, free text, structured annotation and so
on [62]. Figure 6-5 shows example of text annotation (keywords and free text.) These textural
information is stored in Extensible Markup Language (XML.) Because these information is stored with
multimedia contents descriptors, integration of content-based search and keyword-based search is
desirable.
Zhou et. al. [122] proposed a primitive system for integrated retrieval. In their system, keyword
information is integrated into low-level feature vectors. In the extended feature vectors, each keyword
Wavelet Transform
Original Image
Figure 6-4. Quad-Tree Decomposition [98].
<TextAnnotation><KeywordAnnotation>
<Keyword>Italy</Keyword><Keyword>Cat</Keyword><Keyword>Water</Keyword>
</KeywordAnnotation><FreeTextAnnotation xml:lang=”en”>
A cat is drinking water from a tap.Taken in Vernazza, Italy in 2002.
</FreeTextAnnotation></TextAnnotation>
Figure 6-5. Example of Keyword and Free Text Annotation of MPEG-7
95
is represented in a binary number that shows the existence of the keyword in an image. Therefore, the
dimensionality of the feature vector is “the dimensionality of the low-level features + the number of
keywords supported.” However, the system is not scalable because the number of keyword has to be
fixed. It also associates relationship between keywords only from low-level visual features without the
domain knowledge.
We will explore this problem further. We need to incorporate techniques of Text information retrieval
[5] and Data Mining [39] into content-based image retrieval systems.
96
Bibliography
[1] ACD Systems. “ACDSee” http://www.acdsystems.com.
[2] Anderson, E. et. al., “LAPACK Users' Guide,” Third Edition, Society for Industrial andApplied Mathematics, 1999.
[3] Apple Computer Inc. “iPhoto” http://www.apple.com/iphoto.
[4] Automatically Tuned Linear Algebra Software (ATLAS), http://math-atlas.sourceforge.net.
[5] Baeza-Yates, R. and Ribeiro-Neto, B. “Modern Information Retrieval.” Addison Wesley, 1999.
[6] Balabanovic, M., Chu, L.L. and Wolff, G.J. “Storytelling with Digital Photographs.” InCHI’00, 2000.
[7] Bates, M.J. “The design of browsing and berrypicking techniques for the on-line search inter-face.” Online Review, 13(5), pp. 407-431, 1989.
[8] Beckman, N., Kriegel, H.P., Schnneider, R. and Seeger, B., “The R*-Tree: An Efficient andRobust Access Method for Points and Rectangles,” In Proceedings of the ACM SIGMOD’90,1990.
[9] Bederson, B. B. “Quantum Treemaps and Bubblemaps for a Zoomable Image Browser.” HCILTech Report #2001-10, University of Maryland, College Park, MD 20742.
[10] Berchtold, S., Keim, D. and Kriegel, H-P., “The X-Tree: An Index Structure for High Dimen-sional Data.” In the 22nd VLDB Conference, 1996.
[11] Berchtold, S., Böhm, C., Jagadish, H.V., Kriegel, H-P. and Sander, J., “Independent Quantiza-tion: An Index Compression Technique for High-Dimensional Data Space,” In Proceedings ofICDE’00, 2000.
[12] Bernhard, E., Boser, I.G. and Vapnik, V., “A training algorithms for optimal margin classifiers,”In Proceedings of the 4th Workshop on Computational Learning Theory, pp. 144-152, San Mateo,CA, 1992.
97
[13] Beyer, K., Goldstein, J., Rmarkrishnan, R. and Shaft, U., “When Is “Nearest Neighbor” Mean-ingful?” In Proceeding of International Conference on Database Theory (ICDT’99), 1999.
[14] Brinkhoff, T., Kriegel, H-P., Schneider, R. and Seeer, B., “Multi-step Processing of SpatialJoins.” In SIGMOD Record (ACM Special Interest Group on Management of Data,) 23(2) 197-208, June 1994.
[15] Brown, M. H. and Hershberger, J., “Color and Sound in Algorithm Animation.” In IEEEComputer, Vol. 25, No. 12, December 1992.
[16] Burges, C., “A tutorial on support vector machines for pattern recognition,” In Data Miningand Knowledge Discovery, 2, pp.121-167, 1998.
[17] Card, S. K., Macinlay, J. D. and Shneiderman, B., “Readings in Information Visualization -Using Vision to Think,” Morgan Kaufmann, 1999.
[18] Carson, C. et. al., “Region-based Image Querying,” In Proceedings of IEEE Workshop onCBAIVL, June, 1997.
[19] Carson, C., Thomas, M. and Belongie, S., “Blobworld: A system for region -based image index-ing and retrieval,” University of California at Berkeley, 1999.
[20] Chen, J-Y., Bouman, C.A., and Dalton, J.C., “Heretical Browsing and Search of Large ImageDatabase,” In IEEE Transaction on Image Processing, Vol. 9, No. 3, pp. 442-455, March 2000.
[21] Chen, C., Gagaudakis, G. and Rosin, P., “Content-Based Image Visualization,” In Proceedingsof IEEE International Conference on Information Visualization (IV’00), 2000.
[22] Chen, T., Nakazato, M., and Huang, T. S., “Speeding up the Similarity Search in MultimediaDatabase.” In Proceedings of IEEE ICME2002, August, 2002.
[23] Chen, Y., Zhou, X. S. and Huang, T. S., “One-Class SVM for learning in image retrieval,” InProceedings of IEEE International Conference on Image Processing (ICIP’01), 2001.
[24] Cousins, S. B., et al. The Digital Library Integrated Task Environment (DLITE). In Proceedingsof 2nd ACM International Conference on Digital Libraries, 1997.
[25] Cox, I. J., Miller, M. L., Minka, T. P., Papathomas, T. V. and Yianilos, P. N. The Bayesian ImageRetrieval System, PicHunter: Theory, Implementation, and Psychophysical Experiments. InIEEE Transactions on Image Processing, Vol. 9, No. 1, January 2000.
[26] Duda, O. R., Hart, P. E. and Stork, D. G., “Patter Classification Second Edition”, Wiley-Inter-science, 2001.
98
[27] Faloutsos, C. and Ling, K., “FastMap: A Fast Algorithm for Indexing, Data-Mining and Visual-ization of Traditional and Multimedia Datasets,” In Proceedings of ACM SIGMOD95, pp. 163-174, May 1995.
[28] Fauqueur, J. and Boujemaa, N., “Image Retrieval by regions: Coarse Segmentation and FineColor Description,” In Proceedings of Fifth International Conference on Visual Information Sys-tems (VISual 2002), HsinChu, Taiwan, March 11-13, 2002.
[29] Fielding et. al. “Hypertext Transfer Protocol -- HTTP/1.1,” RFC 2616. June 1999.
[30] Finkel, R. and Bentley, J., “Quad-trees: A Data Structure for Retrieval on Composite Keys. InACTA Informatica, 4(1) pp.1-9, 1974.
[31] Flickner, M. et al., “Query by image and video content: The QBIC system,” IEEE Computers,1995.
[32] Freeston, M., “The BANG file: A new kind of grid file,” In Proceedings of the ACM SIGMODInternational Conference on Management of Data, 1987.
[33] Fukunaga, K., “Introduction to Statistical Patter Recognition 2nd Edition,” Academic Press,Inc. 1990.
[34] Funt, B. V. and Finlayson, G. D., “Color constant color indexing,” Technical Report 91-09,School of Computing Science, SImon Fraser University, Vancouver, B.C., Canada, 1991.
[35] Gaede, V. and Günther, O., “Multidimensional Access Methods,” In ACM Computing Surveys,Vol. 30, No. 2, June 1998.
[36] Gonzales, R. C. and Woods, R. E., “Digital Image Processing,” Addison-Wesley, 1992.
[37] Gonzales, D. Digital Cameras are no longer just for the Digerati. New York Times, November25, 2001.
[38] Guttman, A. “R-trees: A Dynamic Index Structure for Spatial Searching,” In Proceedings of SIG-MOD’84, 1984.
[39] Han, J. and Kamber, M., “Data Mining Concepts and Techniques,” Morgan Kaufmann, 2001.
[40] Hearn, D. and Baker, M. P., “Computer Graphics C Version: Second Edition,” Prentice Hall,1997.
[41] Hearst, M. A. and Karadi, C., “Cat-a-cone: An interactive interface for specifying searches andviewing retrieval results using a large category hierarchy. In Proceeding of 20th Annual Interna-tional ACM SIGIR Conference, Philadelphia, PA, 1997.
99
[42] Hiroike, A. and Musha, Y., “Visualization for Similarity-Based Image Retrieval Systems,” IEEESymposium on Visual Languages, 1999.
[43] Hiroike, A., Musha, Y., Sugimoto, A. and Mori, Y., “Visualization of Information Space toRetrieve and Browse Image Data.” In Proceedings of Visual’ 99: Information and Information Sys-tems, 1999.
[44] Howell, D. C., “Fundamental Statistics for the Behavioral Sciences 4th Edition,” DuxburyPress, 1999.
[45] Intel Corporation, “Intel Math Kernel Library,” http://www.intel.com/software/products/mkl/.
[46] Ishikawa, Y., Subrammanya, R. and Faloutsos, C., “MindReader: Query database through mul-tiple examples,” In Proceedings of the 24th VLDB Conference, 1998.
[47] iView Multimedia Ltd. “iView Media Pro.” http://www.iview-multimedia.com.
[48] Jing, F., Li, M., Zhang, H-J. and Zhang, B., “Support Vector Machine for Region-based ImageRetrieval,” IEEE International Conference on Multimedia & Expo, Baltimore, Maryland, July,2003.
[49] Jones, S. Graphical Query Specification and Dynamic Result Previews for a Digital Library. InUIST’98, 1998.
[50] Katayama, N and Satoh, S., “The SR-tree: An Index Structure for High-Dimensional NearestNeighbor Queries.” In Proceedings of the ACM SIGMOD’97, 1997.
[51] Koskela, M., Laaksonen, J. and Oja, E., “MPEG-7 Descriptors in Content-Based ImageRetrieval with PicSOM System,” In Proceedings of Fifth International Conference on Visual Infor-mation Systems (VISual 2002), HsinChu, Taiwan, March 11-13, 2002.
[52] Kruskal J. B., and Wish, M., “Multidimensional Scaling,” SAGE publications, Beverly Hills,1978.
[53] Kuchinsky, A., Pering, C., Creech, M. L., Freeze, D., Serra, B. and Gwizdka, J. FotoFile: AConsumer Multimedia Organization and Retrieval System. In CHI’99, 1999.
[54] Massari, et al., “Virgilio: a Non-Immersive VR System to Browse Multimedia Databases,” InProceedings of IEEE ICMCS 97, 1997.
[55] Laaksonen, J., Koskela, M. and Oja, E. Content-based image retrieval using self-organizationmaps. In Proceedings of 3rd International Conference in Visual Information and Information Sys-tems, 1999.
[56] Lagergren, E. and Over, P. Comparing interactive information retrieval systems across sites: TheTREC-6 interactive track matrix experiment. In ACM SIGIR’98, 1998.
100
[57] Lin, K-I., Jagadish H. V. and Faloutsos, C., “The TV-tree: An index structure for high-dimen-sional data.” In The VLDB Journal: The International Journal on Very Large Data Bases, 3(4)pp.517-549, October 1994.
[58] Lu, G., “Multimedia Database Management Systems,” Artech House, Inc., 1999.
[59] Müller, H et al. Automated Benchmarking in Content-based Image Retrieval. In Proceedings ofIEEE International Conference on Multimedia and Expo 2001, August, 2001.
[60] Ma, W. Y. and Manjunath, B. S., “Netra: A toolbox for navigating large image databases,” InMultimedia Systems, Vol. 7, No. 3, pp. 184-198, 1999.
[61] Manjunath, B. S., Vasudevan, V. V. and Yamaha, A., “Color and Texture Descriptors,” In IEEETransaction on Circuits and Systems for Video Technology, Vol. 11, No. 6, June 2001.
[62] Manjunath, B. S. et. al. (edit) “Introduction to MPEG-7: Multimedia Content DescriptionInterface,” John Wiley & Sons, Ltd., 2002.
[63] Maybury, M. T. (editor) “Intelligent Multimedia Information Retrieval,” MIT Press, 1997.
[64] Miles, S. Camera buyers increasingly focus on digital. CNET NEWS.com, September 26, 2000.
[65] Minka, T. P. and Picard, R. W., “Interactive Learning with a “Society of Models,” In Proceedingof IEEE CVPR’96, 1996.
[66] Moghaddam, B., Biermann, H., and Margaritis, D., “Defining Image Content with MultipleRegion-of-Interest,” In Proceedings of IEEE Workshop on Content-based Access of Image and VideoLibraries (CBAIVL’98), 1998.
[67] Nakazato, M., Manola, L. and Huang, T. S., “ImageGrouper: Search, Annotate and OrganizeImage by Groups,” In Proceedings of Fifth International Conference on Visual Information Systems(VISual 2002), HsinChu, Taiwan, March 11-13, 2002.
[68] Nakazato, M., Manola, L. and Huang, T. S., “ImageGrouper: a group-oriented user interfacefor content-based image retrieval and digital image arrangement,” In Journal of Visual Languagesand Computing, 14/4 pp. 363-386, August, 2003.
[69] Nakazato, M. et al., UIUC Image Retrieval System for JAVA, available at http://chopin.ifp.uiuc.edu:8080.
[70] Nakazato, M. and Huang, T. S. “3D MARS: Immersive Virtual Reality for Content-basedImage Retrieval.” In Proceedings of IEEE International Conference on Multimedia and Expo 2001.
[71] Nakazato, M., Manola, L. and Huang, T. S., “ImageGrouper: A Group-Oriented User Interfacefor Content-based Image Retrieval and Digital Image Arrangement,” In Journal of Visual Lan-guage and Computing, 14/4 pp. 363-386, 2003.
101
[72] Nievergelt, J., Hinterberger, H. and Sevcik, K. C. “The Grid File: An Adaptable, SymmetricMultikey File Structure,” In ACM Transaction on Database System, Vol. 9, No. 1, pp. 38-71,1984.
[73] O’Day V. L. and Jeffries, R. “Orienteering in an information landscape: how informationseek-ers get from here to there,” In INTERCHI ‘93, 1993.
[74] Pecenovic, Z., Do, M-N., Vetterli, M. and Pu, P., “Integrated Browsing and Searching of LargeImage Collections,” In Proceedings of Fourth International Conference on Visual Information Sys-tems, November, 2000.
[75] Peng, J., Bhanu, B. and Qing, S., “Probabilistic feature relevance learning for Content-basedimage retrieval,” In Computer Vision and Image Understanding, 75 (1/2) (1999) pp. 150-164.
[76] Peng, J., “Multi-class relevance feedback content-based image retrieval,” In Computer Visionand Image Understanding, 90 (2003) pp. 42-67.
[77] Picard, R. W., Minka, T. P. and Szummer, M., “Modeling User Subjectivity in image Libraries,”In Proceedings of IEEE International Conference on Image Processing (ICIP’96), Lausanne, Sep-tember 1996.
[78] Porkaew, K., Chakrabarti, K. and Mehrotra, S., “Query Refinement for Multimedia SimilarityRetrieval in MARS,” Technical Report TR-MARS-99-05, University of California at Irvine,1999.
[79] Porkaew, K., Chakrabarti, K. and Mehrotra, S., “Query Refinement for Multimedia SimilarityRetrieval in MARS,” In ACM Multimedia ‘99, Orlando, FL, October, 1999.
[80] Red Hat, Inc., “Cygwin User’s Guide,” 2001.
[81] Robinson, J. T. “The K-D-B-Tree: A Search Structure for Large Multi-Dimensional DynamicIndexes,” In Proceedings of the ACM SIGMOD’81, 1981.
[82] Rodden, K., Basalaj, W., Sinclair, D. and Wood, K. Does Organization by Similarity AssistImage Browsing? In CHI’01. 2001.
[83] Roussopoulos, N., Kelly S. and Vincent, F., “Nearest Neighbor Queries.” In Proceedings of theACM SIGMOD’95, 1995.
[84] Rubner, Y., Guibas, L. and Tomasi, C., “The earth mover’s distance, multi-dimensional scaling,and color-based image retrieval,” In Proceedings of the ARPA Image Understanding Workshop,May 1997.
[85] Rubner, Y. “Perceptual Metrics for Image Database Navigation.” Ph.D. Thesis, Stanford Univer-sity, May 1999
102
[86] Rui, Y., Huang, T. S., Ortega, M. and Mehrotra, M., “Relevance Feedback: A Power Tool forInteractive Content-Based Image Retrieval,” In IEEE Transaction on Circuits and Video Technol-ogy, Vol. 8, No. 5, Sept. 1998.
[87] Rui, Y. and Huang, T. S., “Optimizing Learning in Image Retrieval,” In Proceedings of IEEECVPR, 2000.
[88] Samet, H., “Hierarchical Representation of Collection of Small Rectangles.” In ACM Comput-ing Survey, Vol. 20, No. 4, December 1998.
[89] Santini, S. and Jain, R., “Integrated Browsing and Querying for Image Database,” IEEE Multi-media, Vol. 7, No. 3, 2000, pp. 26-39.
[90] Scholkopf, B., Platt, J. C., Shawe, J. T., Smola, A. J. and Williamson, R. C., “Estimation thesupport of a high-dimensional Distribution,” Technical Report MSR-TR-99-87, MicrosoftResearch, 99.
[91] Schvaneveldt, R.W., Durso, F.T. and Dearholt, D.W., “Network structures in proximity data,”In The Psychology of Learning and Motivation, 24, G. Bower, Ed., Academic Press, 1989, pp.249-284.
[92] Seeger, B. and Kriegel, H-P., “The Buddy-Tree: An Efficient and Robust Access Method forSpatial Data Base Systems,” In Proceedings of the 16th International Conference on Very LargeData Bases, 1990.
[93] Sharp. J-SH07. http://www.j-phone.com/f-e/j/products/j-sh07/back.html and http://www.sharp.co.jp/products/jsh07/index.html
[94] Shneiderman, B. and Kang, H. “Direct Annotation: A Drag-and-Drop Strategy for LabelingPhotos.” In Proceedings of the IEEE Intl Conf on Information Visualization (IV’00), 2000.
[95] Singer, A. et. al., “Tangible Progress: Less is More in Somewire Audio Spaces,” In Proceedings ofACM CHI 99, May 1999.
[96] Smeulders, A. W. M., Worring, M., Santini, S., Gupta, A. and Jain, R. Content-based ImageRetrieval at the End of the Early Years. In IEEE PAMI Vol. 22, No. 12, December, 2000.
[97] Smith, J. R. and Chang S-F. Transform features for texture classification and discrimination inlarge image databases. In Proceedings of IEEE Intl. Conf. on Image Processing, 1994.
[98] Smith J. R. and Chang S-F. “Quad-Tree Segmentation for Texture-based Image Query.” In Pro-ceedings of ACM 2nd International Conference on Multimedia, 1994.
[99] Smith, J. R. and Chang S-F., “VisualSEEk: a fully automated content-based image query sys-tem,” In ACM Multimedia’96, 1996.
103
[100] Smith, J. R. and Chang S-F., “Local Color and Texture Extraction and Spatial Query,” In IEEEInternational Conference on Image Processing, 1996.
[101] Smith, J. R. and Chang S-F., “Tools and techniques for color image retrieval,” In Symposium onElectronic Imaging: Science and Technology - Storage & Retrieval for Image and Video Databases IV,Vol. 2670, San Jose, CA, February 1996. IS&T/SPIE.
[102] Sridhar, V., Nascimento, M. A. and Li, Xiaobo, “Region-Based Image Retrieval Using MultipleFeatures,” In Proceedings of Fifth International Conference on Visual Information Systems (VISual2002), HsinChu, Taiwan, March 11-13, 2002.
[103] Strang, G. and Nguyen, T., “Wavelet and Filter Banks,” Wellesley-Cambridge Press, 1997.
[104] Stricker, M., “Bounds for the discrimination power color indexing techniques,” In Storage andRetrieval for Image and Video Database II, vol. 2185 of SPIE Proceedings Series, pp. 15-24, Feb.1994.
[105] Stricker, M. and Orengo, M., “Similarity of Color Images,” In Proceedings of SPIE, Vol. 2420(Storage and Retrieval of Image and Video Databases I I I), SPIE Press, Feb. 1995.
[106] Sun Microsystems, Inc., “Sun Performance Library User's Guide,” 2000.
[107] Sural, S., Qian, G. and Pramanik, S., “Segmentation and Histogram Generation Using theHSV Color Space for Image Retrieval,” In IEEE International Conference on Image Processing,Rochester, NY, 2002.
[108] Swain, M. J. and Ballard, D. H., “Color Indexing,” In International Journal of Computer Vision,7(1):11-32, 1991.
[109] Tian, G. Y. and Taylor, D., “Colour Image Retrieval Using Virtual Reality,” In Proceedings ofIEEE International Conference on Information Visualization (IV’00), 2000.
[110] Tong, S. and Chang, E., “Support Vector Machine Active Learning for Image Retrieval,” InProceedings of ACM International Conference on Multimedia, 2001.
[111] Vapnik, V., “Estimation of Dependences Based on Empirical Data,” Springer Verlag, 1982.
[112] Vasconcelos, N. and Lippman, A., “A Probabilistic Architecture for Content-based ImageRetrieval,” in In Proceedings of Computer Vision and Pattern Recognition (CVPR 2000.)
[113] Weber, R., Zschek, H-J., and Blott, S., “A Quantitative Analysis and Performance Study forSimilarity-Search Method in High-Dimensional Spaces,” In Proceedings of 24th VLDB Confer-ence, New York, 1998.
[114] White, D. A. and Jain, R. “Similarity Indexing with the SS-tree,” In Proceedings of 12th ICDE,1996.
104
[115] Wise, J. A. et. al., “Visualizing the non-visual: Spatial analysis and interaction with informationfrom text documents,” In Proceedings of the Information Visualization Symposium 95, pp. 51-58.IEEE Computer Society Press, 1995.
[116] Wu, P. and Manjunath, B. S., “Adaptive nearest neighbor search for relevance feedback in largeimage databases,” In Proceedings of ACM International Conference, Ottawa, Canada, October,2001.
[117] Wu, Y., Tian, Q. and Huang, T. S., “Discriminant-EM algorithm with application to imageretrieval,” In Proceedings of ICEM 2000. 2000 IEEE International Conference on Multimedia andExpo, 2000.
[118] Zhang, L., Lin, F. and Zhang, B., “Support vector machine learning for image retrieval,” InProceedings of International Conference on Image Processing, October, 2001.
[119] Zhou, X. and Huang, T. S., “A Generalized Relevance Feedback Scheme for Image Retrieval,”In Proceedings of SPIE Vol. 4210: Internet Multimedia Management Systems, 6-7 November2000, Boston, MA, USA.
[120] Zhou, X. S. and Huang, T. S., “Edge-based structural feature for content-base image retrieval,”Pattern Recognition Letters, Special issue on Image and Video Indexing, 2000.
[121] Zhou, X. S., Petrovic, N. and Huang, T. S. “Comparing Discriminating Transformations andSVM for Learning during Multimedia Retrieval,” In ACM Multimedia ‘01, 2001.
[122] Zhou, X. S. and Huang, T. S., “Unifying Keywords and Visual Contents in Image Retrieval,” InIEEE Multimedia, pp. 23-33, Vol. 9, No. 2, April-July, 2002.
105
Appendix A
Image Features in the Systems
The visual image features in our systems consist of thirty-seven (37) dimensional numerical values from
three groups of features: Color, Texture, and Edge structure. These features are extracted from the images
and indexed in the Meta-data database offline. Although we used the same feature set for all systems in
this thesis, our systems are not limited to these features. They are designed to be used with any visual
features.
A.1 Color DistributionFor color features, the HSV color space is used. The HSV color space is a three-dimensional color
representation which consists of Hue, Saturation, and Value. The HSV color space is often represented
by an inverted cone (Figure A-1.) The hue (H) is represented by the angle raging from 0 to 360º. The
angles of 0º, 60º, 120º, 180º, 240º, and 300º represent red, yellow, green, cyan, blue, and magenta,
respectively. The saturation (S) specifies the purity of a color. It ranges from 0 to 1. The value (V)
represents the brightness or whiteness of the color. V=0 means black, while V=1 means white. When
S=0, the value of V represents gray scale. HSV has been a popular choice for image retrieval systems
106
including MPEG-7 standard [61][62][101][107] because this color representation is intuitive for
human perception [40][61].
Simple comparison in a color histogram [34][108] is sensitive to small color variations caused by
lighting conditions [104]. Instead, we use the color distributions in the HSV space proposed by
Stricker and Orengo [105]. In this method, the first three moments (the mean, the standard deviation,
and the third moment) from each of HSV channels are extracted [105]. Suppose is the value of i-
th color channel at the j-th pixel of an image, the three features are:
(1-1)
(1-2)
(1-3)
V (Value)
S (Saturation)
Red (0º)Cyan
YellowGreen (120º)
Blue (240º) Magenta
Black (V=0)
White (V=1)
H (Hue)
Figure A-1. The HSV color space.
pij
Ei1N---- pijj 1=
N∑=
σi1N---- pij Ei–( )2
j 1=
N∑⎝ ⎠
⎛ ⎞1 2/
=
si1N---- pij Ei–( )3
j 1=
N∑⎝ ⎠
⎛ ⎞1 3/
=
107
They are calculated for each of H, S and V. Therefore, the total number of color features is
for each image. Compared with color histogram comparison, the color moments are more robust
against small color variations [105].
A.2 TextureFor texture, we used Wavelet-based Texture Features [97][98]. First, each image is applied into Wavelet
Filter Bank [103] where the images are decomposed into 10 de-correlated subbands (3 levels) as shown
in Figure A-2. The upper left image in the wavelets image is the low frequency subband, and the lower
right image is the high frequency subband. For each subband, the standard deviation of the wavelet
coefficients is extracted. Therefore, the total number of this feature is 10.
A.3 Edge StructuresFor the edge structures, we apply the Water-Filling Algorithm [120] on the edge maps of the images.
We first pass the original images through the Sobel filter followed by thinning operation [36] to
generate their corresponding edge maps. From the edge map, eighteen (18) elements are extracted from
3 3× 9=
WaveletFIlterBank
fof1
f2
f3f4
f5f6f7
f8f9
Wavelet Image Feature Extraction
Figure A-2. The wavelet texture features
108
the edge maps. These features include the longest edge length, the histogram of the edge length, and
the number of forks on the edges.
109
Appendix B
Implementation Details of ImageGrouper and Query Engine
The source codes of ImageGrouper (Chapter 3,) the Query Engine (Chapter 4) and the traditional GUI
of image retrieval systems will be published under the Illinois Open Source License. In this section, the
implementation details of ImageGrouper and the Query Engine are explained. In addition, a brief
instruction for building the systems is given.
B.1 ImageGrouper
B.1.1 Structure of ImageGrouper User Interface
The current version of ImageGrouper is implemented in Java2 with Swing API. The user interface has
a layered structure (Figure B-1.) that consists of the following five components:
• Main Panel (MainPanel.java)
The base panel where the other panels are placed on. During the instantiation, the Main Panel
creates the other panels and places them on itself (Figure B-1.) Therefore, this panel is not visible
110
to the users. This module also creates and sends queries and loads images via Query Proxy (Section
B.2.2.)
• Grid Panel (GridPanel.java)
The panel to display the query results in a grid. A scroll bar is attached on the right. In fact, this
panel is placed on a scroll pane (JScrollPane) created by the Main Panel. Mouse events
occurred on the grid are handled by Grid Listener class (GridListener.java.)
• Pallet Panel (PalletePanel.java)
The workspace where the user makes queries by creating image groups. The images are dragged
from the Grid Panel. Once an image is dragged and dropped on the pallet, it cannot be moved
back to the grid.
• Glass Panel (GlassPanel.java)
A transparent panel used for image dragging from the Grid to the Pallet (discussed in Section
B.1.2.) This panel covers the entire region of both the Grid Panel and the Pallet Panel.
• Control Field (ControlField.java)
A panel to hold buttons such as the query button and text input fields for keyword search. Most
user actions on this panel are handled by the Main Panel.
B.1.2 Image Drag Operation
The first version of ImageGrouper did not have layers: both the grid and the pallet were drawn on a
single panel (a subclass of Java Swing’s JPanel.) Therefore, all mouse events can be handled by the
same components. In order to make only the grid panel scrollable, these components are separated into
independent panels. This approach also improved the modularity of the system. However, because the
111
grid and the pallet are independent components, the mouse events have to be handled separately. This
made image dragging handling complicated because the mouse events are sent different components
depending on the location of the mouse cursor. Moreover, since each panel has different coordinate
system (the upper left corner as the origin) the coordinates of the image has to be recalculated when
the image moves from one panel to the other. Furthermore, the amount of scroll on the grid has to be
considered.
In order to eliminate the overhead of mouse drag operations, a transparent panel
(GlassPanel.java) is introduced. This transparent panel is used for a temporal workspace for
mouse dragging operations. When the system starts, the Glass Panel is inactive and does not receive
any events. When the user starts dragging an image on the Grid Panel, the system activates Glass Panel
and transfers the images to the same location on the panel (Figure B-2 (1).) While the user is dragging
the image, it is painted on the glass (Figure B-2 (2).) When the user releases the mouse button over the
pallet, the image is dropped on the Pallet Panel (Figure B-2 (3).) If the image is still over the grid, it is
returned to the original location the grid. Once the image is dropped on the pallet, it cannot be moved
back to the grid. Now it is drawn on the Pallet Panel even if it is dragged. When the images are moved
Palet tePanelGridPanel
MainPanel
ControlPane
GlassPanel
ane
Figure B-1. Layered structure of ImageGrouper user interface
112
between the panels, the mouse location in the origin is translated for the coordinate system of the new
panel. This approach realizes much smoother image dragging operation because the mouse coordinates
need to be translated only at the beginning and the end of the drag operations.
B.2 Query Engine
B.2.1 Overview of the Query Engine
In this section, we explain the implementation of the Query Engine. The Query Engine receives a
query from a user interface and then returns the result. The query results include a list of relevant
images and their auxiliary information such as filenames. Our engine is versatile in that the same engine
can handle requests from different user interfaces including ImageGrouper and 3DMARS. The system
dynamically selects a search algorithm and a result format based on the request types. For example,
when the engine receives a query from 3DMARS, it returns the 3D locations of the result images as
well. Meanwhile, the engine need to process group information when communicating with
ImageGrouper. The system will be published with an open source license.
Palet tePanelGridPanel
ControlPa(1)
(3)
lPa
GlassPanel(2)
Figure B-2. Image Dragging from the Grid to the Palette.
113
The current system is implemented in C++ for the maximum computation speed. Query algorithms
like Group Biased Discriminant Analysis (Section 4.3) involve lots of vector and matrix calculations. In
order to improve the speed of the system, we used a linear algebra package called CLAPACK [2]. This
library is commonly used for scientific computation. CLAPACK and its Fortran version LAPACK
(Linear Algebra PACKage) separate the basic vector / matrix computation as Basic Linear Algebra
Subprograms (BLAS) so that the system can take advantages of vector processing units of different
CPUs or shared memory parallel processors. Various implementations of BLAS (or its C version
CBLAS) have been developed. We chose an implementation known as Automatically Tuned Linear
Algebra Software (ATLAS) [4]. ATLAS is a very popular open source implementation of CBLAS that
is used by many commercial systems. For SPARC Solaris machines, we used Sun Performance Library
[106].
Because LAPACK and BLAS were originally designed for Fortran77, the functions in CLAPACK and
CBLAS have cryptic names and require many arguments. In order to make coding of the Query Engine
easier, we provided wrapper functions in a separate source file (matfunc.cc) and a header file
(mathfunc.h) so that the user can easily compile the system with another linear algebra packages
such as Intel Math Kernel Library [45]. The functions included in this source file ranges from simple
vector and matrix calculations to higher-level functions such as covariance matrix and generalized
eigenvectors. Our matfunc functions are very simple and generic. Therefore, it makes easier for the
user to implement other algorithms. The detail of the search algorithms used in our Query Engine is
described in Chapter 4.
114
B.2.2 System Configurations
The Query Engine and the User Interfaces can run in two configurations. The first configuration is
Client-Server configuration where the Query Engine (server) and the user interface (client) are located
on different machines and communicate with each other over networks. The second configuration is
Standalone configuration where the Query Engine and the User Interface reside in the same machine.
The same Query Engine can be used for both configurations. The main component of the engine is
stored in a shared library (or DLL for Windows) named libgroup.so (for UNIX.) They are named
libgroup.dylib for Mac OS X, and group.dll for Windows. No recompilation is required to
switch between two configurations.
The client and the server do not communicate directly. Instead, the client dynamically loads a proxy
component (Query Engine) to connect with the engine. Different proxies connect a client with a server
in different ways. Every proxy, however, has a common interface to the clients (defined in
QuerySender.java,) so that the clients can communicate with any servers in the same manner.
B.2.3 Client-Server Configuration
Figure B-3 shows the overview of the Client-Server configuration. In this configuration, the Query
Engine server and the user interface client communicate via HTTP [29] over TCP/IP. The server is
implemented as a Java Servet that works with Apache Web Server. When the Servlet starts, it loads the
body of the Query Engine stored in libgroup.so. The engine then loads the metadata and waits for
query requests. The Servlet parses HTTP request from a client and passes it to the engine. The Servlet
part of the system is written in Java (GroupImageServlet.java) and the main engine component
115
is written in C++. They are connected by Java Native Interface (JNI) as defined in
GroupImageServlet.h and GroupImageServletImp.cc.
Since the server is implemented as a Servlet, the user can send a query from a web browser by typing
the address of the servlet with the query parameters such as:
http://chopin.ifp.uiuc.edu:8080/servlets/GroupImageServlet?Type=Que-ry&Id=3436&Rel=1.0&Id=3478&Param=15
In this example, the user is requesting 15 similar images that are similar to an image with ID=3436.
Then, the user sees the result as a list of image IDs.
On the client side, the User Interface sends query to the server via a query proxy. This proxy translates
a query request from a client into a HTTP message and sends it to the server. Then, the proxy receives
the result from the server and passes it to the client. The query proxy is implemented in Java (in
RemoteQuerySender.java, a subclass of QuerySender) for ImageGrouper and the traditional
user interface (Chapter 3), and in C++ for 3DMARS (Chapter 2).
One advantage of this configuration is that multiple user interface clients can communicate with a
single engine simultaneously.
User Interface Query Engine
Servlet
Web Server
Network (HTTP, TCP/IP)
Client Server
JNI
Query Proxy (RemoteQuerySender)
Figure B-3. Client-Server configuration.
116
B.2.3.1 Standalone Configuration
Figure B-4 shows the structure of the Standalone configuration. The user interface and the Query
Engine reside in the same machine. This configuration has a three-layer structure. When the user
interface starts, it loads a query proxy named LocalQuerySender. Then, it loads the shared library
libgroup.so and initializes a query engine. The proxy is implemented in Java
(LocalQuerySender.java.) It communicates with the Query Engine via Java Native Interface.
Because there is no network overhead, this configuration has a faster query response than the client-
server configuration.
B.3 Building Instructions
B.3.1 Directory Structure
The directory structure of the main system package is as follows:
lib/access/lib/shared/server/
User Interface
Query Engine
JNI
Query Proxy (LocalQuerySender)
Figure B-4. Standalone configuration with a local query engine
117
grouper/tradgui/
Each directory contains the source codes and the makefiles. The Query Engine is stored in server
directory. ImageGrouper and the traditional GUI are stored in grouper and tradgui, respectively.
The query proxy for the standalone configuration (LocalQuerySender.class) is automatically
copied into classes subdirectories in tradgui and grouper.
lib directory stores libraries which are shared by all components. The data formats exchanged between
the client and server are defined in shared subdirectory. The template of the query proxy (Section
B.2.2) is also defined here. They are required to compile the other components. The query proxy for
client-server communication is defined in access subdirectory. When these libraries are compiled
and installed, the makefiles copy archived libraries (shared.jar and ServletAccess.jar) into
grouper and tradgui directories.
B.3.2 Building Systems
Included makefiles support Solaris (with Sun Performance Library), Mac OS X, and Windows
(Cygwin) at the time of writing. The users of above systems should be able to compile the system
without any problems. For other systems, the user may need to edit makefiles slightly.
B.3.2.1 Additional Libraries
In order to build the systems, the user needs to obtain linear algebra packages such as CLAPACK and
CBLAS as discussed in Section B.2.1. They are not part of the main distribution of the system. While
the users can easily compile those libraries by themselves, compiled binaries are available from the
author by request for Solaris, Mac OS X, and Cygwin. Compiled binaries of ATLAS for various
118
platforms including Linux and AIX are also available from ATLAS’s official web site at http://
math-atlas.sourceforge.net/.
For Cygwin systems, the user also needs a Java wrapper, which is included in
cbir_support_cygwin subpackage. For more detail, see README_Cygwin.txt in the main
directory of the main distribution.
B.3.2.2 Setting Java Parameter
Because lib, grouper, and tradgui are Java2 programs, they can be compiled with the same
common makefile for all platforms. The location of Java SDK, however, varies from systems to systems.
Therefore, the users need to set an environment variable to match the configuration of their own
systems. For example, if Java is installed in /usr/local/bin, set JAVAC environment variable from
a console as follows:
% setenv JAVAC /usr/local/bin/javac
Or the user can set the variable in the makefile of the main directory.
B.3.2.3 Modifying Server Makefile
The source codes in server subdirectory are written in standard C++ and Java. The main makefile
automatically chooses a suitable platform-specific makefile based on OSTYPE environment variable.
Currently, makefiles for OSTYPE = solaris (Solaris), darwin (Mac OS X) and cygwin
(Windows) are included. The users may need to modify some variables in these makefiles so that the
locations of linear algebra packages match their environment. Especially, LIBDIR specifies the location
of CLAPACK libraries, and ATLAS_TOP specifies the location of ATLAS.
119
B.3.2.4 Compiling
To compile all components, simply type:
% gmake all
In order to compile a specific component, for example to compile only lib directory, type:
% gmake lib
B.3.3 Running Systems
To run ImageGrouper and the traditional GUI, the user has to change parameters in html files especially
for the image file location and the meta file location.
B.3.3.1 Image File Location
For example, to run systems with images stored in /workspace/video_databases/corel/t1/
jpg, parameter “imagebase” needs to be changed to:
<PARAM NAME = imagebase VALUE = "file:///workspace/video_databases/corel/t1/jpg">
When the user uses an image database on a web server, it can be changed to, for example:
<PARAM NAME = imagebase VALUE = "http://myserver.com:8080/grouper/class-es/corel/jpg">
B.3.3.2 Metadata File Location
To run the system on a PC locally, the metadata files have to be copied into the local hard disk and
“metapath” parameter has to be set to the file location such as:
<PARAM NAME = metapath VALUE = "/Users/myaccount/corel/">
B.3.3.3 Server URL
For client-server configuration, the URL of the server (Java Servlet) has to be specified as well. For
example:
120
<PARAM NAME = servlet VALUE = "http://chopin.ifp.uiuc.edu:8080/servlets/GroupImageServlet">
B.3.3.4 Starting Systems
Small shell scripts (or batch files) are provided to run ImageGrouper and the traditional GUI. To run a
system on UNIX terminal, the user types:
% ./runclient.sh (for remote client)
or
% ./runlocal.sh (for local standalone client)
In Microsoft Windows,
> RUNCLIENT.BAT (for remote client)
or
> RUNLOCAL.BAT (for standalone)
In order to run the systems as a standalone system, the server has to be compiled too.
121