25
Automatic annotation of image databases based on implicit crowdsourcing, visual concept modeling and evolution Klimis Ntalianis & Nicolas Tsapatsoulis & Anastasios Doulamis & Nikolaos Matsatsinis # Springer Science+Business Media, LLC 2012 Abstract In this paper a novel approach for automatically annotating image databases is proposed. Despite most current schemes that are just based on spatial content analysis, the proposed method properly combines several innovative modules for semantically annotating images. In particular it includes: (a) a GWAP-oriented interface for optimized collection of implicit crowdsourcing data, (b) a new unsupervised visual concept modeling algorithm for content description and (c) a hierarchical visual content display method for easy data navigation, based on graph partitioning. The proposed scheme can be easily adopted by any multimedia search engine, providing an intelligent way to even annotate completely non-annotated content or correct wrongly annotated images. The proposed approach cur- rently provides very interesting results in limited-size both standard and generic datasets and it is expected to add significant value especially to billions of non-annotated images existing in the Web. Furthermore expert annotators can gain important knowledge relevant to user new trends, language idioms and styles of searching. Keywords Implicit crowdsourcing . User feedback . Visual concept modeling . Clickthrough data . Automatic image annotation Multimed Tools Appl DOI 10.1007/s11042-012-0995-2 K. Ntalianis (*) Department of Marketing, Technological Educational Institute of Athens, Αgiou Spyridonos str. 122 10 Egaleo, Athens, Greece e-mail: [email protected] N. Tsapatsoulis Department of Communication and Internet Studies, Cyprus University of Technology, 3603 Limassol, Cyprus e-mail: [email protected] A. Doulamis : N. Matsatsinis Department of Production Engineering and Management, Technical University of Crete, 73100 Chania, Greece A. Doulamis e-mail: [email protected] N. Matsatsinis e-mail: [email protected]

Automatic annotation of image databases based on implicit crowdsourcing, visual concept modeling and evolution

Embed Size (px)

Citation preview

Automatic annotation of image databasesbased on implicit crowdsourcing, visualconcept modeling and evolution

Klimis Ntalianis & Nicolas Tsapatsoulis &

Anastasios Doulamis & Nikolaos Matsatsinis

# Springer Science+Business Media, LLC 2012

Abstract In this paper a novel approach for automatically annotating image databases isproposed. Despite most current schemes that are just based on spatial content analysis, theproposed method properly combines several innovative modules for semantically annotatingimages. In particular it includes: (a) a GWAP-oriented interface for optimized collection ofimplicit crowdsourcing data, (b) a new unsupervised visual concept modeling algorithm forcontent description and (c) a hierarchical visual content display method for easy datanavigation, based on graph partitioning. The proposed scheme can be easily adopted byany multimedia search engine, providing an intelligent way to even annotate completelynon-annotated content or correct wrongly annotated images. The proposed approach cur-rently provides very interesting results in limited-size both standard and generic datasets andit is expected to add significant value especially to billions of non-annotated images existingin the Web. Furthermore expert annotators can gain important knowledge relevant to usernew trends, language idioms and styles of searching.

Keywords Implicit crowdsourcing . User feedback . Visual conceptmodeling . Clickthroughdata . Automatic image annotation

Multimed Tools ApplDOI 10.1007/s11042-012-0995-2

K. Ntalianis (*)Department of Marketing, Technological Educational Institute of Athens,Αgiou Spyridonos str. 122 10 Egaleo, Athens, Greecee-mail: [email protected]

N. TsapatsoulisDepartment of Communication and Internet Studies, Cyprus University of Technology,3603 Limassol, Cypruse-mail: [email protected]

A. Doulamis : N. MatsatsinisDepartment of Production Engineering and Management, Technical University of Crete,73100 Chania, Greece

A. Doulamise-mail: [email protected]

N. Matsatsinise-mail: [email protected]

1 Introduction

The number of images in on-line repositories (e.g. the Web, file servers, databases, etc.)grows rapidly every day. Currently, less than 10% of web multimedia files are professionallyannotated, since it is practically impossible to do it manually on such huge amounts ofcontent. For this reason, most images are of no or, at best, of poor textual description, while apart of them is wrongly annotated. Thus it is extremely difficult to efficiently search andretrieve them.

To overcome this problem, in the recent years more than 200 content-based retrievalsystems have been developed [33], the majority of which are based on low-level features. Inparticular they can be classified into two main categories: (a) those that mine semantics byanalyzing associated textual information, such as annotations, assigned keywords, captions,alternative (alt) text in html pages or surrounding text and (b) those that extract low-levelvisual features such as color, motion and texture. Methods of the first category depend onlaborious annotation, while the latter methods usually cannot capture semantics effectively.In general, neither a single low-level feature nor a combination of multiple low-level featureshas explicit semantic meaning. In addition, the similarity measures between visual featuresdo not necessarily match human perception and thus retrieval results of low-level approachesare generally unsatisfactory and often unpredictable.

Some characteristic approaches on the aforementioned directions use techniques thatautomatically annotate multimedia content by applying a series of image/video processingalgorithms, properly combined with intelligent tools and then extract the semantic meaningof the content [36, 51]. However, the main drawback of these methods is that they support onlylimited vocabulary for the annotation, e.g., indoor/outdoor content, and they lead to high errorrates, especially in cases of complicated content. Specialized research approaches are alsoproposed in case of content of specific types, such as sports, news etc. [4, 50]. An attentionmodel for image annotation by exploiting human perception of content is reported in [38].Another approach is to semantically tag multimedia content through multiple textual annotations[14]. However, these approaches cannot be applied to generic, highly unstructured content.

Additionally relevance feedback algorithms have also been used for improving multimediacontent retrieval, a survey of which can be found in [9]. In [31] it has been shown that CBIRsystems supporting relevance feedback can benefit from long-term or inter-query learning,where relevance feedback interactions are logged and then periodically mined for latentstructure to aid subsequent queries. However, the main functionality of such an algorithm isto modify either the query or the ranking similarity metric, in order to achieve personalizedimage retrieval. Therefore, the purpose of a relevance feedback algorithm is not contentannotation but personalization. On the other hand, the last decade research has moved towardsautomatically acquired (from the Web) data sources in order to be used for training conceptclassifiers or retagging [29, 32, 53, 56]. Such data sources include content that has beenannotated by user-defined tags (e.g., Picasa, Flickr, Yahoo! Video, Youtube etc.) as well asimages and videos annotated with keywords that have been automatically extracted from thesurrounding text of the correspondingWeb pages. However concept classifiers havemost of theproblems of traditional CBIR systems.

By the growth of search engines, search logs have increased dramatically, since millionsof user-search engine interaction sessions are recorded every day. This idea of implicitcrowdsourcing can be used in place of traditional relevance feedback mechanisms in order toserve several purposes. Joachims et al. [23] have discovered that differences betweenimplicit and explicit relevance judgments are not so far as they were thought to be. Thisinnovative finding opened a new way, where implicit relevance judgments were considered

Multimed Tools Appl

as training data for various machine learning-based improvements to information retrieval[30, 52]. There are several advantages of using implicit crowdsourcing data, such as it can beeasily collected, it can be used as training data in various tasks, and its collection introducesno additional cognitive burden on users performing the queries.

Furthermore other interesting approaches include the works of [15] and [16], where the LSAalgorithm has been applied to search logs in order to build a semantic space for indexingimages. Finally, regarding the use of topic models, very few works have investigated theirapplication on implicit crowdsourcing data. A characteristic approach is introduced in [28],where a probabilistic latent semantic analysis search personalization model is proposed. Themodel is trained on implicit relevance judgments from clickthrough data in text-based searches.

Most of the aforementioned techniques use clickthrough data just to select proper contentfor visual concept learning. However just concept learning is not enough for effectivelysearching multimedia databases in the long term and textual annotation of each file should becreated and maintained. To the best of the authors’ knowledge very few works incorporateimplicit crowdsourcing for textual annotation of multimedia content. In our previous works [34,35], we have focused on automatic human action recognition in video streams. In particularaction analysis and modeling has been performed in case of walking, jogging, running, boxing,hand waving, hand clapping and dancing. Then for the candidate video streams, key-frameshave been extracted and a user-interaction environment has been set, so that users couldimplicitly evaluate human actions.

In this paper we extend our previous works and focus on image databases, proposing anovel GWAP-like method [2, 19, 42] that introduces implicit crowdourcing, which isproperly combined with visual concept modeling and evolution. In particular the maincontributions of this paper include: (a) Database images are split into three sets (CorrectlyAnnotated-CA, Wrongly Annotated-WA and Non-Annotated-NA), so that on the one handevolution of each set can be individually observed and on the other hand to investigate theability of the scheme to diminish set NA, (b) The feature vector of each image is constructed byincorporating the novel visual concept distance factor, which indicates the distance betweencurrent visual content representation and visual models of the tags (see Section 4), (c) Theconcept of stability is introduced in the visual models updating mechanism, which takes intoconsideration the selection frequency of each individual image, based on a likelihood updatingfactor and (d) In the GWAP interface a novel hierarchical visual content display algorithm isproposed that is based on graph partitioning and aims at accelerating the annotation process.

In particular, for each image a feature vector is initially constructed that contains fourdifferent types of information: (a) visual descriptors that characterize the visual content ofthe image, (b) textual descriptors expressing the current textual annotation of the image, (c) aprobabilistic score of the assigned textual information and (d) visual descriptors of the globalvisual concept of the textual class that the image belongs to. This feature vector is properlyused and adapted during image retrieval and ranking, according to implicit crowdsourcingdata. By this way query keywords are assigned to selected (clicked) files according to anaccumulative scheme, while non-annotated files are also retrieved in every iteration, basedon vector similarity criteria. The effectiveness, robustness, scalability and flexibility of theproposed system are evaluated in real-world user-interaction settings.

The rest of this paper is organized as follows: In Section 2 a summary of previous workon crowdsourcing annotation is provided while in Section 3 a theoretical background isprovided. In Section 4 the proposed Human-Machine Interaction concept and its processesare analyzed and Section 5 focuses on the visual concept model creation and adaptationmethodology. Section 6 describes an efficient hierarchical ranking algorithm based oncontent similarity criteria. Experimental results are provided in Section 7, exhibiting the

Multimed Tools Appl

perspective and capabilities of the proposed scheme, while in Section 8 conclusions areoutlined and future research directions are discussed.

2 Previous work on crowdsourcing annotation

Although crowdsourcing annotation [17] is a fairly recent development, it is recognized as agrowing and burgeoning research area, as evidenced by several works that have produced anoverview of these methods from different perspectives. Crowdsourcing has magnetized theinterest of several researchers and companies since, among others, it is a very attractivesolution to the problem of cheaply and quickly acquiring annotations. To sense the potentialof crowdsourcing, consider the following observation [1]: a crowd of 5,000 people playingan appropriately designed computer game 24 h a day, could be made to label all images onGoogle (425,000,000 images in 2005) in a matter of just 31 days ! Even though such a gamewould have tremendous impact, generally it has become more common to acquire crowd-sourcing annotations either from a pool of volunteer annotators, or through a micro-taskmarket such as Amazon’s Mechanical Turk (AMT). AMT is an online labor market whereworkers are paid small amounts of money to complete small tasks (in our case non-expertlabelers annotate content). The design of the system is as follows: one is required to have anAmazon account to either submit tasks for annotations or to annotate submitted tasks. TheseAmazon accounts are anonymous, but are referenced by a unique Amazon ID. A Requester cancreate a group of Human Intelligence Tasks (or HITs), each of which is a form composed of anarbitrary number of questions. The user requesting annotations for the group of HITs can specifythe number of unique annotations per HIT they are willing to pay for, as well as the rewardpayment for each individual HIT. Annotators (variously referred to as Workers or Turkers) maythen carry out the tasks of their choice. Finally, after each HIT has been completed, theRequester has the option of approving the work and optionally giving a bonus to individualworkers. There is a two-way communication channel between the task designer and the workersmediated by Amazon, and Amazon handles all financial transactions [18, 24, 46].

Even though currently AMT is probably the most characteristic category of crowdsourcing,there are also other forms that provide different motivations to achieve the end goal of annotation[55]. Quinn et al. [37] describe a taxonomy of “Distributed Human Computation”, dividingcrowdsourcing into 7 genres: Games With a Purpose (GWAP), Mechanized Labor (“MTurk”),Wisdom of the Crowds (WotC), Dual-Purpose Work, Grand Search, Human-based GeneticAlgorithms and Knowledge Collection from Volunteer Contributors. These genres are decom-posed along 6 dimensions and future directions are also discussed towards utilizing crowdsourc-ing to solve other computational problems. In parallel, Yuen et al. [58] also surveyedcrowdsourcing applications, categorizing them into 6 classes: initiatory human computation,distributed human computation, social game-based human computation with volunteers, paidengineers and online players. In this survey crowdsourcing is analyzed from a social gamingperspective, differentiating the classes based on game structure and mechanisms. Performanceaspects in such systems are also examined, presenting references to primary studies that describemethods for best measuring the reliability of results coming from crowdsourcing frameworks.

Aside from these two surveys that examined crowdsourcing in a wider framework,studies have also analyzed specific theoretical aspects relevant to annotation tasks. In [2]general design principles for GWAPs (Games With A Purpose) are set and analyticallydescribed. It is also argued that in GWAPs the main motivator is fun where annotation tasksare designed to provide entertainment to the human subject over the course of short sessions.Towards this direction three game templates are proposed, where the basic rules and winning

Multimed Tools Appl

conditions are defined for each template in a way to make it interesting for players to perform theintended computation. Metrics are also proposed to measure GWAP computation success interms of maximizing the utility obtained per player-hour spent. Similarly, in [21] existing game-theoretic models are surveyed in order to be incorporated into crowdsourcing environments, byoutlining challenges towards advancing theory that can enable better design. On the other handWotC deployments allowmembers of the general public to collaborate to build a public resource,or to predict event outcomes or to estimate difficult to guess quantities [49]. Wikipedia, the mostwell-known fielded WotC application, has different motivators that have changed over time.Initially, altruism and indirect benefit were factors: people contributed articles to Wikipedia tohelp others but also to build a resource that would ultimately help themselves. As Wikipediamatured, prestige of being a regular contributor or editor also slowed the ranks of contributorsfrom the crowd to a more stable formalized group [48].

Several other works have been presented, illustrating the advantages and disadvantages ofcrowdsourcing frameworks and environments. In particular in [44] it is shown that crowd-sourced annotators are not as effective individually as experts, but when non-expert opinionsare aggregated together, it is possible to produce high-quality annotations. So this workestablishes the merit of annotations’ aggregation, suggesting that using a large number ofuntrained annotators can yield annotations of quality comparable to those produced by asmaller number of trained annotators on multiple-choice labeling tasks. In [39] the problemof training a supervised learning system in the absence of ground truth data is addressed,when all that is available in noisy label information from non-expert annotators. The authorssuggest that having effective annotators is more important than data coverage, and empha-size the use of multiple annotations for each item, in conjunction with weights for annotatorsbased on their agreement with the induced ground truth. Aim of their paper is to estimate thesensitivity and specificity of each of the annotators, and also annotate unlabeled examples.The work of Raykar et al. should be distinguished from the earlier work described in [7],where the problem of establishing a ground truth for a set of noisy labels is addressed, ratherthan the further problem of annotating unlabeled examples. In this work and for multi-valuedannotations, the individual annotator accuracy is modeled by a confusion matrix. In [43] thedifficulty of performance evaluation in tasks where annotations are available from multipleannotators, but no ground truth is available as a reference, is also analyzed. The authors tryto integrate the opinions of many experts to determine a gold standard, proposing also thatannotator consensus can be used as a proxy in order to measure annotation quality.Annotator quality is also modeled in [40], where it is showed as well how repeated andselective labeling increased the overall labeling quality on synthetic data. Furthermore in[47] a method for combining prioritized lists obtained from different annotators is proposed.Annotator consistency to obtain ground truth has also been used in the context of pairedgames and CAPTCHAs [3]. In [57] two issues are considered: the difficulty of non-expertannotation and the ability of annotators, while in [54] a system is proposed which activelyasks for image labels that are the most informative and cost effective.

Overall, in the recent past, online recruitment of anonymous annotators has brought toforeground new perspectives of distributed human computation. Usually each of these non-expert annotators provides results of medium quality, while the obtained annotations may benoisy and might require additional validation or scrutiny. Furthermore the whole processcosts money, the setup of each HIT is time-consuming and the current methods cannotprovide an annotation solution on a large-scale (e.g. Web-scale content annotation). Toovercome the aforementioned problems, in this paper the novel idea of implicit crowdsourc-ing is thoroughly investigated, where content annotation is attempted by combining implicithuman computation (when users interact with a search environment) and machine learning

Multimed Tools Appl

algorithms. To the best of our knowledge there has been little work on implicit crowdsourcingmodels and based on the proposed setup we try to answer the following questions: Is it possibleto annotate content by implicit crowdsourcing on a Web-scale basis ? How does one handledifferences among users in terms of the quality of implicit annotations they provide? Howuseful are noisy annotations for creating content models? Are the days of highly-paid expertannotators over?

3 Theoretical background

In this section we set the theoretical background of the current work. The following notation isused throughout the paper. We denote by Ii the i-th image (i01,…, NI) in the image database.The total number of images in the database is denoted by NI. The set existing tags used for theannotation of images is represented by T ¼ fT1; :::;TNTg. Tj indicates j-th tag while the totalnumber of tags is denoted by NT. Obviously T is the unification of tagging of all annotatedimages. Note, however, that there are no duplicate entries in T, i.e., Tj ≠ Ti for i ≠ j.

Every tag Tj is modeled by the low level characteristics of the annotated images havingthat tag in their tagging. As a result the matrix V ¼ ½V1jV2j:::jVNT � represents the visual

models of the tags in T (Vj is therefore the visual model of tag Tj). Vj ¼ ½V1j jV2

j j:::jVNjC

j � is amatrix whose columns correspond to the centroids of clusters of low level vectors si of those

images tagged with Tj (in the example there areNjC different clusters for the term Tj). It is

expected that images tagged with Tj may have totally different visual appearance becausethis tag can be used in different contexts. For example, the tag “puma” may be used inimages showing the animal or shoes and clothing of the corresponding athletic brand.

The visual description of i-th image (MPEG-7 visual descriptors) representing its lowlevel characteristics is denoted by si 2 <L while the set of tags assigned to i-th image isrepresented by the set ti ¼ fti1; :::; tiN i

Tg. Obviously ti � T (it is a subset of the available tags)

and T ¼ TNI

i¼1ti while ti00 for some i (not all images in the database are textually annotated).

We denote with pi ¼ fpi1; :::; piNiTg the likelihoods that the above terms correctly describe the

semantic content of the i-th image.Finally, we denote by ci the vector showing the distance of visual representation si of the

i-th image to the existing visual models of the tags. In particular the k-th element of vector ci

is computed by:

cik ¼ argminj

jjsi � V jk jj ð1Þ

and indicates the distance of the visual representation si to the visual model of the k-th tag. Itshould be noted here that although not all images in the database are textually annotated it isalways straightforward to compute the vector si for every image.

4 Implicit crowdsourcing: the proposed HMI concept

In Fig. 1 an overview of the overall system is provided. Firstly we assume an image databasecontaining several data. Secondly we assume that some of the images of the database areCorrectly Annotated and belong to set CA, some others are Wrongly Annotated (set WA)

Multimed Tools Appl

and the rest are Not Annotated at all (set NA). Aim of the proposed system is to correctlyannotate the images of sets WA and NA, while keeping unaltered or correctly enriching theannotations of images belonging to set CA. For this reason in this paper clickthrough data ofinformed users, together with visual concept models are properly combined.

In particular each image of the database is characterized by a set of four main elements:

& The vector si containing visual descriptors that characterize the visual content of eachspecific image Ii [10].

& A set ti containing the current textual annotation of image Ii (as already mentioned, forimages belonging to the NA set ti is an empty set).

& Avector pi containing the probabilities of the assigned textual information (keywords inthe ti set). It can be easily deduced that the length of vector pi is equal to the cardinalityof set ti and, thus, for images belonging to NA set is a null vector.

& A vector ci indicating the distances of visual representation si to the existing visualmodels of the tags. The individual elements of ci are computed with the aid of Eq. 1while its length is equal to the cardinality of set T.

The idea of creating visual models V for concepts (tags) is further illustrated by thefollowing example: for simplicity reasons let us suppose that the set ti of a specific image Iicontains just the word “tiger”. The database may have several other images with the word“tiger” in their respective t-vectors. The question is: how can we create a set of vectors Vj

that visually characterizes the global concept “tiger”? To do so, in this paper we initiallygather all images Ii, containing the word “tiger” in their t-vectors, to create the “tiger” class.

Image Database

User Interaction

T1, T2, …, TNT

User Interaction

User Interaction

User Interaction

User Interaction

...CA NA

WA

RankingMechanism

QuerySubmission

s1 =t1 = {tiger, ice}p1 = [0.15 032]c1 = [0.071 0.123 …. 0.095]

V1, V2, …, VNT

Set of textual annotations T

Visual Models V fortext annotations

Mod

elin

gpr

oces

s

Fig. 1 System overview

Multimed Tools Appl

Afterwards the “tiger” class is partitioned by a clustering algorithm, which is applied on therespective s-vectors of the images belonging to the class. Then, for each cluster, a represen-tative vector is computed. Finally the representative vectors of all clusters are gatheredtogether to provide the Vi-matrix of the “tiger” class. By this way the constructed Vi-matrixvisually models a respective textual concept. Here it should be mentioned that each clustermay contain a different number of images. For this reason, even though one representative isextracted from each cluster, cluster’s population is taken into consideration during retrieval.

Having created a visual model for each one of the used tags and having computed thevisual description si for each image Ii, we then setup an environment that takes advantage ofthe implicit user interaction. In particular clickthrough data is taken into consideration,where users submit textual queries on the database and may select some of the retrievedresults. Usually users examine the results and download relevant content if some of theretrieved data is within their information needs and preferences. Otherwise, they submitanother query for further processing. As a result, users implicitly vote the retrieved content.The clicked content is marked as highly relevant to users’ queries. Instead non-selectedcontent is marked as irrelevant.

In the proposed system, when a user submits a textual query, images from two maincategories are retrieved. The images of the first category have t-vectors that contain wordsincluded in the textual query. The images of the second category belong to two subcategories:(a) the first subcategory includes images from set NA and (b) the second subcategory includesimages that have different to the textual query t-vectors. Images from both subcategories arevisually similar to cluster representatives of the models corresponding to the tags belonging tothe textual query (assuming that the tags in the query are already used in the image database).The more cluster representatives a visual model contains the more the retrieved results. In thenext section an analytical description of the mathematical formulation of this process isprovided, while in the experimental results section the percentages of both categories andsubcategories are properly set.

After retrieving results from both categories, users implicitly interact with thesystem and their clickthrough data is recorded. In particular every time a user selects an image,the tags contained in the textual query are accumulated to the image’s t-vector. Inparallel, the visual models of the query tags are updated as described in Section 5.1.Experimental results exhibit the quality of the proposed method, which leads tosignificantly faster annotation rates and efficient automatic content description compared to arandom retrieval scheme.

5 Visual modeling of textual semantics via clickthroughs

Let us assume that a user accessing the image database submits a query tQ ¼ ftQ1 ; :::; tQNQT

gcontainingNQ

T terms (tags). These tags may or may not belong to set tagsT (tags already in use).

For every query term tQk :If tQk =2 T then M randomly selected images are retrieved and presented to the user.

The t-vector (set) of the selected images is expanded to include the term tQk and the set of the

already used tags T is also expanded by adding term tQk . The elected by the user images are

clustered according to their s-vectors to create the visual modelVtQkcorresponding to the term tQk .

If tQk 2 Tthen the images in which tQk 2 ti (say tQk ¼ tim ) are found and ranked according tothe values of likelihood pimof the matched tags. The K of them with the highest likelihood are

Multimed Tools Appl

selected and returned to the user. In parallel the L images for which the values cilis the lowest(that is find the images whose visual representation is more closed to the visual model ofterm VtQk

according to Eq. 1). The combination of images selected based on the textual

information and visual similarity is presented to the user; thus both textual description andvisual similarity is taken into account during retrieval, making the method less prone toerroneously tagged images.

5.1 Visual models updating

Images selected by a user based on a query term tQk (submitted to the system as previouslydescribed) are used to update the visual model of the corresponding term (VtQk

). Given that

VtQkis composed of several columns (corresponding to cluster centroids of the images tagged

by the term tQk ) we first find the column that best matches a selected image. We then use thefeature vector of this image (say si) to alter the corresponding column of the model, say Vj

tQk

according to the following formula:

V j

tQk¼ ð1� gÞV j

tQkþ g � sið0 < g < 1Þ ð2Þ

The distance ci of the visual representation si of every image to the updated model isrecomputed (this means that only the affected term of vector ci is recalculated). Furthermore,the likelihoods pimof the matched terms tim in every selected (by the user) image are increasedwhile the ones of non-selected images are decreased.

Let us assume that pim ¼ nsnp, where np is the number of times the particular image was

presented to a user based on the tag timand ns is the number of times this image was selectedby the user (obviously ns � np). Once a presented image is selected by the user, thelikelihood pim is increased:

_pim ¼ ns þ 1

np þ 1ð3Þ

while in case of no-selection it is decreased:

_pim ¼ ns � 1

np þ 1; ns > 1 ð4Þ

Equations 3 and 4 show that the likelihood updating factor is actually a function oftime. The fewest the times an image is presented to a user the more the importance ofhis/her decision on selecting it or not. On the other hand frequently selected imagesbased on a particular query term, makes the corresponding tag a relatively stableannotation term for them.

6 Hierarchical visual content display

In this section, we propose a novel algorithm that is used to display the retrieved visualcontent so that users can easily navigate on and especially select (annotate) data of interest.The main concept is to group the relevant content into visually semantic categories by

Multimed Tools Appl

applying image analysis techniques on imaginary data, estimate representative clusters, thenassociate non-annotated visual content with the relevant clusters and finally relatethem into “representative categories” so that the relevant retrieved visual content isprogressively displayed to the users, providing a spherical preview of the diversity ofthe relevant data. To address this, initially we apply weakly supervised learningstrategies that construct groups of similar data and detect within each cluster a setof representative image frames. In contrast to conventional unsupervised strategies,such as the k-means approach [3], in which the cluster can automatically be createdwithout the use of a training knowledge, exploiting the inherent properties of thevisual data, in the proposed architecture we roughly rely on a small approximate setof supervised data, progressively created through users’ feedback on query selections,and then clusters are incrementally reformed as fresh knowledge is received and newinformation is taken into consideration. The adoption of such a novel weakly super-vised learning scheme is due to the fact that a pure unsupervised scheme is in generaldifficult to cast the high variations of the inherent data properties (depending on thefeatures used to represent the visual content, a variety of groups may be apt for acluster due to the high fluctuations of the image content even for data belonging tothe same semantic type). Thus, the proposed method initially roughly approximatesthe main groups of a visual semantic cluster, based on user interaction and implicitevaluation of the queries in accordance to users’ selections and then, clusteringcontours are incrementally updated as new data are received.

6.1 Hierarchical visual displays via graph partitioning

Let us denote in the following as σ ¼ f� � � ; ðvi; oiÞ; � � �g a set that contains a smallnumber of implicitly categorized visual data according to the profile of the “averageuser” or the average profile within a social group. That is, upon submission of avisual query say vq, the user, without knowing that, assesses the retrievals, since he/she selectssome of the relevant data for viewing, while ignores some others. This way, we canconstruct sample pairs that contain the feature vectors vi used to describe the contentof the selected image data and a desired output oi which relates these elements withthe visual category of the submitted query, i.e., oi0oq. Instead for the relevantretrieved data which are ignored by the user, the desired category is selected to beirrelevant from the submitted one.

Forming this set of data for several different queries submitted by the users, we constructa small indicative training set that includes visual data along with their visual semanticcategories as being implicitly identified by the user. Then, we apply a graph partitioningmethod [57] for the construction of the “visual clusters”, based on the concept of spectralclustering partitioning [10, 54]. The goal of the proposed method is twofold. On the onehand, we need to maximize the coherence among the visual data assigned to a semanticcategory while simultaneously we need to minimize the coherence among the inter-clustereddata. The latter objective is of critical importance since this way we ascertain the maximumdiscrimination among the paper.

Based on the above mentioned requirements, we infer two optimization criteria, one perrequirement: a) balancing of the clusters’ cardinality since incoherent data will be forced toconstruct new clusters and b) minimization of conflicts since the incoherence among thecreated clusters will be as maximum as possible. Using Eq. 5 one can express the non-conflict degree among tasks that request different resources as the sum of the non-conflictdegrees of all tasks that have reserved the mth cluster with the rest ones, normalized over the

Multimed Tools Appl

sum of non-conflict degrees between tasks reserved for the mth cluster, and all tasks pendingin the system. The corresponding equation is:

CHm ¼

P

i2Cm; j2Cm

cohi; jP

i2Cm; j2Wcohi; j

ð5Þ

where we denote as cm the mth cluster, as W the union of all clusters. Parameter cohi,j is ametric that denotes the coherence among two samples i and j in the sense of a correlationdegree. In this paper the cross-correlation criterion is used as coherence metric among thetwo samples. In addition, to take into account the training knowledge σ, we modify thecalculation of cohi,j so that it takes as input the desired outputs oi. Two samples that belongto the same semantic category as being indicated by the majority of users’ selections arestrengthened to greater correlation degrees even if their visual content is not so related.Instead, samples that are assigned by the average user to different categories are forced to alower correlation degree even if there are strongly related in visual domain. This is achievedby multiplying the correlation criterion with a factor indicating the similarly of the desiredcontent. Continuous multiplication factors can be used to indicate relation among differentsemantic categories as well.

Similarly, the maximization of incoherence ICHm for the mth cluster can be expressed as

ICHm ¼

P

i2Cm; j =2Cm

cohi; j

P

i2Cm; j2Wcohi; j

ð6Þ

The dominator of the previous equations is set to avoid trivial solutions of clusterscontaining only one element as is proven in [54]. Summing over all data and clusters, onecan derive the coherence and incoherence degree CH and ICH respectively. Assuming thatthe number of clusters is constant, one can relate CH and ICH with this number and then themaximization of (5) with a simultaneous minimization of (6) becomes equivalent to theoptimization of one of the previous two criteria. However, optimization of Eqs. 5 or 6 is aNP-complete problem. Even for the toy case of two clusters, the aforementioned optimiza-tion is a tedious problem. However, we can overcome this difficulty by transforming theproblem into a matrix-based representation. Then, an approximate solution in the discretespace can be found using concepts derived from eigenvalue analysis. This is achieved via aspectral clustering methodology [10, 12]. In particular, we can represent the problem via amatrix and then optimize the solution in the continuous domain solving a generalizedeignevalue problem. Then, we discretize the solution by optimally projecting the continuoussolution to the discrete domain so as to derive the best approximate solution.

6.2 Weakly supervised visual partitioning

The aforementioned clustering algorithm is static in the sense that the training visual data istaken into account to adopt clustering efficiency. However, in this paper we propose aweakly supervised method that updates clustering contours as new data are coming to thesystem without exploiting user’s feedback any more. In this paper the following approach isproposed: since new queries are submitted and new retrievals are formed we need to modifythe relation among the data since users’ preferences may be dynamic but also a smalltraining set cannot be representative of the visual world complexity. The straightforward

Multimed Tools Appl

for adopting a huge training set to classify the data would result in enormous computationalcosts which will be unaffordable for real-life application cases. For this reason, in this paperwe introduce an incremental clustering approach by inducing alterations to both the degreematrix and the eigenvectors’matrices involved in the optimization process. As it is proved in[26], it is possible to approximate the increment of the eigenvalues and the eigenvectors,without needing to resolve the generalized eigenvalue problem. More specifically, [26]provides equations to compute the updated eigenvalues and eigenvectors as a function ofthe changes of the affinity matrix. The equations refer to a distinct element’s change, butwhen multiple changes occur, the equations could be applied iteratively. This fact allows usto refine the generalized eigenvector matrices involved in [5, 8, 10, 12, 20, 41] directly andhence, to calculate the new, updated solution using an inexpensive, on-line approach Theexact subprocess of the incremental update is described by the following steps:

1. Get informed about the modified relevant/irrelevant data via the new submission ofqueries

2. Update the generalized eigenvalue matrices according to the new incoming data3. Iteratively refine the above mentioned matrices as sets of new samples are received as

inputs to the system4. Apply a cluster-based algorithm to the updated solution and get the new clusters.

7 Experimental results

In this section annotation results and performance evaluation are presented for the proposedsystem. Here it should be mentioned that it is very difficult to objectively evaluate automaticannotation systems mainly for three reasons: (a) annotation opinions for every specific objectare different among different annotators (no consensus), (b) there is not a straightforward way todetermine and select characteristic content for testing and (c) there is not a well-definedmethodology to determine the proper number of test files that can provide stable results.Nevertheless, literature results during the last decade have been based either on Corel (Corel5k[13], Corel30k [6], PSU [27]) or on photo sharing sites (freefoto, flickr, photobucket etc.).However, as stated in [27], performance achieved in the case of Corel database is optimisticbecause the Corel images are known to be highly clustered, that is, images in the same categoryare sometimes extraordinarily alike. In the real-world, annotating images with the samesemantics is harder due to the lack of such high visual similarity. Another limitation of thiscase is that annotation words are assigned on a category basis for the Corel database. The wordsfor a whole category are taken as ground truth for the annotation of every image in this category,and these annotations may not be complete for a particular image.

Except of the aforementioned mainly qualitative issues, there are also quantitative ones,concerning evaluation metrics. Usually the image annotation performance is evaluated bycomparing the automatically generated captions with the human-produced ground-truth. Inthis direction several interesting methodologies have been proposed in literature [13, 27]such as average precision, average recall, number of words with non-zero recall (NZR),mean-average precision (ranked retrieval), mean-average precision for NZR (ranked retriev-al), percent of the images correctly categorized and percent of the ground-truth annotationsthat match the computer annotations.

By taking into consideration all previously mentioned issues and based both on theevaluation philosophy of [27] and the nature of the designed architecture, in this paper the

Multimed Tools Appl

performance of the proposed scheme is tested in terms of precision and recall over ageneral-purpose image database, containing 51000 images that were selected fromwww.flickr.com. These images have been manually annotated by flickr’s users andwere assigned between 1 and 27 keywords (5.3 keywords on average), all of whichwere gathered to create a dictionary of 7814 records. Afterwards, 184 universitystudents volunteered to interact on-line with our typical content retrieval and presentationinterface [34], for a period of a semester (almost 4 months). The users were well-informedabout the experiment and the recording procedure of their clickthrough data. They were alsoasked to have a look at the created dictionary and were advised not to click in a randomway butmake informed selections.

During initialization of our experiments, the image set was split into 8 non-overlappingrandomly selected subsets of doubling size (SB1: 200, SB2: 400, SB3: 800, SB4: 1600, SB5:3200, SB6: 6400, SB7: 12800 and SB8: 25600) so that we could observe the process evolutionon increasing scale. Furthermore, since our main aim was to study the evolution of non-annotated content, 6 different initialization scenarios (Scni) have been designed for differentcombinations of compositions of sets CA and NA, while WAwas kept constant and equal to10% (Scn1: (CA080%, NA010%), Scn2: (CA070%, NA020%), Scn3: (CA060%, NA030%), Scn4: (CA050%, NA040%), Scn5: (CA040%, NA050%), Scn6: (CA030%, NA060%)). Here it should be mentioned that images of each subset SBi, were randomly split intosets CA, NA and WA for each of the 6 different scenarios, abiding the following method-ology: (a) all images of each SB were classified to set NA at least for one session, (b) imagesclassified to set CA kept their manual annotation (ground-truth keywords), (c) for imagesclassified to set NA their manual annotation was discarded, (d) for images classified to setWA their manual annotation was discarded and 1 to 5 random keywords were assigned tothem that did not exist in the created dictionary.

For comparison reasons, the whole experiment was repeated twice under exactly the sameinitialization conditions for sets CA, NA and WA. The first time (first experimentationphase), for each different SBi and Scni images of set WA were not retrieved for irrelevantkeywords, while images of set NA were retrieved in random, irrespectively of the searchkeywords a user used. The second time (second experimentation phase), images wereretrieved using the proposed method and taking into consideration the visual conceptmodeling technique for content ranking. For each SBi, the number of recorded sessions thathave been kept was set equal to three times the size of SBi (e.g. for each Scni of SB1 600sessions (3×200) have been kept, thus 3600 for the 6 different scenarios and 7,200 afterrepetition of the whole experiment). As a result, in total 918000 recorded sessions have beenkept for the first and another 918000 for the second execution of the whole experiment.

Another challenge that was faced by the proposed system focused on content presentationissues. In real-world search environments it is very difficult to compactly and meaningfullypresent large quantities of files, so that users can easily select the relevant ones. As a resultfiles are usually presented in multiple pages and according to studies [22], users are notwilling to browse more than 10 pages. In practice this means that for large sets (e.g. withmore than 3000 images) several files may never be viewed and thus never be selected/annotated. For these reasons the novel hierarchical visual content display mechanismdescribed in Section 6 was used, which gives the opportunity to all files to be viewed, whileat the same time groups semantically similar content. In the specific experiments, in eachquery just 200 images were retrieved and presented in 10 consecutive pages. The set of 200files was produced in each query by properly mixing images from the three image sets (CA,NA, WA) of each SBi. After extensive experimentation and based on users’ comments wehave concluded to the mixing analogy of 50-40-10, where 50% of the retrieved results came

Multimed Tools Appl

from CA (having t-vectors that contained words of the textual query), 40% from NA and10% from WA. In experiments where CA surpassed 50%, the annotation evolution wasnoticeably slowed down, while for contributions of CA less than 30%, the users complainedof cognitively incoherent results. Here it should also be mentioned that the content of eachset was hierarchically retrieved and mixed in each results’ page using the same mixinganalogy (50-40-10). In case a set was exhausted for a specific query, the hierarchical visualcontent display algorithm continued with the next set. By this setup an efficient systemtuning has been achieved, able to illustrate the capability of the proposed methodologymainly in annotating non-annotated content.

In particular during the first (second) experimentation phase, users selected 1.3(2.9) images per session on average, used 5557 (4984) different keywords duringsearching and about 24% (63%) of the non-annotated images were correctly annotatedon average, receiving 3.9 (3.4) keywords. In this paper, with the terms “correctlyannotated” we mean that at least one automatically assigned keyword matched themanual annotation of the respective image. Here it should be mentioned that also asignificant number of images has been properly annotated, using however differentkeywords. Nine of the correctly annotated images during the second experimentationphase are presented in Fig. 2 together with their assigned keywords. Keyword orderof appearance corresponds to the weights in decreasing order, according to theincorporated accumulative process. Regarding agreement of the proposed system tothe assigned keywords of flickr, we have noticed that the manual annotation of flickris much more analytical and provides more keywords, compared to the recordedsessions. For example in case of the third image of Fig. 2 the tags of flickr were“Pyramid, egypt, giza, desert, interestingness, Gíze, Gizan, Gizeh, Caire, pirámides,necropolis, Cairo, Гизe, Egipt”. As it can be observed, most of the words refer to thereal content of the image in different languages. Even though only the word “Egypt”is the same for the proposed approach, the words “pyramids” and “Gizah” have thesame semantic meaning as the words “pyramid, pirámides” and “giza, Gize, Gizan,Gizeh” respectively. In Fig. 3(a) and (b) the number of images that are assigned thesame keywords, both by flickr.com and either by the proposed system (Ph2 (PSC)) orby the scheme of the first experimentation phase (Ph1), is presented. As it can be observed mostfrequently 1 or 2 words are common in both cases, while 2471 (5709) images do not share anycommon annotation keywords during Ph1 (Ph2). The complete distributions of the number ofimages correctly annotated by the proposed system and the scheme of the first experimentationphase are presented in Tables 1 and 2 respectively.

Now regarding the evolution of the annotation process there were differences for each SBi

and Scni. In particular in subsets with small numbers of images and for scenarios withprevailing NA sets (Fig. 4(a)), the annotation evolution was faster compared to cases ofprevailing CA sets. However, for the recorded sessions, the more prevailing the CA set themore images were finally annotated. This can be explained by considering the limitedpresentation space and the tendency of users to make informed selections. Analytical resultsfor both experimentation phases and for all sets are presented in Fig. 4. Due to spacelimitations and to better visualize the results for both phases, only Scn1, Scn4 and Scn6 arepresented for subsets SB1, SB4, SB6 and SB8. As it can be observed in Fig. 4(a)–(d), for smallsets (e.g. SB1) the difference between the two phases (random and proposed) is lesssignificant. However when set size increases, the performance difference becomes clearer.This is due to the fact that the visualization space is limited and the random retrieval method,in contrast to the proposed scheme, does not group visually similar images, a practice thatprovides better annotation results. Furthermore for large sets (e.g. SB8) the annotation rate

Multimed Tools Appl

for both methods seems to decelerate, which probably means that the remaining NA contentis not interesting. On the other hand, for small sets (e.g. SB1) almost all NA images areannotated by the proposed scheme. This can be explained by the fact that users do not havemany choices, when searching again and again the same limited subset. So it is morepossible to also select content that is not of their direct interest. Here it should also bementioned that based on Fig. 4(a)–(d), if the experiments continued further (and moresessions were recorded), it is anticipated that more images would have been correctlyannotated, even though the proposed method does not guarantee that all images will befinally annotated.

Furthermore the proposed scheme can be beneficial also to the WA set. Inparticular during the first (second) experimentation phase about 2% (0.7%) or 561

Office, PC, work

(desk, office, n95)

Beach, sea, exotic, vacation

(Meeru, Maldives, paradise, beach, exotic, tropic)

pyramids, Egypt, Gizah

(Pyramid, egypt, giza, desert, interestingness, Gíze, Gizan, Gizeh,

Caire, pirámides, necropolis, Cairo, , Egipt)

spaghetti, food

(spaghetti, meatballs, sauce)

Mountain, snow, landscape

(Mountain, Haines, Alaska)

Flowers, garden, nice

(Garden, leisure, cottage, cottage garden, summertime)

Horse, sports, leisure

(Hungary, rider, horse, Olympus, c-740, riding-school, equestrian)

Protest, crowd

(protest, Madison, WI, firefighters)

Submarine, sea, war, military

(submarine, Tugboat, Cabrillo National Monument, San Diego

Bay)

Fig. 2 Nine of the 26531 images that were correctly annotated during the second experimentation phase.Associated keywords are presented in weight order. Keywords in brackets were manually assigned fromflickr's users and used as ground truth

Multimed Tools Appl

(196) images moved from CA to WA set on average, while about 3.6% (16.3%) ofimages moved from WA to CA set on average, meaning that the proposed schemeshows a clear (even though slow) tendency not only to annotate non-annotated imagesbut to correct the annotations of wrongly annotated ones. This is due to the fact thatit groups images of similar content, even if they have different annotation keywords(for images of CA and WA).

Table 1 Distribution of the number of correctly added keywords by implicit crowdsourcing - Ph2 (PSC)

# of keywords 0 1 2 3 4 5 6 7 8 9 10

# of images 5709 10316 6126 4513 2450 1418 645 548 322 161 32

Percentage (%) 15,3 27,6 16,4 12,1 6,6 3,8 1,7 1,5 0,9 0,4 0,1

Fig. 3 Same annotation keywords for flickr.com and (a) the proposed scheme (PSC), where 1 and 2 wordsare encountered in 16025 images, while 8, 9 and 10 words just in 515 images and (b) the scheme of firstexperimentation phase, where 1 and 2 words are encountered in 6067 images, while 8, 9 and 10 words just in141 images

Multimed Tools Appl

To further evaluate the annotation performance of the proposed scheme, the meanprecision and recall metrics are used [13], averaged per word for each annotated file. Inparticular let A be the number of images annotated with a given word by implicit crowd-sourcing, B the number of images correctly annotated with that word and C the number ofimages having that word in ground-truth annotation (manual annotation by flickr’s users).Then precision ¼ B

A and recall ¼ BC . In Figs. 5 and 6 it can be observed that the proposed

scheme surpasses the performance of the scheme used during the first experimentationphase, providing 44.03% average precision and 46.01% average recall, while Ph1 provides

Table 2 Distribution of the number of correctly added keywords by implicit crowdsourcing – Ph1

# of keywords 0 1 2 3 4 5 6 7 8 9 10

# of images 2471 3602 2465 1591 1012 492 174 169 101 36 4

Percentage (%) 20,4 29,7 20,3 13,1 8,3 4,1 1,4 1,4 0,8 0,3 0,0

Fig. 4 Evolution of annotation for both phases (Ph1: Phase 1, Ph2 (PSC): Phase 2 – proposed scheme) forsets (a) SB1, (b) SB4, (c) SB6 and (d) SB8. For each set, three scenarios are presented Scn1, Scn4 and Scn6

Multimed Tools Appl

31.84% and 32.55% respectively. Here it should be mentioned that in reality precision of theproposed scheme is much larger than 44%, since users made informed selections. This isalso evident by the 5709 files that had zero words in common. However also in this caseand in most files users have assigned similar to ground-truth words. The samesituation also occurred in case where relevant-to-content words have been used bythe users, which however did not agree to the words used by flickr’s users. Never-theless semantic precision and recall have not been considered and/or defined inliterature, concepts that could provide a clearer evaluation of the proposed and similarcontent annotation schemes.

Finally it should be mentioned that it is very difficult to evaluate and weight accordinglythe perception of users, so that to effectively limit out wrong selections, something which isout of the scope of this paper. However in our experiments this issue was partly confrontedby recording and accumulating all opinions, something which eventually led to annotationbased on average user’s preferences, selections and understanding.

Fig. 4 (continued)

Multimed Tools Appl

8 Conclusions and future work

Image searches over the Internet are performed by submitting millions of text querieseveryday. However since a large number of images are not annotated, content retrieval isa very challenging task. To ensure easy sharing and effective searching of this extremely vastnumber of online images, efficient systems for automatic keyword annotation should beimplemented.

Towards this direction, in this paper we have proposed an efficient scheme forautomatically annotating image databases based on implicit crowdsourcing. The wholeiterative process depends on recording and analyzing user queries and the respective

Fig. 5 Evolution of the mean precision averaged per word for each annotated file (Ph1: Phase 1, Ph2 (PSC):Phase 2 – proposed scheme)

Fig. 6 Evolution of the mean recall averaged per word for each annotated file (Ph1: Phase 1, Ph2 (PSC):Phase 2 – proposed scheme)

Multimed Tools Appl

images that are selected. In particular, each time a user submits a query and selectssome of the retrieved images, each image is assigned the query keywords according toan accumulative scheme. By this way user perception is implicitly incorporated forautomatically annotating images. To achieve our goal we have also proposed a contentmodeling algorithm and an efficient ranking module, which works under the frame-work of a typical searching window. The performance of the proposed scheme hasbeen extensively evaluated using 51000 images from flickr.com. Experimental resultsillustrate the interesting performance of the proposed approach, which led to theannotation of 32240 images or about 63% of the total set after 918000 recordedsessions, presenting an increasing tendency, which however does not guarantee that allimages will be finally annotated. Furthermore the proposed scheme performs betterthan the most sophisticated among random retrieval methods (where images belongingto NA set are retrieved in random, while content from WA set is not retrieved), testedduring the first experimentation phase. Now in terms of mean precision and recallaveraged per word, the proposed scheme also surpassed the performance of the firstexperimentation phase scheme, providing about 44% average precision and about 46%average recall, while Ph1 scheme provided about 32% in both metrics respectively.However, according to the assigned keywords, in reality the proposed scheme workseven better, as assigned words may not be the same as ground-truth keywords, but inmost cases they have similar meanings.

On the other hand, another merit of the proposed scheme is that it is able tocorrect even erroneously annotated files. In our experiments about 16% of imagesbelonging to set WA have moved to set CA. However it should be mentioned that theperformance of the proposed scheme heavily depends on the selections of participating users. Inour experiments users were well-informed about the ground-truth keywords dictionaryand they were also asked to make careful selections, so that to avoid introducingsignificant noise.

In the future it would be very interesting to test the proposed scheme in an even morerealistic setup, over larger databases. In the performed experiments sets CA, NA and WAwere well-defined, something that is not valid in real world conditions. Another importantdirection includes the examination of algorithms and methodologies for noise reduction(random clicks, user mistakes etc.) during implicit crowdsourcing. Furthermore innovativesocial media concepts can also be examined, such as those described in [11, 25, 45].According to these ideas, maximization of interests’ commonalities between users withinsocial groups is attempted and influence patterns in topic communities are also investigated.As a result, in the framework of automatic annotation, a social group can provide “local”annotations and an aggregation of all “local” annotations can serve as “global” annotation.Additionally as semantic precision and recall metrics have not been extensively examined in theliterature, it would also be interesting to make research on these concepts, so that a clearer andmore integrated evaluation of the proposed and other similar schemes can be performed. Finallyother modalities (e.g. text within the image, surrounding html text etc.) could also be taken intoconsideration for feature vector formulation in different kinds of image databases.

Acknowledgments This work falls under the Cyprus Research Promotion Foundation’s FrameworkProgramme for Research, Technological Development and Innovation 2008 (DESMI 2008), co-funded bythe Republic of Cyprus and the European Regional Development Fund, and specifically under GrantANTHRO/0308 (BIE)/04.

Multimed Tools Appl

References

1. Ahn LV, Dabbish L (2004) Labeling images with a computer game. In Proceedings of CHI,p.p. 319–326

2. Ahn LV, Dabbish L (2008) Designing games with a purpose. Commun ACM 51(8):58–673. Ahn LV, Maurer B, McMillen C, Abraham D, Blum M (2008) reCAPTCHA: human-based character

recognition via web security measures. Science 321(5895):1465–14684. Assfalg J, Bertini M, Colombo C, and Bimbo AD , “Semantic annotation of sports videos,” IEEE

Multimedia, vol. 9, no. 2, pp. 52-60, April-June 20025. Bach FR, Jordan MI (2004) Learning spectral clustering. In: Thrun S, Saul L, Schoelkopf B (Eds.),

Advances in Neural Information Processing Systems (NIPS) 16, (long version)6. Carneiro G, Chan AB, Moreno PJ, Vasconcelos N (2006) Supervised learning of semantic classes

for image annotation and retrieval. IEEE Trans Pattern Analysis and Machine Intelligence 29(3):394–410

7. Dawid A, Skene A (1979) Maximum Likelihood Estimation of Observer Error-Rates Using the EMAlgorithm. J of the Royal Statistical Society, Series C (Appl Stat) 28(1):20–28

8. Delias P, Doulamis A, Doulamis N, Matsatsinis N (2011) Optimizing resource conflicts in workflowmanagement systems. IEEE Trans on Knowledge Engineering 23(3):417–432

9. Doulamis N, Doulamis A (2006) Evaluation of relevance feedback schemes in content-based retrievalsystems. Signal Processing: Image Communication 21(4):334–357

10. Doulamis A, Doulamis N, Kollias S (2000) Non-sequential video content representation using temporalvariation of feature vectors. IEEE Trans Consumer Electronics 46(3):758–768

11. Doulamis N, Dragonas J, Doulamis A, Miaoulis G, Plemenos D (2009) Machine learning andpattern analysis methods for profiling in a declarative collaorative framework. Studies in ComputationalIntelligence 240:189–206

12. Duda RO, Hart P, Stock D (2004) Pattern classification. Wiley-Interscience (2nd Edition)13. Feng S, Manmatha R, Lavrenko V (2004) Multiple bernoulli relevance models for image and

video annotation. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 200414. Gao S, Wang D-H, Lee C-H (2006) Automatic image annotation through multi-topic text categorization,

proceedings of the 2006 IEEE ICASSP, VOL. 2, PP. II–II, Toulouse, France, May 200615. He X, King O, Ma W-Y, Li M, Zhang H-J (2003) Learning a semantic space from user's relevance

feedback for image retrieval. IEEE Trans CSVT 13(1):39–4816. Heisterkamp D (2002) Building a latent-semantic index of an image database from patterns of relevance

feedback. In Proceedings of the 16th ICPR, p.p. 134-13717. Howe J (2008) Crowdsourcing: why the power of the crowd is driving the future of business. Crown

Business, 1st edition, August 2008.18. Hsueh P-Y, Melville P, Sindhwani V (2009) Data quality from crowdsourcing: a study of annotation

selection criteria. Proceedings of the NAACL HLT 2009 Workshop on Active Learning for NaturalLanguage Processing, Boulder, Colorado, June 2009.

19. http://www.gwap.com/gwap/, Oct. 2011.20. Huazhong N, Wei X, Yun C, Yihong G, Huang TS (2010) Incremental spectral clustering by efficiently

updating the eigen-system. Pattern Recognition 43(1):113–12721. Jain S, Parkes DC (2009) The role of game theory in human computation systems. In ACM KDD

Workshop on Human Computation, pp. 58–6122. Joachims T (2002) Optimizing search engines using clickthrough data. Proceedings of the 8th ACM

SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133 – 142, Edmonton,Canada, July

23. Joachims T, Granka L, Pang B, Hembrooke H, Gay G (2005) Accurately interpreting clickthrough data asimplicit feedback,” In Proceedings of the 28th Annual International ACM SIGIR Conference, p.p. 154-161

24. Kaisser M, Hearst M, Lowe JB (2008) Evidence for varying search results summary lengths. InProceedings of ACL

25. Karamolegkos PN, Patrikakis CZ, Doulamis ND, Vlacheas PT, Nikolakopoulos IG (2009) An evaluationstudy of clustering algorithms in the scope of user communities assessment. Journal of Computers &Mathematics with Applications. 58(8) October

26. Karypis G, Kumar V (1995) Analysis of multilevel graph partitioning. ACM Conf. on supercomputing,ISBN:0-89791-816-9

27. Li J, Wang J (2003) Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach. IEEETrans Pattern Analysis and Machine Intelligence 25(9):1075–1088

Multimed Tools Appl

28. Lin C, Xue G-R, Zeng H-J, Yu Y (2005) Using probabilistic latent semantic analysis for personalized websearch. In Web Technologies Research and Development - Proceedings of the 7th Asia-Pacific WebConference, Vol. 3399 of LNCS, p.p. 707-717, 2005.

29. Liu D, Yan S, Hua X-S, Zhang H-J (2011) Image retagging using collaborative tag propagation. IEEETrans Multimedia 13(4):702–712

30. Macdonald C, Ounis I (2009) Usefulness of quality click-through data for training,” In Proceedings of the2009 Workshop on Web Search Click Data, p.p. 75-79

31. Muller H, Pun T, Squire D (2004) Learning from user behavior in image retrieval: application of marketbasket analysis. Int J Comput Vis 56(1-2):65–77

32. Natsev A, Jiang W, Merler M, Smith JR, Tesic J, Xie L, Yan R (2008) IBM Research TRECVID-2008Video Retrieval System,” In Proc. of TRECVID 2008,

33. Nesvadba J (2007) From push-based passive content consumption to pull-based content experiences,Panel presentation in the 8th IEEE WIAMIS. Santorini, Greece

34. Ntalianis K, Doulamis A, Tsapatsoulis N, Doulamis N (2010) Human action analysis, annotation andmodeling in video streams based on implicit user interaction. in Multimedia Tools and Applications,Springer, Vol. 50, No. 1, p.p. 199-225

35. Ntalianis KS, Tsapatsoulis N, Doulamis A, Doulamis N (2008) Event recognition in video streams basedon user cognition. in Proc. of the 1st ACM Int. Workshop in Analysis and Retrieval of Events/Actions andWorkflows in Video Streams, Vancouver, Canada, October 2008.

36. Petridis K, Kompatsiaris I, Strintzis MG, Bloehdorn S, Handschuh S, Staab S, Simou N, Tzouvaras V,Avrithis Y (2004) Knowledge representation for semantic multimedia content analysis and reasoning,proceedings of the 2004 European Workshop on the Integration of Knowledge, Semantics and DigitalMedia Technology, pp. 33-46, London, UK, November

37. Quinn AJ, Bederson BB (2009) A taxonomy of distributed human computation. Technical Report, Univ.of Maryland, College Park

38. Rapantzikos K, Tsapatsoulis N, Avrithis Y, Kollias S (2007) Bottom-up spatiotemporal visual attentionmodel for video analysis. IET Image Processing 1(2):237–248

39. Raykar V, Yu S, Zhao L, Jerebko A, Florin C, Valadez G, Bogoni L, Moy L, (2009) Supervised Learningfrom Multiple Experts: Whom to Trust when Everyone Lies a Bit. In Proc. 26th Annual InternationalConference on Machine Learning (ICML’09), p.p. 889–896

40. Sheng VS, Provost F, Ipeirotis PG (2008) Get another label? Improving data quality and data miningusing multiple, noisy labelers. In KDD, 2008

41. Shi J, Malik J (2000) Normalized cut and image segmentation. IEEE Trans Pattern Anal Machine Intell 22(8):888–905

42. Siorpaes K, Hepp M (2008) Games with a purpose for the semantic web. IEEE Intell Syst 23(3):50–6043. Smyth P, Fayyad U, Burl M, Perona P, Baldi P (1995) Inferring ground truth from subjective labeling of

venus images. Advances in Neural Information Processing Systems (NIPS’95)44. Snow R, O’Connor B, Jurafsky D, Ng A (2008) Cheap and fast – but is it good ? Evaluating non-expert

annotations for natural language tasks. In proceedings of the conference on empirical methods in naturallanguage processing, pp. 254–263, Honolulu, Hawaii, 2008

45. SocIoS: FP7 STREP project, September 2010 – February 2013, http://www.sociosproject.eu/46. Sorokin A, Forsyth D (2008) Utility data annotation via amazon mechanical turk. In IEEE Workshop on

Internet Vision at CVPR, 200847. Spain M, Perona P (2008) Some objects are more equal than others: measuring and predicting importance.

In ECCV48. Suh B, Convertino G, Chi EH, Pirolli P (2009) The singularity is not near: slowing growth of wikipedia.

In Proceedings of the 5th International Symposium onWikis and Open Collaboration (WikiSym’09),pp. 1–10, New York, USA

49. Sunstein CR (2006) Infotopia: how many minds produce knowledge, Oxford University Press, ISBN0195189280, 2006

50. Tsapatsoulis N, Petridis S (2007) Classifying images from athletics based on spatial relations, proc. of the2nd intern. Workshop on semantic media adaptation and personalization, pp. 92-97, Dec

51. Tseng VS, Su J-H, Huang J-H, Chen C-J (2008) Integrated mining of visual features, speech features, andfrequent patterns for semantic video annotation. IEEE Transactions on Multimedia 10(2):260–267

52. Tsikrika T, Diou C, de Vries AP, Delopoulos A (2009) Image annotation using clickthrough data, InProceedings of the 8th International Conference on Image and Video Retrieval

53. Ulges A, Koch M, Schulze C, Breuel T (2008) Learning TRECVID'08 high-level features fromYouTubeTM,” In Proc. of TRECVID 2008

54. Vijayanarasimhan S, Grauman K (2009) What’s it going to cost you?: Predicting Effort vs. Informativenessfor multi-label image annotations. In CVPR, p.p. 2262–2269

Multimed Tools Appl

55. Wang A, Hoang CDV, Kan M-Y (2010) Perspectives on crowdsourcing annotations for natural languageprocessing. Technical Report, School of Computing, National University of Singapore

56. Wang X-J, Ma W-YM, Li X (2006) Exploring statistical correlations for image retrieval. MultimediaSystems 11(4):340–351

57. Whitehill J, Ruvolo P, Wu T, Bergsma J, Movellan J (2009) Whose vote should count more: optimalintegration of labels from labelers of unknown expertise. In NIPS

58. Yuen M-C, Chen L-J, King I (2009) A survey of human computation systems. IEEE InternationalConference on Computational Science and Engineering, p.p. 723–728, 2009

Klimis S. Ntalianis (S’99) was born in Athens, Greece, in 1975. He received the Diploma degree and the Ph.Ddegree in electrical and computer engineering, both from the National Technical University of Athens (NTUA),Athens, Greece, in 1998 and 2002 respectively. His Ph.D. studies were supported from the National ScholarshipsFoundation and the Institute of Communications and Computers Systems of the NTUA. During the last decadeDr.Ntalianis has received six prizes for his academic achievements. He is the author of more than 80 scientific articles,while his research interests include multimedia annotation systems, 3-D image processing, video organization andmultimedia security. He is currently an Assistant Professor at the Department of Marketing of the TechnologicalEducational Institute of Athens.

Nicolas Tsapatsoulis graduated from the Department of Electrical and Computer Engineering, the NationalTechnical University of Athens in 1994 and received his Ph.D degree in 2000 from the same University. He

Multimed Tools Appl

has worked at the School of Electrical and Computer Eng. of the National Technical University of Athens(2000–2002) as a Research Assistant, being responsible for the project “ORESTEIA” IST-2000–26091, and atthe Institute of Communications and Computer Systems, Athens, Greece, being a Class C’ Researcher (2002–2003). During the academic period 2003–2004 he was with the Computer Science Dept. of the University ofCyprus being a Visiting Lecturer. The academic period 2004–2005 he served as a Visiting Assistant Professorat the same department. Since September 2005 he works as a Technical Manager for the CRPF projectOPTOPOIHSH: Development of knowledge-based Visual Attention models for Perceptual Video Coding. In2006, He was elected Assistant Professor at the Technical University of Cyprus in the field of multimediacontent analysis. He is a member of the Technical Chambers of Greece and Cyprus and a member of IEEESignal Processing and Computer societies. He has published fourteen papers in international journals, eigthpapers in books and more than 47 in proceedings of international conferences. His research has beenrecognised by the international research community through more than 212 citations. He served as TechnicalProgram Co-Chair for the VLBV’01 workshop and in the Biometric workshop of 2009.

Anastasios D. Doulamis received the Diploma degree in Electrical and Computer Engineering from theNational Technical University of Athens (NTUA) in 1995 with the highest honor. In 2000, he hasreceived the PhD degree in electrical and computer engineering from the NTUA. From 1996–2000, hewas with the Image, Video and Multimedia Lab of the NTUA as research assistant. From 2001 to2002, he serves his mandatory duty in the Greek army in the computer center department of theHellenic Air Force, while in 2002, he join the NTUA as senior researcher. His PhD thesis wassupported by the Bodosakis Foundation Scholarship. In 2006, he is Assistant professor in theTechnical University of Crete in the area of multimedia systems. Dr. Doulamis has received severalawards and prizes during his studies, including the Best Greek Student in the field of engineering innational level in 1995, the Best Graduate Thesis Award in the area of electrical engineering with A.Doulamis in 1996 and several prizes from the National Technical University of Athens, the NationalScholarship Foundation and the Technical Chamber of Greece. In 1997, he was given the NTUAMedal as Best Young Engineer. In 2000, he received the best Phd thesis award by the ThomaidionFoundation in conjunction with N. Doulamis. In 2001, he served as technical program chairman of theVLBV’01. He has also served as program committee in several international conferences and workshops. He isreviewer of IEEE journals and conferences as well as and other leading international journals. He is author of morethan 200 papers in the above areas, in leading international journals and conferences. His research interestsinclude, non-linear analysis, neural networks, multimedia content description, intelligent techniques for videoprocessing.

Multimed Tools Appl

Nikolaos Matsatsinis (M’02) received his B.A. in Physics from Aristotle University of Thessaloniki (1980)and his PhD in Intelligence Decision Support Systems from Technical University of Crete (1995).

He also has over twenty eight years of experience in Information and Decision Systems Develop-ment. He is a full Professor of Information and Intelligent Decision Support Systems at the Depart-ment of Production Engineering and Management of the Technical University of Crete, Greece. He isChairman of the Department, Director of Decision Support Systems Lab and President of the HellenicOperational Research Society (HELORS). He has contributed as scientific or project coordinator onover of forty scientific projects. Professor Matsatsinis is member of the international advisory board ofthree scientific journals, belongs to several scientific groups and has a rich teaching experience. He isthe author or co-author of fifteen books and over of sixty eight articles in international scientificjournals and books. His research interests fall into the areas of Decision Support Systems, ArtificialIntelligent and Multi-Agent Systems, e-Business, e-Marketing, Multicriteria Decision Analysis, GroupDecision Support Systems, Workflow Management Systems.

Prof. Matsatsinis is member of the Institute of Electrical and Electronics Engineers (IEEE); member of theAmerican Association for Artificial Intelligence (AAAI); Member of IFIP, CEPIS; International Consortiumfor Electronic Business; member of INFORMS, EURO, Euro Working Group on Multicriteria Decision Aid;Euro Working Group on Decision Support Systems; European Federation for Information Technology inAgriculture, Food and the Environment—EFITA.

Multimed Tools Appl