1
Prediction of Search Targets From Fixations in Open-World Settings Hosnieh Sattar 1,2 , Sabine Müller 1 , Mario Fritz 2 , Andreas Bulling 1 1 Perceptual User Interfaces Group, 2 Scalable Learning and Perception Group, Max Planck Institute for Informatics, Saarbrücken, Germany {sattar,smueller,mfritz,bulling}@mpi-inf.mpg.de Previous work on predicting the target of visual search from human fix- ations ([2], [1]) only considered closed-world settings. In this work we go beyond the state of the art by studying search target prediction in an open- world setting in which we no longer assume that we have fixation data to train for the search targets. We present a dataset containing fixation data of 18 users searching for natural images from three image categories within synthesised image collages of about 80 images. ? ? open-world closed-world { } Figure 1: Experiments conducted in this work. In the closed-world experi- ment we aim to predict which target image (here Q 2 ) out of a candidate set of five images Q train = Q test the user is searching for by analysing fixations F i on an image collage C. In the open-world experiments we aim to predict Q i on the whole Q test . Given a query image Q Q and a stimulus collage C C, during a search task participants P P perform fixations F ( C, Q, P)= {(x i , y i , a i ), i = 1,..., N}, where each fixation is a triplet of positions x i , y i in screen coordi- nates and appearance a i at the fixated location. To recognise search targets we aim to find a mapping from fixations to query images (Figure 1). We use a bag of visual world featurisation φ of the fixations. We interpret fixations as key points around which we extract local image patches. These are clus- tered into a visual vocabulary V and accumulated in a count histogram. This leads to a fixed-length vector representation of dimension | V | commonly known as a bag of words. Therefore, our recognition problem can more specifically be expressed as: φ (F ( C, Q, P), V ) 7Q Q (1) In our new open-world setting, we no longer assume that we observe fixa- tions to train for test queries. Therefore Q test Q train = / 0. The main chal- lenge that arises from this setting is to develop a learning mechanism that can predict over a set of classes that is unknown at training time. To circum- vent the problem of training for a fixed number of search targets, we propose to encode the search target into the feature vector, rather than considering it a class that is to be recognised. This leads to a formulation where we learn compatibilities between observed fixations and query images: (F ( C, Q i , P), Q j ) 7Y ∈{0, 1} (2) Training is performed by generating data points of all pairs of Q i and Q j in Q train and assigning a compatibility label Y accordingly: Y = 1 if i = j 0 if i 6= j (3) We do not have fixations for the query images. Therefore, we introduce a sampling strategy S which still allows us to generate a bag-of-words repre- sentation for a given query. We stack the representation of the fixation and This is an extended abstract. The full paper is available at the Computer Vision Foundation webpage. the query images. This leads to the following learning problem: φ (F ( C, Q i , P), V ) φ (S(Q j )) 7Y ∈{0, 1} (4) We learn a model for the problem by training a single binary SVM B classi- fier according to the labelling as described above. At test time we find the query image describing the search target by Q = arg max Q j Q test B φ (F test , V ) φ (S(Q j )) (5) In both closed-world and open-world evaluation we distinguish between within-participant and cross-participant predictions. In the “within partici- pant” condition we predict the search target for each participant individually using their own training data. In the closed-world setting accuracies were well above chance for all participants for the Amazon book covers (aver- age accuracy 75%) and the O’Reilly book covers (average accuracy 69%). Accuracies were lower for mugshots but still above chance level (average accuracy 30%, chance level 20%). In the open-world setting the average performance of all participants in each group was for Amazon: 70.33%, O’Reilly: 59.66%, mugshots: 50.83% (chance level 20%). In contrast, for 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 21*21 31*31 41*41 51*51 61*61 Accuracy Window Size Optimum k Average k Minimum K 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 21*21 31*31 41*41 51*51 61*61 Accuracy Window Size Optimum k Average k Minimum K 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 21*21 31*31 41*41 51*51 61*61 Accuracy Window Size Optimum K Average K Minimum K Figure 2: Open-World evaluation results showing mean and standard devia- tion of cross-participant prediction accuracy for Amazon book covers (left), O’Reilly book covers (middle), and mugshots (right). Results are shown with (straight lines) and without (dashed lines) using the proposed sampling approach around fixation locations. The chance level is indicated with the dashed line. the “cross participant” condition, we predict the search target across partic- ipants. The “cross participant” condition is more challenging as the algo- rithm has to generalise across users. In the closed-world the prediction accu- racies for Amazon book covers was best, followed by O’Reilly book covers and mugshots. Accuracies were between 61% ± 2% and 78% ± 2% for Ama- zon and O’Reilly book covers but only around chance level for mugshots. In the open-world setting the model achieves an accuracy of 75% for Amazon book covers, which is significantly higher than chance at 50%. For O’Reilly book covers accuracy reaches 55% and for mugshots we reach 56%. Fig- ure 2 summarises the cross-participant prediction accuracies. In this paper we demonstrated how to predict the search target during visual search from human fixations in an open-world setting. We showed that this formulation is effective for search target prediction from human fixations. References [1] Ali Borji, Andreas Lennartz, and Marc Pomplun. What do eyes reveal about the mind?: Algorithmic inference of search targets from fixations. Neurocomputing, 2014. [2] Gregory J Zelinsky, Yifan Peng, and Dimitris Samaras. Eye can read your mind: Decoding gaze fixations to reveal categorical search targets. Journal of Vision, 13(14):10, 2013.

Prediction of Search Targets From Fixations in Open-World Settings · 2015-05-24 · Prediction of Search Targets From Fixations in Open-World Settings Hosnieh Sattar1;2, Sabine Müller1,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Prediction of Search Targets From Fixations in Open-World Settings · 2015-05-24 · Prediction of Search Targets From Fixations in Open-World Settings Hosnieh Sattar1;2, Sabine Müller1,

Prediction of Search Targets From Fixations in Open-World Settings

Hosnieh Sattar1,2, Sabine Müller1, Mario Fritz2, Andreas Bulling1

1Perceptual User Interfaces Group, 2Scalable Learning and Perception Group, Max Planck Institute for Informatics, Saarbrücken, Germany{sattar,smueller,mfritz,bulling}@mpi-inf.mpg.de

Previous work on predicting the target of visual search from human fix-ations ([2], [1]) only considered closed-world settings. In this work we gobeyond the state of the art by studying search target prediction in an open-world setting in which we no longer assume that we have fixation data totrain for the search targets. We present a dataset containing fixation dataof 18 users searching for natural images from three image categories withinsynthesised image collages of about 80 images.

?

?

open

-world

closed

-world

{ }

Figure 1: Experiments conducted in this work. In the closed-world experi-ment we aim to predict which target image (here Q2) out of a candidate setof five images Qtrain =Qtest the user is searching for by analysing fixationsFi on an image collage C. In the open-world experiments we aim to predictQi on the whole Qtest .

Given a query image Q ∈ Q and a stimulus collage C ∈ C, during asearch task participants P∈P perform fixations F(C,Q,P) = {(xi,yi,ai), i=1, . . . ,N}, where each fixation is a triplet of positions xi,yi in screen coordi-nates and appearance ai at the fixated location. To recognise search targetswe aim to find a mapping from fixations to query images (Figure 1). We usea bag of visual world featurisation φ of the fixations. We interpret fixationsas key points around which we extract local image patches. These are clus-tered into a visual vocabulary V and accumulated in a count histogram. Thisleads to a fixed-length vector representation of dimension |V | commonlyknown as a bag of words. Therefore, our recognition problem can morespecifically be expressed as:

φ(F(C,Q,P),V ) 7→ Q ∈Q (1)

In our new open-world setting, we no longer assume that we observe fixa-tions to train for test queries. Therefore Qtest ∩Qtrain = /0. The main chal-lenge that arises from this setting is to develop a learning mechanism thatcan predict over a set of classes that is unknown at training time. To circum-vent the problem of training for a fixed number of search targets, we proposeto encode the search target into the feature vector, rather than considering ita class that is to be recognised. This leads to a formulation where we learncompatibilities between observed fixations and query images:

(F(C,Qi,P),Q j) 7→ Y ∈ {0,1} (2)

Training is performed by generating data points of all pairs of Qi and Q j inQtrain and assigning a compatibility label Y accordingly:

Y =

{1 if i = j0 if i 6= j

(3)

We do not have fixations for the query images. Therefore, we introduce asampling strategy S which still allows us to generate a bag-of-words repre-sentation for a given query. We stack the representation of the fixation and

This is an extended abstract. The full paper is available at the Computer Vision Foundationwebpage.

the query images. This leads to the following learning problem:(φ(F(C,Qi,P),V )

φ(S(Q j))

)7→ Y ∈ {0,1} (4)

We learn a model for the problem by training a single binary SVM B classi-fier according to the labelling as described above. At test time we find thequery image describing the search target by

Q = argmaxQ j∈Qtest

B(

φ(Ftest,V )φ(S(Q j))

)(5)

In both closed-world and open-world evaluation we distinguish betweenwithin-participant and cross-participant predictions. In the “within partici-pant” condition we predict the search target for each participant individuallyusing their own training data. In the closed-world setting accuracies werewell above chance for all participants for the Amazon book covers (aver-age accuracy 75%) and the O’Reilly book covers (average accuracy 69%).Accuracies were lower for mugshots but still above chance level (averageaccuracy 30%, chance level 20%). In the open-world setting the averageperformance of all participants in each group was for Amazon: 70.33%,O’Reilly: 59.66%, mugshots: 50.83% (chance level 20%). In contrast, for

40%

45%

50%

55%

60%

65%

70%

75%

80%

85%

21*21 31*31 41*41 51*51 61*61

Accuracy

WindowSize

Optimumk Averagek MinimumK

40%

45%

50%

55%

60%

65%

70%

75%

80%

85%

21*21 31*31 41*41 51*51 61*61

Accuracy

WindowSize

Optimumk Average k MinimumK

40%

45%

50%

55%

60%

65%

70%

75%

80%

85%

21*21 31*31 41*41 51*51 61*61

Accuracy

WindowSize

OptimumK AverageK MinimumK

Figure 2: Open-World evaluation results showing mean and standard devia-tion of cross-participant prediction accuracy for Amazon book covers (left),O’Reilly book covers (middle), and mugshots (right). Results are shownwith (straight lines) and without (dashed lines) using the proposed samplingapproach around fixation locations. The chance level is indicated with thedashed line.

the “cross participant” condition, we predict the search target across partic-ipants. The “cross participant” condition is more challenging as the algo-rithm has to generalise across users. In the closed-world the prediction accu-racies for Amazon book covers was best, followed by O’Reilly book coversand mugshots. Accuracies were between 61%±2% and 78%±2% for Ama-zon and O’Reilly book covers but only around chance level for mugshots. Inthe open-world setting the model achieves an accuracy of 75% for Amazonbook covers, which is significantly higher than chance at 50%. For O’Reillybook covers accuracy reaches 55% and for mugshots we reach 56%. Fig-ure 2 summarises the cross-participant prediction accuracies.

In this paper we demonstrated how to predict the search target duringvisual search from human fixations in an open-world setting. We showedthat this formulation is effective for search target prediction from humanfixations.

References

[1] Ali Borji, Andreas Lennartz, and Marc Pomplun. What do eyes reveal about themind?: Algorithmic inference of search targets from fixations. Neurocomputing,2014.

[2] Gregory J Zelinsky, Yifan Peng, and Dimitris Samaras. Eye can read your mind:Decoding gaze fixations to reveal categorical search targets. Journal of Vision,13(14):10, 2013.