11
An Interactive Method to Improve Crowdsourced Annotations Shixia Liu, Changjian Chen, Yafeng Lu, Fangxin Ouyang, Bin Wang A D (e) (c) H (b) E F G (a) (d) B C J I Fig. 1. LabelInspect: (a) the confusion visualization to reveal the confusion degree between different classes; (b) the instance visualization to illustrate the uncertain labels in context; (c) the worker visualization to demonstrate worker reliability; (d) the validation trail to display the number of validated and influenced instances at each validation step; (e) images. Abstract— In order to effectively infer correct labels from noisy crowdsourced annotations, learning-from-crowds models have introduced expert validation. However, little research has been done on facilitating the validation procedure. In this paper, we propose an interactive method to assist experts in verifying uncertain instance labels and unreliable workers. Given the instance labels and worker reliability inferred from a learning-from-crowds model, candidate instances and workers are selected for expert validation. The influence of verified results is propagated to relevant instances and workers through the learning-from-crowds model. To facilitate the validation of annotations, we have developed a confusion visualization to indicate the confusing classes for further exploration, a constrained projection method to show the uncertain labels in context, and a scatter-plot-based visualization to illustrate worker reliability. The three visualizations are tightly integrated with the learning-from-crowds model to provide an iterative and progressive environment for data validation. Two case studies were conducted that demonstrate our approach offers an efficient method for validating and improving crowdsourced annotations. Index Terms—Crowdsourcing, learning-from-crowds, interactive visualization, focus + context 1 I NTRODUCTION The quality of training data has proven to be a critical factor for the success of supervised and semi-supervised learning [36, 38, 43, 57]. However, labeling a large dataset is expensive and demanding [54]. Therefore, researchers resort to crowdsourcing, which distributes micro labeling tasks to crowds, in order to acquire labeled data [36, 57]. This S. Liu, C. Chen, F. Ouyang, and B. Wang are with School of Software, Tsinghua University. Email: {shixia, wangbins}@tsinghua.edu.cn, {ccj17, oyfx15}@mails.tsinghua.edu.cn. Y. Lu is with Arizona State University. Email: [email protected]. Manuscript received xx xxx. 201x; accepted xx xxx. 201x. Date of Publication xx xxx. 201x; date of current version xx xxx. 201x. For information on obtaining reprints of this article, please send e-mail to: [email protected]. Digital Object Identifier: xx.xxxx/TVCG.201x.xxxxxxx mechanism, while being efficient and cost-effective, often falls short of high quality results when the task is complicated and requires proven skills or specialties. As a result, crowdsourced annotations may be noisy and poor in quality and usually require additional validation [58]. Correspondingly, there have been some initial efforts to introduce additional expert labels into learning-from-crowds algorithms [27, 36]. Although the performance of the aforementioned methods repre- sents the state-of-the-art for improving crowdsourced labels, they use trusted labels without considering the effort and cost of acquiring them. Data scientists (expert labelers) often have to spend significant time validating the data for better model accuracy, especially when judging the reliability of a worker or evaluating which verifications can bring in a relatively large gain. Moreover, for objects that are difficult to distin- guish, data scientists need to consult domain experts who are familiar with the field. Generally, difficulties arise from: (1) Two categories of highly similar-looking objects, for example, the Irish Wolfhound

An Interactive Method to Improve Crowdsourced Annotationsshixialiu.com/publications/LabelInspect/paper.pdf · the validation of annotations, we have developed a confusion visualization

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An Interactive Method to Improve Crowdsourced Annotationsshixialiu.com/publications/LabelInspect/paper.pdf · the validation of annotations, we have developed a confusion visualization

An Interactive Method to Improve Crowdsourced Annotations

Shixia Liu, Changjian Chen, Yafeng Lu, Fangxin Ouyang, Bin Wang

A

D

(e)

(c)

H

(b)

E

F

G

(a)

(d)

B

C

JI

Fig. 1. LabelInspect: (a) the confusion visualization to reveal the confusion degree between different classes; (b) the instancevisualization to illustrate the uncertain labels in context; (c) the worker visualization to demonstrate worker reliability; (d) the validationtrail to display the number of validated and influenced instances at each validation step; (e) images.

Abstract— In order to effectively infer correct labels from noisy crowdsourced annotations, learning-from-crowds models haveintroduced expert validation. However, little research has been done on facilitating the validation procedure. In this paper, we proposean interactive method to assist experts in verifying uncertain instance labels and unreliable workers. Given the instance labels andworker reliability inferred from a learning-from-crowds model, candidate instances and workers are selected for expert validation. Theinfluence of verified results is propagated to relevant instances and workers through the learning-from-crowds model. To facilitatethe validation of annotations, we have developed a confusion visualization to indicate the confusing classes for further exploration,a constrained projection method to show the uncertain labels in context, and a scatter-plot-based visualization to illustrate workerreliability. The three visualizations are tightly integrated with the learning-from-crowds model to provide an iterative and progressiveenvironment for data validation. Two case studies were conducted that demonstrate our approach offers an efficient method forvalidating and improving crowdsourced annotations.

Index Terms—Crowdsourcing, learning-from-crowds, interactive visualization, focus + context

1 INTRODUCTION

The quality of training data has proven to be a critical factor for thesuccess of supervised and semi-supervised learning [36, 38, 43, 57].However, labeling a large dataset is expensive and demanding [54].Therefore, researchers resort to crowdsourcing, which distributes microlabeling tasks to crowds, in order to acquire labeled data [36, 57]. This

• S. Liu, C. Chen, F. Ouyang, and B. Wang are with School of Software,Tsinghua University. Email: {shixia, wangbins}@tsinghua.edu.cn, {ccj17,oyfx15}@mails.tsinghua.edu.cn.

• Y. Lu is with Arizona State University. Email: [email protected].

Manuscript received xx xxx. 201x; accepted xx xxx. 201x. Date of Publicationxx xxx. 201x; date of current version xx xxx. 201x. For information onobtaining reprints of this article, please send e-mail to: [email protected] Object Identifier: xx.xxxx/TVCG.201x.xxxxxxx

mechanism, while being efficient and cost-effective, often falls short ofhigh quality results when the task is complicated and requires provenskills or specialties. As a result, crowdsourced annotations may benoisy and poor in quality and usually require additional validation [58].Correspondingly, there have been some initial efforts to introduceadditional expert labels into learning-from-crowds algorithms [27, 36].

Although the performance of the aforementioned methods repre-sents the state-of-the-art for improving crowdsourced labels, they usetrusted labels without considering the effort and cost of acquiring them.Data scientists (expert labelers) often have to spend significant timevalidating the data for better model accuracy, especially when judgingthe reliability of a worker or evaluating which verifications can bring ina relatively large gain. Moreover, for objects that are difficult to distin-guish, data scientists need to consult domain experts who are familiarwith the field. Generally, difficulties arise from: (1) Two categoriesof highly similar-looking objects, for example, the Irish Wolfhound

Page 2: An Interactive Method to Improve Crowdsourced Annotationsshixialiu.com/publications/LabelInspect/paper.pdf · the validation of annotations, we have developed a confusion visualization

Fig. 2. Objects that are hard to distinguish: (a) Irish Wolfhound;(b)Scottish Deerhound; (c) house sparrow; (d) tree sparrow.

and Scottish Deerhound in Figs. 2(a) and 2(b); (2) the key featuresfor differentiating various objects are partially hidden, for example, inFigs. 2(c) and 2(d), the key features to distinguish these two types ofsparrows are the colors of the crown and cheek, part of which is hiddenin these two images; (3) some noisy images, which do not belong in thecategories of interest, are mixed into the dataset, for example, an imagewith a Spanish sparrow is contained in the dataset of the house sparrowand tree sparrow. To support an efficient and effective verification, itis necessary to better illustrate the annotation results and minimize theworkload of both data scientists and domain experts. However, littlework has been done to explore a method of acquiring expert valida-tion efficiently and how to take advantage of such feedback for furtherannotation improvement.

To facilitate the validation of crowdsourced data we have developedan interactive visual analytics tool, LabelInspect, to improve the anno-tated results from crowdsourcing. A demo of the prototype is availableat http://visgroup.thss.tsinghua.edu.cn/label.

LabelInspect combines a learning-from-crowds model [36] withinteractive visualizations to provide experts quick access to the mostuncertain instance labels and unreliable workers for verification.Expert verifications of instance labels and spammers mutuallyinfluence each other by the mutual reinforcement graph model[18, 37], and are propagated to other instances and workers using thelearning-from-crowds model [36]. LabelInspect includes a confusionvisualization, an instance visualization, and a worker visualization asits main visualizations (Fig. 1). The confusion visualization employs aconfusion matrix to measure the confusion rate between classes. In theinstance visualization, a novel constrained t-SNE projection is proposedin an attempt to show the uncertain labels in the context of otherrelevant instances and reveal their influence on each other based onsimilarities. In the worker visualization, which is utilized to illustratethe reliability of the workers, we propose a class-level spammer scorecalculation algorithm and combine it with a scatter plot. The threevisualizations are tightly coordinated through the underlying modelsto reflect the validation effects on the relevant instances and workers.

In summary, the main contributions of our work are:• A visual analytics tool that allows data scientists to verify crowd-

sourced instance labels and unreliable workers in a convenientand fast manner.

• An iterative and progressive verification procedure that prop-agates the influence of expert validations on both instance labelsand worker reliabilities to recommend the most informative onesfor further verification. Results have demonstrated a large perfor-mance improvement.

• A novel instance visualization based on a well-formulated con-strained t-SNE, a worker visualization conveying both globaland partial spammer behaviors, and a confusion visualization.These visualizations work together to effectively identify the most

informative instances and most unreliable workers, and illustratethe influence of each verification on other instances and workers.

2 RELATED WORK

Our work is related to integrating machine learning techniques intovisual analytics [14, 20, 38, 39, 42, 55, 63], in particular, combininginteractive visualization with learning techniques to improve data qual-ity. In the field of visualization, many methods have been proposed toobtain high-quality data. Most existing relevant work can be classifiedinto three branches: visual data processing, visual data retrieval, andinteractive labeling.Visual data processing. To support data processing, Kandel et al. [30]proposed Wrangler, which is a visual analytics system focusing oncreating data transformations. Maciejewski et al. [47] employed theBox-Cox transformation to transform data into the best normalizeddistribution for effective visualization and analysis. Other than datatransformation, some methods have focused on anomaly detection andoutlier removal. For example, the system developed by Kandel etal., Profiler [31], coupled automated anomaly detection with linkedsummary visualizations and interactive data exploration to help ana-lysts assess data quality. Hao et al. [22] supported noise removal priorto building prediction models for seasonal time series data. Wilkin-son [61] developed a distributed algorithm for detecting outliers in bigdata and visually illustrated the outliers for further inspection. In termsof crowdsourced data, Willett et al. [62] introduced a crowd-assistedclustering method to identify and consolidate redundant responses. Parket al. [50] developed a visual analytics platform to analyze medicalcrowdsourced data. In contrast to the above visual data processingtechniques, LabelInspect considers the relationship between workerreliability and instance label uncertainty, and integrates the mutual rein-forcement graph model and learning-from-crowds model to propagateexpert verifications in different phases.Visual data retrieval. Data retrieval is an effective way to obtain dataof interest for data-related applications. Several methods have been pro-posed for visual data retrieval. For example, Krause et al. [34] presenteda visual analytics system, COQUITO, which supports cohort construc-tion by iteratively updating queries through a visual interface. Liu etal. [37] developed a microblog retrieval system, MutualRanker, whichconsiders posts, users, and hashtags together as integral for retriev-ing and ranking data. It presented the results on a graph visualizationalong with its uncertainty calculated from a mutual reinforcement graphmodel. Lu et al. [45] utilized semantic similarity to obtain relevanttextual records based on an interactively built concept map on a force-directed layout. Wall et al. [59] developed Podium to interactively rankand retrieve multi-variate data based on users’ subjective preferences,which are obtained by capturing users’ drag interactions on a tableview. All the aforementioned methods aim to acquire relevant datawithout consideration of labels, which thus cannot be directly extendedto improve the quality of labeled data. Therefore, we tightly integratevisual analytics with the learning-from-crowds model to provide aniterative and progressive environment for verifying uncertain instancelabels and unreliable workers.Interactive labeling. Many visualization researchers have also pro-posed employing active or incremental learning to help users interac-tively label a variety of data. Heimerl et al. [23] presented a methodthat incorporates active learning and interactive labeling for documentclassification. Paiva et al. [49] supported interactive labeling by help-ing users interpret misclassified instances on a Neighbor Joining treeand a similarity layout view. This concept of interactively labelingdata as part of a model learning process is also supported by othervisualization works [4, 6, 7, 9, 25]. An experimental study conductedby Bernard et al. [5] demonstrated the promise of visual-interactivelabeling. It showed that visual-interactive labeling usually outperformsactive learning when class distributions are well separated by dimensionreduction. Different from these methods, our work focuses on labelingcrowdsourced data, which has specific data quality issues. The labelernot only needs to label instances but also analyze worker behavior. Inaddition, s/he often needs to understand the influence within instance

Page 3: An Interactive Method to Improve Crowdsourced Annotationsshixialiu.com/publications/LabelInspect/paper.pdf · the validation of annotations, we have developed a confusion visualization

Data Collection

1 2 2

2 2 2

1 1 1

Crowdsourced labels

Images

FeatureExtraction

Learningfrom

Crowds

Modeling Visualization

IterativeProgressive

Coordinate SynchronizeInstance Visualization

Con

fusi

on V

isua

lizat

ion

Worker Visualization

Fig. 3. LabelInspect overview: the data modeling module estimates correct labels from crowdsourced annotations and updates the labeling resultsbased on expert validation; the visualization module allows analysts to interactively validate instance labels and spammer workers.

labels or workers as well as the mutual influence between them. There-fore, LabelInspect consists of three highly coordinated visualizationsto facilitate the labeling process.

In the field of machine learning, researchers have introduced expertvalidation into learning-from-crowds models [27, 36]. These methodshave demonstrated the effectiveness of leveraging expert labels to re-duce the labor involved and improve the performance. However, theyall assume expert labels can be acquired easily. This is not true in manyreal-world applications, in which experts have to repetitively browsethe data and carefully compare crowdsourced labels to provide usefulvalidations. Consequently, developing an effective validation procedureis of theoretical and practical significance.

3 DESIGN OF LABELINSPECT

In this section, we introduce the task analysis process and collectedtasks, as well as the visual analytics tool built on these tasks.

3.1 Task Analysis

This research is motivated by our own research projects that employsupervised and semi-supervised learning as an analysis method. Be-cause data quality is critical for data-driven tasks [1], in these projects,after collecting labeled data from Amazon Mechanical Turk (AMT),we often spent a lot of time and effort cleaning the data, includingdiscovering the most problematic instance labels and spammers. Thisprocess is tedious and requires domain expertise in complex cases.

In order to identify the main scenarios in the aforementioned valida-tion process, we conducted three interviews with four data scientists.Two of them are R&D researchers from Inventec, whose job responsi-bility is to validate the “pass/fail” labels of boards along the productline (A1, A2). The other two are senior PhD students who often collectand validate image labels for crowdsourcing-related research (A3, A4).For brevity’s sake, we refer to data scientists as analysts and to domainexperts as experts throughout the rest of the paper. The interviews weresemi-structured with a focus on probing the participants validationneeds and processes by asking the following questions.

• How do you improve the quality of crowdsourced annotations?• What are the major steps in the validation process?• What type of information helps the validation process?

In the first interview, the analysts said that they needed to understand thecrowdsourced labels and the labeling behavior of workers from differentperspectives, so that they can quickly identify the most problematicdata and suspicious workers. For objects that are easily confused witheach other, they usually consult with a domain expert who knows thedata. In such cases, an intuitive visualization is preferred to help explainthe reason for the confusion to the domain expert within the propercontext. Based on the interviews and our previous research experience,we identified the following tasks:

T1 - Inspect the confusion degree among different categories and se-lect the most confusing ones for detailed examination. This is especiallyimportant when multiple-category labeled data is considered.

T2 - Understand the instance distribution under different categorylabels, examine the confused instances between categories in context,and then quickly identify the most problematic labels.

T3 - Analyze the reliability of each worker, form an assumptionof the spammer degree of the worker, verify the formed assumptionby checking the corresponding annotation results, and recognizespammers in order to improve the labeling accuracy.

T4 - Check the influence within instances/workers, as well as mutualinfluence between them. If the label of an instance is modified, theanalyst wants to understand its influence on other instances and theworkers who annotated this instance, and vice versa.

T5 - Examine the validation trail to understand the influence of eachvalidation on other instances/workers. The validation trail displays thenumber of instances validated by the analyst in each propagation roundand the number of instances whose label estimates have been changedwith these verifications. By checking the trail, an analyst is able todecide when to stop a validation process or roll back to a previousstate. For example, if the recent influence is very small, it might be oneindication to stop the validation process.

3.2 System OverviewGuided by the above tasks, we designed a visual analytics tool, La-belInspect, to help analysts validate crowdsourced labels and acquirehigh-quality training data. It consists of two modules: data modelingand interactive visualization. They work together to provide an iterativeand progressive validation solution for data validation.

As shown in Fig. 3, the data modeling module focuses on collectingthe crowdsourced annotations and estimating correct labels from noisycrowdsourced data using a learning-from-crowds model. In our im-plementation, we take image data as an example to illustrate the basicidea of the proposed method. We collected raw data from ImageNetand obtained the crowdsourced labels from AMT. The estimated la-bels were derived using the max-margin-majority-voting (M3V)-basedmodel [36]. After several verifications from the analyst are fed back,the M3V-based model incrementally updates the labeling results to starta new round of validation. The visualization module consists of threecoordinated visualizations to enable analysts to interactively validatecrowdsourced data and improve the labeling accuracy. By examiningthe confusion between different classes in the confusion visualization,the analyst can pick out the most confused ones to start the validationprocess (T1). Then the instance visualization is employed to show theuncertain labels in the context of other relevant instances (T2). Theexperts first verify several seed instances and/or workers, and thencandidate instances and workers are selected for further validation. Theinfluence of verified results is propagated to relevant instances andworkers through the M3V-based model (T4). In the worker visual-ization, a scatter plot is employed to illustrate worker reliability (T3).By checking the spammer score and labeling results, the analyst canidentify the highly reliable workers and spammers. The validation ofworkers will also propagate to other data via the M3V-based model.The three visualizations are closely coordinated through a set of analy-

Page 4: An Interactive Method to Improve Crowdsourced Annotationsshixialiu.com/publications/LabelInspect/paper.pdf · the validation of annotations, we have developed a confusion visualization

sis methods, such as candidate selection for further verifications (T4),validation propagation to reflect the validation impact on the relevantinstances/workers (T4), and exploration of the validation trail to rollback to a previous validation step (T5).

4 DATA MODELING

The current implementation uses two image datasets, a dog dataset with4 categories and a bird dataset with 10 categories from ImageNet [16].The dog dataset already contains the crowdsourced annotations [65].For the bird dataset, we utilize AMT to collect crowdsourced annota-tions. Each worker labels an image at most once, and each image islabeled 12 times. We exclude workers with a low number of annotations(less than 60 labels) or low accuracy (less than 20% on the gold standardimages). The resulting crowdsourced dataset is available at http://cgcad.thss.tsinghua.edu.cn/shixia/crowd.zip. Foreach dataset, we first perform feature extraction, then employ a learning-from-crowds model to infer the initial true labels and incrementallyupdate the labeling results based on validation.Feature Extraction. Previous studies have shown that feature vectorsextracted from deep convolutional neural networks (CNNs) trained on alarge labeled dataset, such as the ImageNet ILSVRC 2012 dataset [51],can represent the image very well and be used in different patternrecognition tasks [17, 64]. As a result, for the dog dataset, whosecategories are all covered by the ILSVRC 2012 dataset [51]), we employthe outputs of the last but one fully-connected layer of a pre-trainedVGG-16 [52] as the feature vectors. In particular, we utilize the pre-trained Keras VGG-16 model on the ILSVRC 2012 dataset [12]). Theaforementioned feature vectors work well for the dog dataset since all4 categories in this dataset are included in the ILSVRC 2012 dataset.

However, the bird dataset includes images of several species thatare not contained in the ILSVRC 2012 dataset, thus the extractedfeatures with the Keras model do not differentiate them well. To solvethis problem, we employ a fine-tuning method [21] to adapt the pre-trained VGG-16 model to our task. In particular, we replace all the fullyconnected layers of the Keras VGG-16 model with three fully connectedlayers and update the model weights with a set of additional trainingsamples. This sample set consists of 800 images whose labels areinferred by the learning-from-crowds model with the least uncertaintyand 10,672 images of the 10 bird species from ImageNet, resultingin a training set of 8,030 images and a validating set of 3,442 images.We also use the output of the last but one fully-connected layer of thefine-tuned model as the feature vector of each image.The Learning-From-Crowds Model. We employ a semi-supervisedlearning-from-crowds method, the M3V-based model [36], to estimatethe true labels from noisy crowdsourced annotations and graduallyintegrate the expert labels, which are confirmed by the analyst via astep-by-step validation process. The detailed formulation and solutionof this model can be found in [36]. The advantage of using this methodis that it introduces additional expert labels into the learning-from-crowds framework and incrementally updates the learning model asnew expert labels arrive. This method is also utilized to infer the initialtrue labels from the crowdsourced annotations.

5 VISUALIZATION

LabelInspect consists of three major visualizations. The confusionvisualization illustrates the confusion degree among classes and helpsselect a few classes to start (T1). The instance visualization uses aconstrained t-SNE projection to visualize the instances of interest andhighlights the most informative ones using uncertain glyphs (T2). Ascatter plot is employed in the worker visualization to demonstrateworker reliability (T3). Instance and worker visualizations are coordi-nated so that verifications on either visualization will update both ofthem to reflect their mutual influence (T4). In addition, a validationtrail and clear images are provided to aid the validation process (T5).

5.1 Confusion VisualizationThe exploration of confusion matrix has been supported in Talbot etal.’s EnsembleMatrix [55] to understand and compare multiple classi-fiers. In our approach the concept of confusion matrix is extended and

used to explore the relationship between worker labeling and modelestimates. The confusion visualization (Fig. 1(a)) is developed to illus-trate whether workers confuse two or more classes. This is importantbecause in a multi-class labeling task, identifying a few (often two)mixed classes can dramatically reduce the workload and difficulty ofvalidating the annotations. In order to effectively visualize possibleconfusion patterns, we calculate a worker confusion matrix and reorderit to reveal block patterns where each block can be considered to beone subset of the confusing classes.

The worker confusion matrix represents the mismatch between theworkers’ labeling results and model-inferred results. It is an m×mmatrix where m is the number of classes. Each column represents theinferred class from the M3V-based model, and each row representsthe label assigned by the workers. The cell in the ith row and jthcolumn represents the percentage of times that instances inferred tobe the column class c j have been labeled as the row class ci. Wedenote the percentage as LabelP(ci,c j). For example, in Fig. 1A, themodel-inferred house sparrows are labeled as “ house sparrow” 61%of the time, and as “tree sparrow” 29%. Generally, LabelP(ci,c j)and LabelP(c j,ci) are not equal, and we take their average to be thesimilarity between ci and c j.

Once the worker confusion matrix has been calculated, a matrixreordering algorithm is developed to reveal the block pattern visually.A reordered matrix with a block pattern shows coherent rectangularareas (blocks) and each block indicates the set of confusion classes.As the Optimal-Leaf-Ordering (OLO) algorithm [2] has been shownto be optimal for demonstrating blocks [3], we have employed thisreordering algorithm.

Our confusion visualization uses a single hue sequential colorscheme to encode each cell according to its label percentage valueso that the groups are easily identified as dark blocks. A quantitativecolor scheme is used to distinguish classes that are labeled along theleft and bottom sides of the matrix. This visualization supports classselection for the instance inspection, and the percentage values updatein each validation round to cue the analyst on the validation progress.

5.2 Instance Visualization

Once we have the classes selected, the instance visualization will showall the instances inferred to be in these classes and the analysts canexplore the instance labels and work on instance-level validation.Visual Encoding. This visualization is designed as a circular-basedconstraint layout (Fig. 1(b)), where the outer arcs represent classes andthe length is proportional to the number of instances. Instances areplotted within the circle. Same as the confusion visualization, the colorrepresents the class.

Our underlying model will recommend some uncertain instancesthat are encoded using uncertain glyphs, while the remaining instancesare encoded as simple dots with colors indicating their inferred classes.The uncertain glyph, as shown in Fig. 1B, consists of a centered dotindicating the inferred class, the surrounding colored arcs denotingthe distribution of worker assignments, arrowhead markers pointingtowards corresponding class arcs (marked by the workers) on the outercircle, and an optional snapshot providing a quick spot of the instance(Fig. 1C). In our implementation, 20 uncertain instances are selected bythe MRG model (Sec. 5.4). The position of an instance is determined bya constrained t-SNE projection, which is explained later in this section.The snapshot position is calculated using a clutter-aware label-layoutalgorithm [48] to ensure that the snapshot is close to its instance whilenot cluttered with other snapshots. Given these snapshots, analysts canquickly judge whether the prediction of the model is accurate.Constrained t-SNE Projection. We propose a constrained t-SNE pro-jection for instance visualization so that 1) similar instances are groupedtogether; 2) instances tend to plot towards their inferred class and themost uncertain ones are pulled by all its possible class labels; and 3)instances try to keep their previous positions when being updated.

Although t-SNE [46] maintains the relationships between instancesin the low-dimensional space, it does not consider the relationshipsbetween the crowdsourced labels and the inferred labels. As a result,

Page 5: An Interactive Method to Improve Crowdsourced Annotationsshixialiu.com/publications/LabelInspect/paper.pdf · the validation of annotations, we have developed a confusion visualization

it is difficult to utilize t-SNE to clearly reveal label errors. A super-vised t-SNE is used in UTOPIAN [13] where the distances between thesame class instances are decreased by a parameter. In addition, Topi-cLens [33] proposed a guided t-SNE where the group centroids on theprojected 2D space are first calculated with a small set of instances andthen the centroids are used to add constraints for additional data points.These methods consider instance labels or projection stability to someextent. However, they do not control the layout position of the groupsand cannot meet our need that certain instances are close to its class arcwhile uncertain instances are projected to the center. In addition to theclass-level readability supported by the guided t-SNE, we also want tomaintain instance-level stability so that the same instance will be closeto its previous position after each update.

Therefore, we have developed a novel constrained t-SNE projectionthat balances readability and stability for the instance layout. Read-ability helps preserve neighborhoods and clusters (overall readability),reveals class association, and accentuates abnormal labels (class read-ability). Stability helps control the position of the instance in theiterative process and provides a stable context. Thus, for m classes andn instances, we minimize the following cost function,

fcost = α ·KL(P ‖ Q)+β ·KL(Pc ‖ Qc)+(1−α−β ) ·KL(Ps ‖ Qs), (1)

where KL(· ‖ ·) is the Kullback-Leibler (KL)-divergence between twodistributions. The first KL-divergence is the same as the original t-SNE [46] for projecting the n instances and preserving the overallreadability. The second KL-divergence is to maintain the class read-ability. The last KL-divergence is to improve stability in the iterativevalidation process. α , β ∈ [0,1] are two weights to balance them.

We introduce m class constraint arcs, one for each class, and n virtualstability constraint points, one for each instance. Pc, Qc ∈ Rn×m arethe joint probability distributions that measure pairwise similaritiesbetween the instances and constraint arcs in the high-dimensional spaceand low-dimensional space, respectively.

(Pc)i j =

{exp(−dh

i j/2σ 2) · I(c j ∈ Li) ei ∈ Uexp(−dh

i j/2σ 2) · I(c j = li) ei /∈ U , (2)

(Qc)i j = (1+ ‖ dpi j ‖

2)−1, (3)

where I(·) is an indicator function, dhi j is an estimated distance between

instance ei and arc of class c j in the high-dimensional space, and dpi j is

the distance between instance ei and the nearest point on the arc of classc j in the projected space. li represents the inferred label of instance ei,Li is the set of labels that instance ei has been assigned by the crowdworkers, and U is a set of uncertain instances. To estimate dh

i j , we buildone hypersphere in the high-dimensional space for each class. Thishypersphere uses the class’s barycenter as its centroid and the distancefrom the centroid to the farthest point as its radius. Then dh

i j is definedas the shortest distance from instance ei to the hypersphere.

Given the way we define Pc, uncertain instances will have multi-ple class constraints decided by the labels assigned by the workers.These constraints add the tendency to project such instances to themiddle/center of their confusion classes. For other instances, instead,only one class constraint from the inferred class will be added, and theywill be pulled towards their inferred class arc.

For the stability constraint, the virtual point takes the same positionof the instance in the high-dimensional space and the last projectedposition of the instance as its 2D position. Ps, Qs ∈ Rn×1 are the jointprobability distributions that measure pairwise similarities betweenthe instances and stability constraint points in the high-dimensionalspace and low-dimensional space, respectively. (Ps)i ∈ [0,1] representsthe strength of the stability constraint added to instance ei. In ourimplementation, Ps changes after every propagation.

(Ps)i =

0 ei is validated0.5 ei is not validated, but its label has changed1 otherwise

, (4)

Stability constraints for validated instances are set to be 0 so that theinstances are able to be freely attracted to the confirmed class. We

set a middle value, 0.5, for the instances whose inferred labels havechanged because of the propagation. In this way, their positions tendto be closer to the arc of its newly inferred class but not too far awayfrom its previous position for easy tracking. Other instances stick totheir previous positions. Qs is the similarity in the 2D space betweeninstances and their previous positions. We use yyy′i to denote the lastprojected position of instance ei and Qs is defined as: (Qs)i = (1+ ‖yyyi− yyy′i ‖2)−1.Flow of Influence. Once an instance label is verified, the MRG modelwill assign a high prior saliency to this instance and re-rank all instancesto reselect the 20 uncertain instances apart from the validated ones. Thisnew set of uncertain instances are displayed using instance glyphs. Inaddition, the influence flow from the verified instance to the 5 mostinfluenced candidate instances is displayed as a flow map, as illustratedin Fig. 1I. The flow map helps to expose the internal status of thevalidation propagation and inform the analyst of the effect. The layoutof the flow paths from one source to multiple targets is based on theflow map layout [10] and the edge bundling developed by Holten etal. [26]. The width of the flow between two instances is proportionalto their similarity value, which denotes the influence degree from theverified instance to the other instances on the flow.

5.3 Worker VisualizationThe inference of instance labels and worker reliability mutually influ-ence each other in crowdsourced data validation. Therefore, we havedeveloped a worker visualization to illustrate worker reliability andenabled the validation of crowdsourced data from the worker-level data.Overview. The worker visualization only plots workers who havelabeled the instances associated with the selected confusion classes.A scatter-plot is implemented where the x-axis is the spammer score,the y-axis is the accuracy according to model inferred labels. One dotrepresents one worker where the size is proportional to the numberof instances s/he has labeled. The accuracy of a worker is measuredby the percentage of his labeled instances that match the model pre-dictions. The spammer score is calculated by an approach that com-bines the spammer detection method proposed by Hung et al. [27] andblock detection to support finding partial confusing spammers, whichis introduced below. Similar to uncertain instances, the MRG modelrecommends K candidate workers and highlights them on the scatterplot after a spammer is confirmed. In our implementation (Fig. 1(c)),K = 8, and these highlighted dots all have high spammer scores andlow accuracy.

An individual confusion matrix is coordinated with the scatter plot toreveal the worker’s labeling performance (Fig. 1D). The design of thismatrix is the same as that of the worker confusion matrix except thatthe individual confusion matrix only analyzes the data for one workeron the selected confusion classes. The coordination between the scatterplot and the individual confusion matrix is useful for revealing patternson a worker’s labeling performance. For example, the matrices shownin Fig. 4 indicate two different confusion patterns. Fig. 4(a) indicatesthat the worker has a problem distinguishing house sparrows from treesparrows, but is able to label falcons and fish eagles pretty well; whileFig. 4(b) shows the opposite pattern for another worker. As such, the

(a) (b)

Fig. 4. Two different confusion patterns: (a) the worker has a problemdistinguishing house sparrows from tree sparrows, but has a high labelingaccuracy on the classes “falcon” and “fish eagle”; (b) the opposite patternfor another worker.

Page 6: An Interactive Method to Improve Crowdsourced Annotationsshixialiu.com/publications/LabelInspect/paper.pdf · the validation of annotations, we have developed a confusion visualization

analyst can verify for which classes a worker is reliable or unreliable.Given this verification, the MRG model is able to update candidateinstances and workers. A flow map is drawn to highlight the mostinfluenced workers. If a worker is identified as reliable, the M3V-basedmodel will increase the worker’s prior reliability score. If a worker isidentified as a spammer for a few classes, we remove the most likelyincorrect labels assigned by the worker, and then as suggested by theanalyst, we lower the prior of the reliability of the worker.Spammer Score. Hung et al. [27] calculated the spammer score asthe distance from the prediction matrix to its closest rank-one approxi-mation for random and uniform spammers. However, for multi-classlabeling tasks, a spammer may only confuse between a subset of theclasses rather than all of the entries. Comparing the worker’s individualconfusion matrix to its rank-one approximation cannot detect thesepartial spammers. To solve this problem, we developed a block-basedspammer detection method, which calculates the spammer score in thefollowing two steps:

Step 1: Block Detection is used to identify the block pattern in aworker’s confusion matrix. Matrix reordering, as described in Sec. 5.1,is first applied to put blocks, if any, along the diagonal. We denotethis reordered matrix as L. In our work, a block indicates that a subsetof the classes are difficult to separate by the workers, and therefore,we consider all blocks to be square. Detecting these squared blocks isequivalent to finding segments along the ordered classes such that thesum of the density in each block indexed by each segment is maximized.We denote B(i, j) as the block from class ci to class c j in the orderedmatrix L. The density of B(i, j) is defined as:

density(i, j) =

{(∑

jx=i ∑

jy=i L[x,y]−∑

jx=i L[x,x])/(( j− i)( j− i+1)) i f i < j

0 i f i = j(5)

where L[x,y] is the value of the xth row and yth column in L. Weremove the diagonal values and consider only the “confusion” part (off-diagonal cells). We solve this following the time series segmentationproblem using a dynamic programming approach [24].

Step 2: Aggregation is then preformed to aggregate the spammerscore, s(w), for worker w, based on the detected number of blocks, k.The goal is to assign the highest spammer score to one-block work-ers who mixed all classes, and the lowest spammer score to m-blockworkers who separated all classes. For middling workers, their scoredepends on the number of blocks and how confused they are withineach block. As such, the spammer score for worker w is defined as:s(w) = 1

k ∑ki=1 minMw(i) ‖Mw− Mw(i) ‖F , where Mw is the individual

confusion matrix for worker w, and Mw(i) is the rank-i approximationmatrix to Mw, using the Frobenius norm. This problem is solved usingsingular value decomposition [19].Crowd Workers Characterization. Previous studies have character-ized crowd workers into five types in terms of their expertise, includingreliable workers, normal workers, sloppy workers, uniform spammers,and random spammers [32, 27]. These types are specifically definedfor binary classification. For multi-class classification, in addition tothese five types, we identify a sixth type, partial spammers, who havecertain ability to label a subset of the tasks, but behave as random oruniform spammers for the rest. Fig. 5 illustrates the characteristics ofthe six types of workers by using the scatter plot and the individualconfusion matrix.

• Reliable workers have high accuracy and low spammer scores.Their confusion matrices have very high values in the diagonalcells and very low values in the off-diagonal cells (Fig. 5A).

• Normal workers have relatively high accuracy and low spammerscores, but not as accurate as reliable workers. This is evidencedby the fact that their matrices have relatively high values in thediagonal cells and relatively low values in the off-diagonal cells(Fig. 5B).

• Sloppy workers have low accuracy and medium spammer scores.Their labeling is not very reliable in general, and it is hard to finda high quality subset of task. Accordingly, in the confusion matri-ces, the values of the diagonal cells are a little higher than those ofthe off-diagonal cells. However, the difference is trivial (Fig. 5C).

ReliableWorker

Random/Uniform

SpammerSloppy Worker

Normal Worker

Partial Spammer Worker

A

B

C

D

E

F

Fig. 5. Illustration of six different worker types on the worker visualization.

• Partial spammers have medium accuracy and relatively highspammer scores. Their confusion matrices show obvious blockpatterns and they either can label some classes accurately orseparate instances between class families (e.g.,a sparrow is onlymislabeled as another type of sparrows, but not as an eagle orowl (Fig. 5D)).

• Random spammers have low accuracy and high spammer score.They often carelessly assign labels to all instances. Thus, thecorresponding confusion matrices demonstrate no clear patterns.Typically, the values of the diagonal cells are similar to those ofthe off-diagonal cells (Fig. 5E).

• Uniform spammers also have low accuracy and high spammerscores, similar to the random spammers. These workers usuallyassign the same label to all instances. As a result, their confusionmatrices have an all-one row with other cells being 0 (Fig. 5F).

Random spammers and uniform spammers are considered to be themost unreliable workers. Though they have different patterns on theirindividual confusion matrices, they are all plotted towards the lowerright corner on the scatter plot (Fig. 5).

5.4 Interactive and Iterative Exploration EnvironmentThis section introduces the coordination of the three major visual-izations. It includes a MRG model that uses the mutual influencinginstance and worker information to select candidates, a detail view toassist instance validation and record historical verifications, and a setof interactions that link these visualizations.Mutual Reinforcement Graph Model. In order to help analysts vali-date data efficiently, uncertain instances and workers are selected bythe system based on instance label uncertainty and worker unreliability.This candidate selection can be formulated as a ranking problem. Astraightforward solution is to leverage the PageRank algorithm [8].However, the PageRank algorithm only influences items in the samegraph, while the MRG model has advantages in terms of leveragingboth the relationships within the instances or workers, and the mutualinfluence between them. As a result, we employ the MRG model torank the items in the instance graph and worker graph. The relationshipbetween these two graphs indicates which instances have been labeledby a worker. The instance graph has each node represent one instance,and each pair of nodes are connected by an edge whose weight is thesimilarity between them. The worker graph is built in the same way,while worker similarity is calculated using SimRank [29] on a bipartiteinstance-worker graph. We build the instance-worker bipartite graphas described in [28]: (i) each worker w is represented by a node; (ii)each instance e is represented by m nodes (m is the number of classes):e(c1),e(c2),...,e(cm); (iii) an edge (w, e(ci)) is added if worker w labelsinstance e as class ci.

The MRG model employs a similar method to PageRank [8] to modelthe mutual influence between workers and instances, as well as the in-fluence within each graph. We define Re and Rw as the ranking scorevectors for instances and workers respectively. The initial scores for allitems are the same (1/nt , nt is the total number of instances and work-ers), and they are calculated iteratively through the following equation:[

Re

Rw

]= λ

[αeeMee αweMwe

αewMew αwwMww

][Re

Rw

]+(1−λ )

[Ve

Vw

](6)

Page 7: An Interactive Method to Improve Crowdsourced Annotationsshixialiu.com/publications/LabelInspect/paper.pdf · the validation of annotations, we have developed a confusion visualization

where Mxy denotes the affinity matrix from x to y, and x, y can beinstances or workers. αxy is used to make a trade-off of the mutualreinforcement strength among instances and workers. λ is the dampingfactor for determining the importance of a worker or an instance. Veand Vw are the prior saliency (uncertainty and spammer score) ofinstances and workers, respectively.

In the progressive validation process, the ranking scores are updatedby increasing the prior saliency of verified instances and workers. Oncethe ranking scores change, the instance visualization and worker visual-ization employ the MRG model to update their uncertain candidates.The Detail View and Interactions. In order to assist instance valida-tion, we have developed a detail view for analysts to inspect instancesand check their validation histories. A set of rich interactions are alsoprovided to assist the validation process.

Instance Inspection in Detail. We allow the analyst to check theselected instance(s) in detail to confirm his/her validation. The analystcan click on one candidate instance or select a few using a lasso selec-tion in the instance visualization, and the detail view will display theimages associated with these instances. Similarly, the analyst can clickon a worker on the scatter plot and this will bring the images labeledby this worker into the detail view. Clicking on a cell of the worker’sprediction matrix will further filter down the images to those associatedwith the selected class. The analyst can enlarge an instance image andcrop the image if s/he thinks the feature can be improved by removingsome background objects or focus more on a local area. The croppedimage will be fed back to the feature extraction step which triggers alinked update of the model, the instance and worker visualizations.

Validation Trail and Roll Back. We have developed a validation trailto show the numbers of validated instances and the instances whoselabels are changed by the verifications. Fig. 1(d) shows the validationtrail after a few rounds of validation and propagation. Each round isrepresented by two bars, where the left bar represents the number ofverified instances and the right bar indicates the number of impactedinstances. Analysts can click on the bars to recall which instances areverified and impacted. These instances will be displayed in the detailview as images. In addition, they are allowed to roll back to a previousvalidation status by clicking on “redo” under each validation bar.

Coordination between Instance and Worker Validations. A corefeature of LabelInspect is its coordination between instance andspammer validation. We support a tightly linked interactive explorationof the instance and worker visualizations. For example, when aninstance is hovered over, the workers who labeled this instance willbe highlighted. When one or multiple instances are selected, displayedworkers will be filtered down to those who have labeled any of theseinstances, and the dot for each worker will also change to a pie chart.The dot size matches the number of instances the worker labeledamong the selected ones and the area of the sector represents the ratioof the class labels the worker assigned to these instances. The colorof each sector matches the color of the corresponding class. On theother hand, mousing over a cell of a worker’s confusion matrix willhighlight all instances labeled by him/her on the instance visualization.Given these linked interactions, the analyst can explore instance labelsand worker reliabilities conjointly. Validation on either side will updatethe uncertain candidates on both instances and workers through theMRG model, and the propagations of analysts’ validations will alsoupdate both instance label estimates and worker’s spammer scores.

6 CASE STUDIES

We conducted two case studies to demonstrate how to effectively useLabelInspect to improve the quality of crowdsourced annotations fromboth the instance and worker perspectives.

6.1 Bird Image Validation

In this case study, we collaborated with A1 and A3 to show how LabelIn-spect helps improve crowdsourced annotations with lower estimationaccuracy. Initially, A1 indicated that the model estimation accuracy ofthe bird dataset was relatively low (accuracy: 82.05%), and wanted toimprove it by using LabelInspect.

Overview (T1, T2). To begin with, A1 looked at the confusion vi-sualization (Fig. 1(a)) and quickly found that among the 10 species,most of the disagreements involved two species in one block. Therewere 5 pairs of confusion classes in total. Furthermore, the confusiondegrees between these 5 pairs were relatively low. To further confirmthis observation, A1 selected all 10 species and turned to the instancevisualization. As shown in Fig. 1(b), only a few instances were pro-jected to the center of the view, and the majority of the confusions werewithin each confusion class pair. Among the 5 pairs, the sparrow pair(“house sparrow” and “tree sparrow”), the sandpiper pair (“pectoralsandpiper” and “spotted sandpiper”), and the swan pair (“bewick’sswan” and “mute swan”) were confused the most often. Therefore, A1decided to start his validation with the sparrow pair (the ‘tree sparrow’class in olive and the ‘house sparrow’ class in purple) since it involvedthe most confusion.Starting with Instance Analysis (T2). With “house sparrow” and“tree sparrow” selected, the olive and purple dots were often mixed inthe instance visualization. A1 looked at the snapshots of the uncertaininstances recommended by the tool and he found that many of those inthe center were misclassified by the M3V-based model. For example,the 5 instances circled in Fig. 6A were estimated to be “tree sparrow”(olive), however, they should have been “house sparrow” (purple).Therefore, A1 corrected their labels. It was very interesting to find thatthese 5 house sparrow instances were all misclassified as “tree sparrow”by the model. A1 wanted to understand why this happened. He thenselected these 5 instances using the lasso selection in the instancevisualization to examine how the workers labeled them. Given theselected instances, the scatter plot (Fig. 1(c)) in the worker visualizationfiltered for any workers that had labeled these 5 instances, and each ofthe remaining workers was then represented as a pie chart displayingthe ratio of the labels they assigned to these instances.Focusing on Spammer Analysis (T3, T4). The worker with the high-est accuracy and lowest spammer score (worker A in Fig. 1E) caughtA3’s attention because he noticed that 2 instances were mislabeled bythis worker though s/he had a high inferred reliability. Worker A’sconfusion matrix (Fig. 1D) indicated that his/her labeling results mostlymatched with the estimates of the model, as the diagonal cells of thematrix had high values but off-diagonal cells had low values. Recallthat the 5 mislabeled instances were all recognized as class ‘tree spar-row’ (olive) by this worker and the model. A3 hypothesized that thisworker and the model must have both mislabeled many house sparrows(purple) as tree sparrows (olive). A3 then loaded the correspondingimages into the detail view. These images were all estimated to betree sparrow by the model and labeled as “tree sparrow” by worker A.A3 checked the first 8 images and found 4 of them should have been“house sparrow.” He also noticed that one image (Fig. 1J) was confusingbecause the key features (the colors of crown and cheek) were hidden.A3 consulted an expert (an ornithologist) and confirmed it to be “housesparrow.” This finding verified his hypothesis that worker A and themodel both mislabeled many house sparrows as tree sparrows, andworker A’s accuracy needed to be lowered

Given the fact that 5 out of the 8 checked instances were labeledwrong by worker A, A3 doubted worker A’s reliability. To quicklyexamine the labels of the remaining images labeled as “tree sparrow”by both the model and worker A, A3 hovered over the row name “treesparrow” on the worker’s confusion matrix to highlight these images inthe instance visualization. As shown in Fig. 6(a), these images wereprojected in both areas of the two species. Similarly, the instanceslabeled as “house sparrow” by worker A were also projected evenly onthe space. Accordingly, A3 verified worker A as a random partial spam-mer between “house sparrow” and “tree sparrow.” Then we utilized thetwo-step strategy (Sec. 5.3) to process worker A.

The spammer identification of worker A triggered an update inthe MRG model that recommended several other workers needed tobe investigated. As shown on the flow map (Fig. 1H), workers B(Fig. 1F) and C (Fig. 1G) were recommended but their spammer scoreswere not as high as other recommended workers. Being interestedin these two workers, A3 checked their confusion matrices and somelabeled instances. Following a similar process to the analysis of worker

Page 8: An Interactive Method to Improve Crowdsourced Annotationsshixialiu.com/publications/LabelInspect/paper.pdf · the validation of annotations, we have developed a confusion visualization

(a) (c)(b)

A B

D

C

E

Fig. 6. Instance visualizations of the selected classes: (a) “house sparrow” and “tree sparrow”; (b) “Norfolk Terrier” and “Norwich Terrier”; (c) “IrishWolfhound” and “Scottish Deerhound.”

A, A3 quickly confirmed that B and C were very likely to have thesame spammer behavior as worker A. They tended to misidentify “treesparrow” and “house sparrow” but their labeling results still matchedthe model’s estimates to some extent. Therefore, A3 verified themas random partial spammer with regards to these two species. Afterverifying 11 instances and 3 spammers, A3 finished his first round ofvalidation and propagated his verifications using the M3V-based model.At that point, the labeling accuracy of the bird dataset had increasedfrom 82.05% to 83.15%.Detecting More Spammers After Propagation (T4, T5). After thefirst round of propagation, A3 found that the spammer score increaseda great deal for one unverified worker. A3 checked this worker usinga similar method as for worker A, and verified that s/he was also arandom spammer. A3 also noticed that the spammer score of worker Achanged from 0.089 to 0.603. However, the spammer score was still nothigh given the fact that it was now verified as a random partial spammerfor these two classes. Therefore, A3 checked the uncertain instancesrecommended by the tool and also labeled by worker A, because doingso A3 could very likely find more instances mislabeled by the M3V-based model. Similarly, A3 also checked some instance labels that hadbeen suggested by the tool and labeled by other spammers. Followingthis strategy, A3 verified 16 instances until the spammer scores of thoseverified spammers became higher. During the validation process, A3found two images that were hard to distinguish, as shown in Figs. 2(c-d),because they lacked the key features necessary to differentiate “housesparrow” and “tree sparrow.” A3 then asked the expert to identifythe labels of these two images and validated them with the expertlabels. Referring to the validation trail, the propagation also becamestable, which indicated that there were not many instances that hadbeen influenced by the changed labels. A3 was now satisfied with thecurrent validation between “house sparrow” and “tree sparrow.” In total4 spammers were identified and 33 instances were verified, and theaccuracy improved to 84.45%.Validating Other Classes. After finishing the validation with the twosparrow-related classes, A3 turned to other confusion pairs. Similarcases were found in which some spammers had matching labels withthe model so that their spammer scores were low in the beginning.Similar to the exploration for the sparrow-related classes, A3 validated9 instances for eagle/falcon species, 8 for owls, 7 for sandpipers, and3 for swans. Correspondingly, the accuracy improved to 86.10%,86.30%, 87.50%, and 89.55%. As shown in Table 1, compared withthe M3V-based model (84.50%), LapelInspect achieved a higher accu-racy when validating the same number of instances (3.0% validationeffort). The validation effort is measured by the proportion of thevalidated instances among all instances. Given the accuracy of 89.55%,LabelInspect requires considerably less validation efforts (3.0%) thanthe M3V-based model (more than 10.0%), which reduces expert effort

Table 1. Accuracy(%) of the M3V-based model at different efforts.

DataEffort

1.0% 2.5% 3.0% 5.0% 7.5% 10.0%

Bird 82.65 83.90 84.50 87.45 88.55 89.05Dog 88.88 90.00 90.50 92.00 92.65 93.00

by 70.0% [36]. This is due to the fact that LabelInspect helped A1 andA3 select more representative instances and spammers by leveragingthe coordinated visualizations. For example, A3 commented, “Workerinspection, instance verifications, and their mutual influence help en-hance my validation effectiveness,” while the M3V-based model [36]only considers the instance label validation.

6.2 Dog Image Validation

This case study is used to demonstrate that LabelInspect can alsoimprove crowdsourced annotations with higher estimation accuracy.For this case study, we collaborated with A2 and A4. The dog datasethas four dog species labeled with AMT and the results estimated by theM3V-based model have a relatively high accuracy (88.13%). However,validation is still desired by A4 to meet the high accuracy requirementof his classification task.Overview (T1). Starting with the confusion visualization, A2 saw aclear block pattern that indicated the terrier-related classes and thehound-related classes were separable while there seemed to be con-fusion between “Norfolk Terrier” and “Norwich Terrier,’ and “IrishWolfhound” and “Scottish Deerhound.” A2 decided to start with theclasses “Norfolk Terrier” and “Norwich Terrier.”Confirming Model Inferred Labels (T2, T3). The instance visual-ization (Fig. 6(b)) showed that these two species were separated well.Regarding the image snapshots with the uncertain glyphs, it seemedthat most of the label estimates were correct. However, A2 looked atthe glyphs in the center and found that instance A (Fig. 6B) was labeledas “Norfolk Terrier” (blue) by many workers while the model estimatedit to be “Norwich Terrier” (orange). Noticing this contradiction, A2clicked on the instance to check. The worker visualization showed thatalthough fewer workers labeled it as “Norwich Terrier,” they in generalhad high accuracy and low spammer score. A2 found that this instancewas recommended by the tool because many unreliable workers hadmislabeled it and the model’s estimate was reliable. Therefore, A2 con-firmed the label of instance A and two other instances like this (Fig. 6Cand Fig. 6D). In addition, A2 also verified some other recommendedinstances by MRG. For the terrier-related classes, 7 instances werevalidated and the accuracy was improved to 90.75%.Focusing on Instance Analysis (T1, T4). Next, A4 selected the classes“Irish Wolfhound” and “Scottish Deerhound” for further validation.The instance visualization of these two types of hounds showed a

Page 9: An Interactive Method to Improve Crowdsourced Annotationsshixialiu.com/publications/LabelInspect/paper.pdf · the validation of annotations, we have developed a confusion visualization

high level of confusion (Fig. 6(c)). The recommended instances wereall estimated as “Scottish Deerhound” (red). However, A4 spottedmany label errors by looking at the image snapshots. A4 speculatedthat the M3V-based model was inclined to assign the label “ScottishDeerhound” (red) to the instances, and this also aligned with the factthat the red arc was much longer than the green arc. To investigatethe underlying cause, A4 focused on validating instances that wereestimated to be “Scottish Deerhound” (red). A4 started with instance B(Fig. 6E), which was mixed with many green instances. Checking theimage, A4 found that instance B should be “Irish Wolfhound” and thencorrected it. The 5 most influenced instances, indicated by the flow map,were checked by A4 consecutively and they were all mislabeled. A4propagated his validation after correcting these 6 labels. One instancemoved a long way on the projected space and A4 examined it closely.This instance was also labeled wrong and A4 then assigned “ScottishDeerhound” to it. By repeating the above validation procedure, 13instances were validated and the accuracy was improved to 93.00%. Intotal, we validated 20 instances (validation efforts: 2.5%) to achieve anaccuracy of 93.00%. As shown in Table 1, to achieve the same accuracy,the M3V-based model requires considerably more validation efforts(10.0%) than LabelInpsect. Thus, our tool reduce 75.0% expert effort.A4 believed that this accuracy was acceptable for his classification task,so he stopped the verification process.

7 DISCUSSION

Our visual analytics method has presented an evidence that well de-signed interactive visualizations provide an effective data validationenvironment, which is consistent with the previous findings that inmany cases, visual analytics methods work better than pure machine-learning-based methods [15, 40, 41, 44, 56]. For the bird dataset, theinitial accuracy of the label estimates learned by the M3V-based modelwas not high (82.05%). This led to a slight misjudgment where severalspammers did not have high spammer scores. The first case studydemonstrated that LabelInspect compensated for the M3V-based modelgreatly and improved the labeling accuracy effectively by defining par-tial spammers and allowing the analyst to investigate both workers andinstances. On the other hand, the dog dataset has relatively high label-ing quality and the initial estimation accuracy was high (88.13%). Forthis type of dataset, the spammer score measures the worker’s reliabilityquite well, and the analyst focused more on investigating and validatinginstances. The second case study demonstrated that iterative uncertaincandidate selection and validation propagation effectively improved thedata labeling accuracy even on top of a high quality dataset.

In addition to conducting two case studies, we also shared our systemwith three analysts from different research institutes. Overall, LabelIn-spect was well received by the analysts. They confirmed the usefulnessof the tool for validating crowdsourced annotations and improving dataquality. The visualization design was praised and one analyst whoplayed with this system for about an hour commented, “I can quicklyfind that ‘tree sparrow’ and ‘house sparrow’ are badly separated,” andhe found the spammers with a quick glance. The flow map on theinstance visualization hinted to the newly identified uncertain instancesand most influenced instances. He liked the combination of the instancerecommendation and the model propagation. The analyst commented,“Whenever I fail to find more to validate, I use the propagate button,which seems to work pretty well.” Another analyst who spent over twoweeks to tune the parameters of the M3V-based model without the helpof LabelInspect improved labeled accuracy from 82.05% to 89.05%commented, “LabelInspect is very effective and useful in helping meimprove the data quality of crowdsourced annotations. It only took metwo hours to improve the data quality to a satisfactory level, which usu-ally takes me one or more weeks.” One crowdsourcing expert explicitlysaid, “This is valuable work,” and he liked the worker visualization alot, especially the idea of validating spammers and indicating on whichclasses they are likely to be wrong. “Validating spammer workers is aremarkable function in this tool, and identifying partial spammers isvery useful. It saves more effort than validating one image a time andprovides more benefits.”

Although LabelInpsect was positively received by the analysts, they

suggested several directions to improve this tool.Generalization. In our implementation, image data was taken as anexample to illustrate the basic idea of interactive data validation. Theanalysts expressed the need to employ LabelInspect to validate othertypes of crowdsourced data. Our method can easily be extended tohandle other types of data with only some minor changes of the visu-alization components. For example, for textual data, the snapshot inthe instance visualization can be replaced with a keyword-based textsummarization, and the detail view can display a multimedia-basedsummary generated by Document Cards [53], which includes the im-portant keywords and images of a document. Other than crowdsourceddata validation, the novel concept proposed in this paper can also begeneralized to other usages. The constrained t-SNE projection is a visu-alization technique that can be used for generalized iterative analysis ofuncertain training data, and the spammer score proposed in this papertogether with the worker visualization can also be generalized to anycrowdsourced labeling analysis. In addition, the method for integratingworker analysis and instance analysis is also able to generalize formulti-facet labeling tasks.Scalability. The bird dataset has about 2,000 images and LabelInspectcan respond to each interaction in real time. However, when the numberof instances becomes large (e.g., tens of thousands of instances), the in-stance projection and the model propagation will suffer from increasedcomputational cost. Potential solutions are to employ a sampling tech-nique (e.g., blue noise sampling [11, 39]) in the t-SNE based projection,and to develop a scalable label propagation algorithm.

The instance visualization will become cluttered when the numberof instances increase, and it is difficult to detect useful patterns fromthe visualization. One possible solution is to combine a density-basedvisualization [35, 60] with a t-SNE projection to reveal overall patterns,important instances, and outliers.Insufficient handling of wrong validation. Currently, our methodregards the analysts’ validation as ground truth. However, this may notalways be correct. A wrong validation can lead to a decline in accuracy.One possible solution to avoid accepting wrong validation is to examinethe gap between the model-centered estimation and human-centeredanalysis. If the gap is large, we could ask the analyst to re-check his/hervalidation based on a visual explanation that illustrates why the modelrecommended a different prediction. Other more effective solutions tothis problem are still open to investigation.

8 CONCLUSION

In this paper, we have developed LabelInspect, a visual analytics toolto support crowdsourced data validation for preparing labeled trainingdata. LabelInspect incorporates learning models and visualizationsand enables mutual influencing validation from both the instance-leveland worker-level data. New techniques proposed include a constraintt-SNE projection for instance visualization, a block-based spammerscore calculation, and an integrated system that incorporates expert’sverifications into an iterative and progressive validation process. Weconducted two case studies to demonstrate the effectiveness of LabelIn-spect and the results indicate that label accuracy has been improvedsuccessfully.

Although LabelInspect has been shown to be effective, further workon the following aspects is still promising. First, addressing the abovementioned limitations in terms of scalability and expert trust would becritical for prompting this tool to a wide usage. Second, to strengthenthe analysis of worker behavior, reputation-based worker/instance fil-tering needs to be further studied. Third, it might be beneficial to allowmultiple experts to work together on validating the same datasets for amore accurate labeled dataset.

ACKNOWLEDGMENTS

This research was funded by National Key R&D Program of China(No. SQ2018YFB100002), the National Natural Science Foundation ofChina (No.s 61761136020, 61672308), and Microsoft Research Asia.The authors would like to thank Liwang Zhou for implementing part ofthe learning-from-crowds algorithm and the anonymous reviewers fortheir valuable comments. B. Wang is the corresponding author.

Page 10: An Interactive Method to Improve Crowdsourced Annotationsshixialiu.com/publications/LabelInspect/paper.pdf · the validation of annotations, we have developed a confusion visualization

REFERENCES

[1] C. Arbesser, F. Spechtenhauser, T. Muhlbacher, and H. Piringer. Visplause:Visual data quality assessment of many time series using plausibilitychecks. IEEE Transactions on Visualization and Computer Graphics,23(1):641–650, 2017.

[2] Z. Bar-Joseph, D. K. Gifford, and T. S. Jaakkola. Fast optimal leaf orderingfor hierarchical clustering. Bioinformatics, 17(suppl 1):S22–S29, 2001.

[3] M. Behrisch, B. Bach, N. Henry Riche, T. Schreck, and J.-D. Fekete.Matrix reordering methods for table and network visualization. ComputerGraphics Forum, 35(3):693–716, 2016.

[4] M. Behrisch, F. Korkmaz, L. Shao, and T. Schreck. Feedback-driveninteractive exploration of large multidimensional data supported by visualclassifier. In Proceedings of the IEEE Conference on Visual AnalyticsScience and Technology, pages 43–52, 2014.

[5] J. Bernard, M. Hutter, M. Zeppelzauer, D. Fellner, and M. Sedlmair.Comparing visual-interactive labeling with active learning: An experimen-tal study. IEEE Transactions on Visualization and Computer Graphics,24(1):298–308, 2018.

[6] J. Bernard, D. Sessler, A. Bannach, T. May, and J. Kohlhammer. A visualactive learning system for the assessment of patient well-being in prostatecancer research. In The Workshop on Visual Analytics in Healthcare, pages1:1–1:8, 2015.

[7] T. Blascheck, F. Beck, S. Baltes, T. Ertl, and D. Weiskopf. Visual anal-ysis and coding of data-rich user behavior. In Proceedings of the IEEEConference on Visual Analytics Science and Technology, pages 141–150,2016.

[8] S. Brin and L. Page. The anatomy of a large-scale hypertextual web searchengine. Computer Networks and ISDN Systems, 30(1-7):107–117, 1998.

[9] P. Bruneau and B. Otjacques. An interactive, example-based, visualclustering system. In Proceedings of the International Conference onInformation Visualisation, pages 168–173, 2013.

[10] K. Buchin, B. Speckmann, and K. Verbeek. Flow map layout via spi-ral trees. IEEE Transactions on Visualization and Computer Graphics,17(12):2536–2544, 2011.

[11] H. Chen, W. Chen, H. Mei, Z. Liu, K. Zhou, W. Chen, W. Gu, and K.-L.Ma. Visual abstraction and exploration of multi-class scatterplots. IEEETransactions on Visualization and Computer Graphics, 20(12):1683–1692,2014.

[12] F. Chollet et al. Keras. https://keras.io/, 2018. Last accessed2018-03-31.

[13] J. Choo, C. Lee, C. K. Reddy, and H. Park. Utopian: User-driven topicmodeling based on interactive nonnegative matrix factorization. IEEETransactions on Visualization and Computer Graphics, 19(12):1992–2001,2013.

[14] J. Choo and S. Liu. Visual analytics for explainable deep learning. IEEEComputer Graphics and Applications, 38(4):84–92, 2018.

[15] W. Cui, S. Liu, L. Tan, C. Shi, Y. Song, Z. Gao, H. Qu, and X. Tong.Textflow: Towards better understanding of evolving topics in text. IEEETransactions on Visualization and Computer Graphics, 17(12):2412–2421,2011.

[16] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet:A large-scale hierarchical image database. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, pages 248–255,2009.

[17] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, andT. Darrell. DeCAF: A deep convolutional activation feature for genericvisual recognition. In Proceedings of the International Conference onMachine Learning, pages 647–655, 2014.

[18] Y. Duan, Z. Chen, F. Wei, M. Zhou, and H.-Y. Shum. Twitter topic sum-marization by ranking tweets using social influence and content quality. InProceedings of the International Conference on Computational Linguistics,pages 763–780, 2012.

[19] C. Eckart and G. Young. The approximation of one matrix by another oflower rank. Psychometrika, 1(3):211–218, 1936.

[20] A. Endert, W. Ribarsky, C. Turkay, B. Wong, I. Nabney, I. D. Blanco, andF. Rossi. The state of the art in integrating machine learning into visualanalytics. Computer Graphics Forum, 36(8):458–486, 2017.

[21] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchiesfor accurate object detection and semantic segmentation. In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition,pages 580–587, 2014.

[22] M. C. Hao, H. Janetzko, S. Mittelstadt, W. Hill, U. Dayal, D. A. Keim,

M. Marwah, and R. K. Sharma. A visual analytics approach for peak-preserving prediction of large seasonal time series. Computer GraphicsForum, 30(3):691–700, 2011.

[23] F. Heimerl, S. Koch, H. Bosch, and T. Ertl. Visual classifier training fortext document retrieval. IEEE Transactions on Visualization and ComputerGraphics, 18(12):2839–2848, 2012.

[24] J. Himberg, K. Korpiaho, H. Mannila, J. Tikanmaki, and H. T. Toivonen.Time series segmentation for context recognition in mobile devices. InProceedings of the IEEE International Conference on Data Mining, pages203–210, 2001.

[25] B. Hoferlin, R. Netzel, M. Hoferlin, D. Weiskopf, and G. Heidemann.Inter-active learning of ad-hoc classifiers for video visual analytics. InProceedings of the IEEE Conference on Visual Analytics Science andTechnology, pages 23–32, 2012.

[26] D. Holten and J. J. Van Wijk. Force-directed edge bundling for graphvisualization. Computer Graphics Forum, 28(3):983–990, 2009.

[27] N. Q. V. Hung, D. C. Thang, M. Weidlich, and K. Aberer. Minimizingefforts in validating crowd answers. In Proceedings of the ACM SIGMODInternational Conference on Management of Data, pages 999–1014, 2015.

[28] S. Jagabathula, L. Subramanian, and A. Venkataraman. Reputation-basedworker filtering in crowdsourcing. In Proceedings of the Advances inNeural Information Processing Systems, pages 2492–2500, 2014.

[29] G. Jeh and J. Widom. SimRank: A measure of structural-context simi-larity. In Proceedings of the ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, pages 538–543, 2002.

[30] S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactivevisual specification of data transformation scripts. In Proceedings of theSIGCHI Conference on Human Factors in Computing Systems, pages3363–3372, 2011.

[31] S. Kandel, R. Parikh, A. Paepcke, J. M. Hellerstein, and J. Heer. Profiler:Integrated statistical analysis and visualization for data quality assess-ment. In Proceedings of the International Conference on Advanced VisualInterfaces, pages 547–554, 2012.

[32] G. Kazai, J. Kamps, and N. Milic-Frayling. Worker types and personalitytraits in crowdsourcing relevance labels. In Proceedings of the ACMInternational Conference on Information and Knowledge Management,pages 1941–1944, 2011.

[33] M. Kim, K. Kang, D. Park, J. Choo, and N. Elmqvist. TopicLens: Efficientmulti-level visual topic exploration of large-scale document collections.IEEE Transactions on Visualization and Computer Graphics, 23(1):151–160, 2017.

[34] J. Krause, A. Perer, and H. Stavropoulos. Supporting iterative cohort con-struction with visual temporal queries. IEEE Transactions on Visualizationand Computer Graphics, 22(1):91–100, 2016.

[35] H. Liao, Y. Wu, L. Chen, and W. Chen. Cluster-based visual abstractionfor multivariate scatterplots. IEEE Transactions on Visualization andComputer Graphics (accepted), 2018.

[36] M. Liu, L. Jiang, J. Liu, X. Wang, J. Zhu, and S. Liu. Improving learning-from-crowds through expert validation. In Proceedings of the InternationalJoint Conference on Artificial Intelligence, pages 2329–2336, 2017.

[37] M. Liu, S. Liu, X. Zhu, Q. Liao, F. Wei, and S. Pan. An uncertainty-aware approach for exploratory microblog retrieval. IEEE Transactionson Visualization and Computer Graphics, 22(1):250–259, 2016.

[38] M. Liu, J. Shi, K. Cao, J. Zhu, and S. Liu. Analyzing the training processesof deep generative models. IEEE Transactions on Visualization andComputer Graphics, 24(1):77–87, 2018.

[39] M. Liu, J. Shi, Z. Li, C. Li, J. Zhu, and S. Liu. Towards better analysis ofdeep convolutional neural networks. IEEE Transactions on Visualizationand Computer Graphics, 23(1):91–100, 2017.

[40] S. Liu, W. Cui, Y. Wu, and M. Liu. A survey on information visualization:recent advances and challenges. The Visual Computer, 30(12):1373–1393,2014.

[41] S. Liu, X. Wang, C. Collins, W. Dou, F. Ouyang, M. El-Assady, L. Jiang,and D. Keim. Bridging text visualization and mining: A task-driven survey.IEEE Transactions on Visualization and Computer Graphics (accepted),2018.

[42] S. Liu, X. Wang, M. Liu, and J. Zhu. Towards better analysis of ma-chine learning models: A visual analytics perspective. Visual Informatics,1(1):48–56, 2017.

[43] S. Liu, J. Xiao, J. Liu, X. Wang, J. Wu, and J. Zhu. Visual diagnosis oftree boosting methods. IEEE Transactions on Visualization and ComputerGraphics, 24(1):163–173, 2018.

[44] S. Liu, M. X. Zhou, S. Pan, Y. Song, W. Qian, W. Cai, and X. Lian. Tiara:

Page 11: An Interactive Method to Improve Crowdsourced Annotationsshixialiu.com/publications/LabelInspect/paper.pdf · the validation of annotations, we have developed a confusion visualization

Interactive, topic-based visual text summarization and analysis. ACMTransactions on Intelligent Systems and Technology, 3(2):25, 2012.

[45] Y. Lu, H. Wang, S. Landis, and R. Maciejewski. A visual analytics frame-work for identifying topic drivers in media events. IEEE Transactions onVisualization and Computer Graphics (accepted), 2018.

[46] L. v. d. Maaten and G. Hinton. Visualizing data using t-SNE. Journal ofMachine Learning Research, 9(Nov):2579–2605, 2008.

[47] R. Maciejewski, A. Pattath, S. Ko, R. Hafen, W. S. Cleveland, and D. S.Ebert. Automated box-cox transformations for improved visual encoding.IEEE Transactions on Visualization and Computer Graphics, 19(1):130–140, 2013.

[48] Y. Meng, H. Zhang, M. Liu, and S. Liu. Clutter-aware label layout. InProceedings of the IEEE Pacific Visualization Symposium, pages 207–214,2015.

[49] J. G. S. Paiva, W. R. Schwartz, H. Pedrini, and R. Minghim. An approachto supporting incremental visual data classification. IEEE Transactions onVisualization and Computer Graphics, 21(1):4–17, 2015.

[50] J. H. Park, S. Nadeem, S. Mirhosseini, and A. Kaufman. C2A: Crowdconsensus analytics for virtual colonoscopy. In Proceedings of the IEEEConference on Visual Analytics Science and Technology, pages 21–30,2016.

[51] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Ima-geNet Large Scale Visual Recognition Challenge. International Journalof Computer Vision, 115(3):211–252, 2015.

[52] K. Simonyan and A. Zisserman. Very deep convolutional networksfor large-scale image recognition. Computing Research Repository,abs/1409.1556, 2014.

[53] H. Strobelt, D. Oelke, C. Rohrdantz, A. Stoffel, D. A. Keim, andO. Deussen. Document cards: A top trumps visualization for documents.IEEE Transactions on Visualization and Computer Graphics, 15(6):1145–1152, 2009.

[54] J. Sun, F. Wang, J. Hu, and S. Edabollahi. Supervised patient similaritymeasure of heterogeneous patient records. ACM SIGKDD ExplorationsNewsletter, 14(1):16–24, 2012.

[55] J. Talbot, B. Lee, A. Kapoor, and D. S. Tan. EnsembleMatrix: Interactivevisualization to support machine learning with multiple classifiers. InProceedings of the SIGCHI Conference on Human Factors in ComputingSystems, pages 1283–1292, 2009.

[56] G. K. Tam, V. Kothari, and M. Chen. An analysis of machine- andhuman-analytics in classification. IEEE Transactions on Visualization andComputer Graphics, 23(1):71–80, 2017.

[57] T. Tian and J. Zhu. Max-margin majority voting for learning from crowds.In Proceedings of the Advances in Neural Information Processing Systems,pages 1621–1629, 2015.

[58] P. Wais, S. Lingamneni, D. Cook, J. Fennell, B. Goldenberg, D. Lubarov,D. Marin, and H. Simons. Towards building a high-quality workforce withmechanical turk. In NIPS Workshop on Computational Social Science andthe Wisdom of Crowds, pages 1–5, 2010.

[59] E. Wall, S. Das, R. Chawla, B. Kalidindi, E. T. Brown, and A. Endert.Podium: Ranking data using mixed-initiative visual analytics. IEEETransactions on Visualization and Computer Graphics, 24(1):288–297,2018.

[60] X. Wang, S. Liu, J. Liu, J. Chen, J. Zhu, and B. Guo. TopicPanorama: Afull picture of relevant topics. IEEE Transactions on Visualization andComputer Graphics, 22(12):2508–2521, 2016.

[61] L. Wilkinson. Visualizing big data outliers through distributed aggregation.IEEE Transactions on Visualization and Computer Graphics, 24(1):256–266, 2018.

[62] W. Willett, S. Ginosar, A. Steinitz, B. Hartmann, and M. Agrawala. Identi-fying redundancy and exposing provenance in crowdsourced data analysis.IEEE Transactions on Visualization and Computer Graphics, 19(12):2198–2206, 2013.

[63] J. Xia, W. Chen, Y. Hou, W. Hu, X. Huang, and D. S. Ebertk. Dimscanner:A relation-based visual exploration approach towards data dimensioninspection. In Proceedings of the IEEE Conference on Visual AnalyticsScience and Technology, pages 81–90, 2016.

[64] M. D. Zeiler, G. W. Taylor, and R. Fergus. Adaptive deconvolutionalnetworks for mid and high level feature learning. In Proceedings of theInternational Conference on Computer Vision, pages 2018–2025, 2011.

[65] D. Zhou, S. Basu, Y. Mao, and J. C. Platt. Learning from the wisdom ofcrowds by minimax entropy. In Proceedings of the Advances in NeuralInformation Processing Systems, pages 2195–2203, 2012.