9
JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Adaptive Result Inference for Collecting Quantitative Data with Crowdsourcing Hailong Sun, Member, IEEE, Kefan Hu, Yili Fang, Yangqiu Song Abstract—In quantitative crowdsourcing, workers are asked to provide numerical answers. Different from categorical crowdsourcing, result aggregation in quantitative crowdsourcing is processed by combinatorially computing over all workers’ answers instead of by merely choosing one from a set of candidate answers. Therefore existing result aggregation models for categorical crowdsourcing tasks cannot be used in quantitative crowdsourcing. Moreover, the worker ability often varies in the process of crowdsourcing with the changing of workers’ skill, willingness, efforts, etc. In this work, we propose a probabilistic model to characterize the quantitative crowdsourcing problem by considering the changing of worker ability so as to achieve better quality control.The dynamic worker ability is obtained with Kalman Filtering and Smoother. We design an Expectation-Maximization based inference algorithm and a dynamic worker filtering algorithm to compute the aggregated crowdsourcing result. Finally, we conducted experiments with real data on CrowdFlower and the results showed that our approach can effectively rule out low-quality workers dynamically and obtain more accurate results with less costs. Index Terms—Quantitative crowdsourcing, crowdsensing, quality control, result inference. 1 I NTRODUCTION C Rowdsourcing has been successfully used to collect data for a plethora of applications with the power of the crowd. As a result, many crowdsourcing systems [1] have been developed to deal with various tasks. For instance, in tasks such as image label- ing, audio recognition, video annotation and sentiment analysis, crowdsourcing has become an important approach to generating training sets for machine learning algorithms in data analytics; in data sensing tasks, crowdsourcing, which is often known as crowdsensing [2] in this context, is used to collect data from various sources. Essentially, the effectiveness of crowdsourcing depends on the quality of task results under certain costs, which is known as the quality control problem. Furthermore, given a task and a set of candidate answers provided by workers, the quality of a task result is completely determined by result inference, a process to generate the task result through aggregating the candidate answers. As crowdsourcing is significantly affected by worker ability, task difficulty, incentives and other factors, the quality of an individual candidate answer is usually unreliable. As a result, it is a non-trivial issue to obtain high-quality results from a set of unreliable answers. Therefore, result inference is one of the prominent challenges faced by crowdsourcing. In practice, result inference is closely related to the charac- teristics of crowdsourcing tasks. Different tasks usually asks for special design of result inference. Most existing crowdsourcing efforts focus on classification tasks or categorical crowdsourcing tasks, in which a worker is asked to provide a label and the final result is selected from all the submitted answers. For instance, in typical sentiment analysis tasks [3], a worker is asked to choose a label from {positive’’, “neutral”,“negative} and the final result H. Sun, K. Hu, and Y. Fang are with the School of Computer Science and Engineering, Beihang University, Beijing, China, 100191. Y. Song is with Department of Computer Science, HongKong University of Science and Technology. Corresponding author: Hailong Sun E-mail: [email protected] is determined within the same answer set. The challenge of result inference for classification tasks is centred around how to measure the weight of each candidate answer. Meanwhile, quantitative crowdsourcing tasks, e.g. counting objects shown in a picture [4], [5] and PeerGrading [6], require workers to give a quantitative estimation as opposed to discrete labels and the final result is computed by combining all the reported numerical values. For instance, given a task for recognizing the number of people contained in a picture, workers may provide such an answer set as {12, 13, 100, 2, 19}. The final result can be none of the candidate answers provided. Hence quantitative crowdsourcing tasks ask for different approaches to effectively infer task results. Furthermore, as worker ability directly influence the quality of workers’ answers and essentially influence the quality of the crowdsourced results, it is of great importance to properly incorporate worker ability into result inference. In crowdsourcing, workers’ performance in task processing can change dynamically, which is confirmed in [7] with the UCI data set. On the one hand, the worker ability should normally get improved for the long run because a worker can understand the tasks and learn the required expertise better after processing more and more tasks. On the other hand, if a worker has been continuously processing crowdsourcing tasks for a period of time her ability may drop accordingly due to the distraction by tiredness or something else. In recent years, considering the dynamic changing of worker ability in result inference has drawn a lot of attention from the crowdsourcing research community [7], [8], [9], [10]. In this work, we aim at studying the result inference prob- lem for quantitative crowdsourcing tasks.There have been many works [11], [12], [13] on discovering the truth in categorical crowdsourcing tasks. Workers are usually modelled as classifiers, whose accuracy is subject to several factors including worker ability, bias, task difficulty, etc. Then the final output of a task is inferred by designing corresponding algorithms to evaluate which candidate answer provided by workers is the most probably cor- rect. However, to infer the result of a quantitative crowdsourcing

JOURNAL OF LA Adaptive Result Inference for Collecting ...act.buaa.edu.cn/hsun/papers/iot-rid.pdf · workers from our CrowdFlower dataset described in Section 5. We can observe that

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Adaptive Result Inference for CollectingQuantitative Data with Crowdsourcing

Hailong Sun, Member, IEEE, Kefan Hu, Yili Fang, Yangqiu Song

Abstract—In quantitative crowdsourcing, workers are asked to provide numerical answers. Different from categorical crowdsourcing,result aggregation in quantitative crowdsourcing is processed by combinatorially computing over all workers’ answers instead of bymerely choosing one from a set of candidate answers. Therefore existing result aggregation models for categorical crowdsourcingtasks cannot be used in quantitative crowdsourcing. Moreover, the worker ability often varies in the process of crowdsourcing with thechanging of workers’ skill, willingness, efforts, etc. In this work, we propose a probabilistic model to characterize the quantitativecrowdsourcing problem by considering the changing of worker ability so as to achieve better quality control.The dynamic worker abilityis obtained with Kalman Filtering and Smoother. We design an Expectation-Maximization based inference algorithm and a dynamicworker filtering algorithm to compute the aggregated crowdsourcing result. Finally, we conducted experiments with real data onCrowdFlower and the results showed that our approach can effectively rule out low-quality workers dynamically and obtain moreaccurate results with less costs.

Index Terms—Quantitative crowdsourcing, crowdsensing, quality control, result inference.

F

1 INTRODUCTION

C Rowdsourcing has been successfully used to collect data fora plethora of applications with the power of the crowd. As

a result, many crowdsourcing systems [1] have been developed todeal with various tasks. For instance, in tasks such as image label-ing, audio recognition, video annotation and sentiment analysis,crowdsourcing has become an important approach to generatingtraining sets for machine learning algorithms in data analytics;in data sensing tasks, crowdsourcing, which is often known ascrowdsensing [2] in this context, is used to collect data fromvarious sources. Essentially, the effectiveness of crowdsourcingdepends on the quality of task results under certain costs, which isknown as the quality control problem. Furthermore, given a taskand a set of candidate answers provided by workers, the qualityof a task result is completely determined by result inference,a process to generate the task result through aggregating thecandidate answers. As crowdsourcing is significantly affected byworker ability, task difficulty, incentives and other factors, thequality of an individual candidate answer is usually unreliable.As a result, it is a non-trivial issue to obtain high-quality resultsfrom a set of unreliable answers. Therefore, result inference is oneof the prominent challenges faced by crowdsourcing.

In practice, result inference is closely related to the charac-teristics of crowdsourcing tasks. Different tasks usually asks forspecial design of result inference. Most existing crowdsourcingefforts focus on classification tasks or categorical crowdsourcingtasks, in which a worker is asked to provide a label and the finalresult is selected from all the submitted answers. For instance, intypical sentiment analysis tasks [3], a worker is asked to choose alabel from {“positive’’, “neutral”,“negative”} and the final result

• H. Sun, K. Hu, and Y. Fang are with the School of Computer Science andEngineering, Beihang University, Beijing, China, 100191. Y. Song is withDepartment of Computer Science, HongKong University of Science andTechnology.Corresponding author: Hailong Sun E-mail: [email protected]

is determined within the same answer set. The challenge of resultinference for classification tasks is centred around how to measurethe weight of each candidate answer. Meanwhile, quantitativecrowdsourcing tasks, e.g. counting objects shown in a picture [4],[5] and PeerGrading [6], require workers to give a quantitativeestimation as opposed to discrete labels and the final result iscomputed by combining all the reported numerical values. Forinstance, given a task for recognizing the number of peoplecontained in a picture, workers may provide such an answer set as{12, 13, 100, 2, 19}. The final result can be none of the candidateanswers provided. Hence quantitative crowdsourcing tasks ask fordifferent approaches to effectively infer task results. Furthermore,as worker ability directly influence the quality of workers’ answersand essentially influence the quality of the crowdsourced results,it is of great importance to properly incorporate worker abilityinto result inference. In crowdsourcing, workers’ performance intask processing can change dynamically, which is confirmed in [7]with the UCI data set. On the one hand, the worker ability shouldnormally get improved for the long run because a worker canunderstand the tasks and learn the required expertise better afterprocessing more and more tasks. On the other hand, if a workerhas been continuously processing crowdsourcing tasks for a periodof time her ability may drop accordingly due to the distractionby tiredness or something else. In recent years, considering thedynamic changing of worker ability in result inference has drawna lot of attention from the crowdsourcing research community [7],[8], [9], [10].

In this work, we aim at studying the result inference prob-lem for quantitative crowdsourcing tasks.There have been manyworks [11], [12], [13] on discovering the truth in categoricalcrowdsourcing tasks. Workers are usually modelled as classifiers,whose accuracy is subject to several factors including workerability, bias, task difficulty, etc. Then the final output of a task isinferred by designing corresponding algorithms to evaluate whichcandidate answer provided by workers is the most probably cor-rect. However, to infer the result of a quantitative crowdsourcing

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

task involves effective aggregation of worker answers instead ofselecting one from the candidate answers. Therefore, the exist-ing methods for categorical crowdsourcing cannot be applied toquantitative crowdsourcing. Moreover, quantitative crowdsourcingtasks are more sensitive to worker ability because the range ofan answer set depends largely on the worker ability. Hence, itis of significant importance to consider the changing of workerability in processing quantitative crowdsourcing tasks. Not manyefforts are seen to study quantitative crowdsourcing tasks. [14],[15] are the most recent works in this regard. However, neitherof them considers the dynamic changing of worker ability. Insummary, there are two challenges to deal with in result inferencefor quantitative crowdsourcing. First, how to model the processof answer generation and result aggregation. Second, how toincorporate the changing worker ability into result inference.

We propose a generative model to characterize quantitativecrowdsourcing task processing, in which latent truth, workerbias, worker ability and worker’s estimation are considered. Inparticular, we try to incorporate the dynamic worker abilityinto our model. Intuitively, the worker ability is a time-varyingvariable. In this work, the worker ability is modelled as a lineardynamic system based on the time of task completion and weapply Kalman Filter and Smoother to track the dynamic changingof worker ability. Then we design an EM-based algorithm toinfer the task result, and an algorithm to filter out low-abilityworkers. We conducted a set of experiments on CrowdFlower, inwhich workers were asked to count the objects shown in a set ofpictures. By comparing with the ground truth obtained manually,the experimental results confirm the effectiveness of our method.In addition, we also performed extensive simulations to study thefactors that may affect the effectiveness of our adaptive resultinference approach. We summarize our contributions as follows:

• We design a model in which dynamic worker ability isconsidered in the process of quantitative crowdsourcing.To the best of our knowledge, this is the first effort to takethe dynamic reliability of workers into account for betteraggregating the answers in quantitative crowdsourcing.

• We investigate the precision drift of workers and useKalman algorithm to track this drift, then integrate Kalmanalgorithm into our unsupervised probabilistic model,which can trace the workers precision change simultane-ously and attain more accurate result.

• We proposed a method which can select the qualifiedworkers in task participation dynamically, thus reduce thecost of crowdsourcing system and improve the accuracy ofaggregated result.

• We conducted both experiments with real-world tasks inCrowdflower and simulations. The experimental resultsconfirmed that considering the dynamic changing of work-er ability can help improve the result quality of quantitativecrowdsourcing.

The rest of our paper is structured as follows. We describe theproblem and our framework in Section 2. Section 3 presents ouradaptive result inference model. We introduce the worker filteringalgorithm in Section 4. Section 5 shows the experimental resultsand Section 6 gives some discussion on the proposed method inthe context of categorical crowdsourcing. We describes the relatedwork in Section 7, and we conclude this paper in Section 8.

Workers

Crowdsourcing Platform

blood cells counting people counting

Quantitative Tasks

Worker Responses

Dynamic Worker Filtering

Filtering Decision

Result Inference

Estimating Worker Ability

For each time slot t

Fig. 1: Framework of adaptivequantitative crowdsourcing.

0 10 20 30 40

Task Sequence0

2

4

6

8

10

12

14

16

18

Devia

tion

Worker ID:31001292

0 5 10 15 20 25 30

Task Sequence0

5

10

15

20

25

Devia

tion

Worker ID:26557959

(a) Worker 1 (b) Worker 2

Fig. 2: Worker precision varies over time.

2 PROBLEM DESCRIPTION

We first briefly describe the concerning problem. There are Nquantitative crowdsourcing tasks for M workers to process. Fortask i, we use µi to denote the latent ground truth and ri,jto denote the reported estimation from worker j and ri,j isa numerical value. Finally we denote Qj as the question setanswered by worker j, and Ui as the worker set providing answersto question i. For each task, our goal is to obtain an aggregatedresult based on all the reported answers as close to the groundtruth as possible.

Furthermore, Figure 1 presents the processing framework ofadaptive quantitative crowdsourcing proposed in this work. Thequantitative tasks (e.g. counting objects) are published to publiccrowdsourcing platforms to be processed by crowd workers, andthe workers’ responses are collected as a result. Periodically, weinterchangeably perform the result inference with EM algorithmand estimate the ability of each worker. Next, according to theestimates of worker ability, we employ an algorithm to filter outlow-quality workers so as to improve the result quality.

3 DYNAMIC AGGREGATION MODEL

3.1 Model Design

In practice, ri,j can be affected by many factors. Here we mainlyconsider three factors including worker bias, worker ability andlatent truth (µi).

Worker Bias. Bias bj reflects a worker’s tendency either tounderestimate or overestimate in a quantitative task, which is

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

jb

tj , 1, tj

jir ,

1, tj... ...

u

v

N questions M workers

i

Fig. 3: The graphic model of dQC model.

actually the mean distance between a worker’s answer and thecorresponding ground truth.

Worker Ability. Worker ability is a factor determining howprecise the answers provided by a worker can be. Thus we useworker precision ϕj to represent worker ability, which reflects thecloseness of a worker’s answer to the corresponding truth value onaverage. ϕj is the inverse variance of a worker’s answers. Workerprecision is subject to her skill fluency, expertise, willingness, etc.,thus may not be a fixed value. In Figure 2, we present the plot ofthe deviation value (ri,j − µi − bj) of two randomly sampledworkers from our CrowdFlower dataset described in Section 5.We can observe that even after the bias is removed, the deviationof workers’ answers from ground truth is still constantly changing,which confirms our argument. Due to the space limitation, we donot plot the ability changing of all the 286 workers involved inour experiments. Note that the two workers presented here are notcherry-picked. As a matter of fact, all the workers demonstratethe changing of their abilities, but the changing mode amongall the workers varies a lot. In Section 5, we further show theability tracking results of the top five workers who completed themost tasks. Therefore, different from existing research, we modelworker precision as a random variable dynamically changing withtime.

With the above analysis, we present our RID (Result Inferencewith Dynamic worker ability) model as shown in Fig. 3. Theshaded node represents the observed variable of responses andother blank nodes represent the latent truth µi, worker bias bjand precision ϕj,t at current time t respectively, wherein ϕj,t ismodelled as a latent state in a LDS transited from a previousstate ϕj,t−1. The plates describe the replication of the two majorcomponents in the graphic model.

Latent truth. µi is modelled with a Gaussian prior:

µi ∼ N (u, v), (1)

where u and v are two hyperparameters representing the mean andvariance respectively.

Then the generation of a worker’s answer at certain time t isillustrated as:

ri,j |µi, bj , ϕj,t ∼ N (ri,j |µi + bj , 1/ϕj,t), (2)

where ϕj,t is a latent state that depends on previous state ϕj,t−1.Thus each aggregated result for a task is inferred by consideringworker precision at the time when answers are provided. Inthis way, we hope to improve the quality of the aggregatedcrowdsourcing results. We will explain how ϕj,t is modelled andcomputed in the following subsection.

3.2 Dynamic Computing of Worker PrecisionAs the worker precision is affected by multiple factors, it is verychallenging to identify all the factors and accurately model their

influence separately. Instead, we are concerned with the collectiveinfluence of all the factors. In this regard, [9] employs LDS (LinearDynamic Systems) [16] to solve the categorical crowdsourcingproblem in citizen science applications. Likewise, in this work weuse LDS to describe the dynamic changing of worker precision forquantitative crowdsourcing problems. We use ϕj,t to denote thehidden state of LDS corresponding to the precision of worker j attime t. The state transition is defined as follows:

ϕj,t = ϕj,t−1 + w, (3)

where w is a variable following Gaussian distribution:

w ∼ N (0, λ2). (4)

It indicates that worker precision may slightly change from pre-vious state randomly, and the change is sampled from a Gaussiandistribution, where λ controls the variation of the precision drift.

We need to compute the observed worker precision with work-ers’ answers and their bias. Since ri,j is drawn from Gaussian dis-tribution N (µi + bj , 1/ϕj,t), we can derive the estimation of theworker precision ϕj,t by taking the expectation of |ri,j−µi−bj |:

E(|ri,j − µi − bj |) =

√2√πϕj

. (5)

Thus ϕj,t can be estimated by:

ϕ2j,t =

2

π(ri,j,t − µi − bj)2. (6)

At time t we assume that the truth value of each question µi isalready aggregated, and the inference process will be introducedin next subsection. With µi, the emission function of precisionϕj,t can be defined as:

oi,j = ϕj,t + v =

√2√

π|ri,j − ui − bj |+ v, (7)

where oi,j represents the observation of latent state ϕj,t and v isthe white noise that follows a Gaussian distribution N (0, γ2).

Then we can give inference solution using Kalman Filter [17],an estimator in LDS. Assume that at t−1, the precision of workeris denoted by ϕt−1|t−1, meaning that it is the maximum posteriorvalue according to the observation ot−1. With ϕt−1|t−1, we canpredict the precision at next time slot t as below:

ϕt|t−1 ∼ N (ϕt−1|t−1, Pt|t−1), (8)

where Pt|t−1 is the variance corresponding to the predictionϕt|t−1. Pt|t−1 can be updated by the variance of ϕt−1|t−1 andPt−1|t−1:

Pt|t−1 = Pt−1|t−1 + λ. (9)

At time t, we use oi,j as the observation of Kalman Filter. For eachworker who gives responses at time t, we update her precision by:

ϕt|t = ϕt|t−1 + Kt(oi,j − ϕt|t−1), (10)

where Kt is Kalman gain defined by:

Kt = Pt|t−1(Pt|t−1 + γ)−1. (11)

The variance of ϕt|t is updated by:

Pt|t = (1−Kt)Pt|t−1. (12)

For each worker, the above process is iterated based on (8)-(12)from t = 1 to T , where T is the last time slot the worker providesan answer.

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

Kalman Filter uses a forward recursion, where ϕj,t|t is calcu-lated only based on previous and current observations. Therefore,we then run Kalman Smoother [18] in terms of all observationsto obtain an optimal posterior state ϕj,t|N where N means all ob-servations. At time slot t, we update ϕj,t|N backward recursivelyfrom subsequent steps given observations from t+1 to T . At t, wehave ϕt|t by (10), pt|t by (12). The optimal posterior estimationof ϕt|N with known ϕt+1|N is derived as:

ϕt|N = ϕt|t + Jt(ϕt+1|N − ϕt|t). (13)

We update posterior variance Pt|N by Pt+1|N of backward timestep t+ 1 as:

Pt|N = Pt − J2t (Pt+1|N − Pt+1|t), (14)

where Jt is defines as:

Jt = Pt|tP−1t+1|t. (15)

We run (13)-(15) backwardly from T to 1, then can obtain theoptimal posterior estimate of worker precision at t.

3.3 Model Inference

Now we describe how to make inference with our RID model, andthe goal of the inference is to find the truth µ for each crowdsourc-ing task. As described in previous subsection, we consider µi asa latent variable, and define Θ = {ϕj , bj} as the parameter setof our model. Thus the likelihood of the observed responses fromworkers D = {ri,j} can be represented as follows:

p(D|Θ, µ) =N∏i=1

p(µi)Ui∏j=1

p(ri,j |ϕj,time(j,i), µi, bj), (16)

where time(j, i) is the mapping function that returns the time slotwhen worker j provides a response to task i.

To infer the unknown parameters in Equation 16, it is natu-rally to consider using maximum likelihood estimation algorithm.However, since both µi and ϕj,time(j,i) are unobservable, it isimpossible to conduct exact inference. In this case, Expectation-Maximization (EM) algorithm [19] is often used to iterativelylearn each variable in the model. EM algorithm consists of Estep (Expectation) and M step(Maximization), where the E step isresponsible for estimating the log-likelihood function determinedby Equation 16 and the M step solves the unknown parametersusing maximum likelihood estimation algorithm. The algorithmiterates over these two steps until the process converges.

E step. For each task, given Θn−1 at the (n− 1)th iteration,we compute the posterior probability of its latent truth µi as below:

p(µi|Θ(n−1),DUi)

∝ p(µi)p(DUi|Θ(n−1), µi)

= N (µi|u, v)Ui∏j

N (ri,j |µi + bj , 1/ϕj,time(j,i)),

(17)

where DUiis the response set of task i. The posterior distribution

of µi at this iteration is computed as follows:

µ(n)i = µN =

ujvj +∑Ui

j=1 ϕj,time(j,i)(ri,j − bj)vj +

∑Ui

j=1 ϕj,time(j,i)

. (18)

M step. We re-estimate the model parameters given theexpectation of latent truth µ(n)

i calculated in the E step and wemaximize the posterior probability as follows:

Θn = argmaxΘ

f(Θ,Θn−1) = Q(Θ,Θn−1) + const, (19)

where Q(Θ,Θn+1) = E[ln(p(D, µ|Θ))].For each worker, we use gradient ascent to obtain the optimal

solution of bj ∈ Θn :

b∗j =

∑Qj

i=1 ri,j − µi

|Qj |. (20)

Then we update each worker’s precision of every time slot usingKalman Filter and Smoother by (7), (10), (13). First we use µ(n)

i inE step and b∗j to derive the observation vector sorted by time orderaccording to (7). Then the obtained vector is used to estimatethe latent states in LDS, i.e., the worker precision ϕj,t in eachtime slot. Next we apply Kalman Filter to forwardly derive thepreliminary estimation ϕt|t from t = 1 to T by (10) recursively.To get optimal estimation, we then use Kalman Smoother tobackwardly derive the final estimation ϕt|N with (13) recursively.

We iterate over E and M steps until each parameter converges,then we obtain the estimates of µi, b∗j , and ϕj,t.

4 DYNAMIC WORKER FILTERING

In crowdsourcing, requesters need to provide workers with acertain amount of reward. And the more workers are hired fora task, the more costs it will incurs. Meanwhile more workerscan help improve the quality of crowdsourcing results because theprobability of high-ability workers will increase as more workersparticipate. Therefore, it is widely believed that there is a tradeoffbetween the costs and the result quality. However, when manylow-ability workers are hired, the increased costs do not benefitthe result quality at all. In our RID model, we can use KalmanFilter and Smoother to dynamically compute worker precision,which brings an opportunity to filter out the low-ability workers.Thus it is possible to obtain high quality aggregation results withless costs.

As the worker precision often changes, we track the perfor-mance of a worker in the latest L tasks periodically and fire thedisqualified workers. We define the following metric to evaluate aworker’s recent performance.

L−Precision(j) = logL∏

i=1

f(ri,j |µi + bj , ϕj,time(j,i)), (21)

where f(x|a, b) is Gaussian PDF, and defined as:

f(x|a, b) =1

b√

2πe−

(x−a)2

2b2 . (22)

This function measures the performance of worker j in the latestL tasks. The higher the quality value is, the more contributionthe worker can make. We apply L − Precision to develop ourworker filtering method, RID-Filter, as shown in Algorithm 1. Wekeep a pool of workers to be examined for every L tasks. TheL− Precision value of each worker is calculated with (21) andthe workers whose L − Precision value is below a pre-definedthreshold will be filtered out.

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

Algorithm 1 RID-Filter

Input: Crowdsourced quantity estimates ri,j ;Output: Quantitative truth µi

1: {Initialize u, v, µi, and bj }2: for every L tasks are completed do3: {EM algorithm}4: while RID not converge do5: {E step}6: for each question i do7: Calculate µn

i according to (18)8: end for9: {M step}

10: for each worker j do11: Calculate b∗j according to (20)12: for each time slot t do13: Calculate ϕt|N according to (13)14: end for15: end for16: n⇐ n+ 117: end while18: Calculate the L− Precision of each worker by (21)19: Filter out disqualified workers with the pre-defined thresh-

old20: end for21: return the aggregated truth

TABLE 1: Overview of the crowdsourcing tasks.

Datasets mall dataset road datasetTotal task number 400 400Avg. num of people per image 31 28Task redundancy 8 8Minimum tasks per worker 15 15Total participating workers 130 156Total answers 3200 3200

5 EXPERIMENTAL EVALUATION

5.1 Datasets and Crowdsourcing Tasks

We used two public datasets [20] containing 400 pictures takenby cameras installed in shopping malls and on roads respectively.The pictures were actually the frames extracted from recordedvideo. We shuffled all the pictures sufficiently so as to eliminatethe apparent timing correlations among them. We published thesepictures in batch to CrowdFlower1, a well-known crowdsourcingplatform. Online workers were asked to report their estimates ofthe number of people in a picture and would receive an amount ofmoney reward for each completed task. Totally we recruited 286workers. The platform distributed pictures to workers sequentiallyand each picture was processed redundantly by at least 8 workers.Finally there were 6,400 reports received as shown in Table 1.

5.2 Comparing Approaches

We compared the performance of RID model with the otherfive methods for aggregate quantitative estimates. As the resultsof counting object are integer values, our model can only ap-proximately characterize the answer generation process, whichis sufficient for showing the effectiveness of RID. Note we didnot compare with the methods used in categorical tasks becausethe result aggregation mechanism is completely different, and

1. www.crowdflower.com

��

����

��

����

��

����

��

����

��

����

��� ��� ��� ��� ��� ��� ��

����

�����������������������

������

���������

���������

�������

��

����

��

����

��

����

��

����

��

����

��� ��� ��� ��� �� �� ��

����

�����������������������

������

���������

����������������

(a) Road dataset (b) Mall dataset

Fig. 4: Performance of RID.

applying the methods for categorical tasks (e.g. majority voting)to our tasks is meaningless.

(1) Average. The averaged value is taken as the final aggregat-ed result, which can be greatly affected by outliers.

(2) Median. The median of the collected quantitative answersis considered as the output, which is supposed to be robust againstoutliers.

(3) LOF (Local Outlier Factor) [21]. This is a fusion algorithmbased on outlier detection. It first identifies and rules out theunreliable reports using LOF score, then the result is computedby simply averaging on the remaining reported values. LOF scoreis used to determine whether a reported value is an outlier, whichis calculated by comparing the local density of an object with itsneighbours, and the nearest neighbour parameter k was set to 3.

(4) Max-Trust [22]. This is an extension of the CovarianceIntersection(CI) [23], and it considers the trustworthiness of eachworker as a parameter to form a Gaussian likelihood model,which can jointly output the fused quantitative truth and workers’strustworthiness. However, it does not incorporate the dynamicchanging of worker bias and precision.

(5) RIS Model. To demonstrate the benefits of considering thechange of worker ability, we implemented RIS (Result Inferencewith Static worker ability) model, which differentiates itself fromRID in no consideration of the dynamic change of worker ability.The inference method of RIS is the same as RID model exceptthat in M step, we update ϕj by:

ϕj =|Qj |/2 + α− 1∑Qj

i=1(ri,j − µi − bj)2/2 + β, (23)

where ϕ was modelled as Gamma distribution with parameters αand β. We set α = 1 and β = 1, which means we have no priorknowledge of ϕ.

5.3 Evaluation MetricsWe mainly used the root mean square error (RMSE) to measure theaccuracy of the predicted quantitative estimation, which is definedas follows.

RMSE =

√1

N

∑i

(µi − µi)2, (24)

where µi represents the estimated result of task i and µi isthe ground truth for task i. It is noted that the ground truth wasalready provided by the original dataset.

5.4 Experimental ResultsResult aggregation w/o worker filtering. We first compared RIDmodel with the other five approaches without filtering the low-ability workers. We ran all the approaches and computed RMSE

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

����

����

����

����

��

����

����

����

� �� �� �� �� �� �� ��

����

��������������������������

����������������������

���

����

����

����

����

����

����

����

����

� �� �� �� �� �� �� ��

����

��������������������������������

����������������������

���

(a) Road dataset (b) Mall dataset

Fig. 5: Performance of RID-Filter.

��

���

���

���

���

� � �� �� �� �� �� �� �� �� �� �� ��

����

�������������

����

����

��

5

10

15

20

25

1 5 10 15 20 25 30 35 40 45

1/φ

Tasks Sequence

w1w2

w3w4

w5

(a) Road dataset (b) Mall dataset

Fig. 6: Tracking of worker ability (1/ϕj).

values. The results are shown in Fig. 4, which illustrates howRMSE with the six methods changes as the number of workersdecreases. Each time we randomly removed the answers of 10workers but the total number of tasks was fixed to be 400 through-out the experiment. Fig. 4 (a) and (b) show the results of the twodatasets respectively. RIS model, which takes the bias of workersinto consideration, attains 28.5% lower and 34.1% lower RMSEscore than the Max-Trust method on average across two datasetsrespectively, which only considers the workers’ trustworthinessbut not the bias. This further confirms the existence of workersbias, and RIS is able to learn the bias and reduce the impact of biason the aggregated result. Median, is simple and quite competitive,outperforms LOF and Max-Trust. LOF performs worse than Max-Trust, which indicates merely removing outliers is not enoughfor discovering the truth. Not surprisingly, Average performs theworst. Our RID model performs the best among the six methods,which outperforms RIS model by 10.5% and 9.7% on average forthe two datasets respectively. In fact, the advantage of RID overRIS is subject to several factors such as the total working time andtask difficulty, and for this we will present another set of simulatedresults later.

Besides, we can also observe that as the number of workers de-creases, the RMSE of Max-Trust and LOF increases dramatically.As shown in Fig. 4(a), when the number of workers decreases from116 to 96, the RMSE of LOF and Max-Trust increases 24% and25% respectively while the performance of RID and RIS modelis relatively more stable, whose RMSE scores increase by only8% and 12% respectively. This illustrates the robustness of ourapproach.

Result aggregation with worker filtering. Next we evaluatedthe performance of our RID-Filter algorithm, which was designedto obtain more accurate aggregation results with less costs byfiltering out low quality workers dynamically. Since CrowdFlowerdoes not support filtering specific workers during a crowdsourc-ing process, we cannot directly run RID-filter on CrowdFlower.Instead we ran our algorithm on the collected data and simulatedthe worker filtering by ruling out the answers provided by thesupposed workers to be filtered out. And RID-Filter can beeasily implemented once crowdsourcing platforms like AMT orCrowdFlower supports the filtering of specific workers.

0

5

10

15

20

25

30

35

40

1 5 10 15 20 25 30 35 40 45 5

10

15

20

25

Dev

iatio

n

prec

isio

n (1

/φ)

Tasks Sequence

deviationprecision

0

5

10

15

20

1 5 10 15 20 25 30 35 40 45 0

2

4

6

8

10

Dev

iatio

n

prec

isio

n (1

/φ)

Tasks Sequence

deviationprecision

(a) Worker 1 (b) Worker 2

0

10

20

30

40

50

1 5 10 15 20 25 30 35 40 45 10

12

14

16

18

20

22

24

Dev

iatio

n

prec

isio

n (1

/φ)

Tasks Sequence

deviationprecision

0

2

4

6

8

10

12

14

1 5 10 15 20 25 30 35 40 45 0

2

4

6

8

10

Dev

iatio

n

prec

isio

n (1

/φ)

Tasks Sequence

deviationprecision

(c) Worker 3 (d) Worker 4

Fig. 7: Tracking of the deviation and ability of worker.

Fig. 5 shows the RMSE values of RID-Filter as differentnumber of lowest quality workers are filtered out. We also plotthe results with RID-R-Filter as a baseline, which randomly filtersout the same number of workers as RID-filter and applies RID tothe remaining answers. The horizontal dash line shows the RMSEvalues without filtering out any worker. RID-Filter constantlyoutperforms RID-R-Filter because random filtering can rule outhigh-quality workers. In comparison with RID, we can see thatwhen less than 40 lowest quality workers are filtered out, RID-Filter outperforms RID for both datasets. Moreover when 30 and20 lowest quality workers are filtered out on the two datasetsrespectively, RID-Filter performs the best. This indicates that it isnot always worthwhile to filter out too many workers because theeffect of wisdom-of-the crowd will be negatively affected whenfew workers are left. Furthermore, we can conclude that RID-Filter can obtain the same quality of results as RID but with lesscosts.

Tracking of worker ability. Fig. 6 plots the reciprocal of theprecision (i.e. ϕj ) exhibited by the top 5 workers completing themost tasks on the two datasets. A higher value indicates lowerworker ability. Three of them completed 60 tasks while the otherseven completed 45 tasks. Throughout the task processing, workerability keeps changing. Moreover, the 10 workers show differentpatterns of ability changing, which demonstrates that assumingcertain learning curve or changing pattern may not be correct inpractice. Furthermore, Fig. 7 presents the deviation and precisionexhibited across tasks. Due to space limitation, we only show the4 workers completing 45 tasks on the Mall dataset.

Model efficiency Next we studied the efficiency of our ap-proach. All the algorithms were run on a Laptop with 2.7GHZCPU and 4GB RAM. We conducted this experiments on bothof the two datasets. Due to space limits, we only show the resultswith the Mall dataset. Fig. 8 (a) shows that our approach convergesafter 5 iterations, which is similar to RIS . Because the likelihoodvalues of RIS are much smaller than those of RID, we plottedthem with a scale factor 19 for a better visual effect. Fig. 8 (b)plots the time to run different approaches under different numberof workers. It is not surprising to see that RID model is the mosttime-consuming, which is caused by dynamically computing ofworker ability. We can also observe that the time cost increaseslinearly with the number of workers, which is consistent with thecomplexity of the inference algorithm, and this indicates that ourapproach can be used to large-scale crowdsourcing tasks.

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

������

������

������

������

������

������

������

�������

�� �� �� �� �� �� �� �� ��

��������������������������

�������������������

���������

������

������

������

������

������

������

������

��� ��� ��� ��� �� ��

������������

�����������������������

������

���������

����������������

(a) Convergence rate (b) Time to run

Fig. 8: Model efficiency.

��

����

��

����

��

����

��

�� �� �� �� ��� ���

����

��������������������������

������

��

����

��

����

��

����

��� � � � �

����

������

(a) Workload (b) Lambda

Fig. 9: How RID outperforms RIS.

5.5 Understanding the Performance Gap Between RIDand RISOur experiment results have demonstrated that RID outperformsRIS thanks to the tracking of worker precision drift. Now wefurther study the underlying factors that affect the performancegain of RID in comparison with RIS. Specifically, we conductedsimulations to understand two important factors including theworkload of a worker and λ in Equation (4) that controls thevariation of the precision drift. Both the factors influence the ex-tent of the changing of worker ability. Intuitively, the performancegap between RID and RIS should increase when the worker abilitychanges more.

We generated 200 workers whose bias and initial precision wasrandomly assigned. Each simulated task with average truth valuebeing 20 was assigned to 8 workers to process. We employed RIDand RIS respectively to infer the truth value and computed thecorresponding RMSE.

First, we investigate the impact of worker’s workload. Intu-itively, as a worker processes more tasks the more chances ofher ability’s changing will occur, thus RID will achieve moreperformance gain. The parameter lambda was set to 1. Theanswer to task i from worker j at time t was simulated as:

Ansj,t = sample(N(µi + bj , 1/ϕj,t)). (25)

All workers were assigned the same number of tasks, varyingfrom 30 to 130. For each case of the workload, we conducted 10simulations and reported the average so as to eliminate the impactof randomness. The results are shown in Fig. 9(a). We can observethat as the workload increases, the RMSE with RIS degradesquickly while RID has relatively much stable performance.

Second, as λ is the variance of Gaussian distribution thatcontrols the variation of the precision drift, a larger λ implies theworker precision is subject to greater change. Fig. 9(b) shows thatthe performance gap between RID and RIS grows as λ increases.

6 DISCUSSION

We discuss some limitations and possible extension of this work.Extension for categorical crowdsourcing tasks. In this work,

RID model demonstrated its higher accuracy and robustness than

existing methods in real world experiments, and RID-Filter couldlargely decrease the cost by filtering out low quality workerswhile maintaining the accuracy. However, RID model is designedfor aggregating quantitative crowdsourcing tasks, and cannot bedirectly used to solve categorical crowdsourcing tasks. The reasonis that in RID model, the responses of workers are modelledas Gaussian distribution conditioned on truth, workers’ bias andprecision, but for categorical crowdsourcing tasks, the answers arediscrete values and cannot be modelled as Gaussian distribution,and for most kinds of categorical tasks, bias is not an appropriateproperty of workers. To extend the current work for supportingdynamically aggregating categorical crowdsourcing tasks, we canstill use LDS to model workers’ changing accuracy while we needto leverage other models (e.g. BCC model [24]) to model theresponse generating process.

Improvement of RID-Filter algorithm. In RID-Filter algorithm,once a worker is identified as disqualified, she will not be consid-ered for later tasks. As the worker ability is dynamically affectedby various factors, it can be improved after a certain period of time.In this sense, the filtered workers may need to be reconsidered forprocessing future tasks. However, it is non-trivial to determinewhen it is appropriate to re-consider the eliminated workers. Ifthe ability of a worker changes frequently and vastly, it meansthat the corresponding worker is highly unreliable, thus it is nothelpful to reconsider a filtered worker. In that case, the LDS modelemployed in this work may fail to function properly because thenoise dominates the system in the aforementioned scenario. Toanswer whether it is helpful to reconsider a filtered worker, weneed to improve our dynamic worker ability model and we leavethis to future work.

7 RELATED WORK

In recent years, the significance of crowdsourcing has been wit-nessed by its successful application to solve many complex prob-lems, e.g. text processing, image classification, voice recognition,video annotation, etc. The core problem in crowdsourcing lies inhow to obtain high quality results given the possibly noisy answersfrom unknown workers.

Most existing works mainly concern categorical crowdsourc-ing problems, in which workers are expected to provide accurateclassification labels for certain objects. In [11], the authors discussseveral typical models such as majority voting, ZenCrowd [25],DS & Naive Bayes [26], GLAD [12] for obtaining consensusin categorical classification problems. In recent years, targetingspecific problems, new approaches [27], [28] considering moredetailed factors and domain specific features are proposed to getthe best of the wisdom-of-the-crowd. However, these approachescannot be directly used to solve the crowdsourcing problem ofquantitative truth finding addressed in this paper.

Nowadays quantitative crowdsourcing is receiving more atten-tion in many areas including some citizen science projects [29],database [30], big data analysis [14], crowd sensing [15], Peer-Grading [6] in MOOC and so on. Some works simply use averageor median based approach to aggregate results while there arealso a few advanced models designed. For instance, Local OutlierFactor (LOF) [21] identifies the unreliable report and remove theseoutliers, then remaining data items are fused by averaging method;Max-Trust [22] merges the workers’ responses by learning theprecision of each participant; [6], [14] further introduces more

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

parameters including worker bias, precision and task difficulty totheir models.

Recently, some crowdsourcing research efforts begin to focuson the characteristics of crowd workers. For instance, Zhanget.al. [31] have studied the online crowdsourcing communitiesformed through a real incident happened in China, which revealsthe typical structure and role characteristics of crowd workers.However, no studies have been seen in considering the dynamicsof worker ability in quantitative crowdsourcing. There are only afew efforts to investigate the dynamic changing of worker abilityin categorical crowdsourcing. [7] has proposed a method totrack the change of workers’ accuracy with particle filter, andselectively query the most reliable workers to guarantee the resultquality. Furthermore, [8] improves the model proposed in [7] byconsidering an offset variable and a decision reject option strategyin binary classification problems. [9] has proposed a dynamicBayesian model to simultaneously infer the categorical truthin citizen science problems. [10] considers the improvement ofworker ability by assuming a learning curve model, where workerskills improve gradually as more tasks are processed. However,these methods cannot be used in quantitative truth finding scenariobecause their probabilistic modelling assumption is designed fordiscrete labels. To the best of our knowledge, our work is the firstattempt to dynamically track worker ability and incorporate it intoquantitative crowdsourcing problems.

8 CONCLUSION

Quantitative crowdsourcing is an important category of crowd-sourcing tasks, but has yet received enough attention. In thiswork, we are mainly concerned with building a model to achievebetter aggregation results in quantitative crowdsourcing tasks. Inparticular, we consider the dynamic changing of worker ability,which has prominent influence on the final result. We design agraphical model to characterize the quantitative crowdsourcingprocess in which worker bias, worker ability and latent truth areconsidered. Worker ability is further modelled as a linear dynamicsystem and we use Kalman Filter and Smoother to compute itdynamically in the process of crowdsourcing. Then we designEM-based algorithm to infer the latent truth of a quantitative taskand we further provide a worker filtering algorithm to rule outlow ability workers so as to ensure the quality of results with lesscosts. We finally evaluated our approach with counting tasks ontwo public datasets obtained in a public crowdsourcing platform.

Future work leads to three directions. First, we will furtherstudy how different factors affect workers’ performance so asto better understand the changing of worker ability. Second, onthe basis of the improved model of worker ability, we aim atimproving the worker filtering algorithm in terms of determiningwhen to eliminate a worker and to re-consider an eliminatedworker if necessary. Third, besides quantitative crowdsourcingtasks, we will adapt our methods to solve general crowdsourcingtasks.

ACKNOWLEDGMENTS

This work was supported partly by China 973 program (No.2015CB358700 and No. 2014CB340304), partly by National KeyResearch and Development Program of China (2016YFB1000804)and partly by NSFC program (61421003).

REFERENCES

[1] A. Doan, R. Ramakrishnan, and A. Y. Halevy, “Crowdsourcing systemson the world-wide web,” Commun. ACM, vol. 54, no. 4, pp. 86–96, Apr.2011.

[2] T. Liu, Y. Zhu, Q. Zhang, and A. V. Vasilakos, “Stochastic optimalcontrol for participatory sensing systems with heterogenous requests,”IEEE Trans. Computers, vol. 65, no. 5, pp. 1619–1631, 2016.

[3] H. Wu, H. Sun, Y. Fang, K. Hu, Y. Xie, Y. Song, and X. Liu, “Combiningmachine learning and crowdsourcing for better understanding commodityreviews,” in Proceedings of the Twenty-Ninth AAAI Conference onArtificial Intelligence, 2015, pp. 4220–4221.

[4] J. Aslam, S. Lim, X. Pan, and D. Rus, “City-scale traffic estimation froma roving sensor network,” Proceedings of the 10th ACM Conference onEmbedded Network Sensor Systems, pp. 141–154, 2012.

[5] H. Wang, D. Lymberopoulos, and J. Liu, “Local business ambiencecharacterization through mobile audio sensing,” in Proceedings of the23rd international conference on World wide web, 2014, pp. 293–304.

[6] C. Piech, J. Huang, Z. Chen, C. Do, A. Ng, and D. Koller, “Tuned modelsof peer assessment in moocs,” Proceedings of The 6th InternationalConference on Educational Data Mining, 2013.

[7] P. Donmez, J. G. Carbonell, and J. G. Schneider, “A probabilistic frame-work to learn from multiple annotators with time-varying accuracy,”SIAM International Conference on Data Mining, 2010.

[8] H. J. Jung, Y. Park, and M. Lease, “Predicting next label quality: Atime-series model of crowdwork,” in Proceedings of the Seconf AAAIConference on Human Computation and Crowdsourcing, 2014.

[9] E. Simpson, S. Roberts, I. Psorakis, and A. Smith, “Dynamic bayesiancombination of multiple imperfect classifiers,” in Decision Making andImperfection. Springer, 2013, pp. 1–35.

[10] S. Pan, K. Larson, J. Bradshaw, and E. Law, “Dynamic task allocationalgorithm for hiring workers that learn,” in Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI2016, New York, NY, USA, 9-15 July 2016, 2016, pp. 3825–3831.

[11] A. Sheshadri and M. Lease, “Square: A benchmark for research oncomputing crowd consensus,” in First AAAI Conference on HumanComputation and Crowdsourcing, 2013.

[12] J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. R. Movellan, “Whosevote should count more: Optimal integration of labels from labelersof unknown expertise.” Advances in Neural Information ProcessingSystems, pp. 2035–2043, 2009.

[13] T. Han, H. Sun, Y. Song, Y. Fang, and X. Liu, “Incorporating externalknowledge into crowd intelligence for more specific knowledge acquisi-tion,” in Proceedings of the Twenty-Fifth International Joint Conferenceon Artificial Intelligence, 2016, pp. 1541–1547.

[14] R. W. Ouyang, L. Kaplan, P. Martin, A. Toniolo, M. Srivastava, andT. J. Norman, “Debiasing crowdsourced quantitative characteristics inlocal businesses and services,” in Proceedings of the 14th InternationalConference on Information Processing in Sensor Networks. ACM, 2015,pp. 190–201.

[15] M. Venanzi, W. T. L. Teacy, A. Rogers, and N. R. Jennings, “Bayesianmodelling of community-based multidimensional trust in participatorysensing under data sparsity,” in Proceedings of the 24th InternationalConference on Artificial Intelligence, ser. IJCAI’15, 2015, pp. 717–724.

[16] R. E. Kalman, “Mathematical description of linear dynamical systems,”Journal of the Society for Industrial and Applied Mathematics, vol. 1,no. 2, pp. 152–192, 1963.

[17] G. Welch and G. Bishop, “An introduction to the Kalman Filter,”University of North Carolina at Chapel Hill, vol. 785. ISBN 978-3-642-02287-6. Springer-Verlag Berlin Heidelberg, no. 7, pp. 127–132, 1995.

[18] M. Shuster, “A simple Kalman Filter and Smoother for spacecraftattitude,” in Journal of the Astronautical Sciences, 1989, pp. 89–106.

[19] C. M. Bishop, Pattern Recognition and Machine Learning (InformationScience and Statistics). Secaucus, NJ, USA: Springer-Verlag New York,Inc., 2006.

[20] A. B. Chan, M. Morrow, and N. Vasconcelos, “Analysis of crowdedscenes using holistic properties,” in Performance Evaluation of Trackingand Surveillance workshop at CVPR, 2009, pp. 101–108.

[21] M. M. Breunig, H. P. Kriegel, R. T. Ng, and J. Sander, “Lof: Identifyingdensity-based local outliers.” Proceedings of the 2000 ACM SIGMODInternational Conference on Management of Data, vol. 29, no. 2, pp.93–104, 2000.

[22] M. Venanzi, A. Rogers, and N. R. Jennings, “Trust-based fusion of un-trustworthy information in crowdsourcing applications,” in Proceedingsof the 2013 International Conference on Autonomous Agents and Multi-agent Systems, 2013, pp. 829–836.

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

[23] S. J. Julier and J. K. Uhlmann, “A non-divergent estimation algorithm inthe presence of unknown correlations,” in In Proceedings of the AmericanControl Conference, 1997.

[24] M. Venanzi, J. Guiver, G. Kazai, P. Kohli, and M. Shokouhi,“Community-based bayesian aggregation models for crowdsourcing,” inProceedings of the 23rd international conference on World wide web.ACM, 2014, pp. 155–164.

[25] G. Demartini, D. E. Difallah, and P. Cudre-Mauroux, “Zencrowd: Lever-aging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking,” in Proceedings of the 21st International Conferenceon World Wide Web, ser. WWW ’12, 2012, pp. 469–478.

[26] P. Dawid, A. M. Skene, A. P. Dawidt, and A. M. Skene, “Maximumlikelihood estimation of observer error-rates using the em algorithm,”Applied Statistics, pp. 20–28, 1979.

[27] Y. Sun, A. Singla, D. Fox, and A. Krause, “Building hierarchies ofconcepts via crowdsourcing,” in Proceedings of the 24th InternationalConference on Artificial Intelligence, 2015, pp. 844–851.

[28] E. D. Simpson, M. Venanzi, S. Reece, P. Kohli, J. Guiver, S. J. Roberts,and N. R. Jennings, “Language understanding in the wild: Combiningcrowdsourcing and machine learning,” in Proceedings of the 24th Inter-national Conference on World Wide Web, 2015, pp. 992–1002.

[29] A. M. Smith, S. Lynn, and C. J. Lintott, “An introduction to thezooniverse,” in First AAAI conference on human computation and crowd-sourcing, 2013.

[30] G. Li, J. Wang, Y. Zheng, and M. J. Franklin, “Crowdsourced datamanagement: A survey,” IEEE Trans. Knowl. Data Eng., vol. 28, no. 9,pp. 2296–2319, 2016.

[31] Q. Zhang, D. D. Zeng, F. Y. Wang, R. Breiger, and J. A. Hendler,“Brokers or bridges? exploring structural holes in a crowdsourcingsystem,” Computer, vol. 49, no. 6, pp. 56–64, June 2016.

Hailong Sun received the BS degree in com-puter science from Beijing Jiaotong University in2001. He received the PhD degree in computersoftware and theory from Beihang University in2008. He is an Associate Professor in Schoolof Computer Science and Engineering, BeihangUniversity, Beijing, China. His research inter-ests include software development, crowd com-puting/crowdsourcing and distributed computing.He is a member of the ACM and the IEEE.

Kefan Hu received the BS degree from HeFeiUniversity of Technology in 2014. He is currentlya Master student in School of Computer Sci-ence and Engineering, Beihang University, Bei-jing, China. His main research interest is crowd-sourcing.

Yili Fang is currently a PhD candidate in theschool of Computer Science and engineer-ing, Beihang University, Beijing, China. His re-search interests mainly include crowd comput-ing/crowdsourcing, social computing and deci-sion science.

Yangqiu Song received the BEng and PhDdegrees from the Department of Automation,Tsinghua University, China, in Jul., 2003 andJan., 2009, respectively. He joined Departmentof CSE at HKUST as an assistant professorin Jul., 2016. Before that, he was an assistantprofessor at Lane Department of CSEE at WVU(2015-2016); a post-doc researcher at UIUC(2013-2015), a post-doc researcher at HKUSTand visiting researcher at Huawei Noah?s ArkLab, Hong Kong (2012-2013); an associate re-

searcher at Microsoft Research Asia (2010-2012); a staff researcher atIBM Research-China (2009-2010). His research interests include ma-chine learning algorithms with applications to knowledge engineering,information retrieval, and visualization. He is a member of the IEEE.