77
Creating a Data Collection for Evaluating Rich Speech Retrieval Creating a Data Collection for Evaluating Rich Speech Retrieval Maria Eskevich 1 , Gareth J.F. Jones 1 Martha Larson 2 , Roeland Ordelman 3 1 Centre for Digital Video Processing, Centre for Next Generation Localisation School of Computing, Dublin City University, Dublin, Ireland 2 Delft University of Technology, Delft, The Netherlands 3 University of Twente, The Netherlands

Creating a Data Collection for Evaluating Rich Speech Retrieval (LREC 2012)

Embed Size (px)

DESCRIPTION

We describe the development of a test collection for the investigation of speech retrieval beyond identification of relevant content. This collection focus on satisfying user information needs for queries associated with specific types of speech acts. The collection is based on an archive of Internet video from Internet video sharing platform (blip.tv), and was provided by the MediaEval benchmarking initiative. A crowdsourcing approach is used to identify segments in the video data which contain speech acts, to create a description of the video containing the act and to generate search queries designed to refind this speech act. We describe and reflect on our experiences with crowdsourcing this test collection using the Amazon Mechanical Turk platform.We highlight the challenges of constructing this dataset, including the selection of the data source, design of the crowdsouring task and the specification of queries and relevant items.

Citation preview

  • 1. Creating a Data Collection for Evaluating Rich Speech RetrievalCreating a Data Collection for Evaluating Rich Speech Retrieval Maria Eskevich1 , Gareth J.F. Jones1Martha Larson 2 , Roeland Ordelman 31 Centre for Digital Video Processing, Centre for Next Generation LocalisationSchool of Computing, Dublin City University, Dublin, Ireland2 Delft University of Technology, Delft, The Netherlands 3 University of Twente, The Netherlands

2. Creating a Data Collection for Evaluating Rich Speech RetrievalOutlineMediaEval benchmarkMediaEval 2011 Rich Speech Retrieval TaskWhat is crowdsourcing?Crowdsourcing in Development of Speech andLanguage ResourcesDevelopment of effective crowdsourcing taskComments on resultsConclusionFuture Work: Brave New Task at MediaEval 2012 3. Creating a Data Collection for Evaluating Rich Speech RetrievalediaEvalMultimedia Evaluation benchmarking inititative Evaluate new algorithms for multimedia access andretrieval.Emphasize the multi in multimedia: speech, audio,visual content, tags, users, context.Innovates new tasks and techniques focusing on thehuman and social aspects of multimedia content. 4. Creating a Data Collection for Evaluating Rich Speech Retrieval ediaEval 2011Rich Speech Retrieval (RSR) Task Task Goal: Information to be found - combination of required audio and visual content, and speakers intention 5. Creating a Data Collection for Evaluating Rich Speech Retrieval ediaEval 2011Rich Speech Retrieval (RSR) Task Task Goal: Information to be found - combination of required audio and visual content, and speakers intention 6. Creating a Data Collection for Evaluating Rich Speech Retrieval ediaEval 2011Rich Speech Retrieval (RSR) Task Task Goal: Information to be found - combination of required audio and visual content, and speakers intention 7. Creating a Data Collection for Evaluating Rich Speech Retrieval ediaEval 2011Rich Speech Retrieval (RSR) Task Task Goal: Information to be found - combination of required audio and visual content, and speakers intentionTranscript 1 Transcript 2 8. Creating a Data Collection for Evaluating Rich Speech Retrieval ediaEval 2011Rich Speech Retrieval (RSR) Task Task Goal: Information to be found - combination of required audio and visual content, and speakers intentionTranscript 1 Transcript 2Meaning 1Meaning 2 9. Creating a Data Collection for Evaluating Rich Speech Retrieval ediaEval 2011Rich Speech Retrieval (RSR) Task Task Goal: Information to be found - combination of required audio and visual content, and speakers intentionTranscript 1=Transcript 2Meaning 1 =Meaning 2 10. Creating a Data Collection for Evaluating Rich Speech Retrieval ediaEval 2011Rich Speech Retrieval (RSR) Task Task Goal: Information to be found - combination of required audio and visual content, and speakers intentionTranscript 1=Transcript 2Meaning 1 =Meaning 2Conventional retrieval 11. Creating a Data Collection for Evaluating Rich Speech Retrieval ediaEval 2011Rich Speech Retrieval (RSR) Task Task Goal: Information to be found - combination of required audio and visual content, and speakers intentionTranscript 1 = Transcript 2Meaning 1= Meaning 2 12. Creating a Data Collection for Evaluating Rich Speech Retrieval ediaEval 2011Rich Speech Retrieval (RSR) Task Task Goal: Information to be found - combination of required audio and visual content, and speakers intentionTranscript 1 = Transcript 2Meaning 1= Meaning 2Speech act 1 = Speech act 2 13. Creating a Data Collection for Evaluating Rich Speech Retrieval ediaEval 2011Rich Speech Retrieval (RSR) Task Task Goal: Information to be found - combination of required audio and visual content, and speakers intentionTranscript 1 = Transcript 2Meaning 1= Meaning 2Speech act 1 = Speech act 2Extended speech retrieval 14. Creating a Data Collection for Evaluating Rich Speech Retrieval ediaEval 2011Rich Speech Retrieval (RSR) Task ME10WWW dataset: Videos from Internet video sharing platform blip.tv (1974 episodes, 350 hours) 15. Creating a Data Collection for Evaluating Rich Speech Retrieval ediaEval 2011Rich Speech Retrieval (RSR) Task ME10WWW dataset: Videos from Internet video sharing platform blip.tv (1974 episodes, 350 hours) Automatic Speech Recognition (ASR) transcript provided by LIMSI and Vocapia Research 16. Creating a Data Collection for Evaluating Rich Speech Retrieval ediaEval 2011Rich Speech Retrieval (RSR) Task ME10WWW dataset: Videos from Internet video sharing platform blip.tv (1974 episodes, 350 hours) Automatic Speech Recognition (ASR) transcript provided by LIMSI and Vocapia Research No queries and relevant items 17. Creating a Data Collection for Evaluating Rich Speech Retrieval ediaEval 2011Rich Speech Retrieval (RSR) TaskME10WWW dataset:Videos from Internet video sharing platform blip.tv(1974 episodes, 350 hours)Automatic Speech Recognition (ASR) transcript providedby LIMSI and Vocapia ResearchNo queries and relevant items > Collect for Retrieval Experiment:user-generated queriesuser-generated relevant items 18. Creating a Data Collection for Evaluating Rich Speech Retrieval ediaEval 2011Rich Speech Retrieval (RSR) TaskME10WWW dataset:Videos from Internet video sharing platform blip.tv(1974 episodes, 350 hours)Automatic Speech Recognition (ASR) transcript providedby LIMSI and Vocapia ResearchNo queries and relevant items > Collect for Retrieval Experiment:user-generated queriesuser-generated relevant items > Collect via crowdsourcing technology 19. Creating a Data Collection for Evaluating Rich Speech RetrievalWhat is crowdsourcing?Crowdsourcing is a form of human computation. Human computation is a method of having people dothings that we might consider assigning to a computingdevice, e.g. a language translation task.A crowdsourcing system facilitates a crowdsourcingprocess. 20. Creating a Data Collection for Evaluating Rich Speech RetrievalWhat is crowdsourcing?Crowdsourcing is a form of human computation. Human computation is a method of having people dothings that we might consider assigning to a computingdevice, e.g. a language translation task.A crowdsourcing system facilitates a crowdsourcingprocess.Factors to take into account: 21. Creating a Data Collection for Evaluating Rich Speech RetrievalWhat is crowdsourcing?Crowdsourcing is a form of human computation. Human computation is a method of having people dothings that we might consider assigning to a computingdevice, e.g. a language translation task.A crowdsourcing system facilitates a crowdsourcingprocess.Factors to take into account:Sufcient number of workers 22. Creating a Data Collection for Evaluating Rich Speech RetrievalWhat is crowdsourcing?Crowdsourcing is a form of human computation. Human computation is a method of having people dothings that we might consider assigning to a computingdevice, e.g. a language translation task.A crowdsourcing system facilitates a crowdsourcingprocess.Factors to take into account:Sufcient number of workersLevel of payment 23. Creating a Data Collection for Evaluating Rich Speech RetrievalWhat is crowdsourcing?Crowdsourcing is a form of human computation. Human computation is a method of having people dothings that we might consider assigning to a computingdevice, e.g. a language translation task.A crowdsourcing system facilitates a crowdsourcingprocess.Factors to take into account:Sufcient number of workersLevel of paymentClear instructions 24. Creating a Data Collection for Evaluating Rich Speech RetrievalWhat is crowdsourcing?Crowdsourcing is a form of human computation. Human computation is a method of having people dothings that we might consider assigning to a computingdevice, e.g. a language translation task.A crowdsourcing system facilitates a crowdsourcingprocess.Factors to take into account:Sufcient number of workersLevel of paymentClear instructionsPossible cheating 25. Creating a Data Collection for Evaluating Rich Speech RetrievalCrowdsourcing in Development of Speech andLanguage Resources 26. Creating a Data Collection for Evaluating Rich Speech RetrievalCrowdsourcing in Development of Speech andLanguage Resources Suitability of crowdsourcing for simple/straightforward natural language processing tasks: 27. Creating a Data Collection for Evaluating Rich Speech RetrievalCrowdsourcing in Development of Speech andLanguage Resources Suitability of crowdsourcing for simple/straightforward natural language processing tasks: Work by non-experts crowdsource workers is of similar standard to that performed by expert workers:translation/translation assessmenttranscription of native languageword sense disambiguationtemporal annotation[Snow et al., 2008] 28. Creating a Data Collection for Evaluating Rich Speech RetrievalCrowdsourcing in Development of Speech andLanguage Resources Suitability of crowdsourcing for simple/straightforward natural language processing tasks: Work by non-experts crowdsource workers is of similar standard to that performed by expert workers:translation/translation assessmenttranscription of native languageword sense disambiguationtemporal annotation[Snow et al., 2008] Research question at collection creation stage: Can untrained crowdsource workers undertake extended tasks which require them to be creative? 29. Creating a Data Collection for Evaluating Rich Speech RetrievalCrowdsourcing with Amazon Mechanical Turk 30. Creating a Data Collection for Evaluating Rich Speech RetrievalCrowdsourcing with Amazon Mechanical Turk Task is referred to as a Human Intelligence Task or HIT. 31. Creating a Data Collection for Evaluating Rich Speech RetrievalCrowdsourcing with Amazon Mechanical Turk Task is referred to as a Human Intelligence Task or HIT. Crowdsourcing procedure: HIT initiation: Requester uploads a HIT. 32. Creating a Data Collection for Evaluating Rich Speech RetrievalCrowdsourcing with Amazon Mechanical Turk Task is referred to as a Human Intelligence Task or HIT. Crowdsourcing procedure: HIT initiation: Requester uploads a HIT. Work: Workers carry out the HIT 33. Creating a Data Collection for Evaluating Rich Speech RetrievalCrowdsourcing with Amazon Mechanical Turk Task is referred to as a Human Intelligence Task or HIT. Crowdsourcing procedure: HIT initiation: Requester uploads a HIT. Work: Workers carry out the HIT Review: Requester reviews the completed work and conrms payment to the worker with a previously set payment. *Requester has an option of paying more (Bonus) 34. Creating a Data Collection for Evaluating Rich Speech RetrievalInformation expected from the workerto create a test collection for RSR Task 35. Creating a Data Collection for Evaluating Rich Speech RetrievalInformation expected from the workerto create a test collection for RSR TaskSpeech act type:expressives: apology, opinionassertives: denitiondirectives: warningcommissives: promise 36. Creating a Data Collection for Evaluating Rich Speech RetrievalInformation expected from the workerto create a test collection for RSR TaskSpeech act type:expressives: apology, opinionassertives: denitiondirectives: warningcommissives: promiseTime of the labeled speech act: beginning and end 37. Creating a Data Collection for Evaluating Rich Speech RetrievalInformation expected from the workerto create a test collection for RSR TaskSpeech act type:expressives: apology, opinionassertives: denitiondirectives: warningcommissives: promiseTime of the labeled speech act: beginning and endAccurate transcript of the labeled speech act 38. Creating a Data Collection for Evaluating Rich Speech RetrievalInformation expected from the workerto create a test collection for RSR TaskSpeech act type:expressives: apology, opinionassertives: denitiondirectives: warningcommissives: promiseTime of the labeled speech act: beginning and endAccurate transcript of the labeled speech actQueries to rend this speech act:a full sentence querya short web style query 39. Creating a Data Collection for Evaluating Rich Speech RetrievalData management for Amazon MTurking ME10WWW videos vary in length: 40. Creating a Data Collection for Evaluating Rich Speech RetrievalData management for Amazon MTurking ME10WWW videos vary in length: > Starting points for longer videos at a distance of approximately 7 minutes apart are calculated: Data set EpisodesStarting pointsDev 247 562Test 17273278 41. Creating a Data Collection for Evaluating Rich Speech RetrievalCrowdsourcing experiment 42. Creating a Data Collection for Evaluating Rich Speech RetrievalCrowdsourcing experiment Worker expectations: 43. Creating a Data Collection for Evaluating Rich Speech RetrievalCrowdsourcing experiment Worker expectations: 44. Creating a Data Collection for Evaluating Rich Speech RetrievalCrowdsourcing experiment Worker expectations: Reward vs Work 45. Creating a Data Collection for Evaluating Rich Speech RetrievalCrowdsourcing experiment Worker expectations: Reward vs Work Per hour Rate 46. Creating a Data Collection for Evaluating Rich Speech RetrievalCrowdsourcing experiment Requester uploads the HIT: Worker expectations: Reward vs Work Per hour Rate 47. Creating a Data Collection for Evaluating Rich Speech RetrievalCrowdsourcing experiment Requester uploads the HIT: Worker expectations: Reward vs Work Pilot wording Per hour Rate 48. Creating a Data Collection for Evaluating Rich Speech RetrievalCrowdsourcing experiment Requester uploads the HIT: Worker expectations: Reward vs Work Pilot wording Per hour Rate0.11 $ + bonus per speech act type 49. Creating a Data Collection for Evaluating Rich Speech RetrievalCrowdsourcing experimentWorkers feedback:Requester uploads the HIT: Reward is not worth the Work Pilot wording Task is0.11 $ + bonus per too complicated speech act type 50. Creating a Data Collection for Evaluating Rich Speech RetrievalCrowdsourcing experiment Requester updates the HIT:Workers feedback:Rewording Reward is not worth the Work Task is too complicated 51. Creating a Data Collection for Evaluating Rich Speech RetrievalCrowdsourcing experiment Requester updates the HIT:Workers feedback:Rewording Reward is not worthExamples the Work Task is too complicated 52. Creating a Data Collection for Evaluating Rich Speech RetrievalCrowdsourcing experiment Requester updates the HIT:Workers feedback:Rewording Reward is not worthExamples the Work0.19 $ + bonus (0-21$) Task is Workers suggest bonus too complicated size (Mention to be a non-prot organization) 53. Creating a Data Collection for Evaluating Rich Speech RetrievalCrowdsourcing experiment Requester updates the HIT: Workers feedback: Reward is worthRewording the WorkExamples Task is comprehensible0.19 $ + bonus (0-21$) Workers suggest bonus Workers are size (Mention that we are anot greedy! non-prot organization) 54. Creating a Data Collection for Evaluating Rich Speech RetrievalHIT examplePilot: Please watch the video and nd a short portion of the video (a segment) that contains an interesting quote. The quote must fall into one of these six categories 55. Creating a Data Collection for Evaluating Rich Speech RetrievalHIT examplePilot: Please watch the video and nd a short portion of the video (a segment) that contains an interesting quote. The quote must fall into one of these six categoriesRevised: Imagine that you are watching videos on YouTube. When you come across something interesting you might want to share it on Facebook, Twitter or your favorite social network. Now please watch this video and search for an interesting video segment that you would like to share with others because it is (an apology, a denition, an opinion, a promise, a warning). 56. Creating a Data Collection for Evaluating Rich Speech RetrievalResults:Number of collected queries per speech act Prices: Dev set: 40 $ per 30 queries Test set: 80 $ per 50 queries 57. Creating a Data Collection for Evaluating Rich Speech RetrievalResults assessment 58. Creating a Data Collection for Evaluating Rich Speech RetrievalResults assessment Number of accepted HITs = number of collected queries 59. Creating a Data Collection for Evaluating Rich Speech RetrievalResults assessment Number of accepted HITs = number of collected queries 60. Creating a Data Collection for Evaluating Rich Speech RetrievalResults assessment Number of accepted HITs = number of collected queries No overlap of workers in dev and test sets 61. Creating a Data Collection for Evaluating Rich Speech RetrievalResults assessment Number of accepted HITs = number of collected queries No overlap of workers in dev and test sets Creative work - Creative Cheating: 62. Creating a Data Collection for Evaluating Rich Speech RetrievalResults assessment Number of accepted HITs = number of collected queries No overlap of workers in dev and test sets Creative work - Creative Cheating: Copy and paste provided examples 63. Creating a Data Collection for Evaluating Rich Speech RetrievalResults assessment Number of accepted HITs = number of collected queries No overlap of workers in dev and test sets Creative work - Creative Cheating:Copy and paste provided examples > Examples should be pictures, not texts 64. Creating a Data Collection for Evaluating Rich Speech RetrievalResults assessment Number of accepted HITs = number of collected queries No overlap of workers in dev and test sets Creative work - Creative Cheating:Copy and paste provided examples > Examples should be pictures, not textsChoose the option of no speech act found in the video 65. Creating a Data Collection for Evaluating Rich Speech RetrievalResults assessment Number of accepted HITs = number of collected queries No overlap of workers in dev and test sets Creative work - Creative Cheating:Copy and paste provided examples > Examples should be pictures, not textsChoose the option of no speech act found in the video > Manual assessment by requester needed 66. Creating a Data Collection for Evaluating Rich Speech RetrievalResults assessment Number of accepted HITs = number of collected queries No overlap of workers in dev and test sets Creative work - Creative Cheating: Copy and paste provided examples > Examples should be pictures, not texts Choose the option of no speech act found in the video > Manual assessment by requester neededWorkers rarely nd noteworthy content later than the third minute from the start of playback point in the video 67. Creating a Data Collection for Evaluating Rich Speech RetrievalConclusionsIt is possible to crowdsource extensive and complextasks to support speech and language resources 68. Creating a Data Collection for Evaluating Rich Speech RetrievalConclusionsIt is possible to crowdsource extensive and complextasks to support speech and language resourcesUse concepts and vocabulary familiar to the workers 69. Creating a Data Collection for Evaluating Rich Speech RetrievalConclusionsIt is possible to crowdsource extensive and complextasks to support speech and language resourcesUse concepts and vocabulary familiar to the workersPay attention to technical issues of watching the video 70. Creating a Data Collection for Evaluating Rich Speech RetrievalConclusionsIt is possible to crowdsource extensive and complextasks to support speech and language resourcesUse concepts and vocabulary familiar to the workersPay attention to technical issues of watching the videoVideo preprocessing into smaller segments 71. Creating a Data Collection for Evaluating Rich Speech RetrievalConclusionsIt is possible to crowdsource extensive and complextasks to support speech and language resourcesUse concepts and vocabulary familiar to the workersPay attention to technical issues of watching the videoVideo preprocessing into smaller segmentsCreative work demands higher reward level, or justmore exible system 72. Creating a Data Collection for Evaluating Rich Speech RetrievalConclusionsIt is possible to crowdsource extensive and complextasks to support speech and language resourcesUse concepts and vocabulary familiar to the workersPay attention to technical issues of watching the videoVideo preprocessing into smaller segmentsCreative work demands higher reward level, or justmore exible systemHigh level of wastage due to task complexity 73. Creating a Data Collection for Evaluating Rich Speech RetrievalediaEval 2012 Brave New Task:Search and HyperlinkingUse Scenario: a user is searching for a known segment in a video collection. Furthermore, because the information in the segment might not be sufcient for his information need, s/he wants to have links to other related video segments, which may help to satisfy information need related to this video. 74. Creating a Data Collection for Evaluating Rich Speech RetrievalediaEval 2012 Brave New Task:Search and HyperlinkingUse Scenario: a user is searching for a known segment in a video collection. Furthermore, because the information in the segment might not be sufcient for his information need, s/he wants to have links to other related video segments, which may help to satisfy information need related to this video. Sub-tasks: 75. Creating a Data Collection for Evaluating Rich Speech RetrievalediaEval 2012 Brave New Task:Search and HyperlinkingUse Scenario: a user is searching for a known segment in a video collection. Furthermore, because the information in the segment might not be sufcient for his information need, s/he wants to have links to other related video segments, which may help to satisfy information need related to this video. Sub-tasks: Search: nding suitable video segments based on a short natural language query, 76. Creating a Data Collection for Evaluating Rich Speech RetrievalediaEval 2012 Brave New Task:Search and HyperlinkingUse Scenario: a user is searching for a known segment in a video collection. Furthermore, because the information in the segment might not be sufcient for his information need, s/he wants to have links to other related video segments, which may help to satisfy information need related to this video. Sub-tasks: Search: nding suitable video segments based on a short natural language query, Linking: dening links to other relevant video segments in the collection. 77. Creating a Data Collection for Evaluating Rich Speech RetrievalediaEval 2012 Thank you for your attention! Welcome to MediaEval 2012! http://multimediaeval.org