1
Lowlevel and highlevel models of perceptual compensa7on for reverbera7on Amy V. Beeston and Guy J. Brown Department of Computer Science, University of Sheffield, UK {a.beeston, g.brown}@dcs.shef.ac.uk Lowlevel model Highlevel model Background Perceptual experiments sourcereceiver distance varied Auditory efferent system has been implicated in controlling dynamic range [5]. Meantopeak ra7o (MPR) of wideband speech envelope updates aVenua7on (ATT) applied to nonlinear pathway of dualresonance nonlinear (DRNL) filterbank [6]. This helps to recover dips in the temporal envelope e.g., reverberated /t/. Perceptual constancy allows us to compensate for our surroundings and overcome distor7ons of naturally reverberant environments. Prior listening in reverberant rooms improves speech percep7on [1, 2]. Compensa7on is disrupted when reverbera7on applied to a test word and preceding context is incongruous [1]. Here we develop lowlevel and highlevel computa7onal models of perceptual compensa7on in speech iden7fica7on tasks. Compensa7on for reverbera7on is viewed as an acous7c model selec7on process: analysis of speech preceding TEST informs selec7on of appropriate acous7c model. Performance is op7mal when reverbera7on of context and test word match. Wrong model is selected in mismatched CW/TEST reverbera7on condi7ons: confusions increase. s7mulus MFCC likelihood weigh7ng decode phone sequence likelihood from near model likelihood from far model p(x(t)|λ n ) p(x(t)|λ f ) MPR of envelope detect reverb condi7ons afferent efferent s7mulus recogniser DRNL hair cell spectrotemporal excita7on paVern f ATT MPR outer middle ear ASR features: 12 MFCCs + deltas + accelera7ons. Feature vectors for ‘near’ and ‘far’ reverberated uVerances were concatenated for training to provide matching state segmenta7on to the likelihood weigh7ng scheme Feature vectors subsequently split into separate ‘near’ and ‘far’ models for decoding. The combined nearfar observa7on state likelihood is a weighted sum of parallel likelihoods in the log domain: log[p(x(t)|λ n,f )] = α(t) log[p(x(t)|λ n )] + (1α(t)) log[p(x(t)|λ f )] α(t) adjusted dynamically using near/far classifier based on MPR metric. α(t)→0 if reverberant; α(t)→1 if dry. Model reproduces main confusions evident in human data (Φ 2 < 0.1). Funded by EPSRC. Thanks to Ray Meddis for the DRNL code, to Tim Jürgens for help with the phisquared metric, and to Kalle Palomäki, Tony Watkins, Hynek Hermansky and Simon Makin for helpful discussion. # sir # not sir nearfar 36 44 farfar 19 61 Watkins demonstrated a ‘sir/s7r’ category boundary shiu for reverberated test words in response to reverbera7on distance of preceding context [1]. He imposed the temporal envelope of ‘s7r’ on ‘sir’ to give the impression of a ‘t’ stop at one end of an 11step interpolated con7nuum of test words. He recorded impulse responses (IRs) of a room (volume 183.6 m 3 ) at ‘near’ (0.32 m) and ‘far’ (10 m) distances, and independently reverberated test and context. The /t/ in a reverberated test word was more likely to be iden7fied if preceding context speech was similarly reverberated [1]. We replicated and extended Watkins’ findings using natural speech, 20 talkers (male and female), and a wider range of consonants (/p/, /t/, /k/) [3]. 80 uVerances of form CW1 CW2 TEST CW3 from Ar7cula7on Index Corpus [4], each containing context words (CW) and TEST word SIR, SKUR, SPUR or STIR. UVerances were lowpass filtered (8 th order BuVerworth) to assess frequency dependent characteris7cs of compensa7on. Results shown for 4 kHz condi7on. CW and TEST independently reverberated using Watkins’ IRs [1] to give the impression of speech at different distances in a room e.g., near CW – far TEST. Compensa7on for reverbera7on was observed: increased reverbera7on on TEST increased confusion rate, but errors reduced when CW similarly reverberated. Reverbera7on caused par7cular confusions to be made: most errors at nearfar were test words mistaken for ‘sir.’ Compensa7on reduced mistaken ‘sir’ responses at farfar, but confusions persisted between ‘skur’, ‘spur’ and ‘s7r.’ Comparable data to Watkins’ (‘sir’ and ‘not sir’) resulted in a significant chisquared value (Bonferroni corrected), with χ 2 =8.007, p=0.023. nearnear nearfar farfar sir skur spur s7r sir skur spur s7r sir skur spur s7r sir 19 0 0 1 18 0 0 2 16 1 1 2 skur 0 20 0 0 3 15 0 2 0 16 0 4 spur 0 1 18 1 7 2 10 1 2 1 14 3 s7r 0 0 0 20 8 1 1 10 1 0 0 19 Good match to Watkins’ sirs7r listener data using simple template recogniser [7]. Effect of reverbera7on on test word: category boundary shius up (more ‘sir’s). Compensa7on (forward reverbera7on cases): boundary shius back (more ‘s7r’s). Matching human listeners, compensa7on is abolished for reverse reverbera7on. forward speech forward reverbera7on reverse speech forward reverbera7on reverse speech reverse reverbera7on forward speech reverse reverbera7on context distance, m context distance, m context distance, m context distance, m category boundary (# sir) test .32 m test 10 m test .32 m test 10 m model results (quan7sed) human results nearnear nearfar farfar sir k p t Φ 2 sir k p t Φ 2 sir k p t Φ 2 sir 16 0 1 3 0.0564 5 12 0 3 0.4887 11 3 2 4 0.0731 skur 0 13 1 6 0.2121 1 12 3 4 0.1250 3 12 1 4 0.1143 spur 0 3 11 6 0.1565 1 14 5 0 0.4042 1 10 7 2 0.2558 s7r 0 1 2 17 0.0811 2 4 3 11 0.1612 5 5 1 9 0.3060 Simplified model with efferent circuit engaged (at fixed ATT) in ‘far’ context cases. ASR features: 13 DCTtransformed auditory features + deltas + accelera7ons. Pearson’s phisquared metric denotes (here, lack of) similarity with human results by comparing each row of confusion matrices as a 2 x 4 con7ngency table [8]. For iden7cal distribu7ons, Φ 2 = 0. For nonoverlapping distribu7ons, Φ 2 = 1. nearnear nearfar farfar sir k p t Φ 2 sir k p t Φ 2 sir k p t Φ 2 sir 16 0 0 4 0.0514 18 0 1 1 0.0333 14 1 2 3 0.0167 skur 0 19 0 1 0.0256 3 17 0 0 0.0531 2 16 0 2 0.0667 spur 1 0 17 2 0.0590 5 1 14 0 0.0583 3 0 16 1 0.0583 s7r 1 1 1 17 0.0811 8 3 0 9 0.0513 0 0 0 20 0.0256 far: MPR = 0.32 near: MPR = 0.19 Meantopeak ra7o (MPR) is tested as a metric to quan7fy reverbera7on. Reverbera7on fills dips in temporal envelope and dynamic range reduces. MPR increases with reverbera7on; inversely propor7onal to dynamic range. MPR Hidden Markov model (HMM) recogniser implemented using HTK [9]. HMMs ini7ally trained on TIMIT, then adapted to subset of AI corpus. 39 monophone models + silence model [10]. AI corpus prompts expanded to phone sequences using CMU dic7onary [11]. Semiforced alignment: recogniser iden7fied TEST only. ASR Discussion The highlevel computer model replicates compensa7on for reverbera7on in the AI corpus speech iden7fica7on task. Efferent model results are consistent with the proposal that auditory processes controlling dynamic range might contribute to reverberant ‘sir/s7r’ dis7nc7on. Efferent model helps to recover dips in temporal envelope, but not to recover the more complex acous7cphone7c cues for /p/, /t/, /k/ iden7fica7on. Lack of training data may have contributed to poor performance of efferentbased model on AI corpus task (for the highlevel model, we adapted the recogniser on the AI corpus test material). Future work will add frequencydependent processing, since recent perceptual data suggests constancy occurs within individual frequency bands [12, 3]. We will also address recent findings of [13] concerning compensa7on with silent contexts. 1. AJ Watkins (2005). J. Acoust. Soc. Am. 118 (1) 249262 2. EJ Brandewie & P Zahorik (2010). J. Acoust. Soc. Am. 128 (1) 291299 3. AV Beeston, GJ Brown, AJ Watkins & SJ Makin (2011). Int. J. Audiology 50 (10) 771772 4. J Wright (2005). Ar7cula7on Index. Linguis7c Data Consor7um 5. JJ Guinan (2006). Ear Hear. 27 (6) 589607 6. RT Ferry & R Meddis (2007). J. Acoust. Soc. Am. 122 (6), 35193526 7. AV Beeston & GJ Brown (2010). Proc. Interspeech, Makuhari, 24622465 8. T Jürgens & T Brand (2009). J. Acoust. Soc. Am. 125 (5) 26352648 9. Hidden Markov model toolkit (HTK). hVp://htk.eng.cam.ac.uk 10. KF Lee & HW Hon (1989). IEEE Trans. Acoust., Speech, Signal Process. 37 (11) 16411648 11. Carnegie Mellon University (CMU) pronuncia7on dic7onary. hVp://www.speech.cs.cmu.edu/cgibin/cmudict 12. AJ Watkins, SJ Makin & AP Raimond (2010). In Binaural processing and spa7al hearing. Danavox Jubilee Founda7on 371380 13. JB Nielsen & T Dau (2010). J. Acoust. Soc. Am. 128 (5) 30883094

Low$level(and(high$level(models(of(perceptual(compensaon ......Low$level(and(high$level(models(of(perceptual(compensaon(for(reverberaon(Amy(V.(Beeston(and(Guy(J.(Brown(Departmentof(Computer(Science,(University(of(Sheffield,(UK({a

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • Low-‐level  and  high-‐level  models  of  perceptual  compensa7on  for  reverbera7on  Amy  V.  Beeston  and  Guy  J.  Brown   Department  of  Computer  Science,  University  of  Sheffield,  UK  

    {a.beeston,  g.brown}@dcs.shef.ac.uk  

    Low-‐level  model   High-‐level  model  Background  

    Perceptual  experiments  

    source-‐receiver  distance  varied  

    •  Auditory  efferent  system  has  been  implicated  in  controlling  dynamic  range  [5].  • Mean-‐to-‐peak  ra7o  (MPR)  of  wideband  speech  envelope  updates  aVenua7on  (ATT)  applied  to  nonlinear  pathway  of  dual-‐resonance  nonlinear  (DRNL)  filterbank  [6].  •  This  helps  to  recover  dips  in  the  temporal  envelope  e.g.,  reverberated  /t/.  

    •  Perceptual   constancy  allows  us   to  compensate   for  our   surroundings  and  overcome  distor7ons  of  naturally  reverberant  environments.  •  Prior  listening  in  reverberant  rooms  improves  speech  percep7on  [1,  2].  •  Compensa7on  is  disrupted  when  reverbera7on  applied  to  a  test  word  and  preceding  context  is  incongruous    [1].  •  Here   we   develop   low-‐level   and   high-‐level   computa7onal   models   of   perceptual  compensa7on  in  speech  iden7fica7on  tasks.  

    •  Compensa7on   for   reverbera7on   is   viewed   as   an   acous7c  model   selec7on   process:  analysis  of  speech  preceding  TEST  informs  selec7on  of  appropriate  acous7c  model.  •  Performance  is  op7mal  when  reverbera7on  of  context  and  test  word  match.    • Wrong   model   is   selected   in   mismatched   CW/TEST   reverbera7on   condi7ons:  confusions  increase.  

    s7mulus   MFCC   likelihood  weigh7ng   decode  phone  

    sequence  

    likelihood  from  near  model  

    likelihood  from  far  model  

    p(x(t)|λn)  

    p(x(t)|λf)  MPR  of  envelope  

    detect  reverb  condi7ons  

    afferent  

    efferent  

    s7mulus   recogniser  DRNL   hair  cell   spectro-‐temporal  excita7on  paVern  

    ∑f  ATT   MPR  

    outer-‐middle  ear  

    •  ASR  features:  12  MFCCs  +  deltas  +  accelera7ons.  •  Feature  vectors   for   ‘near’   and   ‘far’   reverberated  uVerances  were  concatenated   for  training  to  provide  matching  state  segmenta7on  to  the  likelihood  weigh7ng  scheme  •  Feature  vectors  subsequently  split  into  separate  ‘near’  and  ‘far’  models  for  decoding.  •  The   combined   near-‐far   observa7on   state   likelihood   is   a   weighted   sum   of   parallel  likelihoods  in  the  log  domain:  

    log[p(x(t)|λn,f)]  =  α(t)  log[p(x(t)|λn)]  +  (1-‐α(t))  log[p(x(t)|λf)]  

    •  α(t)  adjusted  dynamically  using    near/far  classifier  based  on  MPR  metric.  •  α(t)→0  if  reverberant;  α(t)→1  if  dry.  • Model  reproduces  main  confusions  evident  in  human  data  (Φ2  <  0.1).  

    Funded  by  EPSRC.  Thanks  to  Ray  Meddis  for  the  DRNL  code,  to  Tim  Jürgens  for  help  with  the  phi-‐squared  metric,  and  to  Kalle  Palomäki,  Tony  Watkins,  Hynek  Hermansky  and  Simon  Makin  for  helpful  discussion.  

    #  sir   #  not  sir  near-‐far   36   44  far-‐far   19   61  

    • Watkins   demonstrated   a   ‘sir/s7r’   category   boundary   shiu   for   reverberated   test  words  in  response  to  reverbera7on  distance  of  preceding  context  [1].  •  He  imposed  the  temporal  envelope  of  ‘s7r’  on  ‘sir’  to  give  the  impression  of  a  ‘t’  stop  at  one  end  of  an  11-‐step  interpolated  con7nuum  of  test  words.  •  He  recorded  impulse  responses  (IRs)  of  a  room  (volume  183.6  m3)  at  ‘near’  (0.32  m)  and  ‘far’  (10  m)  distances,  and  independently  reverberated  test  and  context.  •  The  /t/  in  a  reverberated  test  word  was  more  likely  to  be  iden7fied  if  preceding  context  speech  was  similarly  reverberated  [1].  

    • We   replicated   and   extended  Watkins’   findings   using   natural   speech,   20   talkers  (male  and  female),  and  a  wider  range  of  consonants  (/p/,  /t/,  /k/)  [3].  •  80  uVerances   of   form  CW1  CW2  TEST  CW3   from  Ar7cula7on   Index  Corpus   [4],  each  containing  context  words  (CW)  and  TEST  word  SIR,  SKUR,  SPUR  or  STIR.  •  UVerances   were   low-‐pass   filtered   (8th   order   BuVerworth)   to   assess   frequency-‐dependent  characteris7cs  of  compensa7on.  Results  shown  for  4  kHz  condi7on.  •  CW   and   TEST   independently   reverberated   using   Watkins’   IRs   [1]   to   give   the  impression  of  speech  at  different  distances  in  a  room  e.g.,  near  CW  –  far  TEST.  

    •  Compensa7on  for  reverbera7on  was  observed:   increased  reverbera7on  on  TEST  increased  confusion  rate,  but  errors  reduced  when  CW  similarly  reverberated.  •  Reverbera7on  caused  par7cular  confusions   to  be  made:  most  errors  at  near-‐far  were  test  words  mistaken  for  ‘sir.’  •  Compensa7on   reduced   mistaken   ‘sir’   responses   at   far-‐far,   but   confusions  persisted  between  ‘skur’,  ‘spur’  and  ‘s7r.’  •  Comparable  data  to  Watkins’  (‘sir’  and  ‘not  sir’)  resulted   in   a   significant   chi-‐squared   value  (Bonferroni  corrected),  with  χ2=8.007,  p=0.023.  

    near-‐near   near-‐far   far-‐far  sir   skur   spur   s7r   sir   skur   spur   s7r   sir   skur   spur   s7r  

    sir   19   0   0   1   18   0   0   2   16   1   1   2  skur   0   20   0   0   3   15   0   2   0   16   0   4  spur   0   1   18   1   7   2   10   1   2   1   14   3  s7r   0   0   0   20   8   1   1   10   1   0   0   19  

    •  Good  match  to  Watkins’  sir-‐s7r  listener  data  using  simple  template  recogniser  [7].  •  Effect  of  reverbera7on  on  test  word:  category  boundary  shius  up  (more  ‘sir’s).  •  Compensa7on  (forward  reverbera7on  cases):  boundary  shius  back  (more  ‘s7r’s).  • Matching  human  listeners,  compensa7on  is  abolished  for  reverse  reverbera7on.  

    forward  speech  forward  reverbera7on  

    reverse  speech  forward  reverbera7on  

    reverse  speech  reverse  reverbera7on  

    forward  speech  reverse  reverbera7on  

    context  distance,  m   context  distance,  m   context  distance,  m  context  distance,  m  

    category  bou

    ndary  (#  sir)  

    test  .32  m  test  10  m  

    test  .32  m  test  10  m  

    model  results  

    (quan7sed)  human  results  

    near-‐near   near-‐far   far-‐far  sir   k   p   t   Φ2   sir   k   p   t   Φ2   sir   k   p   t   Φ2  

    sir   16   0   1   3   0.0564   5   12   0   3   0.4887   11   3   2   4   0.0731  skur   0   13   1   6   0.2121   1   12   3   4   0.1250   3   12   1   4   0.1143  spur   0   3   11   6   0.1565   1   14   5   0   0.4042   1   10   7   2   0.2558  s7r   0   1   2   17   0.0811   2   4   3   11   0.1612   5   5   1   9   0.3060  

    •  Simplified  model  with  efferent  circuit  engaged  (at  fixed  ATT)  in  ‘far’  context  cases.  •  ASR  features:  13  DCT-‐transformed  auditory  features  +  deltas  +  accelera7ons.  •  Pearson’s  phi-‐squared  metric  denotes  (here,  lack  of)  similarity  with  human  results  by  comparing  each  row  of  confusion  matrices  as  a  2  x  4  con7ngency  table  [8].    •  For  iden7cal  distribu7ons,  Φ2  =  0.  For  non-‐overlapping  distribu7ons,  Φ2  =  1.  

    near-‐near   near-‐far   far-‐far  sir   k   p   t   Φ2   sir   k   p   t   Φ2   sir   k   p   t   Φ2  

    sir   16   0   0   4   0.0514   18   0   1   1   0.0333   14   1   2   3   0.0167  skur   0   19   0   1   0.0256   3   17   0   0   0.0531   2   16   0   2   0.0667  spur   1   0   17   2   0.0590   5   1   14   0   0.0583   3   0   16   1   0.0583  s7r   1   1   1   17   0.0811   8   3   0   9   0.0513   0   0   0   20   0.0256  

    far:  MPR  =  0.32  near:  MPR  =  0.19  

    • Mean-‐to-‐peak  ra7o  (MPR)  is  tested  as  a  metric  to  quan7fy  reverbera7on.    •  Reverbera7on  fills  dips  in  temporal  envelope  and  dynamic  range  reduces.    • MPR  increases  with  reverbera7on;  inversely  propor7onal  to  dynamic  range.  

    MPR  

    •  Hidden  Markov  model  (HMM)  recogniser  implemented  using  HTK  [9].  •  HMMs  ini7ally  trained  on  TIMIT,  then  adapted  to  subset  of  AI  corpus.  •  39  monophone  models  +  silence  model  [10].  •  AI  corpus  prompts  expanded  to  phone  sequences  using  CMU  dic7onary  [11].  •  Semi-‐forced  alignment:  recogniser  iden7fied  TEST  only.  

    ASR  

    Discussion  •  The  high-‐level  computer  model  replicates  compensa7on  for  reverbera7on   in  the  AI  corpus  speech  iden7fica7on  task.  •  Efferent   model   results   are   consistent   with   the   proposal   that   auditory   processes  controlling  dynamic  range  might  contribute  to  reverberant  ‘sir/s7r’  dis7nc7on.  •  Efferent  model  helps   to   recover  dips   in   temporal  envelope,  but  not   to   recover   the  more  complex  acous7c-‐phone7c  cues  for  /p/,  /t/,  /k/  iden7fica7on.  •  Lack  of   training  data  may  have  contributed   to  poor  performance  of  efferent-‐based  model  on  AI  corpus  task  (for  the  high-‐level  model,  we  adapted  the  recogniser  on  the  AI  corpus  test  material).  •  Future  work  will  add  frequency-‐dependent  processing,  since  recent  perceptual  data  suggests   constancy   occurs   within   individual   frequency   bands   [12,   3].   We   will   also  address  recent  findings  of  [13]  concerning  compensa7on  with  silent  contexts.  

    1.  AJ  Watkins  (2005).  J.  Acoust.  Soc.  Am.  118  (1)  249-‐262  2.  EJ  Brandewie  &  P  Zahorik  (2010).  J.  Acoust.  Soc.  Am.  128  (1)  

    291-‐299  

    3.  AV  Beeston,  GJ  Brown,  AJ  Watkins  &  SJ  Makin  (2011).  Int.  J.  Audiology  50  (10)  771-‐772  

    4.  J  Wright  (2005).  Ar7cula7on  Index.  Linguis7c  Data  Consor7um  5.  JJ  Guinan  (2006).  Ear  Hear.  27  (6)  589-‐607  6.  RT  Ferry  &  R  Meddis  (2007).  J.  Acoust.  Soc.  Am.  122  (6),  3519-‐3526  7.  AV  Beeston  &  GJ  Brown  (2010).  Proc.  Interspeech,  Makuhari,  

    2462-‐2465  

    8.  T  Jürgens  &  T  Brand  (2009).  J.  Acoust.  Soc.  Am.  125  (5)  2635-‐2648  9.  Hidden  Markov  model  toolkit  (HTK).  hVp://htk.eng.cam.ac.uk  10.  KF  Lee  &  HW  Hon  (1989).  IEEE  Trans.  Acoust.,  Speech,  Signal  

    Process.  37  (11)  1641-‐1648  11.  Carnegie  Mellon  University  (CMU)  pronuncia7on  dic7onary.    

    hVp://www.speech.cs.cmu.edu/cgi-‐bin/cmudict  

    12.  AJ  Watkins,  SJ  Makin  &  AP  Raimond  (2010).  In  Binaural  processing  and  spa7al  hearing.  Danavox  Jubilee  Founda7on  371-‐380  

    13.  JB  Nielsen  &  T  Dau  (2010).  J.  Acoust.  Soc.  Am.  128  (5)  3088-‐3094