Wideband Speech Communications for Automotive: the Good, the … · 2019-02-22 · Wideband Speech Communications for Automotive QNX Software Systems 3 WB speech is likely to be localized

Wideband Speech Communications for Automotive: the Good, the Bad, and the Ugly Scott Pennock and Phil Hetherington QNX Software Systems [email protected], [email protected]

Abstract Wideband (50-7000 Hz) speech communications brings improvements over traditional narrowband (300-3400 Hz) communications: it can increase intelligibility and comprehension, reduce driver distraction, create a better “sense of presence”, make it easier to identify the far-end talker, and enable spatial auditory displays (the “Good”).

Unfortunately, wideband communications also has some drawbacks: people are more sensitive to wideband echo, some echo cancellers may have difficulties with wideband signals, and the frequency range allows more noise and echo to be transmitted along with the speech (the “Bad”). Further, the switch to wideband won’t happen overnight: neither the standards community nor the telecommunications industry has addressed interoperability with existing narrowband systems — issues such as maintaining consistent loudness and quality over mixed connections (the “Ugly”).

To address the issues raised by the advent of wideband communications, vehicle platforms will require good electro-acoustic design, as well as high-performance acoustic echo cancellation (AEC) and noise reduction (NR) algorithms; and mixed narrowband-wideband connections will require reliable bandwidth extension techniques. This paper reviews some of the main benefits, challenges and unresolved issues with wideband speech communications in an automotive environment.

Introduction We could have sworn the lab was haunted. As we walked into the room we could hear two of our colleagues talking to each other. Then we noticed that only one person other than ourselves was in the room. It took us a few seconds to realize that the other colleague was in a different room, talking over a perfectly tuned prototype of a conference speaker-phone. The sense of presence was so real that it sounded and felt as though we were all in the same room — even after we became aware of the “trick” being played!

Current telephone calls don’t sound like we are face-to-face because the telephone network and terminals band-limit speech from about 50-10000 Hz down to 300-3400 Hz (Denes and Pinson, 1993). Today, however, the historical reasons that limited speech transmission to narrowband (NB) transmission no longer apply. Emerging technologies, such as wideband (WB) speech coders and VoIP, are beginning to enable speech communications at bandwidths that more accurately carry the full range of spoken sound.

Advantages of Wideband In an automotive environment, wideband (WB) speech reduces driver distraction, makes speech easier to understand in the presence of vehicle noise, and eliminates the fatigue that comes from trying to understand a degraded voice signal. WB speech also

QNX Software Systems Wideband Speech Communications for Automotive

2

makes the person speaking on the far-end of the connection sound more as though he or she were in the vehicle, and consequently easier to identify and communicate with1.

Challenges of Wideband Along with its benefits, WB speech also presents some challenges. People are more sensitive to WB echo and noise than to NB echo and noise, and echo cancellers have a more difficult time cancelling the higher-frequency echoes associated with WB. Further, neither the telecommunications industry nor the standards community has addressed the inter-operability issues between WB systems and existing NB systems. Until these issues are addressed, users of WB terminals may experience inconsistent loudness and quality between calls.

Solutions Fortunately, the benefits of WB speech communication in automobiles far outweigh its draw-backs, and — most importantly — these drawbacks can be addressed through considered use of appropriate technologies: careful vehicle electro-acoustic design to minimize noise and echo picked-up by the microphone; high-performance acoustic echo cancellation (AEC) and noise reduction (NR) algorithms to sufficiently reduce echo and noise levels; and reliable bandwidth extension techniques to

1 In this paper, “wideband speech” refers to speech

communications with a bandwidth of 50-7000Hz. Although some speech energy exists above 7 kHz, extending the high frequency cut-off beyond 7 kHz only provides minimal further improvements to speech communications performance and quality. “Super-wideband” (50-14000 Hz) does produce a small noticeable improvement if compared directly to wideband speech. However, most listeners will not be able to detect a difference between “super-wideband” and “full-band” (20-20000 Hz) speech communications.

minimize undesirable side effects of mixed narrowband-wideband connections.

The Good Part I: Better Task Performance Wideband communication improves speech comprehension, talker identification, and speech localization. These benefits alone make a compelling argument for deploying WB speech in the vehicle. However, because WB reduces the effort the driver must make to understand the person or persons speaking at the far-end of a connection, it may also help reduce driver distraction.

Speech comprehension WB speech helps improve speech comprehension in two ways. First, it helps the human auditory system separate the speech signal from background noise — a process known as auditory streaming. Second, compared to NB speech, WB speech provides more cues to the phonetic identity of sounds within the stream. See Figure 1 below.

Auditory streaming WB speech more effectively separates the speech signal from background noise than does NB speech, because it contains information that supports auditory streaming. This information includes localization and comodulation information.

Localization information is information about the source of the sound (Where is it coming from? and Who is speaking?) that helps a person stream speech more effectively, and hence understand it. Levitt and Rabiner (1967) showed that intelligibility of speech in background noise increases when speech is localized separately from the noise. The ability to make this distinction is sometimes called “the cocktail party effect”.

Wideband Speech Communications for Automotive QNX Software Systems

3

WB speech is likely to be localized better than NB speech because it contains more low and high frequency information. Low frequencies cause time differences at the two ears that provide clues to the source’s location. Similarly, high frequencies provide further clues to the source of the sound. Unlike low frequency speech, which is unattenuated (it wraps around the listener’s head), high frequency speech is attenuated (it doesn’t wrap around the listener’s head).

These level and spectral differences between what the listener hears in each ear offer further clues to the location of a sound source, helping the listener localize the sound source and, thus, identify who is speaking — ultimately, reducing the effort required to maintain the conversation.

Studies have shown that comodulation of additional frequency bands of the signal can help people understand speech in noisy environments by improving auditory streaming (Wright, 1990; Carrell and Opie, 1992). Hall and Grose (1990) further

demonstrated that adding comodulated noise bands can help separate the noise from the signal. The extra speech energy in both the low- and high-frequency bands of WB speech helps the listener separate the speech signal from the noise.

Identifying speech sounds WB speech improves the intelligibility of received speech by providing more cues to the identity of phonemes (units of sound used to form meaningful constructs), syllables, and words in the speech stream. Figure 2 below shows a “difference spectrogram” of the extra information provided by WB speech. To calculate the spectrogram, the researcher subtracted an NB spectrogram from the WB spectrogram of a person saying “The juice of lemons makes fine punch.” Note that, compared to NB speech, WB speech offers the listener more temporal information as well as more frequency information.

The high frequencies in WB speech can help users discriminate fricative consonants such as /s/ and /f/, which is required in order to make the distinction

Figure 1: Auditory streaming of speech. Shapes represent phonetic units that the user has recognized. Dotted lines show information available in wideband speech, but unavailable in narrowband speech. Wideband speech helps users distinguish between fricatives: /f/ and /s/, for example, and thus between words such as “fix” and “six”.


4

between, for instance, the words “six” and “fix”. Diethorn et al. (2005) showed that WB speech reduces confusions between /s/ and /f/, and that the benefits of WB speech increase in noisy conditions.

Figure 3 below shows that when going from NB to WB speech, the probability of a correct response went from .91 to .99 at a Signal-to-noise Ratio (SNR) of 24dB, from .70 to 1.00 at an SNR of 12dB, and from .52 to .99 at an SNR of 0dB. WB speech clearly provides a task performance advantage — especially in noisy conditions such as those found in an automobile.

The extra temporal information provided by WB speech also gives more cues to speech sound identity. Moore (1997) summarizes some related studies. For instance, WB speech provides more information related to rapid changes in spectrum, amplitude attack, and the delay between onset of sound and start of voicing, all of which give clues to phonetic identity (Stevens, 1980; Blumstein and Stevens, 1979).

Further, WB speech provides more envelope information, which helps listeners distinguish voiced from unvoiced sounds, the voiceless stops /p t k/ from

other consonants, and consonant-to-vowel relative amplitudes (Van Tasell et al., 1987).

The envelope also carries prosodic information about sentence structure (intonation) and speech sound duration (stress), which provide context that helps the listener identify speech sounds. WB speech provides more complete information related to these temporal aspects of speech sounds than does NB speech, and thus improves the listener’s chances of making correct phonetic identification.

Reducing driver distraction Driver distraction occurs when non-driving tasks cause reaction time delays in time-critical driving tasks. These delays can produce driving errors and, most importantly, accidents that would have been avoided had there been no distraction.

WB speech can help reduce driver distraction in several ways. It can reduce the load on the driver’s working memory; it can reduce the amount of attention the driver requires to comprehend speech; and it can reduce driver fatigue.

Working memory WB speech reduces a driver’s need to use working memory to understand what he is hearing. When a speech signal becomes degraded — NB speech in noise, for instance — a person resorts to higher-level cognitive mechanisms to help decipher what is being said (Kjellberg, A., 2004). Uncertainty with the current speech sounds (for instance, phoneme, syllable, or word) engages working memory to further analyze and compare the current sounds to the speech sounds that preceded and followed in an attempt to eliminate

Figure 2: A difference spectrogram comparing the information provided by narrowband and wideband speech.


5

ambiguities. This uncertainty also engages higher-level knowledge of syntactic, semantic, and contextual information.

In essence, with a degraded signal, speech comprehension shifts from an automatic “bottom-up” process to more of a “top-down” process that requires more integration of current, recently processed, and stored information in the listener’s working memory. Since a person’s working memory is a shared resource with limited processing throughput and temporary storage, the speech comprehension task can interfere with driving tasks, causing delays in response times — driver distraction.

Attention WB speech can also reduce driver distraction by reducing the amount of attention the driver requires for speech comprehension. Compared to a driver who is listening to NB speech, a drivers listening to WB speech has more cognitive capacity available for monitoring driving-related tasks, and, all other factors being equal, is less distracted.

Fatigue Finally, WB speech can reduce driver distraction by reducing fatigue. Fatigue is the gradual reduction in cognitive functioning due to continuous exposure to stimuli. Fatigue can be accelerated by the extra cognitive processing demands required for speech comprehension when signals are degraded. In 1995, the AT&T Technical Journal reported that “customers find that 7k-Hz bandwidth speech is far more pleasant and less fatiguing to listen to than telephone bandwidth speech” (Cox, et al., 1995). When a driver becomes cognitively fatigued, his ability to monitor and quickly respond to driving-related tasks is impaired.

Speech localization and intelligibility WB speech helps the driver benefit from spatial auditory displays to effect speech localization; that is, to accurately render the target position of the voice. It does this by providing more low- and high-frequency information (as described above) that the driver’s perceptual mechanisms use to localize sound in the automobile, and, thereby, to determine who is speaking, and to more easily understand the conversation. It also helps separate telephony speech from application prompts.

Speech localization Speech localization is defined as the identification by the listener of the place from which a speaker’s voice is coming. In an automobile, this place, or location, may be a door panel, the dashboard, etc.

Spatial audio displays Spatial audio displays are used to control speech localization. These audio displays spatially separate multiple parties in a conference call, causing each

Figure 3: Effects of speech bandwidth and SNR on /s/ to /f/ confusions.


6

speaker’s voice to be perceived as coming from a different location. For example, caller A may appear to be coming from the left and caller B may appear to be coming from the right with a spatial auditory display.

Figure 4: Spatial audio display being used to localize different talkers and application prompts from different positions within the vehicle, helping the listener determine who is speaking and thus more easily understand the conversation.

The Good, Part II: User preference Users prefer WB speech and judge it to be of higher quality than NB speech. In the presence of vehicle noise, they can expend less listening effort and can listen to speech at a more comfortable loudness level.

Other factors, including user expectations and transmission of talker characteristics, may also play a role in the preference for WB speech.

Better quality To assess preference for WB speech, researchers typically ask listeners to rate speech quality, then compare the listeners’ ratings to determine preference.

ITU-T Recommendation G.107 (08/2008), a transmission planning tool based on many subjective quality tests, indicates that, compared to NB, WB improves speech quality by 29%.

Further, a field study of WB speech quality with 150 T-mobile subscribers found that about 70% of trial participants preferred WB speech (Kälvemark and Kornblad, 2006). Ratings indicated that users preferred WB speech because it performed better in noisy environments and increased the sense of talker presence. Users often describe WB speech as “warmer,” “clearer,” “more natural,” and creating a better “sense of presence.” The AT&T Technical Journal noted that “low frequency enhancement (50-200 Hz)”, such as that provided by WB speech, “contributes to increased naturalness and speaker presence” (Jayant, et al., 1990).

Listening effort Listening effort refers to the listener’s awareness of the cognitive load during speech comprehension. In quiet, listeners may report that no effort is required to understand speech. However, as the signal becomes degraded, listeners report that they have to exert effort to understand speech.

All other things being equal, listeners prefer a system that requires less listening effort. WB speech can reduce listening effort by reducing the cognitive load required for speech comprehension (as described above).


7

Loudness level Loudness level refers to the perceived level of speech. At high loudness levels, speech peaks can cause discomfort for the listener.

Listeners prefer a system with a more comfortable loudness level, all other things being equal. WB speech creates a more comfortable loudness level by reducing the level of speech peaks that can cause discomfort.

WB speech peak levels are lower than NB speech peak levels because with WB speech overall loudness is achieved across frequencies (that is, critical bands) instead of by increasing the level of only a narrow bandwidth (Zwicker and Fastl, 1990a). That is, the energy (the loudness) is distributed across a wide acoustic spectrum, which does not cause the sort of discomfort that the same energy does when confined to a narrow bandwidth — in its most unpleasant forms, in a shriek or a whistle.

Other factors influencing preference Listener preference is likely determined by more than just how easy it is to hear a sound, and how aesthetically pleasing the sound is. Communicating nuances through transmission of information related to talker attributes, and user expectations of quality also play a role in determining listener preferences.

Communicating nuances Intelligibility in a conversation does not only depend on differentiating between talkers and on clear reception of the sound units that convey the literal meanings of words and sentences. It also depends on a wealth of other factors. Intonation, for example, is often essential to distinguishing a statement (“You are at home.”) from a question (“You are at home?”), which in most English dialects are distinguishable by their, respectively, falling and rising intonations.

Similarly, useful clues about a talker’s physical characteristics (age), emotional state (nervousness, sincerity) and psychological traits (intelligence, personality) are also communicated through speech. This information can help listeners more accurately assess the situation on the far end of the telephone connection, and increases the “sense of presence”.

Studies testing the relationship between WB speech and these sorts of factors are, unfortunately, lacking. It is not unreasonable, however, to suggest that, because WB speech delivers to the listener speech signals that more closely approximate face-to-face communications, it provides listeners with a wealth of clues that improve speech communications performance and preference.

User expectations Like secondary clues about the far-end speaker’s characteristics, state, and intentions, user expectations are also a probable factor in determining listeners’ preferences for WB speech. Preference for WB speech is likely to increase as expectations are increased. Increased expectations will be driven by greater availability and usage of WB systems.

History has a similar example. When speakerphones first appeared on the market, they were “half-duplex”: when one end of the call was talking, the other end would often be unable to interrupt, or it was unintelligible. Surprisingly, users rated the quality of such systems as high, for the simple reason that their expectations were low. The technology was new and users were just happy not to have to hold a handset against their heads.

As better, “full-duplex” speakerphones were introduced, user expectations of quality increased and “half-duplex” speakerphones became less acceptable. Users now expected that both ends of the connection should be able to speak and be heard simultaneously. They became more sensitive to the


8

impairments introduced by “half-duplex” speakerphones, and became dissatisfied with their performance. In short, as users gain more exposure to the higher quality of WB speech, their expectations will increase. And, as was the case with “half-duplex” speakerphones, users will also become sensitized to the impairments introduced by NB speech. These factors will likely increase the preference — and demand — for WB speech.

The Bad WB speech has many benefits, but it is not without its challenges. Chief among these are that people are more sensitive to WB echo and noise than to NB echo and noise, and that echo cancellers have more difficulty cancelling higher-frequency echo than lower-frequency echo. Fortunately, acoustic engineers and vehicle designers can overcome these challenges with a combination of high-performance signal enhancement algorithms and careful electro-acoustic design of the vehicle platform.

Sensitivity to wideband echo and noise People are more sensitive to WB echo for a number of reasons. First, high-frequency echo falls into a region where the ear is most sensitive to sound (Fletcher, 1995a). Second, echo in the new frequency regions of WB (50-300Hz, 3400-7000Hz) adds to the loudness of the echo heard in the traditional NB region (Zwicker and Fastl, 1990a). Third, ones own voice does not mask high-frequency echo as effectively as it masks low-frequency echo (Gierlich et al., 2008).

People are also more sensitive to WB noise than to NB noise. First, a lot of low-frequency vehicle noise normally gets filtered out by the NB network. Although people are less sensitive to low frequencies, this lower sensitivity is offset by the high levels of low-frequency vehicle noise. In other words, the overall level of low-

frequency noise in a moving vehicle means that the problem of low-frequency noise in WB remains. Second, loudness increases dramatically as the noise bandwidth increases (Zwicker and Fastl, 1990a). The total number of critical bands (that is, frequency bands of the human auditory system) increases from around 13 (NB) to 20 (WB). About three additional critical bands exist between 50-300Hz, and about four exist between 3400-7000Hz (Zwicker and Fastl, 1990b). Loudness increases dramatically because total loudness is the sum of specific loudness at each critical band.

Cancellation of high-frequency echo Several factors make high-frequency echo more difficult to cancel than low-frequency echo.

First, the amplitude of high-frequency speech sounds is much lower than it is for low-frequency speech sounds; for example, the phonetic power of /s/ is about 17dB lower than /a/ (Fletcher, 1995b). This weakness of the excitation signal at high frequencies makes it harder to drive the echo canceller to convergence than when cancelling low-frequency echo. The lower amplitudes at high frequencies present more of a problem for time-series echo cancellers than for frequency-domain echo cancellers.

Second, distortion in the echo path (which can be caused by loudspeaker mounting or microphone saturation) usually creates energy in the higher frequencies. This distortion passes through the linear cancellation and Non-Linear Processor (NLP) attenuator of many echo cancellers, because it is classified as driver’s speech instead of echo. The far-end talker then hears the distortion as echo. This high-frequency distortion can also prevent the echo canceller from training correctly because the canceller will falsely detect double-talk whenever the far end is talking.


9

Finally, high-frequency speech sounds are less frequent in speech than are low-frequency sounds; and the amplitude of these training signals is significantly lower. An echo canceller, therefore, has a harder time driving its coefficients to convergence.

High performance algorithms Fortunately, acoustic processing systems in the automotive environment can use high-performance Acoustic Echo Cancellation (AEC) and Noise Reduction (NR) algorithms to resolve the problems presented by WB echo and noise.

Echo cancellation at high frequencies In order to compensate for the user’s higher sensitivity to WB speech echo, AEC algorithms for WB speech must reduce the level of echo even more than do AEC algorithms for NB speech; and they must do so without introducing noticeable level fluctuations to the near-end talker’s speech. These algorithms must also be effective despite the poor excitation signal presented by high-frequency echo, as well as provide robust double-talk detectors so that high-frequency distortion does not prevent the echo canceller from training, as described above.

WB noise reduction NR algorithms must also do a better job at reducing noise levels than on NB connections. They must also be able to handle WB speech signals and track background noise even during active speech.

Careful electro-acoustic design Careful electro-acoustic design of vehicle platforms is essential to achieving optimal speech quality. This requirement is especially true for platforms that will support WB speech, which leaves little room for error. Signal enhancement algorithms have theoretical limits; once signals are corrupted at the acoustic interface, they can only do so much to clean things up.

System designers must, therefore, maximize the Speech-to-Noise Ratio (SpNR) at the output of the microphone, reduce loudspeaker-to-microphone coupling, and maintain linearity of the echo path. They can achieve these objectives through careful and precise design of the vehicle cabin and electronics, and through proper selection, placement, orientation, and mounting of the transducers, taking into account challenges such as the linearity problems associated with some lower speech frequencies — the “rattling” door panel.

The Ugly Assuming that high-performance algorithms and careful electro-acoustic vehicle platform design have adequately solved the problems specific to WB speech, there still remains what may be — at least in the short term — the most difficult obstacle to overcome: interoperability. The transition from NB to WB telephony will be a long one. Hence, to be ultimately successful, any WB speech system must adequately address the problem of interoperability with current NB systems.

Interoperability issues Neither the standards community nor the telecommunications industry has adequately addressed the issues related to interoperability of WB systems with existing NB systems. These issues include problems with maintaining consistent loudness and quality over mixed connections. Figure 5 below illustrates the interoperability issues with mixed WB/NB connections.

Inconsistent loudness Some users of WB terminals will find it slightly annoying that the loudness level drops when the far end uses an NB terminal. NB signals sound quieter to


10

the human ear than do WB signals because they have less speech energy spread across fewer critical bands.

Unfortunately, a system cannot use a single volume setting for both WB and mixed WB/NB connections. Even standard root mean square (RMS) and peak-based Automatic Gain Control (AGC) algorithms do not solve the problem because they are unable to take into account the perceptual effect of loudness additivity across critical bands. The solution to this problem,

then, is to use algorithms that measure perceived loudness to control automatic gain adjustments.

Note that for users of NB terminals, the problem does not exist. They hear no loudness differences, because NB terminals limit WB signals to the bands supported by NB signals. WB and NB signals output through a NB terminal have about the same speech energy, and hence the same perceived loudness.

Inconsistent quality Users of WB terminals may also be put off by the lower quality of calls where the far end is on a NB terminal of instead of a WB terminal. The NB impairments will not be new, but after experiencing the better quality of WB speech, users will have become more sensitive to — and probably less tolerant of — the limitations of NB.

Further, the contrast effect will contribute to making quality differences more noticeable. The contrast effect occurs when a user listens to two different levels of quality within a short period of time. For example, this effect would occur when the user of a WB terminal ends a call to a WB terminal, then immediately calls a NB terminal. It could also

occur on a conference call where both WB and NB terminals are connected.

To reduce the inconsistent quality heard on WB terminals, system designers can use BandWidth Extension (BWE) techniques. BWE takes an NB speech signal and reconstructs the missing low- and high-frequency information. In essence, it turns NB speech into WB speech. Of course, BWE can not make NB speech sound equivalent to WB speech. It can,

Figure 5: Interoperability issues with mixed WB/NB communications.


11

however, perceivably reduce inconsistencies in speech quality.

Users of WB terminals will not be the only ones affected by WB-NB interoperability issues. Just like half-duplex speakerphone users a generation ago, NB terminal users may quickly become dissatisfied with the quality of their terminals. Exposure to WB speech will increase their expectations over time, as it will make them more aware of the inherent limitations of NB speech.

To reduce the differences between WB and NB speech experienced by users of both NB and WB terminals, system designers can use High Frequency Encoding (HFE) techniques. HFE takes the high-frequency speech energy (above 3400Hz) normally removed by an NB system and lowers its frequency so that it is carried through a NB system. HFE has been shown to increase intelligibility and can reduce the differences heard between WB and NB systems.

Long transition period Companies such as Cisco and Avaya have already started to deploy WB speech on business terminals and enterprise networks. It has gained acceptance on the Internet with PC-based VoIP phones such as Skype, and France Telecom has begun deploying it. Despite these promising signs, the transition to WB speech communications will probably be gradual and lengthy—and automobile platforms will not be an exception.

WB speech has not yet taken off in automotive. Though when it does, it may well do so very quickly. The automotive industry is already well-positioned to introduce WB speech. Vehicle audio systems are already WB-capable because they have to support high-bandwidth audio signals from the radio and from infotainment applications. Wideband microphones are readily available and easy to drop in, and several WB speech coders are already standardized.

There are several reasons why WB has yet not become established on automotive platforms. Firstly, service providers and automotive OEMs do not appear to fully understand how WB speech can help differentiate their products — for instance, by making them safer or sound better. Secondly, wireless service providers are reluctant to deploy WB speech service. And, finally, consumers are not demanding it, because they do not yet appreciate how WB speech can improve the quality of their in-vehicle communications.

Though the transition to WB speech is inevitable, it may well be a long one. Mixed WB/NB connections will be around for some time, as will the need to implement interoperability solutions: legacy network equipment will probably continue to serve certain areas for many years yet, and NB terminals won’t die — they will just fade away!

Successful Deployment Though further research that directly measures the effects of speech bandwidth in the automotive environment is needed, work done thus far clearly supports the view that human perceptual mechanisms use the extra frequency and temporal information provided in a WB signal to improve task performance, and that people prefer WB speech. These findings suggest that WB speech will likely become a huge differentiator for automotive OEMs and wireless service providers. Those who wait to make the transition will find themselves forced to react — and react quickly — when the market truly takes off.

Successful deployment of WB speech depends on attention to the design of vehicle platforms and use of high-performance speech enhancement algorithms, such as AEC and NR. Although interoperability issues will mean that the introduction of WB speech will not be completely smooth, these issues will necessarily be resolved: acoustic engineers already have the tools


12

they will need. We look forward to the day when we are all haunted by WB speech, as I was when I walked into that lab.

Acknowledgements Thanks to Mark Fallat for creating the difference spectrogram, to Shree Paranjpe for reviewing this paper, to Gary Elko for sharing his thoughts on challenges presented by high frequency echo, and to Joe Hall for providing the figure on the effects of speech bandwidth.

References Blumstein, S. E., Stevens, K. N. (1979). “Acoustic invariance in speech production: Evidence from measurements of the spectral characteristics of stop consonants”. Journal of the Acoustical Society of America, 66, 1001-1017.

Carrell, T. D., Opie, J. M. (1992). “The effect of amplitude comodulation on object formation in sentence perception”. Perception and Psychophysics, 52, pp. 437-445.

Cox, R. V., Kroon, P., Chen, J., Thorkildsen, R., O’Dell, K.M., Isenberg, D. S. (1995). “Speech Coders: From Idea to Product”. AT&T Technical Journal: Voice and Audio Processing, 74(2), 21.

Denes, P. B., Pinson, E. N. (1993). “The Acoustic Characteristics of Speech”. The Speech Chain (2nd ed.) (pp. 141-142). New York, NY: W. H. Freeman and Company.

Diethorn, E. J., Elko, G. W., Hall, J. L. (2005). “Some Aspects of Wideband Speech in Enterprise Telephony”. 2nd Workshop on Wideband Speech Quality in Terminals and Networks: Assessment and Prediction, 22nd and 23rd June 2005, Mainz, Germany.

Gierlich, H. W., Poschen, S., Kettler, F., Raake, A., Spors, S., Geier, M. (2008). “Echo Perception in Wideband Telecommunications Scenarios — Comparison to E-Model’s Narrowband Echo Findings”. ITU-T workshop on "From Speech to Audio: bandwidth extension, binaural perception", 10-12 September 2008, Lannion, France.

Hall, J. W., Grose J. H. (1990). “Comodulation masking release and auditory grouping”. Journal of the Acoustical Society of America, 88, 119-125.

Hall, J. W., Grose, J. H., Mendoza, L. (1995). “Across Channel Processes in Masking”. In B. C. J. Moore (Ed.), Hearing (pp. 243-261). San Diego, CA: Academic Press, Inc.

Fletcher, H. (1995a). “Loudness”. In J.B. Allen (Ed.). Speech and Hearing in Communication (pp. 188). Woodbury, NY: Acoustical Society of America.

Fletcher, H. (1995b). “Acoustical Speech Powers”. In J.B. Allen (Ed.). Speech and Hearing in Communication (pp. 82-84). Woodbury, NY: Acoustical Society of America.

ITU-T Recommendation G.107 (08/2008), The E-model, a computational model for use in transmission planning.

Jayant, N. S., Lawrence, V. B., Prezas, D. P. (1990). “Coding of Speech and Wideband Audio”. AT&T Technical Journal: Speech Technologies, 69(5), 35.

Kälvemark, A., Kornblad, A. (2006). “Voice Quality Consumer Trial”. Ericsson Consumer & Enterprise Lab. Internet web site: http://www.ericsson.com/technology/ tech_articles/amr_files/presentation_voice_uality.pdf

Kjellberg, A. (2004). “Effects of reverberation time on the cognitive load in speech communication : Theoretical considerations”. Noise Health 2004;7:11-21

Levitt, H. and Rabiner, L. R. (1967). “Binaural release from masking for speech and gain in intelligibility”. Journal of the Acoustical Society of America, 42, 601-608.

Moore, B. J. C. (1997). “The search for invariant acoustic cues and the multiplicity of cues”. An Introduction to the Psychology of Hearing (4th ed.) (pp. 296-301). San Diego, CA: Academic Press, Inc.

Stevens, K. N. (1980). “Acoustic correlates of some phonetic categories”. Journal of the Acoustic Society of America, 68, 836-842.

Van Tasell, D. J., Soli, S. D., Kirby, V. M. and Widin, G. P. (1987). “Speech waveform cues for consonant recognition”. Journal of the Acoustical Society of America, 82, 1152-1161.


13

Wright, B. A. (1990). “Comodulation detection differences with multiple signal bands”. Journal of the Acoustical Society of America, 87, pp. 293-303.

Zwicker, E., Fastl, H. (1990a). “Models of Loudness”. Psychoacoustics: Facts and Models (pp. 197-214). New York, NY: Springer-Verlag.

Zwicker, E., Fastl, H. (1990b). “Critical-Band Rate Scale”. Psychoacoustics: Facts and Models (pp. 142). New York, NY: Springer-Verlag.

About QNX Software Systems QNX Software Systems is the leading global provider of innovative embedded technologies, including middleware, development

tools, and operating systems. The component-based architectures of the QNX® Neutrino® RTOS, QNX Momentics® Tool Suite, and

QNX Aviage® middleware family together provide the industry’s most reliable and scalable framework for building high-

performance embedded systems. Global leaders such as Cisco, Daimler, General Electric, Lockheed Martin, and Siemens depend

on QNX technology for vehicle telematics and infotainment systems, industrial robotics, network routers, medical instruments,

security and defense systems, and other mission- or life-critical applications. The company is headquartered in Ottawa, Canada,

and distributes products in over 100 countries worldwide.

www.qnx.com © 2010 QNX Software Systems GmbH & Co. KG, a subsidiary of Research In Motion Limited. All rights reserved. QNX, Momentics, Neutrino, Aviage,

Photon and Photon microGUI are trademarks of QNX Software Systems GmbH & Co. KG, which are registered trademarks and/or used in certain

jurisdictions, and are used under license by QNX Software Systems Co. All other trademarks belong to their respective owners. 302140 MC411.78

Documents

Wideband Speech Communications for Automotive: the Good, the … · 2019-02-22 · Wideband Speech Communications for Automotive QNX Software Systems 3 WB speech is likely to be localized