We Can Hear You with Wi-Fi! - University of California, Berkeleyguanhua/paper/07384744.pdf · 2016-09-26 · Abstract—Recent literature advances Wi-Fi signals to “see ... talks

1536-1233 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2016.2517630, IEEETransactions on Mobile Computing

1

We Can Hear You with Wi-Fi!Guanhua Wang, Student Member, IEEE, Yongpan Zou, Student Member, IEEE,

Zimu Zhou, Student Member, IEEE, Kaishun Wu, Member, IEEE, Lionel M. Ni, Fellow, IEEE

Abstract—Recent literature advances Wi-Fi signals to “see”people’s motions and locations. This paper asks the followingquestion: Can Wi-Fi “hear” our talks? We present WiHear, whichenables Wi-Fi signals to “hear” our talks without deploying anydevices. To achieve this, WiHear needs to detect and analyze fine-grained radio reflections from mouth movements. WiHear solvesthis micro-movement detection problem by introducing MouthMotion Profile that leverages partial multipath effects and waveletpacket transformation. Since Wi-Fi signals do not require line-of-sight, WiHear can “hear” people talks within the radio range.Further, WiHear can simultaneously “hear” multiple people’stalks leveraging MIMO technology. We implement WiHear onboth USRP N210 platform and commercial Wi-Fi infrastructure.Results show that within our pre-defined vocabulary, WiHearcan achieve detection accuracy of 91% on average for singleindividual speaking no more than 6 words and up to 74%for no more than 3 people talking simultaneously. Moreover,the detection accuracy can be further improved by deployingmultiple receivers from different angles.

Index Terms—Wi-Fi Radar, Micro-motion Detection, MovingPattern Recognition, Interference Cancelation

I. INTRODUCTION

Recent research has pushed the limit of ISM (IndustrialScientific and Medical) band radiometric detection to a newlevel, including motion detection [11], gesture recognition[38], localization [10], and even classification [14]. We cannow detect motions through-wall and recognize human ges-tures, or even detect and locate tumors inside human bodies[14]. By detecting and analyzing signal reflection, they enableWi-Fi to “SEE” target objects.

Can we use Wi-Fi signals to “HEAR” talks? It is common-sensical to give a negative answer. For many years, the abilityof hearing people talks can only be achieved by deployingacoustic sensors closely around the target individuals. It costsa lot and has a limited sensing and communication range.Further, it has detection delay because the sensor must firstrecord the sound and process it, then transmit it to the receiver.In addition, it cannot be decoded when the surrounding is toonoisy.

This paper presents WiHear (Wi-Fi Hearing), which ex-plores the potential of using Wi-Fi signals to HEAR peopletalk and transmit the talking information to the detector atthe same time. This may have many potential applications: 1)WiHear introduces a new way to hear people talks withoutdeploying any acoustic sensors. Further, it still works welleven when the surrounding is noisy. 2) WiHear will bring a

G. Wang, Y. Zou, Z. Zhou, and L. Ni are with the Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, China, (email:{gwangab, yzouad, zzhouad, ni}@cse.ust.hk).

K. Wu is with the College of Computer Science and Software Engineering, Shenzhen University. (email: [email protected]).

new interactive interface between human and devices, whichenables devices to sense and recognize more complicatedhuman behaviors (e.g. mood) with negligible cost. WiHearmakes devices “smarter”. 3) WiHear can help millions ofdisabled people to conduct simple commands to devices withonly mouth motions instead of complicated and inconvenientbody movements.

How can we manage Wi-Fi hearing? It sounds impossible atfirst glance, as Wi-Fi signals cannot detect or memorize anysound. The key insight is similar to radar systems. WiHearlocates the mouth of an individual, and then recognizes hiswords by monitoring the signal reflections from his mouth.By recognizing mouth moving patterns, WiHear can extracttalking information the same way as lip reading. Thus, WiHearintroduces a micro-motion detection scheme that most ofprevious literature can not achieve. And this minor movementdetection can also achieve the ability like leap motion [1].The closest works are WiSee [38] and WiVi [11], which canonly detect more notable motions such as moving arms or legsusing doppler shifts or ISAR (inverse synthetic aperture radar)techniques.

To transform the above high-level idea into a practicalsystem, we need to address the following challenges:

(1) How to detect and extract tiny signal reflections fromthe mouth only? Movements of surrounding people, and otherfacial movement (e.g. wink) from the target user may affectradio reflections more significantly than mouth movements do.It is challenging to cancel these interferences from the receivedsignals while retaining the information from the tiny mouthmotions.

To address this issue, WiHear first leverages MIMO beam-forming to focus on the target’s mouth to reduce irrele-vant multipath effects introduced by omnidirectional antennas.Such avoidance of irrelevant multipath will enhance WiHear’sdetection accuracy, since the impact from other people’smovements will not dominate when the radio beam is locatedon the target individual. Further, since for a specific user, thefrequency and pattern of wink is relatively stable, WiHearexploits interference cancelation to remove the periodic fluc-tuation caused by wink.

(2) How to analyze the tiny radio reflections withoutany change on current Wi-Fi signals? Recent advances har-ness customized modulation like Frequency-Modulated CarrierWaves (FMCW) [10]. Others like [19] use ultra wide-bandand large antenna array to achieve precise motion tracking.Moreover, since mouth motions induce negligible dopplershifts, approaches like WiSee [38] are inapplicable.

WiHear can be easily implemented on commercial Wi-Fi devices. We introduce mouth motion profiles, which par-tially leverage multipath effects caused by mouth movements.



2

(a) æ (b) u (c) s

(d) v (e) l (f) m

(g) O (h) e (i) w

Fig. 1. Illustration of vowels and consonants [37] that WiHear can detect andrecognize, c©Gary C. Martin.

Traditional wireless motion detection focuses on movementsof arms or body, which can be simplified as a rigid body.Therefore they remove all the multipath effects. However,mouth movement is a non-rigid motion process. That is, whenpronouncing a word, different parts of the mouth (e.g. jawsand tongue) have different moving speeds and directions. Wethus cannot regard the mouth movements as a whole. Instead,we need to leverage multipath to capture the movements ofdifferent parts of the mouth.

In addition, since naturally only one individual is talkingduring a conversation, the above difficulties only focus onsingle individual speaking. How to recognize multiple indi-viduals’ talking simultaneously is another big challenge. Thereason for this extension is that, in public areas like airportsor bus stations, multiple talks happen simultaneously. WiHearenables hear multiple individuals’ simultaneously talks usingMIMO technology. We let the senders form multiple radiobeams to locate on different targets. Thus, we can regardthe target group of people as the senders of the reflectionsignals from their mouths. By implementing a receiver withmultiple antennas and enabling MIMO technology, it candecode multiple senders’ talks simultaneously.

Summary of results: We implemented WiHear in bothUSRP N210 [8] and commercial Wi-Fi products. Fig.1 depictssome syllables (vowels and consonants) that WiHear can rec-ognize 1. Overall, WiHear can recognize 14 different syllables,33 trained and tested words. Further, WiHear can correctrecognition errors by leveraging related context information.In our experiments, we collect training and testing samples atroughly the same location with the same link pairs. All the

1Jaws and tongue movement based lip reading can only recognize30%∼40% of the whole vocabulary of English [24]

experiments are per-person trained and tested. For single usercases, WiHear can achieve an average detection accuracy of91% to correctly recognize sentences made up of no morethan 6 words, and it works in both line-of-sight (LOS) andnon-line-of-sight (NLOS) scenarios. With the help of MIMOtechnology, WiHear can differentiate up to 3 individuals’simultaneously talking with accuracy up to 74%. For through-wall detection of single user, the accuracy is up to 26%with one link pair, and 32% with 3 receivers from differentangles. In addition, based on our experimental results, thedetection accuracy can be further improved by deployingmultiple receivers from different angles.

Contributions: We summarize the main contributions ofWiHear as follows:• WiHear exploits the radiometric characteristics of mouth

movements to analyze micro-motion in a non-invasiveand device-free manner. To the best of our knowledge,this is the first effort using Wi-Fi signals to hear peopletalk via PHY layer CSI (Channel State Information) onoff-the-shelf WLAN infrastructure.

• WiHear achieves lip reading and speech recognition inLOS, NLOS scenarios. WiHear also has the potentialof speech recognition in through-wall scenarios withrelatively low accuracy.

• WiHear introduces mouth motion profile using partialmultipath effect and discrete wavelet packet transforma-tion to achieve lip reading with Wi-Fi.

• We simultaneously differentiate multiple individuals’talks using MIMO technology.

In the rest of this paper, we first summarize related work inSection II, followed by an overview in Section IV. Section Vand VI detail the system design. Section VII extends WiHearto recognize multiple talks. We present the implementation andperformance evaluation in Section VIII, discuss the limitationsin Section IX, and conclude in Section X.

II. RELATED WORK

The design of WiHear is closely related to the followingtwo categories of research.

Vision/Sensor based Motion Sensing. The flourish of smartdevices has spurred an urge for new human-device interactioninterfaces. Vision and sensors are among prevalent ways todetect and recognize motions.

Popular vision-based approaches include Xbox Kinect [2]and Leap Motion [1], which use RGB hybrid cameras anddepth sensing for gesture recognition. A slightly differentapproach which has been grounded into commercial productscalled Vicon systems [3]. These systems can achieve precisemotion tracking using cameras by detecting and analysingmarkers placed on human body, which needs both instru-mentation to environments and target human body. Yet theyare limited to the field of view and are sensitive to lightingconditions. Thermal imaging [35] acts as an enhancement indim lighting conditions and non-line-of-sight scenarios at thecost of extra infrastructure. Vision has also been employed forlip reading. [26] and [25] present a combination of acousticspeech and mouth movement image to achieve higher accuracy



3

of automatic speech recognition in noisy environment. [34]presents a vision-based lip reading system and comparesviewing a person’s facial motion from profile and front view.[23] shows the possibility of sound recovery from the silentvideo.

Another thread exploits various wearable sensors or hand-hold devices. Skinput [30] uses acoustic sensors to detect on-body tapping locations. Agrawal et al. [12] enable writingin the air by holding a smartphone with embedded sensors.TEXIVE [15] leverages smartphone sensors to detect drivingand texting simultaneously.

WiHear is motivated by these precise motion detectionsystems, yet aims to harness the ubiquitously deployed Wi-Fi infrastructure, and works non-intrusively (without on-bodysensors) and through-wall.

Wireless-based Motion Detection and Tracking. WiHearbuilds upon recent research that leverages radio reflectionsfrom human bodies to detect, track, and recognize motions[42]. WiVi [11] initializes through-wall motion imaging usingMIMO nulling [36]. WiTrack [10] implemented an FMCW(Frequency Modulated Carrier Wave) 3D motion trackingsystem at the granularity of 10cm. WiSee [38] recognizesgestures via Doppler shifts. AllSee [33] achieves low-powergesture recognition on customized RFID tags.

Device-free human localization systems locate a personby analyzing his impact on wireless signals received bypre-deployed monitors, while the person carries no wirelessenabled devices [52]. The underlying wireless infrastructurevaries, including RFID [53], Wi-Fi [52], ZigBee [48], and thesignal metrics range from coarse signal strength [52] [48] tofiner-grained PHY layer features [50] [51].

Adopting a similar principle, WiHear extracts and interpretsreflected signals, yet differs in that WiHear targets at finer-grained motions from lips and tongue. Since the micro motionsof the mouth produce negligible Doppler shifts and amplitudefluctuations, WiHear exploits beamforming techniques andwavelet analysis to focus on and zoom in the characteristicsof mouth motions only. Also, WiHear is tailored for off-the-shelf WLAN infrastructure and is compatible with thecurrent Wi-Fi standards. We envision WiHear as an initial steptowards centimetre-order motion detection (e.g. finger tapping)and higher-level human perception (e.g. inferring mood fromspeech pacing).

III. BACKGROUND ON CHANNEL STATE INFORMATION

In typical cluttered indoor environments, signals oftenpropagate to the receiver via multiple paths. Such multipatheffect creates varying path loss across frequencies, knownas frequency diversity [39]. Frequency diversity depicts thesmall-scale spectral structure of wireless channels, and hasbeen adopted for fine-grained location distinction [54], motiondetection [49] and localization [44].

Conventional MAC layer RSSI provides only a single-valued signal strength indicator. Model multi-carrier radio suchas OFDM measures frequency diversity at the granularity ofsubcarrier, and stores the information in the form of Channel

Feature

Extraction

Classification &

Error Correction

Learning-based

Lip Reading

aptopLaptop

PeoplePeople

APAP

MIMO

Beamforming

Vowels and

consonants Filtering

Wavelet

Transform

Remove

Noise

Mouth Motion Profiling

Partial

Multipath

Removal

Segmentation

Profile

Building

Fig. 2. Framework of WiHear.

State Information (CSI). Each CSI depicts the amplitude andphase of a subcarrier:

H(fk) = ||H(fk)||ej sin(∠H) (1)

where H(fk) is the CSI at the subcarrier with centralfrequency of fk, and ∠H denotes its phase. Leveraging theoff-the-shelf Intel 5300 network card with a publicly availabledriver [28], a group of CSIs H(f) of K = 30 subcarriers areexported to upper layers.

H(f) = [H(f1), H(f2), . . . ,H(fK)] (2)

Recent WLAN standards (e.g. 802.11n/ac) also exploitMIMO techniques to boost capacity via spatial diversity. Wethus involve spatial diversity to further enrich channel mea-surements. Given M receiver antennas and N transmitter an-tennas, we obtain an M ×N matrix of CSIs {Hmn(f)}M×N ,where each element Hmn(f) is defined as Equation 2.

In a nutshell, PHY layer CSI portrays finer-grained spectralstructure of wireless channels. Spatial diversity provided bymultiple antennas further expands the dimensions of channelmeasurements. While RSSI based device-free human detectionsystems mostly make binary decisions whether a person ispresent along the link [48] or resort to multiple APs tofingerprint a location [52], TagFree utilizes the rich featurespace of CSI to identify different objects with only a singleAP.

IV. WIHEAR OVERVIEW

WiHear is a wireless system that enables commercial Wi-Fidevices to hear people talks using OFDM (Orthogonal Fre-quency Division Multiplexing) Wi-Fi devices. Fig.2 illustratesthe framework of WiHear. It consists of a transmitter and areceiver for single user lip reading. The transmitter can beconfigured with either two (or more) omnidirectional antennason current mobile devices or one directional antenna (easilychangeable) on current APs (access points). The receiver onlyneeds one antenna to capture radio reflections. WiHear can beextended to multiple APs or mobile devices to support multiplesimultaneous users.

WiHear transmitter sends Wi-Fi signals towards the mouthof a user using beamforming. WiHear receiver extracts and



4

APAP

PeoplePeople

LaptopLaptop

T1

T2

T3

T3

Fig. 3. Illustration of locating process.

analyzes reflections from mouth motions. It interprets mouthmotions in two steps:

1) Wavelet-based Mouth Motion Profiling. WiHear san-itizes received signals by filtering out-band interferenceand partially eliminating multipath. It then constructsmouth motion profiles via discrete wavelet packet de-composition.

2) Learning-based Lip Reading. Once WiHear extractsmouth motion profiles, it applies machine learning torecognize pronunciations, and translates them via clas-sification and context-based error correction.

At the current stage, WiHear can only detect and recog-nize human talks if the user performs no other movementsduring speaking. We envision the combination of device-freelocalization [50] and WiHear may achieve continuous Wi-Fihearing for mobile users. For irrelevant human interferenceor ISM band interference, WiHear can tolerant irrelevanthuman motions 3m away from the link pair without dramaticperformance degradation.

V. MOUTH MOTION PROFILING

The first step of WiHear is to construct Mouth MotionProfile from received signals.

A. Locating on Mouth

Due to the small size of the mouth and the weak extent ofits movements, it is crucial to concentrate maximum signalpower towards the direction of the mouth. In WiHear, weexploit MIMO beamforming techniques to locate and focuson the mouth, thus both introducing less irrelevant multipathpropagation and magnifying signal changes induced by mouthmotions [20]. We assume the target user does not move whenhe speaks.

The locating process works in two steps:1) The transmitter sweeps its beam for multiple rounds

while the user repeats a predefined gesture (e.g. pronouncing[æ] once per second). The beam sweeping is achieved via asimple rotator made by stepper motors similar in [55]. Weadjust the beam directions in both azimuth and elevation as in[56]. Meanwhile, the receiver searches for the time when thegesture pattern is most notable during each round of sweeping.With trained samples (e.g. waveform of [æ] for the target user),the receiver can compare the collected signals with trained

0

1.0

Amplitude

0

5.0 10.0 t/s-0.5

2.0

Fig. 4. The impact of wink (as denoted in the dashed red box).

samples. And it chooses the time stamp in which the collectedsignals share highest similarity with trained samples.

2) The receiver sends the selected time stamp back to thetransmitter and the transmitter then adjusts and fixes its beamaccordingly. After each round of sweeping, the transmitterwill get the time stamp feedback to adjust the emitted angleof the radio beam. The receiver may also further feedbackto the transmitter during the analyzing process to refine thedirection of the beam. As the example shown in Fig.3, after thetransmitter sweeping the beam for several rounds, the receiversends back time slot 3 to the transmitter.

Based on our experimental results, the whole locatingprocess usually costs around 5-7 seconds, which is acceptablein real-world implementation. we can And we define correctlylocating as the mouth is within the beam’s coverage. Moreprecisely, since the horizontal angle of our radio beam isroughly 120◦, our directional antenna rotates 120◦ per second.Thus basically we sweep our radio beam for around 2 rounds.And then it can locate to the correct direction. For single userscenarios, we tested 20 times with 3 times failure, and thusthe accuracy is around 85%. For multiple user scenarios, wedefine the correct locating as all users’ mouths are within theradio beams. We tested with 3 people for 10 times with 2times failure, and thus the accuracy is around 80%.

B. Filtering Out-Band Interference

As the speed of human speaking is low, signal changescaused by mouth motion in the temporal domain are oftenwithin 2-5 Hz [47]. Therefore, we apply band-pass filteringon the received samples to eliminate out-band interference.

In WiHear, considering the trade-off between computationalcomplexity and functionality, we adopt a 3-order ButterworthIIR band-pass filter [21], of which the frequency response isdefined by equation 3. Butterworth filter is designed to havemaximum flat frequency response in the pass band and rolloff towards zero in the stop band, which ensures the fidelityof signals in target frequency range while removing out-bandnoises greatly. The gain of an n-order Butterworth filter is:

G2(w) = |H(jw)|2 =G2

0

1 + ( wwc)2n

(3)

where G(w) is the gain of Butterworth filter; w represents theangular frequency; wc is the cutoff frequency; n is the orderof filter, in our case, n=3; G0 is the DC gain.

Specifically, since normal speaking frequency is 150-300syllables/minute [47], we set the cutoff frequency to be (60/60-300/60) Hz to cancel the DC component (corresponding to



5

static reflections) and high frequency interference. In practice,as the radio beam may not be narrow enough, a commonlow-frequency interference is caused by winking. As shownin Fig.4, however, the frequency of winking is smaller than1 Hz (0.25 Hz on average). Thus, most of reflections fromwinking are also eliminated by filtering.

C. Partial Multipath Removal

Unlike previous work (e.g. [10]), where multipath reflec-tions are eliminated thoroughly, WiHear performs partialmultipath removal. The rationale is that mouth motions arenon-rigid compared with arm or leg movements. It is commonfor the tongue, lips, and jaws to move in different patternsand deform in shape sometimes. Consequently, a group ofmultipath reflections with similar delays may all conveyinformation about the movements of different parts of themouth. Therefore, we need to remove reflections with longdelays (often due to reflections from surroundings), and retainthose within a delay threshold (corresponding to non-rigidmovements of the mouth).

WiHear exploits CSI of commercial OFDM based Wi-Fidevices to conduct partial multipath removal. CSI representsa sampled version of the channel frequency response at thegranularity of subcarrier. An IFFT (Inverse Fast Fourier Trans-formation) is first operated on the collected CSI to approximatethe power delay profile in the time domain [43]. We thenempirically remove multipath components with delay over 500ns [31], and convert the remaining power delay profile backto the frequency domain CSI via an FFT (Fast Fourier Trans-formation). Since for typical indoor channel, the maximumexcess delay is usually less than 500 ns [31], we set it asthe initial value. The maximum excess delay of power delayprofile is defined to be the temporal extent of the multipath thatabove a particular threshold. The delay threshold is empiricallyselected and adjusted based on the training and classificationprocess (Section VI). More precisely, if we cannot get well-trained waveform (i.e. easy to be classified as a group) ofone specific word/syllable, we empirically adjust the multipaththreshold value.

D. Mouth Motion Profile Construction

After filtering and partial multipath removal, we obtain asequence of cleaned CSI. Each CSI represents the phases andamplitudes on a group of 30 OFDM subcarriers. To reducecomputational complexity with keeping the temporal-spectralcharacteristics, we explore to select a single representativevalue for each time slot.

We apply identical and synchronous sliding windows on allsubcarriers and compute a coefficient C for each of them ineach time slot. The coefficient C is defined as the peak topeak value on each subcarrier within a sliding window. Sincewe have filtered the high frequency components, there wouldbe little dramatic fluctuation caused by interference or noise[21]. Thus the peak-to-peak value can represent human talkingbehaviors. We also compute another metric, the mean of signalstrength in each time slot for each subcarrier. The mean valuesof all subcarriers facilitate us to pick the several subcarriers (in

L[n] 2 H[n] H[n] 2 2

L[n] 2

X[n]

L[n] H[n] 2 2

L[n] 2

L[n] H[n] 2 2

L[n] 2

H[n] H[n] 2 2

L[n] 2

H[n] 2

Fig. 5. Discrete wavelet packet transformation.

our case, we choose ten such subcarriers) which represent themost centralized ones, by analyzing the distribution of meanvalues in each time slot. Among the chosen subcarriers, basedon C calculated within each time slot, we pick the waveformof the subcarrier which has the maximum coefficient C. Bysliding the window on each subcarrier synchronously, we canpick a series of waveform segments from different subcarriersand assemble them into a single one by arranging them one byone. We define the assembled CSIs as a Mouth Motion Profile.

Some may argue that this peak-to-peak value may bedominated by environment changes. However, first of all, wehave filtered the high frequency components. In addition, asmentioned in introduction and previous sessions, during ourexperiment, we keep the surrounding environment static. Thusit is unlikely to introduce irrelevant signal fluctuation causedby the environment. Furthermore, the sliding window we useis 200 ms (we can change the duration of sliding windowaccording to different people’s speaking patterns). These threereasons may ensure that, for most scenarios, our peak-to-peakvalue is dominated by mouth movements. Further, we use allthe 30 subcarriers information to remove irrelevant multipathand keep partial multipath in Section 4.3. Thus we do notwaste any information collected from PHY layer.

E. Discrete Wavelet Packet Decomposition

WiHear performs discrete wavelet packet decomposition onthe obtained Mouth Motion Profiles as input for the learningbased lip reading.

The advantages of wavelet analysis are two-folds: 1) Itfacilitates signal analysis on both time and frequency domain.This attribute can be leveraged in WiHear for analysing themotion of different parts on mouth (e.g. jaws and tongue) invaried frequency domains. It is because each part of mouthmoves at different pace. It can also help WiHear locate the timeperiods for different parts of mouth motion when one specificpronouncing happens. 2) It achieves fine-grained multi-scale



6

analysis. In WiHear, the motion of mouth when pronouncingsome syllables shares a lot in common (e.g. [e],[i]), whichmakes them difficult to be distinguished. By applying discretewavelet packet transform to the original signals, we can figureout the tiny difference which is beneficial for our classificationprocess.

Here we first introduce Discrete Wavelet Transform (DWT).As with the Fourier transform, where the signal is decomposedinto linear combination of the basis if the signal is in the spacespanned by the basis, wavelet decomposition also decomposesa signal to a combination of a series of expansion functions.It is given by equation 4:

f(t) =∑k

akφk(t) (4)

where: k is an integer index of the finite or infinite sum, theak are the expansion coefficients, and the φk(t) are expansionfunctions, or the basis. If the basis chosen appropriately, thereexists another set of basis φk(t) which is orthogonal to φk(t).

The inner product of these two functions is ginven byequation 5:

< φi(t), φj(t) >=

∫φi(t)φ

∗j (t)dt = σij (5)

With the orthonormal property, it is easy to find the coeffi-cients by equation 6:

< f(t), φk(t) > =

∫f(t)φ∗k(t)dt

=

∫(∑k′

ak′φk′(t))φ∗k(t)dt

=∑k′

ak′σk′k

= ak

(6)

We can rewrite it as follows equation 7

ak =< f(t), φk(t) >=

∫f(t)φ∗k(t)dt (7)

For the signal we want to deal with, apply a particular basissatisfying the orthogonal property on that signal. It is easy tofind the expansion coefficients ak. Fortunately, the coefficientsconcentrate on some critical values, while others are close tozero.

Discrete wavelet packet decomposition is based on the well-known discrete wavelet transform (DWT), where a discretesignal f [n] is approximated by a combination of expansionfunctions (the basis).

f [n] =1√M

∑k

Wφ[j0, k]φj0,k[n]

+1√M

∞∑j=j0

∑k

Wψ[j, k]ψj,k[n]

(8)

where f [n] represents the original discrete signal, which isdefined in [0,M − 1], including totally M points. φj0,k[n]and ψj,k[n] are both discrete functions defined in [0,M − 1],called wavelet basis. Usually, the basis sets φj0,k[n]k∈Z andψj,k[n](j,k)∈Z2,j≥j0 are chosen to be orthogonal to each other

0

0

-6

Amplitude

t/ms500 1000

6

250 750

3

-3

(a) æ

0 600 900 1200 1500 t/ms

Amplitude

0

-3

-6

3

6

300

(b) u

0 600 900 1200 1500 t/ms

Amplitude

0

-3

-6

3

6

300

(c) s

0 500 800 1100 1400 t/ms

Amplitude

0

-3

-6

3

6

(d) v

0 300 600 900 1200 t/ms

Amplitude

0

-3

-6

3

6

(e) l

0 300 600 900 1200 t/ms

Amplitude

0

-3

-6

3

6

(f) m

0 1000 1500 t/ms

Amplitude

0

-3

-6

3

6

500

(g) O

0 300 600 900 1200 t/ms

Amplitude

0

-3

-6

3

6

(h) e

Amplitude

t/ms0 300 600 900 1200

0

-6

3

6

3

(i) w

Fig. 6. Extracted features of pronouncing different vowels and consonants.

in order for the convenience of obtaining the wavelet coeffi-cients in the decomposition process, which means:

< φj0,k[n], ψj,m[n] >= δj0,jδk,m (9)

In discrete wavelet decomposition, during the decomposi-tion procedure, the initial step splits the original signal into twoparts, approximation coefficients (i.e. Wφ[j0, k]) and detail co-efficients (i.e. Wψ[j, k]). After that, the following steps consistof recursively decomposing the approximation coefficients anddetail coefficients into two new parts, respectively, using thesame strategy as in initial step. This offers the richest analysis:the complete binary tree in the decomposition producer isproduced as shown in Fig.5:

The wavelet packet coefficients in each level can be com-puted using the following equations as:

Wφ[j0, k] =1√M

∑n

f [n]φj0,k[n] (10)

Wψ[j, k] =1√M

∑n

f [n]ψj,k[n], j ≥ j0 (11)

where Wφ[j0, k] refers to the approximation coefficients whileWψ[j, k] represents the detailed coefficients respectively.

The efficacy of wavelet transform relies on choosing properwavelet basis. One approach that aims at maximizing thediscriminating ability of the discrete wavelet packet decom-position is applied, in which a class separability function isadopted [40]. We applied this method for all possible waveletsin the following families: Daubechies, Coiflets, Symlets, andgot their class separability respectively. Based on their clas-sification performance, a Symlet wavelet filter of order 4 isselected.



7

VI. LIP READING

The next step of WiHear is to recognize and translate theextracted signal features into words. To this end, WiHeardetects the changes of pronouncing adjacent vowels and con-sonants by machine learning, and maps the patterns to wordsusing automatic speech recognition. That is, WiHear buildsa wireless-based provocation dictionary for automatic speechrecognition system [22]. To make WiHear an automatic andreal-time system, we need to address the following issues:segmentation, feature extraction and classification.

A. Segmentation

The segmentation process includes inner word segmentationand inter word segmentation.

For inner word segmentation, each word is divided intomultiple phonetic events [32]. And WiHear then uses thetraining samples of pronouncing each syllable (e.g. sibilantsand plosive sounds) to match the parts of the word and thenusing the syllables’ combination to recognize the word.

For inter word segmentation, since there is usually a shortinterval (e.g. 300 ms) between pronouncing two successivewords, WiHear detects the silent interval to separate wordsapart. Specifically, we first compute the finite difference (i.e.,sample-to-sample difference) of the signal we obtained, whichis referred as Sdif . Next we apply a sliding window toSdif signal. Within each time slot, we compute the absolutemean value of signals in that window to determine whetherthis window is active or not, w.r.t, by comparing with adynamically computed threshold, we can determine whetherthe user is speaking within time period that the sliding windowcovers. In our experiments, the threshold is set to be 0.75times the standard deviation of the differential signal acrossthe whole process of pronouncing a certain word. This metricidentifies the time slot when signal changes rapidly, indicatingthe process of pronouncing a word.

B. Feature Extraction

After signal segmentation, we can obtain wavelet profilesfor different pronunciations, each with 16 4th-order sub-waveforms from high frequency to low frequency components.To avoid the well-known “dimensionality curse” [27], Weapply a Multi-Cluster/Class Feature Selection (MCFS) scheme[18] to extract representative features from wavelet profiles toreduce the quantity of sub-waveforms. Compared with otherfeature selection methods like SVM, MCFS can produce anoptimal feature subset by considering possible correlationsbetween different features, which better conforms to the char-acteristics of the dataset. We run the same dataset on SVM,which cost 3-5 minutes. The same dataset processing usingMCFS only takes around 5 seconds. MCFS works as follows.

Firstly, a m-nearest neighbor graph is constructed from theoriginal dataset P . For each pi, once its m nearest neighborsare determined, weighted edges are assigned between pi andeach of its neighbors, respectively. We define the weight matrixW for the edge connecting node i and node j as:

Wi,j = e−‖pi−pj‖

2

ε (12)

0

3

Amplitude

0

2.0 5.0 t/s

-3

6

-61.0 3.0 4.0

(a) Representive signals of user1 speaking

3

Amplitude

0

-3

6

-60 2.0 5.01.0 3.0 4.0 t/s

(b) Representive signals of user2 speaking

Fig. 7. Feature extraction of multiple human talks with ZigZag decoding ona single Rx antenna.

Secondly, MCFS solves the following generalized eigen-problem:

Lv = λAv (13)

where A is a diagonal matrix and Aii =∑jWij . The graph

Laplacian L is defined as L = A −W . And V is defined asV = [v1, v2, ...vK ], in which all the vk are the eigenvectors ofEquation 13 with respect to the smallest eigenvalue.

Thirdly, given vk, a column of V , MCFS searches for arelevant feature subset by minimizing fitting errors:

minαk

‖vk − PTαk‖2 + γ|αk| (14)

where αk is a M -dimensional vector and |αk| =∑Mj=1|αk,j |

represents the L1-norm of αk.Finally, for every feature j, MCFS defines the MCFS score

for the feature as:

Score(j) = maxk|αk,j | (15)

where αk,j is the j-th element of vector αk and all the featuresare sorted by their MCFS scores in descending order.

Fig.6 shows the features selected by MCFS w.r.t. the mouthmotion reflections in Fig.1, which differ in each pronunciation.

C. Classification

For a specific individual, his speed and rhythm of speakingeach word share similar patterns. We can thus directly comparethe similarity of the current signals and previously sampledones by generalized least squares [16].

For scenarios where the user speaks at different speeds ina specific place, we can use dynamic time warping (DTW)[46] to classify the same word spoken at different speeds intothe same group. DTW overcomes the local or global timeseries’ shifts in time domain. It calculates intuitive distancebetween two time series waveforms. For more information,



8

Fig. 8. Floor plan of the testing environment.

we recommend [41] which describes it in detail. Further,for people that share similar speaking patterns, we can alsouse DTW to enable word recognition with only 1 trainingindividual.

For other unknown scenarios (e.g. different environments,etc.), due to the fine-grained analysis of wavelet transform, anysmall changes in the environment will lead to a single classifierwith very poor performance [45]. Therefore instead of using asingle classifier, we explore a more advanced machine learningscheme: Spectral Regression Discriminant Analysis (SRDA)[17]. SRDA is based on the popular Linear DiscriminantAnalysis (LDA) yet mitigates computational redundancy. Weuse this scheme to classify the test signals in order to recognizeand match them into the corresponding mouth motions.

D. Context-based Error Correction

So far we only explore direct word recognition with mouthmotions. However, since the pronunciations spoken are corre-lated, we can leverage context-aware approaches widely usedin automatic speech recognition [13] to improve recognitionaccuracy. As a toy example, when WiHear detects “you” and“ride”, if the next word is “horse”, WiHear can automaticallydistinguish and recognize “horse” instead of “house”. Thuswe can easily reduce the mistakes in recognizing words withsimilar mouth motion pattern, and further improve recog-nition accuracy. Therefore, after applying machine learningfor classification of signal reflections and mapping to theircorresponding mouth motions, we use context-based errorcorrection to further enhance our lip reading recognition.

VII. EXTENDING TO MULTIPLE TARGETS

For one conversation, it is common that only one person istalking at one time. Therefore it seems sufficient for WiHearto track one individual each time. To support debate anddiscussion, however, WiHear needs to be extended to trackmultiple talks simultaneously.

A natural approach is to leverage MIMO techniques. Asshown in previous work [38], we can use spatial diversity

RxTx

(a)

RxTx

(b)

Tx

Rx

(c)

Tx

Rx

(d)

Tx Rx

Rx

Rx

(e)

Tx Tx Tx

Rx Rx Rx

(f)

Fig. 9. Experimental scenarios layouts. (a) line-of-sight; (b) non-line-of-sight;(c) through wall Tx side; (d) through wall Rx side; (e) multiple Rx; (f) multiplelink pairs.

to recognize multiple talks (often from different directions)at the receiver with multiple antennas. Here we also assumethat people stay still while talking. To simultaneously trackmultiple users, we can first let each of them perform aunique pre-defined gesture (e.g. Person A repeatedly speaks[æ], Person B repeatedly speaks [h], etc.). Then we try tolocate radio beams on them. The detailed beam locatingprocess is illustrated in Section 4.1. After locating, WiHear’smulti-antenna receiver can detect their talks simultaneously byleveraging spatial diversity in MIMO system.

However, due to additional power consumption of multipleRF links [33] and physical sizes of multiple antennas, weexplore an alternative approach called ZigZag cancelation tosupport multiple talks with only one receiving antenna. Thekey insight is that, for most of the circumstances, multiplepeople do not begin pronouncing each word exactly at thesame time. Therefore we can use ZigZag cancelation. Afterwe recognize the first word of a user, we can predictablyrecognize the word he would like to say. Then in the middleof the first person speaking the first word, the second personspeaks his first word. We can rely on the previous part ofthe first person part of first word, and use this informationto predict the following part of his first word, and we cancancel the following part of first person speaking the first wordand recognize the second person speaking. And we repeat theprocess back and forth. Thus we can achieve multiple hearingwithout deploying additional devices.

Fig.7(a) and Fig.7(b) depict the speaking of two users,respectively. After segmentation and classification, we can seeeach word as encompassed in the dashed red box. As is shown,three words from user1 have different starting and ending timecompared with those of user2. Take the first word of the twousers as an example, we can first recognize the beginningpart of user1 speaking word1, and then use the predictedending part of user1’s word1 to cancel in the combined signalsof user1 and user2’s word1. Thus we use one antenna tosimultaneously decode two users’ words.

VIII. IMPLEMENTATION AND EVALUATION

We implement WiHear on both commercial Wi-Fi infras-tructure and USRP N210 [8], and evaluate its performance intypical indoor scenarios.



9

Receiver Transmitter

3-Antenna

Intel 5300 NIC

TP-Link N750

(TL-WDR4300)

Directional

Antenna

TL-ANT2406A

Fig. 10. The commercial hardware testbed.

A. Hardware Testbed

We use a TP-LINK N750 serial, TL-WDR4300 type wire-less router as the transmitter, and use a 3.20GHz Intel(R)Pentium 4 CPU 2GB RAM desktop equipped with Intel 5300NIC (Network Interface Controller) as the receiver. As shownin the Fig.10, the transmitter possesses directional antennasTL-ANT2406A [4] (beam width: Horizontal 120◦, Vertical90◦) and operates in IEEE 802.11n AP mode at working at2.4GHz band. The receiver has 3 working antennas and thefirmware is modified as in [29] to report original CSI to upperlayers.

During the measurement campaign, the receiver continu-ously pings packets from the AP at the rate of 100 packetsper second and we collect CSIs for 1 minute during each mea-surement. The collected CSIs are then stored and processed atthe receiver.

For USRP implementation, we use GNURadio softwareplatform [5], and implement WiHear into a 2 × 2 MU-MIMO system with 4 USRP N210 [8] boards and XCVR2450daughterboards, which operate in the 2.4GHz range. We useIEEE 802.11 OFDM standard [9], which has 64 sub-carriers(48 for data). We connect USRP N210 nodes via GigabitEthernet to our laboratory PCs, which are all equipped witha qual-core 3.2GHz processor, 3.3GB memory and runningUbuntu 10.04 with GNURadio software platform [5]. SinceUSRP N210 boards cannot support multiple daughter boards,we combine two USRP N210 nodes with an external clock[7] to build a two-antenna MIMO node. We use the other twoUSRP N210 nodes as clients.

B. Experimental Scenarios

We conduct the measurement campaign in a typical officeenvironment and run our experiments with 4 people (1 femaleand 3 males). We conduct measurements in a relatively openlobby area covering 9m × 16m as Fig.8. During our exper-iments, we always keep the distance between the radio andthe user within roughly 2m. To evaluate WiHear’s ability toachieve LOS, NLOS and through-wall speech recognition, weextensively evaluate WiHear’s performance in the following 6scenarios (shown in Fig.9).

• Line of sight. The target person is on the line of sightrange between the transmitter and the receiver.

• None line of sight. The target person is not on the lineof sight places, but within the radio range between thetransmitter and the receiver.

• Through wall Tx side. The receiver and the transmitterare separated by a wall (roughly 6 inches). The targetperson is on the same side as the transmitter.

• Through wall Rx side. The receiver and the transmitterare separated by a wall (roughly 6 inches). The targetperson is on the same side as the receiver.

• Multiple Rx. One transmitter and multiple receivers areon the same side of a wall. The target person is withinthe range of these devices.

• Multiple link pairs. Multiple link pairs work simultane-ously on multiple individuals.

Due to the high detection complexity of analyzing mouthmotions, for practical issues, the following experiments areper-person trained and tested. Further, we tested two differenttypes of directional antennas, namely, TL-ANT2406A andTENDA-D2407. With roughly the same location of users andlink pairs, we found that WiHear does not need training percommercial Wi-Fi device. However, for devices that have hugedifferences like USRPs and commercial Wi-Fi devices, werecommend per device training and testing.

C. Lip Reading Vocabulary

As previously mentioned, lip reading can only recognizea subset of vocabulary [24]. WiHear can correctly classifyand recognize following syllables (vowels and consonants) andwords.

Syllables: [æ], [e], [i], [u], [s], [l], [m], [h], [v], [O], [w],[b], [j], [S].

Words: see, good, how, are, you, fine, look, open, is, the,door, thank, boy, any, show, dog, bird, cat, zoo, yes, meet,some, watch, horse, sing, play, dance, lady, ride, today, like,he, she.

We note that it is unlikely any words or syllables can be rec-ognized by WiHear. However, we believe the vocabulary of theabove words and syllables are sufficient for simple commandsand conversations. To further improve the recognition accuracyand extend the vocabulary, one can leverage techniques likeHidden Markov Models and Linear Predictive Coding [16],which is beyond the scope of this paper.

D. Automatic Segmentation Accuracy

We mainly focus on two aspects of segmentation accuracyin LOS and NLOS scenarios like Fig.9(a) and Fig.9(b): interword and inner word. Our tests consist of speaking sentenceswith varied quantity of words ranging from 3 to 6. For innerword segmentation, due to its higher complexity, we try tospeak 4-9 syllables in one sentence. We test on both USRPN210 and commercial Wi-Fi devices. Based on our experi-mental results, we found that the performance for LOS (i.e.Fig.9(a)) and NLOS (i.e. Fig.9(b)) achieve similar accuracy.Given this, we average both LOS and NLOS performance asthe final results. And Section 7.5, 7.6, 7.7 follow the samerule.



10

0.91

0.08

0.01

0.00

0.01

0.04

0.05

0.87

0.06

0.03

0.04

0.07

0.03

0.03

0.79

0.08

0.05

0.13

0.01

0.02

0.11

0.74

0.14

0.11

0.00

0.00

0.03

0.07

0.65

0.04

0.00

0.00

0.00

0.08

0.11

0.61

4 5 6 7 8 9

Number of syllables

4

5

6

7

8

9

(a)

0.96

0.06

0.04

0.02

0.03

0.88

0.08

0.05

0.01

0.05

0.81

0.16

0.00

0.01

0.07

0.77

3 4 5 6

Number of words

3

4

5

6

(b)

0.89

0.05

0.05

0.00

0.00

0.01

0.04

0.85

0.11

0.04

0.04

0.06

0.04

0.05

0.75

0.08

0.06

0.11

0.02

0.03

0.06

0.69

0.14

0.13

0.00

0.02

0.03

0.09

0.60

0.16

0.01

0.00

0.00

0.10

0.16

0.53

4 5 6 7 8 9

Number of syllables

4

5

6

7

8

9

(c)

0.91

0.07

0.06

0.02

0.06

0.84

0.07

0.06

0.03

0.05

0.78

0.19

0.00

0.04

0.09

0.73

3 4 5 6

Number of words

3

4

5

6

(d)

Fig. 11. Automatic segmentation accuracy for (a) inner-word segmentation on commercial devices; (b) inter-word segmentation on commercial devices; (c)inner-word segmentation on USRP; (d) inter-word segmentation on USRP.

1 2 3 4 5 670

75

80

85

90

95

100

Number of words

Accu

racy(%

)

Commercial

USRP N210

Fig. 12. Classification performance.

5 10 20 50 100 200 50070

75

80

85

90

95

100

Quantity of training samples

Accura

cy(%

)

Word−based

Syllable−based

Fig. 13. Training overhead.

Fig.11 shows the inner-word and inter-word segmentationaccuracy. The correct rate of inter-word segmentation is higherthan that of inner-word segmentation. The main reason is thatfor inner-word segmentation, we directly use the waveform ofeach vowel or consonant to match the test waveform. Sincedifferent segmentation will lead to different combinations ofvowels and consonants, even some of the combinations do notexist. In contrast, inter-word segmentation is relatively easysince it has a silent interval between two adjacent words.

When comparing between commercial devices and USRPs,we find the overall segmentation performance of commercialdevices is a little better than USRPs. The key reason maybe the number of antennas on the receiver. The receiver NICcard of commercial devices has 3 antennas whereas MIMO-based USRP N210 receiver only has two receiving antennas.Thus the commercial receiver may have richer information andspacial diversity than USRP N210’s receiver.

E. Classification Accuracy

Fig.12 depicts the recognition accuracy on both USRPN210s and commercial Wi-Fi infrastructure in LOS (i.e.Fig.9(a)) and NLOS (i.e. Fig.9(b)). We also average theperformance of LOS and NLOS for each kind devices. Allthe correctly segmented words are used for classification. Wedefine the correct detection as correctly recognizing the wholesentence and we do not use context-based error correctionhere. As is shown in Fig.12, the accuracy performance ofcommercial Wi-Fi infrastructure system achieves 91% onaverage with no more than 6-word sentences. In addition,with multiple receivers deployed, WiHear can achieve 91% on

average with fewer than 10-word sentences, which is furtherdiscussed in Section 7.8.

Results show that the accuracy of commercial Wi-Fi in-frastructure with directional antenna is much higher than thatof USRP devices. The overall USRP accuracy performancefor 6-word sentences is around 82%. The key reasons aretwo-folds: 1) the USRP N210 uses omni-directional antennaswhich may introduce more irrelevant multipath. 2) the receiverof commercial Wi-Fi product has one more antenna, whichgives one more dimension of spacial diversity.

Since overall commercial Wi-Fi devices perform better thanUSRP N210, we mainly focus on commercial Wi-Fi devicesin the following evaluations.

F. Training Overhead

WiHear requires a training process before recognizing hu-man talks. We evaluate the training process in LOS andNLOS scenarios in Fig.9(a) and Fig.9(b), and then average theperformance. Fig.13 shows the training overhead of WiHear.For each word or syllable, we present the quantity of trainingset and its corresponding recognition accuracy. As a whole,we can see that for each word or syllable, the accuracy ofword-based is higher than syllable-based scheme. Given thisresult, empirically we choose the quantity of training sampleranging from 50 to 100, which has good recognition accuracywith acceptable training overhead.

However, the training overhead of word-based scheme ismuch larger than syllable-based one. Note that the amount ofsyllables in a language is limited, but the quantity of words



11

<3 4−6 6<60

70

80

90

100

Quantity of words

Accu

racy(%

)

With Without

Fig. 14. With/without context-based error correc-tion.

<3 4−6 6<70

75

80

85

90

95

100

Quantity of words

Accu

racy(%

)

1−Rx 2−Rx 3−Rx

Fig. 15. Performance with Multiple Rx.

<3 4−6 6<50

60

70

80

90

100

Quantity of words

Accu

racy(%

)

1−Pair 2−Pair 3−Pair

Fig. 16. Performance of multiple users with mul-tiple link pairs.

<3 4−6 6<20

40

60

80

100

Quantity of words

Accu

racy(%

)

1−User

2−User

3−User

Fig. 17. Performance of zigzag decoding formultiple users.

<3 4−6 6<0

5

10

15

20

25

30

Quantity of words

Accura

cy(%

)

Wall Tx Wall Rx

Fig. 18. Performance of two through wall scenar-ios.

<3 4−6 6<0

10

20

30

40

Quantity of words

Accu

racy(%

)

1−Rx

2−Rx

3−Rx

Fig. 19. Performance of through wall with multipleRx.

is huge. We should make a trade off between syllable-basedrecognition and word-based recognition.

G. Impact of Context-based Error Correction

We evaluate the importance of context-based error correc-tion in LOS and NLOS scenarios as in Fig.9(a), Fig.9(b),and then average the performance. We compare WiHear’srecognition accuracy with and without context-based errorcorrection. We divide the quantity of words into 3 groups,namely fewer than 3 words (i.e. <3), 4 to 6 words (i.e. 4-6), more than 6 words but fewer than 10 words (i.e. 6<).By testing different quantity of words in each group, weaverage the performance as the group’s recognition accuracy.The following sections follow the same rule.

As shown in Fig.14, without context-based error correction,the performance drops dramatically. Especially in the scenarioof more than 6 words, context-based error correction achieves13% performance gain than without it. This is because thelonger the sentence, the more context information can beexploited for error correction.

Even with context-based error correction, the detectionaccuracy still tends to drop for longer sentences. The mainproblem is segmentation. For syllable-based technique, it isobviously hard to segment the waveforms. For word-basedtechnique, even though a short interval often exists betweentwo successive words, the magnitudes of waveforms duringthese silent intervals are not strictly 0. Thus some of themmay be regarded as part of the waveforms of some words. Thismay cause wrong segmentation of the words and decrease thedetection accuracy. Thus the detection accuracy is dependenton the number of words. The performance in the followingparts suffers from the same issue.

H. Performance with Multiple Receivers

Here we analyze radiometric impacts of human talks fromdifferent perspectives (i.e. scenarios like Fig.9(e)). Specifically,to enhance recognition accuracy, we collect CSI from differentreceivers in multiple angle of views.

Based on our experiments, even though each NIC receiverhas 3 antennas, the spatial diversity is not significant enough.In other words, the mouth motion’s impacts on different linksin one NIC are quite similar. This may be because the antennasare closely placed to each other. Thus we propose to usemultiple receivers for better spatial diversity. As shown inFig.20, the same person pronouncing the word “GOOD” hasdifferent radiometric impacts on the received signals fromdifferent perspectives (from the angles of 0◦, 90◦ and 180◦).

With WiHear receiving signals in different perspectives, wecan build up Mouth Motion Profile with these dimensions ofdifferent receivers. Thus it will enhance the performance andimprove recognition accuracy. As depicted in Fig.15, withmultiple (3 in our case) dimensional training data, WiHearcan achieve 87% accuracy even when the user speaks morethan 6 words. It ensures the overall accuracy to be 91% inall three words’ group scenarios. Given this, if it is neededfor high accuracy of Wi-Fi hearing, we recommend to deploymore receivers from different views.

I. Performance for Multiple Targets

Here we present WiHear’s performance for multiple targets.We use 2 and 3 pairs of transceivers to simultaneouslytarget on 2 and 3 individuals, respectively (i.e. scenarioslike Fig.9(f)). As shown in Fig.16, compared with a singletarget, the overall performance decreases with the number oftargets increasing. Further, the performance drops dramaticallywhen each user speaks more than 6 words. However, the



12

0 t/ms1000 2000

0

-3

3

Amplitude

2

1

-1

-2

500 1500

(a) GOOD 0◦

0 t/ms1000 2000

0

-6

6

Amplitude

4

2

-2

-4

500 1500

(b) GOOD 90◦

0 t/ms1000 2000

0

-3

3

Amplitude

2

1

-1

-2

500 1500

(c) GOOD 180◦

Fig. 20. Example of different views for pronouncing words.

overall performance is acceptable. The highest accuracy of3 users’ simultaneously talking less than 3 words is 74%. Theworst situation can achieve nearly 60% accuracy with 3 usersspeaking more than 6 words at the same time.

For ZigZag cancelation decoding, since NIC card [29] has3 antennas, we enable only one antenna for our measurement.As depicted in Fig.17, the performance drops more severelythan that of multiple link pairs. The worst case (i.e. 3 users,6¡words) only achieves less than 30% recognition accuracy.Thus we recommend to use ZigZag cancelation scheme withno more than 2 users who speak fewer than 6 words. Other-wise, we increase link pairs to ensure the overall performance.

J. Through Wall Performance

We tested two through wall scenarios, target on the Tx side(Fig.9(c)) and on the Rx side (Fig.9(d)). As shown in Fig.18,although recognition accuracy is pretty low (around 18% onaverage), compared with the probability of random guess (i.e.1/33=3%), the recognition accuracy is acceptable. Performancewith target on the Tx side is better.

We believe by implementing interference nulling as in [11]can improve the performance, which unfortunately cannot beachieved with commercial Wi-Fi products. However, an alter-native approach is to leverage spatial diversity with multiplereceivers. As shown in Fig.19, with 2 and 3 receivers, we cananalyze signals from different perspectives with the target onthe Tx side. Especially with 3 receivers, the maximum accu-racy gain is 7%. With trained samples from different views,multiple receivers can enhance through wall performance.

K. Resistance to Environmental Dynamics

We evaluate the influence of other ISM-band interferenceand irrelevant human movements on the detection accuracyof WiHear. We test these two kinds of interference in bothLOS and NLOS scenarios as depicted in Fig.9(a) and Fig.9(b).The resistance results of these two scenarios also share highsimilarity. Thus here we depict environmental effects on NLOSscenarios in Fig.21.

As shown in Fig.21, one user repeatedly speaks a 4-wordsentence. For each of the following 3 cases, we collect theradio sequences of speaking the repeated 4-word sentence for30 times and draw the combined waveform in Fig.21. For thefirst case, we remain the surroundings stable. With pre-trained

0 7

6

-6

0

t/s1 2 3 4 5 6

Am

pli

tud

e

3

-3

(a) Waveform of a 4-word sentence without interference of ISM bandsignals or irrelevant human motions

0 7

6

-6

0

t/s1 2 3 4 5 6

Am

pli

tud

e

3

-3

(b) Impact of irrelevant human movements interference

0 7

6

-6

0

t/s1 2 3 4 5 6

Am

pli

tud

e

3

-3

(c) Impact of ISM band interference

Fig. 21. Illustration of WiHear’s resistance to environmental dynamics.

waveform of each word that the user speaks, as shown inFig.21(a), we can easily recognize 4 words that user speaks.For the second case, we let three men randomly stroll in theroom but always keep 3 m away from the WiHear’s link pair.As shown in Fig.21(b), the words can still be correctly detectedeven though the waveform is loose compared with that inFig.21(a). This loose character may be the effect of irrelevanthuman motions. For the third case, we use a mobile phone tocommunicate with an AP (e.g. surfing online) and keep them3 m away from WiHear’s link pair. As shown in Fig.21(c),the generated waveform fluctuates a little compared with thatin Fig.21(a). This fluctuation may be the effect of ISM bandinterference.



13

Based on above results, we can conclude that WiHear canbe resistant to ISM band interference and irrelevant humanmotions 3m away without significant recognition performancedegradation. For interference within 3m around transceivers,the interference sometimes dominates the WiHear signal fluc-tuation. Therefore, the performance is unacceptable (usuallyaround 5% accuracy) and we leave it as one of our futureworks.

IX. DISCUSSION

So far we assume people do not move when they speak.It is possible that a person talks while walking. We believethe combination of device-free localization techniques [50]and WiHear would enable real-time tracking and continuoushearing. We leave it as a future direction.

Generally, people share similar mouth movements whenpronouncing the same syllables or words. Given this, wemay achieve Wi-Fi hearing via DTW (details in Section 5.3)with training data from one person, and testing on anotherindividual. We leave it as part of the future work.

Due to the longer distance between the target person and thedirectional antenna, the larger noise and interference occurs.For long range Wi-Fi hearing, we recommend grid parabolicantennas like TL-ANT2424B [6] to accurately locate the targetfor better performance.

To support real-time processing, we can only use CSI onone subchannel to reduce the computational complexity. Sincewe found the radiometric impact of mouth motions is similaracross subchannels, we may safely select one representativesubchannel without sacrificing much performance. However,the full potential of the whole CSI information is still under-explored.

X. CONCLUDING REMARKS

This paper presents WiHear, a novel system that enables Wi-Fi signals to hear talks. WiHear is compatible with existingWi-Fi standards and can be extended easily to commercialWi-Fi products. To achieve lip reading, WiHear introduces anovel system for sensing and recognizing micro-motions (e.g.mouth movements). WiHear consists of two key components,mouth motion profile for extracting features, and learning-based signal analysis for lip reading. Further, Mouth motionprofile is the first effort that leverage partial multipath effectsto get the whole mouth motions’ impacts on radio. Extensiveexperiments demonstrate that, with correct segmentation, Wi-Hear can achieve recognition accuracy of 91% for single userspeaking no more than 6 words and up to 74% for hearing nomore than 3 users simultaneously.

WiHear may have many application scenarios. Since Wi-Fi signals do not require LOS, even though experimentalresults are not promising, we believe WiHear has the potentialto “hear” people talks through walls and doors within theradio range. In addition, WiHear can “understand” peopletalking, which can get more complicated information fromtalks than gesture-based interfaces like Xbox Kinect [2] (e.g.mood). Further, WiHear can also help disabled people toconduct simple commands to devices with mouth movements

instead of inconvenient body gestures. We can also extendWiHear for motion detection on hands. Since WiHear can beeasily extended into commercial products, we envision it as apractical solution for Wi-Fi hearing in real-world deployment.

REFERENCES

[1] Leap Motion, https://www.leapmotion.com/.[2] Xbox Kinect, http://www.xbox.com/en-US/kinect.[3] Vicon, http://www.vicon.com.[4] TP-LINK 2.4GHz 6dBi Indoor Directional Antenna, http:

//www.tp-link.com/en/products/details/?categoryid=2473&model=TL-ANT2406A#over.

[5] GNU software defined radio, http://www.gnu.org/software/gnuradio.[6] TP-LINK 2.4GHz 24dBi Grid Parabolic Antenna, http:

//www.tp-link.com/en/products/details/?categoryid=2474&model=TL-ANT2424B#over.

[7] Oscilloquartz SA, OSA 5200B GPS Clock. http://www.oscilloquartz.com.[8] Universal Software Radio Peripheral. Ettus Research LLC, http://www.

ettus.com.[9] Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY)

Specifications. IEEE Std 802.11, 2012.[10] Fadel Adib, Zach Kabelac, Dina Katabi, and Robert C. Miller. 3d

tracking via body radio reflections. In USENIX NSDI, 2014.[11] Fadel Adib and Dina Katabi. See through walls with wi-fi! In ACM

SIGCOMM, 2013.[12] Sandip Agrawal, Ionut Constandache, Shravan Gaonkar, Romit Roy

Choudhury, Kevin Cave, and Frank DeRuyter. Using mobile phonesto write in air. In ACM MobiSys, 2011.

[13] Meghdad Aynehband, Amir Masoud Rahmani, and Saeed Setayeshi.Coast: Context-aware pervasive speech recognition system. In IEEEInternational Symposium on Wireless and Pervasive Computing, 2011.

[14] Dinesh Bharadia, Kiran Raj Joshi, and Sachin Katti. Full duplexbackscatter. In ACM HotNets, 2013.

[15] Cheng Bo, Xuesi Jian, Xiang-Yang Li, Xufei Mao, Yu Wang, and FanLi. You’re Driving and Texting: Detecting Drivers Using Personal SmartPhones by Leveraging Inertial Sensors. In ACM MobiCom, 2013.

[16] Jeremy Bradbury. Linear predictive coding. Mc G. Hill.[17] Deng Cai, Xiaofei He, and Jiawei Han. SRDA: An Efficient Algorithm

for Large-Scale Discriminant Analysis. IEEE Transactions on Knowl-edge and Data Engineering, 2008.

[18] Deng Cai, Chiyuan Zhang, and Xiaofei He. Unsupervised FeatureSelection for Multi-Cluster Data. In ACM SIGKDD, 2010.

[19] Gregory L. Charvat, Leo C. Kempel, Edward J. Rothwell, Christo-pher M. Coleman, and Eric L. Mokole. A through-dielectric radarimaging system. IEEE Transactions on Antennas and Propagation,58(8), 2010.

[20] L. J. Chu. Physical limitations of omni-directional antennas. Journal ofApplied Physics, 1948.

[21] Gabe Cohn, Dan Morris, Shwetak N. Patel, and Desney S. Tan.Humantenna: Using the body as an antenna for real-time whole-bodyinteraction. In ACM SIGCHI, 2012.

[22] Martin Cooke, Phil Green, Ljubomir Josifovski, and Ascension Vizinho.Robust automatic speech recognition with missing and unreliable acous-tic data. Speech Communication, 2001.

[23] Abe Davis, Michael Rubinstein, Neal Wadhwa, Gautham Mysore, FredoDurand, and William T. Freeman. The visual microphone: Passiverecovery of sound from video. ACM Transactions on Graphics (Proc.SIGGRAPH), 33(4):79:1–79:10, 2014.

[24] Barbara Dodd and Ruth Campbell. Hearing by eye: The psychology oflip-reading. Lawrence Erlbaum Associates, 1987.

[25] Paul Duchnowski, Martin Hunke, Dietrich Busching, Uwe Meier, andAlex Waibel. Toward movement-invariant automatic lip-reading andspeech recognition. In Proceedings of 1995 International Conferenceon Acoustics, Speech, and Signal Processing, 1995.

[26] Paul Duchnowski, Uwe Meier, and Alex Waibel. See me, hear me:Integrating automatic speech recognition and lip-reading. In Proc. Int.Conf. Spoken Lang. Process, 1994.

[27] Iffat A. Gheyas and Leslie S. Smith. Feature Subset Selection in LargeDimensionality Domains. Pattern Recognition, 2010.

[28] Daniel Halperin, Wenjun Hu, Anmol Sheth, and David Wetherall. Pre-dictable 802.11 Packet Delivery from Wireless Channel Measurements.In Proceedings of ACM SIGCOMM Conference (SIGCOMM), 2010.

[29] Daniel Halperin, Wenjun Hu, Anmol Shethy, and David Wetherall.Predictable 802.11 packet delivery from wireless channel measurements.In ACM SIGCOMM, 2010.



14

[30] Chris Harrison, Desney Tan, and Dan Morris. Skinput: Appropriatingthe body as an input surface. In ACM SIGCHI, 2010.

[31] Yunye Jin, Wee Seng Soh, and Wai Choong Wong. Indoor localizationwith channel impulse response based fingerprint and nonparametricregression. IEEE Transactions on Wireless Communications, 9(3), 2010.

[32] Holger Junker, Paul Lukowicz, and Gerhard Troster. On the automaticsegmentation of speech signals. In Proceedings of IEEE InternationalConference on Acoustics, Speech, and Signal Processing, 1987.

[33] Bryce Kellogg, Vamsi Talla, and Shyamnath Gollakota. Bringing gesturerecognition to all devices. In USENIX NSDI, 2014.

[34] Kshitiz Kumar, Tsuhan Chen, and Richard M. Sternl. Profile view lipreading. In Proceedings of 2007 International Conference on Acoustics,Speech, and Signal Processing, 2007.

[35] Eric Larson, Gabe Cohn, Sidhant Gupta, Xiaofeng Ren, Beverly Harri-son, Dieter Fox, and Shwetak N. Patel. Heatwave: Thermal imaging forsurface user interaction. In ACM SIGCHI, 2011.

[36] Kate Ching-Ju Lin, Shyamnath Gollakota, and Dina Katabi. Randomaccess heterogeneous mimo networks. In ACM SIGCOMM, 2011.

[37] Gary C. Martin. Preston blair phoneme series. http://www.garycmartin.com/mouth shapes.html.

[38] Qifan Pu, Sidhant Gupta, Shyamnath Gollakota, and Shwetak Patel.Whole-home gesture recognition using wireless signals. In ACMMobiCom, 2013.

[39] Theodore Rappaport. Wireless Communications: Principles and Prac-tice. Prentice Hall PTR, 2nd edition, 2001.

[40] Naoki Saito and RonaldR. Coifman. Local discriminant bases and theirapplications. Journal of Mathematical Imaging and Vision, 5(4):337–358, 1995.

[41] Stan Salvador and Philip Chan. Toward accurate dynamic time warpingin linear time and space. Intell. Data Anal., 11(5), October 2007.

[42] Markus Scholz, Stephan Sigg, Hedda R. Schmidtke, and Michael Beigl.Challenges for device-free radio-based activity recognition. In Workshopon Context Systems, Design, Evaluation and Optimisation, 2011.

[43] Souvik Sen, Jeongkeun Lee, Kyu-Han Kim, and Paul Congdon. Avoid-ing multipath to revive inbuilding wifi localization. In ACM MobiSys,2013.

[44] Souvik Sen, Bozidar Radunovic, Romit Roy Choudhury, and TomMinka. You are Facing the Mona Lisa: Spot Localization using PHYlayer information. In Proceedings of ACM International Conference onMobile Systems, Applications, and Services (MobiSys), 2012.

[45] Marina Skurichina and Robert P. W. Duin. Bagging, Boosting and theRandom Subspace Method for Linear Classifiers. Pattern Analysis andApplications, 5(2):121–135, 2002.

[46] Jue Wang and Dina Katabi. Dude, where’s my card? rfid positioningthat works with multipath and non-line of sight. In ACM SIGCOMM,2013.

[47] J. R. Williams. Guidelines for the use of multimedia in instruction. InProceedings of the Human Factors and Ergonomics Society 42nd AnnualMeeting, 1998.

[48] J. Wilson and N. Patwari. Radio Tomographic Imaging with WirelessNetworks. IEEE Transactions on Mobile Computing, 9(5):621–632,2010.

[49] Jiang Xiao, Kaishun Wu, Youwen Yi, Lu Wang, and L.M. Ni. FIMD:Fine-grained Device-free Motion Detection. In Proceedings of IEEEInternational Conference on Parallel and Distributed Systems (ICPADS),2012.

[50] Jiang Xiao, Kaishun Wu, Youwen Yi, Lu Wang, and L.M. Ni. Pilot: Pas-sive Device-free Indoor Localization Using Channel State Information.In IEEE ICDCS, 2013.

[51] Zheng Yang, Zimu Zhou, and Yunhao Liu. From RSSI to CSI:Indoor Localization via Channel Response. ACM Computing Surveys,46(2):25:1–25:32, 2013.

[52] Moustafa Youssef, Matthew Mah, and Ashok Agrawala. Challenges:Device-free Passive Localization for Wireless Environments. In ACMMobiCom, 2007.

[53] Daqiang Zhang, Jingyu Zhou, Minyi Guo, Jiannong Cao, and TianbaoLi. TASA: Tag-Free Activity Sensing Using RFID Tag Arrays. IEEETransactions on Parallel and Distributed Systems, 22(4):558–570, 2011.

[54] Junxing Zhang, Mohammad H. Firooz, Neal Patwari, and Sneha K.Kasera. Advancing Wireless Link Signatures for Location Distinction.In Proceedings of ACM International Conference on Mobile Computingand Networking (MobiCom), 2008.

[55] Weile Zhang, Xia Zhou, Lei Yang, Zengbin Zhang, Ben Y. Zhao, andHaitao Zheng. 3D Beamforming for Wireless Data Centers. In ACMHotnets, 2011.

[56] X. Zhou, Z. Zhang, Y. Zhu, Y. Li, S. Kumar, A. Vahdat, B. Zhao, andH. Zheng. Mirror mirror on the ceiling: flexible wireless links for datacenters. In ACM SIGCOMM, 2012.

Guanhua Wang is a first year Computer SciencePh.D. student in the AMPLab, at UC Berkeley,advised by Prof. Ion Stoica. He received the M.Phil.degree in Computer Science and Engineering, fromHong Kong University of Science and Technologyin 2015, advised by Prof. Lionel M. Ni. He got theB.Eng. degree in Computer Science from SoutheastUniversity, China, in 2012. His main research inter-ests include big data and networking.

Yongpan Zou received the BEng degree of Chem-ical Machinery from Xi’an Jiaotong University,Xi’an, China. Since 2013, he is a PhD student inthe department of computer science and engineeringin Hong Kong University of Science and Technol-ogy (HKUST). His current research interests mainlyinclude: wearable/mobile computing and wirelesscommunication.

Zimu Zhou is currently a Ph.D. candidate in theDepartment of Computer Science and Engineering,Hong Kong University of Science and Technology.He received his B.E. degree in 2011 from the De-partment of Electronic Engineering of Tsinghua Uni-versity, Beijing, China. His main research interestsinclude wireless networks and mobile computing. Heis a student member of IEEE and ACM.

Kaishun Wu is currently a distinguish professor atShenzhen University. Previously, He was a researchassistant professor in Fok Ying Tung GraduateSchool with the Hong Kong University of Scienceand Technology (HKUST). He received the Ph.D.degree in computer science and engineering fromHKUST in 2011. He received the Hong Kong YoungScientist Award in 2012. His research interests in-clude wireless communication, mobile computing,wireless sensor networks and data center networks.

Lionel M. Ni is Chair Professor in the Departmentof Computer and Information Science and Vice Rec-tor of Academic Affairs at the University of Macau.Previously, he was Chair Professor of Computer Sci-ence and Engineering at the Hong Kong Universityof Science and Technology. He received the Ph.D.degree in electrical and computer engineering fromPurdue University in 1980. A fellow of IEEE andHong Kong Academy of Engineering Science, Dr.Ni has chaired over 30 professional conferences andhas received eight awards for authoring outstanding

papers.

Documents

We Can Hear You with Wi-Fi! - University of California, Berkeleyguanhua/paper/07384744.pdf · 2016-09-26 · Abstract—Recent literature advances Wi-Fi signals to “see ... talks