RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS€¦ · Subset of TIDigits speech recognition corpus 10 speakers X 11 samples X 2 pronunciations = 220 total samples. SPEECH ANALYSIS USING

GYROPHONERECOGNIZING SPEECH FROM GYROSCOPE

SIGNALSYan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)

(1) Stanford University, (2) National Research & Simulation Center, Rafael Ltd.

Presented at Usenix Security '14 and Black Hat Europe '140

ABOUT MESecurity research leader

National Research & Simulation Center

part of Rafael – Advanced Defense Systems Ltd.

High-end research and consulting services

Adjunct researcher and lecturer

The Technion (Israel Institute of Technology)

MICROPHONE ACCESS

REQUIRES PERMISSIONS

GYROSCOPE ACCESS

DOES NOT REQUIRE PERMISSIONS

GYROSCOPE ACCESS BY A BROWSERJAVASCRIPT CODE EXAMPLE

if(window.DeviceOrientationEvent) { window.addEventListener('deviceorientation', function(event) { var Xrotation = event.alpha; var Yrotation = event.beta; var Zrotation = event.gamma; });}

ANY WEBSITE CAN ACCESS THE PHONE'S GYRO

MEMS GYROSCOPESMajor vendors:

STMicroelectronics (Galaxy, iPhones up to 5s)

InvenSense (Google Nexus, iPhone 6)

GYROSCOPES ARE SUSCEPTIBLE TO SOUND

50 HZ TONE POWER SPECTRAL DENSITY

GYROSCOPES ARE SUSCEPTIBLE TO SOUND

50-100 HZ CHIRP POWER SPECTRAL DENSITY(MEASURED USING JAVASCRIPT)

GYROSCOPES ARE (LOUSY, BUT STILL)MICROPHONES

Feature Microphone Gyroscope

Sample Width 16 bits 16 bits

Self Noise ~20dB ~70dB

Sample Freq. 44.1KHz 200Hz

THIS SAMPLING FREQUENCY IS VERY LOW

SPEECH RANGE (FUNDAMENTAL FREQ.)Adult male 85 - 180 Hz

Adult female 165 - 255 Hz

THE EFFECTS OF LOW SAMPLING FREQUENCYSPEECH SAMPLED AT 8,000HZ

0:00

SPEECH SAMPLED AT 200HZ

0:00

EXPERIMENTAL SETUPRoom. Simple speakers.

Smartphone.

Subset of TIDigits speech

recognition corpus

10 speakers X 11 samples X 2

pronunciations = 220 total

samples

SPEECH ANALYSIS USING A SINGLEGYROSCOPE

Gender identification

Speaker identification

Isolated word recognition

RECOGNITION ALGORITHMSSTANDARD FEATURES

Mel-Frequency Cepstral

Coefficients

Spectral centroid

RMS energy

Short-Time Fourier Transform

STANDARD CLASSIFIERSMulti-class SVM

Gaussian Mixture

Model

Dynamic Time Warping

WE CAN SUCCESSFULLY IDENTIFY GENDER

NEXUS 4 84% (DTW)

GALAXY S III 82% (SVM)

Random guess probability is 50%

A GOOD CHANCE TO IDENTIFY THE SPEAKER

Mixed Female/Male 50% (DTW)

Female speakers 45% (DTW)

Male speakers 65% (DTW)Nex

us 4

Random guess probability is 20% for one gender and 10% for amixed set

ISOLATED WORDS RECOGNITIONSPEAKER INDEPENDENTMixed Female/Male 17% (DTW)

Female speakers 26% (DTW)

Male speakers 23% (DTW)Nex

us 4


ISOLATED WORDS RECOGNITIONSPEAKER DEPENDENT

Confusion matrix corresponds to the DTW results


HOW CAN WE LEVERAGE EAVESDROPPINGSIMULTANEOUSLY ON TWO DEVICES?

(Tested for the case of speaker dependent word recognition)

EVALUATION

Single device Two devices

Exhibits improvement over using a single device

Using even more devices might yield even better results

Not a proper non-uniform reconstruction

DEFENSES

DEFENSES

Restrict the sampling frequency even further

Access to high sampling rate should require a special

permission

Low-pass filter the raw samples

Acoustic masking

(won't help against vibration of the surface)

CONCLUSIONThe are many sensors in modern smartphone devices

They can be exploited in various unexpected ways

Giving applications direct access to hardware is dangerous

THANK YOU VERY MUCH

QUESTIONS?CRYPTO.STANFORD.EDU/GYROPHONE

http://crypto.stanford.edu/gyrophone

Documents

RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS€¦ · Subset of TIDigits speech recognition corpus 10 speakers X 11 samples X 2 pronunciations = 220 total samples. SPEECH ANALYSIS USING