62
Crowd++: Unsupervised Speaker Count with Smartphones Chenren Xu , Sugang Li, Gang Liu, Yanyong Zhang, Emiliano Miluzzo, Yih-Farn Chen, Jun Li, Bernhard Firner ACM UbiComp’13 September 10 th , 2013

Crowd++: Unsupervised Speaker Count with Smartphones

  • Upload
    thetis

  • View
    23

  • Download
    0

Embed Size (px)

DESCRIPTION

Crowd++: Unsupervised Speaker Count with Smartphones. Chenren Xu , Sugang Li, Gang Liu, Yanyong Zhang, Emiliano Miluzzo, Yih-Farn Chen, Jun Li, Bernhard Firner. ACM UbiComp’13 September 10 th , 2013 . Scenario 1: Dinner time, where to go?. I will choose this one!. - PowerPoint PPT Presentation

Citation preview

Page 1: Crowd++: Unsupervised Speaker Count with Smartphones

Crowd++: Unsupervised Speaker Count with Smartphones

Chenren Xu, Sugang Li, Gang Liu, Yanyong Zhang, Emiliano Miluzzo, Yih-Farn Chen, Jun Li, Bernhard Firner

ACM UbiComp’13September 10th, 2013

Page 2: Crowd++: Unsupervised Speaker Count with Smartphones

2Chenren Xu [email protected]

Scenario 1: Dinner time, where to go?

I will choose this one!

Page 3: Crowd++: Unsupervised Speaker Count with Smartphones

3Chenren Xu [email protected]

Scenario 2: Is your kid social?

Page 4: Crowd++: Unsupervised Speaker Count with Smartphones

4Chenren Xu [email protected]

Scenario 2: Is your kid social?

Page 5: Crowd++: Unsupervised Speaker Count with Smartphones

5Chenren Xu [email protected]

Scenario 3: Which class is attractive?

She will be my choice!

Page 6: Crowd++: Unsupervised Speaker Count with Smartphones

6Chenren Xu [email protected]

Solutions?

Speaker count! Dinner time, where to go?

Find the place where has most people talking!

Is your kid social? Find how many (different) people they talked with!

Which class is more attractive? Check how many students ask you questions!

Microphone + microcomputer

Page 7: Crowd++: Unsupervised Speaker Count with Smartphones

7Chenren Xu [email protected]

The era of ubiquitous listening

GlassNecklace

Watch

Phone

Brace

Page 8: Crowd++: Unsupervised Speaker Count with Smartphones

8Chenren Xu [email protected]

What we already have

Next thing: Voice-centric sensing

Page 9: Crowd++: Unsupervised Speaker Count with Smartphones

9Chenren Xu [email protected]

Voice-centric sensing

Stressful

BobFamily life Alice

Speech recognition

Emotion detection

2

3

Speakeridentification

Speaker count

Page 10: Crowd++: Unsupervised Speaker Count with Smartphones

10Chenren Xu [email protected]

How to count?

ChallengeNo prior knowledge of speakers

Background noise

Speech overlap

Energy efficiency

Privacy concern

Page 11: Crowd++: Unsupervised Speaker Count with Smartphones

11Chenren Xu [email protected]

How to count?

Challenge SolutionNo prior knowledge of speakers Unique features extraction

Background noise Frequency-based filter

Speech overlap Overlap detection

Energy efficiency Coarse-grained modeling

Privacy concern On-device computation

Page 12: Crowd++: Unsupervised Speaker Count with Smartphones

12Chenren Xu [email protected]

Overview of Crowd++

Page 13: Crowd++: Unsupervised Speaker Count with Smartphones

13Chenren Xu [email protected]

Speech detection

Pitch-based filter Determined by the vibratory frequency of the vocal folds

Human voice statistics: spans from 50 Hz to 450 Hz

f (Hz)50 450

Human voice

Page 14: Crowd++: Unsupervised Speaker Count with Smartphones

14Chenren Xu [email protected]

Speaker features

MFCC Speaker identification/verification

Alice or Bob, or else?

Emotion/stress sensing Happy, or sad, stressful, or fear, or anger?

Speaker counting No prior information

Supervised

Unsupervised

Page 15: Crowd++: Unsupervised Speaker Count with Smartphones

15Chenren Xu [email protected]

Speaker features

Same speaker or not? MFCC + cosine similarity distance metric

θMFCC 1

MFCC 2

We use the angle θ to capture the distance between speech segments.

Page 16: Crowd++: Unsupervised Speaker Count with Smartphones

16Chenren Xu [email protected]

Speaker features

MFCC + cosine similarity distance metric

θs Bob’s MFCC in speech segment 1

Bob’s MFCC in speech segment 2

Alice’s MFCC in speech segment 3

θd

θd > θs

Page 17: Crowd++: Unsupervised Speaker Count with Smartphones

17Chenren Xu [email protected]

Speaker features

MFCC + cosine similarity distance metric

histogram of θs histogram of θd

1 second speech segment

2-second speech segment

3-second speech segment

10-second speech segment

.

.

.

10 seconds is not natural in conversation!

We use 3-second for basic speech unit.

Page 18: Crowd++: Unsupervised Speaker Count with Smartphones

18Chenren Xu [email protected]

Speaker features

MFCC + cosine similarity distance metric

3-second speech segment

3015

Page 19: Crowd++: Unsupervised Speaker Count with Smartphones

19Chenren Xu [email protected]

Speaker features

Pitch + gender statistics

(Hz)

Page 20: Crowd++: Unsupervised Speaker Count with Smartphones

20Chenren Xu [email protected]

Same speaker or not?

Same speaker

Different speakers

Not sure

IF MFCC cosine similarity score < 15

AND

Pitch indicates they are same gender

ELSEIF MFCC cosine similarity score > 30

OR

Pitch indicates they are different genders

ELSE

Page 21: Crowd++: Unsupervised Speaker Count with Smartphones

21Chenren Xu [email protected]

Conversation example

Speaker A

Speaker B

Speaker C

Conversation

Overlap Overlap Silence

Time

Page 22: Crowd++: Unsupervised Speaker Count with Smartphones

22Chenren Xu [email protected]

Phase 1: pre-clustering Merge the speech segments from same speakers

Unsupervised speaker counting

Page 23: Crowd++: Unsupervised Speaker Count with Smartphones

23Chenren Xu [email protected]

Unsupervised speaker counting

Ground truth

Segmentation

Pre-clustering

Same speaker

Page 24: Crowd++: Unsupervised Speaker Count with Smartphones

24Chenren Xu [email protected]

Unsupervised speaker counting

Ground truth

Segmentation

Pre-clustering

Merge

Page 25: Crowd++: Unsupervised Speaker Count with Smartphones

25Chenren Xu [email protected]

Unsupervised speaker counting

Ground truth

Segmentation

Pre-clustering

Keep going …

Page 26: Crowd++: Unsupervised Speaker Count with Smartphones

26Chenren Xu [email protected]

Unsupervised speaker counting

Ground truth

Segmentation

Pre-clustering

Same speaker

Page 27: Crowd++: Unsupervised Speaker Count with Smartphones

27Chenren Xu [email protected]

Unsupervised speaker counting

Ground truth

Segmentation

Pre-clustering

Merge

Page 28: Crowd++: Unsupervised Speaker Count with Smartphones

28Chenren Xu [email protected]

Unsupervised speaker counting

Ground truth

Segmentation

Pre-clustering

Keep going …

Page 29: Crowd++: Unsupervised Speaker Count with Smartphones

29Chenren Xu [email protected]

Unsupervised speaker counting

Ground truth

Segmentation

Pre-clustering

Same speaker

Page 30: Crowd++: Unsupervised Speaker Count with Smartphones

30Chenren Xu [email protected]

Unsupervised speaker counting

Ground truth

Segmentation

Pre-clustering

Merge

Page 31: Crowd++: Unsupervised Speaker Count with Smartphones

31Chenren Xu [email protected]

Unsupervised speaker counting

Ground truth

Segmentation

Pre-clustering

Merge

Pre-clustering

.

.

.

Page 32: Crowd++: Unsupervised Speaker Count with Smartphones

32Chenren Xu [email protected]

Phase 1: pre-clustering Merge the speech segments from same speakers

Phase 2: counting Only admit new speaker when its speech segment is

different from all the admitted speakers.

Dropping uncertain speech segments.

Unsupervised speaker counting

Page 33: Crowd++: Unsupervised Speaker Count with Smartphones

33Chenren Xu [email protected]

Unsupervised speaker counting

Counting: 0

Page 34: Crowd++: Unsupervised Speaker Count with Smartphones

34Chenren Xu [email protected]

Unsupervised speaker counting

Counting: 1

Admit first speaker

Page 35: Crowd++: Unsupervised Speaker Count with Smartphones

35Chenren Xu [email protected]

Unsupervised speaker counting

Counting: 1

Counting: 1Not sure

Page 36: Crowd++: Unsupervised Speaker Count with Smartphones

36Chenren Xu [email protected]

Unsupervised speaker counting

Counting: 1

Counting: 1Drop it

Page 37: Crowd++: Unsupervised Speaker Count with Smartphones

37Chenren Xu [email protected]

Unsupervised speaker counting

Counting: 1

Counting: 1Different speaker

Page 38: Crowd++: Unsupervised Speaker Count with Smartphones

38Chenren Xu [email protected]

Unsupervised speaker counting

Counting: 2

Counting: 1Admit new speaker!

Page 39: Crowd++: Unsupervised Speaker Count with Smartphones

39Chenren Xu [email protected]

Unsupervised speaker counting

Counting: 2

Counting: 2

Counting: 1

Same speaker

Page 40: Crowd++: Unsupervised Speaker Count with Smartphones

40Chenren Xu [email protected]

Unsupervised speaker counting

Counting: 2

Counting: 2

Counting: 1

Different speaker

Same speaker

Page 41: Crowd++: Unsupervised Speaker Count with Smartphones

41Chenren Xu [email protected]

Unsupervised speaker counting

Counting: 2

Counting: 2

Counting: 1

Merge

Page 42: Crowd++: Unsupervised Speaker Count with Smartphones

42Chenren Xu [email protected]

Unsupervised speaker counting

Counting: 2

Counting: 2

Counting: 1

Not sure

Page 43: Crowd++: Unsupervised Speaker Count with Smartphones

43Chenren Xu [email protected]

Unsupervised speaker counting

Counting: 2

Counting: 2

Counting: 1

Not sure

Not sure

Page 44: Crowd++: Unsupervised Speaker Count with Smartphones

44Chenren Xu [email protected]

Unsupervised speaker counting

Counting: 2

Counting: 2

Counting: 1

Drop

Page 45: Crowd++: Unsupervised Speaker Count with Smartphones

45Chenren Xu [email protected]

Unsupervised speaker counting

Counting: 2

Counting: 2

Counting: 1

Different speaker

Page 46: Crowd++: Unsupervised Speaker Count with Smartphones

46Chenren Xu [email protected]

Unsupervised speaker counting

Counting: 2

Counting: 2

Counting: 1

Different speaker

Different speaker

Page 47: Crowd++: Unsupervised Speaker Count with Smartphones

47Chenren Xu [email protected]

Unsupervised speaker counting

Counting: 2

Counting: 3

Counting: 1

Admit new speaker!

Page 48: Crowd++: Unsupervised Speaker Count with Smartphones

48Chenren Xu [email protected]

Unsupervised speaker counting

Counting: 2

Counting: 3

Counting: 1

Ground truth

We have three speakers in this conversation!

Page 49: Crowd++: Unsupervised Speaker Count with Smartphones

49Chenren Xu [email protected]

Evaluation metric

Error count distance The difference between the estimated count and the

ground truth.

Page 50: Crowd++: Unsupervised Speaker Count with Smartphones

50Chenren Xu [email protected]

Benchmark results

Page 51: Crowd++: Unsupervised Speaker Count with Smartphones

51Chenren Xu [email protected]

Benchmark results

Phone’s position on the table does not matter much

Page 52: Crowd++: Unsupervised Speaker Count with Smartphones

52Chenren Xu [email protected]

Benchmark results

Phones on the table are better than inside the pockets

Page 53: Crowd++: Unsupervised Speaker Count with Smartphones

53Chenren Xu [email protected]

Benchmark results

Collaboration among multiple phones provide better results

Page 54: Crowd++: Unsupervised Speaker Count with Smartphones

54Chenren Xu [email protected]

Large scale crowdsourcing effort

120 users from university and industry contribute

109 audio clips of 1034 minutes in total.

Private indoor Public indoor Outdoor

Page 55: Crowd++: Unsupervised Speaker Count with Smartphones

55Chenren Xu [email protected]

Large scale crowdsourcing results

Sample number

Error count distance

Private indoor 40 1.07

Public indoor 44 1.35

Outdoor 25 1.83

The error count results in all environments are reasonable.

Page 56: Crowd++: Unsupervised Speaker Count with Smartphones

56Chenren Xu [email protected]

Computational latency

Latency(msec)

HTCEVO 4g

SamsungGalaxy S2

SamsungGalaxy S3

GoogleNexus 4

GoogleNexus 7

MFCC 42.90 36.71 24.41 22.86 23.14

Pitch 102.71 80.36 58.11 47.93 58.33

Count 175.16 150.47 89.01 83.53 70.23

Total 320.77 267.54 171.53 154.32 151.7It takes less than 1 minute to process

a 5-minute conversation.

Page 57: Crowd++: Unsupervised Speaker Count with Smartphones

57Chenren Xu [email protected]

Energy efficiency

Page 58: Crowd++: Unsupervised Speaker Count with Smartphones

58Chenren Xu [email protected]

Use case 1: Crowd estimation

Group 1 Group 2Group 1

Group 2Group 3

Page 59: Crowd++: Unsupervised Speaker Count with Smartphones

59Chenren Xu [email protected]

Use case 2: Social log

Home

Class

Meeting

Lunch

Research Research

Sleeping Research Dinner

Ph.D student life snapshot before UbiComp’13 submission

His paper is accepted!

Page 60: Crowd++: Unsupervised Speaker Count with Smartphones

60Chenren Xu [email protected]

Use case 3: Speaker count patterns

Different talks show very different speaker count patterns as time goes.

Page 61: Crowd++: Unsupervised Speaker Count with Smartphones

61Chenren Xu [email protected]

Conclusion

Smartphones can count the number of speakers

with reasonable accuracies in different

environments.

Crowd++ can enable different social sensing

applications.

Page 62: Crowd++: Unsupervised Speaker Count with Smartphones

62Chenren Xu [email protected]

Thank you

Chenren XuWINLAB/ECE

Rutgers University

Sugang LiWINLAB/ECE

Rutgers University

Gang LiuCRSS

UT Dallas

Yanyong ZhangWINLAB/ECE

Rutgers University

Emiliano MiluzzoAT&T LabsResearch

Yih-Farn ChenAT&T Labs Research

Jun LiInterdigital

Communication

Bernhard FirnerWINLAB/ECE

Rutgers University