31
Getting Value Out of Chat Data WHAT TO DO WHEN YOUR DATA IS NOISY, SPARSE, AND SHORT 0

Daniel Shank, Data Scientist, Talla at MLconf SF 2017

  • Upload
    mlconf

  • View
    339

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Getting Value Out of Chat DataWHAT TO DO WHEN YOUR DATA IS NOISY, SPARSE, AND SHORT

0

Page 2: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Introduction

Contact: [email protected]

1

Page 3: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Talla

NLP for internal business use cases

Smart knowledge management

Hiring!

2

Page 4: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

What is “Chat data?”

USER2: USER3 do you have new new cal on your Talla account already? Looks like it’s not available for me yet. Would be nice if we could also get inbox support enabled since it’s so much better than gmail. cc USER1USER3: USER2 I realized that after I typed this that I was using my personal gmail when I updated to the new changes. I looked on Talla and I didn’t see the same option to update to new calendar yet.USER4: USER2 I just enabled Inbox for our domainUSER4: new calendar is set to letting google decide when to roll it out, but it looks like we can also enable it as an option nowUSER4: I've now set that to be available as well. These may take some time to show upUSER1: USER2 its been enabled for awhile.USER1: (inbox)USER1: and the new calendar is enabled, soon as google decides you are allowed to have it.USER2: Thanks USER1 USER4

3

Page 5: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Things similar to chat data

Sequential interactions

Forum posts

Some email

IT ticketing system interactions

Short text

Associated with a user

Possibly directed at another user

Highly context dependent

4

Page 6: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Problems with chat

Increasing number of data sources

In theory contains lots of valuable information

In practice data is unlabeled

“Water, water, everywhere, but not a drop to drink.”

5

Page 7: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Goal: Issue detection and matching

People get help through chat platforms

Extract that data and automate the process

USER1’s interaction should help USER3!

USER1: Hi, does anyone know if we have patriot’s day off?USER2: Yeah USER1, we do.USER1: Thanks! …USER3: Hey, do we get patriot’s day off?

6

Page 8: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Automating knowledge delivery

Find issues or questions that people have

Match new issues to pre-existing ones

Serve the appropriate response or answer

Extracting answers is very hard

Focus on matching and search

7

Page 9: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Overview

Jumpstart ML: Active Learning

Topic modeling

Dimensionality Reduction and Representations

8

Page 10: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Find questions and analyze

Use patterns to find questions

Has ‘?’ token

Has a question word

Not too hard

Good start for finding past issues

9

Page 11: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Problems with extracted questions

Most questions need context to understand. e.g.:

“What is it?”

”Can I use her personal email?”

Intent varies:

Want information

Do this thing for me

Huh?

10

Page 12: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Only some questions make sense out of context

“Who is she?” “What is that?” “Will that fix my computer?”

Anaphora—it, that

Pronouns—He, she, etc

“What day is it?”, “Where am I?”

Answer depends on time, person asking

Requires more involved data model

11

Page 13: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Questions have different intents

“Performative” – Please help me? ex:

hi can you please help me reset my 2 factor authentication on salesforce?

“Informational” – What is it?

what's the pl code?

“Navigational” – How do I do this?

how do i record a vidyo meeting?

12

Page 14: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Can we write special case rules?

Borderline cases

is there a way to find out the size of an hbase table? – User asks “Is there (a way…)” to get directions

can anyone tell me where i find the out of stock request report? –User asks someone to give them information

Many variants

Alternative is to label data and use supervised learning

13

Page 15: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

We want to label data, but…

Managing crowdworkers:

Expensive

Time consuming

Can’t be used unless data is safely anonymous

Will the model work afterwards?

14

Page 16: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Active Learning makes labeling more efficient

More value for your time

Can use with crowd workers or without

Good for chat:

Models train fast

Quick to annotate

Supervised learning with little labeled data

Annotate

Train/Predict Get data

15

Page 17: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

How it works (roughly)

Annotate 𝐷0 ∈ 𝐷

Train your model on 𝐷0

Predict labels on remaining data (𝐷 − 𝐷0)

Choose more data, 𝐷1 ∈ 𝐷 − 𝐷0,

Choice of 𝐷1 is based on label predictions

Repeat

???

Profit!

Annotate

Train/Predict Get data

16

Page 18: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Where we are

Jumpstart ML: Active Learning

Topic modeling

Dimensionality Reduction and Representations

17

Page 19: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

More to data than questions or intent

What do people talk about?

What kind of issues are common?

Are there clear lines defining topics?

Finding problem areas

Strategic thinking about what to tackle

18

Page 20: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Know Your Data

Read some of it (if you can)

Learn the context

Cluster and overview

19

Page 21: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Clustering or modeling chat topics

LDA, LSA, NMF, others

Human supervision necessary for interpretation(boo!)

Messages short, so chat is hard

Larger documents have broader topic distributions

We expect messages to be about fewer topics

20

Page 22: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Using LDA with Chat

𝜶 =. 𝟓 𝜶 =. 𝟏 𝜶 =. 𝟎𝟓 𝜶 = . 𝟎𝟑

know; does; link database; jermaine; running file; area; bank free; jermaine; database

did; try; work online; palace; sorry mean; try; screen user; hi; email

send; test; agent try; user; free did; ok; want client; server; user

look; able; mean user; client; error error; server; user ok; did; update

online; help; screen mean; app; does whats; agent; end mean; user; file

hi; palace; property shall; working; process client; property; user online; user; change

email; error; just emails; kelly; time online; user; update mandy; wrong; chance

user; issue; want did; ok; property palace; live; test owner; end; invoice

client; need; check ticket; whats; right run; right; check want; error; agent

owner; report; password check; chloe; duncan emails; know; link live; palace; try

21

Page 23: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Where we are

Jumpstart ML: Active Learning

Topic modeling

Dimensionality Reduction and Representations

22

Page 24: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Why do dimensionality reduction?

We want to improve our supervised learning techniques

Chat data is even more sparse than many NL datasets

Good representations can help search and similarity models

Off the shelf representations are good

Off the shelf + custom representations are better

23

Page 25: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Setting up methods for learning

Word2vec, NMF, even LDA

Most methods equivalent*

Chat has no clear document barriers

Methods assume either continuous context or separate documents

Using messages as contexts too sparse

24

Page 26: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Choosing a context

Representations are influenced by context choice

Figure out your goal

Choose context where words are associated in a way helpful for your goal

For our purposes: Words should be similar if they occur together in issues people have

25

Page 27: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Using a time-based context window

Window before each question

Problem statement and questions should be related

USER2: Can I email this form, or do I have to print it out?USER1: You need to drop the form off in personUSER2: OK, sure. USER1: Great.USER2: Where can I get access to the printers? …

26

Page 28: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Keywords are extracted from recent history

USER2: Can I email this form, or do I have to print it out?USER1: You need to drop the form off in personUSER2: OK, sure.USER1: Great.USER2: Where can I get access to the printers?…

27

Page 29: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Similarity from resulting representations

‘printer’

['printer', 'choice', 'fuji', 'xerox', 'settings', 'sequence', 'default', 'rollover', 'driver', 'takes', 'smaller', 'main', ]

‘issue’

['issue', 'resolved', 'helping', 'experiencing', 'companies', 'related', 'assuming', 'reported', 'double', 'site', 'saw', 'causing', 'understand', 'sorted', 'logging', 'heard’]

‘ssh’

['ssh', 'config', 'dhcp, 'ping', 'reconnect', 'jpg’, 'webconsole', 'coats', 'lab’, 'browsers', 'instances', 'bypass’]

28

Page 30: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Final Thoughts...

Tip of the iceberg

Understand how people interact

What information can we extract?

Can we escape our corpus?

29

Page 31: Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Thank you everyone!

thanks

['heaps', 'great', 'perfect', 'fantastic',]

30