53
AUTORITAS Classification of Spanish by Regions Universitat Politècnica de València, Apr 24th

Smart Listening - MUIinf

Embed Size (px)

Citation preview

Page 1: Smart Listening - MUIinf

AUTORITAS

Classification of Spanish by RegionsUniversitat Politècnica de València, Apr 24th

Page 2: Smart Listening - MUIinf

consulting, s.a.au

torit

as

2

“When you’re able to see the subtle, it’s easy to win”

Sun Tzu

Page 3: Smart Listening - MUIinf

consulting, s.a.au

torit

as

3

Smart Listening

http

://hd

wal

l.co/

face

book

-wal

lpap

er-d

igg/

DATA -> INFORMATION -> KNOWLEDGE -> INTELLIGENCE

Tools and Methods

Human Interpretation

Strategic Application

Page 4: Smart Listening - MUIinf

consulting, s.a.au

torit

as

4

How does Autoritas do it?

Technology

Method

Team

Page 5: Smart Listening - MUIinf

consulting, s.a.au

torit

as

5

DECISION MAKING

STRATEGY

INFORATIONRETRIEVAL

FILTERING, CLEANING AND NOISE DELETION

KNOWLEDGEEXTRACTION

SmartListening

The Smart Cycle

Page 6: Smart Listening - MUIinf

consulting, s.a.au

torit

as

6

THINK, THINK, THINK......since thought precedes

actionDECISION MAKING

STRATEGY

INFORATIONRETRIEVAL

FILTERING, CLEANING AND NOISE DELETION

KNOWLEDGEEXTRACTION

SmartListening

Page 7: Smart Listening - MUIinf

consulting, s.a.au

torit

as

7

Mi Entorno

What do you expect from Internet / Social

Networks for you Business

Objectives?

Page 8: Smart Listening - MUIinf

consulting, s.a.au

torit

as

8

Mi Entorno

Where is yourROI?

Page 9: Smart Listening - MUIinf

consulting, s.a.au

torit

as

9

Catching the bad

Police Investigation Bureau

¡They have it very clear!

Page 10: Smart Listening - MUIinf

consulting, s.a.au

torit

as

10

Demandprediction

Tourist Agency

Page 11: Smart Listening - MUIinf

consulting, s.a.au

torit

as

11

Measuring Reputation in risk management

Insurance Companies

Page 12: Smart Listening - MUIinf

consulting, s.a.au

torit

as

12

Internet as Information source

Marketing Agencies

Page 13: Smart Listening - MUIinf

consulting, s.a.au

torit

as

13

Governments

OpenGovernment

Participation

Collaboration Transparency

Smar

t List

ening

Page 14: Smart Listening - MUIinf

consulting, s.a.au

torit

as

14

To improve customer service

Telcos

Page 15: Smart Listening - MUIinf

consulting, s.a.au

torit

as

15

Social content generation

Audience meassurement

Mass media

Page 16: Smart Listening - MUIinf

consulting, s.a.au

torit

as

16

And yours?Think, think, think...

Page 17: Smart Listening - MUIinf

consulting, s.a.au

torit

as

17

“Truth is out there”X-Files

DECISION MAKING

STRATEGY

INFORATIONRETRIEVAL

FILTERING, CLEANING AND NOISE DELETION

KNOWLEDGEEXTRACTION

SmartListening

Page 18: Smart Listening - MUIinf

consulting, s.a.au

torit

as

18

OBJECTIVE: To retrieve all that we

should retrieve and/but without retrieving nothing

that we should not retrieve

Page 19: Smart Listening - MUIinf

consulting, s.a.au

torit

as

19

Information Sources (channels)

Page 20: Smart Listening - MUIinf

consulting, s.a.au

torit

as

20

Concept-query mapping

IntelligenceConcepts Queries

1 N

Brands

People

Products

Campaigns

Competitors

Referrals

CriticsTopics

Countries

Page 21: Smart Listening - MUIinf

consulting, s.a.au

torit

as

21

1 N

A real example: Spanish public television

5 brands10 presenters10 programas10 campañas

6.856 queries8 channels

Concept-query mapping

IntelligenceConcepts Queries

Page 22: Smart Listening - MUIinf

consulting, s.a.au

torit

as

22

API vs. Crawling

Page 23: Smart Listening - MUIinf

consulting, s.a.au

torit

as

23

information = data - noise

DECISION MAKING

STRATEGY

INFORATIONRETRIEVAL

FILTERING, CLEANING AND NOISE DELETION

KNOWLEDGEEXTRACTION

SmartListening

Page 24: Smart Listening - MUIinf

consulting, s.a.au

torit

as

24

Why is there noise?

Users lie, plagiarize or say foolishness...

Page 25: Smart Listening - MUIinf

consulting, s.a.au

torit

as

25

For example...

findability relevance

Page 26: Smart Listening - MUIinf

consulting, s.a.au

torit

as

26

Advertisements without relevance for the contents

Latest news section that distorts the

semantics of the page

Usable contents

Usable content retrieval

Page 27: Smart Listening - MUIinf

consulting, s.a.au

torit

as

27

ORDEN

If the url includes the date, it’s easy to know it

That’s relative. Is this url from july or

january:http://xxx/07/01/2010/

crawler-403-forbidden.html

The importance (and difficulty) of obtaining the right date

Page 28: Smart Listening - MUIinf

consulting, s.a.au

torit

as

28

Englishestoy sin internet ¬¬ fuuuuck!!!

Finnish... euskocaja, como euskolabel, euskotren, euskomueble... XDDD

PortugueseFlowah Powah!

GermanVierrrrrrrrrrrrnes, egunon!!

Language Models vs. n-Gramms vs. Machine Learning

Language filtering

Page 29: Smart Listening - MUIinf

consulting, s.a.au

torit

as

29

Source geography vs. contents geography vs. profile geography

Geography filtering

Page 30: Smart Listening - MUIinf

consulting, s.a.au

torit

as

30

resultados

elimina url prescindibles

filtra palabras

marca url’s como SpamElimina url’s

Quita de la ‘vista’ los antitesauros

filtra #hastags

filtra influenciadores

filtra localizaciones

“Limpia, fija y da explendor”Lema de la RAE

80%

of

the

wor

kloa

d

Page 31: Smart Listening - MUIinf

consulting, s.a.au

torit

as

31

I believe I solved the equation:

Page 32: Smart Listening - MUIinf

consulting, s.a.au

torit

as

32

Information Retrieval Evaluation...

...in Science

- Do you want that all retrieved content is good? -> Precision

- Don’t you want to leave anything without retrieving? -> Recall

Page 33: Smart Listening - MUIinf

consulting, s.a.au

torit

as

33

7.000 retrieved 54 wrong

99.23% precision

3.000 retrieved50 not retrieved

98.36% recall

Information Retrieval Evaluation...

...in the Industry

We have a lack of credibility!!

Page 34: Smart Listening - MUIinf

consulting, s.a.au

torit

as

34

Example

QUERIES: Mallorca, Menorca, Ibiza, Formentera, Baleares, ...

METHOD: 1.200 random tweets / 3 annotators

30%

70%

LabeledNot labeled

70

80

90

100

Profile Detector Combination

94%

82%

85%

79%

87%85%

PrecisionRecall

Touristic project: 4.575.096 tweets

1.372.528

3.202.567

672.539 wrongly retrieved204.419 good but not retrieved1.372.528 that we don’t know...

Page 35: Smart Listening - MUIinf

consulting, s.a.au

torit

as

354

Bring order to the information......to analyse it and

get value

DECISION MAKING

STRATEGY

INFORATIONRETRIEVAL

FILTERING, CLEANING AND NOISE DELETION

KNOWLEDGEEXTRACTION

SmartListening

Page 36: Smart Listening - MUIinf

consulting, s.a.au

torit

as

36

...to ask questions...

...and to know which new questions to ask

Page 37: Smart Listening - MUIinf

consulting, s.a.au

torit

as

37

ORDENWHAT are people talking about, the consonant, the programming language or the galician telecommunication company?

Semant

ic ambigu

ity in m

illions o

f docum

ents!!

Page 38: Smart Listening - MUIinf

consulting, s.a.au

torit

as

38

WHEN? -> Crisis management

When are things happening?

In real

time!!

And let

remem

ber the

issue

of date

s ident

ification

...

Page 39: Smart Listening - MUIinf

consulting, s.a.au

torit

as

39

WHERE/FROM WHERE is people talking about?

Aprox. 2% of

geotag

ged con

tents!!

Page 40: Smart Listening - MUIinf

consulting, s.a.au

torit

as

40

ORDENHOW? -> Not only sentiment analysis

Polarity is onlye one dimension:- Emotions- Motivations- Values- SWOTAll of them answer the question “how?”

Page 41: Smart Listening - MUIinf

consulting, s.a.au

torit

as

41

ORDEN

An example: “The risk premium in Spain is 235”Positive, negative, neutral o none?

Page 42: Smart Listening - MUIinf

consulting, s.a.au

torit

as

An example: “The risk premium in Spain is 235”Positive, negative, neutral o none?

42

ORDEN

My question: For whom?- For the president of the country?- For the leader of the opposition?- For the director of the Spanish Bank?- For the foreigner investor?- For the national capitalist?- For whom has a morgage?

Subject

ivity in

transmitte

r,

and in

receiver

!!

Page 43: Smart Listening - MUIinf

consulting, s.a.au

torit

as

43

ORDENWHO? -> Social Network Analysis

If I want to transmit a successful message, who can help me?

If there is a crisis, who do I have to keep an eye on?

Page 44: Smart Listening - MUIinf

consulting, s.a.au

torit

as

44

ORDENWHY -> Author Profiling

EMOTIONAL PROFILE

GENDERAGE GROUP

NATIVE LANGUAGE

... political ideology, religious beliefs, and much more!

Page 45: Smart Listening - MUIinf

consulting, s.a.au

torit

as

45

Big Data is both the solution and the problem......but mainly the opportunity

80~120 tw/sec

=4.800~7.200tw/min

=288.000~432.000 tw/hour

=6.912.000~10.368.000 tw/day

‣Retrieval & Storing‣Evolution‣Words & Topics‣Tagging‣Hashtags‣People‣Locations‣Brands‣Sentiment & Emotion‣Users & Relationships‣Influence‣Gender & Age‣Language Variety‣...

Page 46: Smart Listening - MUIinf

consulting, s.a.au

torit

as

46

Precision vs. computational cost

0

375000

750000

1125000

1500000

Approach 1 Approach 4 Approach 7 Approach 10 Approach 13 Approach 16 Approach 19

10.2

6 m

inut

es28

.83

min

utes

38.3

1 m

inut

es54

.03

min

utes

1.04

hou

rs1.

09 h

ours

2.66

hou

rs3.

25 h

ours

4.66

hou

rs4.

86 h

ours

5.08

hou

rs5.

13 h

ours

6.37

hou

rs6.

56 h

ours

6.82

hou

rs17

.88

hour

s4.

44 d

ays

5.19

day

s6.

68 d

ays

9.90

day

s11

.78

days

0

10

20

30

40

Approach 1 Approach 4 Approach 7 Approach 10 Approach 13 Approach 16 Approach 19

29.06

34.72

39.86

24.20

28.34 27.98

35.06

25.1427.03

38.58

31.25 32.34

7.87

24.6522.87

27.03

32.22

24.67

33.04

15.40

19.82

%

AGE AND GENDER IDENTIFICATION ~240K AUTHORS

Page 47: Smart Listening - MUIinf

consulting, s.a.au

torit

as

47

Classification of Spanish by Regions

Autoritas Case Study

Page 48: Smart Listening - MUIinf

consulting, s.a.au

torit

as

48

On the Internet there are no boundaries...

...except the language

Page 49: Smart Listening - MUIinf

consulting, s.a.au

torit

as

49

Show me how you talk...

...and I tell you where you are

HispaBlogsTRAINING TEST

450 200x 5 varietiesx 5 varieties

https://github.com/autoritas/RD-Lab/tree/master/data/HispaBlogs

Page 50: Smart Listening - MUIinf

consulting, s.a.au

torit

as

50

State-of-the-artAutomatic Identification of Language Varieties: The Case of Portuguese.

Zampieri, M., Gebrekidan-Gebre, B.In Proceedings of the Conference on Natural Language Processing 2012

Corpus (1000 documents from newsletters): 2 regional variationsFeatures: word and character n-gramsML Algorithm: Language probability distributions with log-likelihood function for probability estimationEvaluation method: 50/50 splitAccuracy:

Word uni-grams: 99.6%Word bi-grams: 91.2%Character 4-grams: 99.8%

Automatic Identification of Arabic Language Varieties and Dialects in Social Media.Sadat, F., Kazemi, F., Farzindar, A.

In Proceeding of the 1st. International Workshop on Social Media Retrieval and Analysis SoMeRa 2014

Corpus (blogs and forum documents): 6 regional variationsFeatures: character n-gramsML Algorithm: Markov language model vs. Naïve BayesEvaluation method: 50/50 splitAccuracy: 98% (78% F-measure)

Page 51: Smart Listening - MUIinf

consulting, s.a.au

torit

as

51

Our approach: compared results

0

20

40

60

80

100

char n-grams

word n-gr

ams

TF/IDF n-grams

Probabilist

ic FW

Skip-grams

72.20%

58.52%

39.32%

52.74%51.51%

77.70%

91.69%

79.26%

70.15%69.22%

10-fold cross-validation Supplied test set

Page 52: Smart Listening - MUIinf

consulting, s.a.au

torit

as

52

Your turn...

- Course work

- Master project

- Scholarship, job positions, collaborations

- ...

Page 53: Smart Listening - MUIinf

Argentina - Brasil - Chile - España - México - Panamá - UK

@[email protected]