View
320
Download
0
Embed Size (px)
Citation preview
AUTORITAS
Classification of Spanish by RegionsUniversitat Politècnica de València, Apr 24th
consulting, s.a.au
torit
as
2
“When you’re able to see the subtle, it’s easy to win”
Sun Tzu
consulting, s.a.au
torit
as
3
Smart Listening
http
://hd
wal
l.co/
face
book
-wal
lpap
er-d
igg/
DATA -> INFORMATION -> KNOWLEDGE -> INTELLIGENCE
Tools and Methods
Human Interpretation
Strategic Application
consulting, s.a.au
torit
as
4
How does Autoritas do it?
Technology
Method
Team
consulting, s.a.au
torit
as
5
DECISION MAKING
STRATEGY
INFORATIONRETRIEVAL
FILTERING, CLEANING AND NOISE DELETION
KNOWLEDGEEXTRACTION
SmartListening
The Smart Cycle
consulting, s.a.au
torit
as
6
THINK, THINK, THINK......since thought precedes
actionDECISION MAKING
STRATEGY
INFORATIONRETRIEVAL
FILTERING, CLEANING AND NOISE DELETION
KNOWLEDGEEXTRACTION
SmartListening
consulting, s.a.au
torit
as
7
Mi Entorno
What do you expect from Internet / Social
Networks for you Business
Objectives?
consulting, s.a.au
torit
as
8
Mi Entorno
Where is yourROI?
consulting, s.a.au
torit
as
9
Catching the bad
Police Investigation Bureau
¡They have it very clear!
consulting, s.a.au
torit
as
10
Demandprediction
Tourist Agency
consulting, s.a.au
torit
as
11
Measuring Reputation in risk management
Insurance Companies
consulting, s.a.au
torit
as
12
Internet as Information source
Marketing Agencies
consulting, s.a.au
torit
as
13
Governments
OpenGovernment
Participation
Collaboration Transparency
Smar
t List
ening
consulting, s.a.au
torit
as
14
To improve customer service
Telcos
consulting, s.a.au
torit
as
15
Social content generation
Audience meassurement
Mass media
consulting, s.a.au
torit
as
16
And yours?Think, think, think...
consulting, s.a.au
torit
as
17
“Truth is out there”X-Files
DECISION MAKING
STRATEGY
INFORATIONRETRIEVAL
FILTERING, CLEANING AND NOISE DELETION
KNOWLEDGEEXTRACTION
SmartListening
consulting, s.a.au
torit
as
18
OBJECTIVE: To retrieve all that we
should retrieve and/but without retrieving nothing
that we should not retrieve
consulting, s.a.au
torit
as
19
Information Sources (channels)
consulting, s.a.au
torit
as
20
Concept-query mapping
IntelligenceConcepts Queries
1 N
Brands
People
Products
Campaigns
Competitors
Referrals
CriticsTopics
Countries
consulting, s.a.au
torit
as
21
1 N
A real example: Spanish public television
5 brands10 presenters10 programas10 campañas
6.856 queries8 channels
Concept-query mapping
IntelligenceConcepts Queries
consulting, s.a.au
torit
as
22
API vs. Crawling
consulting, s.a.au
torit
as
23
information = data - noise
DECISION MAKING
STRATEGY
INFORATIONRETRIEVAL
FILTERING, CLEANING AND NOISE DELETION
KNOWLEDGEEXTRACTION
SmartListening
consulting, s.a.au
torit
as
24
Why is there noise?
Users lie, plagiarize or say foolishness...
consulting, s.a.au
torit
as
25
For example...
findability relevance
consulting, s.a.au
torit
as
26
Advertisements without relevance for the contents
Latest news section that distorts the
semantics of the page
Usable contents
Usable content retrieval
consulting, s.a.au
torit
as
27
ORDEN
If the url includes the date, it’s easy to know it
That’s relative. Is this url from july or
january:http://xxx/07/01/2010/
crawler-403-forbidden.html
The importance (and difficulty) of obtaining the right date
consulting, s.a.au
torit
as
28
Englishestoy sin internet ¬¬ fuuuuck!!!
Finnish... euskocaja, como euskolabel, euskotren, euskomueble... XDDD
PortugueseFlowah Powah!
GermanVierrrrrrrrrrrrnes, egunon!!
Language Models vs. n-Gramms vs. Machine Learning
Language filtering
consulting, s.a.au
torit
as
29
Source geography vs. contents geography vs. profile geography
Geography filtering
consulting, s.a.au
torit
as
30
resultados
elimina url prescindibles
filtra palabras
marca url’s como SpamElimina url’s
Quita de la ‘vista’ los antitesauros
filtra #hastags
filtra influenciadores
filtra localizaciones
“Limpia, fija y da explendor”Lema de la RAE
80%
of
the
wor
kloa
d
consulting, s.a.au
torit
as
31
I believe I solved the equation:
consulting, s.a.au
torit
as
32
Information Retrieval Evaluation...
...in Science
- Do you want that all retrieved content is good? -> Precision
- Don’t you want to leave anything without retrieving? -> Recall
consulting, s.a.au
torit
as
33
7.000 retrieved 54 wrong
99.23% precision
3.000 retrieved50 not retrieved
98.36% recall
Information Retrieval Evaluation...
...in the Industry
We have a lack of credibility!!
consulting, s.a.au
torit
as
34
Example
QUERIES: Mallorca, Menorca, Ibiza, Formentera, Baleares, ...
METHOD: 1.200 random tweets / 3 annotators
30%
70%
LabeledNot labeled
70
80
90
100
Profile Detector Combination
94%
82%
85%
79%
87%85%
PrecisionRecall
Touristic project: 4.575.096 tweets
1.372.528
3.202.567
672.539 wrongly retrieved204.419 good but not retrieved1.372.528 that we don’t know...
consulting, s.a.au
torit
as
354
Bring order to the information......to analyse it and
get value
DECISION MAKING
STRATEGY
INFORATIONRETRIEVAL
FILTERING, CLEANING AND NOISE DELETION
KNOWLEDGEEXTRACTION
SmartListening
consulting, s.a.au
torit
as
36
...to ask questions...
...and to know which new questions to ask
consulting, s.a.au
torit
as
37
ORDENWHAT are people talking about, the consonant, the programming language or the galician telecommunication company?
Semant
ic ambigu
ity in m
illions o
f docum
ents!!
consulting, s.a.au
torit
as
38
WHEN? -> Crisis management
When are things happening?
In real
time!!
And let
remem
ber the
issue
of date
s ident
ification
...
consulting, s.a.au
torit
as
39
WHERE/FROM WHERE is people talking about?
Aprox. 2% of
geotag
ged con
tents!!
consulting, s.a.au
torit
as
40
ORDENHOW? -> Not only sentiment analysis
Polarity is onlye one dimension:- Emotions- Motivations- Values- SWOTAll of them answer the question “how?”
consulting, s.a.au
torit
as
41
ORDEN
An example: “The risk premium in Spain is 235”Positive, negative, neutral o none?
consulting, s.a.au
torit
as
An example: “The risk premium in Spain is 235”Positive, negative, neutral o none?
42
ORDEN
My question: For whom?- For the president of the country?- For the leader of the opposition?- For the director of the Spanish Bank?- For the foreigner investor?- For the national capitalist?- For whom has a morgage?
Subject
ivity in
transmitte
r,
and in
receiver
!!
consulting, s.a.au
torit
as
43
ORDENWHO? -> Social Network Analysis
If I want to transmit a successful message, who can help me?
If there is a crisis, who do I have to keep an eye on?
consulting, s.a.au
torit
as
44
ORDENWHY -> Author Profiling
EMOTIONAL PROFILE
GENDERAGE GROUP
NATIVE LANGUAGE
... political ideology, religious beliefs, and much more!
consulting, s.a.au
torit
as
45
Big Data is both the solution and the problem......but mainly the opportunity
80~120 tw/sec
=4.800~7.200tw/min
=288.000~432.000 tw/hour
=6.912.000~10.368.000 tw/day
‣Retrieval & Storing‣Evolution‣Words & Topics‣Tagging‣Hashtags‣People‣Locations‣Brands‣Sentiment & Emotion‣Users & Relationships‣Influence‣Gender & Age‣Language Variety‣...
consulting, s.a.au
torit
as
46
Precision vs. computational cost
0
375000
750000
1125000
1500000
Approach 1 Approach 4 Approach 7 Approach 10 Approach 13 Approach 16 Approach 19
10.2
6 m
inut
es28
.83
min
utes
38.3
1 m
inut
es54
.03
min
utes
1.04
hou
rs1.
09 h
ours
2.66
hou
rs3.
25 h
ours
4.66
hou
rs4.
86 h
ours
5.08
hou
rs5.
13 h
ours
6.37
hou
rs6.
56 h
ours
6.82
hou
rs17
.88
hour
s4.
44 d
ays
5.19
day
s6.
68 d
ays
9.90
day
s11
.78
days
0
10
20
30
40
Approach 1 Approach 4 Approach 7 Approach 10 Approach 13 Approach 16 Approach 19
29.06
34.72
39.86
24.20
28.34 27.98
35.06
25.1427.03
38.58
31.25 32.34
7.87
24.6522.87
27.03
32.22
24.67
33.04
15.40
19.82
%
AGE AND GENDER IDENTIFICATION ~240K AUTHORS
consulting, s.a.au
torit
as
47
Classification of Spanish by Regions
Autoritas Case Study
consulting, s.a.au
torit
as
48
On the Internet there are no boundaries...
...except the language
consulting, s.a.au
torit
as
49
Show me how you talk...
...and I tell you where you are
HispaBlogsTRAINING TEST
450 200x 5 varietiesx 5 varieties
https://github.com/autoritas/RD-Lab/tree/master/data/HispaBlogs
consulting, s.a.au
torit
as
50
State-of-the-artAutomatic Identification of Language Varieties: The Case of Portuguese.
Zampieri, M., Gebrekidan-Gebre, B.In Proceedings of the Conference on Natural Language Processing 2012
Corpus (1000 documents from newsletters): 2 regional variationsFeatures: word and character n-gramsML Algorithm: Language probability distributions with log-likelihood function for probability estimationEvaluation method: 50/50 splitAccuracy:
Word uni-grams: 99.6%Word bi-grams: 91.2%Character 4-grams: 99.8%
Automatic Identification of Arabic Language Varieties and Dialects in Social Media.Sadat, F., Kazemi, F., Farzindar, A.
In Proceeding of the 1st. International Workshop on Social Media Retrieval and Analysis SoMeRa 2014
Corpus (blogs and forum documents): 6 regional variationsFeatures: character n-gramsML Algorithm: Markov language model vs. Naïve BayesEvaluation method: 50/50 splitAccuracy: 98% (78% F-measure)
consulting, s.a.au
torit
as
51
Our approach: compared results
0
20
40
60
80
100
char n-grams
word n-gr
ams
TF/IDF n-grams
Probabilist
ic FW
Skip-grams
72.20%
58.52%
39.32%
52.74%51.51%
77.70%
91.69%
79.26%
70.15%69.22%
10-fold cross-validation Supplied test set
consulting, s.a.au
torit
as
52
Your turn...
- Course work
- Master project
- Scholarship, job positions, collaborations
- ...
Argentina - Brasil - Chile - España - México - Panamá - UK