Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
TEXT MINING NAMES IN ‘BIG DATA’ TO
RECOGNIZE TURKISH MIGRATION TRENDS
NamSor Applied Onomastics
1
2014-05-30
Names Data Mining is just a Tool 2
Zeynep Değirmencioğlu
Şükrü Kaya
Şükrü Saracoğlu
Elian Carsenat
Hüseyin Yıldız
Mahmut Yıldırım
Fatih Öztürk
Mehmet Bölükbaşı
Mehmet Yılmaz
Elif Yıldırım
Ahmet Yıldırım
Mustafa Yücedağ
Mustafa Uzunyılmaz
Fatih Kılıç
Fatih Yılmaz
Murat Yıldırım
Hüseyin Kılıç
Oğuzhan Yıldız
Mevlüt Çavuşoğlu
… (Source: Freebase)
What’s in a name? What’s a name? 3
Elian Carsenat
@ElianCarsenat (Twitter)
tioulpanov (Skype)
NamSor.com
Onomastics = the science of proper names
Onoma != Residence != Nationality 4
Source: OECD
NamSor sorts names : functions, use cases 5
2.Name Transliteration & Matching
3.Named Entity Extraction, Parsing
1.Name Ling. Classification
Multilingual Text Mining
Control Watch Lists Social Networks Analytics
Geo demographics
NamSor supervised learning 6
FN LN
Mette Andersen
Lene Andersson
Eva Arndt-Riise
Heidi Astrup
Mie Augustesen
Margot Bærentzen
Louise Bager Nørgaard
Marie Bagger Rasmussen
Yutta Barding
Ulla Barding-Poulsen
FN LN
Xian Dongmei
Zheng Dongmei
Jin Dongxiang
Xu Dongxiang
Li Dongxiao
Qin Dongya
Li Dongying
Han Duan
Li Duihong
Jiang Fan
Training set : Athletes
Step 1 – Learn stereotypes bitao gong
biwang jiang
birgitta agerberth
birgitte l. eriksen
bitao gong
bitten thorengaard
biwang Jiang
birgitta agerberth
birgitte l. eriksen
bitten thorengaard
Data set : Inventors
Step 2 – Classify
Accuracy is measurable ~80% The very first backtesting on the onomastics of 150,000 Olympic game athletes
7
TOTAL PERF Row Labels
3794 97% Japan
260 93% Mongolia
1576 92% Greece
262 89% Lithuania
4150 89% Italy
2818 88% Poland
2180 87% South Korea
Japan Indonesia Sri Lanka Nigeria Congo (B)
Japan 3686 4 3 3 3
Mongolia Iraq Japan Mali Kazakhstan
Mongolia 243 2 1 1 1
Greece Italy Georgia Romania Great Britain
Greece 1444 14 6 5 5
Lithuania Namibia Greece Latvia Russia
Lithuania 234 3 3 3 2
Italy Spain Portugal France Austria
Italy 3675 81 80 29 26
Poland Czechoslovakia Czech Republic Slovakia Austria
Poland 2486 46 38 34 22
South Korea North Korea Chinese Taipei Equatorial Guinea China
South Korea 1901 209 10 6 5
Euro athletes (excl. Anglo & Latin).
Breakdown accuracy 84%
Ex- Yugoslavia athletes
Breakdown accuracy 75%
Decrypting identity accross space/time:
India Geodemographics (1914) 8
Source: Commonwealth WWI Casualties
Unsupervised learning is
fine-grain: Country/Region,… 9
Ex. Russian Federation
In progress :
Syrian names (backtesting)
Onoma Count
Syria 201
Saudi Arabia 20
Iraq 8
Kuwait 4
United Arab Emirates 3
Egypt 3
Qatar 2
Bahrain 2
Soudan 2
Lebanon 2
Algeria 1
Oman 1
Grand Total 249
10
201
Syria
Saudi Arabia
Iraq
Kuwait
United Arab Emirates
Egypt
Qatar
Bahrain
Soudan
Lebanon
Algeria
Oman
الحريري طاهر
سليمان العيدة عبدالغفار
شحادة عبدالغفار
األسعد قاسم
حموده مؤمن
الجراد محمد مفلح
الحروب نزار
سليمان العيدة نزار
الحراكي أسامة
الصغير أنس
الهبول خالد
عبد الواحد وفيق
يونس إسراء
نزهة رشا
وهبة محمد زكريا
بركات كمال
اللو محمد عيد
[…]
Syrian names recognized at ~80%
Other name may effectively be non-
Syrian or generic to the Arab world.
What can you dig with this tool? 11
Mining 5M names to recognize Gender, breakdown by nationality/likely origin
12
Mining 1M names to map Diasporas 13
Source: Twitter
Mining 3M Geo-Tweets
Population flows on Twitter 14
Source Target Type Id Onoma Weight
United Kingdom France Directed 16 Great Britain 37
Spain France Directed 55 Spain 14
United States France Directed 75 Great Britain 12
Turkey France Directed 79 Turkey 11
Brazil France Directed 87 Portugal 10
United Kingdom France Directed 112 Ireland 9
Italy France Directed 152 Italy 7
Switzerland France Directed 226 France 5
Belgium France Directed 247 France 5
United Kingdom France Directed 258 France 5
Mexico France Directed 287 Spain 4
Ireland France Directed 317 Great Britain 4
United Kingdom France Directed 333 Italy 4
United States France Directed 375 France 4
Source: Twitter
Mining 150k names in Patents to see
where the Turkish ‘brain juice’ flows 15
Mining names : a word of caution 16
Can ‘Big Data’ answer any question? 17
Trash in, Gold out ? Yes, to some extent
Beware of biases induced by the data source itself
Data access limitations / privacy issues
Open Data vs. Free APIs vs. Commercial Databases
Still, tools make possible the impossible 18
originating FDI leads 19
NamSor™ announces FDI Magnet, a new offering for Investment Promotion Agencies.
What is the Idea behind it: “ As recently as 1986 Ireland was one of the poorest countries in the European
Union (EU), but today it is one of the richest. The engine of this new Irish prosperity has been Foreign Direct
Investment (FDI). [Between 1986 and 2002], the Irish have done almost everything right. They have
attracted huge amounts of money from America – due largely to a century of personal and familial ties –
and they have used this money to build factories ”.
A successful approach which Milda Darguzaite, the Managing Director of Invest Lithuania, considers relevant
for her own country. With three million people living in Lithuania and nearly one million people of Lithuanian
origin living abroad, there is a good many personal and familial ties to be leveraged to attract new
investment projects to the country. NamSor name recognition software helped discover those ties.
Recognizing names and their origin in global professional databases allows Investment Promotion Agencies
to identify potentially interesting high profile contacts in different countries / industrial sectors and reach out
to them. Another method to accelerate the origination of new leads is to better understand and leverage
the existing network of foreign businessmen in the country itself.
NamSor™ filters data from millions of meaningless elements to a few dozen actionable names.
Domas Girtavicius, a Senior consultant at Invest Lithuania, said "we were impressed by the accuracy of the
name recognition software: it reliably predicts the country of origin and the number of false positives is fully
manageable". Elian Carsenat, the founder of NamSor™, said "searching for names in the Big Data is like
seeking a gold needle in a haystack: doable once the right tool exists".
Conclusions 20
We recognize names in any language, any place, any database; we can classify and we can sort
Onomastic class is no ‘hard fact’ like a place of birth, a nationality, etc. but it’s accurate and fine-grain
As a statistics tool, it might be dabatable. But as a datamining tool, it’s sharp, simple and efficient : it can help find research directions, discover trends
We see use cases in Migration research; Education & Skills; Labour & Social Affairs; Territorial Development/FDI; Science & Innovation
Merci !
http://fdimagnet.com/
http://namsor.com/
21
Juillet 2013, Ambassade de Lituanie à Paris
+33 6 52 77 99 07
Twitter @NamsSor_com