Spatio-temporal linkage of real and virtual identity

Preview:

DESCRIPTION

This presentation outlines the initial work explaining the linkage of identities in the real and virtual worlds worlds.

Citation preview

Spatio-temporal linkage of real and virtual identity

Muhammad Adnan (and Paul Longley)University College London

Geodemographics

• “Analysis of people by where they live [places]”(Sleight, 1993:3)

• Social similarity, not locational proximity

HomeAddressPerson

Area

Identity of individuals in the real world

• Name (Forename & Surname)

• Surnames have geographic concentrations

• Prospects for linkage with socio-economic data

• E.g. Analysing the socio-economic circumstances of different ethnic groups

An example – gbnames.publicprofiler.org

Longley Cheshire

An example – Output Area Classification

Kingston upon Hull Hereford

A socio-economic and ethnic classification

A socio-economic and ethnic classification

Wu

Source: Cheshire and Longley (2011)

12

Courtesy: James Cheshire

Wordle.net

The European scale

16 countries.

400 million people.

5.95 million unique surnames

Courtesy: James Cheshire

Onomap classification

Surnames

UK Electoral Roll

Forenames

Pablo Mateos

Garcia

Pérez

...Juan

Rosa

Marta

...

Sánchez

Rodríguez

...– Several iterations until self-contained cluster is exhausted– Cluster assigned a cultural, ethnic & linguistic Onomap type– Probability of ethnicity assigned to each name

Mateos et al (2007) CASA Working Paper 116

Forename-Surname clustering (based on Hanks and Tucker, 2000)

WorldNames CEL clusters

Source: Mateos et al (2011)

Uncertainty and virtual identity

• Identity increasingly shaped by online activities– => value may be leveraged from the fusion of physical

and virtual data sources• Data fusion and generalisation to relate physical

and virtual properties• Use of residence alongside activity patterns and

social network information

Most of us have virtual identities

• Email address; social media accounts

• People use different procedures and providers to establish virtual identities

• Harvesting these data has interesting potential applications• Cyber crime• Cyber geodemographics (Facebook has already started

this)

Most of us have virtual identities

• Facebook data mining engine• Analyses the words you use and tailors advertisement

accordingly

Starting Point

http://worldnames.publicprofiler.org

• Worldnames holds data for approximately 1 billion population around 28 countries of the world

• Approximately 1.6 million unique users have visited the website since 2008

Starting Point

http://worldnames.publicprofiler.org

• Worldnames has been archiving ‘Surname search’, ‘Email Address’, ‘Gender’, and ‘IP Address’ for searches over the past 6 months• c. 175,000 records: email validation• 150,000 usable ‘IP Address’ entries

IP Address to Latitude/Longitude conversion

http://quova.com

An API to convert “IP addresses” to their corresponding latitude / longitude values

IP Address to Latitude/Longitude conversion

http://quova.com

A search for an IP Address in UCL (128.40.214.196)

Top CountriesWebsite was searched from 155 countries over the past

6 months

UNITED STATES

UNITED KIN

GDOM

CANADA

GERMANYITALY

AUSTRALIA

BRAZIL

FRANCE

ARGENTINA

SPAIN

NEW ZEALAND

NETHERLANDS

GREECE

SWITZERLAND

BELGIU

M

POLAND

AUSTRIA

MEXICO

IRELA

ND

SWEDEN0

10000

20000

30000

40000

50000

60000

70000

80000

90000

UNITED STATES 76708UNITED KINGDOM 21892CANADA 8154GERMANY 7158ITALY 4058AUSTRALIA 2978BRAZIL 2440FRANCE 2028ARGENTINA 1958SPAIN 1830NEW ZEALAND 1236NETHERLANDS 1074GREECE 1040SWITZERLAND 992BELGIUM 940POLAND 880AUSTRIA 874MEXICO 834IRELAND 710SWEDEN 630

UK and Ireland

Europe

North America

South America

India, China, Japan, Singapore

Popular Surname Searches

SMITH

JONES

JOHNSON

ANDERSON

WILLIA

MS

MILLER

MARTIN

WILSON

BROWN

MOORE

THOMAS

TAYLOR

CLARK

LEE

ROBERTS

DAVIS

CAMPBELL

LEWIS

HARRIS

MITCHELL0

100

200

300

400

500

600

700

800

SMITH 708JONES 306JOHNSON 258ANDERSON 224WILLIAMS 222MILLER 218MARTIN 202WILSON 194BROWN 194MOORE 188THOMAS 178TAYLOR 170CLARK 164LEE 160ROBERTS 156DAVIS 152CAMPBELL 144LEWIS 138HARRIS 138MITCHELL 136

Popular Email Domains

GMAIL.COM

HOTMAIL.COM

YAHOO.COM

AOL.COM

COMCAST.NET

HOTMAIL.CO.U

K

MSN.COM

WEB.DE

YAHOO.CO.U

K

GMX.DE

SBCGLOBAL.N

ET

BTINTERNET.C

OM

HOTMAIL.IT

VERIZON.NET

GOOGLEMAIL.

COM

LIVE.C

OM

COX.NET

ATT.NET

MAILINATOR.C

OM

LIBERO.IT

0

5000

10000

15000

20000

25000

30000

35000

GMAIL.COM 31842HOTMAIL.COM 22098YAHOO.COM 15542AOL.COM 5550COMCAST.NET 2696HOTMAIL.CO.UK 1948MSN.COM 1624WEB.DE 1522YAHOO.CO.UK 1290GMX.DE 1260SBCGLOBAL.NET 1246BTINTERNET.COM 860HOTMAIL.IT 844VERIZON.NET 798GOOGLEMAIL.COM 742LIVE.COM 742COX.NET 708ATT.NET 632MAILINATOR.COM 616LIBERO.IT 616

Popular Email Domains by Surnames

Smith (English)GMAIL.COMYAHOO.COMHOTMAIL.COMAOL.COMMAILINATOR.COM

Jones (Welsh)GMAIL.COMHOTMAIL.COMYAHOO.COMCOMCAST.NETGOOGLEMAIL.COM

Johnson (English)GMAIL.COMHOTMAIL.COMYAHOO.COMMSN.COMVERIZON.NET

Perez (Spanish) Gupta (Indian)GMAIL.COMHOTMAIL.COMYAHOO.COMGOOGLAMAIL.COMINDIATIMES.COM

Meyer (German)

GMAIL.COMHOTMAIL.COMYAHOO.ESCHARTER.NETGRANDECOM.NET

GMAIL.COMHOTMAIL.COMYAHOO.COMAOL.COMGMX.DE

Popular Email Domains by Country

UK USA France

Germany Brazil JapanYAHOO.COMYAHOO.CO.JPGMAIL.COMHOTMAIL.COMMSN.COM

GMAIL.COMYAHOO.COMHOTMAIL.COMAOL.COMCOMCAST.NET

HOTMAIL.FRGMAIL.COMHOTMAIL.COMYAHOO.FRLAPOSTE.NET

GMAIL.COMHOTMAIL.COMHOTMAIL.CO.UKYAHOO.CO.UKYAHOO.COM

WEB.DEGMX.DET-ONLINE.DEYAHOO.DEGMAIL.COM

HOTMAIL.COMGMAIL.COMYAHOO.COM.BRIG.COM.BRBOL.COM.BR

Top GoogleMail.com users

BINDERWATKINSWHITEWOODSROBINSONSLEEMANBENNETTRITCHIESHARPROLLINGS

Top Surnames

GoogleMail.com users• Surname ‘Binder’

Germany Switzerland

GoogleMail.com users• Surname ‘Binder’

Germany Switzerland

GoogleMail.com users• Surname ‘Blackbourn’

New Zealand

Who use their surnames as part of their email address• Approximately 40% of the users have their surname

as part of their email address• abbie.harper@hotmail.com (Surname: Harper)• helmut.kempe@inode.at (Surname: Kempe)

• Top Countries

SOUTH AFRIC

A

SLOVENIA

UNITED KIN

GDOM

IRELA

NDIN

DIA

MALAYSIA

PORTUGAL

GERMANY

COSTA RIC

A

AUSTRIA

LUXEMBOURG

BELGIU

M

CANADA

NEW ZEALAND

AUSTRALIA

CHINA

TURKEY

CROATIA

SWITZERLAND

UNITED STATES

05

101520253035404550

Who use long email addresses ? • Grand mean average email length of 8 characters

• Number of characters on the left side of ‘@’• United Kingdom, USA, Canada, and other European countries

• People from South American countries and India have long email addresses (Average length: 13 characters)

• South Indians have longer email address than North Indians

BRAZIL ANA.ARAUJO3909@CREASP.ORG.BR (14 characters)CHILE BYRON.DELGADO.INOSTROZA@HOTMAIL.COM (25 characters)URUGUAY DIEGOJAVIERZEBALLOS@GMAIL.COM (17 characters)INDIA GANGULYDEEPANJAN@HOTMAIL.COM (18 characters)ARGENTINA AGUSTINAREYNOZO@GMAIL.COM (13 characters)

What else we can infer from email addresses• Internet service provider

• A.GOODEVE@AOL. COM• BERRYMANL@BTINTERNET.COM• CARL@VALLEYWISP.NET (Person lives in a rural area of northeast Oregon)

• Country of origin• A.HAKIM26@YAHOO.FR • CBARNES@MEDIAWORKS.CO.NZ

• Probable temporal aspects• ABBY527@OPTONLINE.NET • BERZINSKY102@YAHOO.COM• C.JOHNSTON2@BTINTERNET.COM

What else we can infer from email addresses• Probable forename of a person

• BEVERLY.RICHARDS@YAHOO.COM • BJORN.SOBRY@HOTMAIL.COM • BRANDAN.HOLMES@HOTMAIL.COM

• How up to date someone is with technology• ALEXANDER.BREUSCH@GMAIL.COM• WILLIAM.NEALON@GOOGLEMAIL.COM

• Professional Affiliations• CHRIS@IEEE.ORG

What else we can infer from email addresses• Work Locations

• DOUG.GOODMAN@FOUNDATION.ORG.UK • GRL@KCS.ORG.UK• ERM43@CAM.AC.UK

• Studying• RTRIPOLI@STUDENT.UMASS.EDU• CBALIN01@STUDENTS.BBK.AC.UK• KATHERINE.LITTEN@STUDENT.KIRKWOOD.EDU

• There are some interesting patterns found in the study of email addresses• some problems (accuracy of geocoding techniques)

• Prospect of data linkage of data coded to unit postcode level• cluster analysis and data mining techniques

• Future work may involve the data mining of Facebook and Twitter data• issues of generalisation

• Visualisation of the data

Conclusion and future work

Any Questions ?

Thanks for Listening

A research agenda

1 Acquire relevant real and virtual data sources and devise DBMS2 Devise GB-wide classification of NICT usage at neighbourhood scale3 Devise GB-wide classification of social network traffic4 Develop enhanced worldnames site to harvest real and virtual user data5 Undertake text analysis of worldnames user data and use to link

classifications (2) and (3)6 Devise, implement and analyse social networking application and

cybergeodemographic classification

Recommended