51
Geodemographics: Open tools and methods Dr. Muhammad Adnan Department of Geography, University College London Web: http:// www.uncertaintyofidentity.com Email: [email protected] Twitter: @gisandtech

Geodemographics: Open tools and mehtods

Embed Size (px)

DESCRIPTION

This presentation given an overview of geodemographic classifications and why there is a need to use open tools and methods for creating geodemographic classifications. The presentation also describes the challenges involve with creating real-time geodemographic classifications and the use of social media data for geodemographic applications.

Citation preview

Page 1: Geodemographics: Open tools and mehtods

Geodemographics: Open tools and methods

Dr. Muhammad Adnan

Department of Geography, University College London

Web: http://www.uncertaintyofidentity.com

Email: [email protected]

Twitter: @gisandtech

Page 2: Geodemographics: Open tools and mehtods

Lecture Outline

• Geodemographic Classification

• Problems with the Geodemographic Classifications

• Real-time bespoke Geodemographic Classifications

• GeodemCreator: A software for creating Geodemographic Classifications

• Social Media data for Geodemographics

Page 3: Geodemographics: Open tools and mehtods

Geodemographics

• “Analysis of people by where they live” or “locality marketing”

(Sleight, 1993:3)

HomeAddressPerson

Area

Page 4: Geodemographics: Open tools and mehtods

Steps in Creating a Geodemographic Classification

• Variable Selection

• Transformation of the Data

• Standardisation of the Data

• Clustering of the Data (k-means)

• Naming the clusters

Page 5: Geodemographics: Open tools and mehtods

Data – Census + Other

Experian: Mosaic• Census data: 54%• Non-Census data: 46%

CACI: Accorn

• Census data: 30%• Non-Census data: 70%

ONS Output Area Classification (2001 and 2011)• Census data: 100%

Page 6: Geodemographics: Open tools and mehtods

Standardising the data

• Z-Scores• Widely used variable normalisation technique• Can create outliers in the datasets

• Range Standardisation• Standardise values between a range of 0-1• Can erase interesting patterns in the data

• Principal Component Analysis (PCA)• Reduces the dimensions of a data set• Focuses on the part of dataset having maximum variance• Can erase interesting patterns in the data

Page 7: Geodemographics: Open tools and mehtods

Segmentations are created by cluster analysis

Areas V1 V2 V3 V4 V5 V6 …

Area1

Area2

Area3

Area4

Area5

Area6

…….

Variable 1

Variable 2

Cluster 1Cluster 2

Cluster 3

Page 8: Geodemographics: Open tools and mehtods

Output of Cluster AnalysisAreas Cluster

Area1 1

Area2 1

Area3 2

Area4 1

Area5 3

Area6 3

…….

2001 OAC (around Greater London)

Page 9: Geodemographics: Open tools and mehtods

Naming the clusters

• 2011 OAC has 8 super groups

1. Rural Residents

2. Cosmopolitans

3. Ethnic Mix

4. Blue Collar Neighbourhoods

5. Multicultural Metripolitans

6. Suburbanites

7. Hard-Pressed Households

8. Urbanites

Page 10: Geodemographics: Open tools and mehtods

But geodemographic classifciations have some problems !

Page 11: Geodemographics: Open tools and mehtods

Does one size fit all ?

• Most geodemographic classifications divide areas into a specified number of categories• 2011 OAC divides the Output Areas in the UK into 8 broad

categories

• Do these categories account for all the characteristics of the population ?

• We need to create bespoke small area classifications ?• Geodemographic categories only apply to a particular area

Page 12: Geodemographics: Open tools and mehtods

Closed Methods

• Commercial geodemographic classifications (i.e. MOSAIC, ACCORN) use closed methods• Data sources used ?• Weighting of the variables ?• Data standardisation techniques employed ?• Clustering algorithm applied ?

• We need open methods and clear documentation of the geodemographic classifications• 2001 OAC• 2001 LOAC (London’s Output Area Classification)• 2011 OAC

Page 13: Geodemographics: Open tools and mehtods

Public Consultation

• Users of the classification cannot modify or give a feedback

• Users should have the control to modify the classification through their feedback

• UCL’s E-Society Classification

Page 14: Geodemographics: Open tools and mehtods

Public Consultation

Feedback

Page 15: Geodemographics: Open tools and mehtods

Real time Geodemographics

Page 16: Geodemographics: Open tools and mehtods

Need for real time Geodemographics

• Current classifications are created using static data sources

• Rate and scale of current population change is making large surveys (census) increasingly redundant• Significant hidden value in transactional data

• Data is increasingly available in near real time e.g. ONS (Office of National Statistics) NESS API

• Social media data is available in real time

Page 17: Geodemographics: Open tools and mehtods

What are real time Geodemographics ?

Real time feeds of data

Online Specification

of inputsClustering Visualisation

Specification Estimation Testing

Page 18: Geodemographics: Open tools and mehtods

Computational challenges

• Integration of large and possibly disparate databases• E.g. NHS data; Census data

• Data normalisation and optimization for fast transactions• Minimizing computational time of clustering algorithms

(Very Important)!• Common protocol

• XML (SOAP)

• Use of non traditional data sources. (Singleton, 2008) • E.g. Flickr; Facebook, Twitter

Page 19: Geodemographics: Open tools and mehtods

Important Challenge: Selection of clustering algorithm

• K-Means• PAM (Partitioning Around Medoids)• CLARA (Clustering Large Applications)• GA (Genetic Algorithm)

Page 20: Geodemographics: Open tools and mehtods

k-means

• Widely used clustering algorithm for geodemographics• Attempts to find out cluster centroids by minimising within

sum of squares distance.• K-means is unstable due to its initial seeds assignment.

• Sensitive to outliers in the data set.

• Creating a Geodemographic classification requires running algorithm multiple times.• 10,000 times (Singleton, 2008)• Computationally expensive in a real time environment.

Page 21: Geodemographics: Open tools and mehtods

An example of bad clustering result (K-means)

Page 22: Geodemographics: Open tools and mehtods

An example of bad clustering result (K-means)

Page 23: Geodemographics: Open tools and mehtods

An example of bad clustering result (K-means)

Page 24: Geodemographics: Open tools and mehtods

Alternate Clustering Algorithms

• PAM (Partitioning around medoids)• CLARA (Clustering Large Applications)• GA (Genetic Algorithm)

Page 25: Geodemographics: Open tools and mehtods

Alternate Clustering Algorithms…

• PAM (Partitioning around medoids) • It tries to minimize the sum of dissimilarities of the data

points to their cluster centers.• Less sensitive to outliers than K-means.• Cannot handle larger data sets.

• Produces better results than k-means for smaller data sets.

Page 26: Geodemographics: Open tools and mehtods

Alternate Clustering Algorithms…

• CLARA (Clustering Large Applications) • It draws multiple samples of the dataset, applies PAM to

each sample and returns the best result.• Can handle large data sets as it operates on samples rather than

on actual data set.

• Could be a better choice for creating classifications on the fly.

Page 27: Geodemographics: Open tools and mehtods

Alternate Clustering Algorithms…

• GA (Genetic Algorithm) • It is inspired by models of biological evolution. It produces

results through a breeding procedure.• Creates hierarchies of generations and then merge the

hierarchies in homogeneous groups having similar characteristics.

• Can be time consuming due to the creation of generation hierarchies.

Page 28: Geodemographics: Open tools and mehtods

Comparing computational efficiency (Z-scores)OA (Output Area) level results

LSOA (Lower Super Output Area) level results Ward level results

Page 29: Geodemographics: Open tools and mehtods

Algorithm Stability (w.r.t. Computational time)Running k-means on OA (Output Area) for 120 times on each iteration

Running CLARA on OA (Output Area) for 120 times on each iteration Running GA on OA (Output Area) for 120 times on each iteration

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 960

0.5

1

1.5

2

2.5

3

3.5

K-means

Tim

e (s

)

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 970

0.5

1

1.5

2

2.5

3

3.5

CLARA

Tim

e (s

)

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 970

0.5

1

1.5

2

2.5

3

3.5

GA

Tim

e (s

)

Page 30: Geodemographics: Open tools and mehtods

Bespoke Real-time Geodemographics

Bespoke Requests

Data RealtimeMeasurement

• Specify inputs and weights• Data normalisation• Clustering• Visualisation

Page 31: Geodemographics: Open tools and mehtods

GeodemCreator: A software for creating Geodemographic Classifications in near real-time.

Page 32: Geodemographics: Open tools and mehtods

GeodemCreator• Allows users to create Geodemographic Classifications• Users have the control of how a Geodemographic

Classification is created (Open Methods !)

Page 33: Geodemographics: Open tools and mehtods

Building a Geodemographic Classification

• Step-1: Choose a dataset

Page 34: Geodemographics: Open tools and mehtods

Building a Geodemographic Classification• Step-2: Check Correlation of the Variables

Page 35: Geodemographics: Open tools and mehtods

Building a Geodemographic Classification• Step-3: Select variables

Page 36: Geodemographics: Open tools and mehtods

Building a Geodemographic Classification• Step-4: Specify ‘number of clusters’ and ‘spatial area’

Number of Clusters

Spatial Area

Page 37: Geodemographics: Open tools and mehtods

Building a Geodemographic Classification• Step-5: Build the Classification

Page 38: Geodemographics: Open tools and mehtods

Building a Geodemographic Classification• Output – Cluster Numbers

Page 39: Geodemographics: Open tools and mehtods

Building a Geodemographic Classification

• Output

Page 40: Geodemographics: Open tools and mehtods

Social Media data for Geodemographics

Page 41: Geodemographics: Open tools and mehtods

Why we need Social Media data for Geodemographics ?

• Traditional geodemographic classifications are based on Census data• Night time geography

• These classifications do not identify where the population is during the day time

• We do not know about the Social links between different people

• A solution is to infuse Social Media data with traditional data sources

Page 42: Geodemographics: Open tools and mehtods

Geodemographics

• “Analysis of people by where they live” or “locality marketing”

Social Media Geodemographics

• “Analysis of people by where they live, travel, and who they communicate with”

Page 43: Geodemographics: Open tools and mehtods

Social Media Geodemographics

• Who: Ethnicity, Gender, and Age of social media users

• Where: Where social media conversations are happening and who is leading them• Intelligence about where people are located and what they are

doing

• When: What time of day conversations happen

Page 44: Geodemographics: Open tools and mehtods

Twitter (www.twitter.com)

• Online social-networking and micro blogging service• Launched in 2006

• Users can send messages of 140 characters or less

• Approximately 200 million active users

• 350 million tweets daily

• In 2012, UK and London were ranked 4th and 3rd, respectively, in terms of the number of posted tweets

Page 45: Geodemographics: Open tools and mehtods

Data available through the Twitter API

• User Creation Date• Followers• Friends• User ID• Language• Location• Name• Screen Name• Time Zone

• Geo Enabled• Latitude• Longitude• Tweet date and time• Tweet text

Page 46: Geodemographics: Open tools and mehtods
Page 47: Geodemographics: Open tools and mehtods

Analysing Names on Twitter

• Some examples of NAME variations on Twitter

Real Names

Kevin Hodge

Andre Alves

Jose de Franco

Carolina Thomas, Dr.

Prof. Martha Del Val

Fabíola Sanchez Fernandes

Fake Names

Castor 5.

WHAT IS LOVE?

MysticMind

KIRILL_aka_KID

Vanessa

Petuna

Page 48: Geodemographics: Open tools and mehtods

English Italian

Pakistani Indian

TurkishGreek

Bangladeshi

Spanish

German French

Portuguese

Sikh

Tweeting Activity by different Ethnic Groups

Page 49: Geodemographics: Open tools and mehtods

Genders of Twitter Users

Page 50: Geodemographics: Open tools and mehtods

Age distribution of Twitter Users vs 2011 Census

Page 51: Geodemographics: Open tools and mehtods

Summary

• Geodemographics is the analysis of people by where they live• But generalised geodemographic classifications have some

problems– We need bespoke classifications for smaller areas

• Real-time geodemographic classifications is a solution to create bespoke classifications

• Methods of creating current classifications are not open– We need Open tools and Open methods for geodemographics

• Social media data for geodemographic classifications