Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Leveraging mobile network big data for development
This work was carried out with the aid of a grant from the Interna5onal Development Research Centre, O>awa, Canada.
Sriganesh Lokanathan, LIRNEasia IIT-‐Delhi, New Delhi 22 August 2014
About LIRNEasia (www.lirneasia.net)
• We are a regional think tank based in Colombo, working across South Asia, South East Asia and the Pacific Small Island States
• Our mission – Catalyzing policy change through research to improve people’s lives in the emerging Asia Pacific by facilita5ng their use of hard and soS infrastructures through the use of knowledge, informa5on and technology
2
What is Big Data?
3
4
Big Data
VARIETY Different forms
of data
VALUE Poten@al value
of data
VOLUME Size of data
VELOCITY Rate of change
of data
VERACITY Uncertainty of
data
The Vs
What has facilitated the rise of big data?
• Vast drops in the cost of storing and retrieving informa5on
• Exponen5al growth in computer power and memory – data can reside in persistent memory instead of disk and tape
• Major improvements in techniques for performing machine learning and reasoning
5
Sources of big data • Administra5ve data
– E.g. digi5zed medical records, insurance records, tax records, etc.
• Commercial transac5ons – E.g. Bank transac5ons, credit card purchases, supermarket purchases,
online purchases, etc.
• Sensors and tracking devices – E.g. road and traffic sensors, climate sensors, equipment &
infrastructure sensors, mobile phones, satellite/ GPS devices, etc.
• Online ac5vi5es/ social media – E.g. online search ac5vity, online page views, blogs/ FB/ twi>er posts,
online audio/ video/ images, etc.
6
If we want comprehensive coverage of the popula5on, what are the sources of big
data in developing economies? • Administra5ve data?
– E.g. digi5zed medical records, insurance records, tax records, etc.
• Commercial transac5ons? – E.g. Bank transac5ons, credit card purchases, supermarket purchases,
online purchases, etc.
• Sensors and tracking devices? – E.g. road and traffic sensors, climate sensors, equipment &
infrastructure sensors, mobile phones, satellite/ GPS devices, etc.
• Online ac5vi5es/ social media? – E.g. online search ac5vity, online page views, blogs/ FB/ twi>er posts,
online audio/ video/ images, etc.
7
Currently only mobile network big data has the widest possible popula5on coverage
8
LIRNEasia’s Big Data for Development (BD4D) Research
• LIRNEasia has nego5ated access to historical and anonymized telecom network meta-‐data from mul5ple operators in Sri Lanka
• In the current research cycle we are – conduc5ng exploratory research on answering a few social science
ques5ons related to mobility and connectedness – developing a framework with privacy and self-‐regulatory guidelines for
the collec5on, use and sharing of mobile phone data.
• h>p://lirneasia.net/projects/bd4d/
9
The data sets • Mul5ple mobile operators in Sri Lanka have provided LIRNEasia access to 4 different types of meta-‐data: – Call Detail Records (CDRs)
• Records of calls, SMS-‐es, Internet access
– Air5me top-‐up records
• Data sets do not include any Personally Iden5fiable Informa5on (PII). – All phone numbers are anonymized and – LIRNEasia does not maintain any mappings of iden5fiers to original phone numbers
10
What are some types of big data captured by mobile network operators?
• Call Detail Record (CDR) – Records of all calls made and received by a person created mainly for
the purposes of billing – Similar records exist for all SMS-‐es sent and received as well as for all
Internet sessions
– The Cell ID in turn has a lat-‐long posi5on associated with it.
11
• Air@me reload records – Records of all air5me reloads performed by prepaid SIMs – Each row corresponds to a record of one person’s ac5vity:
12
What are some types of big data captured by mobile network operators (contd.)?
The overall process of leveraging mobile network big data for development
13
Mobile network big data (CDRs, Internet access usage, air5me recharge
records)
Construct behavioral variables
(i) Mobility variables (ii) Social variables (iii) Consump5on
variables
Other data sources
(i) Census data (ii) HIES data (iii) Survey maps (iv) Transporta5on
schedules (v) ++++
Insights
(i) Urban planning (ii) Crisis
management & DRR
(iii) Health monitoring & planning
(iv) Socio-‐economic monitoring & planning
Analytics
Why are we doing this work?
• To bring 5mely evidence into the policy making process in developing economies
14
An example of 5mely policy-‐relevant evidence
15
• The image on the leS depicts rela5ve density of people in Colombo city and the surrounding regions at 1300 compared to 0000 (midnight the previous day) on a normal weekday.
• The yellow to red colors depict areas whose density has increased rela5ve to midnight. The blue color depicts areas whose density has decreased rela5ve to midnight (the darker the blue, the greater the loss in density). The clear areas are those where the overall density has not changed.
Our findings closely match results from expensive & infrequent transporta5on surveys
16
Using mobile network big data we can understand the mobility pa>erns of the popula5on
17 Volume of people
Low High
Understanding land use characteris5cs
• People leave digital traces when they use communica5on devices.
• Mobile communica5on pa>erns at different loca5ons can be leveraged to classify them into land-‐use categories.
18
The diurnal pa>erns of users connected to base sta5ons revealed interes5ng pa>erns
19
Base station 1 Base station 2
How do we leverage this insight?
• Time series of each base sta5on is normalized to a (0-‐1) range
• Euclidean distance between two 5me series is used to cluster base sta5ons in an unsupervised manner using k-‐means algorithm
20
Distribu5on of base sta5ons in Colombo district
6
What does this reveal?
• Cluster 1 exhibits pa>erns consistent with a commercial area
• Cluster 2 exhibits pa>erns
consistent with a less commercial and more residen5al characteris5cs (or possibly mixed)
8
A closer look at base sta5ons in each cluster
7
Cluster 1 examples Cluster 2 examples
Our results show how the Central Business District (CBD) has expanded
24
Seethawaka Export Processing Zone (EPZ)
11
Seethawaka EPZ Photo ©Senanayaka Bandara - Panoramio
So how can mobile network big data help transporta5on and urban policy
• Can help us understand popula5on mobility at a very high resolu5on (both temporally as well as spa5ally) – VLR data is very useful for understanding traffic conges5on
• Can help us understand land-‐use characteris5cs • Can provide insights in-‐between infrequent and costly surveys and land use census and in some cases even replace them
26
Can mobile network big data help us understand Sri Lanka’s communi5es?
27
The geo-‐spa5al distribu5on of Sri Lanka’s social networks?
• Each link represents the raw number of outgoing and incoming calls between two DSDs • Divisional Secretariat
Division (DSD) is a third level administra5ve division; 331 in total in LK
9 No. of calls
Low High
A different picture emerges when call volume is normalized by popula5on
● Strongly connected
components are visible
10 No. of calls
Low High
Iden5fying communi5es: methodology
• The social network is segregated such that overlapping connec5ons between communi5es are minimized.
• Strength of a community is determined by modularity • Modularity Q = (edges inside the community) –
(expected number of edges inside the community)
M. E. J.-‐Newman, Michele-‐Girvan, “Finding and evalua5ng community structure in networks”, Physical Review E, APS, Vol. 69, No. 2, p. 1-‐16, 204.
12
The resultant communi5es are centered around geographic neighborhoods
● For Sri Lanka, the op5mal number of communi5es discovered by the algorithm was 11
31
How much do these communi5es mesh with
exis5ng administra5ve boundaries?
● Southern and Northern provinces have the highest similarity to their respec5ve provincial boundaries.
14
● Colombo district is clustered as a single community and Gampaha is merged with North Western Province
15
How much do these communi5es mesh with exis5ng administra5ve boundaries (contd.)?
Zooming into a community
● Community detection is done within identified communities
● These communities
are less less related to DSD boundaries in Colombo district
16
Implica5ons for public policy?
• Administra5ve boundaries based on history and geography may not reflect current community structures
35
Analy5cal challenges
36
Challenge Solu@on(s) Data is biased towards frequent users
• Understand and adjust for selec5on bias
Data sparsity • Interpola5on techniques • Probability based models
Different tower densi5es • Different scale of analyses depending on region
Valida5ng results • Using other data sources e.g. data from Dept. of Census and Sta5s5cs, transporta5on survey data, etc.
In sum..
• Mobile network big data has many uses for developmental policy: – Helping us understand human mobility at a fine temporal and geo-‐spa5al scale
– Helping us understand social connectedness and community structures
• But it is not without analy5cal challenges – Incorpora5ng other data (even ‘small data’) can improve the results
37
Thank you
• More informa5on: – h>p://lirneasia.net/projects/bd4d/
38