27
1 Inferring land use from mobile phone activity Jameson L. Toole (MIT) Michael Ulm (AIT) Dietmar Bauer (AIT) Marta C. Gonzalez (MIT) UrbComp 2012 Beijing, China Sunday, August 12, 2012

Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

1

Inferring land use from mobile phone activity

Jameson L. Toole (MIT)Michael Ulm (AIT)Dietmar Bauer (AIT)Marta C. Gonzalez (MIT)

UrbComp 2012Beijing, China

Sunday, August 12, 2012

Page 2: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

2

The Big Questions

Can land use be predicted from mobile phone activity?

Can mobile phone data help measure temporal patterns of land use?

Are these patterns robust across cities?

Sunday, August 12, 2012

Page 3: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

3

Data Scarcity Methods Outputs

Abundance

Traditional Data:• Zoning Maps

Novel Data:• Mobile phone activity

Geographic Boundaries (Ratti et al 2010)

Intra-city Mobility (Cho et al 2011)

Inter-city Mobility

(Gonzalez et al 2008) (Wang et al 2009) (Simini 2012)

Unsupervised Classification(Reades 2009) (Calabrese 2010) (Soto 2011)(Zheng 2012)

Statistical Physics

Machine learning

Sensing Urban Systems

Sunday, August 12, 2012

Page 4: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

4

2. Location-based measures, analysing accessibility atlocations, typically on a macro-level. The measuresdescribe the level of accessibility to spatially distrib-uted activities, such as !the number of jobs within30 min travel time from origin locations’. More com-plex location-based measures explicitly incorporatecapacity restrictions of supplied activity characteris-tics to include competition e!ects. Location-basedmeasures are typically used in urban planning andgeographical studies.

3. Person-based measures, analysing accessibility at theindividual level, such as !the activities in which anindividual can participate at a given time’. This type

of measure is founded in the space–time geographyof H!agerstrand (1970) that measures limitations onan individual’s freedom of action in the environment,i.e. the location and duration of mandatory activities,the time budgets for flexible activities and travelspeed allowed by the transport system.

4. Utility-based measures, analysing the (economic) ben-efits that people derive from access to the spatiallydistributed activities. This type of measure has itsorigin in economic studies.

Table 1 presents a matrix of perspectives on accessi-bility and components. The table shows each perspective

location andcharacteristics of

infrastructure

passenger andfreight travel

demandsupply

Transport component

Temporal component Individual component

Accessibility toopportunities

needs, abilitiesopportunities

time restrictions

availableopportunities

travel time, costs, effort

• income, gender, educational level• vehicle ownership, etc.

• opening hours of shops and services• available time for activities

travel demand

available time

locations andcharacteristics

of opportunities

demand,competition

supply

Land-use component

locations andcharacteristics

of demand

= direct relationship = indirect relationship = feedback loop

Fig. 1. Relationships between components of accessibility.

Table 1Perspectives on accessibility and components

Measure Component

Transport component Land-use component Temporal component Individual component

Infrastructure-basedmeasures

Travelling speed; vehicle-hours lost in congestion

Peak-hour period; 24-hperiod

Trip-based stratification, e.g.home-to-work, business

Location-based measures Travel time and or costsbetween locations ofactivities

Amount and spatialdistribution of the demandfor and/or supply ofopportunities

Travel time and costs maydi!er, e.g. between hoursof the day, between daysof the week, or seasons

Stratification of thepopulation (e.g. by income,educational level)

Person-based measures Travel time betweenlocations of activities

Amount and spatialdistribution of suppliedopportunities

Temporal constraints foractivities and time avail-able for activities

Accessibility is analysed atindividual level

Utility-based measures Travel costs betweenlocations of activities

Amount and spatialdistribution of suppliedopportunities

Travel time and costs maydi!er, e.g. between hoursof the day, between daysof the week, or seasons

Utility is derived at theindividual or homogeneouspopulation group level

K.T. Geurs, B. van Wee / Journal of Transport Geography 12 (2004) 127–140 129

Land-use

supply

Demand

Travel

supply

Demand

TemporalComponent

opening hours of shops and services

available time for activities

IndividualComponent

income, gender, education level

vehicle ownership, etc.

adapted from:

availabletime

travel demand

Land use Travel Behaviorand

where a person wants

to go

when a person can

travel

a persons ability to

travel

a persons desire to

travel

Static GIS data

New micro level data from mobile phones+ = Dynamic Land Use

Sunday, August 12, 2012

Page 5: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

DataZoning Data Mobile Phone

• GIS shapefiles at the parcel level

• Provided by MASSDOT

• Aggregated to 5 zoning classification: Residential, Commercial, Industrial, Parks, Other

• ~600,000 Users in the Boston Metropolitan Area (Pop: ~3 million, Mkt Share: 30%-50%)

• 1 month of Call Detail Records (CDR) for voice calls and text messages.

• Each event geo-tagged with (Lat, Lon) triangulated from nearby towers

Grid• Partition the region into 200m X 200m cells

• Each cell is labeled with the most prevalent zoning type within

• All mobile phone events are aggregated hourly into an average week

5

Grid provides privacy

Sunday, August 12, 2012

Page 6: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

6Sunday, August 12, 2012

Page 7: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

ZONING DATA

7

Mon Tue Wed Thu Fri Sat Sun20406080

100120

Abs.

Act

.

Mon Tue Wed Thu Fri Sat Sun

0

1

Mon Tue Wed Thu Fri Sat Sun

0

0.2

Hour

Parks Other

CBD

40km

Sunday, August 12, 2012

Page 8: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

8

Results: Absolute Activity

Mon Tue Wed Thu Fri Sat Sun20406080

100120

Abs.

Act

.

Mon Tue Wed Thu Fri Sat Sun

0

1

Mon Tue Wed Thu Fri Sat Sun

0

0.2

HourResidential Commercial Industrial Conservation Other

Mon Tue Wed Thu Fri Sat Sun20406080

100120

Abs.

Act

.

Mon Tue Wed Thu Fri Sat Sun

0

1

Mon Tue Wed Thu Fri Sat Sun

0

0.2

Hour

Parks Other

Zoned use & Dynamic Population

Downtown Boston is mostly zoned as OTHER thus more activity is related to more people.

Sunday, August 12, 2012

Page 9: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

9

Results: Normalized Activity

Mon Tue Wed Thu Fri Sat Sun20406080

100120

Abs.

Act

.

Mon Tue Wed Thu Fri Sat Sun

0

1

Mon Tue Wed Thu Fri Sat Sun

0

0.2

HourResidential Commercial Industrial Conservation Other

Mon Tue Wed Thu Fri Sat Sun20406080

100120

Abs.

Act

.

Mon Tue Wed Thu Fri Sat Sun

0

1

Mon Tue Wed Thu Fri Sat Sun

0

0.2

Hour

Parks Other

V. DESCRIPTIVE STATISTICS

We first examine the relationship between mobile phone activity and land use at the

macro, city-wide scale. Figure 2 displays time series of mobile phone activity summed over

all cells of a given zoning classification. Examining absolute counts reveals that some zoning

classifications have much higher activity than others. Residential areas have the lowest

average absolute activity despite being by far the most prominent zoning classification.

This is likely due to the spatial distribution population. The downtown area of a city has

many more individuals and phones, but is not zoned as residential. Whichever land use

dominates in this region will have higher activity counts. Because of this, we develop two

types of normalization. The first rescales time series such that each has zero mean and

unit variance (commonly known as z-scoring). Each time series now displays how much

activity deviates from average activity levels in units of standard deviations. This allows

comparison between cells that may have different absolute activity levels. Mathematically,

the normalized activity of cell (i,j) is given by:

anormij (t) =aabsij (t)− µaabsij

σaabsij

(1)

From the z-scored time series, we see the largest signal in the data is the circadian rhythm

of the city. Residents wake up, go to sleep, and wake again the next day. The rise and fall

of activity in each zone, however, is not the result of users moving into and out of a region,

but rather, is due to an uneven distribution of phone use across the day. To account for

this, during each hour, we subtract the average normalized activity of the entire city from

the activity at each given cell. Residual activity can be interpreted as the amount of mobile

phone activity in a region, at a given time, relative to the expected mobile phone activity

at that hour. Mathematically, the residual activity is calculated as follows:

aresij (t) = anormij (t)− anorm(t) (2)

where anorm(t) is the normalized activity averaged over all cells at each particular time.

Averaging the residual time series for each zoning classification reveals patterns related

to travel behavior. The most notable signal is the inverse relationship between residual

activity in residential and commercial areas. Residential areas have higher residual activity

6

Zoned use & Dynamic Population

Similarities are related to how individuals use mobile phones, not how they move.

Sunday, August 12, 2012

Page 10: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

Mon Tue Wed Thu Fri Sat SunHour

Residential Industrial Parks Other

10

Results: Residual Activity

Inverse relationship between RESIDENTIAL and COMMERCIAL

Weekends are quantitatively different than weekdays.

V. DESCRIPTIVE STATISTICS

We first examine the relationship between mobile phone activity and land use at the

macro, city-wide scale. Figure 2 displays time series of mobile phone activity summed over

all cells of a given zoning classification. Examining absolute counts reveals that some zoning

classifications have much higher activity than others. Residential areas have the lowest

average absolute activity despite being by far the most prominent zoning classification.

This is likely due to the spatial distribution population. The downtown area of a city has

many more individuals and phones, but is not zoned as residential. Whichever land use

dominates in this region will have higher activity counts. Because of this, we develop two

types of normalization. The first rescales time series such that each has zero mean and

unit variance (commonly known as z-scoring). Each time series now displays how much

activity deviates from average activity levels in units of standard deviations. This allows

comparison between cells that may have different absolute activity levels. Mathematically,

the normalized activity of cell (i,j) is given by:

anormij (t) =aabsij (t)− µaabsij

σaabsij

(1)

From the z-scored time series, we see the largest signal in the data is the circadian rhythm

of the city. Residents wake up, go to sleep, and wake again the next day. The rise and fall

of activity in each zone, however, is not the result of users moving into and out of a region,

but rather, is due to an uneven distribution of phone use across the day. To account for

this, during each hour, we subtract the average normalized activity of the entire city from

the activity at each given cell. Residual activity can be interpreted as the amount of mobile

phone activity in a region, at a given time, relative to the expected mobile phone activity

at that hour. Mathematically, the residual activity is calculated as follows:

aresij (t) = anormij (t)− anorm(t) (2)

where anorm(t) is the normalized activity averaged over all cells at each particular time.

Averaging the residual time series for each zoning classification reveals patterns related

to travel behavior. The most notable signal is the inverse relationship between residual

activity in residential and commercial areas. Residential areas have higher residual activity

6

Zoned use & Dynamic Population

Sunday, August 12, 2012

Page 11: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

11

Mon Tue Wed Thu Fri Sat Sun20406080

100120

Abs.

Act

.

Mon Tue Wed Thu Fri Sat Sun

0

1

Mon Tue Wed Thu Fri Sat Sun

0

0.2

Hour

Parks Other

Zoned use & Dynamic Population

static population

phone use

movement

Sunday, August 12, 2012

Page 12: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

Absolute Activity

12

Results:

Low Activity High Activity

Sunday, August 12, 2012

Page 13: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

13

Methods - Supervised Learning: Random Decision Tree

Contents

1. 11.1. 1

1

Each classifier, h(x,Θk), casts a weighted vote for every input. The weightedmajority of these votes is used to make predictions, c.

1.1.

1

= c

Inferring Land Use

Contents

1. 11.1. 1

1

(1) Θ13 < y3

1.1.

1

Contents

1. 11.1. 1

1

(1) Θ11 < y1

1.1.

1

Contents

1. 11.1. 1

1

(1) Θ12 < y2

1.1.

1

Contents

1. 11.1. 1

1

(1) c1 c2

1.1.

1

Contents

1. 11.1. 1

1

(1) c1 c2

1.1.

1

Contents

1. 11.1. 1

1

(1) c3 c4

1.1.

1

Sunday, August 12, 2012

Page 14: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

14

Random Forest:

6 L. BREIMAN

generated as the counts in N boxes resulting from N darts thrown at random at the boxes,where N is number of examples in the training set. In random split selection ! consists ofa number of independent random integers between 1 and K . The nature and dimensionalityof ! depends on its use in tree construction.

After a large number of trees is generated, they vote for the most popular class. We callthese procedures random forests.

Definition 1.1. A random forest is a classifier consisting of a collection of tree-structuredclassifiers {h(x, !k), k = 1, . . .} where the {!k} are independent identically distributedrandom vectors and each tree casts a unit vote for the most popular class at input x.

1.2. Outline of paper

Section 2 gives some theoretical background for random forests. Use of the Strong Lawof Large Numbers shows that they always converge so that overfitting is not a problem.We give a simplified and extended version of the Amit and Geman (1997) analysisto show that the accuracy of a random forest depends on the strength of the indivi-dual tree classifiers and a measure of the dependence between them (see Section 2 fordefinitions).

Section 3 introduces forests using the random selection of features at each node todetermine the split. An important question is how many features to select at each node. Forguidance, internal estimates of the generalization error, classifier strength and dependenceare computed. These are called out-of-bag estimates and are reviewed in Section 4. Section5 and 6 give empirical results for two different forms of random features. The first usesrandom selection from the original inputs; the second uses random linear combinations ofinputs. The results compare favorably to Adaboost.

The results turn out to be insensitive to the number of features selected to split each node.Usually, selecting one or two features gives near optimum results. To explore this and relateit to strength and correlation, an empirical study is carried out in Section 7.

Adaboost has no random elements and grows an ensemble of trees by successive reweight-ings of the training set where the current weights depend on the past history of the ensembleformation. But just as a deterministic random number generator can give a good imitationof randomness, my belief is that in its later stages Adaboost is emulating a random forest.Evidence for this conjecture is given in Section 8.

Important recent problems, i.e., medical diagnosis and document retrieval, often have theproperty that there are many input variables, often in the hundreds or thousands, with eachone containing only a small amount of information. A single tree classifier will then haveaccuracy only slightly better than a random choice of class. But combining trees grownusing random features can produce improved accuracy. In Section 9 we experiment on asimulated data set with 1,000 input variables, 1,000 examples in the training set and a 4,000example test set. Accuracy comparable to the Bayes rate is achieved.

In many applications, understanding of the mechanism of the random forest “black box”is needed. Section 10 makes a start on this by computing internal estimates of variableimportance and binding these together by reuse runs.

Contents

1. 11.1. 1

1

(1) Θ13 < y3

1.1.

1

Contents

1. 11.1. 1

1

(1) Θ11 < y1

1.1.

1

Contents

1. 11.1. 1

1

(1) Θ12 < y2

1.1.

1

Contents

1. 11.1. 1

1

(1) c1 c2

1.1.

1

Contents

1. 11.1. 1

1

(1) c1 c2

1.1.

1

Contents

1. 11.1. 1

1

(1) c3 c4

1.1.

1

Contents

1. 11.1. 1

1

(1) c1 c2

1.1.

1

Contents

1. 11.1. 1

1

(1) c1 c2

1.1.

1

Contents

1. 11.1. 1

1

(1) c3 c4

1.1.

1

... ...

Contents

1. 11.1. 1

1

(1) Θk1 < z1

1.1.

1

Contents

1. 11.1. 1

1

(1) Θk2 < z2

1.1.

1

Contents

1. 11.1. 1

1

(1) Θk3 < z3

1.1.

1

(Breiman 2001)

Contents

1. 11.1. 1

1

Each classifier, h(x,Θk), casts a weighted vote for every input. The weightedmajority of these votes is used to make predictions, c.

x is a T × 1 vectorΘk is a m× 1 vector, (m < T )

(1) err =N −

�Ni I(ci = j)

N1.1.

1

Methods - Supervised Learning: Random Forest

Inferring Land Use

Sunday, August 12, 2012

Page 15: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

15

Random Forest:Contents

1. 11.1. 1

1

(1) Θ13 < y3

1.1.

1

Contents

1. 11.1. 1

1

(1) Θ11 < y1

1.1.

1

Contents

1. 11.1. 1

1

(1) Θ12 < y2

1.1.

1

Contents

1. 11.1. 1

1

(1) c1 c2

1.1.

1

Contents

1. 11.1. 1

1

(1) c1 c2

1.1.

1

Contents

1. 11.1. 1

1

(1) c3 c4

1.1.

1

Contents

1. 11.1. 1

1

(1) c1 c2

1.1.

1

Contents

1. 11.1. 1

1

(1) c1 c2

1.1.

1

Contents

1. 11.1. 1

1

(1) c3 c4

1.1.

1

... ...

Contents

1. 11.1. 1

1

(1) Θk1 < z1

1.1.

1

Contents

1. 11.1. 1

1

(1) Θk2 < z2

1.1.

1

Contents

1. 11.1. 1

1

(1) Θk3 < z3

1.1.

1

(Breiman 2001)

Contents

1. 11.1. 1

1

Each classifier, h(x,Θk), casts a weighted vote for every input. The weightedmajority of these votes is used to make predictions, c.

x is a T × 1 vectorΘk is a m× 1 vector, (m < T )

(1) err =N −

�Ni I(ci = j)

N1.1.

1

Contents

1. 11.1. 1

1

Each classifier, h(x,Θk), casts a weighted vote for every input. The weightedmajority of these votes is used to make predictions, c.x is a T × 1 vectorΘk is a m× 1 vector, (m < T )

(1) perf =

�Ni I(ci = j)

N1.1.

1

{c2 c2 c3 c2 c1 c2 c1 c4 }... ...

6 L. BREIMAN

generated as the counts in N boxes resulting from N darts thrown at random at the boxes,where N is number of examples in the training set. In random split selection ! consists ofa number of independent random integers between 1 and K . The nature and dimensionalityof ! depends on its use in tree construction.

After a large number of trees is generated, they vote for the most popular class. We callthese procedures random forests.

Definition 1.1. A random forest is a classifier consisting of a collection of tree-structuredclassifiers {h(x, !k), k = 1, . . .} where the {!k} are independent identically distributedrandom vectors and each tree casts a unit vote for the most popular class at input x.

1.2. Outline of paper

Section 2 gives some theoretical background for random forests. Use of the Strong Lawof Large Numbers shows that they always converge so that overfitting is not a problem.We give a simplified and extended version of the Amit and Geman (1997) analysisto show that the accuracy of a random forest depends on the strength of the indivi-dual tree classifiers and a measure of the dependence between them (see Section 2 fordefinitions).

Section 3 introduces forests using the random selection of features at each node todetermine the split. An important question is how many features to select at each node. Forguidance, internal estimates of the generalization error, classifier strength and dependenceare computed. These are called out-of-bag estimates and are reviewed in Section 4. Section5 and 6 give empirical results for two different forms of random features. The first usesrandom selection from the original inputs; the second uses random linear combinations ofinputs. The results compare favorably to Adaboost.

The results turn out to be insensitive to the number of features selected to split each node.Usually, selecting one or two features gives near optimum results. To explore this and relateit to strength and correlation, an empirical study is carried out in Section 7.

Adaboost has no random elements and grows an ensemble of trees by successive reweight-ings of the training set where the current weights depend on the past history of the ensembleformation. But just as a deterministic random number generator can give a good imitationof randomness, my belief is that in its later stages Adaboost is emulating a random forest.Evidence for this conjecture is given in Section 8.

Important recent problems, i.e., medical diagnosis and document retrieval, often have theproperty that there are many input variables, often in the hundreds or thousands, with eachone containing only a small amount of information. A single tree classifier will then haveaccuracy only slightly better than a random choice of class. But combining trees grownusing random features can produce improved accuracy. In Section 9 we experiment on asimulated data set with 1,000 input variables, 1,000 examples in the training set and a 4,000example test set. Accuracy comparable to the Bayes rate is achieved.

In many applications, understanding of the mechanism of the random forest “black box”is needed. Section 10 makes a start on this by computing internal estimates of variableimportance and binding these together by reuse runs.

} =

6 L. BREIMAN

generated as the counts in N boxes resulting from N darts thrown at random at the boxes,where N is number of examples in the training set. In random split selection ! consists ofa number of independent random integers between 1 and K . The nature and dimensionalityof ! depends on its use in tree construction.

After a large number of trees is generated, they vote for the most popular class. We callthese procedures random forests.

Definition 1.1. A random forest is a classifier consisting of a collection of tree-structuredclassifiers {h(x, !k), k = 1, . . .} where the {!k} are independent identically distributedrandom vectors and each tree casts a unit vote for the most popular class at input x.

1.2. Outline of paper

Section 2 gives some theoretical background for random forests. Use of the Strong Lawof Large Numbers shows that they always converge so that overfitting is not a problem.We give a simplified and extended version of the Amit and Geman (1997) analysisto show that the accuracy of a random forest depends on the strength of the indivi-dual tree classifiers and a measure of the dependence between them (see Section 2 fordefinitions).

Section 3 introduces forests using the random selection of features at each node todetermine the split. An important question is how many features to select at each node. Forguidance, internal estimates of the generalization error, classifier strength and dependenceare computed. These are called out-of-bag estimates and are reviewed in Section 4. Section5 and 6 give empirical results for two different forms of random features. The first usesrandom selection from the original inputs; the second uses random linear combinations ofinputs. The results compare favorably to Adaboost.

The results turn out to be insensitive to the number of features selected to split each node.Usually, selecting one or two features gives near optimum results. To explore this and relateit to strength and correlation, an empirical study is carried out in Section 7.

Adaboost has no random elements and grows an ensemble of trees by successive reweight-ings of the training set where the current weights depend on the past history of the ensembleformation. But just as a deterministic random number generator can give a good imitationof randomness, my belief is that in its later stages Adaboost is emulating a random forest.Evidence for this conjecture is given in Section 8.

Important recent problems, i.e., medical diagnosis and document retrieval, often have theproperty that there are many input variables, often in the hundreds or thousands, with eachone containing only a small amount of information. A single tree classifier will then haveaccuracy only slightly better than a random choice of class. But combining trees grownusing random features can produce improved accuracy. In Section 9 we experiment on asimulated data set with 1,000 input variables, 1,000 examples in the training set and a 4,000example test set. Accuracy comparable to the Bayes rate is achieved.

In many applications, understanding of the mechanism of the random forest “black box”is needed. Section 10 makes a start on this by computing internal estimates of variableimportance and binding these together by reuse runs.

c = mode(^

6 L. BREIMAN

generated as the counts in N boxes resulting from N darts thrown at random at the boxes,where N is number of examples in the training set. In random split selection ! consists ofa number of independent random integers between 1 and K . The nature and dimensionalityof ! depends on its use in tree construction.

After a large number of trees is generated, they vote for the most popular class. We callthese procedures random forests.

Definition 1.1. A random forest is a classifier consisting of a collection of tree-structuredclassifiers {h(x, !k), k = 1, . . .} where the {!k} are independent identically distributedrandom vectors and each tree casts a unit vote for the most popular class at input x.

1.2. Outline of paper

Section 2 gives some theoretical background for random forests. Use of the Strong Lawof Large Numbers shows that they always converge so that overfitting is not a problem.We give a simplified and extended version of the Amit and Geman (1997) analysisto show that the accuracy of a random forest depends on the strength of the indivi-dual tree classifiers and a measure of the dependence between them (see Section 2 fordefinitions).

Section 3 introduces forests using the random selection of features at each node todetermine the split. An important question is how many features to select at each node. Forguidance, internal estimates of the generalization error, classifier strength and dependenceare computed. These are called out-of-bag estimates and are reviewed in Section 4. Section5 and 6 give empirical results for two different forms of random features. The first usesrandom selection from the original inputs; the second uses random linear combinations ofinputs. The results compare favorably to Adaboost.

The results turn out to be insensitive to the number of features selected to split each node.Usually, selecting one or two features gives near optimum results. To explore this and relateit to strength and correlation, an empirical study is carried out in Section 7.

Adaboost has no random elements and grows an ensemble of trees by successive reweight-ings of the training set where the current weights depend on the past history of the ensembleformation. But just as a deterministic random number generator can give a good imitationof randomness, my belief is that in its later stages Adaboost is emulating a random forest.Evidence for this conjecture is given in Section 8.

Important recent problems, i.e., medical diagnosis and document retrieval, often have theproperty that there are many input variables, often in the hundreds or thousands, with eachone containing only a small amount of information. A single tree classifier will then haveaccuracy only slightly better than a random choice of class. But combining trees grownusing random features can produce improved accuracy. In Section 9 we experiment on asimulated data set with 1,000 input variables, 1,000 examples in the training set and a 4,000example test set. Accuracy comparable to the Bayes rate is achieved.

In many applications, understanding of the mechanism of the random forest “black box”is needed. Section 10 makes a start on this by computing internal estimates of variableimportance and binding these together by reuse runs.

})^

Methods - Supervised Learning: Random Forest

Inferring Land Use

Sunday, August 12, 2012

Page 16: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

16

Results:

TABLE II: Random forest classification results. The threshold refers the total number

of phone calls required in each cell over the month of data collected to be considered for

classification. Total accuracy is defined as the fraction of correctly classified cells. The

share refers to the percentage of cells actually zoned for each type of use. Element (i, j) of

the cross tabulation matrix can be interpreted as the fraction of actual zoned uses of type

i that were classified as as use j by the random forest. Thus the high percentages in the

Res column can be interpreted as the algorithm heavily favoring residential use due to its

overwhelming share of overall uses.

Confusion Matrix

Thresh Total Accuracy Weights Res Com Ind Prk Oth

500 0.538 ( 0.60 0.10 0.10 0.10 0.10) 0.62 0.21 0.15 0.01 0.01

0.30 0.48 0.19 0.00 0.02

0.33 0.27 0.38 0.00 0.02

Res Com Ind Prk Oth 0.52 0.26 0.18 0.02 0.02

Share: 0.74 0.09 0.08 0.04 0.05 0.37 0.28 0.25 0.00 0.10

12

Res Com

Ind Prk Oth

Inferring Land Use

Sunday, August 12, 2012

Page 17: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

17

Results:

TABLE II: Random forest classification results. The threshold refers the total number

of phone calls required in each cell over the month of data collected to be considered for

classification. Total accuracy is defined as the fraction of correctly classified cells. The

share refers to the percentage of cells actually zoned for each type of use. Element (i, j) of

the cross tabulation matrix can be interpreted as the fraction of actual zoned uses of type

i that were classified as as use j by the random forest. Thus the high percentages in the

Res column can be interpreted as the algorithm heavily favoring residential use due to its

overwhelming share of overall uses.

Confusion Matrix

Thresh Total Accuracy Weights Res Com Ind Prk Oth

500 0.538 ( 0.60 0.10 0.10 0.10 0.10) 0.62 0.21 0.15 0.01 0.01

0.30 0.48 0.19 0.00 0.02

0.33 0.27 0.38 0.00 0.02

Res Com Ind Prk Oth 0.52 0.26 0.18 0.02 0.02

Share: 0.74 0.09 0.08 0.04 0.05 0.37 0.28 0.25 0.00 0.10

12

= 1++++

Res Com

Ind Prk Oth

Inferring Land Use

Sunday, August 12, 2012

Page 18: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

18

Results:

TABLE III: Random forest classification results. In this case, residential land has been

removed from consideration. The algorithm is now able to correctly predict much large

fractions of scarce land types

Confusion Matrix

Thresh Total Accuracy Weights Res Prk Ind Con Oth

500 0.396 ( N/A 0.30 0.30 0.20 0.20) N/A N/A N/A N/A N/A

N/A 0.50 0.19 0.11 0.19

N/A 0.27 0.37 0.12 0.24

Res Com Ind Prk Oth N/A 0.31 0.18 0.29 0.21

Share: 0.00 0.33 0.31 0.16 0.20 N/A 0.26 0.24 0.15 0.34

14

Inferring Land Use

Sunday, August 12, 2012

Page 19: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

19

Classification Error Analysis

Mon Tue Wed Thu Fri Sat Sun

Residential

Mon Tue Wed Thu Fri Sat Sun

Commercial

Mon Tue Wed Thu Fri Sat Sun

Industrial

Mon Tue Wed Thu Fri Sat Sun

Parks

Mon Tue Wed Thu Fri Sat Sun

Other

I II III

I II III I II III

I II III I II III

Group I

Group II

Group III

Cells correctly predicted to be a

Cells of a given use incorrectly

Cells of a different use incor-rectly predicted to be a given

Classification Error Analysis

Mon Tue Wed Thu Fri Sat Sun

Residential

Mon Tue Wed Thu Fri Sat Sun

Commercial

Mon Tue Wed Thu Fri Sat Sun

Industrial

Mon Tue Wed Thu Fri Sat Sun

Parks

Mon Tue Wed Thu Fri Sat Sun

Other

I II III

I II III I II III

I II III I II III

Group I

Group II

Group III

Cells correctly predicted to be a

Cells of a given use incorrectly

Cells of a different use incor-rectly predicted to be a given

Classification Error Analysis

Mon Tue Wed Thu Fri Sat Sun

Residential

Mon Tue Wed Thu Fri Sat Sun

Commercial

Mon Tue Wed Thu Fri Sat Sun

Industrial

Mon Tue Wed Thu Fri Sat Sun

Parks

Mon Tue Wed Thu Fri Sat Sun

Other

I II III

I II III I II III

I II III I II III

Group I

Group II

Group III

Cells correctly predicted to be a

Cells of a given use incorrectly

Cells of a different use incor-rectly predicted to be a given

Classification Error AnalysisInferring Land Use

Sunday, August 12, 2012

Page 20: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

20

Robustness Across Cities

Cross Tabulation

Thresh Avg. Error Cutoffs Res Com Ind Par Oth

100 0.468 ( 0.50 0.12 0.12 0.12 0.12) 0.66 0.09 0.08 0.05 0.12

0.53 0.15 0.10 0.05 0.16

0.52 0.10 0.19 0.05 0.14

Res Com Ind Prk Oth 0.59 0.10 0.11 0.06 0.13

Share: 0.63 0.09 0.09 0.09 0.10 0.55 0.12 0.10 0.06 0.17

60

Sunday, August 12, 2012

Page 21: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

21

Robustness Across Cities

Mon Tue Wed Thu Fri Sat Sun20

40

60

80Ab

s. A

ct.

City−wide Time Series | Chicago LBS Data

Mon Tue Wed Thu Fri Sat Sun−0.5

0

0.5

Nor

m. A

ct.

Mon Tue Wed Thu Fri Sat Sun

−0.1−0.05

00.050.1

Hour

Res

. Act

.

Sunday, August 12, 2012

Page 22: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

Inferring Land Use

22

Summary: • Introduced a way to reconcile and normalize traditional and new data sources

• Used temporal activity patterns to identify land uses

• Performed error analysis to determine why classifications failed

• Similar results obtained for different cities

Implications: • Mobile phone activity can be radically different depending on zoned land use.

• Traditional zoning maps should be augmented to account for real-time use.

• Dynamic land use patterns seem relatively consistent despite differences in history and regulatory classification

Sunday, August 12, 2012

Page 23: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

Future Contributions

23

• Use more sophisticated feature spaces for zoning classification.

• Introduce new data sources.

• Measure urban system dynamics in multiple cities around the globe to identify similarities and differences in behavior.

• Apply normalization metrics to other data sources.

• Combine our understanding of temporal land use with measurements of human mobility to develop better models of intra-city travel behavior.

Sunday, August 12, 2012

Page 24: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

Thank You!

24

Questions?

Sunday, August 12, 2012

Page 25: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

25

Errors

Sunday, August 12, 2012

Page 26: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

26

Correlation with Static Population

Sunday, August 12, 2012

Page 27: Inferring land use from mobile phone activityurbcomp2012/Presentations/UrbComp12_Paper09... · We first examine the relationship between mobile phone activity and land use at the

27

Mon Tue Wed Thu Fri Sat Sun

Residential

Mon Tue Wed Thu Fri Sat Sun

Commercial

Mon Tue Wed Thu Fri Sat Sun

Industrial

Mon Tue Wed Thu Fri Sat Sun

Parks

Mon Tue Wed Thu Fri Sat Sun

Other

I II III

I II III I II III

I II III I II III

Group I

Group II

Group III

Cells correctly predicted to be a

Cells of a given use incorrectly

Cells of a different use incor-rectly predicted to be a given

Classification Error Analysis

Sunday, August 12, 2012