Published online 13 June 2012 Inferring social network ...sjrob/Pubs/psorakis_roberts_interface_2012.pdf · network of social afﬁliations needs to be inferred through proxies such

J. R. Soc. Interface (2012) 9, 3055–3066

*Author for c

Electronic sup10.1098/rsif.2

doi:10.1098/rsif.2012.0223Published online 13 June 2012

Received 20 MAccepted 15 M

Inferring social network structure inecological systems from spatio-

temporal data streamsIoannis Psorakis1,2,*, Stephen J. Roberts1, Iead Rezek1

and Ben C. Sheldon2

1Pattern Analysis and Machine Learning Research Group, and 2Edward Grey Institute,University of Oxford, Oxford, UK

We propose a methodology for extracting social network structure from spatio-temporal data-sets that describe timestamped occurrences of individuals. Our approach identifies temporalregions of dense agent activity and links are drawn between individuals based on their co-occurrences across these ‘gathering events’. The statistical significance of these connectionsis then tested against an appropriate null model. Such a framework allows us to exploit thewealth of analytical and computational tools of network analysis in settings where the under-lying connectivity pattern between interacting agents (commonly termed the adjacencymatrix) is not given a priori. We perform experiments on two large-scale datasets (greaterthan 106 points) of great tit Parus major wild bird foraging records and illustrate the useof this approach by examining the temporal dynamics of pairing behaviour, a process thatwas previously very hard to observe. We show that established pair bonds are maintainedcontinuously, whereas new pair bonds form at variable times before breeding, but are charac-terized by a rapid development of network proximity. The method proposed here is general,and can be applied to any system with information about the temporal co-occurrence ofinteracting agents.

Keywords: network analysis; spatio-temporal data streams;animal social networks

1. INTRODUCTION

We use the terms graph or network to describe the sim-plified version of the pattern of interactions in a system,such as an animal population, where nodes areindividual entities and edges represent some form ofassociation, interaction, similarity or behavioural corre-lation between nodes. In the same way that a map is asimplified (though useful) version of a landscape, a net-work describes the topology of a real-world system byfocusing on the connectivity patterns of its individualcomponents [1].

The key motivation for employing network analysistools is that the web of interconnections betweenindividuals can provide insights into the underlyingmechanisms that govern the system under study [2]. Forexample, in an ecological context, the position and roleof animals in the network may have important fitnessconsequences [3] both for the individual and thepopulation as a whole [4]. Additionally, the networkparadigm gives us the flexibility to look at the system atvarious resolutions and model any type of interaction;sexual, cooperative, competitive, etc [4].

orrespondence ([email protected]).

plementary material is available at http://dx.doi.org/012.0223 or via http://rsif.royalsocietypublishing.org.

arch 2012ay 2012 3055

Despite the advantages of the network paradigm andthe wealth of analytical and computational tools for net-work analysis [5–8], the problem of capturing any givensystem as a graph is not always trivial. Not all systemspossess an obvious ‘web-like’ structure (such as theInternet), where the interconnections between participat-ing entities are apparent from direct observation(computers that are connected through physical cables).Additionally, collected data (from field studies, sensorobservations, World Wide Web, etc.) may not capturethe associations between the observed agents, thus norelational structure can be directly defined. For example,in systems, such as animal populations, the underlyingnetwork of social affiliations needs to be inferred throughproxies such as the behaviour (mobility patterns, foraginghabits, etc.) of individual animals.

This work focuses on the problem of finding theunderlying social network structure of a populationthat can only be observed through the spatial trajectoriesof its individual members. We use as a case study asetting where individual wild birds are marked withtransponder devices and through appropriate logginghardware we are able to identify their position at varioussites in their natural habitat. The observation datacollected in this manner consist of a long stream of time-stamped records, where no obvious interaction or social

This journal is q 2012 The Royal Society

mailto:[email protected]

http://dx.doi.org/10.1098/rsif.2012.0223



http://rsif.royalsocietypublishing.org

http://rsif.royalsocietypublishing.org

http://crossmark.crossref.org/dialog/?doi=10.1098/rsif.2012.0223&domain=pdf&date_stamp=2012-06-13

Table 1. Sample format of our data.

bird ID timestamp location ID

N199642 1/9/2007 10.02:15 (am) 1aTE80535 1/9/2007 10.02:30 (am) 1aV260952 1/9/2007 10.02:30 (am) 2bV260952 1/9/2007 10.02:45 (am) 2bN199642 1/9/2007 10.12:15 (am) 1c. . . . . . . . .

3056 Inferring social network structure I. Psorakis et al.

affiliation is apparent. By assuming that social structureis a latent factor that affects the way birds visit locations(in the sense that socially affiliated individuals have simi-lar mobility patterns), we propose a methodology thatextracts a social network from such a spatio-temporaldata stream. Although we demonstrate our method inan ecological context, our approach can be generalizedto any setting where agents perform timestamped‘check-ins’ at various locations.

The paper is organized as follows. In §2, we outlineour experiment settings and discuss our data format. In§3, we present our contribution, which is a methodologyfor extracting network structure from timestampedobservation data. In §4, we apply our method to thewild bird dataset and show that the extracted networksreflect actual processes that take place in the population,by focusing on mating pair formation. We conclude thispaper in §5 by discussing the next steps of our research,both in terms of method development and data collec-tion extensions. The Matlab code that implements themethods presented in the paper is made available1

to the community.

2. DATA COLLECTION

This work lies within the context of a large ongoingstudy of the great tit Parus major population atWytham Woods near Oxford, UK. Thousands of indi-vidual birds are marked with transponders and a gridof sensor-enabled locations generates hundreds of thou-sands of records each winter. At each one of the 67locations in the forest, there is a feeder that acts as anattraction point for foraging individuals. By placingappropriate logging hardware at the feeder, we areable to record the presence of each individual bird.Owing to equipment constraints, there were only 16 log-gers available at any time, and these were thus rotatedaround the 67 locations following a structured random-ized design, so that each of eight approximately equallysized sections of the site always had two active loggersin it. More details on our experiment set-up is providedin the electronic supplementary material.

The data generated from this scheme consist of along stream of timestamped observations as shown intable 1. Each row represents a single record that cap-tures the ID of the bird along with the time andlocation where the foraging event took place. In thisformat, shown in table 1, our data stream is only atransactions table in a relational database context,which restricts our analysis to a handful of relativelysimple counting operations such as finding the totalappearances of a given bird, total birds that visited aspecific feeder, etc.

What we are interested in is to find an appropriatemapping of this spatio-temporal stream to a relationalspace, where social affiliations between individuals arerevealed by the similarity of their feeder visitation pat-terns. We seek to characterize the overall social networkof the population of marked birds and explore theability of this approach to recover relationships betweenmated pairs of individuals observed independently

1See http://www.robots.ox.ac.uk/~parg/software.html.

J. R. Soc. Interface (2012)

during breeding season data collection. We further wishto explore the temporal dynamics of the formation ofmated pairs. In biological terms, the process by whichpairs of individuals develop relationships that lead tomating is poorly understood in most natural populations,since the majority of work involves observations of pairs atthe time of breeding, after pair formation has occurred. Asa consequence, we have little knowledge of when suchrelationships form, and when they become distinguishablefrom other social relationships between individuals.

In §3, we introduce a method, based on the abovegoals, that extracts network structure given suchspatio-temporal data. In §4, we present the applicationof this approach to the P. major dataset.

3. NETWORK INFERENCE FROMSPATIO-TEMPORAL DATA

3.1. The time-window problem

A typical approach for building a network from datasuch as those presented in §2 would involve discretizingthe stream using a fixed aggregation or time window Dtand assuming that if two individuals are recordedwithin an interval Dt then there is a link betweenthem in the network [9–13]. The most obvious problemwith this approach is that of finding the appropriate sizefor the time window. An inappropriately small Dt maylead to a network that does not capture important con-nections, while a very large Dt would overload the graphwith ‘junk’ links.

Using our wild-bird data as an example, we take asingle day’s worth of observations (in a format similarto the one shown in table 1) and split that stream intotime intervals of size Dt. We then place links betweenthe N individual birds (nodes) based on the number oftimes they were recorded within a temporal distance ofDt. We seek to examine the changes that take place inthe network as we vary the time window size by monitor-ing the network load (NL), which is the fraction of M linksin the network over all possible pair combinations12ðN 2 � NÞ of N nodes.2 We can see in figure 1a thatNL increases along with the size of Dt, because morelinks are placed between nodes. An example of how net-work topology changes for various selections of timewindow size is shown in figure 1b, while Krings et al.[13] have performed similar experimentation considering

2In this example and throughout this paper we are consideringnetworks that are undirected with nodes that have no self-edges.

http://www.robots.ox.ac.uk/~parg/software.html

http://www.robots.ox.ac.uk/~parg/software.html

500 1000 1500 20000

0.2

0.4

0.6

0.8

1.0

Δt size (s)

netw

ork

load

(N

L)

Δt = 10 s, NL = 0.1103

Δt = 60 s, NL = 0.2091

Δt = 600 s, NL = 0.4483

Δt = 10 s, M = 95 Δt = 60 s, M = 180

Δt = 600 s, M = 978

(a)

(b)

Figure 1. (a) We plot the network load for various time window sizes, spanning from 10 s to half an hour. We can see thatespecially for early increases of Dt, there is a large inclusion of links in the network. We also mark three cases of different timewindow sizes (dashed vertical line) and show in (b) how the graph topology changes based on the Dt value. (Online version incolour.)

Inferring social network structure I. Psorakis et al. 3057

more network metrics such as average degree, averageweight, clustering coefficients, etc.

Between all these different network topologies thatresult from varying Dt, there is no direct way of showingwhich one is the most appropriate. Additionally, even ifwe had some prior knowledge on the appropriate timewindow size or even a specific quality function for find-ing its optimum value, we have still made the strongassumption that Dt is fixed throughout the datastream. This corresponds to the belief that the ‘inter-action radius’ between individuals is constant acrossour observation period and is not affected by temporalchanges in the overall system.

In the current work, we shall pursue a differentapproach for building networks from spatio-temporalrecords, which exploits the inhomogeneous density pro-file of our data stream thus avoiding schemes such asmultiple runs [13] in order to select an appropriate Dt.This methodology, which we will call GMMEvents(Gaussian mixture model for event streams) is comple-mented with an appropriate null model that allows us


to distinguish between links that denote social tie andthe ones that result from coincidence.

3.2. Identification of gathering events

Let our spatio-temporal data D, a sample of which weshowed in table 1, be represented in the formD ¼ fbz ; tz ; ‘zgZz¼1, where Z is the total number ofrecords or tuples in our database (e.g. the number ofrows of table 1). If we take a single tuple fbz ; tz ; ‘zg,we read it as ‘the bird bz appeared at time tz at the feed-ing location ‘z ’. Note that ftzgZz¼1 denotes event time;therefore, for every timestamp tz, there is a correspond-ing bird appearance bz. Additionally, given a specificbird i out of total N birds, there can be many recordsz for which bz ¼ i, as a single individual may appearmany times in the data. Our goal is to find an appropri-ate mapping from the stream D to an adjacency matrixA [ RN�N , where aij = 0 denotes a link between birdsi and j. To keep the notation uncluttered, from now onwe will focus on the case of a single location and show

event

0 1000

obse

rvat

ion

2000 3000 4000 5000time line (s)

6000 7000 8000 9000 10 000

1event2

event3

event4

Figure 2. We plot bird arrivals as recorded at a specific location over the course of 3 h period. We can see that the visitation profileis temporally focused, consisting of bursts of bird activity. Our goal is to identify such regions of increased observation density andexamine which individuals participate in these gathering events. (Online version in colour.)

freq

uenc

y of

d(t

z)

d (tz) size (s)

106

105

104

103

102

10

1102 103 104

Figure 3. We calculate the time difference dðtzÞ ¼ tz � tz�1 between every pair of consecutive observations at each location in ourtwo data streams (seasons 2007–2008 and 2008–2009) and plot the histogram of those values on a logarithmic scale. The dðtzÞthat refer to pairs where z 2 1 is the last observation of day d 2 1 and z the first observation of day d have been omitted, in orderto avoid bias in the results (there is no bird feeding activity during night-time). Open circles, dataset 2007–2008; plus symbols,dataset 2008–2009. (Online version in colour.)


later that results can be easily generalized to themulti-site case.

Consider the plot of figure 2, which illustrates howbird arrivals at a particular feeding location arespread throughout a small sample of our observationtimeline. Each stem represents an actual sensor captureof a specific bird bz at time tz. We can see that therecords are not uniformly spread across time, but theyare ‘packed’ in small observation-dense regions.Indeed, if we take the whole data stream and extractthe histogram of the time differences dðtzÞ ¼ tz � tz�1

between every pair of consecutive observations, asseen in figure 3, we find a broad power-law tail withexponent ≃2.5 for dðtzÞ . 800. This non-Poissoniandecay of inter-record timestamps, along with the factthat most dðtzÞ take small values, implies that theobservation profile comprises temporally focused


bursts of recording activity, which can be seen asflocks of foraging individuals.

Our main hypothesis is that birds not only visit thefeeder as part of such small flocks but also have a pre-ference to the members of the flock they choose toforage with. Such regions of increased observation den-sity can be viewed as K gathering events of sociallyaffiliated birds. We seek to cluster our Z observationsin a way such that closely appearing individuals,based on their arrival timestamp tz, are assigned tothe same gathering event k.

We perform this clustering scheme using a Gaussianmixture model, with an appropriate configuration thatallows us to automatically infer the effective numberK of events/clusters (see the electronic supplementarymaterial). The result is described by an observation-to-cluster responsibility matrix G [ RZ�K , where Z is

data

dens

itypr

ofile

1

2

3

4

individuals events

(a) (b) the bird-to-bird socialnetwork

the bird-to-event bipartitenetwork

Figure 4. Our method identifies gathering events from the bursts in our observation stream as seen in (a). Then individuals areassigned to such events creating a bipartite network. In part (b), we recover the bird-to-bird social network, via an appropriateone-mode projection, based on the co-participation of individuals to these events. (Online version in colour.)


the total number of bird observations, K the number ofclusters and the elements fgz1; gz2; . . . ; gzKg of each rowdenote a membership score of a single observation z toan event k.

As there is a one-to-many correspondence between agiven bird i and timestamps tz, a single bird can berecorded many times in the observation stream or, inother words, there are many tuples ftz ; bzg for whichbz ¼ i. Therefore, we seek to map the observation-to-clus-ter matrix G [ RZ�K to a bird-to-cluster matrixB [ RN�K . We start by taking each row gz ¼ fgzkgKk¼1of G and set the largest element to 1 and all the othersto zero. This allows us to describe participation scoresgzk , and all the other measures we derive from them, asinteger-valued occurrences. For each individual birdi [ f1; . . . ;Ng, we identify the subset Zi of rows gz ofG that correspond to observations regarding i. We thusset each row bi of B as the sum bi ¼

Pz[Zi

gz . Theresulting matrix B [ RN�K can be seen as a represen-tation of a bipartite or two-mode network that is agraph with two types of nodes; N birds and K events,as shown in figure 4a. Each element bik denotes thenumber of times each bird was observed at a specificforaging group.

3.3. Building the social network

The bipartite network we extracted in §3b and shown infigure 4a describes the event participation structure ofthe bird population, which is the weighted allocationof N birds to K foraging events, encoded byB [ RN�K . Although this finding is important byitself, as it allows us to quantify the structure of suchsmall foraging groups in terms of the number, individ-ual characteristics, relatedness of their members etc,we seek to move one step further and extract the


bird-to-bird social network based on the mutualparticipation of individuals to such events.

Therefore, we seek to define an appropriate one-modeprojection B [ RN�K ! A [ RN�N , shown in figure 4,so that a link aij between a pair i, j in the resultingnetwork will express how strongly the two birds foragetogether. We start by defining co-occurrence of individ-uals i and j as the number of times they were recorded inthe same foraging group. Thus, given the event member-ship profiles bi and bj for i and j, respectively, we definethe total co-occurrences aij as aij ¼

PKk¼1 minðbik; b jkÞ,

where K is the number of foraging groups and aij is effec-tively the link weight between i and j in the resultingsocial network described by the adjacency matrixA [ RN�N . Note that other association indices, suchas the ones presented in Ginsberg & Young [14] can beused depending on the problem context.

3.4. Co-occurrences: social tie versuscoincidence

The next issue we seek to address is the statistical sig-nificance of the extracted link weights. Building theadjacency matrix A [ RN�N in the manner describedin §3a makes the very strong assumption that if twoindividuals participate in the same gathering event,they have some form of social affiliation. This assump-tion, known in the animal social network literature asthe Gambit of the Group (GoG) [15], may lead us toadjacency matrices encumbered with ‘junk’ links, pro-duced by co-occurrences that happened by chance.Such coincidences are also frequent in settings wherethere are natural peak-hours in the data collectionperiod and also when the sensor hardware act as attrac-tion points, as, for example, the bird feeders in ourstudy. Hence, we seek to define an appropriate null


model that describes how ‘statistically surprising’ agiven link weight would be, if there was no underlyingsocial preference in the foraging habits of the bird popu-lation. From previous sections, we have discussed thatobservations occur in bursts (as shown in figure 2)that denote small foraging groups of birds that arrivetogether at the feeders. This is captured by the bird-to-event matrix B [ RN�K , where each element bik inthe row vector bi denotes the number of times bird iappeared at the gathering event k.

Consider each row vector bi as a draw from a multi-nomial distribution Mðni;piÞ, with parametersni ¼

PKk¼1 bik and pik ¼ bik=ni. The values of the par-

ameter vector fpikgKk¼1 can be viewed as a preferenceprofile of a bird i to each foraging event k. If our hypoth-esis that social affiliation between birds affects eventmembership holds, then closely interacting birds i, j willhave similar preference profiles pi and pj .

Let us now propose an element shuffling s of pi sothat pi ! sðpiÞ and draw a new event occurrence vectorbð0Þi from the multinomial distribution Mðni;sðpiÞÞ.Performing this permutation and sampling scheme inde-pendently for all birds i [ f1; . . . ;Ng leads to a newbird-to-event bipartite network described byBð0Þ [ RN�K . This new matrix B (0) preserves manykey characteristics of the original data, among themthe event membership structure, because bird appear-ances remain concentrated in K regions of increasedobservation density. Quantities such as the number ofindividuals N, and the total records ni, of bird i in thedata are also retained.

The key difference introduced in B (0) is that,although a bird’s uneven participation preference piacross foraging groups is preserved (as the permutedsðpiÞ has the same entropy as pi), the shuffling s

‘breaks’ all correlations between bi and bj induced bylatent social affiliation between individuals i and j.In other words, under our null model, birds stillforage in small groups, but with no social preferenceto which other members of the group they will foragewith. We repeat this process R-times and for each gen-erated bird-to-event matrix B (0) we extract the bird-to-bird matrix A(0) using the same one-mode projectionpresented in §3c. By generating multiple instances ofA(0) in this manner, we are effectively drawing samplesfrom the ensemble or family of graphs Gð0Þ that containsall possible network configurations generated by thenull model. Our goal is to examine if our observednetwork A is an unlikely case of Gð0Þ.

The randomization process generates R values of theweight of each link between i and j. From the histogram,we get the empirical distribution Pðaij jH0Þ that denotesthe probability of having a link of weight aij given thatthe null hypothesis H0 holds. We examine how statisti-cally surprising is each observed link aij by performinga hypothesis test, given an appropriate significancelevel a, by examining the likelihood p ¼ Pðx � aij jH0Þof co-occurrences as large as aij . Note that the keypoint of a null model is that co-occurrences happenbetween individuals, but not as a result of an underlyingsocial structure. In other words, the links in A(0) areindependent under H0, hence PðAjH0Þ ¼

Qij Pðaij jH0Þ.


Thus, our significance test lies in examining how wellthis independence assumption can explain the observedco-occurrences encoded in each link of A.

3.5. Integrating information from multiplelocations

We briefly expand on our graph inference scheme to themulti-location setting. For each record ftz ; bz ; ‘zg in ourdata stream, we now have an additional term‘z [ f1; . . . ;Lg that denotes the index of the locationwhere observation z took place.

We start by segmenting our data D ¼ ftz ; bz ; ‘zgZz¼1into L streams, so that each Dð‘Þ contains records refer-ring only to location ‘. For each Dð‘Þ, we perform thenetwork extraction process as presented in §3b,c leadingto L adjacency matrices Að‘Þ [ RN‘�N‘ , where N‘ � N ,the subset of birds recorded at location ‘. Significancetests, as described in §3d, are performed independentlyfor each ‘, in order to preserve the unique visitation andlocation load statistics of each site.

Each matrix Að‘Þ [ RN‘�N‘ generated in this schemecaptures a subset of the overall connectivity profile inthe population. As the interpretation of link weightis the number of co-occurrences between twoindividuals, the overall aij is simply the summationaij ¼

PL‘¼1 að‘Þij over multiple sites.

In §4, we will demonstrate how these methodologiesare applied to the wild-bird dataset described in §2.

4. RESULTS

4.1. Application on the wild-bird dataset

We apply GMMEvents on the dataset of wild-bird fora-ging records presented in §2. Our observations consist oftwo main streams: D7,8 that covers the activity ofN7,8 ¼ 770 birds from August 2007 to March 2008 andD8,9 that spans from August 2008 to March 2009 andcontains N8,9 ¼ 753 birds.

Instead of applying our method to the whole two-season data stream directly, we start by breaking itdown into 24 h segments. Our aim is to produce a collec-tion of network snapshots that would allow us to studythe day-by-day changes in the population’s sociality. Anexample of the observation data is shown in figure 5a,where we can see the isolated observation-rich regions(blue stem lines) that refer to each particular day. Notethat the night period (no-observation zones in betweendays) acts as a natural separator in our data stream, asno bird foraging activity takes place during that time.

We proceed by breaking down each daily segmentof our data into sub-streams that correspond to Ldifferent feeding locations, shown in figure 5b for 9September 2007. We then apply GMMEvents to eachlocation ‘ separately, as co-occurrences need to bedefined both in terms of temporal and spatial proxi-mity. On each one of those feeder-specific streams forthat day, our method identifies bursts in the obser-vation density profile and builds a bipartite networkBð‘Þ between birds and gathering events, as shown infigure 5c. The weight of each link bð‘Þik denotes thenumber of times bird i appeared in the gathering

timeline

12

34

56

78

910

1112

13

8-Sep-2007

obse

rvat

ion

35462000 35468000timeline (s)

35473000 3547900035456000

9-Sep-2007 10-Sep-2007 11-Sep-2007(a)

(b)

(c) (d)

observation density profile at each feeding location, during 9-Sep-2007

observation density profile across 4 days

Figure 5. In (a), we show a segment of our data stream profile for a duration of 4 days. We pick a single day ‘data-chunk’ ofobservations and break it down into separate streams that refer to bird records at each particular location, as shown in (b).For each location-specific stream, we use our method to identify gathering events, as shown in coloured nodes on the right ofthe bipartite graph in (c). We assign birds (black nodes on the left of the graph) into such events based on their participationstrength. We project the bird-to-event bipartite graph of (c) into an one-mode network based on co-occurrences in gatheringevents, as shown in (d). We remove any links (marked with double lines) that can be explained away by the null model.(Online version in colour.)


event k. Based on §3c, we then perform one-mode pro-jection of this bipartite network into a bird-to-birdsocial network, shown in figure 5d, described by theadjacency matrix Að‘Þ. The weight of each link að‘Þij ¼PK

k minðbð‘Þik ; bð‘Þjk Þ denotes the total number of co-

occurrences between bird i and j across all K gatheringevents that took place at location ‘. The statisticalimportance of each að‘Þij is then tested againstthe null model we formulated in §3d, where alllinks below the significance threshold (marked withdouble lines in figure 5d) are removed. For our signifi-cance test, we used R ¼ 104 samples of the nullensemble along with a standard a ¼ 0.05 importancethreshold.

We repeat this process for all L locations and basedon §3e, we combine all site-specific adjacency matricesAð‘Þ to a single one At that captures the population-wide social structure on the given day T. An exampleis shown in figure 6, where we have summarized the


subgraphs (such as the one shown in figure 5c) fromall L ¼ 13 locations shown in figure 5b into a single,global network that describes wild-bird social organiz-ation on 9 September 2007. We repeat the process forall T 24 h segments of our data stream, we get a stackof adjacency matrices fAtgTt¼1 that represent dailysnapshots of the wild-bird social network.

From an implementation perspective, GMMEventsruns L times for each day-segment of the data stream.For each location ‘, R randomizations of the bird-to-event incidence matrix B are generated and for eachone we perform one-mode projection in order to samplethe weight distributions for each link pair i, j. Althoughit may appear computationally prohibitive for large data-sets, our method is able to analyse 2 years’ worth of datathat correspond to about 1 million observations inapproximately 6 h, run on a modern 8-core machineunder a Matlab implementation. This is due to the factthat our method itself is executed on multiple small

Figure 6. The Wytham woods Parus major wild bird socialnetwork at 9 September 2007, with N ¼ 240 nodes, M ¼ 491edges, created by integrating all location-specific subgraphsshown in figure 5d. Note that not all 770 birds of the 2007–2008 season have been recorded during that day and also indi-viduals no connections have been removed from the network.(Online version in colour.)


sub-streams (that refer to different locations per day) andcan be directly parallelized. Our R randomization/sampling schemes are also independent by definition, sothey can run concurrently on different processing units.More details on computational issues are discussed inthe electronic supplementary material and our sourcecode documentation.

4.2. Using GMMEvents to study the dynamics ofmating pair formation

In this section, we examine the validity of the graphs weextracted in §4a using GMMEvents. As the ground truthnetwork is not available to us in such settings, wecannot directly compare our inferred topologies withsome form of given solution. Although tests on simu-lated data streams have been performed (see theelectronic supplementary material), our aim is to exam-ine how well our dynamic network reflects meaningfulquantities from our application-domain perspective.

We make use of an additional dataset, compiled froman independent field study at Wytham woods, whichprovides wild-bird mating records for each season. Suchpedigree dataset logs the IDs of individuals that formeda breeding pair each year. Some bird pairs persist overseveral seasons while others last for only 1 year owingto either divorce or fatalities. We assume that if theextracted network structure is valid, then breeding indi-viduals will be closely connected, either in terms of adirect link or being in the same social circle. Althoughlooking for direct links between mated individuals is anobvious choice, it is a very strict case and thus very sen-sitive to missing data and noise. Therefore, our approachis to examine if breeding pairs belong more frequently


and consistently than random into social circles thatdenote birds with similar foraging patterns.

Our first objective is to identify such social circles inour population. In figure 6, where we have visualizedthe network structure of the wild-bird population for aspecific day, we can see certain regions in the graph(shown in different colour) where nodes are more denselyconnected with their immediate neighbours than the restof the population. Such ‘hot-spots’ of increased link pres-ence are called modules or communities in the networkanalysis jargon [7,16]. For each daily network describedby At, we extract such communities using a non-negativematrix factorization (NMF) approach [17].

We find that the majority of mated pairs in networkcommunities are connected through a direct link in77.26 per cent of cases for the 2007–2008 data and71.57 per cent of cases for the 2008–2009 data. Reach-ability through a path of two links is reported for the14.74 per cent of cases in 2007–2008 and 17.06 percent of cases in 2008–2009. The average path lengthbetween two members, for the cases where both ofthem are observed in the data, is 1.33 (2007–2008)and 1.46 (2008–2009) with median value of 1 in bothdatasets. Finally, there are still cases (8% in 2007–2008 and 11.37% in 2008–2009) of pairs where theirgeodesic distance spans from three to six edges butstill belong to the same community.

We monitor bird membership within these groupsusing a binary matrix Ct , where each element cijt ¼ 1denotes that birds i, j appeared in the same communityat day t. This leads us to a new collection of co-member-ship matrices fCtgTt¼1 that encode temporal changes inthe way birds participate with each other in communities.From a summation across t, we get a matrix CðsÞ [ RN�N

where each element cðsÞij denotes the total number of daysin the season where the pair i, j participated in thesame community. In figure 7, we plot a histogram ofall co-membership values (y-axis on a logarithmic scale)based on two matrices C(s) that refer to bird co-member-ship values in field seasons 2007–2008 and 2008–2009,respectively. We can see that for both seasons, the vastmajority of pairs have never participated in thesame group and the distribution is heavily skewed. Thisimplies a strong preferential mechanism in the population,where random individuals rarely belong to the samesocial circle.

We now examine if the above distribution holds forcertain sub-category of pairs in the network, which weknow a priori are connected with actual social ties.This prior information is provided by the pedigree data-set we mentioned previously, which gives a list of nodedyads i, j that denote breeding individuals. In this list,we also distinguish between mated pairs that wereformed during our observation season, called newpairs, and others that already existed before, calledold pairs. In figure 8, we plot the cumulative distri-butions FðcijÞ, where cij are values co-membershipmatrix C (s) and i, j can be (a) any node pair (blue cir-cles, stem), (b) a new pair (green squares, stem) and (c)old pair (red triangles, stem). In figure 8a, we plot thedistributions that refer to the 2007–2008 season, withN ¼ 217 individuals, from which we have 49 new pairsand 20 old pairs. For season 2008–2009, shown in

log(

no. p

airs

)

10

5

0 5 10 15 20 25 30 35c

i,j: no. days where a pair (i, j) is observed in the same community(s)

Figure 7. We plot the co-membership values of CðsÞ on a base-2 logarithmic scale. Each value (x-axis) denotes the total number ofdays a random pair is observed in the same community. We can see that CðsÞ is sparse and the vast majority co-membership valuesare zero. This shows that if we pick a random dyad in the population, it will most probably be never seen in the same social circle.Asterisks with continuous line, season 2007–2008; open square with continuous line, season 2008–2009. (Online version in colour).

no. days where both i, j belong to the same community

no. days where both i, j belong to the same community

F(c

ij)

= P

(x ≤

cij)

1.0

0.8

0.6

0.4

0.2

F(c

ij)

= P

(x ≤

cij)

1.0

0.8

0.6

0.4

0.2

0 5 10 15 20 25 30

F(cij|{i,j} = random pair)

F(cij|{i,j} = new pair)

F(cij|{i,j} = old pair)

F(cij|{i,j} = random pair)

F(cij|{i,j} = new pair)

F(cij|{i,j} = old pair)

0 5 10 15 20 25 30 35 40

(a)

(b) cummulative co–membership distributions, for season 2008–2009

cummulative co–membership distributions, for season 2007–2008

Figure 8. We plot the cumulative co-membership distributions for three different dyad types: random pairs, mating pairs formedin previous seasons and pairs that formed in the current season. Although for the majority of random bird pairs in the network co-membership values are concentrated around zero, breeding individuals tend to participate much more frequently into the sameflocks. (Online version in colour.)


figure 8b, we have N ¼ 203 individuals that include 48new pairs and 10 old pairs.

We can see that for both seasons presented in figure 8,the distributions that refer to mated pairs differ


significantly from the one for random ones, with p-valuesless than 10215 under a Kolmogorov–Smirnov test [18]with 5 per cent precision level for both seasons. In contrastto the random case, where values cij are mostly zero,

a = 0.05

1.0

0.8

0.6

0.4

p-va

lue

in K

S te

st

0.2

0Aug Sep Oct Nov Dec Jan Feb

Aug Sep Oct Nov Dec Jan Feb

a = 0.05

1.0

0.8

0.6

0.4

p-va

lue

in K

S te

st

0.2

0

(b)

(a) co-membership distribution comparison, season 2007–2008

co-membership distribution comparison, season 2008–2009

Figure 9. We compare the co-membership distributions Pðcij jfi; jg ¼ random pairÞ versus Pðcij jfi; jg ¼ old pairÞ (red triangles,line) and Pðcij jfi; jg ¼ random pairÞ versus Pðcij jfi; jg ¼ new pairÞ (green squares, line) in a month-by-month basis, using a Kol-mogorov–Smirnov test. Values above the proposed a ¼ 0.05 significance threshold imply that the two distributions undercomparison are similar. We can see that from very early in the year old pairs differentiate themselves from random, by starting toparticipate frequently in the same communities. On the other hand, members of new pairs in the beginning of the year treat eachother as random, while preferential mechanism that makes them flock together, starts to build-up during early winter. (a,b) Triangleswith solid line, random versus old pairs; squares with solid line, random versus new pairs. (Online version in colour.)


co-membership for mated pairs achieves larger values, thusdenoting stronger and consistent graph proximity. Thedifferences between old and new pairs are also revealedbetween their respective cumulative distributions (greensquares, stem and red triangles, stem), where old pairsachieve higher co-membership values owing to the factthat they existed before new pairs where formed, thusthey had more opportunities during the season toparticipate in the same foraging flocks.

We have already seen that co-membership distributionsdiffer between various pair types. We will now examinewhen that differentiation takes place duringthe observation season. We start by breaking down theobservation period into eight months. For each month,we used the respective daily networks in order to findthe three co-membership distributions of interest. Wethen compared Pðcij jfi; jg ¼ random pairÞ versusPðcij jfi; jg ¼ old pairÞ and Pðcij jfi; jg ¼ random pairÞversus Pðcij jfi; jg ¼ new pairÞ, by calculating the p-valueunder a Kolmogorov–Smirnov test with a proposedsignificance level 0.05. In figure 9, we can see that at


the beginning of the season, new pairs have similarco-membership patterns to random ones, as they havenot been formed at such early point. But as we movethrough the year, this similarity drops and fromthe ‘cloud’ of random associations, breeding relationshipsemerge.Onthe other hand, oldpairs thathavebeenalreadyformed from previous seasons have a consistent non-random co-membership pattern, even from very earlypoints in the season.

5. DISCUSSION AND FUTURE WORK

The network paradigm is a powerful tool for studyingreal-world complex systems. As there is an extensive tool-set of methods and algorithms for network analysis, inthis work we have focused on the problem of constructingthe network in the first place. In many applications, thecollected data capture the behaviour of the system insome manner, like the spatial trajectories of participatingagents, but not the underlying relations between them.


We address this issue by assuming that mobility patternsof individuals may be correlated based on some form ofunderlying social connection. By identifying obser-vation-dense regions in the data stream, which can beseen as gathering events of affiliated individuals, wepropose a methodology of drawing links betweenagents based on their co-participation into those events.

Traditional approaches [9–12] to constructing socialnetworks from spatio-temporal data involve discretizingthe observation stream based on some fixed timewindow Dt and drawing links between individuals whenthey lie within such ‘interaction-radius’. Our methodovercomes the practical difficulties of such time-slicingapproach in cases when we have no prior knowledge ofhow big or small the time window size should be, thushaving to perform multiple runs across various Dt andselect the appropriate one based on some ad hoc qualityfunction. Additionally, we have proposed an appropriatenull model, which allows us to examine if the co-occur-rence of individuals into gathering events are a resultof a latent social tie, or coincidence. Our null modelretains the ‘bursty’ nature of the data stream butbreaks all correlations between the individuals’ appear-ance patterns through an appropriate randomization.

We applied GMMEvents into two large-scale data-sets that provide wild-bird foraging records. Weshowed that the inferred network topologies reflectmating pair formation events that take place in thepopulation, where breeding individuals tend tobelong into the same foraging groups more oftenthat random dyads. We also showed that the dynamicsof community structure in the system reveal hownewly formed pairs initially have a random-like behav-iour, while as we approach the mating season theystart to participate more often than random into thesame communities.

The communities identified here are based on tem-poral occurrence at feeding stations, and while thedata analysed here are extensive, they are incomplete,as observations are made for only a proportion of thetime, and only for feeding-related activity. While morecomplete data would be expected to result in more com-pletely connected communities (both in terms of linknumber and connection strength), it is not necessarilythe case that all communities would ultimately befully connected. For example, communities might becomprised of pairs of individuals that avoided eachother (e.g. territorial males, competing females) relativeto the other members of the community, even thoughthey have links via other individuals. As expected forindividuals linked via a network, there is a variety ofdirect and indirect ways that individuals within andbetween communities might influence each other. Inthe case of the present network, we might expect thatan important source of direct effects lies in the flow ofinformation between community members about thepresence of food, but such information will also spreadindirectly to other individuals via network linksbetween communities [19]. Numerous other effectsmight also be considered. For example, like many ani-mals, small passerine birds give alarm calls that alertother individuals about the presence of predators [20].While the individuals in the same community may be


expected to be nearest to a focal individual, otherlinked communities may also be influenced directly bythis sort of behaviour, and the overall inter-communitynetwork may serve as a hypothesis for the likelihood ofsuch effects being transmitted between individuals. Sofar all feeding sites have been analysed in isolationuntil the last stage. Site-specific network adjacencymatrices are extracted and tested for significancewhence they are all combined to one single adjacencymatrix. An alternative to this spatial aggregation oversites would be temporal aggregation. In this approach,temporal data could be aggregated and behaviour andfeeding sites analysed directly. While such an analysismay account for popular feeders, it would not achievethe high temporal resolution of the existing approach.For instance, using a temporal aggregation strategy, agroup of birds feeding in the morning and one in theevening would all be treated as one single group whenthe times of their feeding site visitation clearly suggestotherwise. A proper resolution of this conflict mayrequire a full spatio-temporal clustering stage andanother bespoke hypothesis test to detect both spatiallyand temporally insignificant events. Such a multi-variate approach would alleviate the necessity toaccount for spatial correlations during hypothesis test-ing which otherwise would be extremely hard toextract from data. Thus, in our future work, we willfocus on a full spatio-temporal analysis of bird behav-iour and the development of clustering models thatcombine data of different characteristics, such as thebursty behavioural and the continuous spatial data.

The next stages of our research consist of two mainmodules. From the perspective of the model, we seekto extend the way we define the link aijt between twoindividuals at time t so that we take into accountprior knowledge from previous observations. This hasthe advantage of capturing the uncertainty over thelink weight, detect abrupt changes in the network top-ology and handling missing observations in a principledmanner. From an ecological point of view, we currentlyrun an improved scheme of our data collection, wherewe have sensors at each feeding location. This gives usthe advantage of looking at the data at much greaterresolutions thus having a more accurate view of theoverall bird population’s foraging patterns.

Although the methodology we presented is applied toanimal observation records, it can be extended to anysystem where agents perform check-ins at certain locationsand such observations are not uniformly spread in datastream, but temporally focused. We believe methodologiesand theoretical results derived from the study of animalsocial networks will benefit the wider field of networkanalysis, as individuals can be monitored from thebeginning to the end of their lifespan, there are no privacyissues associated with data collection and understandingthe dynamics of animal interactions provides an insightinto the behaviour and evolution of complex systems.

The authors are grateful to Simon Evans, Ada Grabowska andparticularly Teddy Wilkin for data collection, and to theNational Environment Research Council (NE/D011744/1)and the European Research Council (AdG 250164) forfunding aspects of the work. Ioannis Psorakis is funded by a


Microsoft Research European PhD scholarship, for which weare most grateful. Additionally, we gratefully acknowledgefunding from the UK Research Council for project ‘Orchid’,grant EP/I011587/1. Finally, we warmly thank the reviewersfor their constructive comments and evaluation.

REFERENCES

1 Rosvall, M. 2006 Information horizons in a complex world.2 Newman, M. E. J. 2010 Networks: an introduction.

Oxford, UK: Oxford University Press.3 Wey, T., Blumstein, D. T., Shen, W. & Jordan, F. 2008

Social network analysis of animal behaviour: a promisingtool for the study of sociality, Anim. Behav. 75,333–344. (doi:10.1016/j.anbehav.2007.06.020)

4 Krause, J., Lusseau, D. & James, R. 2009 Animal socialnetworks: an introduction. Behav. Ecol. Sociobiol. 63,967–973. (doi:10.1007/s00265-009-0747-0)

5 Buchanan, M. & Caldarelli, G. 2010 A networked world.Phys. World 23, 22–24.

6 Barrat, A., Barthelemy, M. & Vespignani, A. 2004 Model-ing the evolution of weighted networks. Phys. Rev. E 70,066149. (doi:10.1103/PhysRevE.70.066149)

7 Fortunato, S. 2010 Community detection in graphs. Phys.Rep. 486, 75–174. (doi:10.1016/j.physrep.2009.11.002)

8 Newman, M. E. J. The structure and function of complexnetworks. SIAM Rev. 45, 167–256. (doi:10.1137/S003614450342480)

9 Lauw, H. W., Lim, E. P., Pang, H. & Tan, T. T. 2005Social network discovery by mining spatio-temporalevents. Comput. Math. Organ. Theory 11, 97–118.(doi:10.1007/s10588-005-3939-9)

10 Whitehead, H. 2008 Analyzing animal societies: quantitat-ive methods for vertebrate social analysis. Chicago, IL:Chicago University Press.


11 Oh, K. P. & Badyaev, A. V. 2010 Structure of social net-works in a passerine bird: consequences for sexual selectionand the evolution of mating strategies. Am. Nat. 176,E80–E89. (doi:10.1086/655216)

12 Gero, S., Engelhaupt, D., Rendell, L. & Whitehead, H.2009 Who Cares? Between-group variation in alloparentalcaregiving in sperm whales. Behav. Ecol. 20, 838. (doi:10.1093/beheco/arp068)

13 Krings, G., Karsai, M., Bernharsson, S., Blondel, V. D. &Saramaki, J. 2012 Effects of time window size and place-ment on the structure of aggregated networks. (http://arxiv.org/abs/1202.1145v1)

14 Ginsberg, J. R. & Young, T. P. 1992 Measuring associ-ation between individuals or groups in behaviouralstudies. Anim. Behav. 35, 1454–1469.

15 Whitehead, H. & Dufault, S. 1992 Techniques for analyz-ing vertebrate social structure using identified individuals:review and recommendations. Adv. Study Behav. 28,33–74. (doi:10.1016/S0065-3454(08)60215-6)

16 Porter, M. A., Onnela, J. P. & Mucha, P. J. 2009 Commu-nities in networks. Not. Am. Math. Soc. 56, 1082–1097(1164–1166).

17 Psorakis, I., Roberts, S., Ebden, M. & Sheldon, B. 2011Overlapping community detection using Bayesian non-negative matrix factorization. Phys. Rev. E 83, 066 114.(doi:10.1103/PhysRevE.83.066114)

18 Lilliefors, H. W. 1967 On the Kolmogorov-Smirnov test fornormality with mean and variance unknown. J. Am. Stat.Assoc. 62, 399–402.

19 Franz, M. & Nunn, C. L. 2009 Network-based diffusionanalysis: a new method for detecting social learning.Proc. R. Soc. B 276, 1829–1836. (doi:10.1098/rspb.2008.1824)

20 Zuberbuhler, K. 2009 Survivor signals: the biology andpsychology of animal alarm calling. Adv. Study Behav.40, 277–322. (doi:10.1016/S0065-3454(09)40008-1)

http://dx.doi.org/doi:10.1016/j.anbehav.2007.06.020

http://dx.doi.org/doi:10.1007/s00265-009-0747-0

http://dx.doi.org/doi:10.1103/PhysRevE.70.066149

http://dx.doi.org/doi:10.1016/j.physrep.2009.11.002

http://dx.doi.org/doi:10.1137/S003614450342480

http://dx.doi.org/doi:10.1137/S003614450342480

http://dx.doi.org/doi:10.1007/s10588-005-3939-9

http://dx.doi.org/doi:10.1086/655216

http://dx.doi.org/10.1093/beheco/arp068

http://dx.doi.org/10.1093/beheco/arp068

http://arxiv.org/abs/1202.1145v1



http://dx.doi.org/doi:10.1016/S0065-3454(08)60215-6

http://dx.doi.org/doi:10.1103/PhysRevE.83.066114

http://dx.doi.org/doi:10.1098/rspb.2008.1824

http://dx.doi.org/doi:10.1098/rspb.2008.1824

http://dx.doi.org/doi:10.1016/S0065-3454(09)40008-1

Documents

Published online 13 June 2012 Inferring social network ...sjrob/Pubs/psorakis_roberts_interface_2012.pdf · network of social afﬁliations needs to be inferred through proxies such