The Complexity of Data: Computer Simulation and “Everyday” Social Science

DEPARTMENT OF SOCIOLOGY

The Complexity of Data: Computer Simulation and “Everyday” Social

Science

Edmund Chattoe-Brown

[email protected]

Plan of talk

• Simulation as a confusing term.

• A simple (but revealing) example.

• The importance of data collection: Simulation methodology.

• Where does complexity fit into all this?

• A more challenging example: DrugChat.

• Conclusions.

Simulation as a confusing term• Not “gaming” or “role playing”: Student United Nations.

• Not system dynamics, discrete event simulation, analogue simulation and so on, though these are ancestors.

• Not simulation as discussed by Bourdieu, whatever that is.

• Instrumental versus descriptive simulation: Not just a technical tool (doing the same sums quicker) but a distinctive way of understanding (explaining) social behaviour.

• A social process described as a computer programme rather than a narrative or a statistical/mathematical model.

• Other disciplines, other approaches: Experiments, time series, documents/content analysis, GIS.

Spatial segregation (Schelling)• Agents live on a square grid (like a US city) so each has a maximum of

eight neighbours.

• There are two “types” of agents (red and green) and some spaces in the grid are vacant. Initially agents and vacancies are distributed randomly.

• All agents decide what to do in the same very simple way.

• Each agent has a preferred proportion (PP) of neighbours of its own kind (0.5 PP means you want at least half your neighbours to be your own kind - but you would be happy with all of them being so i. e. PP is a minimum.)

• If an agent is in a position that satisfies its PP then it does nothing otherwise it moves to an unoccupied position chosen at random.

• A time period is defined as the time it takes for each agent (chosen in random order to avoid non robust patterns) to “take a turn” at deciding and possibly moving.

Initial random state

Two questions• What is the smallest PP (i. e. number 0-1) that will produce clusters?

• What happens when the PP is 1?

Simple individuals but complex system

I ndividual Desires and Collective Outcomes

-20

0

20

40

60

80

100

120

0 50 100 150

% S imilar W anted ( I ndividual)

% S

imil

ar

Ach

iev

ed

(S

ocia

l)

% similar% unhappy

Counter-intuitive macro (social) results from simple micro interactions. A non-linear system.

Deconstructing this example• Clearly unrealistic in many senses: Property values, decision processes,

unstructured space, communication, neighbourhood knowledge.

• However, not unrealistic in important sense that simulation contains no arbitrary parameters and agents operate on plausible local knowledge. The only “parameters” in the model are individual PP values (measure by experiment? Already in surveys: Mare.)

• The simulation also generates unintended consequences (PP=1) and patterns that were not “built in”. For example, is the distribution of empty sites random or buffering? This emergence allows the possibility of genuine falsification and has heuristic fertility: What does compatibility of desires mean? When does it occur?

• We need two sorts of data: Quantitative (what patterns are we trying to explain?) and qualitative (what social processes create these?)

Aside …• It is very clear that we need the “complexity approach” because

we are not very good at deducing how complex systems work “in either direction” (micro to macro or vice versa).

• But what is the complexity approach in this context? Is it a set of methods, a set of subject areas, a family of interesting models/results, a way of looking at problems or all of the above?

• How does “the complexity approach” compare with “the sociology approach” or “the physics approach?”

• Should complexity be more than simulation calibrated on real data? If so, what?

• IMO, the main problem with complexity is “where’s the data?”

Quantitative data collection approach• Collect survey data: Cross sectional, time series or whatever.

• Choose a model and accept/reject it on grounds of statistical fit.

• Model coefficients are “results” conditional on acceptable model.

• In what sense do models explain observed patterns? (If we find a correlation between income and academic success of a particular size, what have we really learnt?)

• Technical problems: Explanatory range depends on sample size.

• Basic problem doesn’t go away even with “fancier” techniques like time series/multi-level modelling: A description isn’t an explanation.

• Rarely heuristically fertile.

Deriving a quantitative coefficient

Number ofstrikes(units)

Unemployment (millions)

1 2

50

80

Quantitative example• “The most important empirical findings of this study can be summarized as

follows:

• … there is a moderate tendency for individuals with higher service class origins to be more likely than others to enrol in PhD programmes.

• …

• The estimated effect of class drops to zero when controlling for parents’ education and employment in research or higher education.

• The overall implication of these findings is that the transition from graduate to doctoral studies is influenced by social origins to a considerable degree. Thus, the notion that such effects disappear at transitions at higher educational levels - due either to changes over the life course or to differential social selection - is not supported.” (Mastekaasa, Acta Sociologica, 2006, 49(4), pp. 448-449.)

Translating back into simulation …• Agents start with particular attributes (like being red or green and

having a particular PP in Schelling). These might include things like IQ and motivation.

• They undergo a long sequence of social interactions in institutional contexts, being influenced by parents, peers and teachers in classroom, playground, public library and so on. They also make choices and operate within institutional contexts (like rules for “streaming” or school allocation by catchment).

• The quantitative approach described here tries to link “late” attributes (starting a PhD) to “early” ones (parental occupation) in the hope that regularities in social life support this.

• Is parental occupation an attribute or a process?

Qualitative data collection approach• Collect data (cognitive, behavioural, structural) by observation

and questioning.

• Try (though surprisingly rarely) to induce a pattern from the data: Example of the “addiction cycle” and compare with amount (frequency) and type account of drug use.

• Result is rich coherent narrative(s): What heroin addiction means from the inside and in a particular context.

• Are the results generalisable? (What is N?)

• Can we correctly envisage the overall consequences of complex social interaction sequences presented using narratives? (Compare Schelling case again.)

• Often heuristically fertile (“addiction cycle”).

Qualitative example• “Turkish interviewees do not include themselves when they are evaluating the

status of ‘Turkish women’ in general. While referring to ‘Turkish women’, most Turkish interviewees use the pronoun ‘they’:

• Turkish women are more home-oriented. I think that they are left in the backstage because they do not have education, because they are not given equal opportunities with men. (T3)

• One of the Turkish interviewees stated that it was difficult for her to answer the questions related to her status ‘as a woman’, because:

• I don’t think of myself as a Turkish women, but as a Turkish person. I mean I never think about what kind of role I have in the society as a woman. (T1)

• Most Norwegian interviewees, on the other hand, identify with ‘Norwegian women’ in general, and they refer to ‘Norwegian women’ as ‘we’:

• I think that in a way Norwegian women, that is we, at least have our rights on paper.

We have equal rights for education and we have good welfare arrangements … (N1)” (Sümer, Acta Sociologica, 1998, 41(1), p. 122)

Translating back into simulation …• Agents choose “appropriate” actions on the basis of perceived

identity.

• A range of identities is “given” to agents by biological difference (skin colour) and social structure (“mother”, “worker”).

• Identities are made more salient by patterns of social interaction and socialisation. For example, perhaps a Turkish upbringing stresses female identities that are traditional (mother) or liberal (worker) and de-stresses the existence of a separate “woman’s identity” while a Norwegian upbringing stresses that identity as the underpinning of both work and child-rearing.

• Clearly this simulation needs to be much more cognitive, contextual and detailed than the Schelling example.

What is going on here?

• Qualitative research tells us how people interact and make decisions within environments but can’t usually tell us what large scale patterns result.

• Quantitative research tells us what the large scale patterns are but may not really explain them. (Inability to reason about complexity may result in naïve attribution i. e. clusters are evidence of xenophobia.)

• Simulation shows how we might bridge the gap between the levels of description with a “generative” social theory expressed as a computer programme. (Coleman “boat”.)

How are we doing with complexity?

• Large number of elements which interact dynamically.

• Interaction rich (mutual influence between significant numbers of elements).

• Non-linearity.

• Interaction short range and each element ignorant of the behaviour of the system as a whole. [2OE on clusters?]

• Interaction loops.

• Open system far from equilibrium requiring energy input. [?]

• Has a history.

• Source: Compressed losslessly from Cilliers, Complexity and Postmodernism, pp. 3-5.

Different kinds of “difficulty”

• Difficult patterns: Chaos, self-organised criticality. (Mathematical strand: We are studying formal systems, we don’t need data.)

• Difficult mental processes: Reflexivity, self-awareness, subconscious motives. (Social theory strand: We are too embedded in these systems and our reflections on them to bracket anything off as objective data.)

• Difficult social systems: Rich context, negotiated roles, complex artefacts. (Ethnographic strand: The world is too complex for general theories.)

Degrees of similarity in Schelling• Predict exact positions of clusters?

• Predict that there will be clusters at all?

• Predict spatial stability of clusters?

• Predict the size distribution (or separation) of clusters?

• Predict (for three “types”) that clusters will be separated/nested?

• Predict that most cosmopolitan agents will form perimeters of clusters?

• Predict that empty sites will be randomly distributed for cosmopolitan agents but form buffer zones for more xenophobic agents? (“Looking at the holes”: A heuristic idea, “vacancy chains”.)

Ideal simulation methodology• Choose a target system: Ethnic segregation in cities.

• Build a simulation of the target system and calibrate it, typically on micro level data: Ethnography and experiments? How do agents make relocation decisions and where do they go?

• Run simulation and look for regularities and their preconditions: Do we observe clusters (always, never, only with high PP, fixed, identical, moving) and buffer zones?

• Compare these regularities with statistical data on real residential patterns. What effective similarity tests do we have?

• If there is a “good” match then we haven’t yet falsified the claim that the simulation “generates” the target system and therefore explains it (a progressive process of course).

The Gilbert and Troitzsch “box”

Case Study I: DrugChat

• A reimplementation of Agar’s DrugTalk for the DTI Foresight Programme.

• Based on ethnographic data but generates some qualitatively realistic aggregate data.

• Problematises both the “attribute” based approach to social regularity and the “transition probability” based approach to modelling.

Assumptions I

• Networks: Many have few ties and few have many.

• Types: Non-users, users and addicts. (Distinguished by patterns of behaviour not level of use.)

• Choice based on attitudes to risk (fixed and normally distributed around 50) and to drugs (varies by experience and social influence initialised at 50).

• System driven by “arrival” of drug doses: Addicts get few doses with high probability, users get more doses with lower probability and non-users get few doses with very low probability.

Assumptions II

• Choice simply compares ATR and ATD (but addicts don’t choose).

• “Stash”: Users share all bar one dose with friends (“partying”) while addicts don’t share.

• Drug use experience evaluated on each dose and can be good and bad. Counts kept of these update ATD. Early experiences have more impact than late ones and bad experiences more impact than good.

• After 5 doses, addiction occurs (physiology).

Assumptions III

• Addict communication is ignored but status as addicts has strong negative effect on friends.

• Current users have a direct “congruence” influence via drug experience (good or bad).

• Non current users and non users only influence slightly through “gossip” - telling “drug stories” (total counts of good and bad experiences across all friends used to update ATD).

• Clearly a complicated system: Is it a complex one?

Aggregate properties

The statistical approach

Reading these outputs

• Producing an “S curve” is very weak support for the simulation assumptions. Too many other assumption sets produce it too. (Back to issue of qualitative similarity.)

• Because this simulation is only broadly empirical, the failure to predict user status on ATR does not “disprove” the statistical approach. It only shows how systems at a particular level of “complicatedness” (in fact not very high) may break down relationships between attributes which statistical approaches rely on.

Aside …

• The Caulkins model also has three states: User, non-user and addict and assumes that there are fixed transition rates between states.

• These TRs are for NU to U, U to A, U to NU and A to NU. The only behavioural restriction on the TRs is that A to NU is assumed to be smaller than U to NU.

• This model is fitted to real data.

• What happens if we use the DrugChat simulation to calculate transition probabilities of the Caulkins kind?

Transition probabilities in DrugChat

Reading this output• Again, DrugChat is not calibrated well enough to prove that it

is “right” and Caulkins et al. are “wrong”.

• However, this output (not only are transition probabilities not constant but they change sign!) does suggest that constant transition probabilities are not likely to be a very effective approximation in social systems with even a rather low level of “complicatedness”. (The Caulkins model doesn’t even work in the simplified world of DrugChat.)

• Should we start asking questions about how likely different approaches are to work and how we would go about establishing this? (Hendry and model reductions.)

Simulated biography

Reading this output• Initially there is little information in the system. ATD=50.

• Then the agent has two bad experiences with drugs.

• By then, much gossip and experience is reporting good things about the drug which is true “on average” before its addictive nature is recognised.

• This promotes more use, each time with mixed results.

• Unfortunately by this point, addiction has kicked in.

• This particular agent becomes addicted despite several bad drug experiences via social influence.

What are we doing here?• Collecting different kinds of data from the simulated

system which can be compared not only with real data but with underlying assumptions of various theoretical approaches (simple statistical models, models based on “stocks and flows”). Access to multiple kinds of data allows stronger falsification of methods and models.

• Reflecting (at least broadly) on where we might get the kinds of data we need to calibrate the model properly (behavioural, cognitive, physiological, institutional, structural) within the context of existing methods.

Why is this a good idea?• Simulated systems recognise and can represent

different kinds of social “difficulty” - which may include various things people intend by complexity (reflexivity, chaotic output) but also make their “ontological” status clearer. (Is this “difficulty” in the heads of individuals, in their processes of interaction or what?)

• However, unlike a lot of complexity theory (albeit for different reasons) there is an “old fashioned” commitment to integrating data and theory and to explaining across levels of description. This may work better using the new approach too.

Conclusions• Complexity needs to think very carefully about what “kind of

thing” it is if it is going to survive after the “fad” phase.

• Simulation has tools to offer the approach which (at least in principle) tap into the methods and data of social science. (I haven’t talked about the physical sciences but I think the some of the same arguments go through.)

• Simulation of Innovation: A Node (SIMIAN): ESRC funded under NCRM for three years with Professor Nigel Gilbert (Sociology @ Surrey) to train and do methodologically innovative research. A good time for collaboration?

Now read on?• Gilbert and Troitzsch (2005) Simulation for the Social Scientist, second edition

(Open University Press). [Examples/resources online. All examples in NetLogo.]

• J. Artificial Societies and Social Simulation: http://jasss.soc.surrey.ac.uk/ [Free, fully peer reviewed, interdisciplinary and only online.]

• Chattoe (2006) ‘Using Simulation to Develop and Test Functionalist Explanations: A Case Study of Dynamic Church Membership’, British Journal of Sociology, 57(3), September, pp. 379-397.

• Chattoe and Hamill (2005) ‘It’s Not Who You Know – It’s What You Know About People You Don’t Know That Counts: Extending the Analysis of Crime Groups as Social Networks’, British Journal of Criminology, 45(6), pp. 860-876.

• Chattoe, Hickman and Vickerman (2005) Foresight: Drugs Futures 2025? Modelling Drug Use, Office of Science and Technology, Department of Trade and Industry. [Available from the presenter or online.]

Education

The Complexity of Data: Computer Simulation and “Everyday” Social Science