Juan-Camilo Cárdenas Universidad de los Andes Jim Murphy University of Alaska Anchorage Experimental Methods in Social Ecological Systems

Juan-Camilo Crdenas Universidad de los Andes Jim Murphy University of Alaska Anchorage Experimental Methods in Social Ecological Systems

Agenda Day 1 Noon 12:15Welcome, introductions 12:15 1:15Play Game #1 (CPR: 1 species vs. 4 species) 1:15 2:00Debrief game #1 and other results from the field 2:00 2:15Break 2:15 3:15Game #2 (Beans game) 3:15 4:00Debrief Game #2 4:00 4:15Break 4:15 5:00Basics of Experimental design Homework for Day 2: Think of an interesting question or problem to be worked in groups tomorrow

Agenda Day 2 8:30 9:15Designing and running experiments in the field 9:15 10:15Classwork: work in groups solving experimental design problems 10:15 10:30Break 10:30 11:15Discussion on group solutions 11:15 noonBegin design your own experiment (form groups based on best ideas proposed) Noon 1:00 Lunch 1:00 1:30Continue design your own experiment (work in groups) 1:30 2:30Present designs 2:30 3:00Feedback: how could we make this workshop better?

Materials online We will create a web site with materials from the workshop. Please give us your email address (write neatly!!) and we will send you a link when it is ready.

Why run experiments?

Types of experiments 1. Speaking to Theorists Test a theory or discriminate between theories Compare theoretical predictions with experimental observations Does non-cooperative game theory accurately predict aggregate behavior in an unregulated CPR? Explore the causes of a theorys failure If what you observe in the lab differs from theory, try to figure out why. Communication increases cooperation in a CPR even though it is cheap talk Why? Is my experiment designed correctly? What caused the failure? Theory stress tests (boundary experiments)

Types of experiments (cont.) 2. Searching for Facts Establish empirical regularities as a basis for new theory In most sciences, new theories are often preceded by much observation. I keep noticing this. Whats going on here? The Double Auction Years of experimental data showed its efficiency even though no formal models had been developed to explain why this was the case. Behavioral Economics Many experiments identifying anomalies, but have not yet developed a theory to explain.

Types of experiments (cont.) 3. Whispering in the Ears of Princes Evaluate policy proposals Alternative institutions for auctioning emissions permits Allocating space shuttle resources Test bed for new institutions Electric power markets Water markets Pollution permits FCC spectrum licenses

Basics of Experimental Design

Baseline static CPR game Common pool resource experiment Social dilemma Individual vs group interests Benefits to cooperation Incentives to not cooperate Field experiments in rural Colombia Groups of 5 people Decide how much to extract/harvest from a shared natural resource

Subjects choose a level of extraction 0 8 Low harvest levels (conservative) High harvest levels

Payoffs also depend on choices of other 4 group members

Group earnings largest if all choose 1

Strong incentives to harvest more than 1

Nash equilibrium: All choose 6 Social optimum: All choose 1

Comment on payoff tables The early CPR experiments typically used payoff tables. We dont live in a world of payoff tables Frames how a person should think about the game A lot of numbers, hard to read Too abstract?? More recent CPR experiments using richer ecological contexts e.g., managing a fishery is different than an irrigation system

Objective To explore interaction between: Formal regulations imposed on a community to conserve local natural resources Informal non-binding verbal agreements to do the same.

Possible 2x3 factorial design External Enforcement NoneLowMedium Communication No BaselineLowMedium Yes Comm OnlyLow + CommMedium + Comm Groups of N=5 participants Play 10 rounds of one of the 6 treatments Enforcement Individual harvest quota = 1 (Social optimum) Exogenous probability of audit Fine (per unit violation) if caught exceeding quota Participants paid based on cumulative earnings in all 10 rounds These 2 treatments have been conducted ad nauseum. Are they necessary?

Baselines and replication Replication In any experimental science, it is important for key results to be replicated to test robustness Link to previous research. Is your sample unique? Baseline or control group The baseline treatment also gives us a basis for evaluating what the effects are of each treatment In any experimental study, it is crucial to think carefully about the relevant control!

Alternative design Stage 1 Baseline CPR (5 rounds) Stage 2 one of the 5 remaining treatments (5 rounds) Comm only Low Low + Comm Med Med + Comm Advantage Having all groups play Stage 1 baseline facilitates a clean comparison across groups. Disadvantage fewer rounds of the Stage 2 treatments. Enough time to converge?? Disadvantage(?) All stage 2 decisions conditioned upon having already played a baseline

Optimal sample size External Enforcement NoneLowMedium Communication No BaselineLowMedium Yes Comm OnlyLow + CommMedium + Comm Groups of N=5 participants How many groups per treatment cell?

John Lists notes on sample size Also see : John A. List Sally Sadoff Mathis Wagner So you want to run an experiment, now what? Some simple rules of thumb for optimal experimental design Experimental Economics (2011). 14:439-457

Some Design Insights A. 0 (control) / 1 (treatment), equal outcome variances B. 0/1 treatment, unequal outcome variances C. Treatment Intensityno longer binary D. Clusters

Some Design Rules of Thumb for Differences in between-subject experiments Assume that X 0 is N( 0, 0 2 ) and X 1 is N( 1, 1 2 ); and the minimum detectable effect 1 0 = . H 0 : 0 = 1 and H 1 : 1 0 = . We need the difference in sample means X 1 X 0 to satisfy: 1.Significance level (probability of Type I error) = : 2. Power (1 probability of Type II error) = 1-:

Standard Case

Power A. Our usual approach stems from the standard regression model: under a true null what is the probability of observing the coefficient that we observed? B. Power calculations are quite different, exploring if the alternative hypothesis is true, then what is the probability that the estimated coefficient lies outside the 95% CI defined under the null.

Sample Sizes for Differences in Means (Equal Variances) Solving equations 1 and 2 assuming equal variances 1 2 = 2 2 : Note that the necessary sample size Increases rapidly with the desired significance level ( t /2 ) and power ( t ). Increases proportionally with the variance of outcomes ( ). Decreases inversely proportionally with the square of the minimum detectable effect size ( ). Sample size depends on the ratio of effect size to standard deviation. Hence, effect sizes can just as easily be expressed in standard deviations.

Standard is to use =0.05 and have power of 0.80 (=0.20). So if we want to detect a one-standard deviation change using the standard approach, we would need: n = 2(1.96 + 0.84) 2 *(1) 2 = 15.68 observations in each cell std. dev. change is detectable with 4*15.68 ~ 64 observations per cell n=30 seems to be the magic number in many experimental studies: ~ 0.70 std. dev. change.

Sample Size Rules of Thumb: Assuming =0.05 and = 0.20 requires n subjects: = 0.05 and = 0.05 1.65 n = 0.01 and = 0.20 1.49 n = 0.01 and = 0.05 2.27 n

Example from a recent undergrad research project Local homeless shelter was conducting a fundraising campaign. They asked us to replicate Lists study about the effects of matching contributions. The shelter wanted the same 4 treatments as in List: No match, 1:1, 2:1, and 3:1 to test whether high match ratios would increase contributions. Local oil company agreed to donate up to $5000 to provide a match for money donated.

Fundraising example The shelter had funds to send out 16,000 letters to high income women in Anchorage who had never donated before. Expected response rate was about 3 to 4% (n 480-640) Question: How many treatments should we run, if we expect about 500 responses? They said a meaningful treatment effect would be ~$25. Standard deviation from previous campaigns was ~$100.

Sample size With only 500 expected responses, we could only conduct 2 treatments.

Sample Sizes for Differences in Means (unequal variances) Another Rule of Thumbif the outcome variances are not equal then: The ratio of the optimal proportions of the total sample in control and treatment groups is equal to the ratio of the standard deviations. Example: Communication tends to reduce the variance, so perhaps groups in this treatment.

Treatment levels External Enforcement NoneLowMediumHigh Communication No BaselineLowMediumHigh Yes Comm OnlyLow + CommMedium + CommHigh + Comm How many levels of enforcement do we need? Do we need 3 levels of enforcement?

What about Treatment Levels? Assume that you are interested in understanding the intensity of treatment : Level of enforcement (e.g., audit probability) Assume that the outcome variance is equal across various cells. How should you allocate the sample if audit probability could be between 0-1? For simplicity, say X=25%, 50%, or 75% Assume that you have 1000 subjects available.

Reconsider what we are doing: Y = XB + e One goal in this case is to derive the most precise estimate of B by using exogenous variation in X. Recall that the standard error of B is = var(e)/n*var(X)

Rules of Thumb Linear sample @ X=25% 0 @ X=50% @ X=75% Quadratic @ X=25% @ X=50% @ X=75% Intuition:The test for a quadratic effect compares the mean of the outcomes at the extremes to the mean of the outcome at the midpoint

Intra-cluster Correlation What happens when the level of randomization differs from the unit of observation? Think of randomization at the village level, or at the store level, and outcomes are observed at the individual level. Classic example: comparing two textbooks. Randomization over classrooms Observations at individual level Another Example: To test robustness of results, you may want to conduct the experiments in multiple communities. How do you allocate treatments across communities, especially if number of participants per village is small? In our Colombian enforcement study, we replicated the entire design in three regions. In a separate CPR experiment in Russia, we visited 3 communities in one region. Each treatment was conducted 1x in each community. We are assuming that the differences across communities are small. Cannot make cross-community comparison

Intracluster Correlation Real Sample Size (RSS) = mk/CE m = number of subjects in a cluster k = number of clusters CE = 1 + (m-1) = intracluster correlation coefficient = s 2 B /(s 2 B + s 2 w ) s 2 B = variance between clusters s 2 w = variance within clusters

Intracluster Correlation What does 0 mean? No correlation of responses within a cluster No need to adjust optimal sample sizes What does 1 mean? All responses within a cluster are identical Large adjustment needed: RSS is reduced to the number of clusters

Example Pilot testing confirms our suspicion, yielding = 0.04. They wish to detect a 1/10 std. dev. change. Using the standard approach, what should the sample size equal?

0: What is n? Sample Size Formula: n = 2*(t a + t B ) 2 * [/] 2 n = 1568 at each level; 3136 total.

Example RSS = mk/CE =784*4/(1+.04(784-1)) ~97! What is the required sample size? = 2*(t a + t B ) 2 * 100(1+783(0.04)) = 15.68*3232(note that 0: 15.68*100) =50,678 at each incentive level!

Randomized factorial design Advantages Independence among the factor variables Can explore interactions between factors Disadvantages Number of treatments grows quickly with increase in number of factors or levels within a factor Example: Conduct experiment in multiple communities and use community as a treatment variable

Fractional factorial design Say we want to add informal sanctions with a 3:1 ratio I can pay $3 to reduce your earnings by $1 1 new factor with 2 levels To run all combinations would require 2x2x2 = 8 treatments Assume optimal sample size per cell is 6 groups of 5 people (30 total per cell) 8 treatments x 30 people/cell = 240 people Assume you can only recruit about half that (~120) You could run only 3 groups per cell (15 people) lose power/significance Solution: conduct a balanced subset of treatments External Enforcement LowMedium Communication No LowMedium Yes Low + CommMedium + Comm

Fractional factorial design If you are considering this approach, there are a few different design options depending upon the effects you want to capture, number of treatments, etc. This is just one example! Communication External Enforcement Sanctions

Fractional factorial design Advantage: dramatically reduces the number of trials Disadvantage: achieves balance by systematically confounding some direct effects with some interactions. It may not be serious, but you will lose the ability to analyze all of the different possible interactions.

Nuisance Variables Other factors of little or no primary interest that can also affect decisions. These nuisance effects could be significant. Common examples Gender, age, nationality (most socio-economic vbls) Selection bias Recruitment -- open to whoever shows up vs random selection Experience Participated in previous experiments Learning Concern in multi-round experiments Non-experiment interactions People talking before an experiment while waiting to start In a community, people may hear about experiment from others

Confounded variables Confounding occurs when the effects of two independent variables are intertwined so that you cannot determine which of the variables is responsible for the observed effect. Example: What are some potential confounds when comparing the Baseline with Low? External Enforcement NoneLowMedium Communication No BaselineLowMedium Yes Comm OnlyLow + CommMedium + Comm

Another design approach If trying to identify factors that influence decisions, try adding them one at a time. Imposing a fine for non-compliance differs from the baseline CPR in multiple ways. Possible confounds: FRAME The simple existence of a quota may send a signal about expected behavior, independent of any audits or fines. GUILT = FRAME + audit Getting audited may generate feelings of guilt because the individual is privately reminded about anti-social choices FINE = FRAME + GUILT (audit) + fine for violations Are people responding to the expected penalty? Or are they responding to the frame from the quota?

3 Sources of variability 1. conditions of interest (wanted) 2. measurement error (unwanted) People can make mistakes, misunderstand instructions, typos 3. experimental material and process (unwanted) No two people are identical, and their responses to the same situation may not be the same, even if your theory predicts otherwise.

Design in a nutshell Isolate the effects of interest Control what you can Randomize the rest

Some Practical Advice

Some thoughts in no particular order Think carefully about your research question Formulate testable hypotheses grounded in theory How does your idea contribute to the literature? Think carefully about possible results and how they would be interpreted What if results are consistent with theory/expectations? What if they are not? Be prepared for either possibility Prepare code for data analysis BEFORE running experiments Forces you to think carefully about what your data will look like, and what you want to get out of it.

Some thoughts on data analysis Are your data discrete, binary or continuous? Multinomial logit, ordered probit, logit, Poission, linear Repeated observations or one-shot decisions Random effects, hierarchical mixed models, nonparametrics

More thoughts Subject payments and salience One distinguishing feature of economic experiments is that subjects are paid based on their decisions and possibly the decisions of others Must pay enough for subjects to take experiment seriously Avoid tournaments E.g., giving a bonus to person who earns the most money Typically pay in cash, in some field experiments may use another medium Never use deception! Keep earnings and decisions private

Instructions Think carefully about every word in your instructions Framing effects partner in the UG or your opponent Could frame UG as an offer to sell at a price Using examples I used the example of $14/$6 split. Does that suggest proposers should take more than half? What if I used a 10/10 split? Or 6/14? Could give multiple examples Experiment length Be aware that people get tired and bored

Other stuff Strategy method Hot vs cold decisions Paying for just one round in multi-round game AB-BA designs for within-subject comparisons Playing multiple games and paying for just one Factor levels should allow for enough distance between hypotheses Social optimum is people will harvest 10% of the fish Nash equilibrium predicts 15%. Nash equilibrium & social optimum should be farther apart

Documents

Juan-Camilo Cárdenas Universidad de los Andes Jim Murphy University of Alaska Anchorage Experimental Methods in Social Ecological Systems