35
Silico-paleontology with graph databases Rooting through the relics of digital evolution Nic McPhee & David Donatucci (w/ Thomas Helmuth) Division of Science and Mathematics University of Minnesota, Morris Morris, Minnesota, USA May 2015 Genetic Programming Theory and Practice University of Michigan Ann Arbor, MI McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 1 / 26

Silica-Paleontology with graph databases: Rooting through the relics of digital evolution

Embed Size (px)

Citation preview

Silico-paleontology with graph databasesRooting through the relics of digital evolution

Nic McPhee & David Donatucci (w/ Thomas Helmuth)

Division of Science and MathematicsUniversity of Minnesota, Morris

Morris, Minnesota, USA

May 2015Genetic Programming Theory and Practice

University of MichiganAnn Arbor, MI

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 1 / 26

Overview The Big Picture

The Big Picture

Genetic programming clearly works.But we rarely know why or how.Databases allow examination of the internal interactions of a run.Graph databases better suited for this than relational databases.Silico-paleontology can help us understand and improve our tools.

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 2 / 26

Overview Outline

Outline

1 What do we know? (And how do we talk about it?)

2 Using a graph database

3 Let’s go exploring!

4 Conclusions

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 3 / 26

What do we know? (And how do we talk about it?)

Outline

1 What do we know? (And how do we talk about it?)We throw so much awaySummary results are highly lossyPlots are better (but can still obscure details)Can we zoom in to individual runs?

2 Using a graph database

3 Let’s go exploring!

4 Conclusions

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 4 / 26

What do we know? (And how do we talk about it?) We throw so much away

We keep/see/share so little

EC research has the potential to generatehuge amounts of data.

What do we normally do with that data?

We normally throw it away – &paleontologists weep!

https://www.flickr.com/photos/blmoregon/14566767645/

https://www.flickr.com/photos/nicmcphee/1323950471

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 5 / 26

What do we know? (And how do we talk about it?) Summary results are highly lossy

Oooh – a table of results!

TreatmentProblem L T IRSWN 55 13 17SYL 22 1 2SLB 75 19 10NTZ 57 15 7

These show successes on 4 problemsfor 3 different treatments

L seems to be winning

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 6 / 26

What do we know? (And how do we talk about it?) Summary results are highly lossy

Oooh – a table of results!

TreatmentProblem L T IRSWN 55 13 17SYL 22 1 2SLB 75 19 10NTZ 57 15 7

But why?!?!?

What’s actually happening in all thosematings and crossovers and mutationsthat makes the difference?

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 6 / 26

What do we know? (And how do we talk about it?) Plots are better (but can still obscure details)

Let’s draw pretty pictures

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

lexicasetourney

0 100 200 300generation

erro

r.div

ersi

ty

So much more data!

Diversity over time across allthe runs.

L’s diversity (top) is consis-tently higher than T (bot-tom).

That might be important(and supports some hy-potheses).

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 7 / 26

What do we know? (And how do we talk about it?) Plots are better (but can still obscure details)

Let’s draw pretty pictures

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

lexicasetourney

0 100 200 300generation

erro

r.div

ersi

ty

Still, this mushes all the runstogether.

And that likely obscures in-teresting things.

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 7 / 26

What do we know? (And how do we talk about it?) Can we zoom in to individual runs?

Zooming in

0.2

0.4

0.6

0.8

0 25 50 75generation

erro

r.div

ersi

ty

Focusing on one successfulL run now.

Three big diversity changes:

First 15 generationshave a sharp drop thensteep riseAround generation 40 asharp drop and riseSharp drop at end justbefore a solution isfound

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 8 / 26

What do we know? (And how do we talk about it?) Can we zoom in to individual runs?

Zooming in

0.2

0.4

0.6

0.8

0 25 50 75generation

erro

r.div

ersi

ty

What’s happening at thosesections of the run?

We want to be able to digthrough a run and see whathappened.

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 8 / 26

Using a graph DB

Outline

1 What do we know? (And how do we talk about it?)

2 Using a graph databaseGoalsNeo4jCypher

3 Let’s go exploring!

4 Conclusions

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 9 / 26

Using a graph DB Goals

Goals

We want to store and analyze all theindividuals and their relationships.

Ancestry relationships are naturallymodeled with a graph

So graph databases seem a natural toolfor the relationship part.

www.hokstad.com/family-tree-using-graphviz-and-ruby

(a) Distribution of fitness values (b) Genealogies in the last generation (c) Root lineages in the last generation

(d) Genealogy of the best individual (e) Root lineage of the best individual

Fitness value (Pearson’s R2)

0.0 1.0

Figure 1: Distribution of fitness, genealogies and root lineages in the population graph.

[Burlacu et al., 2013]

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 10 / 26

Using a graph DB Neo4j

Neo4j graph database

Part of the new-ish NoSQL movementNeo4j’s initial release was 2007Started to take off in 2010

Represent individuals as nodesRepresent parent-child relationships asedges

Easy to represent complex relationshipsEasy to search for relationshipsEfficient recursive queries, esp.compared to traditional databases

http://neo4j.com

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 11 / 26

Using a graph DB Cypher

Cypher query language

Neo4j uses the Cypher query language.

Fundamental elements of Cypherqueries:

STARTMATCHWHERERETURN

Uses "ASCII art" to describerelationships:

(p)- ->(c)

(p)-[r:PARENT_OF]->(c)

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 12 / 26

Using a graph DB Cypher

Can model (complex) paths

Find Nic’s parents:

(Nic)<-[:PARENT_OF]-(p)

Find all Nic’s grandparents:

(Nic)<-[:PARENT_OF*2]-(gp)

Find everyone at most 5 steps from Nic:

(Nic)<-[:PARENT_OF*1..5]-(a)

Find all Nic’s siblings:

(Nic)<-[:PARENT_OF]-()-[:PARENT_OF]->(s)

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 13 / 26

Let’s go exploring!

Outline

1 What do we know? (And how do we talk about it?)

2 Using a graph database

3 Let’s go exploring!SetupComparing the end-games

4 Conclusions

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 14 / 26

Let’s go exploring! Setup

What are we exploring?

Tom Helmuth provided a lot of data:A number of program synthesis problems taken from introcomputing textsThree different selection mechanisms: Lexicase, tournament, andimplicit fitness sharing (IFS)All using Clojush implementation of Lee Spector’s PushGP systemhttps://github.com/lspector/Clojush

Population size 1,000; ≤ 300 generationsSee [Helmuth and Spector, 2015] for more.

We used batch-import tool and custom scripts to import into Neo4j.https://github.com/jexp/batch-import

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 15 / 26

Let’s go exploring! Setup

Only just the beginning

We have data from hundreds of runsCurrently a very “by hand” processDefinitely learned valuable things about:

The behavior of lexicaseRole of alternation (a type of crossover) in PushGPImpact of test cases on evolutionary dynamics

We’ll look at results from two runs:Both successful on replace-space-with-newline problemOne using lexicase (sol’n found in 88 gens)One using tournament selection (sol’n found in 151 gens)

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 16 / 26

Let’s go exploring! Comparing the end-games

How did we construct a winner?

How is a winner constructed at the end of a run?

This query finds all ancestors of a winner (zero total_error) goingback at most 8 steps:

MATCH (w) WHERE w.total_error = 0MATCH (p)-->(c)-[*0..7]->(w)RETURN DISTINCT id(p), id(c);

8 steps is fairly arbitrary; returns a small enough set to visualize.

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 17 / 26

Let’s go exploring! Comparing the end-games

Comparing the end-games

Ancestry of winner(s) look verydifferent

Tournament selection (below):Single winner w/ highbranching factorLexicase (right): 45 winners w/much lower branching factor

Gen 142

Gen 143

Gen 144

Gen 145

Gen 146

Gen 147

Gen 148

Gen 149

Gen 150

233 5 2

3

2332

2

2

2

2

2

Gen 79

Gen 80

Gen 81

Gen 82

Gen 83

Gen 84

Gen 85

Gen 86

Gen 87

80:220

82:447

83:04783:124 83:619

84:319

85:086

86:261

87:71987:941 87:94742 Other Winners

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 18 / 26

Let’s go exploring! Comparing the end-games

Lexicase selection

Gen 79

Gen 80

Gen 81

Gen 82

Gen 83

Gen 84

Gen 85

Gen 86

Gen 87

80:220

82:447

83:04783:124 83:619

84:319

85:086

86:261

87:71987:941 87:94742 Other Winners

A number of observations:45(!) “winning” individualsIndividual “86:261” is (a)parent of all 45Individual “86:261” is aparent of 934 (of 1,000)individuals in nextgeneration

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 19 / 26

Let’s go exploring! Comparing the end-games

Lexicase selection

Gen 79

Gen 80

Gen 81

Gen 82

Gen 83

Gen 84

Gen 85

Gen 86

Gen 87

80:220

82:447

83:04783:124 83:619

84:319

85:086

86:261

87:71987:941 87:94742 Other Winners

Seriously?!? 934 offspring?!?

Turns out to an be extreme caseof a common phenomena withlexicase

Nodes marked with diamondsall had at least 100 offspring

Shaded diamonds also have atleast 5 offspring that are ances-tors of or are winners

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 19 / 26

Let’s go exploring! Comparing the end-games

Lexicase selection

Gen 79

Gen 80

Gen 81

Gen 82

Gen 83

Gen 84

Gen 85

Gen 86

Gen 87

80:220

82:447

83:04783:124 83:619

84:319

85:086

86:261

87:71987:941 87:94742 Other Winners

What’s the total error (fitness) of“86:261”?

4,034(!)Bottom quartile!But had 934 offspring!

Failed to return on 4 cases(error 1,000 each)Got 2 other answers wrong(error 17 each)Terrible total error, butperfect on 194 of 200 testsGreat for lexicase!

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 19 / 26

Let’s go exploring! Comparing the end-games

Lexicase selection

Gen 79

Gen 80

Gen 81

Gen 82

Gen 83

Gen 84

Gen 85

Gen 86

Gen 87

80:220

82:447

83:04783:124 83:619

84:319

85:086

86:261

87:71987:941 87:94742 Other Winners

What’s the total error (fitness) of“86:261”?

4,034(!)Bottom quartile!But had 934 offspring!

Failed to return on 4 cases(error 1,000 each)Got 2 other answers wrong(error 17 each)Terrible total error, butperfect on 194 of 200 testsGreat for lexicase!

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 19 / 26

Let’s go exploring! Comparing the end-games

Lexicase selection

Gen 79

Gen 80

Gen 81

Gen 82

Gen 83

Gen 84

Gen 85

Gen 86

Gen 87

80:220

82:447

83:04783:124 83:619

84:319

85:086

86:261

87:71987:941 87:94742 Other Winners

What’s the total error (fitness) of“85:086”?

100,000!Rank 971 out of 1,000But had 180 offspring

Got all the “print” casesFailed to return value for all100 “return” cases (error1,000 each)Terrible total error, butperfect on 100 of 200 testsFine for lexicase

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 19 / 26

Let’s go exploring! Comparing the end-games

Lexicase selection

Gen 79

Gen 80

Gen 81

Gen 82

Gen 83

Gen 84

Gen 85

Gen 86

Gen 87

80:220

82:447

83:04783:124 83:619

84:319

85:086

86:261

87:71987:941 87:94742 Other Winners

What’s the total error (fitness) of“85:086”?

100,000!Rank 971 out of 1,000But had 180 offspring

Got all the “print” casesFailed to return value for all100 “return” cases (error1,000 each)Terrible total error, butperfect on 100 of 200 testsFine for lexicase

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 19 / 26

Let’s go exploring! Comparing the end-games

Lexicase selection

Gen 79

Gen 80

Gen 81

Gen 82

Gen 83

Gen 84

Gen 85

Gen 86

Gen 87

80:220

82:447

83:04783:124 83:619

84:319

85:086

86:261

87:71987:941 87:94742 Other Winners

High proportion of mutations:Roughly half the offspringin this graph created viamutationProbably why there’s lessbranching

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 19 / 26

Let’s go exploring! Comparing the end-games

Tournament selection

Gen 142

Gen 143

Gen 144

Gen 145

Gen 146

Gen 147

Gen 148

Gen 149

Gen 150

233 5 2

3

2332

2

2

2

2

2

Much broader: 42 ancestors of a winner for tournament 9 gensback; 14 for lexicaseAbout two-thirds created via crossover, so more branching thanlexicase

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 20 / 26

Let’s go exploring! Comparing the end-games

Number ancestors of “winners” over time

Gens from winner Lexicase Tournament

1 4 22 6 43 7 64 6 105 7 136 9 207 10 308 14 339 14 42

10 22 63...

......

18 58 297

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 21 / 26

Let’s go exploring! Comparing the end-games

12 most fecund individuals

Lexicase Tournament

934 24657 23594 23590 21433 20326 20297 19294 19285 19283 18279 18271 18

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 22 / 26

Conclusions

Outline

1 What do we know? (And how do we talk about it?)

2 Using a graph database

3 Let’s go exploring!

4 Conclusions

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 23 / 26

Conclusions

Conclusions

Still early days, but we can definitely see some useful things:Differences in ways selection mechanisms workSupport for hypotheses (e.g., Tom’s paper)Evidence for importance of crossover in PushGPImpact of test cases on evolutionary dynamics

Future WorkAutomate more of the workExamine more runs/problems/etc.Explore how to include this “on-line”

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 24 / 26

Conclusions

Thanks!

Thank you for your time and attention!

Thanks to M. Kirbie Dramdahl (University of Minnesota, Morris), and toLee Spector’s Computational Intelligence group (Hampshire College)for ideas and feedback.

Contacts:[email protected]

[email protected]

[email protected]

Questions?

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 25 / 26

References

References

Burlacu, B., Affenzeller, M., Kommenda, M., Winkler, S., and Kronberger, G. (2013).Visualization of genetic lineages and inheritance information in genetic programming.In Proceedings of the 15th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO ’13Companion, pages 1351–1358, New York, NY, USA. ACM.

Helmuth, T. and Spector, L. (2015).General program synthesis benchmark suite.In Proceedings of the 17th Annual Conference on Genetic and Evolutionary Computation, GECCO ’15, New York, NY,USA. ACM.

McPhee & Donatucci (UMN Morris) Graph database analysis of GP dynamics May 2015, GPTP, Ann Arbor MI 26 / 26