41
gamedesigninitiative at cornell university the Data Collection Lecture 16

Lecture 16 - Cornell University

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Data Collection

Lecture 16

Page 2: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

The Rise of Big Data

�  Big data is changing game design �  Can gather data form a huge number of players �  Can use that data to inform future content

� What can we do with all that data? �  What types of questions can we answer? �  How does it affect our business model?

� How do we collect all of this data? �  What are the technical challenges? �  What are the legal/ethical challenges?

Data Collection 2

Page 3: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

The Rise of Big Data

�  Big data is changing game design �  Can gather data form a huge number of players �  Can use that data to inform future content

� What can we do with all that data? �  What types of questions can we answer? �  How does it affect our business model?

� How do we collect all of this data? �  What are the technical challenges? �  What are the legal/ethical challenges?

Data Collection 3

Page 4: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Data Collection 4

Logging Game Data

Query 2 Log

Data Store

Page 5: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Issues with Data Collection

� You have to record the data � Capture data from the game state � Do it so it does not hurt performance

� You have to store the data somewhere �  Lots of players means a lot of data � Use database or Google-style storage solution

� You have to move the data to the datastore � Requires an internet connection (while playing?)

Data Collection 5

Page 6: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Single Player Game

Data Collection 6

Instrumenting Activity

Online Game

Server

Client

LOG? LOG?

Page 7: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Single Player Game

Data Collection 7

Instrumenting Activity

Online Game

Server

Client

LOG? LOG?

You Need to Access the Data Log

Page 8: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Single Player Game

Data Collection 8

Instrumenting Activity

Online Game

Server

Client

LOG? LOG?

Page 9: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Single Player Game

Data Collection 9

Instrumenting Activity

Online Game

Server

Client

LOG? LOG?

Battery &

Connectivity Server Load

Page 10: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Single Player Game

Data Collection 10

Instrumenting Activity

Online Game

Server

Client

LOG LOG

Page 11: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

�  Disks (for logs) are slow �  SDD: 500 MB/s

�  Standard Disk: 200 MB/s

�  Or about 8 MB/frame

�  Memory (for state) is fast �  RAM: ~15000 MB/s

�  Or about 250 MB/frame

�  Cannot write fast enough! �  Asynchronous logging?

Data Collection 11

Even This is Not Enough

Update

Draw

State

Log

Page 12: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Solution: Limit What you Collect

� Way too much data to collect everything �  Example: 500 objects with 10 integer size fields

�  20 KB/frame or 1.2 MB/s or 72 MB/minute

� Console games have more objects, played longer

� What would you do if even if you could? � How do you search through all this data?

�  Just saying “use a database” is not enough

Data Collection 12

Page 13: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

�  I/O has problems with jitter �  Throughput is not constant �  Might be in use elsewhere

�  Update loop also has jitter �  May finish before 16ms �  “Sleeps” until next frame

�  Remove jitter with a budget �  Use time at end of update �  Write as much as possible �  Will keep up in the long run

Data Collection 13

Aside: Asynchronous Still Desirable

Update

Draw

Game State

Every Frame

Log State

Periodically (With Budget)

Page 14: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Performance

�  Only collect what you need �  Overcoming challenges

�  Significant game events

�  Unusual actions/interactions

�  Instrumentation is complex �  Identify the source of event

�  Find this location in code

�  Add log statement there

Data Collection 14

The Classic Tradeoff

Extensibility

�  Who knows what I need? �  Resource data can get large

�  Player position might matter

�  Eventually have full state

�  How do I get more data? �  Always log it all (no)

�  Ask programmer to add more instrumentation(?)

Page 15: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

How Do We Solve this Problem?

Data Collection 15

Page 16: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

How Do We Process All this Data?

� How you query the data is more important �  Example: SQL, Map Reduce, custom set-up

�  This often determines data format for you.

� In general, use a uniform schema for the data � All data should have the same attributes

� Avoid key-value blobs; make attributes structured

�  This means you could use SQL (even if you do not)

Data Collection 16

Page 17: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Is SQL Suitable for Analytics?

Data Collection 17

� Game state is (often) naturally relational � Objects are a long list of attributes �  Store each game object as a row

�  Lots of libraries can convert objects to SQL

key player x y health damage range flying

1 1 12 342 100 50 20 false

2 1 43 12 100 50 20 false

3 2 123 90 50 30 50 true

Page 18: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Or Maybe Not

Data Collection 18

Page 19: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Data Collection 19

Can Still Do This Relationally

But how easy is this to process?

player … inventory

1 … 10

2 … 11

3 … 12

Player Table item … value

20 … 250

21 … 1000

22 … 50

23 … 100

24 … 5000

Item Table player item qty

1 21 1

1 20 3

1 22 10

2 21 1

2 24 1

Join Table

Page 20: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Analytics Involve Time-Sensitive Data

Data Collection 20

� Each event/data needs to be time-stamped �  Time it happened in seconds/milliseconds/etc �  Typically use an int, as it makes grouping easier

�  Time attribute is an integral part of queries

key player x y health damage range flying time

1 1 12 342 100 50 20 false 1209823

2 1 43 12 100 50 20 false 1309875

3 2 123 90 50 30 50 true 1724304

Page 21: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Example: Aggregating Daily Prices

�  Average price of each item, each day �  Want to group together item transactions per day

�  Use SEC_PER_DAY to convert time stamp to days

�  Relatively simple aggregate query

Data Collection 21

SELECT item, avg(price), time/SEC_PER_DAY FROM gamedata GROUPBY time/SEC_PER_DAY

Page 22: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Hard Example: Quest Order

�  Return first quest completed after tutorial level �  Want a quest that is completed after tutorial

�  Want no other quest completed in between

�  This is known as a time-series query

Data Collection 22

Tutorial Quest 1

Quest 2

Page 23: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Time Series in SQL

�  Return all pairs of consecutive time events

Data Collection 23

SELECT D1.name, D2.name FROM Data as D1, Data as D2 WHERE D2.time > D1.time

EXCEPT

SELECT D1.name, D2.name FROM Data as D1, Data as D2, Data as S WHERE S.time > D1.time and D2.time > S.time

Page 24: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

NoSQL is not Much Help

�  Time series is a correlation of actions �  Events X that happen after Y �  Correlation requires a DB join

�  NoSQL means Map Reduce �  Map: Sort data into key, values �  Reduce: Aggregate values of same keys �  Example: Average price of an item

�  Joins are hard in Map Reduce �  Google gets scalability by “cheating”

Data Collection 24

Page 25: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

�  Given set of event streams �  Database with time stamps �  Can only read data in order �  Can only read data once

�  Idea: Get events as they occur �  But this is a little unrealistic �  Delay to create the event

�  Also have a pattern query �  Matches event/set of events �  Outputs match as new event �  Called a composite event

Data Collection 25

Alternative: Data Stream Systems

Event Stream

Event Stream

Event Stream

Event Pattern Query

Output also an Event Stream

Page 26: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

External Stream System

Data Collection 26

Using Data Stream Systems

Internal Stream System

Event Stream

Pattern

Event Stream

Game State

Data Stream System

Page 27: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

External Event System

�  Game acts as base stream �  Matching done outside �  Simple but needs bandwidth

�  Several commercial options �  Streambase �  Microsoft StreamInsight �  Oracle CEP

�  Less open source options �  Lot of academic prototypes

Data Collection 27

Using Data Stream Systems

Internal Event System

�  Game handles all matching �  Event streams in game state �  Bandwidth only for output

�  You must implement system �  Maybe use open source �  But always read licenses!

�  Why do this? Complex AI �  Just like rule matching! �  Bioshock Infinite does this

Page 28: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Did I Mention A LOT of Data?

�  Amount of data might be too much for one log

�  Just one player can create a significant amount

�  Data probably in different categories

�  Examples: Items sold, quests finished

�  Put the data in appropriate databases

�  Each might be on a different server

�  Data streams can send data to the right place

Data Collection 28

Page 29: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Data Collection 29

Data Streams and Categorization

Log Data Stream System

Query 2

Page 30: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Classic Event Patterns

� Selection: P(e) (event e has property P) � Can be applied to base or composite events � Common property for composite events: duration

� Ordering: E1:E2 (E2 happened after E1)

� Sequences: E1;E2 (E2 is the first event after E1) � Ranges over all events fed into the pattern query �  Sometimes want to exclude certain matches to E2 �  E1 ;P E2 (E2 is first event after E1satisfying P)

Data Collection 30

Page 31: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Patterns and Base Streams

Data Collection 31

(a1,b1,c1,…,t1)

(a2,b2,c2,…,t2)

(a3,b3,c3,…,t3)

(a4,b4,c4,…,t4)

(a5,b5,c5,…,t5)

(a6,b6,c6,…,t6)

(a7,b7,c7,…,t7) …

New

er E

vent

s

Base Stream from Application

(a1,b1,c1,…,t1)

(a3,b3,c3,…,t3)

(a6,b6,c6,…,t6)

(a7,b7,c7,…,t7) …

New

er E

vent

s

Derived Stream “P”

P(x)

Filter Predicate

Boolean expression on the event tuple (easy to compute)

One for each predicate

Page 32: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Sequencing Streams

Data Collection 32

(a2,b2,…,t2)

(a4,b4,…,t4)

(a7,b7,…,t7) …

Input Stream E1

(a4,…,x4,…,t2,t5)

Derived Stream E1;E2

E1;E2

Event Pattern (x1,y1,…,t1)

(x5,y5,…,t5)

(x6,y6,…,t6)

(x7,y7,…,t7) …

Input Stream E2

(a2,...,x5,…,t2,t5)

Page 33: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Sequencing Streams

Data Collection 33

(a2,b2,…,t2)

(a4,b4,…,t4)

(a7,b7,…,t7) …

Input Stream E1

(a4,…,x4,…,t2,t5)

Derived Stream E1;E2

E1;E2

Event Pattern (x1,y1,…,t1)

(x5,y5,…,t5)

(x6,y6,…,t6)

(x7,y7,…,t7) …

Input Stream E2

(a2,...,x5,…,t2,t5)

Have to store until a match

is found

Page 34: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

�  All events must happen �  Specific time it is generated �  Crucial for time series

�  Pattern output is an event! �  New stream to be reused �  Often pick latest time stamp

�  But duration is important �  Good candidate for P(x) �  Allows us to timeout

�  Solution: intervals

Data Collection 34

Some Words on Time Stamps

Base Events (B)

Composite Events (B;B)

e1 e2 e2 e4

e1; e2

e2; e3

e3; e4 Still Lossy. Why?

Page 35: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Ordering Streams

Data Collection 35

(a2,b2,…,t2)

(a4,b4,…,t4)

(a7,b7,…,t7) …

Input Stream E1

Derived Stream E1:E2

E1:E2

Event Pattern (x1,y1,…,t1)

(x5,y5,…,t5)

(x6,y6,…,t6)

(x7,y7,…,t7) …

Input Stream E2

(a2,…,t2,t5) (a4,…,t2,t5)

(a2,…,t2,t6) (a4,…,t2,t6)

(a2,…,t2,t7) (a4,…,t2,t7)

O(n2) blow-up.

Need timeouts!

Page 36: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

�  Analysis => lots of patterns �  Each play action has pattern

�  Was purpose of the system

�  Time Series => data storage �  Have to store the first part

�  Drop at match (or timeout)

�  Idea: Each pattern has a list �  Stores uncompleted part

�  Complex time series?

�  Pattern 1: P

�  Pattern 2: P;Q

�  Pattern 3: R;QS

�  Pattern 4: P;Q;R

Data Collection 36

Managing Patterns

Nothing to store

Have to store

Have to store

Must store Must store

Page 37: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Data Collection 37

Solution: Automata

Pattern: T<30s((P;QR):S)

P R

S

(a3,…,t3)

(a4,…,t4)

(a7,…,t7)

(a1,b2,…,t1,t2)

S True Q < 30s

< 30s

Each edge has: �  Input event stream

�  Test predicate

�  (Concatenation op)

For each edge: 1.  Take each tuple from out-state 2.  Concatenate with stream tuple 3.  Test if predicate is true 4.  If so, move tuple to in-state

Nondeterminstic

Page 38: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Events and Data Driven-Development

� Designers should not program these automata �  Essentially want a nice query language �  There are some simple ones to implement

� May already have automata framework for AI � Allow designers to specify AI behavior in script file � Compile into automaton when necessary

� Your event system can piggy back on this �  Similar system (except for the tuple storage) � But need a different (custom) compiler

Data Collection 38

Page 39: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

The Main Challenge: Negation

�  Common pattern: �  Find a sequence A;B with no C in between �  Example: Quest started/done with none in between

�  Classic Idea: Regular expressions �  A(; ¬C)*;B (Keep checking that not C until find B) �  This is not quite right, however. Why? �  Bigger issue: should not store “not” events

� Messy technical challenge; no clean solutions �  Purpose of contexts in Irrational’s technical talk

Data Collection 39

Page 40: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Putting this All Together

� We need to record the data fast � Requires a custom logging system in game � Have to be choosy about what data is recorded � Need to be flexible to respond to designers

� We need to process the data fast � Depends on the choice of query languages � Need support for aggregation and time-series � Data stream systems are ideal for this

Data Collection 40

Page 41: Lecture 16 - Cornell University

gamedesigninitiativeat cornell university

the

Brainstorm: Mobile Data Collection

Data Collection 41