14
Version Traveler: Fast and Memory-Efficient Version Switching in Graph Processing Systems Xiaoen Ju Dan Williams Hani Jamjoom Kang G. Shin University of Michigan IBM T.J. Watson Research Center Abstract Multi-version graph processing, where each version cor- responds to a snapshot of an evolving graph, is a com- mon scenario in large-scale graph processing. Straight- forward application of existing graph processing sys- tems often yields suboptimal performance due to high version-switching cost. We present Version Traveler (VT), a graph processing system featuring fast and mem- ory-efficient version switching. VT achieves fast ver- sion switching by (i) representing differences among ver- sions as deltas and (ii) constructing the next version by integrating the in-memory graph representation of the current version with the delta(s) relating the two ver- sions. Furthermore, VT maintains high computation performance and memory compactness. Our evaluation using multi-version processing workloads with realis- tic datasets shows that VT outperforms PowerGraph— running 23x faster with a 15% memory overhead. VT is also superior to four multi-version processing systems, achieving up to 90% improvement when jointly consid- ering processing time and resource consumption. 1 Introduction Multi-version graph processing is an important and com- mon scenario in big data analytics. In such a scenario, each version corresponds to a snapshot of an evolving graph; a graph processing system iterates over all input versions and applies a user-defined algorithm to them, one at a time. Multi-version graph processing enables the analysis of characteristics embedded across differ- ent versions. For example, computing the shortest dis- tance between two users across multiple versions of a social network captures the varying closeness between them [26]. Computing the centrality scores of scientific researchers across multiple versions of a co-authorship graph demonstrates their evolving impact [17]. A key element in multi-version graph processing is ef- ficient arbitrary local version switching. Version switch- ing refers to the preparation of the next to-be-processed version after computation completes on the current ver- sion. Such a procedure can be arbitrary, because the se- quence of to-be-processed versions cannot be predeter- mined by the underlying system. It is local in that the next version commonly resides in the vicinity of the cur- rent version, demonstrating version locality. Arbitrary local version switching has not been fully addressed before, in particular, from the perspective of the entire multi-version processing workflow. Due to version unawareness, mainstream systems, such as Pregel [23] and PowerGraph [11], perform version switching by discarding the in-memory representation of the current version and loading the next version in its entirety from persistent storage. Such an approach in- curs substantial version switching time. Existing multi- version processing systems [10, 17, 19, 22] expedite the version-switching procedure by graph-delta integration. Specifically, the next version is constructed by integrat- ing the current version with the delta representing the difference between the two versions. Albeit efficient, they either lack the support for arbitrary local switch- ing [10, 19], incur high neighbor access penalty during computation [17], or lead to high memory overhead [22]. Towards efficient support for arbitrary local version switching, a system must balance contradicting require- ments among three design dimensions: extensibility, compactness, and neighbor access efficiency. Extensibil- ity refers to the easiness of creating a new graph version by extending the current one. Compactness refers to the memory overhead related to version-switching support. Neighbor access efficiency refers to the speed of neigh- bor access by a graph computing engine. We present Version Traveler (VT), a multi-version graph processing system enabling fast arbitrary local ver- sion switching. From a holistic view, VT balances the requirements in all three dimensions of the design space. VT consists of two novel components: (i)a hybrid-com-

IBM T.J. Watson Research Center Kang G. Shin Version ... · system balancing the above three requirements with two novel components, and the demonstration of VT’s superior performance

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IBM T.J. Watson Research Center Kang G. Shin Version ... · system balancing the above three requirements with two novel components, and the demonstration of VT’s superior performance

Vers

ion

Trav

eler

:Fas

tand

Mem

ory-

Effi

cien

tVe

rsio

nSw

itchi

ngin

Gra

phPr

oces

sing

Syst

ems

Xia

oen

JuD

anW

illia

ms†

Han

iJam

joom

†K

ang

G.S

hin

Uni

vers

ityof

Mic

higa

n† I

BM

T.J.

Wat

son

Res

earc

hC

ente

r

Abs

trac

t

Mul

ti-ve

rsio

ngr

aph

proc

essi

ng,w

here

each

vers

ion

cor-

resp

onds

toa

snap

shot

ofan

evol

ving

grap

h,is

aco

m-

mon

scen

ario

inla

rge-

scal

egr

aph

proc

essi

ng.

Stra

ight

-fo

rwar

dap

plic

atio

nof

exis

ting

grap

hpr

oces

sing

sys-

tem

sof

ten

yiel

dssu

bopt

imal

perf

orm

ance

due

tohi

ghve

rsio

n-sw

itchi

ngco

st.

We

pres

ent

Vers

ion

Trav

eler

(VT)

,agr

aph

proc

essi

ngsy

stem

feat

urin

gfa

stan

dm

em-

ory-

effic

ient

vers

ion

switc

hing

.V

Tac

hiev

esfa

stve

r-si

onsw

itchi

ngby

(i)r

epre

sent

ing

diff

eren

cesa

mon

gve

r-si

ons

asde

ltas

and

(ii)

cons

truc

ting

the

next

vers

ion

byin

tegr

atin

gth

ein

-mem

ory

grap

hre

pres

enta

tion

ofth

ecu

rren

tve

rsio

nw

ithth

ede

lta(s

)re

latin

gth

etw

ove

r-si

ons.

Furt

herm

ore,

VT

mai

ntai

nshi

ghco

mpu

tatio

npe

rfor

man

cean

dm

emor

yco

mpa

ctne

ss.

Our

eval

uatio

nus

ing

mul

ti-ve

rsio

npr

oces

sing

wor

kloa

dsw

ithre

alis

-tic

data

sets

show

sth

atV

Tou

tper

form

sPo

wer

Gra

ph—

runn

ing

23x

fast

erw

itha

15%

mem

ory

over

head

.V

Tis

also

supe

rior

tofo

urm

ulti-

vers

ion

proc

essi

ngsy

stem

s,ac

hiev

ing

upto

90%

impr

ovem

entw

hen

join

tlyco

nsid

-er

ing

proc

essi

ngtim

ean

dre

sour

ceco

nsum

ptio

n.

1In

trod

uctio

n

Mul

ti-ve

rsio

ngr

aph

proc

essi

ngis

anim

port

anta

ndco

m-

mon

scen

ario

inbi

gda

taan

alyt

ics.

Insu

cha

scen

ario

,ea

chve

rsio

nco

rres

pond

sto

asn

apsh

otof

anev

olvi

nggr

aph;

agr

aph

proc

essi

ngsy

stem

itera

tes

over

alli

nput

vers

ions

and

appl

ies

aus

er-d

efine

dal

gori

thm

toth

em,

one

ata

time.

Mul

ti-ve

rsio

ngr

aph

proc

essi

ngen

able

sth

ean

alys

isof

char

acte

rist

ics

embe

dded

acro

ssdi

ffer

-en

tve

rsio

ns.

For

exam

ple,

com

putin

gth

esh

orte

stdi

s-ta

nce

betw

een

two

user

sac

ross

mul

tiple

vers

ions

ofa

soci

alne

twor

kca

ptur

esth

eva

ryin

gcl

osen

ess

betw

een

them

[26]

.C

ompu

ting

the

cent

ralit

ysc

ores

ofsc

ient

ific

rese

arch

ers

acro

ssm

ultip

leve

rsio

nsof

aco

-aut

hors

hip

grap

hde

mon

stra

tes

thei

revo

lvin

gim

pact

[17]

.A

key

elem

enti

nm

ulti-

vers

ion

grap

hpr

oces

sing

isef

-

ficie

ntar

bitr

ary

loca

lver

sion

switc

hing

.Ver

sion

switc

h-in

gre

fers

toth

epr

epar

atio

nof

the

next

to-b

e-pr

oces

sed

vers

ion

afte

rco

mpu

tatio

nco

mpl

etes

onth

ecu

rren

tver

-si

on.

Such

apr

oced

ure

can

bear

bitr

ary,

beca

use

the

se-

quen

ceof

to-b

e-pr

oces

sed

vers

ions

cann

otbe

pred

eter

-m

ined

byth

eun

derl

ying

syst

em.

Itis

loca

lin

that

the

next

vers

ion

com

mon

lyre

side

sin

the

vici

nity

ofth

ecu

r-re

ntve

rsio

n,de

mon

stra

ting

vers

ion

loca

lity.

Arb

itrar

ylo

cal

vers

ion

switc

hing

has

not

been

fully

addr

esse

dbe

fore

,in

part

icul

ar,

from

the

pers

pect

ive

ofth

een

tire

mul

ti-ve

rsio

npr

oces

sing

wor

kflow

.D

ueto

vers

ion

unaw

aren

ess,

mai

nstr

eam

syst

ems,

such

asPr

egel

[23]

and

Pow

erG

raph

[11]

,pe

rfor

mve

rsio

nsw

itchi

ngby

disc

ardi

ngth

ein

-mem

ory

repr

esen

tatio

nof

the

curr

ent

vers

ion

and

load

ing

the

next

vers

ion

inits

entir

ety

from

pers

iste

ntst

orag

e.Su

chan

appr

oach

in-

curs

subs

tant

ialv

ersi

onsw

itchi

ngtim

e.E

xist

ing

mul

ti-ve

rsio

npr

oces

sing

syst

ems

[10,

17,1

9,22

]ex

pedi

teth

eve

rsio

n-sw

itchi

ngpr

oced

ure

bygr

aph-

delta

inte

grat

ion.

Spec

ifica

lly,t

hene

xtve

rsio

nis

cons

truc

ted

byin

tegr

at-

ing

the

curr

ent

vers

ion

with

the

delta

repr

esen

ting

the

diff

eren

cebe

twee

nth

etw

ove

rsio

ns.

Alb

eit

effic

ient

,th

eyei

ther

lack

the

supp

ort

for

arbi

trar

ylo

cal

switc

h-in

g[1

0,19

],in

cur

high

neig

hbor

acce

sspe

nalty

duri

ngco

mpu

tatio

n[1

7],o

rlea

dto

high

mem

ory

over

head

[22]

.To

war

dsef

ficie

ntsu

ppor

tfo

rar

bitr

ary

loca

lve

rsio

nsw

itchi

ng,a

syst

emm

ustb

alan

ceco

ntra

dict

ing

requ

ire-

men

tsam

ong

thre

ede

sign

dim

ensi

ons:

exte

nsib

ility

,co

mpa

ctne

ss,a

ndne

ighb

orac

cess

effic

ienc

y.E

xten

sibi

l-ity

refe

rsto

the

easi

ness

ofcr

eatin

ga

new

grap

hve

rsio

nby

exte

ndin

gth

ecu

rren

tone

.C

ompa

ctne

ssre

fers

toth

em

emor

yov

erhe

adre

late

dto

vers

ion-

switc

hing

supp

ort.

Nei

ghbo

rac

cess

effic

ienc

yre

fers

toth

esp

eed

ofne

igh-

bora

cces

sby

agr

aph

com

putin

gen

gine

.W

epr

esen

tVe

rsio

nTr

avel

er(V

T),

am

ulti-

vers

ion

grap

hpr

oces

sing

syst

emen

ablin

gfa

star

bitr

ary

loca

lver

-si

onsw

itchi

ng.

From

aho

listic

view

,V

Tba

lanc

esth

ere

quir

emen

tsin

allt

hree

dim

ensi

ons

ofth

ede

sign

spac

e.V

Tco

nsis

tsof

two

nove

lcom

pone

nts:

(i)a

hybr

id-c

om-

Page 2: IBM T.J. Watson Research Center Kang G. Shin Version ... · system balancing the above three requirements with two novel components, and the demonstration of VT’s superior performance

pressed-sparse-row (CSR) graph supporting fast deltaintegration while preserving compactness and neighboraccess speed during graph computation, and (ii) a ver-sion delta cache that stores the deltas in an easy-to-inte-grate and compact format. Conceptually, the hybrid-CSRgraph represents the common subgraph shared amongmultiple versions in the CSR format. As a result, thesubgraph is compactly stored in memory and yields highneighbor access speed—both known advantages of theCSR format. Differences among versions are absorbedby a hierarchical vector-of-vectors (VoV) representationand placed in the delta cache, leading to high version-switching speed thanks to its ability to overcome CSR’slack of extensibility.

We have implemented Version Traveler inside Pow-erGraph [11] by augmenting its graph representationlayer with VT’s hybrid-CSR graph and version deltacache. Our evaluation with realistic graphs shows thatVT significantly outperforms PowerGraph in multi-ver-sion processing: VT runs 23x faster with a mere 15%memory overhead. VT also outperforms designs pro-posed in state-of-the-art multi-version processing sys-tems, such as log delta, bitmap delta, and multi-version-array, achieving up to 90% improvement when jointlyconsidering processing time and resource consumption.

The contributions of this paper include:• the formulation of the arbitrary local version

switching problem in the context of multi-versiongraph processing,• a method for arbitrary local version switching with

a holistic view, considering neighbor access speed,version switching speed, and compactness,• the design of Version Traveler, a graph processing

system balancing the above three requirements withtwo novel components, and• the demonstration of VT’s superior performance

compared to state-of-the-art graph processing sys-tems via extensive evaluation.

2 Multi-Version Graph Processing

In this section, we first discuss the characteristics ofmulti-version graph processing workloads, followed bya discussion on its workflow. We then summarize relatedwork, analyze the design space for efficient multi-versiongraph processing systems, and discuss challenges.

2.1 Workload Characteristics

In multi-version graph processing, version switchingcommonly demonstrates randomness and locality. Ver-sion switching is arbitrary, in that the next version mayprecede or succeed the current version in the graph evo-

lution.1 Such a switching sequence may be dynamic, un-able to be predetermined by the graph processing sys-tem. Version switching is local, in that the next ver-sion commonly resides within the vicinity—in terms ofsimilarity—of the current one in the graph evolution.

We exemplify the demand for arbitrary local versionswitching with three examples. First, suppose we needto identify the cause of the varying distance between twousers in a social network [26]. For simplicity, assumethat the distances in versions i and k are different andthat the distance changes only once along the evolutionfrom versions i to k. If, after processing version j = i+k

2 ,a binary-search-style exploration algorithm finds that thedistance in that version remains the same as that in ver-sion k but differs from that in version i, then the algo-rithm would invoke another iteration of shortest distancecomputation for version m = i+ j

2 . The switching fromversions j to m is arbitrary for the supporting graph pro-cessing system. In terms of locality, although the searchmay oscillate between versions with high dissimilarityat the beginning, the version locality increases exponen-tially with the progress of the execution.

Second, in interactive big data analytics, an analystmay rerun an algorithm on a graph version, after digest-ing the results of the previous execution. Which versionshould be processed next depends on the analyst’s un-derstanding of the existing results, as well as his/her do-main knowledge and intuition. This leads to arbitraryversion switching from the perspective of the graph pro-cessing system. As for locality, such analysis commonlyfollows a refinement procedure, where significant effortsare required to zoom in and conduct in-depth analysis ona cluster of versions within the vicinity of each other.

Third, in a collaborative data analytics environment,both datasets and computing power are shared amongusers [5]. Individual tasks—each targeting a graphversion—can be combined by the processing system,leading to multi-version graph processing. Since the nextrequest may be enqueued during the processing of thecurrent version and may target a version preceding orsucceeding the current version in the graph evolution,version switching is arbitrary. Regarding locality, in-dependently-submitted tasks may target similar versions.Such is the case, for example, where various algorithmsare employed to capture and understand trending eventsin an evolving social network.

2.2 WorkflowA typical multi-version graph processing workflow is di-vided into multiple iterations. In each iteration, an ar-bitrary local graph version is processed. Systems de-

1More broadly, in a non-linear graph evolution scenario [5], the nextversion may reside in a different branch than the current version.

Page 3: IBM T.J. Watson Research Center Kang G. Shin Version ... · system balancing the above three requirements with two novel components, and the demonstration of VT’s superior performance

Loadgraph

Compute

(a)

Warm Loop

Loaddelta

Loadgraph

Integrate graph/delta

Compute

(b)

Delta cache

Figure 1: Version switching workflow, (a) with and(b) without the use of deltas

signed for individual graph processing tasks are unableto recognize or take advantage of the evolution relationamong versions. Treating each version as a standalonegraph, such systems first fully load the version from per-sistent storage into memory and then execute a user-de-fined graph algorithm over it (cf. Figure 1a).

When versions of a working set share a sub-stantial common subgraph, working with deltas—representations of the differences between graphversions—can be more efficient. Figure 1b shows themulti-version processing workflow with deltas. After thefirst version is loaded and processed, switching to a sub-sequent version can be achieved by integrating the cur-rent version in memory with deltas relating the currentand the next versions. In general, deltas are much smallerthan full graphs [26, 31]. As a result, they can be cachedin memory, further improving the efficiency of versionswitching.

2.3 Related Work on Graph/Delta Designs

We focus the discussion of related work on in-mem-ory graph and delta representations, because they deter-mine the efficiency of the switching loop (cf. Figure 1b).2

For graphs, we specifically focus on representations re-lated to neighbors of vertices, because they differentiategraphs from regular table-form datasets.

As a result, we exclude other active research direc-tions, such as programming paradigms [11, 14, 15, 20,23, 24, 30, 32], out-of-core processing [19, 27], loadbalancing [16], failure recovery [28], streaming [10],dataflow-based processing [9,12], and performance eval-uation [21]. We also exclude work within the broadscope of multi-version processing but not dedicated toin-memory graph/delta design in the context of arbi-trary local version switching. Examples are stream-ing processing [10, 19],3 parallel multi-version process-

2Both graphs and deltas may have different representations in mem-ory and on disk. We focus on in-memory representations, due to theirsignificance in the warm loop of the version switching workflow.

3Streaming processing is a special case of multi-version graph pro-

ing [13], multi-version algorithm design framework [26],and multi-version dataset management [5, 6]. 4

We study related work by asking three questions:• Does it provide high computation performance? In

particular, does it support fast access of the neigh-bors of a vertex?5

• Does it support fast version switching?• Does it store graphs and deltas compactly?

Graph. We study two common graph representa-tions: compressed sparse row (CSR), adopted in Pow-erGraph [11] and GraphX [12], and a vector-of-vectors(VoV) design, adopted in Giraph [1].

In CSR (cf. Figures 2b and 2c), all neighbors arepacked in an array. A pointer array maintains the addressof the first neighbor of each vertex. The set of neighborsfor vertex i is thus marked by the values of vertices i andi+1 in the pointer array. This representation enables fastaccess to a vertex’s neighbors. Version switching is slow,however. This is because modifying a vertex’s neighboraffects pointers and neighbors of all vertices followingthe one being modified.

As for VoV (cf. Figures 2d and 2e), the first-level vec-tor functions as the pointer array in CSR, locating theneighbors of a vertex according to the vertex id. Eachsecond-level vector represents the neighbors of a vertex.This format also supports fast neighbor access. In ad-dition, the neighbors of a vertex can be modified with-out affecting other vertices, thus enabling fast versionswitching. Its shortcoming is the memory overhead dueto maintaining auxiliary information, such as the startand end positions of each vertex’s neighbors.6

Delta. Previous work has used a compact log-formatstructure to represent deltas in streaming processing [10].A log delta consists of an array of log entries, each spec-ifying an edge via its source and destination vertex ids(and an optional edge id) and whether the edge shouldbe added or removed (i.e., an opcode). Log deltas are

cessing, where version switching is always forward and versions areonly processed once. They are insufficient for the general multi-versionscenario, where switching is arbitrary and a cached version is repeat-edly accessed by multiple algorithms.

4Such work investigates the tradeoff between storage space andrecreation speed of a dataset version, focusing on organizing versionson disk instead of in-memory data structure optimization.

5In this paper, we equate computation performance with neighboraccess efficiency for two reasons. First, computation related to graphalgorithms affects all systems in the same way and is out of scope.Second, in the computation stage, a system supports neighbor accessand vertex/edge data access. Assuming the storage of data in sequencecontainers and their identical impact on all systems, computation per-formance is determined only by neighbor access efficiency.

6Such overhead is non-trivial. For example, a 24-byte per-vectoroverhead (cf. Figure 2d, “start,” “end of contents,” and “end of storage”pointers each consume 8 bytes) amounts to a 40% overhead for repre-senting the entire out-neighbor array for the Amazon08 dataset [7, 8],assuming 4-byte vertex/edge ids.

Page 4: IBM T.J. Watson Research Center Kang G. Shin Version ... · system balancing the above three requirements with two novel components, and the demonstration of VT’s superior performance

(a)

0 1

23

(b)

1 3 0 2 3 2

pointer array

neighbor array

dst_vid

src_vid

1 3 0 2 3 3 2

(c) (d)

1 3 …

pointer array

dst_vid

src_vid

startend of contentsend of storage

0 2 3 2

(e)

1 3 0 2 3 3 2

Figure 2: Graph representations: (a) illustrates two versions of a graph. A circle represents a vertex (vertex idinside) and an arrow represents an edge (edge id omitted). The first version consists of solid-arrow edges. Thesecond version has one more edge (illustrated by a dashed arrow). (b) and (c) demonstrate the CSR representa-tion of the out-edges of the two versions. (d) and (e) demonstrate the vector of vectors format. For clarity, eachelement in the neighbor array in (b)–(e) shows only the destination vertex id and omits the edge id.

compact and have no negative impact on the neighboraccess efficiency during graph processing. This is be-cause log deltas are conceptually applicable to all graphrepresentations as-is. During graph-delta integration, logdeltas are fully absorbed in the graph version. The cost ofgraph-delta integration is high, however, because all logentries in deltas relating the current and the next versionsneed to be applied during version switching.

Alternatively, a system could co-design graph anddelta representations to minimize the integration cost.For example, affected neighbor vectors of a VoV graphmay be copied and updated in a delta, reducing versionswitching to simple and fast vector pointer updates butlosing compactness. LLAMA [22] partially mitigates thecompactness loss by separating modified neighbors re-lated to a version into a dedicated consecutive area in theneighbor array, avoiding copying unmodified neighbors.CSR’s pointer array is transformed to a two-level transla-tion table. The first level consists of per-version indirec-tion tables, each bookkeeping a set of second-level pagesassociated with a version. A second-level page containsa series of vertex records—equivalent to a fragment ofCSR’s pointer array—with each record indicating thestart of a vertex’s neighbors.7 LLAMA’s version switch-ing incurs nearly zero time cost: conceptually, only anindirection table pointer needs to be updated. Its useof page-level copy-on-write for the second-level pagesholding vertex records, nevertheless, requires the copyof an entire page even if only one vertex in the page hasa modified neighborhood, hindering its compactness.

GraphPool [17] maintains the union of edges across allversions in the graph. Its deltas are per-version bitmapsover the graph’s edge array, where a bitmap’s n-th bit in-dicates the existence of the corresponding edge in thatversion. Version switching is simple: a bitmap pointer isadjusted to point to the next version. In the computation

7Neighbors belonging to the same vertex but stored in separateareas—each containing per-version modifications—are concatenatedvia continuation records such that only one start position needs to bemaintained for a vertex’s neighbors per version.

stage, however, this approach requires bitmap checkingfor determining whether an edge exists in the current ver-sion, incurring neighbor access penalty.

2.4 Design Dimensions and Challenges

Summarizing lessons learned from related work, wepoint out that the design of graph and delta must bal-ance between three dimensions: extensibility, compact-ness, and neighbor access efficiency.

Extensibility. Efficient version switching requires that adelta be easily integrated with a graph. From a data struc-ture perspective, it requires that the neighbors of a vertexbe easily extended to reflect the evolution from one ver-sion to another. This, in turn, requires that either the datastructure representing the neighbors of a vertex supportefficient modification (i.e., insertion and removal) or thecollection of the neighbors of a version be easily replacedby that of another version.

Compactness. Compact graph and delta representa-tions enable caching a large number of versions, lead-ing to low delta cache miss rate and high version switch-ing efficiency. Moreover, real-world large graphs com-monly have millions of vertices and millions or billionsof edges, making the compactness of the neighborhooddata structure a primary requirement.

Access Efficiency. A common and crucial operation dur-ing computation is to access a complete collection ofneighbors for a vertex. Fast neighbor access requires lim-iting the number of lookups in the integrated graph/deltadata structure. Ideally, only one lookup is sufficient forlocating the first neighbor of a vertex. The remainingneighbors can then be accessed sequentially.

The main design challenge is to carefully balance therequirements from the above three dimensions and co-design graph and delta representations such that they areextensible, compact, and efficient in neighbor access.Achieving the balance is difficult, as witnessed by exist-

Page 5: IBM T.J. Watson Research Center Kang G. Shin Version ... · system balancing the above three requirements with two novel components, and the demonstration of VT’s superior performance

Delta Cache

srcvid

prev csrend

revert value

neighborschain

Hybrid CSR

dst vid

src vid

x1

delta indicator

offset

bit 31 bits 30-0

1 0 0 0

1 01 0

3-Version Graph

version 0

version 1

version 2

1 10 1000

0

0 2

0 2 3

0

2

3

1

Vertex Delta Array

Pointer Array

Neighbor Array

ModifiedNeighbor

Array

𝛿01

𝛿12

Delta Entry Format

Figure 3: Hybrid graph representation

ing designs, because those requirements commonly leadto contradicting design choices.

3 Version Traveler

We introduce Version Traveler (VT), a graph processingsystem that features a graph/delta co-design achievingcompactness, extensibility, and access efficiency. Resid-ing in the core of VT are two innovative components—a hybrid graph and a hybrid delta cache—bringing to-gether fast neighbor access and compactness of CSR andhigh extensibility of VoV.

3.1 Hybrid Graph

VT’s hybrid graph augments CSR in a way that achievesextensibility while remaining compact and efficient inneighbor access (cf. Figure 3). It avoids costly in-placemodification to CSR’s neighbor array by storing verticeswith a modified neighborhood in a version delta cache.

CSR’s neighbor array is created during the loading ofthe first version—also referred to as the root version—and then remains constant. Each subsequent version isloaded into the delta cache, in the form of a series of ver-tex delta entries. Each entry contains information relatedto the updated neighbors of a vertex, as well as meta-data to support neighbor access and version switching(cf. Figure 3). VT reserves a delta indicator bit in eachentry of CSR’s pointer array to indicate the placementof a vertex’s neighbors for the current version: in CSR’sneighbor array or in the vertex delta cache.

Neighbor Access. In a conventional CSR, neighbors ofvertex vid are bounded by the pointers of vertices vid andvid + 1. For VT’s hybrid CSR, neighbor access may bedirected to either CSR’s neighbor array or the neighborsfield of a delta entry, depending on whether the neigh-bors are stored (cf. access neighbors in Algorithm 1).Each delta entry maintains the end position of the pre-ceding vertex’s neighbors in CSR’s neighbor array (inthe prev csr end field), such that the end position of vidcan be determined regardless of whether neighbors of

Algorithm 1 Neighbor Access and Delta Application1: procedure ACCESS NEIGHBORS(vid)2: if csr ptrs[vid].in delta = true then3: return cache[csr ptrs[vid]].nbrs4: else if csr ptrs[vid +1].in delta = false then5: return csr nbrs[csr ptrs[vid], csr ptrs[vid +1]]6: else return csr nbrs[csr ptrs[vid],7: cache[csr ptrs[vid +1]].prev csr end]8: procedure DELTA APPLICATION(δi j,opcode)9: for e in δi j do

10: if opcode = apply then . apply δi j11: csr ptrs[e.src vid]← offset(e)12: csr ptrs[e.src vid].in delta← true13: else csr ptrs[e.src vid]← e.revert value . revert δi j

vid + 1 are stored in CSR’s neighbor array or the deltacache (cf. lines 4–7).

Delta Application and Reversion. VT performsversion switching by applying or reverting deltas(cf. delta application in Algorithm 1). When ap-plying δi j to switch from versions i to j, VT iteratesover entries belonging to δi j in the delta cache and, foreach entry, updates the corresponding entry (according tothe src vid field) in the CSR pointer array with the deltaentry’s offset in the delta array and sets the delta indi-cator bit. Reverting δi j consists of restoring the revertvalue field—which contains the saved value for versioni’s CSR pointer entry—to the corresponding entry in theCSR pointer array for each entry in δi j.

Example. We use a 3-version graph in Figure 3 to illus-trate neighbor access and version switching. The hybridCSR in Figure 3 represents the state of version 2. Thethree out-neighbors of vertex 1 can be accessed from itsdelta entry (δ12). For vertex 0, neighbor access requiresobtaining the start position from its CSR pointer, due toits cleared delta indicator bit, and the end position fromprev csr end of vertex 1’s delta entry. The difference be-tween the two is 0 (0− 0 = 0), indicating that vertex 0has no out-neighbors. In order to switch from versions2 back to 1, VT reverts δ12, which has only one entryrelated to vertex 1. Its revert value field, of which thedelta indicator bit is set and the offset is 0, is restoredto vertex 1’s CSR pointer entry. After reversion, vertex

Page 6: IBM T.J. Watson Research Center Kang G. Shin Version ... · system balancing the above three requirements with two novel components, and the demonstration of VT’s superior performance

version 3

version 2

version 1

version 0

non-sharing

version 3version 2version 1version 0

sharing

Figure 4: Illustration of the concept of Sharing

srcvid

dstvid

eidnext vid

addition log format

typesrcvid

offsetdst vid eidlayer offset

removal log format

Figure 5: Delta log format

1’s CSR pointer entry will point to the first entry in thedelta array, which corresponds to vertex 1’s delta entryfor version 1.

3.2 Hybrid DeltaFor simplicity, in Section 3.1, we assume that a delta en-try maintains the entire neighbors of a vertex (cf. Fig-ure 3). This is memory-inefficient for vertices with alarge number of neighbors and small amount of per-version modifications, due to numerous redundant copiesof neighbors. To improve compactness, we propose twocomplementary solutions: Sharing and Chaining. Shar-ing preserves access efficiency and trades extensibilityfor compactness. Chaining preserves extensibility andtrades access efficiency for compactness.

3.2.1 Sharing

Concept. Sharing reduces the memory footprint bymerging a vertex’s delta entries spanning multiple ver-sions into one shared entry. Figure 4 shows an examplewith four versions of a vertex, each adding one neighborto its base. When they share a delta entry, there existsonly one neighbor vector, containing the neighbors of thevertex related to the current version being processed. Thechallenge is to compactly specify how the shared vectoris modified during version switching. VT maintains thisinformation in addition and removal delta logs.

Delta Representation. In the Sharing mode, VT doesnot create delta cache entries with copies of modifiedneighbor arrays. Instead, it creates log entries: speci-fying the neighbors it would have added to or removedfrom the neighbor array in an addition or removal log.Each entry in the addition log array (cf. Figure 5) con-tains the source and target vids of the added edge, aswell as the edge id. Logs associated with the same vertex(the source vid in the out-neighbor case) are continuouslystored. A next vid field indicates the start position of the

version 3

version 2

version 1

version 0

non-chaining chaining

CSR pointer

leap-over chaining

Figure 6: Illustration of the concept of Chaining

rm-sec indicator

unused

rm-sec pointer

rm-sec pointer

vid

eid

vid

eid

removal preamble new-neighbor section

layer

offset

layer

offset

removal section

Figure 7: Neighbor vector format

logs associated with a subsequent vertex. The format ofa removal log entry—gray fields apart—is similar. Itsoffset field refers to the offset within the neighbor arraywhere the removal should take place.

Neighbor Access. Sharing has no effect on neighbor ac-cess. When multiple versions of a vertex share a neigh-bor vector, the CSR pointer header points to the samedelta entry containing the vector for all versions.

Delta Application/Reversion. In the Sharing mode,version switching resorts to log-based delta applica-tion/reversion, similar to streaming processing [10, 19].During delta application, for an neighbor addition, theneighbor is simply appended to the end of the neighborvector. For a removal, VT removes the neighbor at theoffset according to the offset value in the removal log en-try. Delta reversion follows the inverse procedures.

3.2.2 Chaining

Concept. Chaining refers to the representation of a ver-tex’s neighbors with a chain of vectors, each containing asubset of neighbors and capturing the difference betweenthe version associated with it and its base version. In Fig-ure 6, with each version chained onto its base, only oneneighbor needs to be maintained per version. Redundantcopies are eliminated, improving compactness. Exten-sibility remains the same: to switch from versions 1 to2, for example, we need to adjust only the CSR pointerto version 2’s delta entry, regardless of whether that en-try’s neighbor vector is chained onto another. Access ef-ficiency decreases in Chaining, because of the need toswitch among multiple neighbor vectors. Chaining im-poses two new challenges to delta design: chaining be-yond the base, called Leap-Over Chaining, and removalfrom ancestors, called indirect removal.

Delta Representation. Leap-Over Chaining intends toaccelerate neighbor access. In Figure 6, with each ver-

Page 7: IBM T.J. Watson Research Center Kang G. Shin Version ... · system balancing the above three requirements with two novel components, and the demonstration of VT’s superior performance

sion chained onto its base, the neighbor access for laterversions leads to considerable performance hit, limit-ing Chaining to a small set of adjacent versions in thegraph evolution relation. Leap-Over Chaining enablesthe chaining of a delta entry on an indirect ancestor ver-sion. For instance, version 3 in Figure 6 can be chainedonto version 0.

To support Chaining, in particular Leap-Over Chain-ing, we introduce a chaining field to the delta entry for-mat (cf. Figure 3). When Chaining is disabled,8 the entryis a standalone entry with a complete copy of neighbors.When Chaining is in use, VT saves a pointer to the en-try upon which the current one is based along the chainto the latter’s chaining field. Similar to CSR pointers, achaining pointer uses its most significant bit to indicatewhether the offset is for CSR or for the delta array.

To support indirect removal, a neighbor vector is di-vided into two sections: a new-neighbor section and aremoval section (cf. Figure 7). An element in the new-neighbor section represents a new neighbor added to thevertex in the current version. An element in the removalsection corresponds to a removed element, with layer in-dicating the neighbor vector in the chain where the re-moval should take place and offset the position of theto-be-removed neighbor within the vector. To improvecompactness, we overload the first element in the new-neighbor section: it is marked with a special flag if theremoval section exists, in which case the second elementcontains a pointer to the removal section;9 otherwise itcontains the first added neighbor.

Effect on Removal Log. Since Chaining introduces theseparation of new-neighbor and removal sections, Shar-ing’s removal log format needs to be adjusted (cf. grayfields in Figure 5). Specifically, a type field is added todifferentiate the two sections. A {layer, offset} pair ina removal section is stored similarly to a {vid, eid} pairin a new-neighbor section. During delta application, ifa removal takes place in the current vertex’s new-neigh-bor section, then the corresponding {vid, eid} pair is re-moved. Otherwise, the {layer, offset} pair is inserted tothe current vertex’s removal section. The inverse proce-dure achieves delta reversion.

Neighbor Access. Given a vertex, VT first locates itsentry via the CSR pointer header. If the neighbors arestored in a delta array entry whose Chaining mode isturned on, then VT iteratively accesses neighbors storedin entries along the chain, skipping neighbors that nolonger exist in the current version using the removal sec-tions. Otherwise, it follows the common neighbor accessprocedure, as described in Section 3.1.

Delta Application/Reversion. Except for Chaining’s ef-8That is, setting all bits in chaining to one.9The first two elements are also referred to as removal preamble.

fect on removal logs, both delta application and reversionfollow the description in Section 3.1.

3.2.3 Relationship between Chaining and Sharing

Chaining and Sharing are similar in that they both aim atreducing memory consumption by storing only the dif-ference among versions. Sharing is a good choice whenthe size of neighbors is large and the delta size is small. Alarge neighborhood leads to a considerable gain in com-pactness over a full-neighbor-copy approach, whereas asmall delta entails a moderate cost for log-based versionswitching. Chaining is useful when both the sizes ofneighbors and delta are large. Similar to Sharing, a largeneighbor size leads to a substantial gain in compactnessfor Chaining. A large delta entails Chaining’s superiorityto Sharing, due to the avoidance of the latter’s costly log-based version switching procedure.

Another way to compare the two is when the con-catenation of neighbors occurs. Chaining performs theconcatenation in a chain at the computation stage. Shar-ing performs the concatenation at the version switchingstage. Due to the different delta formats used in Chain-ing and Sharing, the concatenation in Chaining is lighter-weight than that in Sharing. As for the number of con-catenation performed for a vertex, the concatenation inChaining needs to take place when a vertex’s neighborsare accessed. The cost of concatenation in Chaining isthus magnified if a vertex is iteratively processed by analgorithm. The concatenation in Sharing is, in contrast,guaranteed to be once per vertex per version switching.

VT supports Sharing and Chaining as operationmodes, complementing the default full-neighbor-copymode (referred as Full mode). It enables them when theestimated cost of version switching and the potential im-pact on the computation stage are justified by the amountof memory saving. The current VT implementation sup-ports flexible threshold-based policies: when creating anew delta for a vertex, VT feeds the number of neigh-bors in its base version and the current delta size relatedto that vertex to a configurable policy arbitrator function,which determines the activation of Sharing or Chaining.

3.3 Implementation

We implement VT by integrating it with Power-Graph [11], replacing the latter’s graph representationwith VT’s hybrid CSR graph and delta/log arrays. VToperates seamlessly with PowerGraph’s computation en-gine layer, thanks to its full support of the same com-putation-stage graph abstraction viewed from an engine.This also demonstrates VT’s broad applicability to exist-ing graph processing systems.

Page 8: IBM T.J. Watson Research Center Kang G. Shin Version ... · system balancing the above three requirements with two novel components, and the demonstration of VT’s superior performance

0 0.2 0.4 0.6 0.8

1 1.2 1.4

0 10 20 30 40 50 60 70 80 90 100

swit

chin

g (m

s)

number of δ

(a)FullChainingSharing

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0 10 20 30 40 50 60 70 80 90 100

acce

ss (m

s)

number of δ

(c)

FullChainingSharing

0 5

10 15 20 25 30

0 10 20 30 40 50 60 70 80 90 100

mem

ory

(MB)

number of δ

(e)

FullChaining

Sharing

0 5

10 15 20 25 30 35

0 10 20 30 40 50 60 70 80 90 100

swit

chin

g (m

s)

number of δ

(b)FullChainingSharing

0 0.5

1 1.5

2 2.5

3

0 10 20 30 40 50 60 70 80 90 100

acce

ss (m

s)

number of δ

(d)FullChainingSharing

0

5

10

15

20

0 10 20 30 40 50 60 70 80 90 100

mem

ory

(MB)

number of δ

(f)

FullChainingSharing

Figure 8: Microbenchmark results (top row for additions and bottom row for removals)

4 Evaluation

We first demonstrate the three-way tradeoff among ex-tensibility, compactness, and access efficiency, showingthe relative advantage of Full, Chaining, and Sharing.We then compare the performance of VT against Pow-erGraph and several multi-version reference designs.

4.1 Microbenchmark

Design. The goal of microbenchmarking is to evaluatethe relative effectiveness of Full, Sharing, and Chainingin balancing the three-way tradeoff. Since the overalltradeoff on a graph is the accumulative effect of the sametradeoff on each vertex, we conduct microbenchmarkingfrom a vertex’s perspective. Trends in the microbench-mark results are applicable to varying graph sizes, giventhe accumulative nature of the per-vertex tradeoff.

We construct a graph with 1000 identical high-degreestars. For each star, only the center vertex has a non-empty set of out-neighbors—default to 1000. Each cen-ter vertex thus provides the opportunity for an in-depthstudy of the per-vertex tradeoff. To evaluate it in a multi-version scenario, we create two versions: a source ver-sion and a target version. The target version differs fromthe source by randomly adding or removing out-neigh-bors of center vertices.

Two key factors related to a delta are its size and theratio of additions to removals. Prior work shows that thedifference between consecutive versions is commonlywithin 1% of the graph size [26]. For each star, giventhe default 1000 edges in the base version, we vary thedelta size from 1 to 100, corresponding to 0.1% to 10%of the size of the star. The total vertices and edges in thegraph thus vary between 0.9 to 1.1 million. We also fixthe operation types in a delta: a delta consists of eitheredge additions or edge removals.

We evaluate extensibility by measuring the versionswitching time from the source version to the target ver-

sion, neighbor access efficiency by measuring the timefor iterating through all the out-neighbors of center ver-tices in the target version, and compactness by measuringthe memory used for maintaining the graph connectivityinformation of both versions. All measurements are con-ducted on a host with 8 3GHz vCPUs and 60GB memory.

Version Switching. Figures 8a and 8b compare the ver-sion switching performance. The performance of Fulland Chaining is comparable and remains constant, re-gardless of the edge modification types in deltas or thedelta size, because both approaches require adjustingonly CSR pointer values for center vertices. The costof Sharing linearly grows with the delta size, due to theneed to parse a log array whose size is proportional tothe delta size. Comparing edge additions with removals,the cost of the former is significantly lower than the lat-ter. This is because, additions translate to appendingneighbor records to the end of the neighbor vector andremovals involve data movement within the vector.

Access. Figures 8c and 8d compare the neighbor accessspeed. Full and Sharing perform equally well for bothadditions and removals. Since the cost of neighbor ac-cess is proportional to the neighborhood size, it linearlyincreases and decreases with the size of delta in the casesof addition and removal, respectively. Chaining leads tothe worst performance in both cases, due to its cost ofindirection. The cost is moderate in the case of edgeadditions, because there is one and only one indirectionduring neighbor access—that is, the switching from thenewly added neighbors to the existing ones—regardlessof the delta size. The cost of indirection becomes signif-icant for edge removals, because each removal separatesa previously continuous neighbor range into two, intro-ducing one more indirection during neighbor access. Thecost thus linearly grows with the delta size.

Memory. Figures 8e and 8f show the memory footprint.In both addition and removal cases, Chaining and Shar-ing lead to significant memory savings comparing with

Page 9: IBM T.J. Watson Research Center Kang G. Shin Version ... · system balancing the above three requirements with two novel components, and the demonstration of VT’s superior performance

Table 1: Graphs, algorithms, and reference designs

dataset V (M) E (M) descriptionAmazon08 0.7 5.2 similarity among booksDblp11 1.0 6.7 scientific collaborationWiki13 4.2 101.4 English WikipediaLivejournal 5.4 79.0 friendship in LiveJournal social networkTwitter 41.7 1468.4 Twitter follower graphFacebook 0.1 1.6 friendship in regional Facebook networkGitHub 1.0 5.7 collaboration in software development

algorithm descriptionnop access neighbor and returnbipart max matching in a bipartite graphcc identify connected componentsPageRank compute rank of each vertexsssp single-source shortest pathtc triangle count

ref. design descriptioncsr use CSR graph and log deltalog use VoV graph and log deltabitmap maintain union of neighbors in all versions in VoV

graph and use bitmap deltam-array use multi-version-array graph/delta

Full. Intuitively, the cost of Full linearly grows with thedelta size in the addition case and linearly decreases inthe removal case. Our measurements, however, showmostly constant memory footprints in both cases, dueto (1) the capacity doubling effect and (2) no capacityreduction upon removal in the vector implementation inour testbed (glibc 2.15). For Chaining and Sharing, thememory footprint grows with the size of delta, regardlessof the type of edge modifications. This is because, forboth additions and removals, Chaining needs to maintainthe modifications either in the neighbor vector (for addi-tions) or in the removal section (for removals). Similarly,Sharing maintains the modifications in the log arrays.

4.2 Macrobenchmark

Reference Designs. We compare VT with Power-Graph [11]—a high-performance system targeting in-dividual graph processing—and four reference multi-version processing system designs (cf. Table 1) re-flecting different combinations of graph and delta for-mats. Specifically, we evaluate CSR+log, VoV+log,VoV+bitmap, and multi-version-array. They mirror de-sign choices made in PowerGraph, Giraph [1], Graph-Pool [17], and LLAMA [22], and are abbreviated to csr,log, bitmap, and m-array, respectively.

Workloads. Table 1 summarizes the datasets and al-gorithms. The Facebook [31] and GitHub graphs are

collected as dynamically evolving graphs. The remain-ing five graphs are collected as static graphs [7, 8], forwhich deltas need to be created. Since deltas among con-secutive versions are commonly within 1% of the graphsize [26], we vary the delta size from 0.01% to 1%. Weselect δ = 0.1% as a middle ground and show most of theevaluation results with this configuration. The total num-ber of cached versions n varies broadly from 1 to 100.Unless otherwise specified, we use uniform add-onlydeltas: each delta consists of edge additions uniformlydistributed over a graph. Graphs evolve linearly: versioni is created by iteratively applying δ j, j+1, j = 0 . . . i− 1to the root version (i.e., version 0). Version switching islocal, in that all versions are within the range of nδ fromthe root version. Version switching is arbitrary. That is,the next version j is selected independently to the currentversion i, may precede or succeed i, and do not need tobe consecutive to i.

Since machines with large memory and many coresbecome popular and affordable [25,29], VT’s evaluationfocuses on single-host setting. The elimination of inter-host communication cost in graph processing stage fur-ther highlights the effect of neighbor access efficiency.All measurements except those related to the Twittergraph [18] are performed on a host with 8 3GHz vCPUsand 60GB memory. Twitter-related workloads run on ahost with 32 2.5GHz vCPUs and 244GB memory.

Metrics. The requirements on extensibility, compact-ness, and access efficiency naturally lead to the use oftime and memory consumption as two basic metrics. Inaddition, inspired by the resource-as-a-service model inthe economics of cloud computing [4], we introduce apenalty function as a third metric: p = (ts + tc)α ×mβ .The penalty p is a function of the version switching timets, the computation time tc, and memory consumption m.α and β are weights associated with time and memoryresource. If the per-time-unit monetary cost is deter-mined only by memory consumption, then assigning 1to both parameters equates the penalty with the per-taskmonetary cost. We use α = 1 and β = 1 in our evalu-ation. When appropriate, we report penalty score p inthe form of utility improvement: the improvement of VTover a reference system ref is calculated as pre f−pvt

pre f.

Delta Preparation. For each system/workload setting,deltas corresponding to versions accessed in that work-load are populated in memory, according to the deltadesign employed by that system, before the start of theworkload. For VT, we employ a threshold-based policy(cf. Section 3.2.3), determining the delta format accord-ing to the number of neighbors in its source version andswitching from Full to Sharing to Chaining as the num-ber increases. We sample the threshold space for Full-Sharing and Sharing-Chaining transitions and report the

Page 10: IBM T.J. Watson Research Center Kang G. Shin Version ... · system balancing the above three requirements with two novel components, and the demonstration of VT’s superior performance

0 2 4 6 8

powergraph powergraphwith cache

vt 0 0.5 1 1.5 2

tim

e (s

)

mem

ory

(GB)

timememory

Figure 9: Comparison of VT and PowerGraph

lowest penalty score.

Comparison with PowerGraph. We evaluate the per-formance of PowerGraph by running SSSP on Ama-zon08 with ten 0.1% δ s in two scenarios. First, we mea-sure the performance of PowerGraph as-is, with a graphversion loaded from persistent storage in its entirety atthe beginning of each task. Second, we augment Power-Graph with full-version caching, storing each version inthe working set as a full graph copy in memory.

Figure 9 shows that VT significantly outperformsPowerGraph in both scenarios. VT’s processing speed ison a par with that of PowerGraph with full-graph cach-ing and is 23x faster than that of PowerGraph withoutcaching. This is due mainly to PowerGraph’s substantialloading time when caching is disabled. VT’s memoryfootprint is close to that of PowerGraph without caching(incurring a 15% overhead) and is only 12% of that ofPowerGraph with full-graph caching—a 7.3x enhance-ment. Overall, VT improves utility by 86% and 95% overPowerGraph with and without caching, respectively.

Comparison with Multi-Version Designs. Figure 10compares four multi-version designs with VT, executingnop on Amazon08 with ten 0.1% δ s. Csr incurs pro-hibitive switching cost, due to CSR’s low extensibility.It, nevertheless, yields the highest performance and hasthe smallest memory footprint. Both log and bitmap con-sume more memory than VT. Bitmap incurs a computa-tion-stage penalty due to bitmap checking. Log’s switch-ing cost is 7.4x that of VT.

Regarding m-array, its neighbor access time and ver-sion switching time are significantly shorter than theother designs. Its memory consumption, however, ismuch higher than the other. It is important to note that m-array’s superior performance is an outcome of efficientimplementation of LLAMA, not a result of the multi-ver-sion-array design. This is because, after a version be-comes ready for processing, all things being equal, csrshould yield the highest neighbor access performance forthe nop workload. The difference between m-array andcsr is then due to the framework-related overhead: m-array is measured with LLAMA and csr—as well as theother designs—is measured with PowerGraph-based im-plementation. Had we ported m-array to PowerGraph,its performance would be at best on a par with csr, andthus also close to VT.

M-array’s high memory consumption is a result of themulti-version-array design. For Amazon08 with 0.1%δ s, each version contains 5.2K new edges. Uniformlydistributed, those edges affect 5.2K vertices’ neighbor-hood. In LLAMA, with a 16-byte vertex record10 and a4KB page, the entire vertex record array for the root ver-sion spans 2.7K pages, which is also the expected num-ber of pages affected when the 5.2K vertices with modi-fied neighborhood are uniformly distributed. This yieldsa 100% memory overhead in terms of per-version vertexrecord array—because the entire 2.7K pages containingthe root multi-version array need to be copied for eachversion—and a 21.5% overhead when the entire graphconnectivity structure (with neighbor arrays) is consid-ered. Such an overhead is prohibitively expensive forlarge graphs. In contrast, VT has a smaller footprintfor the root version and, more importantly, incurs onlya 0.6% per-version overhead for the graph connectivitystructure in its Chaining mode.

Figure 10 confirms our expectation on the advantagesand shortcomings of existing designs. Given csr’s lowextensibility and m-array’s high memory consumption,we focus on comparing VT with log and bitmap for therest of the evaluation.

Comparison with log and bitmap. Figure 11 summa-rizes the results comparing VT with log and bitmap, eachwith 10 δ s of size 0.1%. VT consistently outperformsboth systems in all but one case. Except the Twitter-SSSP workload, VT runs 2–17% faster in average per-version processing time and achieves 17–34% memorysaving and 19–40% utility improvement.

Running SSSP over the Twitter graph, VT runs 88%faster than log but 19% slower than bitmap. This is be-cause, given the size of the dataset, the configuration ofthe supporting hardware, and the characteristics of thealgorithm, the difference in version switching dominatesthe overall processing efficiency. Log falls far behindVT, due to the former’s need of log replaying during ver-sion switching. VT’s delta application, although efficientand highly parallelized, is still a heavier-weight opera-tion compared to bitmap. Combining time and memoryconsumption, the net effect is that VT outperforms log by90% and is on a par with bitmap in utility improvement.

Varying Deltas. We compare VT with log and bitmapby executing SSSP on Amazon08, varying the size ofdelta from 0.01% to 1% and the number of deltas from10 to 100. Fixing the delta size to 0.1% and varying thenumber of deltas from 10 to 100, we observe that VT’sutility gain remains high with respect to log and bitmap(cf. Figure 12a). Compared to log, VT’s memory saving

10A vertex record consists of version id, offset into the neighbor ar-ray, number of new edges for the current version, and an optional out-degree, each occupying 4 bytes.

Page 11: IBM T.J. Watson Research Center Kang G. Shin Version ... · system balancing the above three requirements with two novel components, and the demonstration of VT’s superior performance

0

0.04

0.08

0.12

csr log bitmapm-array

vt

acce

ss (s

)

(a)

10-6

10-3

1

103

csr log bitmapm-array

vtswit

chin

g (s

)

(b)

0.1

1

10

csr log bitmapm-array

vtmem

ory

(GB)

(c)

Figure 10: Comparison with existing graph/delta designs

-20 0

20 40 60 80

100

amazon-sssp

dblp-sssplivejournal-sssp

wiki-sssptwitter-sssp

amazon-bipart

amazon-ccamazon-tc

amazon-pagerank

amazon-sssp

dblp-sssplivejournal-sssp

wiki-sssptwitter-sssp

amazon-bipart

amazon-ccamazon-tc

amazon-pagerank

(%)time-savingmem-savingutil-improve

bitmaplog

Figure 11: Comparison of VT with log and bitmap across all datasets and algorithms with 10 0.1% δ s

reduces with the increasing number of deltas, because thememory consumption of the delta cache grows with thenumber of deltas, gradually neutralizing the benefit of theuse of hybrid CSR. VT’s gain due to the reduction of ver-sion switching time increases with the number of deltas,however. Overall, with these opposite trends, VT’s utilitygain remains high. Compared to bitmap, VT’s memorysaving remains high, because of bitmap’s need to main-tain per-version bitmaps. VT’s saving in processing timereduces, however, because the impact of bitmap’s sav-ing in version switching time increases with the num-ber of deltas, compensating for bitmap’s neighbor-accessslowdown in the computation stage. The overall effect ofthese opposite trends is VT’s constantly high utility gainwith respect to bitmap across a wide range of versions.

Fixing the number of deltas to 10 and varying the sizeof delta from 0.01% to 1%, we observe that VT’s util-ity gain gradually reduces (cf. Figure 12b). Compared tolog, VT’s gain peaks at δ = 0.01%, thanks to its efficientgraph-delta representation. VT’s gains for δ = 0.1% andδ = 1% are similar: larger deltas reduce VT’s advantagein memory representation but amplify its reduction ofversion switching cost. Compared to bitmap, VT’s util-ity gain remains high for δ = 0.01% and δ = 0.1%, butdrops significantly for δ = 1%. Note that, the maximumdistances among versions—in terms of dissimilarity—are the same for 100 0.1% δ (in Figure 12a) and 10 1%δ (in Figure 12b). Yet, VT’s gain with respect to bitmapis much higher in the former case. This is because VT’smemory saving is more significant when bitmap needs tomaintain a larger number of per-version bitmaps in orderto track the neighbor-version relation.

Skewed and Add/Remove Workloads. Figure 13 com-pares VT’s performance across three types of workloads,all with ten 0.1% δ s. The first is a uniformly distributedadd-only delta type, same as those used throughout theevaluation. The second is a skewed add-only delta, in

which the probability of adding a new edge to a vertexis proportional to the latter’s degree in the root version.The third is a mixed add/remove delta type, with eachdelta maintaining a removals/total operations ratio vary-ing from 0.1% to 10%. VT consistently outperforms logand bitmap in all the three workloads.

Effectiveness of Optimization. Figure 14 summarizesthe effectiveness of Sharing and Chaining. Reusing theworkload of SSSP-Amazon with ten 0.1% δ s, we firstenforce a fixed delta format, measuring the performanceof Full, Sharing, and Chaining individually. We thencombine Full and one of the two optimization approachesand report the minimum achievable penalty. All resultsare then normalized to those of VT. The effectiveness ofSharing and Chaining is demonstrated by the superior-ity (in terms of penalty) of a combined delta preparationstrategy (e.g., Full-Sharing) to both approaches when ap-plied individually (e.g., Full and Sharing). It is alsodemonstrated by VT’s superiority—with all three deltaformats combined—to the five alternatives.

Realistic Evolving Workloads. We compare VT withlog and bitmap, using two 10-version graphs generatedfrom the evolving Facebook friendship and GitHub col-laboration graphs,11 respectively. Figure 15 shows theirevolution trends. Specifically, we choose 10 consecu-tive days towards the end of the collected periods for thetwo graphs12 and combine newly established friendship/collaboration relations in each day into a delta. Friend-ship/collaboration relations existing before that 10-day

11In the GitHub graph, the collaboration (i.e., edge) between twousers (i.e., vertices) is established when they start to work on at leastone shared repository. The initial state of the graph is set to empty. Itsevolution spans between March 2011 and July 2015. We generate thisgraph via the use of GitHub API [2] and GitHub Archive [3].

12For the Facebook graph, daily delta size reduces drastically to-wards the end of the collected period, which might be caused by limi-tations of the collection method. We avoid those anomalies when cre-ating the multi-version graph for evaluation.

Page 12: IBM T.J. Watson Research Center Kang G. Shin Version ... · system balancing the above three requirements with two novel components, and the demonstration of VT’s superior performance

0

20

40

10 20 50 100 10 20 50 100

(%)

(a)number ofversions

time-savingmemory-savingutil-improve

bitmaplog

0

20

40

0.01 0.1 1.0 0.01 0.1 1.0

(%)

(b)δ size

time-savingmemory-savingutil-improve

bitmaplog

Figure 12: Varying number of δ s and δ size. (a) Fixing δ = 0.1% and varying number of versions between 10and 100. (b) Fixing number of versions to 10 and varying δ between 0.01% and 1.0%.

0

20

40

uni skewrm-0.1%

rm-1%rm-10%

uni skewrm-0.1%

rm-1%rm-10%

(%) time-savingmem-savingutil-improve

bitmaplog

Figure 13: Uniform, skewed, and add/rm deltas

95 100 105 110 115 120

full sharechain

full-sharefull-chain

vt

rati

o (%

) timememorypenalty

Figure 14: Effectiveness of optimization, with 100.1% δ s (normalized to VT)

period then form the root versions of the two graphs. Fig-ure 16 shows that VT outperforms both log and bitmapwhen executing SSSP over these two graphs, improvingutility by 10.38–24.83%.

4.3 Discussion: Locality Revisited

At the core of VT lies the concept of locality. The effec-tiveness of VT depends on the high version access local-ity in multi-version workloads. Quantifying the localityof version access patterns, nevertheless, is difficult. Inthis paper, we express locality in terms of a range nδ—defined by the number of deltas n and the size of deltaδ—within which arbitrary version switching takes place.We have shown that VT achieves superior performancefor a wide range of nδ configurations (cf. Figure 12),with respect to state of the art. Yet, VT’s performance,as well as its relative gain with respect to other systems,needs careful investigation for other access patterns.

For example, for workloads featuring high computa-tion-to-version-switching ratio and forward-only switch-ing, we expect either log or a single-version system toperform the best. For such workloads, the significanceof the computation-stage performance outweighs that ofversion switching. For example, when a large set of al-gorithms are applied to a loaded version, any versionswitching except the first one becomes a self-switchingoperation, incurring almost zero cost thanks to the mem-ory management of supporting operating systems. In ad-dition, forward-switching nullifies the need to preservethe graph representation of a version after it is processed.

0 1500 3000 4500 6000

0 500 1000 1500 2000

edge

s(i

n th

ousa

nd)

days

facebookgithub

Figure 15: Evolution trends of a regional Facebookfriendship graph and a GitHub collaboration graph

0 10 20 30

facebookgithub facebook

github

(%) time-savingmem-savingutil-improve

bitmaplog

Figure 16: Performance of VT, log, and bitmap onFacebook and GitHub graphs

Thus a system optimized purely for high computationperformance is favorable.13 To efficiently handle suchworkloads, VT needs to be extended to support directmodification to the CSR, bypassing the shadowing ef-fect of the delta cache. More importantly, the switchingbetween existing operating modes of VT and this newdirect modification mode, as well as other modes poten-tially devised in future work, requires a thorough inves-tigation of the switching policies.

5 Conclusions

In this paper, we conducted a systematic investigationof the caching design space in multi-version graph pro-cessing scenarios, decomposing it into three dimensions:neighbor access efficiency, extensibility, and compact-ness. Our solution, Version Traveler, balances require-ments from all three dimensions, achieving fast andmemory-efficient version switching. It significantly out-performs PowerGraph and is superior to four multi-version reference designs.

Acknowledgments

We thank the anonymous reviewers and our shepherd,Indranil Gupta, for their feedback.

13The relative merit of log to a single-version system is determined,in this case, by whether constructing the next version by modifying thecurrent one is less costly than building it from scratch.

Page 13: IBM T.J. Watson Research Center Kang G. Shin Version ... · system balancing the above three requirements with two novel components, and the demonstration of VT’s superior performance

References

[1] Apache Giraph. http://giraph.apache.org.Retrieved in Apr. 2016.

[2] Github api. https://developer.github.com/

v3/. Retrieved in Apr. 2016.

[3] Github archive. https://www.githubarchive.

org/. Retrieved in Apr. 2016.

[4] AGMON BEN-YEHUDA, O., BEN-YEHUDA, M.,SCHUSTER, A., AND TSAFRIR, D. The resource-as-a-service (raas) cloud. In Proceedings of the 4thUSENIX Workshop on Hot Topics in Cloud Ccom-puting (Berkeley, CA, USA, 2012), HotCloud’12,USENIX Association.

[5] BHARDWAJ, A., BHATTACHERJEE, S., CHAVAN,A., DESHPANDE, A., ELMORE, A. J., MADDEN,S., AND PARAMESWARAN, A. G. DataHub: Col-laborative Data Science & Dataset Version Man-agement at Scale. In 7th Biennial Conference onInnovative Data Systems Research (2015), CIDR’15.

[6] BHATTACHERJEE, S., CHAVAN, A., HUANG,S., DESHPANDE, A., AND PARAMESWARAN, A.Principles of dataset versioning: Exploring therecreation/storage tradeoff. Proc. VLDB Endow. 8,12 (Aug. 2015), 1346–1357.

[7] BOLDI, P., ROSA, M., SANTINI, M., AND VI-GNA, S. Layered label propagation: A multireso-lution coordinate-free ordering for compressing so-cial networks. In Proceedings of the 20th interna-tional conference on World Wide Web (2011), ACMPress.

[8] BOLDI, P., AND VIGNA, S. The WebGraph frame-work I: Compression techniques. In Proc. of theThirteenth International World Wide Web Confer-ence (WWW 2004) (Manhattan, USA, 2004), ACMPress, pp. 595–601.

[9] BU, Y., BORKAR, V., JIA, J., CAREY, M. J., ANDCONDIE, T. Pregelix: Big(ger) graph analytics ona dataflow engine. Proceedings of the VLDB En-dowment 8, 2 (2014), 161–172.

[10] CHENG, R., HONG, J., KYROLA, A., MIAO, Y.,WENG, X., WU, M., YANG, F., ZHOU, L., ZHAO,F., AND CHEN, E. Kineograph: Taking the pulseof a fast-changing and connected world. In Pro-ceedings of the 7th ACM European Conference onComputer Systems (New York, NY, USA, 2012),EuroSys ’12, ACM, pp. 85–98.

[11] GONZALEZ, J. E., LOW, Y., GU, H., BICK-SON, D., AND GUESTRIN, C. Powergraph:Distributed graph-parallel computation on naturalgraphs. In Proceedings of the 10th USENIX Con-ference on Operating Systems Design and Imple-mentation (Berkeley, CA, USA, 2012), OSDI’12,USENIX Association, pp. 17–30.

[12] GONZALEZ, J. E., XIN, R. S., DAVE, A.,CRANKSHAW, D., FRANKLIN, M. J., AND STO-ICA, I. Graphx: Graph processing in a distributeddataflow framework. In 11th USENIX Symposiumon Operating Systems Design and Implementation(OSDI 14) (Broomfield, CO, Oct. 2014), OSDI ’14,USENIX Association, pp. 599–613.

[13] HAN, W., MIAO, Y., LI, K., WU, M., YANG, F.,ZHOU, L., PRABHAKARAN, V., CHEN, W., ANDCHEN, E. Chronos: A graph engine for temporalgraph analysis. In Proceedings of the 8th ACM Eu-ropean Conference on Computer Systems (2014),EuroSys ’14.

[14] HOQUE, I., AND GUPTA, I. Lfgraph: Simple andfast distributed graph analytics. In Proceedings ofthe First ACM SIGOPS Conference on Timely Re-sults in Operating Systems (New York, NY, USA,2013), TRIOS ’13, ACM, pp. 9:1–9:17.

[15] KANG, U., TSOURAKAKIS, C. E., AND FALOUT-SOS, C. Pegasus: A peta-scale graph miningsystem implementation and observations. In Pro-ceedings of the 2009 Ninth IEEE InternationalConference on Data Mining (Washington, DC,USA, 2009), ICDM ’09, IEEE Computer Society,pp. 229–238.

[16] KHAYYAT, Z., AWARA, K., ALONAZI, A.,JAMJOOM, H., WILLIAMS, D., AND KALNIS, P.Mizan: A system for dynamic load balancing inlarge-scale graph processing. In Proceedings of the8th ACM European Conference on Computer Sys-tems (New York, NY, USA, 2013), EuroSys ’13,ACM, pp. 169–182.

[17] KHURANA, U., AND DESHPANDE, A. Efficientsnapshot retrieval over historical graph data. InProceedings of the 2013 IEEE International Con-ference on Data Engineering (ICDE 2013) (Wash-ington, DC, USA, 2013), ICDE ’13, IEEE Com-puter Society, pp. 997–1008.

[18] KWAK, H., LEE, C., PARK, H., AND MOON, S.What is twitter, a social network or a news media?In Proceedings of the 19th International Confer-ence on World Wide Web (New York, NY, USA,2010), WWW ’10, ACM, pp. 591–600.

Page 14: IBM T.J. Watson Research Center Kang G. Shin Version ... · system balancing the above three requirements with two novel components, and the demonstration of VT’s superior performance

[19] KYROLA, A., BLELLOCH, G., AND GUESTRIN,C. Graphchi: Large-scale graph computation onjust a pc. In Proceedings of the 10th USENIX Con-ference on Operating Systems Design and Imple-mentation (Berkeley, CA, USA, 2012), OSDI’12,USENIX Association, pp. 31–46.

[20] LOW, Y., GONZALEZ, J., KYROLA, A., BICK-SON, D., GUESTRIN, C., AND HELLERSTEIN,J. M. Graphlab: A new parallel framework for ma-chine learning. In UAI (2010), UAI ’10, pp. 340–349.

[21] LU, Y., CHENG, J., YAN, D., AND WU, H. Large-scale distributed graph computing systems: An ex-perimental evaluation. Proceedings of the VLDBEndowment 8, 3 (2014).

[22] MACKO, P., MARATHE, V., MARGO, D., ANDSELTZER, M. LLAMA: Efficient graph analyticsusing large multiversioned arrays. In Proceedingsof the 2015 IEEE International Conference on DataEngineering (ICDE 2015) (Washington, DC, USA,April 2015), ICDE ’15, IEEE Computer Society.

[23] MALEWICZ, G., AUSTERN, M. H., BIK, A. J.,DEHNERT, J. C., HORN, I., LEISER, N., ANDCZAJKOWSKI, G. Pregel: A system for large-scale graph processing. In Proceedings of the 2010ACM SIGMOD International Conference on Man-agement of Data (New York, NY, USA, 2010),SIGMOD ’10, ACM, pp. 135–146.

[24] NGUYEN, D., LENHARTH, A., AND PINGALI, K.A lightweight infrastructure for graph analytics. InProceedings of the Twenty-Fourth ACM Symposiumon Operating Systems Principles (New York, NY,USA, 2013), SOSP ’13, ACM, pp. 456–471.

[25] PEREZ, Y., SOSIC, R., BANERJEE, A., PUT-TAGUNTA, R., RAISON, M., SHAH, P., ANDLESKOVEC, J. Ringo: Interactive graph analyticson big-memory machines. In Proceedings of the2015 ACM SIGMOD International Conference onManagement of Data (New York, NY, USA, 2015),SIGMOD ’15, ACM, pp. 1105–1110.

[26] REN, C., LO, E., KAO, B., ZHU, X., ANDCHENG, R. On querying historical evolving graphsequences. Proceedings of the VLDB Endowment4, 11 (2011), 726–737.

[27] ROY, A., MIHAILOVIC, I., AND ZWAENEPOEL,W. X-stream: Edge-centric graph processing usingstreaming partitions. In Proceedings of the Twenty-Fourth ACM Symposium on Operating SystemsPrinciples (New York, NY, USA, 2013), SOSP ’13,ACM, pp. 472–488.

[28] SHEN, Y., CHEN, G., JAGADISH, H., LU, W.,OOI, B. C., AND TUDOR, B. M. Fast failurerecovery in distributed graph processing systems.Proceedings of the VLDB Endowment 8, 4 (2014).

[29] SHUN, J., AND BLELLOCH, G. E. Ligra: Alightweight graph processing framework for sharedmemory. In Proceedings of the 18th ACM SIG-PLAN Symposium on Principles and Practice ofParallel Programming (New York, NY, USA,2013), PPoPP ’13, ACM, pp. 135–146.

[30] TIAN, Y., BALMIN, A., CORSTEN, S. A.,TATIKONDA, S., AND MCPHERSON, J. From“think like a vertex” to “think like a graph”. Pro-ceedings of the VLDB Endowment 7, 3 (2013).

[31] VISWANATH, B., MISLOVE, A., CHA, M., ANDGUMMADI, K. P. On the evolution of user in-teraction in facebook. In Proceedings of the 2ndACM SIGCOMM Workshop on Social Networks(WOSN’09) (August 2009).

[32] ZHOU, C., GAO, J., SUN, B., AND YU, J. X.Mocgraph: Scalable distributed graph processingusing message online computing. Proceedings ofthe VLDB Endowment 8, 4 (2014).