Data flow vs. procedural programming: How to put your algorithms into Flink

June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 1

Flink Meetup #8

Data flow vs. procedural programming: How to put your

algorithms into Flink

June 23, 2015

Mikio L. Braun@mikiobraun


Programming how we're used to

● Computing a sum

● Tools at our disposal:– variables

– control flow (loops, if)

– function calls as basic piece of abstraction

def computeSum(a):sum = 0for i in range(len(a))

sum += a[i]return sum


Data Analysis Algorithms

Let's consider centering

becomes

or even just

def centerPoints(xs): sum = xs[0].copy() for i in range(1, len(xs)): sum += xs[i] mean = sum / len(xs) for i in range(len(xs)): xs[i] -= mean return xs

xs - xs.mean(axis=0)


Don't use for-loops

● Put your data into a matrix● Don't use for loops


Least Squares Regression

● Compute

● Becomes

What you learn is thinking in matrices, breaking down computations in terms of matrix algebra

def lsr(X, y, lam): d = X.shape[1] C = X.T.dot(X) + lam * pl.eye(d) w = np.linalg.solve(C, X.T.dot(y)) return w


Basic tools

Advantage– very familiar

– close to math

Disadvantage– hard to scale

● Basic procedural programming paradigm● Variables● Ordered arrays and efficient functions on those


Parallel Data Flow

Often you have stuff like

Which is inherently easy to scale

for i in someSet:map x[i] to y[i]


New Paradigm

● Basic building block is an (unordered) set.● Basic operations inherently parallel


Computing, Data Flow Style

Computing a sum

Computing a mean

sum(x) = xs.reduce((x,y) => x + y)

mean(x) = xs.map(x => (x,1)) .reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2))

.map(xc => xc._1 / xc._2)


Apache Flink

● Data Flow system● Basic building block is a DataSet[X]● For execution, sets up all computing nodes,

streams through data


Apache Flink: Getting Started

● Use Scala API● Minimal project with Maven (build tool) or

Gradle● Use an IDE like IntelliJ

● Always import org.apache.flink.api.scala._


Centering (First Try)

def computeMeans(xs: DataSet[DenseVector]) =xs.map(x => (x,1))

.reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2)) .map(xc => xc._1 / xc._2)

def centerPoints(xs: DataSet[DenseVector]) = {val mean = computeMean(xs)

xs.map(x => x – mean)}

You cannot nest DataSet operations!


Sorry, restrictions apply.

● Variables hold (lazy) computations● You can't work with sets within the operations● Even if result is just a single element, it's a

DataSet[Elem].● So what to do?

– cross joins

– broadcast variables


Centering (Second Try)

Works, but seems excessive because the mean is copied to each data element.

def computeMeans(xs: DataSet[DenseVector]) =xs.map(x => (x,1))

.reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2)) .map(xc => xc._1 / xc._2)

def centerPoints(xs: DataSet[DenseVector]) = { val mean = computeMean(xs) xs.crossWithTiny(mean).map(xm => xm._1 – xm._2)}


Broadcast Variables

● Side information sent to all worker nodes● Can be a DataSet● Gets accessed as a Java collection


class BroadcastSingleElementMapper[T, B, O](fun: (T, B) => O) extends RichMapFunction[T, O] { var broadcastVariable: B = _

@throws(classOf[Exception]) override def open(configuration: Configuration): Unit = { broadcastVariable = getRuntimeContext .getBroadcastVariable[B]("broadcastVariable") .get(0) }

override def map(value: T): O = { fun(value, broadcastVariable) } }

Broadcast Variables


Centering (Third Try)def computeMeans(xs: DataSet[DenseVector]) =

xs.map(x => (x,1)) .reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2)) .map(xc => xc._1 / xc._2)

def centerPoints(xs: DataSet[DenseVector]) = {val mean = computeMean(xs)

xs.mapWithBcVar(mean).map((x, m) => x – m)}


Intermediate Results pattern

val x = someDataSetComputation()val y = someOtherDataSetComputation()

val z = dataSet.mapWithBcVar(x)((d, x) => …)

val result = anotherDataSet.mapWithBcVar((y,z)) { (d, yz) => val (y,z) = yz …}

x = someComputation()y = someOtherComputation()

z = someComputationOn(dataSet, x)

result = moreComputationOn(y, z)


Matrix Algebra

● No ordered sets per se in Data Flow context.


Vector operations by explicit joins

● Encode vector (a1, a2, …, an) with

{(1, a1), (2, a2), … (n, an)}● Addition:

– a.join(b).where(0).equalTo(0) .map((ab) => (ab._1._1, ab._1._2 + ab._2._2))

after join: {((1, a1), (1, b1)), ((2, a1), (2, b1)), … }


Back to Least Squares Regression

Two operations: computing X'X and X'Y

def lsr(xys: DataSet[(DenseVector, Double)]) = { val XTX = xs.map(x => x.outer(x)).reduce(_ + _) val XTY = xys.map(xy => xy._1 * xy._2).reduce(_ + _)

C = XTX.mapWithBcVar(XTY) { vars => val XTX = vars._1 val XTY = var.s_2

val weight = XTX \ XTY }}


Summary and Outlook

● Procedural vs. Data Flow– basic building blocks elementwise operations on

unordered sets

– can't be nested

– combine intermediate results via broadcast vars

● Iterations● Beware of TypeInformation implicits.

Software

Data flow vs. procedural programming: How to put your algorithms into Flink