SimpleFlow: A Non-iterative, Sublinear Optical Flow Algorithmkubitron/courses/cs252-S11/projects/reports/project1...SimpleFlow: A Non-iterative, Sublinear Optical Flow Algorithm Michael

SimpleFlow: A Non-iterative, Sublinear Optical Flow Algorithm

Michael W. TaoU.C. Berkeley

Jiamin BaiU.C. Berkeley

Sylvain ParisAdobe Inc.

Pushmeet KohliMicrosoft Research

Abstract

We present a simple non-iterative algorithm for com-puting optical flow that is designed specifically for paral-lel architectures . Its key distinguishing characteristic isthe fact that unlike conventional flow algorithms which de-pend on complex optimization algorithms, our method pro-duces good results using only local averaging operations.This allows different regions of image to be computed inde-pendently, which makes parallel computation efficient andsimple. As a result, our algorithm is able to perform op-tical flow estimates of very high resolution images (4096 x2034) within seconds which is not even possible for existingalgorithms. Our algorithm also maintains a probabilisticestimate of the flow at every pixel instead of just a singlevector. Furthermore, It also allows aggressive subsamplingwhich result in a sublinear worst case time complexity. Weinvestigate the performance behavior of our algorithm onmulticore CPUs and GPUs and provide an analysis on thehardware nuances that affect it.

1. IntroductionOptical flow describes the motion in a video, i.e. it de-

scribes how each point in the scene moves from frame tothe next [3]. It is an essential component for most video ap-plications, e.g. [11, 16]. Several aspects make this problemparticularly challenging: (1) Occlusions: Points can appearor disappear between two frames, (2) Aperture problem: Inregions of uniform appearance, local cues do not provideany information about the motion, and only partial flow in-formation can be recovered along one-dimensional struc-tures.

Traditional methods for optical flow operate by match-ing pixels from one frame to the next based on their color.However, since there are usually many pixels of the samecolor in a frame, this is not a discriminative feature and thusleads to erroneous results. To increase the discriminativepower, researchers match blocks or group of pixels [17,13].However, determining the size of the blocks is challengingbecause not all points within a block have the same vectordisplacement (optimal flow) from one frame to the next.

The above-mentioned challenges associated with com-

puting the optical flow have traditionally been addressedby propagating local evidences for particular flow valuesacross the image so that ambiguous regions get resolveddue to nearby corners and textured areas. This strategy isgenerally implemented using a probabilistic model such asMarkov Random Field (MRF) defined over the entire im-age, e.g. [10]. While the MRF formulation of the problemis principled and can produce accurate results, it is often im-practical because of the difficultly of finding the Maximuma Posterior (MAP) solution of the MRF.

Computing the MAP solution of the MRF involves min-imizing an energy function defined over the flow vari-ables which is an NP-hard problem. However, many al-gorithms such as Graph cuts [4, 5], Loopy Belief Propaga-tion (LBP) [26, 20], and Tree Re-weighted message pass-ing (TRW) [14, 25] have been proposed in the literature forcomputing good approximate solutions of the problem. Theruntime complexity of these algorithms is super-linear inthe number of variables in the problem as well as in thenumber of labels each variable can take, which makes themextremely computationally expensive for the problem of op-tical flow.

Recently, efficient methods have been proposed suchas [27, 7, 12]. These techniques rely on local operationsthat can be performed on the GPU. To propagate informa-tion across the image domain, these schemes are iterative.While this may not be an issue with low-resolution footage,it can be a serious handicap when processing high-definitionvideos because many iterations are required to reach distantpoints. Further, with these large datasets, processing everypixel becomes a costly operation.

In comparison, our algorithm that we name SimpleFlowis iteration-free. It runs in a classical multiscale fashionand at each scale, pixels are processed independently andonly once. This makes our scheme highly data-parallel andensures a linear complexity with respect to the number ofpixels. The lack of a global optimization routine comparedto other algorithms allows us to easily parallelize compu-tation for a variety of parallel architectures. Furthermore,when we upsample the data in the multiscale pyramid, wecan afford to interpolate some of the flow vectors from theirneighbors instead of analyzing the frame data. While our al-gorithm’s complexity remains linear for each pixel, in prac-

1

(a) Frame Ft (b) Frame Ft+1 (c) Without p smoothing (d) With p smoothing

(e) With vector smoothing (f) With refinements (g) Ground truth

Figure 1. With frames Ft (a) and Ft+1 (b), without smoothing, vectors flows are noisy (c). Using a simple bilateral filter on p producessmoother flows (d). By applying a weighted filter on the flow vectors, we achieve a subpixel flow (e). Additional refinements such assubpixel, sublinear optimization, and multiscale produces results (f) close to ground truth (d).

tice, it exhibits a sublinear behavior because interpolationis virtually free. However, at the cost of dropping compu-tation and interpolating from neighboring pixels, we intro-duce branches in the instruction stream. This has varyingimpacts on different parallel architectures which will we in-vestigate in latter sections.

The design of our method is simple. We express theclassical color invariance assumption using probabilities de-rived from pixel-wise color differences. The key aspect ofour approach is that we manipulate these probabilities in-stead of flow vectors [23]. With this representation, thesmoothness of the flow field is represented by locally aver-aging probabilities and accounting for edges is achieved us-ing an edge-aware scheme such as the bilateral filter [24,19]that can be computed in linear time [18, 22, 28]. Finally,subpixel accuracy is achieved by fitting a quadratic modelto the probability distribution.

Notation We consider two successive frames Ft andFt+1. We use (x, y) for pixel positions and (u, v) forflow vectors, that is, we seek to estimate u and v at eachpixel such that the scene point at (x, y) in Ft is visible at(x + u, y + v) in Ft+1. Although strictly speaking, u andv depend on (x, y); for the sake of clarity, we will use thenotation (u, v) instead of

(u(x, y), v(x, y)

)when possible.

We use Ft(x, y) to denote the RGB color of the (x, y) pixelin Ft.

2. A Single-scale AlgorithmFor sake of clarity, we first describe a version of our al-

gorithm that operates at a single scale.

(a) Pixel from Ft (b) Pixels from Ft+1 (c) Resulting p

Figure 2. We want to match the pixel from Ft (a) to a pixel in apixel group from Ft+1 (b). The resulting p represents the proba-bility of pixels in (b) matches with pixel (a), where brighter areasrepresent closer matches while darker areas represent further mis-matches (c).

A simple likelihood model We follow the classicalconstant-color assumption, that is, we seek flow vectors(u, v) such that Ft(x, y) and Ft+1(x+u, y+v) are similar.We model this requirement by a simple energy term:

e(x, y, u, v) = ‖Ft(x, y)− Ft+1(x+ u, y + v)‖2 (1)

that corresponds to the likelihood probability p ∝exp(−e) [23]. This term is simple to compute, it is thesquare difference between two RGB vectors. Further, itcaptures nontrivial information about the local structure ofthe video such as the fact that edges suffer from a one-dimensional ambiguity (Fig. 2). p also naturally representsthe reliability of the information that it embeds; for instance,if p is nearly uniform, it provides only a weak evidenceabout the flow whereas peaked distributions indicate reli-able data. Because it is a point-wise computation, p often

provides noisy and ambiguous information but combiningcues across pixels yields a rich representation of the flow.

Local validity as smoothness prior We also assume thatthe flow is locally smooth. However, we do not rely on apairwise term such as ‖(u1, v1) − (u2, v2)‖2 as often usedin the literature [3]. The drawback such pairwise terms isthat they link the flow estimates at several pixels, gener-ating dependencies between unknowns that make the opti-mization problem harder to solve. Instead, we require thata flow vector (u0, v0) at a pixel (x0, y0) is a good expla-nation of the motion at (x0, y0) as well as other pixels ina neighborhood N0. A simple way to express this assump-tion is (u0, v0) = arg max(u,v)∈Ω

∏(x,y)∈N0

p(x, y, u, v)where Ω is the set of possible (u, v) vectors that we con-sider. By pooling information from all the pixels in N0,this yields a smoother and cleaner estimate of the flow(Fig. 1 (c)). We can better understand this behavior in thenegative log-likelihood (or energy cost) domain where wetry to find the lowest cost flow vector for the pixel (x0, y0):(u0, v0) = arg min(u,v)∈Ω

∑(x,y)∈N0

e(x, y, u, v). Oursmoothness prior amounts to smoothing the likelihood terminstead of adding a pairwise term. As a consequence, find-ing a minimizer remains simple since there is no interactionbetween the unknowns, the solution at a pixel does not de-pend on the solution at its neighbors.

The box-filter approach does not differentiate betweenpixels withinN0. We add weights to account for how pixelsrelate to each other. In this paper, we apply two weightingfunctions, wd for the distance between pixels and wc for thecolor difference. We propose the following log-likelihood:

E(x0, y0, u, v) =∑

(x,y)∈N0wd wc e(x, y, u, v) (2)

with wd = exp(−‖(x0, y0)− (x, y)‖2/2σd

)and wc = exp

(−‖Ft(x0, y0)− Ft(x, y)‖2/2σc

)As shown in Figure 1 (d), adding weights improves the re-sults. Further, the weights correspond to the bilateral filterweights [19], which makes it possible to compute E withoptimized schemes, e.g. [18, 8, 2, 1, 22, 28].

Detecting occlusions To detect occlusions, we comparethe forward flow (uf, vf) from t to t + 1, and the back-ward flow (ub, vb) from t + 1 to t. Ideally, one shouldbe the opposite of the other. One can either test theequality to get a binary detector or compute the difference‖(uf, vf) − (−ub,−vb)‖ for a continuous estimator. Whenthe computed difference is high, we conclude the pixel isoccluded between the frames.

Practical implementation For each pixel (x0, y0), ourprototype implementation computes e on n × n windowscentered on (x0, y0), i.e., we compute the color difference

between Ft(x0, y0) and every pixel in a n × n window inFt+1. This produces a n2-dimensional vector e at eachpixel. Then, we compute E by applying a bilateral filter onthese e vectors using the frame data Ft to define the colorweights wc. This is known as a cross- or joint-bilateral fil-tering [9, 21] and can be sped up with optimized data struc-tures, e.g. [18, 8, 2, 1]. Finally, we estimate the flow as the(u0, v0) vector that minimizes E. We found it useful to fur-ther regularize the result by applying a bilateral filter on theflow vectors. For this operation, we discard the occludedpixels, and use the weights wd and wc with an additionalweight wr that represents how reliable is our flow estimateat (x, y):

wr(x, y) = mean(u,v)∈Ω

e(x, y, u, v)− min(u,v)∈Ω

e(x, y, u, v)

Discussion This scheme is not efficient because we needto consider large n × n windows to be able to recoverlarge motions. We address this issue in the next sectionwith a multiscale approach. Representing the motion bythe likelihood p is related to the work of Rosenberg andWerman [23]. While they already underscore the usefulnessand richness of this probability distribution, they use it fortracking discrete points. In comparison, we focus on denseflow field and use these probabilities in conjunction withedge-aware filtering and multiscale processing. Althoughwe use the bilateral filter to aggregate local evidences aboutthe flow, we believe that other edge-aware schemes wouldperform equivalently. We see the exploration of other op-tions as a possible direction for future work. Most impor-tantly, our algorithm is simple: a series of color differencesfollowed by an edge-aware filter.

3. A Multiscale Sublinear AlgorithmThis section describes how we make our simple algo-

rithm much more efficient using an adaptive multi-scalestrategy. First, we show how to structure the algorithm de-scribed in the previous section in a multi-scale fashion. Wethen describe how its runtime complexity could be madesublinear with respect to the frame resolution of the video.

3.1. Multiscale Flow Estimation

We construct an image pyramid for each image frame inwhich each level is twice coarser than the previous one, i.e.the `-th level has a resolution 2` times lower than the origi-nal frame. At the coarsest level of the pyramid, we estimatethe flow using the scheme described in Section 2. We nowexplain how to compute the flow at level ` assuming theflow at `+ 1 is known.

We generate an initial estimate (u`, v`) of the flow by up-sampling the flow from the previous level (u`+1, v`+1) us-ing joint-bilateral upsampling [15], followed by a multipli-

(a) Frame Ft (b) Frame Ft+1 (c) Optical flow result

(d) 480x270 optimization (e) 960x540 optimization (f) 1920x1080 optimization

Figure 3. Our sublinear scheme runs a full flow estimation only at a few key pixels at each scale. In this HD example (1920× 1080), weuse three scales (d,e,f). Black indicates where we run a full estimation. Brighter shades correspond to regions where we estimate the flowwith linear interpolation on 2× 2, 4× 4, and 16× 16 windows respectively. As illustrated on this example, our scheme concentrates mostof the computation in discontinuous areas, thereby drastically reducing the required computational effort (Fig. 4).

cation by 2 to account for the change in resolution. We thenfollow our standard flow computation process described inSection 2 with the only change that now we aggregate prob-abilities in a neighborhood N centered on (x0 + u, y0 + v).For the flow vector at (x0, y0), we select (u0, v0) that mini-mizes the log-likelihood E(x0, y0, u, v).

0.1

1

10

100

1000

10000

30000 300000 3000000

Seco

nds

Number of Pixels

CPU: Sublinear (8 Threads)

CPU: Linear (8 Threads)

GPU: Sublinear

GPU: Linear

LDOF

O(N0.91)

O(N0.98)

O(N0.78)

O(N1.00)

O(N1.63)

Figure 4. Log-log plot of the running times with respect to thevideo size. Without our sublinear scheme, our algorithm exhibitsa linear complexity (in purple). Our sublinear scheme (in green)greatly reduces the running time and most importantly yields apractical complexity on the order of O(N0.78) (in green). Notethat other methods exhibits a linear complexity.

3.2. A Sublinear Scheme

The pyramid scheme described in the previous section isa standard multiscale approach, and has a linear complexity

with respect to the number of pixels1. We now describe ouradaptive multi-scale strategy which tries to isolate imageregions where the flow changes slowly. For pixels in suchregions, we can replace the computational expensive energyminimization operation by a simple interpolation operation.Although this is not strictly speaking a sublinear schemesince we still process each pixel, it behaves so because in-terpolation has a negligible cost.

For each layer, we estimate a flow irregularity mapwhere the flow is smooth and where it varies more. Ateach pixel (x0, y0), we compute the irregularity value asmax(x,y)∈N0‖

(u(x, y), v(x, y)

)−(u(x0, y0), v(x0, y0)

)‖.

During upscaling, if this value is above a threshold τ , werun the full pipeline on the corresponding upscaled pixels.Otherwise, we compute the flow at four corners of the patch,and find the flows at other pixel by interpolation. We use aplanar fit on the four corners and compute flow values forother pixels by using the derived plane. We recursively findsmooth flow regions as we upscale, which effectively yieldsconstant time computation to upsample these areas. Wefound that this four-corner scheme is fast and works well.But when more accuracy is desirable, one could trade-offestimate the flow at more points and fit a higher-order func-tion to them. Evaluating this option is kept as future work.

1The total number of variables in a pyramid with a scaling factor of ris equal to 1 + 1

r+ 1

r2 + ... = rr−1

. Since we reduce the resolution byhalf in both dimensions, we get a scaling factor of 4 (r=4). Thus, the thepyramid size is 4/3 that of the image.

Generate Likelihood Model

Local Smoothness Prior Argmax

Upscale

Subpixel Refinement

Generate Likelihood Model

Local Smoothness Prior Argmax

Occlusion

Forward Flow

Backward Flow Frame Ft

Frame Ft+1

Final Layer

Generate Smoothness Upscale Subpixel Total 8 Threads CPU 4.42 3.18 4.78 5.06 4.62 8 Cores CPU 6.97 3.88 6.06 7.48 5.95 GPU 33.87 4.49 101.56 81.84 29.60

Speed up factor over single threaded CPU implementation

Figure 5. Illustrating different stages of our pipeline. Stages highlighted in yellow are parallelized. We tabulate the speed up factor of the8 threads CPU implementation and the GPU implementation against the single thread CPU implementation for various stages as well asthe whole pipeline.

4. Parallel Architectures and Implementation

We investigated performance behaviors of multicoreCPUs and GPUs for our algorithm. Using CPUs and GPUsis essential because of the increase in processing core num-bers is a large trend, giving our algorithm a better leveragedue to its pixel independent operations. Our algorithm isspecifically designed for such parallel hardware because theresult of each pixel for most stages can be computed inde-pendently from its neighbors. Since CPUs and GPUs havevery different architectures, we examined the benefits of afast general processor and a group of slower processors. Tosummarize the performance gain of our algorithm in highdefinition videos, we can see in Figure 4, compared to oneof the state of the art optical flow algorithms proposed byBrox and Malik [6], we scale much better, making flow es-timation feasible in HD videos.

4.1. CPUs

We used OpenMP to implement multithreading in CPUs.OpenMP is a multi-platform shared-memory parallel pro-gramming API for C/C++. It allows us to easily assign par-allel loops and sections as independent sections and it dy-namically assigns jobs to threads. We tested our code on 2different machines. The first is an Intel i7 920 2.67Ghz with8Gb of RAM. This is a 4 core machine with Hyperthread-ing, allowing 8 threads to run simultaneously. The second isa dual Intel Xeon E5520 2.27Ghz with 6Gb of RAM. Thisis an 8 core machine.

4.2. GPUs

We used Nvidia’s CUDA 4.0 environment to run multi-threads on the GPU’s cores. CUDA is a parallel computingarchitecture developed by Nvidia that allows general pur-pose computation on their GPUs. The GPU we used is aNvidia 285 GTX with 1 Gb of RAM. The GPU has 2401.44 GHz programmable cores, allowing up to 240 threadsto run simultaneously. The threads are grouped together andeach group is called a block. The threads in each block areexecuted in lockstep, that is, if one thread stalls, all otherthreads are stalled as well.

4.3. Parallel Implementation

As described in earlier sections, there are several stagesof our proposed algorithm. In Fig. 5, we illustrate the com-plete pipeline as well as the stages which are admissible forparallel processing.

Generate Likelihood Model When we generate the like-lihood model for each pixel, we perform point wise com-parison for each pixel in Ft over a neighborhood in Ft+1.Since the result of the likelihood for a pixel in Ft does notdepend on any values computed in this stage, the result forevery pixel in this stage can be computed in parallel.

Local Smoothness Prior In this stage, we smooth thelikelihood term computed for a pixel in Ft by considering aneighborhood around itself. Again, the result for each pixeldoes not depend on any result that is computed in this stage,

thus allowing all pixels in this stage to be computed in par-allel.

Argmax In this stage, we estimate the displacement thatbest explains the underlying motion in the scene. In thenaive implementation, the computation can be parallelized.However, the sublinear scheme we propose requires us tonot compute the argmax for smooth regions but interpolateknown values from other pixels. Therefore, the result foreach pixel becomes dependent on the result of neighboringpixels. As such, this stage becomes non trivial to parallelize.We decide to compute this stage without any parallelizationbecause the computation is very light and it is insignificantcompared to the other stages.

Occlusion To compute occlusions between frame, wesimply looked at the estimated displacement from the for-ward flow and the backwards flow. If the flows are not op-posite of each other, we can conclude that the occlusion hasoccurred. This stage is very lightweight and fast. We chosenot to compare the performance for this stage between theCPU and the GPU because the GPU implementation wouldrequire transferring the data onto the GPU RAM. Since thememory transfer itself will dominate the compute time forthe GPU implementation, we decide against comparing theperformance for this stage.

Upscale The estimated displacement for a smaller layeris upscaled to serve as an initial guess for the displacementfor a larger layer. To compute this, we use a joint-bilateralupsampling method. Since the result for each pixel is inde-pendent from its neighbors, this stage can be computed inparallel.

Subpixel Refinement In this stage, we refine our finalmotion estimate by regularizing the output using a bilateralfilter. Since the result of the bilateral filter for each pixeldoes not depend on any result computed in this stage, wecan also compute this stage in parallel.

5. ResultsWe conduct a number of experiments to evaluate the

performance of our method and its different variants. Thedata used for the experiments comprises of a number of im-age pairs, including some from the Middlebury optical flowbenchmark. The values used for the different parameters ofthe algorithm were: σc = 25.5, σd = 4.1, τ = 10.0, N a4× 4 window and Ω a 18× 18 window.

Accuracy Figure 12 compares the accuracy obtained withand without multiscale, subpixel refinement, and sublinear

upsampling. The RMSE error is measured by the averageL2 distance between resulting vectors versus the groundtruth, provided by the Middlebury dataset, of all pixels inthe image. We can see that results improve significantlyusing the proposed technique. Without refinements, errorspropagate due to large displacement movements betweenframes and quantized movement vectors. Our sublinearmethod produces results similar to those produced with thelinear complexity algorithm.

Best Running Time We also verify that our completescheme actually behaves sublinearly. Figure 4 shows a log-log plot its running time for various resolutions obtained bydownsampling the same HD 1920x1080 video. In such aplot, the slope of the curve indicates the complexity of thealgorithm. We measure of slope of 0.78 which means thatempirically, our algorithm has a complexity of O(N0.78)with N the number of pixels in a frame. Typical run-ning time for the Middlebury examples was 188 secondswith a typical resolution of 584x388. Our best timing wasachieved with the GPU implementation at 1.326 seconds.

6. Performance Analysis

In this analysis the timings compared are obtainedby computing optical flow over a pair of images sized2048×1152. As expected, due to the pixel independentalgorithm we proposed, we observed significant speed upswhen implemented on parallel architectures. This result issuccinctly presented in the table in Fig. 5. We observe anoverall speed up of 4.62 times on a 8 threaded, 4 core inteli7 machine compared to a 1 threaded implementation. Forthe 8 core Intel Xeon machine, we observe an overall speedup of 5.95 times compared to a 1 core implementation. Inthe GPU implementation, the overall speed up is more sig-nificant, with 29.60 times faster compared to a 1 threadedimplementation. An interesting phenomena is that for theLocal Smoothness Prior stage, the speed up factor is signif-icantly lower than its counter parts. This is especially so forthe GPU implementation.

Caching.The reason the Local Smoothness Prior stageattains much lower speed ups is that the data that we areworking on is the probability map for each pixel. The ar-ray is arranged such that the probability map for each pixelis contiguous in memory, the probability map is of size 25in our implementation. Since the smoothing priors stageat larger layers might occur at different displacements forneighboring pixels, cache locality suffers. The size of theprobability map also decreases the number of probabilitymaps stored in cache, which will lead to more frequentcache eviction and thrashing. This phenomena is espe-cially prominent in the GPU implementation because of thesmaller cache available for the threads.

0

20

40

60

80

100

120

140

160

CPU - Sub Linear 1 thread


GPU - Sub Linear

Run

ing

Tim

e (S

econ

ds)

Subpixel Refinement Upscale Argmax Local Smoothness Prior Generate Likelihood Model

0

20

40

60

80

100

120

140

160



GPU - Sub Linear

Run

ing

Tim

e (S

econ

ds)

Subpixel Refinement Upscale Argmax Local Smoothness Prior Generate Likelihood Model

Figure 6. This shows the breakdown of timings for differencestages across 3 implementations. We can see that while moststages attain reasonable speed ups, Local Smoothing Model doesnot scale as well.

Although caching poses a large problem for the LocalSmoothness Prior subroutine, we maximized the use ofcache locality. In our design analysis, we have two vari-ations of the code. First, for each pixel, we compute thesmoothed entries for entire probability map. Second, foreach entry in the probability map, we compute the smoothedentries for each pixel. Because of the continuous blocks ofaccess, the first variation improves our timing by four time.This shows that the subroutine is very sensitive to cache hitsand misses.As seen in Figure 5, running forward and backward flowsto check for consistency and occlusions should be done inparallel. However, in practice, because of limited cachesizes shared by threads, this causes more cache conflictsbecause running different operations loses spatial and tem-poral cache locality. The total running time increased by 38percent.

Sublinear behavior. Sublinearity enhancements propa-gated through upscaling enables us to heavily reduce com-putation. However, the method proposed introduces manybranches in the instruction stream, which increases the av-erage cycle per instruction. We see this trade-off in Fig-ure 4, where the sublinear scheme reduces the efficiency forsmaller resolutions, especially in CPU implementation. Thephenomenon shows that branching outweighs reduction incomputation if the reduction is small. Moreover, this posesa large problem especially in GPUs because threads withinthe block are executed in lock step. If a thread stalls within

0

0.5

1

1.5

2

2.5

1 2 3 4 5 6 7 8

Run

ning

Tim

e (S

econ

ds)

Block Number (8 Blocks Total)

Figure 7. This shows the large variation in timings for the LocalSmoothing Model for different blocks in the image in the GPUimplementation. The variation is due to different divergence be-haviors for different regions of the image.

a block due to misprediction, other threads will be forced tostall. This behavior varies across different regions of the im-age. Regions with similar displacements might experiencesimilar branching patterns. This will allow less overall stallsfor a block. Regions with different displacement might ex-perience varying branching patterns which will cause allthe threads in a block to stall frequently. This behavior isdemonstrated in Figure:7 where different regions of the im-age have a large variation in performance due to lock stepbehavior.

Architectural Performance Comparisons. In Figure 8and 9, we can see that our algorithm gains speed ups withthread and core increases. The speed up scales well withthe number of thread increases. An interesting analysis liesin the switch between 4 cores and eight threads versus eightcores and eight threads. As seen in 5, our algorithm ben-efits modestly from additional core resources. Because ofthe increased L1 cache, we can see that all stages of ourpipeline benefits small increase. As discussed previouslyin the paper, the smoothness gains less, because of the dis-continuity and large leaps in memory access. Because thesmoothness prior consumes the largest amount of time, wesee just modest gains in performance compared to just usingfour cores. Optimizing the smoothness prior is definitely apriority in future work, as we can see it is a large limitingfactor. Nevertheless, the graphs also show that all cores per-form in sublinear scale with respect to the number of pixelsin the image, making our algorithm powerful for higher res-olution images.

We quantitatively assess the performance of the CPUagainst the GPU by computing the ratio of the time elapseddivided by the ratio of the metric normalized to the CPU.

1

10

100

1000

10000

10000 100000 1000000 10000000

Seco

nds

Number of Pixels

CPU: Sublinear (8 Threads) CPU: Sublinear (7 Threads) CPU: Sublinear (6 Threads) CPU: Sublinear (5 Threads) CPU: Sublinear (4 Threads) CPU: Sublinear (3 Threads) CPU: Sublinear (2 Threads) CPU: Sublinear (1 Thread)

Figure 8. Here we show log-log plots of timings as we scale theimage size for varies number of threads on the Intel i7 machine.The timings scale monotonically with image size and the perfor-mance benefits for increasing threads decreases. This is becausethere are only 4 physical cores on the machine.

The timings are for computing optical flow for a 2048 ×1152 sized image. The total energy consumption for theCPU is 4140 Joules while for the GPU is 910. This is notsurprising as the GPU was 6.4 times faster for the compu-tation. The GPU performs 3.3 times better than the CPUtaking into considering the number of transistors. This isexpected as the GPU is a processor that is built speciallyfor parallel computation while the CPU is a general purposeprocessor. If we consider the number of cores, the CPU per-forms 10 times better than the GPU. The GPU also achievesa similar performance advantage of 3.6 times over the CPUwhen we take the area of the die into consideration.

7. Discussion and Conclusion

We presented a simple sub-linear method for opticalflow. A key property of our approach is that we do not re-sort to optimization to propagate local information acrossthe image. Instead, we average local probability distribu-tions computed from standard color differences. The factthat we recover an accurate flow field from such simple cuessuggests that they contain actually more information aboutthe scene than what their simplicity suggests at first. Thelocal aspect of our scheme is also the key component thatenables sublinear computation. We believe that this prop-erty is critical to be able process high-resolution footage

1

10

100

1000

10000

10000 100000 1000000 10000000

Seco

nds

Number of Pixels

CPU: Sublinear (8 Cores) CPU: Sublinear (7 Cores) CPU: Sublinear (6 Cores) CPU: Sublinear (5 Cores) CPU: Sublinear (4 Cores) CPU: Sublinear (3 Cores) CPU: Sublinear (2 Cores) CPU: Sublinear (1 Cores)

Figure 9. Here we show log-log plots of timings as we scale theimage size for varies number of threads on the Intel Xeon machine.The timings scale monotonically with image size and the perfor-mance benefits for increasing cores decreases. However, becausewe have 8 physical cores, the performance is better than 8 threadson the Intel i7 machine with simultaneous multithreading.

CPU GPU

Timing (Secs) 31.844 4.977 Energy Consumption (Watts) 130 183

Number of Transistors (Millions)

731 1400

Number of Cores 4 240

Area (mm2) 263 470

Figure 10. This table lists the specifications for the Intel i7 pro-cessor and the Nvidia 285 GTX. The timings are the time recordedfor computing optical flow for a 2048X1152 image.

CPU GPU

Speed Up Normalized to CPU (Performance)

1.0 6.4

Energy Consumption (Joules) 4140 910

Performance per Transistor 1.0 3.3

Performance per Cores 1.0 0.1

Performance per Area 1.0 3.6

Figure 11. Here we show the performance normalized to CPU. Itis computed using the ratio of timings against ratio of metric. Forexample, the GPU has 3.3 times the performance considering thenumber of transistors it has compared to the CPU.

such as HD videos and movie sequences in their originalformat. Since local operations are only used at the compute

(a) single-scale, no subpixel:RMSE .855

(b) single-scale, no subpixel:RMSE 2.07

(c) single-scale, no subpixel:RMSE .995

(d) single-scale with subpixel:RMSE .751

(e) single-scale with subpixel:RMSE 2.03

(f) single-scale with subpixel:RMSE .993

(g) multiscale, subpixel, linear:RMSE .121

(h) multiscale, subpixel, linear:RMSE .517

(i) multiscale, subpixel, linear:RMSE .443

(j) multiscale, subpixel, sublinear:RMSE .119

(k) multiscale, subpixel, sublinear:RMSE .519

(l) multiscale, subpixel, sublinear:RMSE .423

(m) Ground truth (n) Ground truth (o) Ground truth

Figure 12. Results on datasets from the Middlebury benchmark [3]. Using a single scale yields inaccurate results because we do not capturelarge motions (first row). Using large search windows would address this issue but at a prohibitive computational cost. Adding subpixelrefinement improves the results (second row) but the accuracy remains limited because of the large motions. Our multiscale approach (thirdrow) captures these motions and greatly improves the results. Using our sublinear scheme (fourth row) yields a similar accuracy whilespeeding up the computation by a factor 10 (Fig. 4).

intensive stages of our approach, the algorithm is massivelyparallelizable, allowing performance benefits that scale wellwith computational power. We explore using multi-coreCPUs and GPUs for our algorithm and found that the speedup factor on the CPU is largely dependent on the number ofcores. Although the speed up factor on the GPU is muchgreater than the CPU, lock-step execution within blocksseverely impacts the potential speed up we can get. Oneinteresting direction to pursue is to detect and group threadsexecuting on the GPU with similar branching behavior andexecute them in the same block, maximizing computationalefficiency. Because of the resource compatibility in our al-gorithm, GPUs scale very well with our algorithm, mak-ing the algorithm more sublinear. Our speed up potential isenormous since we are using a two generation old GPU.

References[1] A. Adams, J. Baek, and A. Davis. Fast high-dimensional

filtering using the permutohedral lattice. Computer GraphicsForum, 2010. Proceedings of the Eurographics conference.

[2] A. Adams, N. Gelfand, J. Dolson, and M. Levoy. GaussianKD-trees for fast high-dimensional filtering. ACM Transac-tions on Graphics, 28(3), 2009. Proceedings of the ACMSIGGRAPH conference.

[3] S. Baker, D. Scharstein, J. Lewis, S. Roth, M. Black, andR. Szeliski. A database and evaluation methodology for op-tical flow. In Proceedings of the International Conference onComputer Vision, 2007.

[4] E. Boros and P. Hammer. Pseudo-boolean optimization. Dis-crete Applied Mathematics, 2002.

[5] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate en-ergy minimization via graph cuts. PAMI, 2001.

[6] T. Brox and J. Malik. Large displacement optical flow: de-scriptor matching in variational motion estimation. PAMI,2010.

[7] A. Bruhn, J. Weickert, T. Kohlberger, and C. Schnorr. Amultigrid platform for real-time motion computation withdiscontinuity-preserving variational methods. InternationalJournal of Computer Vision, 70(3), 2006.

[8] J. Chen, S. Paris, and F. Durand. Real-time edge-aware im-age processing with the bilateral grid. ACM Transactionson Graphics, 26(3), 2007. Proceedings of the ACM SIG-GRAPH conference.

[9] E. Eisemann and F. Durand. Flash photography enhance-ment via intrinsic relighting. ACM Transactions on Graph-ics, 23(3), July 2004. Proceedings of the ACM SIGGRAPHconference.

[10] B. Glocker, H. Heibel, N. Navab, P. Kohli, and C. Rother.Triangleflow: Optical flow with triangulation-based higher-order likelihoods. In Proceedings of the European Confer-ence on Computer Vision, 2010.

[11] D. B. Goldman, C. Gonterman, B. Curless, D. Salesin, andS. M. Seitz. Video object annotation, navigation, and compo-sition. In Proceedings of ACM symposuim on User InterfaceSoftware and Technology, 2008.

[12] P. Gwosdek, H. Zimmer, S. Grewenig, A. Bruhn, and J. We-ickert. A highly efficient GPU implementation for variationaloptic flow based on the Euler-Lagrange framework. In Pro-ceedings of the Workshop for Computer Vision with GPUs,2010.

[13] B. K. Horn and B. G. Schunck. Determining optical flow.Artificial Intelligence, 17, 1981.

[14] V. Kolmogorov. Convergent tree-reweighted message pass-ing for energy minimization. IEEE Trans. Pattern Anal.Mach. Intell., 28(10):1568–1583, 2006.

[15] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele.Joint bilateral upsampling. ACM Transactions on Graphics,26(3), 2007. Proceedings of the ACM SIGGRAPH confer-ence.

[16] D. Kuettel, M. D. Breitenstein, L. Van Gool, and V. Ferrari.What’s going on? Discovering spatio-temporal dependen-cies in dynamic scenes. In Proceedings of the conference onComputer Vision and Pattern Recognition. IEEE, 2010.

[17] B. D. Lucas and T. Kanade. An iterative image registrationtechnique with an application to stereo vision. In Proceed-ings of Imaging Understanding Workshop, 1981.

[18] S. Paris and F. Durand. A fast approximation of the bilat-eral filter using a signal processing approach. InternationalJournal of Computer Vision, 2009.

[19] S. Paris, P. Kornprobst, J. Tumblin, and F. Durand. Bilateralfiltering: Theory and applications. Foundations and Trendsin Computer Graphics and Vision, 2009.

[20] J. Pearl. Fusion, propagation, and structuring in belief net-works. Artif. Intell., 29(3):241–288, 1986.

[21] G. Petschnigg, M. Agrawala, H. Hoppe, R. Szeliski, M. Co-hen, and K. Toyama. Digital photography with flash andno-flash image pairs. ACM Transactions on Graphics, 23(3),July 2004. Proceedings of the ACM SIGGRAPH confer-ence.

[22] F. Porikli. Constant time O(1) bilateral filtering. In Pro-ceedings of the conference on Computer Vision and PatternRecognition, 2008.

[23] Y. Rosenberg and M. Werman. Representing local motion asa probability distribution matrix applied to object tracking.In Proceedings of the conference on Computer Vision andPattern Recognition, 1997.

[24] C. Tomasi and R. Manduchi. Bilateral filtering for gray andcolor images. In Proceedings of the International Confer-ence on Computer Vision, pages 839–846. IEEE, 1998.

[25] M. Wainwright, T. Jaakkola, and A. Willsky. Map esti-mation via agreement on trees: message-passing and linearprogramming. IEEE Transactions on Information Theory,51(11):3697–3717, 2005.

[26] Y. Weiss and W. Freeman. On the optimality of solutionsof the max-product belief-propagation algorithm in arbitrarygraphs. Transactions on Information Theory, 2001.

[27] M. Werlberger, T. Pock, and H. Bischof. Motion estimationwith non-local total variation regularization. In Proceedingsof the conference on Computer Vision and Pattern Recogni-tion, 2010.

[28] Q. Yang, K.-H. Tan, and N. Ahuja. Real-time o(1) bilateralfiltering. In Proceedings of the Conference on Computer Vi-sion and Pattern Recognition. IEEE, 2009.

Documents

SimpleFlow: A Non-iterative, Sublinear Optical Flow Algorithmkubitron/courses/cs252-S11/projects/reports/project1...SimpleFlow: A Non-iterative, Sublinear Optical Flow Algorithm Michael