"Making OpenCV Code Run Fast," a Presentation from Intel

Copyright © 2017 Intel Corporation 1

Vadim Pisarevsky, Software Engineering Manager, Intel Corp.

May 2017

Making OpenCV Code Run Fast


OpenCV at glance

What The most popular computer vision library:

http://opencv.org

License BSD

Supported Languages C/C++, Java, Python

Size >950 K lines of code

SourceForge statistics 13.6 M downloads (does not include github traffic)

Github statistics >7500 forks, >4000 patches merged during 6 years

(~2.5 patches per working day before Intel,

~5 patches per working day at Intel)

Accelerated with SSE, AVX, NEON, IPP, MKL, OpenCL, CUDA,

parallel_for_, OpenVX, Halide (planned)

The actual versions 2.4.13.2 (2016 Dec), 3.2 (2016 Dec)

Upcoming releases 2.4.14 (2017), 3.3 (2017 Jun)

http://opencv.org/


OpenCV, CV & Hardware Evolution 2000 => 2017

2000 2017

OpenCV OpenCV 1.0 alpha; C API, 1

module, Windows

OpenCV 3.2; C++ API; 30+30 modules,

Windows/Linux/Android/iOS/QNX, etc.

CPU 32-bit single-core, ~1 GFlop 32/64-bit many-core, 300+ GFlops, ~100 GFlops in a

cellphone!

GPU as accelerator - OpenCL, CUDA; 0.5-1+ TFlops

Other accelerators FPGA (manually coded) OpenCL-capable FPGA, various DSPs, etc.

Vision algorithms Traditional vision, simple image

processing, detection & tracking,

contours; “empirical, low-profile

computer vision”

Sophisticated traditional vision, 3D vision,

computational photography, deep learning, hybrid

algorithms; “learning-based, extensive computer

vision”

Cameras, sensors Analog surveillance cameras

(recording only), Webcams

Computer vision in every cellphone, every street

crossing, every mall, coming to every car; 3d

sensors, lidars, etc.

Computing model Desktop Edge, Cloud, Fog; Desktop for R&D only


OpenCV Acceleration Options

CUDA modules

OpenVX(immediate mode)

OpenCV optimized

for custom hardware

Universal

intrinsicsNEON/SSE/AVX2…

Carotene HALOpenCV optimized for

ARM CPU

IPP, MKLOpenCV optimized

for x86/x64 CPU

OpenVX (graphs)

OpenCV optimized

for custom hardware

OpenCV

T-API OpenCL GPU-optimized

OpenCV

OpenCV HAL

Halide scripts Any Halide-supported

hardware

User-programmable

tools

Collections of fixed

functions

Active development area


• OpenCV 3.x includes T-API by default:

• Asynchronous: can run GPU & CPU code in parallel

• 100s of open-source OpenCL kernels

T-API: heterogeneous compute

with OpenCV is easy!

#include "opencv2/opencv.hpp"

using namespace cv;

int main(int argc, char** argv)

{

Mat img, gray;

img = imread(argv[1], 1);

imshow("original", img);

cvtColor(img, gray, COLOR_BGR2GRAY);

GaussianBlur(gray, gray,

Size(7, 7), 1.5);

Canny(gray, gray, 0, 50);

imshow("edges", gray);

waitKey();

return 0;

}

#include "opencv2/opencv.hpp"

using namespace cv;

int main(int argc, char** argv)

{

Mat img; UMat gray;

img = imread(argv[1]);

imshow("original", img);

cvtColor(img, gray, COLOR_BGR2GRAY);

GaussianBlur(gray, gray,

Size(7, 7), 1.5);

Canny(gray, gray, 0, 50);

imshow("edges", gray); // automatic sync point

waitKey();

return 0;

}


T-API: under the hood

Very little of “boilerplate code”! (just ~30 lines of code)void mykernel(cv::InputArray input, cv::OutputArray output, params …) {

}

Use OpenCL?

Get clmem (use zero-

copy if possible)

Retrieve/compile OpenCL

kernel & “enqueue” it

successfully?

yes

yes

Finish

Retrieve

cv::Mat

Run C++ code


T-API execution model

• Supports multiple devices

• Asynchronous execution with no explicit synchronization required


T-API showcase: Pedestrian Detector

Build pyramid RGB2LuvHOG feature

maps

Integrals of

HOG maps

Feature Pyramid Builder

Capture Video

Frame

Optical flow-

based Tracker

Per-frame detector

Sliding window +

Cascade classifier

Non-maxima

suppression (filtering

out duplicates)

Do temporal filtering,

follow pedestrians,

detect new ones

Performance profile of per-frame detector (CPU)

Feature Pyramid Builder (65%)

Classifier + Non-max (35%)

• Feature Pyramid Builder is the ideal “kernel” to optimize:

• Expensive

• Regular, easy to parallelize & vectorize

• Reusable (e.g., for cars)


• Duplicate CPU branch

• Make OpenCL-compatible copy (cv::UMat) for each internal buffer (cv::Mat)

• Use available OpenCL-optimized funcs (e.g. cv::resize, cv::integral)

• Create OpenCL kernels for other parts (RGB2Luv, HOG): ~700 LoC

• Debug-Profile-Optimize: repeat until happy

Feature Pyramid Builder optimization with T-API

Part CPU time,

ms (1080p)

OCL time,

ms (1080p)

CPU time,

ms (720p)

OCL time,

ms (720p)

Acceleration

(1080p)

Acceleration

(720p)

All 200 140 107 87 42% 23%

Feature Pyramid

Builder

130 70 60 40 85% 50%

Test machine: Core i5 (Skylake), 2-core 2.5 GHz, Intel HD530 GPU


• Many acceleration options are available (CPU,

GPU, DSPs, FPGA, etc.)

• Coding kernels using native tools is huge

investment and maintenance cost

• Big time to market

• Big commitment because of low portability

• OpenCV cannot be optimized for each single

accelerator

• OpenCL is not perf-portable neither easy to use

• Let’s generate OpenCL or LLVM code automatically

from high-level algorithm description!

• Let’s separate the platform-agnostic algorithm

description and platform-specific “pragma’s”

(vectorization, tiling …)!

Halide: write once, schedule everywhere!

Halide! (http://halide-lang.org)

Function 1 Function 2 …

CPU Scheduler:

Tiling,

Vectorization,

Pipelining

GPU Scheduler:

Tiling,

Vectorization,

Pipelining

CPU code

(SSE, AVX…,

NEON)

GPU code

(OpenCL,

CUDA)

Algorithm Description

http://halide-lang.org/


• Same code for CPU & GPU

• Halide includes very efficient loop handling engine

• Almost any known DNN can be implemented

entirely in Halide

• The language is quite limited (insufficient to cover

OpenVX 1.0)

• In some cases the produced code is inefficient

• The whole infrastructure is immature

Plans

• Halide backend in OpenCV DNN module (in

progress)

• Extend the language (if operator, etc.)

• Improve performance of the generated code

• Fix/improve the infrastructure (nicer frontend, better

support for offline compilation)

kernel OpenCV, ms

(CPU)

Halide, ms

(CPU)

Halide, ms

(GPU)

RGB=>Gray 0.44 0.54 (-20%) 0.58 (-25%)

Canny 3.3 1.4+2 (-3%) 2.4+2 (-25%)

DNN: AlexNet 29 (w. MKL) 24 (+20%) 47 (-40%)

DNN: ENet

(512x256)

~250 (w. MKL) 60 (+320%) 44 (+470%)

HOG-based

pedestrian

detector (1080p)

200 75+70 (+38%) 140 – 700 ms

Halide: first impressions & results


• OpenVX-based HAL in OpenCV

✓ [Done] Immediate-mode OpenVX calls to accelerate simple functions:

• cv::boxFilter(const cv::Mat&, …) => vxuBox3x3(vx_image, …) etc.

• tested with Khronos’ sample implementation and Intel IAP

• [TBD] Graphs for DNN acceleration

✓ [Done] Mixing OpenVX + OpenCV at user app level

• vx_image cv::Mat, OpenVX C++ wrappers, sample code:

• https://github.com/opencv/opencv/tree/master/samples/openvx

OpenCV + OpenVX

https://github.com/opencv/opencv/tree/master/samples/openvx


OpenCV Acceleration Options Comparison

+ ⎼

HAL functions Get used automatically (zero effort); vendors-specific

implementation is possible

Little coverage (mostly image processing); usually CPU-only

HAL intrinsics Super-flexible, widely applicable and widely available Low-level, CPU only

T-API Can potentially deliver top speed OpenCL is not performance-portable; lot’s of expertise needed

OpenVX Can be tailored for any hardware (CPU, GPU, DSP, FPGA) Inflexible, not easy to use, difficult to extend

Halide Decent performance; relatively easy to use Not as flexible as OpenCL or C++

Performance

Ease-of-use

HAL functions

HAL intrinsics

Halide

T-API (custom)

T-API (built-in)

OpenVX (graphs)OpenVX (graphs for DNN)

Flexibility

Coverage

HAL functions

HAL intrinsics

Halide

T-API (custom)

T-API (built-in)

OpenVX (graphs)


• Modern OpenCV provides several acceleration paths

• Custom kernels are essential for user apps; existing OpenCV (and

OpenVX) functionality is not enough

• Universal intrinsics

(http://docs.opencv.org/master/df/d91/group__core__hal__intrin.html) is

best solution for CPU

• T-API (OpenCL; http://opencv.org/platforms/opencl.html) is the way to go

for GPU acceleration

• Halide looks very promising and can become a viable alternative to plain

C++ and OpenCL for “regular” algorithms; OpenCV 3.3 will include

Halide-accelerated deep learning module

Summary

http://docs.opencv.org/master/df/d91/group__core__hal__intrin.html

http://opencv.org/platforms/opencl.html


• OpenCV: http://opencv.org

• Intel CV SDK: https://software.intel.com/en-us/computer-vision-sdk - the

home of Intel-optimized OpenCV & OpenVX

• Halide: http://halide-lang.org

• Insights on the OpenCV 3.x feature roadmap, EVS2016 talk by Gary

Bradski: https://www.embedded-vision.com/platinum-

members/embedded-vision-alliance/embedded-vision-

training/videos/pages/may-2016-embedded-vision-summit-opencv

Resources

http://opencv.org/

https://software.intel.com/en-us/computer-vision-sdk

http://halide-lang.org/

https://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit-opencv

Technology

"Making OpenCV Code Run Fast," a Presentation from Intel