"Efficient Convolutional Neural Network Inference on Mobile GPUs," a Presentation from Imagination Technologies

Copyright © 2016 Imagination Technologies 1

Efficient Convolutional Neural Network

Inference on Mobile GPUs Paul Brasnett

May 3, 2016


• About Imagination Technologies

• PowerVR GPUs

• Case study: Implementing Convolutions

• Performance Analysis

• Conclusions

• Resources

Overview


• Imagination Technologies

is a leading IP supplier for

multimedia, processors and

communications

• More than 8bn units

containing Imagination IP

shipped

About Imagination Technologies

SoC

fabric

PowerVR Graphics & GPU Compute

Processors

Ensigma Communications

Processors

PowerVR Vision

Processors

MIPS Processors

PowerVR Video

Processors


What is a Mobile GPU?

Mobile GPU

Optimised for High

Performance at

Low Power


What is a Mobile GPU?

Mobile Devices

Automotive

Consumer Multimedia

Wearables

Internet of Things

Augmented Reality

Mobile GPU

Optimised for High

Performance at

Low Power


Why Mobile GPUs for Vision Processing?

CPUs can generate large amounts of heat • CPUs can deliver high peak/burst

performance

• But generate large amounts of heat

• PowerVR Mobile GPUs provide

• Lowest power FP16 & int pipelines

• Local memory for highly efficient data

access for compute operations

• Power-saving features such as gating

of non-compute parts of GPU for

efficient compute operation


Why Mobile GPUs for Vision Processing?

Provence(raytracing)

Particle Simulation –

32k

Particle Simulation –

4k Julia Set

AmbientOcclusion

Denoise Gaussian Blur

CPU 100.00% 100% 100% 100% 100% 100% 100%

PowerVR Series6 265% 407% 517% 963% 1126% 482% 383%

0%

100%

200%

300%

400%

500%

600%

Perf

orm

ance

rel

ativ

e t

o C

PU


Moving the CNN Workload to the GPU

PowerVR GPU — Graphics and compute CPU

Large Cache

Unified System Memory

CPU1

CPU0

THREADS

Few

Multiprocessor (Unified Shading Cluster)

Multiprocessor (Unified Shading Cluster)

Coarse Grain Scheduler

L2

System Level Cache Cache Unit

Residency

Slots

Common

Store Compute Store

Texture

Processing Unit

Residency

Slots

Common

Store Compute Store Scheduler

System Memory Interface

enqueue Compute

Kernel

Host Interface

Scheduler

System Memory Interface


Evolution of Mobile GPU

PowerVR

Series 6 GPU

PowerVR

Series 7 GPU

PowerVR

Series 8 GPU

…


Evolution of Mobile GPU

OpenCL 1.2

OpenCV

OpenVX

Vulkan

OpenCL 2.0

New APIs


• Mobile GPU increasingly dominating compute performance in SoCs

GPU Dominates Compute in Modern SoCs

CPU

GP

U

Illustrative diagram only, to show relative CPU/GPU size


• State-of-the-art performance

• Rapid development cycles

• Range of vision tasks

• Classification

• Localisation

• Other applications…

Why CNNs?

Camera Localisation

PoseNet: A Convolutional Network for Real-Time 6-DOF Camera

Relocalization, Kendall, A., Grimes, M., Cipolla, R., ICCV 2015


What is a CNN?

Convolution Activation Normalization Pooling Fully Connected

Convolution Image Activation Pooling

Fully Connected

CNN Architecture Building Blocks

CNN Example Network

Normalization

Soft Max

Convolution Activation Pooling Normalization

Convolution Activation Pooling Soft Max


• Training — Offline

CNN Object Classification

Architecture

Data CNN Library Compute + Time Model Coefficients



• Inference — Online


Architecture


Architecture

Model Coefficients



• Inference — Online


Architecture


Architecture

Model Coefficients

Image

CNN Library Compute Classification

Mobile GPU


Where is the Cost in CNN Inference?

Flops by layer-type (AlexNet)

Convolution

Normalisation

Pooling

Fully Connected


• Create as many work-items as is size of output matrix

• Each work-item will read it’s row and column and produce dot product

• Requires large number of accesses to memory

Matrix Multiply — Naïve

x =

A B C


• The OpenCL memory model

closely maps to GPU architecture

• Private Memory — Per work-item

• Local Memory

• Shared within a work-group

• Global Memory /Constant Memory

• Visible to all work-groups

• Host memory

• Typically share CPU/GPU on a

mobile SoC

OpenCL Memory Model


• Work-items load A data into private memory

Matrix Multiply — Tiling Approach

Tiling approach based on “2008. Volkov and Demmel. Using GPUs to accelerate linear algebra runtime”

x =

A B C


• Work-items load A data into private memory

• Work-groups load B data into local memory

• Each work item will read from local memory and produce a dot product

• Significantly reduces global memory accesses


x =

A B C

Tiling approach based on “2008. Volkov and Demmel. Using GPUs to accelerate linear algebra runtime”


• Choose work-group size to fit the GPU, 32 work-items is typically a good

choice for PowerVR GPUs

• Read multiple items (e.g. 4 or 8) into private memory at a time to optimise

memory transfers

• Consider the use of half data type in place of float

• Most PowerVR platforms provide up to 2x the flops

• Define workgroup size at compile time

• __attribute__((reqd_work_group_size(SIZE, 1, 1)))

Matrix Multiply — OpenCL Tips



0.1

1

10

100

1000

Tim

e (

s)

Matrix Size

Naïve

Tiled matrix multiply


CNN Classification: AlexNet & GoogLeNet

60

5.5

Model Coefficients (Millions)

AlexNet GoogLeNet

1.3

3.1

Operations (Billions)

AlexNet GoogLeNet18.2

10.07

Top-5 Error Rate (%)

AlexNet GoogLeNet

Bandwidth Compute


• Time consumed by layer type

Performance Analysis — CNN Inference

GoogLeNet

Convolutions

Pooling

Normalisation

Fully Connected

Reference Time*: 1.36 Reference Time*: 1.00

AlexNet

Convolutions

Pooling

Normalisation

Fully Connected


Performance Analysis — GPU v CPU*

* CPU results based on Caffe (with ATLAS)

0

2

4

6

8

10

12

14R

ela

tive F

PS

Perf

orm

an

ce

(Hig

her

is b

ett

er)

AlexNet

GPU - PowerVR 2 ClusterGPU (480MHz)

CPU - ARM A15 (1.6GHz)


Efficiency Analysis — GPU v CPU

0

0.5

1

1.5

2

2.5

3

3.5R

ela

tive E

ffic

ien

cy (

Hig

her

is

bett

er)

AlexNet

GPU - PowerVR 2Cluster GPU (480MHz)

CPU - ARM A15(1.6GHz)


• Mobile GPUs are widely available in a range of SoCs across numerous

markets today

• Compared to mobile CPUs, PowerVR Mobile GPUs offer

• upto 3x higher efficiency and

• upto 12x higher performance deployment for CNNs

• Newer CNN architectures with smaller fully connected layers help to

make more efficient use of compute resources

• PowerVR GPUs scale to allow for higher levels of performance & lower

power for current and future generations of vision enabled products

• COME & SEE THE DEMO DURING THE NEXT BREAK

Conclusions


• PowerVR GPU Compute

• https://imgtec.com/tools/powervr-gpu-compute/

• Guide to writing OpenCL

• http://blog.imgtec.com/powervr/a-quick-guide-to-writing-opencl-kernels-for-rogue

• PowerVR Imaging Framework

• http://blog.imgtec.com/powervr/powervr-imaging-framework-sdk

• PowerVR CNN Demo

• See our stand

• OpenCL Tutorial

• https://handsonopencl.github.io/

Resources

https://imgtec.com/tools/powervr-gpu-compute/





http://blog.imgtec.com/powervr/a-quick-guide-to-writing-opencl-kernels-for-rogue

















http://blog.imgtec.com/powervr/powervr-imaging-framework-sdk







https://handsonopencl.github.io/

Technology

"Efficient Convolutional Neural Network Inference on Mobile GPUs," a Presentation from Imagination Technologies