Emulated Digital CNN-UM Implementation of a 3-dimensional Ocean Model on FPGAs

University of VeszprémDepartment of Image Processing and Neurocomputing

Emulated Digital CNN-UM Implementation of a

3-dimensional Ocean Model on FPGAs

Zoltán Nagy, Péter Szolgay

Nagy 2 MAPLD 2005/153

Introduction

• Cellular Neural/Nonlinear Networks Universal Machine (CNN-UM)

• Ocean modeling• Results• Conclusions

Cellular Neural/Nonlinear Networks (CNN)

• 2 or N dimensional grid• Locally connected• Analog processing elements • State value is continuous in time

Structure of a CNN cell

• uij input

• xij state

• yij output

• zij constant bias

• Aij,kl feedback template

• Bij,kl feed-forward template

1( ) ( ) ( ) ( )

x ij ij ij kl kl ij kl kl ijkl S ij kl S ijx

C x t x t A y t B u t zR

uij xij

Cx Rx I (ij,kl)xu I (ij,kl)xy Iyx Ry

CNN-UM implementations

• Software simulation Easy to implement Slow, even if using processor specific instructions

• Emulated digital VLSI Specialized digital architecture Selectable computing precision (Castle architecture: 1, 6,

12 bit) Orders faster than the software simulation Long design time

• Analog VLSI Huge computing power (~TeraOP/s) Low accuracy (7-8 bit) Noise and temperature sensitivity

Structure of the Falcon emulated digital CNN-UM

• Mixer Contains cell values for

the next updates

• Memory unit Contains a belt of the

cell array

• Template memory• Arithmetic unit• Processors can be

connected on a grid Linear speedup

Memory unit

Mixer unitTemplatememory

Arithmetic unit

StateIn ConstIn TmpselIn

StateOut ConstOut TmpselOut

RightOut

RightOutNewLeftOut

LeftIn

LeftInNewRightIn

Coreprocessor

Input lines

Output lines

Control lines

Coreprocessor

Structure of the arithmetic unit

• Cell update in row wise order

• Cycle time depends on template size

• Fully pipelined

Mult Mult Mult

S1 S2 S3T1 T2 T3 gij xij

Configurable parameters

• State, template and constant width between 2 to 64 bits

• Number of templates• Size of the templates• Width of the cell array slice• Number of layers• Number and arrangement of the

processor cores

lxxgxu

lxxfxu

Example: Solution of a simple PDE on CNN

• The Wave equation • Spatial discretization

• 2 layer CNN

A12 A21

Layer d

Layer v

Ocean models

• Barotropic model• Baroclinic models

z-coordinate model σ-coordinate model isopycnal

• Fine resolution models Real-time forecast Fishing industry Search and rescue

• Coarse resolution models Long term

predictions Climate modeling

The Princeton Ocean Model (POM)

• Sigma coordinate model Vertical coordinate

is scaled on the water column depth

• Second moment turbulence closure sub-model Provides vertical

mixing coefficients

• Solution technique: Mode splitting Internal mode (3D)

o Vertical structure equations

o Implicit solution External mode (2D)

o Vertically integrated equations

o Explicit solution (Leapfrog method)

Governing equations of the external (2D) mode

• ux, uy mass transport

• η free surface elevation• Ω angular rotation of

the Earth • Θ latitude

• H depth of the ocean• g gravitational

acceleration • τw, τb wind and bottom

stress

• A lateral viscosity

xgHusin2

du xyxxx

2bxwxy

ygHusin2

du yyyxy

2bywyx

Solution on CNN

• Spatial discretization on a uniform grid• 3-layer CNN structure• Non-linear template required for advection

• Cannot be solved on analog VLSI CNN chips• Solvable on the modified Falcon architecture

Support of non-linearity Specialized cell model

ij,x,xij,x

ij,x A000101000

The modified arithmetic unit of the Falcon architecture

fij uy,ij

recHij

i-1,j i+1,j

Aij ux,i-1,j ux,i+1,j ux,i,j-1 ux,i,j+1 ux,ij

ux,ij ux,i-1,j ux,i+1,j ux,ij ux,i,j-1 ux,i,j+1

Area requirements

10 12 14 16 18 20 22 24 26 28 30 32 34 36

Precision (bit)

Mult18x18 18kbit BRAM

Implementation on FPGA

• Complicated arithmetic unit

• Fixed-point number representation

• Configurable precision

• High level hardware description language required(e.g. Handel-C)

PerformanceSpeedup compared to an Athlon64 2GHz

10 14 18 22 26 30 34

Precision (bit)

XC2V1000 XC2V6000 XC4VSX55

Number of processors

10 12 14 16 18 20 22 24 26 28 30 32 34 36

Precision (bit)

XC2V1000 XC2V6000 XC4VSX55

The Seamount problem

Results after 72 hours

0 500 1000 1500 20000

X (km)

0 500 1000 1500 20000

x 10-3

Circulation pattern Elevation

Error of the solution

Error of the mass transport uy

1.0E-05

1.0E-04

1.0E-03

1.0E-02

1.0E-01

1.0E+00

10 14 18 22 26 30 34

Precision (bit)

Case1 Case2 Case3

Case4 Case5 Case6

Error of the mass transport ux

1.0E-05

1.0E-04

1.0E-03

1.0E-02

1.0E-01

1.0E+00

10 14 18 22 26 30 34

Precision (bit)

Case1 Case2 Case3

Case4 Case5 Case6

Error of the solution

Error of the elevation

1.0E-07

1.0E-06

1.0E-05

1.0E-04

1.0E-03

1.0E-02

10 14 18 22 26 30 34

Precision (bit)

Case1 Case2 Case3

Case4 Case5 Case6

Memory requirements of the internal (3D) equations

• Extended memory hierarchy New level stores 3 cross sectional slices from

the 3D arrayo Large memory required (e.g. 512x512x64 sized grid,

3x512x64 elements per state variable)o Cannot be stored on-chipo Off-chip storage requires huge I/O bandwidth

• Processor array should be used The 3D array is divided between the

processors Optimal data set for on chip storage: 2048

elements per cross sectional slice (512x32x64 sized grid per processor)

Each processor located on a separate FPGA

Solution of the internal (3D) equations

• Implicit solution Fixed-point solution

o Requires large precision to avoid rounding errors

o Seems to be impractical Floating-point solution

o Requires large area (especially add/sub)

• Explicit solution Smaller timestep Simpler arithmetic unit

Conclusions

• Ocean modeling using emulated digital CNN is very promising

• Moderate precision is required in 2D mode 1% accuracy using 24 bits

• Expected speedup (compared to an Athlon64 2GHz microprocessor) 80 times on our RC200 prototyping board 3700 times on the largest available FPGA

Emulated Digital CNN-UM Implementation of a 3-dimensional Ocean Model on FPGAs

Documents

© Copyright 2018 Xilinx...XQ Kintex® & Virtex® UltraScale+ FPGAs XQ Kintex UltraScale FPGAs XQ Virtex-7 FPGAs XQ Kintex-7 FPGAs XQ Artix®-7 FPGAs XQ SoC Product Documentation:

FPGAs - ICL

Performance Comparison of Hybrid CNN-SVM and CNN-XGBoost

Error Detection and Correction in SRAM Emulated TCAMs

An Automated System for Emulated Network Experimentation

RENT A SEGWAY! SERVICE...May 12, 2006 · IndyCar Racing News (CC) ABC Wld ... CNN CNN Live Saturday CNN Live Saturday On the Story (CC) CNN Presents Larry King Live CNN Saturday

FPGAs : An Overview

Implementing an emulated UART on STM32F4 microcontrollers · Implementing an emulated UART on STM32F4 microcontrollers Introduction This application note describes how to implement

CNN-PS: CNN-based Photometric Stereo for General Non ...openaccess.thecvf.com/...CNN-PS_CNN-based_Photometric_ECCV_2… · CNN-PS: CNN-based PhotometricStereo for General Non-Convex

Putting FPGAs to Work in Software Radio Systems · Putting FPGAs to Work in Software Radio Systems FPGAs: New Device Technology FPGAs: New Development Tools Technology Figure 5 Figure

Emulated EEPROM Quick Start Guide - NXP …cache.freescale.com/files/microcontrollers/doc/app_note/AN3743.pdf · Configurations of the Enhanced EEPROM Region Emulated EEPROM Quick

How to Build Complex, Large-Scale Emulated Networks

FPGAs in Network Equipment FPGAs as the Focus

An Automated System for Emulated Network Experimentationconferences.sigcomm.org/co-next/2013/program/p235.pdf · An Automated System for Emulated Network Experimentation Simon Knight

Wavefront shaping through emulated curved space in waveguide settingsphsites.technion.ac.il/publications/msegev/Wavefront... · 2017. 2. 12. · Wavefront shaping through emulated

Wavefront shaping through emulated curved space in

Tool-flows for mapping CNNs into FPGAs: Trends and … · Disadvantages: − Flexibility ... advantage in terms of obtained ... The CNN RTL Compiler is designed to combine the high

Emulated EEPROM Function for Data storage (ref. 908LJ12)

A Framework for Acceleration of CNN Training on Deeply ...FPGAs by increasing the ﬂexibility of workload allocation thus leading to improved utilization. Multiple FPGAs can cooperatively

Session 2,3 FPGAs