View
32
Download
2
Category
Tags:
Preview:
DESCRIPTION
Emulated Digital CNN-UM Implementation of a 3-dimensional Ocean Model on FPGAs. Zoltán Nagy, Péter Szolgay. Introduction. Cellular Neural/Nonlinear Networks Universal Machine (CNN-UM) Ocean modeling Results Conclusions. Cellular Neural/Nonlinear Networks (CNN). 2 or N dimensional grid - PowerPoint PPT Presentation
Citation preview
University of VeszprémDepartment of Image Processing and Neurocomputing
Emulated Digital CNN-UM Implementation of a
3-dimensional Ocean Model on FPGAs
Zoltán Nagy, Péter Szolgay
Nagy 2 MAPLD 2005/153
Introduction
• Cellular Neural/Nonlinear Networks Universal Machine (CNN-UM)
• Ocean modeling• Results• Conclusions
Nagy 3 MAPLD 2005/153
Cellular Neural/Nonlinear Networks (CNN)
• 2 or N dimensional grid• Locally connected• Analog processing elements • State value is continuous in time
Nagy 4 MAPLD 2005/153
Structure of a CNN cell
• uij input
• xij state
• yij output
• zij constant bias
• Aij,kl feedback template
• Bij,kl feed-forward template
, ,
1( ) ( ) ( ) ( )
r r
x ij ij ij kl kl ij kl kl ijkl S ij kl S ijx
C x t x t A y t B u t zR
z ij
uij xij
Cx Rx I (ij,kl)xu I (ij,kl)xy Iyx Ry
yij
Nagy 5 MAPLD 2005/153
CNN-UM implementations
• Software simulation Easy to implement Slow, even if using processor specific instructions
• Emulated digital VLSI Specialized digital architecture Selectable computing precision (Castle architecture: 1, 6,
12 bit) Orders faster than the software simulation Long design time
• Analog VLSI Huge computing power (~TeraOP/s) Low accuracy (7-8 bit) Noise and temperature sensitivity
Nagy 6 MAPLD 2005/153
Structure of the Falcon emulated digital CNN-UM
• Mixer Contains cell values for
the next updates
• Memory unit Contains a belt of the
cell array
• Template memory• Arithmetic unit• Processors can be
connected on a grid Linear speedup
Memory unit
Mixer unitTemplatememory
Arithmetic unit
StateIn ConstIn TmpselIn
StateOut ConstOut TmpselOut
RightOut
RightOutNewLeftOut
LeftIn
LeftInNewRightIn
Coreprocessor
Coreprocessor
Input lines
Output lines
Control lines
Coreprocessor
Coreprocessor
Coreprocessor
Coreprocessor
Coreprocessor
Coreprocessor
Coreprocessor
Nagy 7 MAPLD 2005/153
Structure of the arithmetic unit
• Cell update in row wise order
• Cycle time depends on template size
• Fully pipelined
Mult Mult Mult
Reg2
+
+
Sh
ift
reg
+
ACC
Reg4
Sh
ift
reg
+
Reg3
S1 S2 S3T1 T2 T3 gij xij
Reg1
Nagy 8 MAPLD 2005/153
Configurable parameters
• State, template and constant width between 2 to 64 bits
• Number of templates• Size of the templates• Width of the cell array slice• Number of layers• Number and arrangement of the
processor cores
Nagy 9 MAPLD 2005/153
lxxgxu
lxxfxu
ttlu
ttu
tlxx
uc
dt
ud
00,
00,
00,
00,0
002
22
2
2
Example: Solution of a simple PDE on CNN
• The Wave equation • Spatial discretization
• 2 layer CNN
121
1
2
2
21
12
x
cA
A
A12 A21
Layer d
Layer v
Nagy 10 MAPLD 2005/153
Ocean models
• Barotropic model• Baroclinic models
z-coordinate model σ-coordinate model isopycnal
• Fine resolution models Real-time forecast Fishing industry Search and rescue
• Coarse resolution models Long term
predictions Climate modeling
Nagy 11 MAPLD 2005/153
The Princeton Ocean Model (POM)
• Sigma coordinate model Vertical coordinate
is scaled on the water column depth
• Second moment turbulence closure sub-model Provides vertical
mixing coefficients
• Solution technique: Mode splitting Internal mode (3D)
o Vertical structure equations
o Implicit solution External mode (2D)
o Vertically integrated equations
o Explicit solution (Leapfrog method)
Nagy 12 MAPLD 2005/153
Governing equations of the external (2D) mode
• ux, uy mass transport
• η free surface elevation• Ω angular rotation of
the Earth • Θ latitude
• H depth of the ocean• g gravitational
acceleration • τw, τb wind and bottom
stress
• A lateral viscosity
y
u
H
u
x
u
H
uuA
xgHusin2
dt
du xyxxx
2bxwxy
x
y
u
H
u
x
u
H
uuA
ygHusin2
dt
du yyyxy
2bywyx
y
y
u
x
u
dt
d yx
Nagy 13 MAPLD 2005/153
Solution on CNN
• Spatial discretization on a uniform grid• 3-layer CNN structure• Non-linear template required for advection
term
• Cannot be solved on analog VLSI CNN chips• Solvable on the modified Falcon architecture
Support of non-linearity Specialized cell model
ij,x,xij,x
ij,x A000101000
x2
u
xu
Nagy 14 MAPLD 2005/153
The modified arithmetic unit of the Falcon architecture
*
fij uy,ij
recHij
*
-
i-1,j i+1,j
wx,ij
Aij ux,i-1,j ux,i+1,j ux,i,j-1 ux,i,j+1 ux,ij
+ +
+
+
*
-
*
-
*
+
*
+ + +
+
+
uy,ij
gHij
ux,ij ux,i-1,j ux,i+1,j ux,ij ux,i,j-1 ux,i,j+1
Nagy 15 MAPLD 2005/153
Area requirements
0
10
20
30
40
50
60
70
10 12 14 16 18 20 22 24 26 28 30 32 34 36
Precision (bit)
Mult18x18 18kbit BRAM
Implementation on FPGA
• Complicated arithmetic unit
• Fixed-point number representation
• Configurable precision
• High level hardware description language required(e.g. Handel-C)
Nagy 16 MAPLD 2005/153
PerformanceSpeedup compared to an Athlon64 2GHz
1
10
100
1000
10000
10 14 18 22 26 30 34
Precision (bit)
Sp
ee
du
p
XC2V1000 XC2V6000 XC4VSX55
Number of processors
0
5
10
15
20
25
30
10 12 14 16 18 20 22 24 26 28 30 32 34 36
Precision (bit)
XC2V1000 XC2V6000 XC4VSX55
Nagy 17 MAPLD 2005/153
The Seamount problem
Nagy 18 MAPLD 2005/153
Results after 72 hours
0 500 1000 1500 20000
200
400
600
800
1000
1200
1400
1600
1800
2000
X (km)
Y (
km)
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
X (km)
Y (
km)
0 500 1000 1500 20000
200
400
600
800
1000
1200
1400
1600
1800
2000
-1.5
-1
-0.5
0
0.5
1
1.5
x 10-3
Circulation pattern Elevation
Nagy 19 MAPLD 2005/153
Error of the solution
Error of the mass transport uy
1.0E-05
1.0E-04
1.0E-03
1.0E-02
1.0E-01
1.0E+00
10 14 18 22 26 30 34
Precision (bit)
Err
or
Case1 Case2 Case3
Case4 Case5 Case6
Error of the mass transport ux
1.0E-05
1.0E-04
1.0E-03
1.0E-02
1.0E-01
1.0E+00
10 14 18 22 26 30 34
Precision (bit)
Err
or
Case1 Case2 Case3
Case4 Case5 Case6
Nagy 20 MAPLD 2005/153
Error of the solution
Error of the elevation
1.0E-07
1.0E-06
1.0E-05
1.0E-04
1.0E-03
1.0E-02
10 14 18 22 26 30 34
Precision (bit)
Err
or
Case1 Case2 Case3
Case4 Case5 Case6
Nagy 21 MAPLD 2005/153
Memory requirements of the internal (3D) equations
• Extended memory hierarchy New level stores 3 cross sectional slices from
the 3D arrayo Large memory required (e.g. 512x512x64 sized grid,
3x512x64 elements per state variable)o Cannot be stored on-chipo Off-chip storage requires huge I/O bandwidth
• Processor array should be used The 3D array is divided between the
processors Optimal data set for on chip storage: 2048
elements per cross sectional slice (512x32x64 sized grid per processor)
Each processor located on a separate FPGA
Nagy 22 MAPLD 2005/153
Solution of the internal (3D) equations
• Implicit solution Fixed-point solution
o Requires large precision to avoid rounding errors
o Seems to be impractical Floating-point solution
o Requires large area (especially add/sub)
• Explicit solution Smaller timestep Simpler arithmetic unit
Nagy 23 MAPLD 2005/153
Conclusions
• Ocean modeling using emulated digital CNN is very promising
• Moderate precision is required in 2D mode 1% accuracy using 24 bits
• Expected speedup (compared to an Athlon64 2GHz microprocessor) 80 times on our RC200 prototyping board 3700 times on the largest available FPGA
Recommended