Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
CES – Chair for Embedded Systems
ces.itec.kit.edu
Low Power Design of the Next-Generation
High Efficiency Video Coding
Authors: Muhammad Shafique, Jörg Henkel
ces.itec.kit.edu 2 Shafique @ ASPDAC, Jan. 2014
Outline
Introduction to the High Efficiency Video Coding (HEVC)
HEVC Analysis
complexity, memory access, thermal
Power-Efficient HEVC System Design
Conclusion
ces.itec.kit.edu 3 Shafique @ ASPDAC, Jan. 2014
High Efficiency Video Coding (HEVC)
Ultra-HD (or supervision)
7680×4320 ≈ 33 million pixels per frame
By 2017: 80% ─ 90% global internet traffic
New video compression standards/techniques required
JCT-VC’s High Efficiency Video Coding (HEVC)
~2× compression efficiency compared to H.264
0
0.4
0.8
1.2
1.6
1 2 3
Time
Bitrate
(a)
Basketball Kimono PeopleOnStreet
(b)
No
rmal
ized
0
2E+11
4E+11
6E+11
8E+11
1E+12
1.2E+12
1.4E+12
1 2 3
HEVC
H.264/AVC
Mem
ory
BW
. [G
B/s
]
0.5
1.5
1
3
2
2.5
0
HD720 HD1080 2K
Full HD @ 30fps
1 second ≈ 712 Mbits
1 hour ≈ 2.4 Tbits
ces.itec.kit.edu 4 Shafique @ ASPDAC, Jan. 2014
Challenges for Developing
HEVC-based Multimedia Systems
Compute
Complexity
Power
Efficiency
Thermal
Management
Content-Awareness,
HW-SW Collaboration,
Many-core Systems
Parallelization
Workload Balancing,
Arch.-Awareness,
Power Budgeting
Accelerator Design,
Content-Awareness,
Power- Gating
Thermal Analysis,
Configurations,
Content-Adaptive
Challenges & Requirements
Video Memory
Memory Hierarchy
Design,
Content-Aware
ces.itec.kit.edu 5 Shafique @ ASPDAC, Jan. 2014
HEVC Overview: Encoding Flow
Intra Prediction
Inter Prediction
Recursive CU/PU
Size Reduction
‾ Transform and
Quatization
Inverse Transform and Quantization
Recursive TU
Size Reduction
+
Deblocking and SAO Filter
Decoded Picture Buffer
Output Reconstructed
Video
Bitstream Headers
CABAC Entropy Coder
Output Bitstream
Input Video in CTUs
ces.itec.kit.edu 6 Shafique @ ASPDAC, Jan. 2014
HEVC Overview: Slices and Tiles
Slice 0
Slice 1
Slice 2
Slice 3
Tile 0 Tile 1 Tile 2
Tile 3 Tile 4 Tile 5
Core0
f0
F0 FM-1
T0 T1 TK-1
Core1
f1
CoreK-1
fK-1
TK-1
T0 T1
GOP0 F0GOP0
HEVC Parallel Encoding
ces.itec.kit.edu 7 Shafique @ ASPDAC, Jan. 2014
HEVC Overview: Tree-Block Structure
0 1
2 3
4 5
6 7
8 9
10 11
12 13
1415 16
17 18
19 20
21
Example PU
Configuration
0 1
2 3
Tested TU
Configurations
...
Example CU
Configuration
CTU0 CTU1 CTU2
8×8
32×32
16×16
4×4
64×64
CTU
ces.itec.kit.edu 8 Shafique @ ASPDAC, Jan. 2014
CTU Distribution
ces.itec.kit.edu 9 Shafique @ ASPDAC, Jan. 2014
HEVC Overview: Intra and Inter Prediction
Vertical Angular Predictors
Ho
rizo
nta
l A
ng
ula
r P
red
icto
rs
0: Planar
1: DC
2log 2 2
02
M i
iiN
─ HEVC-Intra: ~2.56× more mode decisions than H.264
─ HEVC-Inter: ~2.2× more complex than H.264
2log 3 2
013 2
M i
i
64×64
32×32
32×32
16×16
16×16
HEVC Intra Prediction HEVC Inter Prediction
ces.itec.kit.edu 10 Shafique @ ASPDAC, Jan. 2014
HEVC Overview: Motion Estimation
Block Matching (BM) or Motion Estimation (ME)
Compression by searching temporal neighbors
High energy/time, high compression efficiency (H.264-Inter, HEVC-Inter)
Search
Window Current Frame
Current
Block
Best
Matching
Motion
Vector
Previous Frame
Current Frame Reference Frame Residue Frame
ces.itec.kit.edu 11 Shafique @ ASPDAC, Jan. 2014
HEVC Overview: Search Data Fetching
External
Memory
(DRAM)
On-Chip
Memory
(SRAM)
Block
Matching
External Memory Bus
Reference Frame
Current Frame Current
Block
Search
Window
High leakage
High dynamic
Very high
leakage High bus
power
A memory subsystem with low power
consumption and high efficiency is
required…
ces.itec.kit.edu 12 Shafique @ ASPDAC, Jan. 2014
Outline
Introduction to the High Efficiency Video Coding (HEVC)
HEVC Analysis
complexity, memory access, thermal
Power-Efficient HEVC System Design
Conclusion
ces.itec.kit.edu 13 Shafique @ ASPDAC, Jan. 2014
HEVC Analysis: Computational Complexity
CU/PU Partitioning
0
20
40
60
80
100
BasketballDrill832×480
ParkScene1920×1080
PeopleOnStreet2560×1600
64×64 32×32 16×16 8×8 4×4Pe
rcen
tage
Are
a
Low Variance Regions
High Variance Regions
Large partitions for low-variance and homogeneous image areas and vice-versa
Smooth texture (due to larger QP or resolution) is usually captured by
larger sized PUs
Early PU size prediction may provide
significant reduction in computational
and energy requirements
ces.itec.kit.edu 14 Shafique @ ASPDAC, Jan. 2014
HEVC Analysis: CTU Distribution
ces.itec.kit.edu 15 Shafique @ ASPDAC, Jan. 2014
HEVC Analysis: Memory Accesses
Memory Access for Motion Estimation
Memory accesses of HEVC ≈ 3.86× of H.264
Most of the on-chip memory is wasted (leakage power)
0%
25%
50%
75%
100%
H.264 HEVC
0
20
40
60
80
100
Keiba BasketballDrill RaceHorses KristenAndSara
(a)
Pe
rce
nta
ge
Uti
lizat
ion
Median75 %
25 %
Maximum
Minimum
(b)
─ Only a part of the full search window is utilized
─ Adapting the search window size at run-time provides
increased potential for leakage power savings
ces.itec.kit.edu 16 Shafique @ ASPDAC, Jan. 2014
Using a thermal camera setup
Copyright: © Chair for Embedded Systems (CES),
Karlsruhe Institute of Technology (KIT), Germany
DIAS Pyroview thermal camera operates
at 50Hz with spatial resolution of 50 µm
Peltier Based
Cooling
IR Camera
CPU chip
Thermal map
Water-cooling unit to cool
down the thermoelectric device
Voltage
supply Linux Ubuntu
kernel A bottom view
Water heat sink
Thermoelectric
device
Copper plate
Thermal pad
Intel Atom 45nm dual-core processor
(1.8 GHz) Src: Intel
ces.itec.kit.edu 17 Shafique @ ASPDAC, Jan. 2014
Temperature Measurements for HEVC
[RaceHorses@37QP vs. 22QP]
Temp max.: 53.0°C
Temp min.: 35.0 °C
Temp avg.: 49.0 °C
Temp max.: 55.0°C
Temp min.: 36.0 °C
Temp avg.: 53.0 °C
Copyright: © Chair for Embedded Systems (CES),
Karlsruhe Institute of Technology (KIT), Germany hevcDTM @ DATE‘14
ces.itec.kit.edu 18 Shafique @ ASPDAC, Jan. 2014
HEVC Analysis: Temperature
40
45
50
55
60
1350 1400 1450 1500 1550
Tem
peratu
re ( C
)
Time (sec)
Keiba (1.8 GHz)
Basketball (1.35 GHz)
62 ºC
56 ºC
50 ºC
44 ºC
40
45
50
55
60
1000 1050 1100 1150
Tem
per
atu
re ( C
)
Time (sec)
Keiba Basketball
62 ºC
60 ºC
58 ºC
56 ºC
54 ºC
52 ºC
50 ºC
48 ºC
46 ºC
44 ºC
Frequency Dependence Content Dependence
So What is Required?
Interplay between Software and Hardware needs
to be explored for power/energy optimization
1. Optimized Algorithms for Fast Intra- and Inter-
Prediction
2. Energy-Efficient Hardware Accelerators
3. Energy-Efficient Video Memory Heirarchy
4. Content-Adaptive Power Management
ces.itec.kit.edu 19 Shafique @ ASPDAC, Jan. 2014
Outline
Introduction to the High Efficiency Video Coding (HEVC)
HEVC Analysis
complexity, memory access, thermal
Power-Efficient HEVC System Design
Conclusion
ces.itec.kit.edu 20 Shafique @ ASPDAC, Jan. 2014
Power Efficient HEVC Design:
Hardware Architecture
Video Tile
Formation
Data Analysis
and Statistics
Adaptive
Workload
Budgeting
Complexity
Reduction
Scheme
Energy to
Quality
TradeoffApplication
Driven
Adaptive
Power/
Thermal
Manager
HEVC
Encoding
Intra/Inter
HEVC Software Layer
CTR
CTR
CTR
CTR
CTR
CTR
CTR
CTR
CTR
Battery
HEVC Hardware
Processing
Architecture
Feedback
Monitors to
Software
...
...
...Off-chip
DRAM
ces.itec.kit.edu 21 Shafique @ ASPDAC, Jan. 2014
Analysis and Statistics
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 20 40 60 80 100 120 140 160 180 200
Fre
qu
en
cy
Variance
8x816x1632x3264x64
1
10
100
1000
10000
100000
1 10 100 1000 10000 100000
Fre
qu
en
cy
Distortion
8x816x1632x3264x64
PD
F
Variance
0 20 40 60 140 160 180 20080 100 120 101 100 1000 10000 100000
Distortion
PD
F
Parameter Value SAD SSE SATD Kbps
Max. CU
Depth
4 771 263 51 3001.7
3 659.15 229.08 42.1 3320.9
2 372.92 153.37 28.6 3363.1
Search
Range
64 771 263 51 3001.7
32 553.92 263.26 50.9 3080.44
16 472.37 262.78 52.82 3738.83
AMP 1 771 263 51 3001.7
0 665.74 237.1 44.27 3072.92
Variance and Motion based Classification
ces.itec.kit.edu 22 Shafique @ ASPDAC, Jan. 2014
Complexity Reduction: PU Size Estimation
CTU variance computation at 4×4
Recursive 4 neighbors merge
PU Map (PUM)
PU Map Above (PUMA)
HEVC CTU Compressor
1
2
0
1
1
n
i x
i
v xn
, {1,2,3,4}
, {1,2,3,4} OR
c i i
c Th i i Th
v CombineVariances v
if v v v v
MergeBlocks
2
2
log4 1ln
1 220
Th v
Hv QP
Rayleigh CDF
Analysis
Empirical
Analysis
µv = Mean of variance curve
Δ = CDF threshold (0.8)
H = Size of PU to combine
ces.itec.kit.edu 23 Shafique @ ASPDAC, Jan. 2014
Time Savings and Video Quality Results
Sequence Class Size BD-PSNR BD-Rate
Traffic A 4K -0.03048 0.5611
BasketballDrive B 1080p -0.04966 3.1834
BasketballDrill C WVGA -0.05175 1.0846
BQSquare D WQVGA -0.03365 0.3802
RaceHorses D WQVGA -0.03009 0.4482
Johnny E 720p -0.08711 2.1241
BasketballDrillText F WVGA -0.05827 1.1123
0
0.2
0.4
0.6
0.8
1
1.2
Traffic BasketballDrive BasketballDrill BQSquare Johnny
RDO Proposed
No
rma
lize
d T
ime
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Traffic BasketballDrive BasketballDrill BQSquare RaceHorses JohnnyTraffic BasketballDrive
BasketballDrill
BQSquare RaceHorses Johnny BasketballDrillText
39.544.3
37.8 37.042.1
57.4
39.5
ces.itec.kit.edu 24 Shafique @ ASPDAC, Jan. 2014
Tile Mapping and Parallelization
Output
Tile Formation and Maximum Workload Estimator
Workload(per core)
Frequency (per CPU)
Offline Tuning
Video Input
Cores Frame Rate fpCPUs Max freq. fmax
Core0
User bit-rate tolerance n
Workload Manager
Intra Mode Prediction
Core Frequency Selector
Workload Allocator
Monitoring Unit
Threshold Generator
Workload Adaptation
Total Intra Angles (θ)
Core1
Core2 Core3
CoreK-2 CoreK-1
...
50
60
70
80
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
Tile 0 Tile 1
Tile 3 Tile 4
Tim
e [
mse
c]
Frame
Workload is not equal for tiles
ces.itec.kit.edu 25 Shafique @ ASPDAC, Jan. 2014
HEVC Thermal Management
Application-Driven DTM
HEVC
Encoder
Application
driven DTM
Core0
Sensor
Core1
Sensor
Extract Motion
Intensity
Frequency
scaling
NO
Execute HM
YES
Tcurrent > Tcritical
ces.itec.kit.edu 26 Shafique @ ASPDAC, Jan. 2014
HEVC Thermal Management
40
45
50
55
60
No DTM DTM 54ºC DTM 50ºC DTM 46ºC
Tem
per
atu
re (
ºC)
Max Average Min
0
50
100
150
PSNR (dB) Bit rate (kbps)
No DTM Our 54°C Our 50°C Our 46°C
56 ºC
54 ºC
52 ºC
50 ºC
48 ºC
46 ºC
44 ºC
42 ºC
40 ºC
38 ºC
ces.itec.kit.edu 27 Shafique @ ASPDAC, Jan. 2014
HEVC Thermal Management
Keiba BasketballDrill
40
45
50
55
60
0 10 20
Pea
k t
emp
eratu
re (
ºC)
# Frames
No DTM DTM 54ºC
DTM 50ºC DTM 46ºC40
45
50
55
60
0 10 20
Pea
k t
emp
eratu
re (
ºC)
# Frames
No DTM DTM 54ºC
DTM 50ºC DTM 46ºC
ces.itec.kit.edu 28 Shafique @ ASPDAC, Jan. 2014
CTR
CTR
CTR
CTR
CTR
CTR
CTR
CTR
CTR
Battery
HEVC Hardware
Processing
Architecture
Feedback
Monitors to
Software
...
...
...
Power Efficient HEVC Design:
Hardware Architecture
ces.itec.kit.edu 29 Shafique @ ASPDAC, Jan. 2014
Hardware Accelerators
HW0 HW2N/8
CTU Row
8 PPC
M
HW1 HW2Intra Predictor
0
200
400
600
800
1000
1 2 3 4 5 6
Slice LUTs (luma)
Slice LUTs (chroma)
Slice registers (luma)
Slice registers (chroma)
Occupied Slices (luma)
Occupied Slices (chroma)
Legend:
Number of datapaths in parallel
0
200
400
600
800
1000
1 2 3 4 5 6
Slice LUTs (luma)
Slice LUTs (chroma)
Slice registers (luma)
Slice registers (chroma)
Occupied Slices (luma)
Occupied Slices (chroma)
Legend:
Number of datapaths in parallel
ces.itec.kit.edu 30 Shafique @ ASPDAC, Jan. 2014
AMBER: Memory Subsystem
External Memory holds the current frame
High density, low read and write power
On-chip SRAM memory (FIFO) holds only the current block
High read and write speed and low dynamic write power
Hides latencies from HEVC engine
SRAM Block FIFO
Power Control
External Memory
(Current Frame)
External Memory
Controller
On-chip Current Data (Block) SRAM
HEVC Video Compression Control
Block Matching
Engine
++++
+
++
HEV
C E
nco
de
r (T
ran
sfo
rm lo
op
)
MRAM Buffers(N Reference Frames) Reference
Write Master
Current Read Master- Reads current frame data
- Writes SRAM Buffer -Low write amount
-Fast Write
Reference Read Master- Reads reference frames
- Low latency read
ces.itec.kit.edu 31 Shafique @ ASPDAC, Jan. 2014
AMBER: MRAM Reference Buffers
One MRAM buffer holds a full reference frame
Each column (sector) of reference buffer is power-gated
Reference read and write masters read and write data to
the MRAM buffer
WW
H Reference F1 H Reference FN
MRAM Power Gate Control
Reference Read Master
Ro
w B
uff
er
SRAM FIFO
Re
fere
nce
Wri
te
Mas
ter
Block MatchingHEVC Encoder
MRAM Reference Buffers
ces.itec.kit.edu 32 Shafique @ ASPDAC, Jan. 2014
AMBER: Reference Buffer Power Management
Observation: Not all of the search window is used
Block matching algorithm accesses only a small percentage of
reference buffer sectors
Power-gate unused sectors
Reduce leakage
Block Matching
Turned OFF
Turned ON
s1 s2
1 x min
2 x max
s CU s
s CU sPrediction of Unused Sectors is based on:
1. Self-Organizing Map
2. Content Properties
ces.itec.kit.edu 33 Shafique @ ASPDAC, Jan. 2014
0
0.5
1
1.5
2
Search Window
AMBER
Power Consumption (4 reference frames) P
ow
er
[W]
Keiba China
Speed
Four
People
Basketball
Drive
People
129×129 193×193 257×257
Keiba ChinaSpeed FourPeople BasketballDrive People
832x480 1024x768 1280x720 1920x1080 2560x1600
─ Increasing the number of reference frames improves the
power consumption of the AMBER system compared to the
search window approach
ces.itec.kit.edu 34 Shafique @ ASPDAC, Jan. 2014
Conclusion
Comprehensive analysis of HEVC
Architecture, power, thermal and complexity
Challenges posed by HEVC
Architectural (memory, reconfiguration, accelerators)
Power/thermal (power-gating, configuration control)
Complexity (parallelization, many-core, workload balancing)
Both Hardware and Software need to be optimized while
leveraging the application-specific knowledge
Our approach
Adaptive complexity management
Video tiling, workload budgeting, CU/PU partitioning
Power and thermal aware HEVC configuration
Hybrid video memory hierarchy with content-driven power-gating
ces.itec.kit.edu 35 Shafique @ ASPDAC, Jan. 2014
ces265: Multi-threaded HEVC Encoder
Open-source
C++ based
Multithreading via
pthread API
One thread of ces265
≈ 13.2× faster than
HM-9.2
Web
http://ces.itec.kit.edu/ces265/
Download
https://sourceforge.net/projects/ces265/
HEVC-Intra Encoder’s Top
Workload Allocator
GOP Compressor
Tile Compressor Threads
Slice Compressor
WorkloadQueue
Tile Formation and Workload
Curtailing
YUV ReadWrite
CTU Compressor
Workload Manager
Encoder Statistics
Sniper many-core x86 simulator
Simulator statistics McPAT power simulator Power statistics
Proposed HEVC Intra Encoder
SystemConfiguration
ces.itec.kit.edu 36 Shafique @ ASPDAC, Jan. 2014
Acknowledgement
Muhammad
Usman Karim
Khan
Daniel
Palomino
Claudio M.
Diniz
Felipe
Sampaio
ces.itec.kit.edu 37 Shafique @ ASPDAC, Jan. 2014
Thank you!
Questions?
Web: http://ces.itec.kit.edu/ces265/
Download: https://sourceforge.net/projects/ces265/