Appendix A Integer Transform

Energy-Efficient System Design for Mobile Processing Platforms

by

Rahul Rithe

B.Tech., Indian Institute of Technology Kharagpur (2008)S.M., Massachusetts Institute of Technology (2010)

Submitted to the Department of Electrical Engineering and Computer Sciencein partial fulfillment of the requirements for the degree of

Doctor of Philosophy SACHUSES wINs-ThUOF TECHNOLOGY

at the rJUN 10 2014

MASSACHUSETTS INSTITUTE OF TECHNOLOGYLiBRARIES

June 2014

@ Massachusetts Institute of Technology 2014. All rights reserved.

Signature redactedA u th or ...................................................... . .- . . . . .

Department of Electrical Engineering and Computer ScienceMay 20, 2014

Signature redactedC ertified b y ...............................................-. :......../...................

Anantha P. ChandrakasanJoseph F. and Nancy P. Keithley Professor of Electrical Engineering

Thesis Supervisor

Acceped bySignature redactedAccepted by ................................... t r e a t d .

Lediej. KolodziejskiChair, Department Committee on Graduate Students

2

Energy-Efficient System Design for Mobile Processing Platforms

by

Rahul Rithe

Submitted to the Department of Electrical Engineering and Computer Scienceon May 20, 2014, in partial fulfillment of the

requirements for the degree ofDoctor of Philosophy

Abstract

Portable electronics has fueled the rich emergence of multimedia applications that have led to theexponential growth in content creation and consumption. New energy-efficient integrated circuitsand systems are necessary to enable the increasingly complex augmented-reality applications,such as high-performance multimedia, "big-data" processing and smart healthcare, in real-timeon mobile platforms of the future. This thesis presents an energy-efficient system design approachwith algorithm, architecture and circuit co-design for multiple application areas.

A shared transform engine, capable of supporting multiple video coding standards in real-timewith ultra-low power consumption, is developed. The transform engine, implemented using45 nm CMOS technology, supports Quad Full-HD (4k x 2k) video coding with reconfigurableprocessing for H.264 and VC-1 standards at 0.5 V and operates down to 0.3 V to maximizeenergy-efficiency. Algorithmic and architectural optimizations, including matrix factorization,transpose memory elimination and data dependent processing, achieve significant savings in areaand power consumption.

A reconfigurable processor for computational photography is presented. An efficient implemen-tation of the 3D bilateral grid structure supports a wide range of non-linear filtering applications,including high dynamic range imaging, low-light enhancement and glare reduction. The proces-sor, implemented using 40 nm CMOS technology, enables real-time processing of HD images,while operating down to 0.5 V and achieving 280x higher energy-efficiency compared to soft-ware implementations on state-of-the-art mobile processors. A scalable architecture enables 8xenergy scalability for the same throughput performance, while trading-off output resolution forenergy.

Widespread use of medical imaging techniques has been limited by factors such as size, weight,cost and complex user interface. A portable medical imaging platform for accurate objec-tive quantification of skin condition progression, using robust computer vision techniques, ispresented. Clinical validation shows 95% accuracy in progression assessment. Algorithmic opti-mizations, reducing the memory bandwidth and computational complexity by over 80%, pave theway for energy-efficient hardware implementation to enable real-time portable medical imaging.

Thesis Supervisor: Anantha P. ChandrakasanTitle: Joseph F. and Nancy P. Keithley Professor of Electrical Engineering

3

4

Acknowledgments

Since the first time I came to MIT in August 2008 and navigated my way to 38-107, trying to

make sense of MIT's (still) incomprehensible building numbering system, it has been a wonder-

ful journey of exploration - filled with numerous challenges and exciting rewards of scientific

discovery.

I have been fortunate to have had exceptional advisors and mentors to guide me through this

journey. I am extremely grateful to my advisor, Prof. Anantha Chandrakasan, for being a

great mentor, role model and a constant source of inspiration. I learned from Anantha that

conducting great research is a process that involves working in collaboration with researchers,

industry partners and funding agencies, while constantly pushing the boundaries of the state-

of-the-art. The collaborative research environment that Anantha has fostered in the lab not

only motivated me to produce great results but also afforded the opportunities to work with

graduate and undergraduate students and learn how to mentor and motivate others in realizing

their full potential as researchers. I learned invaluable lessons in organization and management,

from being inspired by Anantha's visionary leadership of EECS, while managing a large research

group. Thank you Anantha for giving me the freedom to explore my interests and helping me

grow both professionally and personally throughout my graduate studies at MIT!

I am thankful to the members of my Ph.D. thesis committee, Prof. William Freeman, Prof.

Li-Shuan Peh and Prof. Vivienne Sze, for their advise, feedback and support. Prof. Freeman's

advise on the computer vision related work for medical imaging was extremely valuable. I would

like to thank Vivienne for her help and support throughout my graduate work at MIT - first

as a senior graduate student and then as a faculty member at MIT - from helping me learn

digital design to long discussions about research and reviewing paper drafts. I am extremely

grateful to Prof. Fredo Durand for several valuable discussions on topics ranging from research

to photography to career options.

I had the privilege of working with Dr. Dennis Buss, chief scientist at Texas Instruments and

visiting scientist at MIT, during my master's research. I am immensely thankful to Dennis for

all the insightful discussions over the last six years on topics ranging from research and industry

collaboration to the past, present and future of the semiconductor industry.

5

The work was made possible by the generous support of our industry partners. I would like

to acknowledge the Foxconn Technology Group, Texas Instruments and the MIT Presidential

Fellowship for providing funding support and the TSMC University Shuttle Program for chip

fabrication.

I consider teaching to be an integral part of the graduate experience and I am grateful to

Prof. Harry Lee for giving me the rare opportunity to serve as a recitation instructor for the

undergraduate 'Circuits and Electronics' class. I would like to thank Prof. Harry Lee, Prof.

Karl Berggren, Prof. John Kassakian and Prof. Khurram Afridi for helping me further my

passion for teaching and enhance my abilities as a teacher.

One of the best things about MIT is the people you get to interact and work with day-to-day.

I would like to thank Chih-Chi Cheng and Mahmut Sinangil for working long hours with me

on the video coding project. I am extremely thankful to Priyanka Raina, Nathan Ickes and

Srikanth Tenneti for their tremendous help in bringing the computational photography project

from an idea to a live demonstration platform. It has been a great experience for me to work

with two 'SuperUROP' students - Michelle Chen and Qui Nguyen - on the smartphone-based

medical imaging platform and I am thankful to them for being such enthusiastic collaborators.

I would also like to thank Dr. Vaneeta Sheth from the Brigham and Women's Hospital for

bringing her dermatology expertise to our medical imaging work and conducting a pilot study

to demonstrate its effectiveness during treatment.

When I first arrived at MIT, I could not have imagined a work environment better than what

Ananthagroup has offered me over the last six years. It has been an absolute pleasure to work

with all the members of Ananthagroup - past and present. The diverse set of expertise, thoughtful

discussions and "procrastination circles" have helped create the best workplace for research. All

work and no play is no fun. I would like to thank Masood Qazi for teaching me everything

I know about playing Squash and those amazing trips to Burdick's for the best hot chocolate

ever! I would also like to thank the members of the "Ananthagrop Tennis Club" - Arun, Phil

and Nachiket - for quite a few evenings well spent, braving wind, rain and cold on the tennis

courts.

Margaret Flaherty, our administrative assistant, is the reason everything in 38-107 runs so

smoothly. I would like to thank Margaret for her relentless work and attention to detail.

6

Saurav Bandyopadhyay, Rishabh Singh and I went to IIT Kharagpur together and continued

our journey at MIT together, including that first crammed flight from Delhi to Boston. I am

extremely thankful to Saurav and Rishabh for being such great friends over the years.

The foundation of my work rests on the unconditional love and support from my family. The

pride and joy of my late grandparents, Nirmalabai and Namdevrao Wankhade, in every one of my

achievements over the years has been and will continue to be a constant source of inspiration for

me. The love of my grandfather, Panjabrao Rithe, for education and the hardships he endured

for it has been the driving force for me on this academic journey. The steadfast belief of my

parents, Rajani and Jagdish Rithe, and my sister Bhagyashree, their support through all my

endeavors and encouragement to follow my dreams, has made this journey from a small village

in India to the present moment possible. And for that I am eternally grateful!

Rahul Rithe

Cambridge, MA

01 MAY 2014

7

8

Contents

1 Introduction 23

1.1 Mobile Computing Challenges . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.2 Energy-Efficient System Design . . . . . . . . . . . . . . . . . . . . . . . . 26

1.2.1 Parallel Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1.2.2 Application Specific Processing . . . . . . . . . . . . . . . . . . . . 28

1.2.3 Reconfigurable Hardware . . . . . . . . . . . . . . . . . . . . . . . . 30

1.2.4 Low-Voltage Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . 31

1.3 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2 Transform Engine for Video Coding 37

2.1 Transform Engine Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.1.1 Integer Transform: H.264/AVC & VC-1 . . . . . . . . . . . . . . . 40

2.1.2 Matrix Factorization for Hardware Sharing . . . . . . . . . . . . . . 42

2.1.3 Eliminating Transpose Memory . . . . . . . . . . . . . . . . . . . . 47

2.1.4 Data Dependent Processing . . . . . . . . . . . . . . . . . . . . . . 52

2.2 Future Video Coding Standards . . . . . . . . . . . . . . . . . . . . . . . . 56

2.3 Statistical Methodology for Low-Voltage Design . . . . . . . . . . . . . . . 61

2.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

2.5 Measurement Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

2.6 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

10 CONTENTS

3 Reconfigurable Processor for Computational Photography 77

3.1 Bilateral Filtering ...........

3.1.1 Bilateral Grid . . . . . . . . ..

3.2 Bilateral Filter Engine . . . . . . . .

3.2.1 Grid Assignment . . . . . . .

3.2.2 Grid Filtering . . . . . . . . .

3.2.3 Grid Interpolation . . . . . .

3.2.4 Memory Management.....

3.2.5 Scalable Grid . . . . . . . . .

3.3 Applications . . . . . . . . . . . . . .

3.3.1 High Dynamic Range Imaging

3.3.2 Glare Reduction . . . . . . . .

3.3.3 Low-Light Enhanced Imaging

3.4 Low-Voltage Operation . . . . . . . .

3.4.1 Statistical Design Methodology

3.4.2 Multiple Voltage Domains

3.5 Memory Bandwidth Optimization

3.6 Measurement Results . . . . . . . . .

3.6.1 Energy Scalable Processing

3.6.2 Energy Efficiency . . . . . . .

3.7 System Integration . . . . . . . . . .

3.8 Summary and Conclusions . . . . . .

4 Portable Medical Imaging Platform

4.1 Skin Conditions - Diagnosis & Treatment . . . . .

4.1.1 Clinical Assessment: Current Approaches . .

4.1.2 Quantitative Dermatology . . . . . . . . . .

4.2 Skin Condition Progression: Quantitative Analysis .

4.2.1 Color Correction . . . . . . . . . . . . . . .

127

128

128

130

133

134

CONTENTS10

. . . . . . . . . . . . . . . . . . . . 7 9

. . . . . . . . . . . . . . . . . . . . 8 1

. . . . . . . . . . . . . . . . . . . . 8 3

. . . . . . . . . . . . . . . . . . . . 8 3

. . . . . . . . . . . . . . . . . . . . 8 5

. . . . . . . . . . . . . . . . . . . . 8 6

. . . . . . . . . . . . . . . . . . . . 8 8

. . . . . . . . . . . . . . . . . . . . 8 8

. . . . . . . . . . . . . . . . . . . . 9 0

. . . . . . . . . . . . . . . . . . . . 9 1

. . . . . . . . . . . . . . . . . . . . 9 7

. . . . . . . . . . . . . . . . . . . . 1 0 0

. . . . . . . . . . . . . . . . . . . . 1 0 8

. . . . . . . . . . . . . . . . . . . . 1 0 8

. . . . . . . . . . . . . . . . . . . . 1 0 9

. . . . . . . . . . . . . . . . . . . . 1 1 0

. . . . . . . . . . . . . . . . . . . . 1 1 5

. . . . . . . . . . . . . . . . . . . . 1 1 7

. . . . . . . . . . . . . . . . . . . . 1 1 9

. . . . . . . . . . . . . . . . . . . . 1 2 3

. . . . . . . . . . . . . . . . . . . . 1 2 4

CONTENTS 11

4.2.2 Contour Detection . . . . . . . . . .

4.2.3 Progression Analysis . . . . . . . . .

4.2.4 Auto-tagging . . . . . . . . . . . . .

4.2.5 Skin condition Progression: Summary

4.3 Experimental Results . . . . . . . . . . . . .

4.3.1 Clinical Validation . . . . . . . . . .

4.3.2 Progression Quantification . . . . . .

4.3.3 Auto-tagging Performance . . . . . .

4.3.4 Energy-Efficient Processing . . . . .

4.3.5 Limitations . . . . . . . . . . . . . .

4.4 Mobile Application . . . . . . . . . . . . . .

4.5 Multispectral Imaging: Future Work . . . .

4.6 Summary and Conclusions . . . . . . . . . .

5 Conclusions and Future Directions

5.1 Summary of Contributions . . . . . . . . . .

5.1.1 Video Coding . . . . . . . . . . . . .

5.1.2 Computational Photography . . . . .

5.1.3 Medical Imaging . . . . . . . . . . .

5.2 Conclusions . . . . . . . . . . . . . . . . . .

5.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . 170

5.3.1 Computational Photography and Computer Vision . . . . . . . .

5.3.2 Portable Medical Imaging . . . . . . . . . . . . . . . . . . . . . .

A Integer Transform

A.1 H.264/AVC Integer Transform . . . . . . . . . . . . . . . . . . . . . . . .

A.2 VC-1 Integer Transform . . . . . . . . . . . . . . . . . . . . . . . . . . .

B Clinical Pilot Study for Vitiligo Progression Analysis

B.1 Subjects for Pilot Study . . . . . . . . . . . . . . . . . . . . . . . . . . .

137

143

146

149

150

150

150

154

155

157

158

159

162

165

166

166

167

168

168

170

171

175

175

177

179

179

11CONTENTS

CONTENTS

B.2 Progression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

Acronyms 185

189Bibliography

12

List of Figures

1-1 Evolution of computing and multimedia processing. (Analytical Engine:

London Science Museum) . . . . . . . . . . . . . . . . . . . . . . . . . . .

1-2 Processor feature scaling and Performance/Watt trends. (Data courtesy

Stanford CPU DB: cpudb.stanford.edu) . . . . . . . . . . . . . . . . . . . .

1-3 Processor energy/operation scaling with performance. (Data courtesy Stan-

ford CPU DB: cpudb.stanford.edu) . . . . . . . . . . . . . . . . . . . . . .

1-4 Energy efficiency of processors: from CPUs to ASICs. . . . . . . . . . . . .

1-5 Delay scaling with VDD. Corner delay scales by 15x, whereas total delay

(corner + 3a- stochastic delay) scales by 36 x. . . . . . . . . . . . . . . . .

2-1 Hardware architecture of the even component. The figure shows data paths

exercised in (a) H.264 and (b) VC-1. . . . . . . . . . . . . . . . . . . . . .

2-2 Hardware architecture of the odd component. The figure shows data paths

exercised in (a) H.264 and (b) VC-1. . . . . . . . . . . . . . . . . . . . . .

2-3 Column-wise 1D transform: 8x8 data is processed over four clock cycles,

CO to C3: Column 0 and 7 in CO, 1 and 6 in C1, 2 and 5 in C2, 3 and 4 in

C3. Two transformed columns are generated in each clock cycle. . . . . . .

24

26

27

29

32

48

49

50

2-4 Row-wise ID transform: Partial products for all 64 coefficients are com-

puted in each clock cycle, using the 2 x 8 data obtained by transposing the

two columns generated by 1D column-wise transform. The partial products

are stored in the output buffer. At the end end of four clock cycles, the

output buffer contains complete 2D transformed output. . . . . . . . . . . 51

2-5 Hardware architecture of the (a) even and (b) odd component. Std = {0:

H .264, 1: V C-1}. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

2-6 Histogram of the prediction residue for a number of test sequences . . . . . 54

2-7 Correlation between input switching activity and system switching activity.

The plot also shows linear regression for the data. Measured correlation is

0 .83 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

2-8 Switching activity and Power consumption in the transform as a function

of DC bias applied to the input data . . . . . . . . . . . . . . . . . . . . . 55

2-9 Hardware architecture of the even component for shared 8 x 8 transform for

H.264, VC-1 and HEVC. The highlighted blocks are the same as those used

in the shared H.264/VC-1 architecture, shown in Figure 2-1. . . . . . . . . 59

2-10 Hardware architecture of the odd component for shared 8x8 transform for

H.264, VC-1 and HEVC. The highlighted blocks are the same as those used

in the shared H.264/VC-1 architecture, shown in Figure 2-2. . . . . . . . . 60

2-11 Switching activity in HEVC transform as a function of DC bias applied to

the input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

2-12 Delay PDF of a representative timing path at 0.5 V. STA estimate of the

global corner delay is 14.1 ns, the 3o- delay estimate using Gaussian SSTA

is 23.2 ns and the 3a- delay estimate using Monte-Carlo analysis is 31.8 ns. 62

2-13 Graphic illustration in xi-space of the convolution integral, and the oper-

ating point. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

2-14 Delay PDF of a representative timing path at 0.5 V, estimated using Gaus-

sian SSTA, Monte-Carlo and OPA. . . . . . . . . . . . . . . . . . . . . . . 65

LIST OF FIGURES14

2-15 Typical timing path. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

2-16 OPA based statistical design methodology for low voltage operation. . . . . 67

2-17 Block diagram of the 2D transform engine design . . . . . . . . . . . . . . 69

2-18 Die photo and design statistics of the fabricated IC . . . . . . . . . . . . . 69

2-19 Measured power consumption and frequency scaling with VDD for different

transform implementations. (a) Frequency scaling with VDD, (b) Power

consumption while operating at the frequency shown in (a). . . . . . . . . 70

2-20 Power consumption for transform modules with and without transpose

memory, with and without shared architecture for H.264 and VC-1 . . . . . 72

2-21 Switching activity and Power consumption in the transform as a function

of DC bias applied to the input data . . . . . . . . . . . . . . . . . . . . . 72

3-1 System block diagram for the reconfigurable computational photography

processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3-2 Comparison of Gaussian filtering and bilateral filtering. Bilateral filtering

effectively reduces noise while preserving scene details. . . . . . . . . . . . 81

3-3 Construction of a 3D bilateral grid from a 2D image . . . . . . . . . . . . . 82

3-4 Architecture of the bilateral filtering engine. Grid scalability is achieved

by gating processing engines and SRAM banks . . . . . . . . . . . . . . . . 84

3-5 Architecture of the grid assignment engine. . . . . . . . . . . . . . . . . . . 84

3-6 Architecture of the convolution engine for grid filtering. . . . . . . . . . . . 85

3-7 Architecture of the interpolation engine. Trilinear interpolation is imple-

mented as three pipelined stages of linear interpolations. . . . . . . . . . . 87

3-8 Memory management by task scheduling. . . . . . . . . . . . . . . . ... . . . 89

3-9 Camera curves that map the pixel intensity values on to the incident exposure. 92

3-10 HDR creation module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

3-11 HDR image scaled to 8 bit/pixel/color for displaying on LDR media. (HDR

radiance map courtesy Paul Debevec [121].) . . . . . . . . . . . . . . . . . 94

15LIST OF FIGURES

3-12 Processing flow for HDR creation and tone-mapping for displaying HDR

images on LDR media. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

3-13 Tone-mapped HDR image. (HDR radiance map courtesy Paul Debevec

[12 1].) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5

3-14 Processor configuration for HDR imaging. . . . . . . . . . . . . . . . . . . 96

3-15 Input low-dynamic range images: (a) under exposed image, (b) normally

exposed image, (c) over exposed image. Output image: (d) tonemapped

H D R im age. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

3-16 Contrast adjustment module. Contrast is increased or decreased depending

on the adjustment factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

3-17 Processing flow for glare reduction. . . . . . . . . . . . . . . . . . . . . . . 98

3-18 Processor configuration for glare reduction. . . . . . . . . . . . . . . . . . . 99

3-19 (a) Input image with glare. (b) Output image with reduced glare. . . . . . 99

3-20 Processing flow for low-light enhancement. . . . . . . . . . . . . . . . . . . 102

3-21 Processor configuration for low-light enhancement . . . . . . . . . . . . . . 103

3-22 Generating a mask representing regions with high scene details. . . . . . . 104

3-23 Merging flash and no-flash images with shadow correction. . . . . . . . . . 104

3-24 (a) Image with flash, (b) image without flash, (c) no-flash base layer, (d)

flash detail layer, (d) edge mask, (f) low-light enhanced output. . . . . . . 106

3-25 Input images: (a) image with flash, (b) image without flash. Output image:

(c) low-light enhanced image. . . . . . . . . . . . . . . . . . . . . . . . . . 107

3-26 Comparison of the image quality performance from the proposed approach

with that of [138] and [139]. (a) Output from our approach, (b) output

from [138], (c) output from [139], (d) difference image between (a) and (b)

- amplified 5x, (e) difference image between (a) and (c) - amplified 5x. . . 107

3-27 Delay PDF of a representative timing path from the computational pho-

tography processor at 0.5 V. STA estimate of the global corner delay is

21.9 ns, the 3- delay estimate using OPA is 36.1 ns. . . . . . . . . . . . . . 109

LIST OF FIGURES16

3-28 Separate voltage domains for logic and memory. Level shifters are used to

transition between domains. . . . . . . . . . . . . . . . . . . . . . . . . . .111

3-29 Memory bandwidth and estimated power consumption for 2D bilateral fil-

tering, 3D bilateral grid and bilateral grid with memory management using

task scheduling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

3-30 Die photo of the testchip. Highlighted boxes indicate SRAMs. HDR, CR

and SC refer to HDR create, contrast reduction and shadow correction

m odules respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

3-31 Processor performance: trade-off of energy vs. performance for varying VDD 116

3-32 Processor area (number of gates) and power breakdown. . . . . . . . . . . 116

3-33 Energy scalable processing. Grid resolution vs. energy trade-off at 0.9 V. . 117

3-34 Energy/resolution scalable processing. HDR imaging outputs for (a) grid

block size: 16 x 16, intensity levels: 16, (b) grid block size: 128 x 128,

intensity levels: 16, (c) grid block size: 16 x 16, intensity levels: 4, (d) grid

block size: 128 x 128, intensity levels: 4. . . . . . . . . . . . . . . . . . . . 118

3-35 Energy/resolution scalable processing. Low-light enhancement outputs for

(a) grid block size: 16 x 16, intensity levels: 16, (b) grid block size: 128 x 128,

intensity levels: 16, (c) grid block size: 16 x 16, intensity levels: 4, (d) grid

block size: 128 x 128, intensity levels: 4. . . . . . . . . . . . . . . . . . . . 119

3-36 Energy efficiency of processors ranging from CPUs and mobile processors

to FPGAs and ASICs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

3-37 Processor integration with external memory, camera and display. . . . . . . 123

3-38 Printed circuit board and system integration with camera and display. . 1 2

17LIST OF FIGURES

124

4-1 Standardized assessments for estimating the degree of pigmentation to de-

rive the Vitiligo Area Scoring Index. At 100% depigmentation, no pigment

is present; at 90%, specks of pigment are present; at 75%, the depigmented

area exceeds the pigmented area; at 50%, the depigmented and pigmented

areas are equal; at 25%, the pigmented area exceeds the depigmented area;

and at 10%, only specks of depigmentation are present. (Figure reproduced

with permission from [167]) . . . . . . . . . . . . . . . . . . . . . . . . . . 129

4-2 Processing flow for skin lesion progression analysis. . . . . . . . . . . . . . 134

4-3 Color correction by histogram matching. Images captured with normal

room lighting (a) and with color chart white-balance calibration (b). Im-

ages after color correction and contrast enhancement (c) of images in (a). . 136

4-4 Level set segmentation. (a) Original image with intensity inhomogeneity

and initialization of the level set function. (b) Homogeneous image obtained

at the end of iterations and the corresponding level set function. . . . . . . 139

4-5 Narrowband implementation of level set segmentation. LSM variables are

tracked only for pixels that fall within a narrow band defined around the

zero level set in the current iteration. . . . . . . . . . . . . . . . . . . . . . 140

4-6 Number of pixels processed using the narrowband implementation over 50

LSM iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

4-7 Lesion segmentation using K-means. . . . . . . . . . . . . . . . . . . . . . 143

4-8 Contour evolution for lesion segmentation using narrowband LSM. . . . . . 143

4-9 SIFT feature matching performed on the highlighted narrow band of pixels

in the vicinity of the contour. . . . . . . . . . . . . . . . . . . . . . . . . . 144

4-10 Color correction for a sequence of images by R, G, B histogram modifi-

cation. (a) Original image sequence, (b) Color corrected image sequence.

The lesion color changes due to phototherepy. . . . . . . . . . . . . . . . . 151

4-11 Image segmentation using LSM for lesion contour detection despite inten-

sity/color inhomogeneities in the image. . . . . . . . . . . . . . . . . . . . 151

LIST OF FIGURES18

4-12 Image registration based on matching features with respect to the reference

image at the beginning of the treatment. . . . . . . . . . . . . . . . . . . . 152

4-13 Sequence of images during treatment. (a) Images captured with normal

room lighting. (b) Processed image sequence. . . . . . . . . . . . . . . . . 152

4-14 Image registration through feature matching. (a) Images of a lesion from

different camera angles, (b) Images after contour detection and alignment.

Area matches to 98% accuracy and pixel overlap to 97% accuracy. . . . . . 153

4-15 Progression analysis. (a) Artificial image sequence with known area change,

created from a lesion image. (b) Image sequence after applying scaling,

rotation and perspective mismatch. (c) Output image sequence after lesion

alignment and fill factor computation . . . . . . . . . . . . . . . . . . . . . 154

4-16 Memory bandwidth and estimated power consumption for full image LSM

and SIFT compared to the optimized narrowband implementations of LSM

and SIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

4-17 Image segmentation fails to accurately identify lesion contours where the

lesions don't have well defined boundaries. . . . . . . . . . . . . . . . . . . 157

4-18 Architecture of the mobile application with cloud integration. . . . . . . . 158

4-19 User interface of the mobile application. (Contributed by Michelle Chen

and Qui Nguyen). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

4-20 A conceptual diagram of the portable imaging module for multispectral

polarized light imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

5-1 Secure cloud-based medical imaging platform. . . . . . . . . . . . . . . . . 172

B-1 Progression of skin lesions over time. Lesion contours are identified from

the color corrected images and the lesions are aligned using SIFT feature

matching to determine the fill factor. . . . . . . . . . . . . . . . . . . . . . 181

19LIST OF FIGURES

20 LIST OF FIGURES

List of Tables

2.1 Separable 2D transform definitions for H.264/AVC and VC-1 . . . . . . . . 41

2.2 Row-wise transform computations for even-odd components over four clock

cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

2.3 Full-chip Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

2.4 Trtansform engines implemented in this design . . . . . . . . . . . . . . . . 68

2.5 Measurement results for implemented transform modules . . . . . . . . . . 71

2.6 Overheads and Advantages of proposed ideas . . . . . . . . . . . . . . . . . 73

2.7 Performance comparison of proposed approach with previous publications . 74

3.1 Setup/Hold Timing Analysis at 0.5 V . . . . . . . . . . . . . . . . . . . . . 110

3.2 Performance comparison with mobile processor implementations at 0.9 V. . 120

4.1 Summary of clinical assessment and quantitative dermatology approaches . 132

4.2 Bit Width Representations of LSM Variables. . . . . . . . . . . . . . . . . 139

4.3 Performance enhancement through algorothmic optimizations. . . . . . . . 156

B.1 Demographics of the subjects for clinical study. . . . . . . . . . . . . . . . 180

B.2 Progression of Skin Lesions During Treatment . . . . . . . . . . . . . . . . 182

22 LIST OF TABLES

Chapter 1

Introduction

In 1837, Charles Babbage proposed the concept of the Analytical Engine [1], the first

Turing complete computer with an arithmetic logic unit, control flow and integrated

memory. If it had been completely built, the Analytical Engine would have been vast

and would have needed to be operated by a steam engine [2]. The idea of computing

devices that are astronomically more powerful and yet can fit in the palm of a person's

hand while operating on tiny batteries built into the devices themselves, would have been

unthinkable. Integrated circuits, driven by the semiconductor process scaling following

"Moore's Law" [3] and "Dennard Scaling" [4] over the last half century, have transformed

computing through exponential enhancements in performance, power efficiency and cost.

Today we are moving ever closer to the era of all computing being mobile. The vision of

ubiquitous computing [5] and portable wireless terminals for real-time multimedia access

and processing, heralded by the Xerox ParcTab [6] and the InfoPad [7,8], has become

ubiquitous by the emergence of portable multimedia devices like smartphones and tablet

computing devices. We are surrounded by computing devices that form the "internet of

things" - gateways to the hyper-connected world.

The exponential growth in computing has fueled advances in increasingly complex multi-

media processing applications - from the first color photograph, created by Thomas Sutton

and James Clerk Maxwell in 1861 based on Maxwell's three-color method1 [9], to mod-

ern day multimedia processing capabilities that have enabled real-time High Definition

(HD) video, computational photography, computer vision and graphics, and biological

and biomedical imaging. Figure 1-1 shows the evolution of computing and multimedia

processing.

Analytical Engine(1837) 0Sar

First Color Photograph(1861)

Figure 1-1: Evolution of computing and multimedia processing. (Analytical Engine: LondonScience Museum)

Next generation mobile platforms will need to extend these capabilities multifold to en-

able efficient multimedia processing, natural user interfaces through gesture and speech

recognition, real-time interpretation and "big data" inference from sensors interfacing

with the world, and provide portable smart healthcare solutions for continuous health

monitoring.

'The three-color method forms the foundation of virtually all color imaging techniques to this day.

24 Introduction

Regardless of the specific functionality, these applications have a common set of challenges.

They are computationally complex, typically require large non-linear filter kernels or large

block sizes for processing (64 x 64 or more) and have high memory size and bandwidth

requirements. To support real-time performance (1080p at 30 frames per second (fps)),

the throughput requirements for such applications can exceed 1 TOPS. The processing

is often non-localized with data dependancies across multiple rows in an image or even

across multiple frames in a video sequence. Many algorithms are iterative, such as in image

deblurring or segmentation, which limits parallelism and place further constraints on real-

time performance. This presents the significant challenge of high-computing performance

requirement while ensuring ultra-low power operation, to be efficient on battery-operated

mobile devices.

1.1 Mobile Computing Challenges

The energy budget of a mobile platform is constrained by its battery capacity. While pro-

cessing power has increased exponentially, battery energy density has followed a roughly

linear trajectory [10]. Over the last 15 years, processor performance has increased by

100 x, transistor count by 1000 x, whereas battery capacity has increased only by a factor

of 2.6 [11]. At the same time, even as the number of transistors have followed "Moore's

Law" exponential growth and continue to do so with process scaling and 3D integration,

we are no longer able to achieve exponential gains in performance per Watt of power

consumption from process scaling alone, due to the lack of operating voltage scaling [12].

Figure 1-2 shows these trends over the last 40 years [13].

The lack of significant energy density enhancements in batteries combined with flattening

performance enhancements per unit of power consumption have led to a major challenge

in mobile computing. Energy has become the key limiting factor in scaling computing

performance on mobile platforms. The significant performance enhancements needed to

enable high complexity applications on future mobile platforms will only be achievable

251.1 Mobile Computing Challenges

26 Introduction

V

E0z

106

105

104

103

102

101

1 1197 0 1980 1990 2000 2010

Figure 1-2: Processor feature scaling and Performance/Watt trends. (Data

CPU DB: cpudb.stanford.edu)

courtesy Stanford

through significant enhancements in energy-efficiency of such systems.

1.2 Energy-Efficient System Design

Fine-grained parallelism and low voltage operation are powerful tools for low-power de-

sign that take advantage of the exponential scaling in transistor costs to trade-off silicon

area for lower power consumption [14-17]. Technology scaling, circuit topologies, and

architecture trends are aligning to take advantage of these trade-offs for severely energy-

constrained applications on mobile platforms.

1.2.1 Parallel Processing

Parallel processing has become a cornerstone of low-power digital design [14] because of

its remarkable ability, when coupled with voltage scaling, to enhance energy efficiency

at no overall performance cost. It allows each individual processing engine or core to

operate at less than its peak performance, which enables the operating voltage to be

.ne

Transistors

. Performance/Watt

. -o

26 Introduction

1.2 Energy-Efficient System Design 27

scaled down and achieves a super-linear reduction in energy per operation. Figure 1-3

shows the normalized energy/op scaling vs. performance for processors over 20 years. For

applications that support data parallelism, a processor can have two processing engines,

each running at half the required performance that together achieve the same throughput

as a single processing engine running at the full performance. But due to the super-

linear scaling in energy per operation as we lower performance, the two engines combined

consume lower power than one engine running at full performance.

100V 85-90. 90-95

95-00 "..

U .. ....

M 50 00. -05

20 -

Eo. ... .. ..+. .

% 1 10 10 2. ..

00

2. .. .

1 2 5 10 20 50 100

Performance (Normalized)

Figure 1-3: Processor energy/operation scaling with performance. (Data courtesy Stanford CPU

DB: cpudb.stanford.edu)

Over the last decade, the transition from single core processing to multi-core processing,

taking advantage of parallelism, allowed us to continue to scale overall system performance

without increasing the energy budget. However, it is also evident from Figure 1-3 that

continuing to reduce peak performance for increasing energy efficiency has diminishing

returns. Moving between low energy points causes large shifts in performance for small

energy changes. This puts a limit on the performance enhancements achievable from

multi-core processing alone.

1.2.2 Application Specific Processing

The maximum performance enhancement achievable through parallelism is further lim-

ited by "Amdahl's Law" [18], which states that the speedup of a program using parallel

processing is limited by the time needed for the sequential fraction of the program. If 50%

of the processing involved in an algorithm is sequential, then the maximum performance

enhancement achievable through parallelism can not exceed 2x the performance of a

single core processor. Achieving significantly higher performance enhancements requires

a reformulation of the problem with algorithmic design and optimization that reduces

computational complexity and enable highly parallel processing by minimizing sequential

dependencies. The energy-efficiency achievable through parallelism is often limited by the

energy spent in memory accesses. A 16 bit data access consumes about 5 pJ of energy

from on-chip SRAM and about 160 pJ of energy from external DRAM. This compares

to about 160 fJ of energy consumed by a 16 bit add operation [19]. Algorithmic opti-

mizations can also significantly enhance processing locality that enables a large number

of computations per memory access and amortizes the energy cost. This approach is

inherently application specific.

A general purpose processor spends significant amount of resources on the control and

memory overhead associated with each computation. The high cost of programability is

reflected in the relatively small fraction of energy (2-5%) spent in actual computation

as opposed to control (45-50%) and memory access (40-45%) [13]. This makes software

implementations of high-complexity applications extremely inefficient. Maximizing energy

efficiency necessitates a significant reduction in this overhead by minimizing the control

complexity and amortizing the cost of memory accesses over several computations.

Application specific hardware implementations provide the best solutions to trade-off

programmability for high energy-efficiency and take full advantage of algorithmic opti-

mizations. Figure 1-4 shows the energy-efficiency of processors with different architectures

- from CPUs to ASICs, where an operation is defined as a 16 bit addition.

28 Introduction

1.2 Ener~m-Efficient Systemn Design 2

x0

104

10 3

102

101

100

10-1

ASIC (Video Decoder)

0Mobile Processor

1 2 3 4 5 6Processors

7 8 9

Figure 1-4: Energy efficiency of processors: from CPUs to ASICs.

Processor Description

1 Intel Sandy Bridge [20]

2 Intel Ivy Bridge [21]

3 24 Core Programable Processor [22]

4 Multimedia DSP [23]

5 Mobile Processors [24,25]

6 GPGPU Application Processor [26]

7 Object Recognition ASIC [27]

8 SVD ASIC [28]

9 Video Decoder ASIC [29]

Hardware implementations minimize the control requirement, maximize processing data

locality that allows large number of computations per memory access, taking advantage of

spatial and temporal parallelism to reduce memory size and bandwidth, and enable deep

pipelines with flexible bit-widths. Application specific hardware implementations are the

key to achieving exponential enhancements in performance without increasing the energy

budget.

x

CMu

E

0

Ui

a,

LU

a,

----------- -----------m--------------

----------- ]p ---- - -m- --- M-

29

1.2.3 Reconfigurable Hardware

Flexibility in implementing various applications after the hardware has been implemented

is a desirable feature. However, depending on the architecture used to provide flexibility,

there can be a 2 to 3 orders of magnitude difference in energy-efficiency between these

implementations, as seen from Figure 1-4.

Fully customized hardware implementations are well suited for applications that have

well defined standards, such as video coding. Most desktop and mobile processors today

have embedded hardware accelerators for video coding. However it is impractical to

develop hardware implementations for every iteration of an algorithm in areas such as

computer vision and biomedical signal processing, where the algorithms are constantly

evolving. Even for standardized applications, existence of multiple competing standards

makes it difficult to develop individual hardware implementations for all the standards.

For example, it is impractical for most application processors to implement individual

video coding accelerators for more than ten video coding standards with more than 20

different coding profiles. Dedicated video coding engines, such as IVA-HD [30], support

multiple video coding standards though a reconfigurable architecture that implements

optimized core functional units, such as motion estimation, transform and entropy coding

engines, and uses a configurable pipeline with distributed control.

A closer examination of these areas reveals that it may not be necessary to develop hard-

ware accelerators for each individual algorithm. A vast number of computational photog-

raphy and computer vision applications, for example, use a well defined set of functions,

such as non-linear filtering [31], Gaussian or Laplacian pyramids [32,33], Scale Invariant

Feature Transform (SIFT) [34], Histogram of Gaussians (HoG) [35] or Haar features [36],

etc. These functions are well established and form the foundation of the OpenCV library

[37] used for software implementations of almost all computer vision applications. A hard-

ware implementation with highly optimized processing units supporting such functions,

and the ability to activate these processing units and configure the datapaths based on

Introduction30

the application requirements, provides a very attractive alternative that maintains high

energy-efficiency while supporting a large class of applications.

An important aspect of reconfigurable implementations is architecture scalability. The

use of individual processing units as well as the amount of parallelism within each unit,

is application specific. Video coding with 4k x 2k resolution at 60 fps has a 20 x higher

performance requirement than 720p at 30 fps. Different processing block sizes or filter

kernels (4 x 4 to 128 x 128 or more) result in different optimal configurations in a parallel

processor. Scalable architectures also enable us to explore energy vs. output quality trade-

offs, where the user can determine the amount of energy spent in processing, depending

on the desired output for the specific application. The ability to effectively turn-off

processing units and memory banks, through clock and power gating when not used,

is key to minimizing energy that is simply being wasted by the system. This thesis

demonstrates examples of efficient reconfigurable and scalable hardware implementations

for video coding and computational photography applications.

1.2.4 Low-Voltage Circuits

For parallelism to yield enhancements in energy-efficiency, it must be coupled with voltage

scaling. The power consumption of CMOS digital circuits operating at voltage VDD,

frequency f and driving a load modeled as a capacitance C, is given by:

Ptotal = Pswitching + Pleakage

=ax CxVDD X f + Leakage X VDD -

where, a is the switching activity of a logic gate and Ileakage is the leakage current.

For varying performance requirements, scaling frequency only provides a linear scaling

in power consumption in the switching-power dominated region of operation. However,

scaling VDD along with the frequency, to match the peak performance of the proces-

311.2 Energy-Efficient System Design

sor, provides a cubic scaling in power consumption. To take full advantage of Dynamic

Voltage-Frequency Scaling (DVFS) [38], circuit implementations must be capable of oper-

ating across a wide voltage range, from nominal VDD down to the minimum energy point,

which typically occurs near or below the threshold voltage (VT) and minimizes the energy

per operation [39].

When VDD is reduced to the range of 0.5 V, statistical variations in the transistor threshold

voltage becomes an important factor in determining logic performance. Random Dopant

Fluctuations (RDF) are a dominant source of variations at low voltage, causing random,

local threshold voltage shifts [40-42]. Local variations have long been known in analog

design and in SRAM design [43,44]. With technology scaling, they have become a major

concern for digital design as well. At nominal voltage, local variations in VT may result in

5%-10% variation in the logic timing. However, at low voltage, these variations can result

in timing path delays with standard deviation comparable to the global corner delay, and

must be accounted for during timing closure in order to ensure a robust, manufacturable

design. Figure 1-5 shows the delay of a 28 nm CMOS logic gate as the voltage is lowered

from 1 V to 0.5 V. The nominal delay scales by a factor of 15. But taking into account

stochastic variations, the total 3- delay scales by a factor of 36. Typically reliability at

40 , . 1I-36 x -.- Total Delay

30 ----- Corner Delay

0 15x- 20-

S 10-

00.5 0.6 0.7 0.8 0.9 1.0

VDD M

Figure 1-5: Delay scaling with VDD. Corner delay scales by 15x, whereas total delay (corner +3o- stochastic delay) scales by 36 x.

Introduction32

low-voltage is achieved by over-designing the system with large design margins to account

for variations. Such design margins have a significant energy cost [12].

This thesis demonstrates low-voltage design using statistical static timing analysis tech-

niques that minimize the overhead of large design margins to account for variations, while

ensuring reliable low-voltage operation with 3a- confidence.

1.3 Thesis Contributions

The broad focus of this thesis is to address the challenges of implementing high-complexity

applications with high-performance requirements on mobile platforms through a compre-

hensive view of system design, where algorithms are designed and optimized to enhance

processing locality and enable highly parallel architectures that can be implemented using

low-power low-voltage circuits to achieve maximally energy-efficient systems.

This is accomplished by starting with application areas and exploring key features that

form the basis of a wide array of functionalities in that area. The algorithms under-

lying these features are optimized for hardware implementation, considering trade-offs

that reduce computational complexity and memory requirements. Parallel architectures

with reconfigurability and scalability are developed to support real-time performance at

low frequencies. Finally, circuits are implemented to provide a wide voltage-frequency

operating range and ensure minimum energy operation.

The main contributions of this thesis are in the following areas:

o Shared Transform Engine for Video Coding: A shared transform engine for

H.264 and VC-1 video coding standards that supports Quad full-HD (4kx2k) res-

olutions at 30 fps is presented in Chapter 2. Transform engine is a critical part of

video encoding/decoding process. High coding efficiency often comes at a cost of

increased complexity in the transform module. This work explores algorithmic opti-

mizations where a larger transform matrix (8 x 8 or larger) is factorized into multiple

1.3 Thesis Contributions 33

small (2 x 2) matrices that can be computed much more efficiently. The factoriza-

tion can also be formulated in such a way that Discrete Cosine Transform (DCT)

based transform matrices corresponding to multiple video coding standards result in

the same factors. This is key to achieving an efficient shared implementation. The

size of transpose memory for 2D transform becomes a key concern for large trans-

forms. Architectural schemes to eliminate an explicit transpose memory and reuse

an output buffer to save area and power are explored. Data dependent processing is

used to further reduce the power consumption of the transform engine by lowering

switching activity. Both the forward and inverse integer transforms are implemented

to support encoding as well as decoding operations. The proposed techniques are

demonstrated through a testchip, implemented using 45 nm CMOS technology. Sta-

tistical circuit design techniques ensure a wide operating range and reliable operation

down to 0.3 V. The testchip is used to benchmark different implementations of trans-

form engines such as reconfigurable implementation vs. individual implementations

for the two standards, implementations with and without transpose memory, and

evaluate the different architectures for power and area efficiency.

* Reconfigurable Processor for Computational Photography: A wide array

of computational photography applications such as High Dynamic Range (HDR)

imaging, low-light enhancement, tone management and video enhancement rely on

non-linear filtering techniques such as bilateral filtering. Chapter 3 presents the

development of a reconfigurable architecture for multiple computational photogra-

phy applications. Algorithmic optimizations, leveraging the bilateral grid structure,

are explored to transform an inefficient non-linear filtering operation into an ef-

ficient linear filtering operation with significant reductions in computational and

memory requirements. Algorithm-architecture co-design enables a highly parallel

and scalable architecture that can be configured to implement various functionali-

ties, including HDR imaging, low-light enhancement and glare reduction. Memory

management techniques are explored to minimize the external DRAM bandwidth

Introduction34

and power consumption. The scalable architecture enables users to explore en-

ergy/resolution trade-offs for energy-scalable processing. The proposed techniques

are demonstrated through a testchip, implemented using 40 nm CMOS technology.

Careful design for low-voltage operation ensures reliable operation down to 0.5 V,

while achieving real-time performance. The comprehensive system design approach

from algorithms to circuits enables a 280x enhancement in energy-efficiency com-

pared to implementations on commercial mobile processors.

e Portable Platform for Medical Imaging: Medical imaging techniques are im-

portant tools in diagnosis and treatment of various skin conditions. Widespread

use of such imaging techniques has been limited by factors such as size, weight,

cost and complex user interface. Treatments for skin conditions require reliable

outcome measures to compare studies and to assess the changes over time. Chap-

ter 4 presents the development of a portable medical imaging platform for accurate

objective quantification of skin lesion progression. Computer vision techniques are

extended and enhanced to identify lesion contours in images captured using smart-

phones and quantify the progression through feature matching. The approach is

validated through a pilot study in collaboration with the Brigham and Women's

Hospital. Algorithmic optimizations are explored to improve software run-time per-

formance, memory bandwidth and power consumption. These optimizations pave

the way for energy-efficient hardware implementations that could enable real-time

processing on mobile platforms.

351.3 Thesis Contributions

36 Introduction

Chapter 2

Transform Engine for Video

Coding

Multimedia applications, such as video playback, have become prevalent in portable multi-

media devices. Video accounted for 53% of the mobile data traffic in 2013 and is expected

to increase 14x between 2013 and 2018, accounting for 69% of total mobile data traffic

by 2018 [45]. Such applications present the unique challenge of high-performance require-

ment while ensuring ultra-low power operation, to be efficient on battery-operated mobile

devices. Low-power hardware implementations targeted to a specific standard, such as

application processors for H.264 video encoding [46] and decoding [47,48], have been

proposed. A universal media player requires supporting multiple video coding standards.

High power and area cost of dedicated video encoding/decoding for each standard necessi-

tates the development of a shared architecture for multi-standard video coding. Dedicated

video coding engines supporting multiple standards have recently been proposed using re-

configurable architectures. The IVA-HD video coding engine [30] supports encoding and

decoding for multiple standards, such as H.264, H.263, MPEG 4, MPEG 1/2, WMV9,

VC-1, MJPEG and AVS. It implements optimized core functional units, such as motion

estimation, transform and entropy coding engines, and uses a configurable pipeline with

distributed control to achieve programability for the different standards. A multi-format

video codec application processor, supporting H.264, H.263, MPEG 4, MPEG 2, VC-1

and VP8, is proposed in [49]. Hardwired logic is combined with a dedicated ARMv5

architecture CPU to provide programability for supporting multiple standards.

Energy efficiency of circuits is a critical concern for portable multimedia applications. It

is important not only to optimize functionality but also achieve low energy per operation.

Dynamic Voltage-Frequency Scaling (DVFS) is an important technique for reducing power

consumption while achieving high peak computational performance [50]. The energy

efficiency of digital circuits is maximized at very low supply voltages, near or below the

transistor threshold voltage, such as 0.5 V [51]. This makes the ability to operate at low

voltage (VDD < 0.5 V) a key component of achieving low power operation. This work

explores power reduction techniques at various stages, such as algorithms, architectures

and circuits. Combining aggressive voltage scaling, by operating at VDD 0.5 V, and

increased parallelism and pipelining, by processing 16 pixels in each clock cycle, provides

an effective way of reducing power while achieving high performance, such as 4k x 2k Quad-

Full HD (3840 x 2160) video coding at 30 frames per second (fps), at low frequency.

Transform engine is a critical part of video encoding/decoding process. High coding ef-

ficiency often comes at a cost of increased complexity in the transform module, such as

variable size transforms (4x4, 8x8, 8x4, 4x8, etc.) as well as hierarchical transform,

where Discrete Cosine Transform (DCT) coefficients are further encoded using Hadamard

transform. DCT is the most commonly used transform in video and image coding appli-

cations. DCT has excellent energy compaction property, which leads to good compression

efficiency of the transform. However, the irrational numbers in the transform matrix make

its exact implementation with finite precision hardware impossible, leading to a drift (dif-

ference between reconstructed video frames in encoder and decoder) between forward and

inverse transform coefficients. Recent video coding standards, such as H.264/AVC [52,53]

and VC-1 [54-56] use a variation of the DCT, known as integer transform, where the

transform matrix is an integer approximation of the DCT. This allows exact computation

of inverse transform using integer arithmetic and also allows implementation using addi-

Transform Engine for Video Coding38

392.1 Transform Engine Design

tions and shifts, without any multiplications [57]. H.264/AVC and VC-1 also use variable

size transforms, such as 8x8 and 4x4 in H.264/AVC (High profile) and 8x8, 8x4, 4x8

and 4x4 in VC-1 (Advance profile), to more effectively exploit the spatial correlation and

improve coding efficiency. Construction of computationally efficient integer transform ma-

trices is proposed in [58], which allows implementation using 16 bit arithmetic with rate

distortion performance similar to 32 bit or floating point DCT implementations.

Recent research has focused on efficient implementation of the integer transforms. Matrix

decomposition is used to implement 4x4 and 8x8 integer transforms for VC-1 in [59]. A

hardware sharing scheme for inverse integer transforms of H.264, MPEG-4 and VC-1 using

delta coefficient matrix is proposed in [60]. Matrix decomposition with sparse matrices and

matrix offset computations is proposed in [61] for a shared ID inverse integer transform

of H.264 and VC-1. Matrix decomposition and transform symmetry is used to develop a

computationally efficient approach for ID 8x8 inverse transform for VC-1 in [62]. Similar

ideas are used to achieve a shared architecture for 1D 8 x 8 forward and inverse transforms

of H.264 in [63]. A circuit architecture that can be applied to standards such as MPEG

1/2/4, H.264 and VC-1 is proposed in [64] based on similarity of 4x4 and 8x8 DCT

matrices.

In this work, a shared transform for H.264/AVC and VC-1 video coding standards is

proposed [65]. Forward integer transform and inverse integer transform are both imple-

mented to support encoding as well as decoding operations. We also propose a scheme

to eliminate an explicit transpose memory, which is required in 2D transform implemen-

tation, to save area and power. This work also explores data dependent processing to

further reduce the power consumption of the transform engine.


This section explores the ideas of matrix factorization for hardware sharing, eliminating

an explicit transpose memory in 2D transform and data dependent processing to reduce

switching activity, to achieve a shared transform engine for H.264/AVC and VC-1 video

coding standards. The objective is to design a transform engine that can support video

coding with Quad Full-HD (QFHD) resolution at 30 fps, with very low power consump-

tion.

2.1.1 Integer Transform: H.264/AVC & VC-1

H.264/AVC uses 4x4 transform in baseline and main profile and both 4x4 and 8x8

transforms in the high profile. VC-1 uses 4x4, 4x8, 8x4 and 8x8 transforms in the

advance profile. The transform matrices for H.264/AVC and VC-1 standards are defined

in Appendix A.

The 4x4 transform matrices for H.264 and VC-1, as well as the 8x8 transform matrices,

are structurally identical. This allows us to generate a unified 4x4 transform matrix and

a unified 8x8 transform matrix for H.264 and VC-1, as defined by eq. (2.1) and eq. (2.2)

respectively.

aa

a ~y-a-3T4 (2.1)

a -'y-a

a-/ a -y

H.264: a = 1,1 = 1, y = 1/2 and VC-1: a = 17, / = 22, y = 10.

40 Transform Engine for Video Coding

2.1 Transform Engine Design 41

c a d

-a -b

-a e

g e

-f

f

a c -g

a -c -g

-a -e f

e -a b -f

-c a -d g

H.264: a = 8, b = 12, c = 10, d = 6, e = 3, f = 8, g = 4

VC-1: a = 12, b = 16, c = 15, d = 9, e = 4, f = 16, g = 6.

The separable 2D transforms are defined as given in Table 2.1, where m = {8, 4} and n

= {8, 4}, X is the prediction residue and Y is the transformed data.

Table 2.1: Separable 2D transform definitions for H.264/AVC and VC-1

Forward Transform Inverse Transform

H.264 T. Xmxm - Tm TM -Ymxm -TM

VC-1 (TmT . Xmxn - Tn) - Nmxn (Tm -Ymxn - TnT)/1024

The scaling factors in transform definitions can be absorbed in the quantization process.

This work focuses on implementing the transform matrix computations.

a b f

T8 =

g -e

-g -b

f-d

f d

-a b

a c

a a

a e-

a -e

a -d

a -c

a -b

-d

C

-b

b

-c

d

-e

(2.2)

9

f

412.1 Transformn Engine Design

2.1.2 Matrix Factorization for Hardware Sharing

Transform matrices for H.264/AVC and VC-1 have identical structure, as shown in eq. (2.1)

and eq. (2.2). In this section, we will exploit this fact to design a shared transform engine

for H.264/AVC and VC-1.

The 8x8 transform matrix can be decomposed into two 4x4 matrices using even-odd

decomposition [66], given by eq. (2.3).

T8= B8 - M8 - P8

where,

a f

a g

a -g

a -f

0 0

0 0

0 0

0 0

a 9

-a -f

-a f

a

0

0

0

0

-g

0

0

0

0

0

0

0

0

-d

-b

0

0

0

0

0

0

0

0

c -b

e C

-e -b

c d

-d

e

(2.3)

(2.4)

P8 is a permutation matrix that has zero computational complexity and B 8 can be im-

plemented using 8 adders.



0

0

0

0

1

0

0

0

0

1

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

1

and Bs =

1

0

0

0

0

0

0

1

0

0

0

1

1

0

0

0

0

0

0

1

-1

0

0

0

0

0

1

0

0

-1

0

0

0

1

0

0

0

0

-1

0

1

0

0

0

0

0

0

-1

(2.5)

We propose further factorization of the even and odd components of M8 to achieve hard-

ware sharing between H.264 and VC-1 matrices. The factorization scheme is derived

in such a way that both H.264 and VC-1 matrices result maximum number of common

factors.

The even component of H.264 is factorized as shown in eq. (2.6).

r -I

He =

8

8

8

8

1

0

0

1

8

4

-4

-8

0

1

1

0

8

-8

-8

8

4

-8

8

-4

1 0

0 1

0 -1

-1 0

8 0 8

8 0 -8

0 8 0

0 4 0

0

0

4

-8



1

0

0

1

0

1

1

0

1

0

0

-1

0

1

-1

0

4.

2

2

0

0

0

0

2

1

2

-2

0

0

0

0

1

-2

(2.6)= Fie 4F 2e

The even component of VC-1 is factorized as shown in eq. (2.7).

12

12

12

12

1

0

0

1

1

0

0

1

16

6

-6

-16

1

0

0

-1

12

-12

-12

12

0

1

-1

0

1 0

0 1

0 -1

-1 0

6

-16

16

-6

12

12

0

0

0

0

16

6

2

2

0

0

12

-12

0

0

0

0

2

1

2

-2

0

0

0

0

6

-16

0

0

1

-2

+4

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

-1 /Fie - (6F 2e + 4F3e)


(2.7)


Similarly, we propose factorizing the odd component for H.264 as shown in eq. (2.8).

3 -6 10 -12

6 -12 3 10HO =

10 -3 -12 -6

12 10 6 3

3 2 -2 0

-2 0 -3 2

2 -3 0 2

0 2 2 3

1 0 0 -4

0 1 4 0

0 4 -1 0

4 0 0 1

0 1 -1 0

-1 0 0 1

1 0 0 1

0 1 1 0

1 0 0 0

0 0 -1 0

0 -1 0 0

0 0 0 1 'I

1 0 0 -4

0 1 4 0

0 4 -1 0

4 0 0 1

= (2F 2o+ 3F 30 ) - Flo

2

(2.8)


And the odd component for VC-1 is factorized as shown in eq. (2.9).

Vo=

4

9

15

16

4

-3

3

0

1'

-9 15

-16 4

-4 -16

15 9

3 -3

0 -4

-4 0

3 3

0 1-

-1 0 C

1 0 C

0 1 1

= (3F 20o+ 4F30 ) - Flo

-16

15

-9

4

0

3

3

4

1 0

1

1

0

1

0

0

4

0

1

4

0

0

4

-1

0

1

0

0

0

-4

0

0

1

0 0

0 -1

-1 0

0 0

0

0

0

1 /

1

0

0

4

0

1

4

0

0 -4

4 0

-1 0

0 1

(2.9)

Notice that the major factors, Fie and F2,, are common between the even components

of H.264 and VC-1. The factor F3e for VC-1 is a very sparse matrix and has very little

computational complexity. Similarly, all the factors, Flo, F20 and F30, are common be-

tween the odd components of H.264 and VC-1. This factorization allows us to maximize

hardware sharing between the even as well as odd components of H.264 and VC-1.

The hardware architecture for shared implementation of the even component for H.264

and VC-1, using the factorization defined by eq. (2.6), eq. (2.7), is shown in Figure 2-1.

The architecture for the odd component, using the factorization defined by eq. (2.8) and

46 Tr-ansform Engine for Video Coding

3-

eq. (2.9), is shown in Figure 2-2. A column of input data is represented as:

[Xo, Li, X2, X3 , X4 , X5 , X6 , X7]T (2.10)

Reconfigurability is achieved by using multiplexers to program the datapath, enabled by

a flag indicating the standard (H.264 or VC-1) being used.

The shared 4x 4 transform for H.264 and VC-1 is achieved in a similar manner, as defined

by eq. (2.11), where T4 is defined by eq. (2.1).

TH = (Fie - F 2e) >> 1 and TV =Fie - (8F 2e+ 4F 3e + F4 )

where, Fie, F 2e and F3e are defined in eq. (2.7) and F 4 in eq. (2.12).

F 4 =

1

1

0

0

0

0

2

2

1

-1

0

0

0

0

2

-2

2.1.3 Eliminating Transpose Memory

Conventional row-column decomposition uses same 1D transform architecture for both

row and column operations. This requires the use of a transpose memory between the

row-wise 1D transform and the column-wise ID transform. The transpose memory can

be a significant part of the total area (as high as 48% of the gate count in one of the

benchmarked designs) and power consumed by the 2D transform.

We propose an approach to avoid the transpose memory by using separate designs for the

row-wise and column-wise 1D transforms and using the output buffer to store intermediate

data. By enabling the output buffer to have wide number of ports to read/write 2D

data, referred to as a 2D output buffer, an explicit transposition is avoided. In this

(2.11)

(2.12)


XO X4 X2 X6

<<<< 1<<

F2e + + +---- ----------------+-7---------- --------------- ---------- -----

< 1 <<1 << <<1<< 1<<1

<< <<<<1 <<1

--------------- ------ --------- -------- -------------

S 1 H.264 H.264

Fle-4Fe + -

yo Y3 Y1 Y2

(a) H.264

XO X4 X2 X6

<<<< 1<<

F2e +1----- --- ---------------------- -------------- ---------- ------

<1 <<1 <<1 <11 <<1

<< 1 <<<< <<1

6 F 2 e --- ~ ~ ~ ~ ~ ~

6F2e+4F3e

0 1 H.264 0H.264

Fie-(6F2e+4F3e) + + -

Yo Y3 Y1 Y2

(b) VC-1

Figure 2-1: Hardware architecture of the even component. The figure shows data paths exercised

in (a) H.264 and (b) VC-1.


X3 X5 X1 X7

<< 2 << 2 2 < 2

Flo/+ + + -

---------------- -------------------- w--------- ----------F2/F1 + --- +X-

(F20+F3o)-Flo

(2F20 +3F 30 )-Flo + +--------- -- --- -------- ---------------- ----

Yo Y3 Y1 Y2

(a) H.264

X3 X5 X1 X7

<< 2 << 2 I<< 2I

F10 +

---------------- -------- ------ --- --------- --------F20-Flo -/

(F2o+F3o)-FIO

------------------------- ----- --------

(3F20+4F3 0)F 10 + + +------- --- ------ -------- ----- ----- ------- w------

Yo Y3 y1 Y2

(b) VC-1

Figure 2-2: Hardware architecture of the odd component. The figure shows data paths exercisedin (a) H.264 and (b) VC-1.


Transform Engine for Video Coding

implementation, we spread the processing of an 8 x 8 block over 4 clock cycles. In each

cycle, we process 8x2 data, i.e. two columns (0 and 7, 1 and 6, 2 and 5, 3 and 4) from

the 8 x 8 input, to obtain two transformed columns, as shown in Figure 2-3.

x

P -- - - - - -|,-+----

-, -* Co

CT CTTC2- C 3C2CCg a d a a i C 0 4 x~

a g - -a -b - -a13a 14 i

a f - a a -a -b a3 3 3 x a a

-------------- m-------- ------ C

a -e -f d a X-c -a ba4 4 x4 xa*d g a -e a -ci2 x3 x4 2

Co '

-------- C

Column-wise iD transform 8x8 input data

c0 c1 cI co c , 1 c

a -- -f----- d a---g---- ~ ~ 4

a~~~~~ -- g-e-a b-f------ 3 x~x

Column-wise 1D transform Wx input data

U

ITransformed columns

CO C1 C2 C3 C3 C2 C1

U 1 2 U13 U4 U15

U22 U23 U24 U25

U32 U33 U34 U35

U42 U43 U4 U45

U2 U53 U54 U5

U62 U63 U64 U65

Transformed columns

Figure 2-3: Column-wise 1D transform: 8x8 data is processed over four clock cycles, Co to 03:

Column 0 and 7 in C0, 1 and 6 in C1, 2 and 5 in C2, 3 and 4 in C 3 . Two transformed columns

are generated in each clock cycle.

For the row-wise computation, we don't have an entire row (transposed column) available

in each clock cycle, without using a transpose memory. To overcome this problem, we

only compute partial products of all 8x8 coefficients in each clock cycle and store them

to the 8 x 8 output buffer, as shown in Figure 2-4.

The processing in Figure 2-3 and Figure 2-4 is shown as direct inner product for simplic-

ity. The implementation performs the same processing using the matrix decomposition

approach, described in Section 2.1.2.

Co

50


xr ----- m-------------S

------------ I 1

* I I I I I C2. I

I X i a a '

Xi X X C2

C

Row-wise D transform Transposed columns

CI C2 C3 C3 C2 C1 CO

cf c a d '

g -e -a -b

-g -b -a e

-f d a c

S-f d a -c

-g b -a -e

g e -a b

f -c a -d

Row-wise 1D transform

FU02 U12 U2 2 U3 2 U42 U52 U62 U721

IU03 U13 U2 3 U3 3 U43 U53 U63 U73

U04 U14 U24 U34 U44 U54 U64 U74

05 UI U25 u U 35 U45 U55 U65 U751

Transposed columns

Co

C,

C 2

C3 -

C3

C2

C,

Co

Yoo

Y10

Y20

Y30

Y40

150

Y60

Y70

2D Transformed Output

Y01 Y02 Y03 Y04 Y05 Y106

Y11 Y12 Y13 Y14 Y15 Y16

Y21 Y22 Y23 Y24 Y25 Y26

Y31 Y32 Y33 Y34 Y35 Y36

Y41 Y42 143 Y44 Y45 146

Y51 Y52 Y53 154 Y55 Y56

Ye1 Y62 Y63 Y64 165 Y66

Y71 Y72 173 Y74 Y75 Y76

2D Transformed Output

Figure 2-4: Row-wise 1D transform: Partial products for all 64 coefficients are computed in

each clock cycle, using the 2x8 data obtained by transposing the two columns generated by 1D

column-wise transform. The partial products are stored in the output buffer. At the end end of

four clock cycles, the output buffer contains complete 2D transformed output.

Over four clock cycles, we add and accumulate the results for all 8 x 8 coefficients in the

output buffer with 64 reads/writes each cycle, so that at the end of fourth clock cycle

we get the complete result for the entire 8 x 8 block. The partial products computed in

each clock cycle, for the column vector [uOO, I0 1, U02 , u03 , u0 4 , U05 , U06, uo7]T, are shown in

Table 2.2.

These partial products are generated by the hardware architectures shown in Figure 2-5.

The appropriate coefficients are selected by the multiplexers in each clock cycle.

51

CoY07

Y17

Y27

Y37

Y47

Y57

Y67

Y77

Table 2.2: Row-wise transform computations for even-odd components over four clock cycles

Clk H.264 VC-1 H.264 VC-1 H.264 VC-1 H.264 VC-1

Co 8uoo 12u00 8u00 12u00 8u 00 12uOO 8u00 12UOO

C1 4uO6 6uO6 -8U 0 6 -16uO 6 8u06 16uO6 4uO6 6uO6

C 2 8u02 16UO2 4U02 6uO2 -4U02 -6U02 -8u02 -16uO2

C3 8U0 4 12UO4 -8u04 -12UO4 -8u04 -12UO4 8U0 4 12O4

Co -12u 0 7 -16uO7 10U0 7 15uO7 -6U 0 7 -9uO7 3U0 7 4uO7

C1 3uoi 4uoi 6U01 9U01 1Ou0i 15u0i 12uoi 16uoi

C2 1Ou 05 15uO5 3uO5 4uO5 -12uO 5 -16uO 5 6uO5 9uO5

C3 -6u 0 3 -9uO 3 -12UO3 -16u 0 3 -3u 0 3 -4uO3 10u0 3 15u0 3

2.1.4 Data Dependent Processing

In addition to processing optimization, it is also important to take into account the nature

of input data to further achieve power savings. By exploiting the characteristics of the data

being processed, architectures can be designed to minimize switching activity, optimize

pipeline bit widths and perform variable number of operations per block [67]. Application-

specific SRAM designs for video coding applications that exploit the correlation of storage

data and signal statistics to reduce the bit-line switching activity and consequently the

energy consumption are proposed in [68,69].

Transform engine operates on the 8 bit prediction residue. Figure 2-6 shows the histogram

of the prediction residue for a number of test sequences. This analysis shows that more

than 80% of the prediction residue lies in the range -32 to +32. Due to 2's complement

processing, a large number of bits are flipped every time a number changes from a small

negative value to a small positive value. At the input, this results in high switching activity



U04 U02 U06

CLK cycle1234

(a)

U03 U05 U01 U07 «1 «1 «

CLK cycle C

2 11

3 0 1 Std 0 1 Std 0 1 Std

01 2 3 -C 0 1 2 3 -C 12 3 -C 0 1 2 3 -C

Y 0 0 01 Y0 2 Y0 3(b)

Figure 2-5: Hardware architecture of the (a) even and (b) odd component. Std = {0: H.264, 1:VC-1}.

around zero. Switching activity at the input propagates through the system, though the

effect is different at different nodes. For example, a node implementing functionality

similar to XOR shows high switching activity, whereas other nodes show significantly low

switching activity. Because of this, different input patterns affect the system switching

activity differently. Overall, we observe that high switching activity at the input results

in a high switching activity for the entire system.

Figure 2-7 shows the correlation between switching activity at the input and the system


U00 - << 1 << << << 1

0 1 -C1

0 1-C2 |> C20 1AC 0 1

C, C2 C, /2 0 V0 0 F f1 0 1 0 Sid 1 0 -Sid0 1 \- 07-1

0 0 yeo / ye3 Y1 I/ Y'92


x10415 , .

-- HorsecabRally

10 - Splash-- Waterskiing

S5-

0 -100 -50 0 50 100Input Magnitude

Figure 2-6: Histogram of the prediction residue for a number of test sequences

switching activity for 150 different input sequences. Zero input switching activity refers

to no bits changing at the input and 1 refers to all the input bits switching simultaneously

from 0 to 1 or 1 to 0. For the system switching activity, 0 refers to no activity, which

corresponds to leakage power, and 1 refers to maximum power consumption. The plot

shows a strong correlation of 0.83 between input switching activity and system switching

activity. This indicates that there is a significant benefit to the system switching activity

by reducing the input switching activity.

0.8 - ' . 0

0.6 - . *.s C * 00

0 00 a. 0 . 0 0 .1

Input Switching Activity

Figure 2-7: Correlation between input switching activity and system switching activity. Theplot also shows linear regression for the data. Measured correlation is 0.83.


In order to reduce the switching activity, we pre-process the input data by adding a fixed

DC bias to the prediction residue. To accommodate for the added bias, the dynamic

range is increased from 8 bit to 9 bit. The DC bias shifts the input histogram to the

right. For example, for a DC bias of 32, more than 80% of the input data falls within

0 to 64. Thus less than 6 LSBs are flipped during most operations, reducing the overall

switching activity. Note that the DC bias only affects the DC coefficient in the transform

output. This can be easily corrected by subtracting a corresponding bias from the DC

coefficient at the output. Figure 2-8 shows the reduction in switching activity and power

as a function of DC bias values, despite the one bit increase in bit width, for different

video sequences.

.0.8 ------------ 0.8

'0.7 0.7

0.6-Power Switching Activity 0--- Horsecab - - Splash -- Horsecab -'- Splash-+- Rally -+- Waterskiing - Rally --- Waterskiing

0 32 64 96 12 .5Bias

Figure 2-8: Switching activity and Power consumption in the transform as a function of DC

bias applied to the input data

On average, the switching activity and power consumption reach a minimum for DC

bias of about 64 and then start to increase again. This is because as a higher DC bias is

applied, more MSBs start switching, partially offsetting the effect of reduction in switching

activity in the LSBs. Data dependent processing scheme has less than 5% hardware cost

and reduces the average switching activity by 30% and average power by 15% for the DC

bias of 64.



2.2 Future Video Coding Standards

The ideas proposed in this work have general applicability beyond H.264/AVC and VC-1

video coding standards. In this section, we will look at applying these ideas to the 8 x8

transform of the next generation video coding standard High-Efficiency Video Coding

(HEVC) [70].

The HEVC standard recommendation [70] defines the 8x8 1D transform as given by

eq. (2.13).

64 89 83 75 64 50 36 18

64 75 36 -18 -64 -89 -83 -50

64 50 -36 -89 -64 18 83 75

64 18 -83 -50 64 75 -36 -89T 8 = (2.13)

64 -18 -83 50 64 -75 -36 89

64 -50 -36 89 -64 -18 83 -75

64 -75 36 18 -64 89 -83 50

64 -89 83 -75 64 -50 36 -18

Notice that the structure of this transform matrix is same as the generalized matrix for

H.264/AVC and VC-1, defined in eq. (2.2), where: a = 64, b = 89, c = 75, d = 50, e = 18,

f = 83, g = 36.

The idea of matrix decomposition for hardware sharing, as described in Section 2.1.2,

can be applied to eq. (2.13) as well. Extension of even-odd decomposition for HEVC

transform to reduce hardware complexity is described in [71]. Even-Odd decomposition,

performed as defined in eq. (2.3), gives the even and odd components for the 8x8 HEVC

matrix, defined by eq. (2.14) and eq. (2.15) respectively.

2.2 Future Video Coding Standards 57

HEVCe =

HEVCO =

64 83 64 36

64 36 -64 -83

64 -36 -64 83

64 -83 64 -36

18 -50 75 -89

50 -89 18 75

75 -18 -89 -50

89 75 50 18

The even and odd components can be further factorized as given by eq. (2.16) and

eq. (2.17) respectively.

HEVCe = Fie - (32. F2e + 4. F4e + 15 - Fe)

HEVCo = (15. F20 + 22. F30 + F4 0) - F1 0 + 5 -Fo

(2.16)

(2.17)

Notice that the factors Fie, F2e, F3e, Fio, F2 0 and F3 0 are same as those defined in eq. (2.6),

eq. (2.7), eq. (2.8), eq. (2.9), for H.264 and VC-1 factorization. F4e, F4 0 and F50 , defined

by eq. (2.18), eq. (2.19) and eq. (2.20) respectively, are extremely sparse matrices.

F4e =

0 0 0 0

0 0 0 0

0 1 0 1

0 1 0 -1

(2.18)

(2.14)

(2.15)

57'2.2 Future Video Coding Standards


0 0 0 -1

0 -1 0 0F 4 o = (2.19)

0 0 1 0

L1 0 0 0

0 -1 0 0

1 0 0 0F 5o = (2.20)

0 0 0 -1

0 0 1 0

Since most of the factors for HEVC transform matrix are the same as those for H.264

and VC-1, it is possible to achieve an efficient hardware implementation with shared

architecture between H.264, VC-1 and HEVC, as shown in Figure 2-9 and Figure 2-10,

for even and odd components respectively.

This demonstrates that matrix factorization can be extended to standards beyond H.264

and VC-1 to achieve shared hardware implementations for multiple standards.

The identical structure of the transform matrix, as given by eq. (2.2), for H.264, VC-1

and HEVC arises because of the symmetric nature of coefficients in the DCT, which forms

the basis of transforms in all of these standards. As long as a video coding standard uses

transform based on DCT, it will always result in a matrix such as eq. (2.2). Transform

matrices for different standards are multiples of each other with slight variations and can

be factorized into very similar factors to maximize sharing.

The idea of eliminating an explicit transpose memory in 2D transform, as described in

Section 2.1.3, is equally applicable to HEVC. The processing, over four clock cycles, can

be done in the same way as used for H.264 and VC-1, with the results accumulated in the

output buffer.

Tr-ansform Engine for Video Coding58

XO X4 X2 X6

+1~+ -

«1 <1

«1 <1

1<3 + - <<37

+ << 2 << 2 +

YO Y3 Y1 Y2

Std ={O: VC-1, 1: H.264, 2: HEVC}

Figure 2-9: Hardware architecture of the even component for shared 8x8 transform for H.264,VC-1 and HEVC. The highlighted blocks are the same as those used in the shared H.264/VC-1architecture, shown in Figure 2-1.

Data dependent processing, as described in Section 2.1.4, is independent of the video

coding standard being used. Since the nature of the input data (the prediction residue),

as shown in Figure 2-6, is the same for HEVC as for H.264 and VC-1, we can use data

dependent processing to reduce switching activity and power consumption in HEVC trans-

form engine as well. Figure 2-11 shows results of switching activity simulations for the

HEVC transform architecture proposed above. We consistently observe data dependent

processing resulting in an average 25% reduction in switching activity, demonstrating the

applicability of this idea beyond H.264 and VC-1.

It should also be noted that the ideas of even-odd decomposition and matrix factorization

592.2 Future Video Coding Standards


Std = (0: VC-1, 1: H.264, 2: HEVC}

Figure 2-10: Hardware architecture of the odd component for shared 8x8 transform for H.264,

VC-1 and HEVC. The highlighted blocks are the same as those used in the shared H.264/VC-1architecture, shown in Figure 2-2.

1

.0

0

0.9

0.8

0.7

0.6

0.50 32 64

Bias96 128

Figure 2-11: Switching activity in HEVC transform as a function of DC bias applied to the

input data

as well as eliminating an explicit transpose memory can be applied to transform matrices

of larger sizes such as 16x 16 and 32x32. The ideas proposed in this work can potentially

be extended to future video coding standards that use DCT based transforms.

HorsecabRally

- Splash-+- Waterskiing

<<

<<

<< 4 << <<

2 1 0 td 0 1 2

+

YO Y3

X3 X5

<<

<<

<< 4

Std


X1 X7

2 10 td 0 1 2 -Std

Y1 Y2

The benefits of these optimizations become even more significant for a larger size trans-

form. For example, for the 32x32 transform in HEVC [71], the transform weights are 8

bit wide as opposed to 5 bit in H.264 [57]. In addition, each 1D coefficient computation

requires 32 add-multiply operations as opposed to 8 add-multiply operations. This leads

to a 6.4x more complexity per pixel in HEVC transform compared to H.264. The 32x32

HEVC transform also requires 16x larger transpose memory compared to 8x8 transform

in H.264. A hardware implementation of the HEVC decoder, proposed in [72], shows that

the transform module constitutes about 17% of the decoder area and power consumption.

This indicates that the area and power savings achieved by the ideas proposed in this work

can be significant towards achieving a low power video encoder/decoder implementation

for future video coding standards, such as HEVC.

2.3 Statistical Methodology for Low-Voltage Design

The performance of logic circuits is highly sensitive to variation in threshold voltage (VT)

at low voltages and can also result in functional failures at the extremes of VT variation.

For minimum geometry transistors, threshold voltage variation of 25 mV to 50 mV is

typical. At nominal VDD such as 1 V or 1.2 V, local variations in threshold voltage may

result in 5% to 10% variation in the logic timing. However, for low voltage operation

(VDD 0.5 V), these variations can result in timing path delays with standard deviation

comparable to the global corner delay, and must be accounted for during timing closure

in order to ensure a robust, manufacturable design.

This challenge has been recognized [42,73,74] and circuit design techniques for low-voltage

operation have begun to take into account Statistical Static Timing Analysis (SSTA) ap-

proaches for estimating circuit performance [75]. A logic gate design methodology ac-

counting for global process corners that identifies logic gates with severely asymmetric

pullup/pulldown networks is proposed in [76]. Nominal delay and delay variability mod-

els valid in both above and subthreshold regions are proposed in [77]. A transistor sizing



methodology to manage the trade-off between reducing variability and minimizing energy

overhead is proposed in [78]. Most of these statistical approaches make the assumption

that the impact of variations on circuit performance can be modeled as a Gaussian distri-

bution. This assumption is usually accurate at nominal voltage [79,80], but fails to capture

the non-linear impact of variations on circuit performance at low-voltage that results in

highly non-Gaussian delay distributions. This phenomenon is depicted in Figure 2-12,

which shows the delay Probability Density Function (PDF) of a representative path at

0.5 V, estimated using Gaussian SSTA and Monte-Carlo analysis. Static Timing Analysis

(STA) estimates the global corner delay for the path to be 14.1 ns. Modeling the impact

of variations using Gaussian SSTA results in the 3a delay estimate of 23.1 ns. However,

Monte-Carlo analysis suggests that Gaussian SSTA is not adequate to fully capture the

impact of variations and results in the 3a delay estimate of 31.8 ns.

x107I I : * I I I

15 Corner delay-- -- Gaussian SSTAMonte-Carlo

$10 --4 3a, Gaussian

5-3a ,MC

Id,

0 5 10 15 20 25 30 35Timing Path Delay (ns)

Figure 2-12: Delay PDF of a representative timing path at 0.5 V. STA estimate of the global

corner delay is 14.1 ns, the 3a delay estimate using Gaussian SSTA is 23.2 ns and the 3a delay

estimate using Monte-Carlo analysis is 31.8 ns.

Performing large Monte-Carlo simulations for processor designs with millions of transistors

is impractical. We use a computationally efficient approach, called the Operating Point

Analysis (OPA) [81], that can perform accurate path-based timing analysis in the regime

where delay is a highly non-linear function of the random variables and/or the PDFs of

the random variables are non-Gaussian. OPA provides an approximation to the fa value

of a random variable D, when D is a linear or non-linear function D(x, x 2, ... XN) Of

random variables xi, which can be Gaussian or non-Gaussian. The fca operating point is

the point in xi-space where the joint probability density function of the xi is maximum,

subject to the constraint that D(x1, x 2, - - - XN) = Df,. In other words, the operating

point represents the most likely combination of random variables xi that results in the

fa delay for the logic gate or the timing path. Figure 2-13 illustrates the convolution

integrand and the operating point, where delay is a non-linear function of two variables.

A transcendental relationship is established between the unknown operating point and

unknown f a delay, and this equation is solved iteratively.

x 2 *

xO ----------- 'J ConvolutionIntegrand

D(xx a2)=D,

Figure 2-13: Graphic illustration in xi-space of the convolution integral, and the operating point.

The methodology, developed in [82], is summarized below.

Standard Cell Library Characterization

For the 45 nm process used in this work, Random Dopant Fluctuations (RDF) induced lo-

cal variations were modeled by two compact model parameters for each transistor. These

transistor random variables (also called mismatch parameters) are statistically indepen-



dent with approximately Gaussian PDF. The OPA approach is applicable for any local

variations given by a compact model of transistor mismatch parameters. The goal of cell

characterization is to predict the delay PDF for each arc of each cell. An arc is defined as

input rise or fall, input slew rate and output capacitance. At nominal voltage, cell delay is

approximately linear in the transistor random variables, with the result that the cell delay

is approximately Gaussian. However, at 0.5 V, cell delay is highly non-linear in transistor

random variables, with the result that the cell delay has a non-Gaussian PDF.

OPA is used to perform stochastic characterization of the standard cell library at VDD =

0.5 V. This characterization ensures functionality and quantifies the performance of

standard cells at VDD = 0.5 V. Standard cells that fail the functionality or do not satisfy

the performance requirement are not used in the design. The functionality and setup/hold

performance of flip-flops are also verified using the cell characterization approach.

Timing Path Analysis

The goal of timing path analysis is to compute the 3a (or in general fa) stochastic delay

of a timing path. OPA is used, along with the pre-characterized standard cell library, to

determine the 3a setup/hold performance for individual paths from the design at 0.5 V.

Figure 2-14 shows the PDF computed using OPA superimposed on the PDF computed

using Monte-Carlo, for the path analyzed in Figure 2-12 at 0.5 V. Monte-Carlo analysis

results in the 3a delay estimate of 31.8 ns. OPA shows excellent agreement with Monte-

Carlo with the 3a delay estimate of 30.7 ns.

Full-Chip Timing Closure

Given the size of the design, it is not practical to analyze each path individually to deter-

mine the 3a setup/hold performance. At nominal voltage, paths that fail the setup/hold

requirement are determined using the corner-based analysis and timing closure is achieved

by performing setup/hold fix on these paths. However, at low voltage, it is not possible

64

2.3 Statistical Methodology for Low-Voltage Design 65

x 10 7

15*

10*

5

0( 5 10 15 20 25

Timing Path Delay (ns)

Figure 2-14: Delay PDF of a representative timing path at 0.5 V,SSTA, Monte-Carlo and OPA.

30 35

estimated using Gaussian

to consider only the paths that fail the setup/hold requirement in the corner analysis and

determine their 3- setup/hold performance, since a path with larger corner delay need

not have a larger stochastic variation.

A three phase approach, outlined below, is used to reduce the number of paths that need

to be analyzed for setup/hold constraints using OPA analysis.

1. All paths are analyzed with traditional STA using the corner delay plus the 3-

stochastic delay for each cell. This is a pessimistic analysis, so those paths that pass

this analysis can be removed from further consideration.

2. The paths that did not pass during the first phase are re-analyzed, this time using

OPA for the launch and capture clock paths, as defined in Figure 2-15, and STA

with the corner delay plus the 3a- stochastic delay for cells in the data paths. Again,

this is a pessimistic analysis and any paths that pass during this phase need no

further consideration.

3. Lastly, the few remaining paths are analyzed using OPA for the entire path.

The paths that fail the 3- setup or hold performance test are optimized to fix the

Corner delay-.- - - Gau

- OPA

3a, Gaussian

-

ssian SSTAite-Carlo

3o, OPA

3&, MC

2.3 Statistical Methodology for Low-Voltage Design 65

66ltrs D Q V

REG ~j-7 ~--* REG

cells

4D14----------- -- o4 D23-sigma

CLK Path-1 'Launch cik skew CLK Path-2 Capture

CLK Common /Clock/CokCLK CLK Path

Figure 2-15: Typical timing path.

setup/hold violations. This process is repeated untill all the timing paths in the de-

sign meet the 3a setup and hold performance computed using OPA. Setup/hold fixing

using OPA ensures that cells that are very sensitive to VT variations are not used in the

critical paths. Table 3.1 shows statistics on the number of paths analyzed during each

phase of timing closure, for both setup and hold analysis of the entire chip.

Table 2.3: Full-chip Timing Analysis

Phase Data Path Clock Path Paths Analyzed Worst Slack % Fail

Setup Analysis 0 25MHz

1 STA (+3a) STA (-3a) 20k -14.2 ns 5%

2 STA (+3a) OPA 1k -3.2 ns 9%

3 OPA OPA 87 -0.2 ns 12%

Paths requiring fixing (before timing closure) 10

Hold Analysis

1 STA (-3u) STA (+3a) 20k -11.2 ns 7%

2 STA (-3) OPA 1.4k -2.5 ns 8%

3 OPA OPA 112 -0.1 ns 14%


T1ransform Engine for Video Coding66

The overall statistical design methodology can be summarized as shown in Figure 2-16.

C

Data path Clk path extraction

[3-Phase path pruning

OPA analysis of potentially criticalpaths I

Yes

Timing closure achieved! II Setup/Hold fix for failing paths

SPICE Netlist extraction

Timing Closure

Figure 2-16: OPA based statistical design methodology for low voltage operation.

2.4 Implementation

In this work, we implemented ten different versions of the transform engine, listed in

Table 2.4, and compared their relative performance.

All transforms have been implemented to complete an 8 x 8 transform over 4 clock cycles.

I ]

No

2.4 Implementation 67


Table 2.4: Transform engines implemented in this design

Tr. Type Description

HVF8 Shared 8 x 8 forward transform without transpose memory

HV1 8 Shared 8x8 inverse transform without transpose memory

HV/T8 M Shared 8 x 8 forward transform with transpose memory

HVTM Shared 8x8 inverse transform with transpose memory

HF8 8x8 forward transform for H.264 without transpose memory

H 18 8x8 inverse transform for H.264 without transpose memory

VF8 8x8 forward transform for VC-1 without transpose memory

V1 8 8x8 inverse transform for VC-1 without transpose memory

HVF4 Shared 4x4 forward transform

HV1 4 Shared 4x4 inverse transform

In this design, the output buffer has been implemented as a register bank of size 8 x 8 with

each element being 8 bit wide. The architecture of the 2D transform engine, along with

the output buffer, is shown in Figure 2-17.

Figure 2-18 shows the die photo of the IC fabricated using commercial 45 nm CMOS

technology. The gate counts in Figure 2-18 include output buffer as well.

The proposed shared transform engine design uses separate 1D transforms for column

and row-wise computations and does not use a transpose memory. The 1D column and

row-wise transforms are designed using the shared architectures described in Section 2.1.2

and 2.1.3 respectively. The 2D output buffer is used to store intermediate data.

The shared transform modules with transpose memory are implemented using the shared

1D transform architecture described in Section 2.1.2 for both column and row wise trans-

forms. Each ID transform processes 8 x 2 data in each clock cycle and a 16 x 8 transpose

Transformn Engine for Video Coding68

69

PredictionResidue

TransformCoefficients

I, -

4V

'A

a

Figure 2-17: Block diagram of the 2D transform engine design

2mm

EE04

Design Statistics

Active Area 1.5mm 2

Technology 45nm

1/0 Pads 96

Tr. Gate Ti. GateType Count Type Count

HVF8 44.7k HV 8 45.1k

HVTM 66.5k HVTM 66.8kYF8 _____ 8

HF8 30.9k H1 8 31.6k

VF8 35.6k V 8 35.8k

HVF4 18.8k HV14 18.9k

Figure 2-18: Die photo and design statistics of the fabricated IC

memory, which constitutes 48% of the gate count, is used to allow operation in ping-pong

mode to achieve a throughput of 8 x 8 2D transform over 4 cycles. An alternative approach

to achieve the same throughput is to process 8 x 4 data in each clock cycle and use an 8 x 8

transpose memory. This has not been implemented on chip, however synthesis results

show 15% higher overall gate count for this approach.

2.4 Implementation

2.5 Measurement Results

The shared architecture for 8 x8 transform (HVF8/HVS) is able to achieve 25% reduction

in area compared to the combined area of individual 8 x 8 transforms for H.264 (HF8/HIS)

and VC-1 (VF8/VI8). Eliminating explicit transpose memory helps save 23% area com-

pared to the implementation that uses a transpose memory (HVSM/HV8M). The decoder

only uses inverse transforms. The encoder requires both forward and inverse transforms,

thus doubling the area savings due to hardware sharing.

Figure 2-19 shows the measured power consumption and frequency for different transform

modules as a function of VDD-

60 . 1H111--- HVF8

HVF8M4 0 - - H F8 ............................. . . .. . ..

-- V F 8 ....----... ..... ..-.. ..... ...... .....- ...- ...-.- ....

-- HVF4

8.3 0.35 0.4 0.45 0.5 0.55 0.6VDD(V)

(a)500

---- HVF8400 ^ -- H VF8M ................................. - ---------

-H8S300 - F

-+- VF8

0200 - HVF4 .......................... ..........

.3 0.35 0.4 0.45 0.5 0.55 0.6VDD(V)

(b)

Figure 2-19: Measured power consumption and frequency scaling with VDD for different trans-form implementations. (a) Frequency scaling with VDD, (b) Power consumption while operatingat the frequency shown in (a).


All the transform modules implemented on this chip have been verified to be operational

to support video encoding/decoding with Quad-Full HD (3840 x 2160) resolution at 30 fps.

The shared 8x8 transform is able to achieve video encoding/decoding in both H.264 and

VC-1 with 3840 x 2160 (QFHD) resolution at 30 fps, while operating at 25 MHz frequency

at 0.52 V. The module is also able to achieve 1080p (Full-HD) at 30 fps, while operating at

6.3 MHz at 0.41 V and 720p (HD) at 30 fps, while operating at 2.8 MHz at 0.35 V.

Measurement results for all the modules are summarized in Table 2.5.

Table 2.5: Measurement results for implemented transform modules

Transform QFHD@30fps 1080p@30fps 720p@30fpsType 25MHz 6.3MHz 2.8MHz

VDD Power VDD Power VDD Power

(V) (w) (V) (AW) (V) (AW)

HVF8 0.52 214 0.41 79 0.35 43

HV 8 0.53 218 0.42 81 0.36 44

HVFT8M 0.50 270 0.40 95 0.33 51

HV,8M 0.49 268 0.40 94 0.33 50

HF8 0.51 175 0.41 67 0.34 35

H1 8 0.50 172 0.40 66 0.33 34

VF8 0.51 189 0.41 70 0.35 38

V8 0.51 188 0.41 70 0.34 37

HVF4 0.49 127 0.39 55 0.33 31

HV 4 0.48 124 0.40 54 0.33 30

Figure 2-20 compares the power consumption of shared transform without transpose mem-

ory, shared transform with transpose memory and individual transform implementations

for H.264 and VC-1. While supporting Quad Full-HD resolution, eliminating explicit

transpose memory helps reduce power consumption of the 8x8 transform by 26%.


72 'Ifransform Engine for Video Coding

HF8

VF8

HVF8HVTM

720p@30fps 1080p@30fps QFHD@30fps

Figure 2-20: Power consumption for transform modules with and without transpose memory,with and without shared architecture for H.264 and VC-1

Data dependent processing affects different architectures differently because of varying

degrees of correlation between input switching activity and system switching activity.

Figure 2-21 shows the switching activity and power consumption for different transform

modules as a function of the input DC bias. We observe a reduction in switching activity

by 25%-30% across the modules, resulting in a 15%-20% power saving.

Table 2.6 summarizes the overheads and advantages of the three key ideas proposed in this

PowerHVF8 - HF8-

-- HVFT8M _- VF8

32

Switching Activity

HVF8 HF8

HVF8M - VF8

64Bias

96

Figure 2-21:bias applied

Switching activity and Power consumption in the transform as a function of DCto the input data

300

200 -mW

1001-

0

1

C.)

C) 0.6 -

0.5 -

1

0.9

0.8

0.7

0.6

0.5

-0.428

0.40 1


0.9

0.8

0.7

I

work. Applying DC bias requires 16 adders (with one fixed input, i.e. the DC bias) that

cause 5% increase in area and 4% increase in power. But it helps reduce the switching

activity by 30%, which results in a 15% overall power saving for the design. Hardware

sharing requires 26 additional 2:1 multiplexers that consume 9% area and 6% power.

But sharing helps us implement the H.264 and VC-1 transforms using 78 adders and

62 multiplexers (including the overhead), as opposed to 126 adders and 60 multiplexers

for individual H.264 and VC-1 implementations, which reduces the overall area by 25%.

The scheme for eliminating transpose memory requires us to access 8x8 data in each

clock cycle for row-wise transform computations. This increases the data accesses by 4x

for the row-wise computations, as opposed to the implementation that uses a transpose

memory. The increased data accesses lead to 7% increase in power consumption. However,

the ability to eliminate transpose memory saves 23% area and 26% power. Overall, the

proposed design optimizations help reduce the power consumption by about 40%, despite

the overhead.

Table 2.6: Overheads and Advantages of proposed ideas

Feature Overhead Advantage

Data Dependent Processing 5% area, 4% power 30% reduction in switching activity,15% reduction in power

Hardware Sharing between H.264 9% area, 6% power 25% reduction in areaand VC-1

Transpose Memory Elimination 4x data access, 7% 23% reduction in area, 26% reduc-power tion in power

Table 2.7 shows a performance comparison of the proposed approach for 2D transform

implementation with some previously published approaches. The comparison shows that

the proposed approach achieves a significant reduction in power compared to the previous

approaches. Assuming a roughly 4x scaling in power due to technology scaling from

180 nm to 45 nm, the architectural techniques proposed in this work achieve a reduction


'74 Transform Engine for Video Coding

in power consumption by over 45 x compared to

achieving the same throughput at VDD = 0.52 V.

[83] and 68 x compared to [84], while

Table 2.7: Performance comparison of proposed approach with previous publications

Publication Huang'08 Fan'11 Wang'11 Chen'11 This Work[83] [84] [85] [86]

Low NominalVDD VDD

Technology 180 nm 180 nm 130 nm 180 nm 45 nm 45 nm

Gates 39.8k 95.1k 23.1k 17.7k 44.7k 44.7k

Parallelism 8x 8x 8x 4x 16x 16x

Throughput 400M 400M 800M 1000M 400M 4640Mpixels/s pixels/s pixels/s pixels/s pixels/s pixels/s

Frequency 50 MHz 50 MHz 100 MHz 250 MHz 25 MHz 290 MHz

Voltage 1.8V 1.8V 1.2V 1.8V 0.52V 1.0V

Power 38.7 mW 58.01 mW - 54 mW 214 pW 4.1 mW

Supported MPEG, MPEG, MPEG, H.264 H.264, H.264,Standards H.264 H.264, H.264, VC-1 VC-1

AVS, VC-1 AVS, VC-1

Transform Forward Inverse Forward, Inverse Forward, Forward,Type Inverse Inverse Inverse

2.6 Summary and Conclusions

The ability to perform very high-resolution video encoding/decoding for multiple stan-

dards at ultra-low voltage to achieve low power operation is critical in multimedia devices.

In this work, we have developed a shared architecture for H.264/AVC and VC-1 transform

engine. Similarity between the structure of transform matrices is exploited to perform

matrix decomposition to maximize hardware sharing. The shared architecture helps to

74 T'ransform Engine for Video Coding

2.6 Sumnmary and Conclusions 7

save more than 30% hardware compared to total hardware requirement of individual

H.264/AVC and VC-1 transform implementations. An approach to eliminate an explicit

transpose memory is demonstrated, by using a 2D output buffer and separately designing

the row-wise and column-wise ID transforms. This helps us reduce the area by 23% and

save power by 26% compared to the implementation that uses transpose memory. We

have demonstrated that data dependent processing can help reduce the switching activity

by more than 30% and further reduce power consumption. The implementation is able to

support Quad-Full HD (3840 x 2160) video encoding/decoding at 30 fps while operating

at 0.52 V.

The ideas of matrix factorization for hardware sharing, eliminating transpose memory

and data dependent processing could potentially be extended to other coding standards

as well. As bigger block sizes such as 32x32 and 64x64 are explored in future video

coding standards like HEVC, these ideas could lead to even higher savings in area and

power requirement of the transform engine, allowing their efficient implementation in

multi-standard multimedia devices.

Exploration of the ideas proposed in this work leads to the following conclusions.

1. Reconfigurable hardware architectures that implement optimized core functional

units for a class of applications, such as video coding, and enable configurable data-

paths with distributed control, are key to supporting efficient processing for multiple

applications. Algorithmic optimizations that reframe the algorithms are important

for enabling hardware reconfigurability.

2. Data dependent processing can be a powerful tool in reducing system power con-

sumption. By exploiting the characteristics of the data being processed, architec-

tures can be designed to minimize switching activity, optimize pipeline bit widths

and perform variable number of operations per block. The reduction in computations

and switching activity has a direct impact on the system power consumption.

3. Memory size and power consumption can have a significant impact on system effi-

75


ciency. Architectural approaches that trade-off small increases in logic complexity

for significant reductions in memory size and power consumption can provide the

most optimal system design solutions.

4. Low-voltage operation of circuits is important to provide wide voltage/frequency

operating range and attain minimum energy operation. Global and local variations

have a significant impact on circuit performance at low-voltages. This impact can

not be fully captured with corner-based STA or Gaussian SSTA techniques. Statis-

tical design approaches that take into account the non-linear impact of variations

on circuit performance at low-voltage must be used to ensure reliable low-voltage

operation.

76

Chapter 3

Reconfigurable Processor for

Computational Photography

Computational photography is transforming digital photography by significantly enhanc-

ing and extending the capabilities of a digital camera. The field encompasses a wide range

of techniques such as High Dynamic Range (HDR) imaging [87], low-light enhancement

[138,139], panorama stitching [88], image deblurring [89] and light field photography [90],

that allow users to not just capture a scene flawlessly, but also reveal details that could

otherwise not be seen.

Non-linear filtering techniques, such as bilateral filtering [31,91,92], anisotropic diffusion

[93,94] and optimization [95,96], form a significant part of computational photography.

The behaviors of such techniques have been well studied and characterized [97-102]. These

techniques have a wide range of applications, including denoising [103,104], HDR imaging

[87], low-light enhancement [138,139], tone management [105,106], video enhancement

[107,108] and optical flow estimation [109,110]. The high computational complexity of

such multimedia processing applications necessitates fast hardware implementations [111,

112] to enable real-time processing in an energy-efficient manner.

Recent research has focused on specialized image sensors to capture information that is

not captured by a regular CMOS image sensor. An image sensor with multi-bucket pixels

is proposed in [113] to enable time multiplexed exposure that improves the image dynamic

range and detects structured light illumination. A back-illuminated stacked CMOS sensor

is proposed in [114] that uses spatially varying pixel exposures to support HDR imaging.

An approach to reduce the temporal readout noise in an image sensor is proposed in [115]

to improve low-light-level imaging. However, computational photography applications

using regular CMOS image sensors that are currently used in the commercial cameras

have so far remained software based. Such CPU/GPU based implementations lead to

high energy consumption and typically do not support real-time processing.

This work implements a reconfigurable multi-application processor for computational pho-

tography by exploring power reduction techniques at various design stages - algorithms,

architectures and circuits. The algorithms are optimized to reduce the computational

complexity and memory requirement. A parallel and pipelined architecture enables high

throughput while operating at low frequencies, which allows real-time processing on HD

images. Circuit design for low voltage operation ensures reliable performance down to

0.5 V.

The reconfigurable hardware implementation performs HDR imaging, low-light enhanced

imaging and glare reduction, as shown in Figure 3-1. The filtering engine can also be ac-

cessed from off-chip and used with other applications. The input images are pre-processed

for the specific functions. The core of the processing unit are two bilateral filter engines

that operate in parallel and decompose an image into a low frequency base layer and a

high frequency detail layer. Each bilateral filter uses further parallelism within it. The

choice of two parallel engines is based on the throughput requirements for real-time pro-

cessing and the amount of memory bandwidth available to keep all the engines active.

The processor is able to access 8 pixels per clock cycle and each filtering engine is capable

of processing 4 pixels per clock cycle. Bilateral filtering is performed using a bilateral grid

structure [116] that converts an input image into a three dimensional data structure and

filters it by convolving with a three dimensional Gaussian kernel. Parallel processing al-

Reconfigurable Processor for Computational Photography78

3.1 Bilateral Filtering 79

Preprocessing

IF Weighted jINF Average -

IG Grid Grid

'El '3 Assignment AssignmentHDR

Creation _j -IM Convolution Convolution

IHDR *' Engine Engine

ITM Contrast O

IRG AdjustmentIBF *Grid Grid

IBFInterpolation Interpolation

ILLE * hdoCorrection

Postprocessing

Figure 3-1: System block diagram for the reconfigurable computational photography processor

lows enhanced throughput while operating at low frequency and low voltage. The bilateral

filtered images are post processed to generate the outputs for the specific functions.

This chapter describes bilateral filtering and its efficient implementation using the bilat-

eral grid. A scalable hardware architecture for the bilateral filter engine is described in

Section 3.2. Implementation of HDR imaging, low-light enhancement and glare reduction

using bilateral filtering is discussed in Section 3.3. The challenges of low voltage operation

and approaches to address process variation are described in Section 3.4. The significance

of architectural optimizations for reducing external memory bandwidth and power con-

sumption - crucial to enhance the system energy-efficiency, is described in Section 3.5.

Section 3.6 provides measurement results for the testchip.

3.1 Bilateral Filtering

Bilateral filtering is a non-linear filtering technique that traces its roots to the non-linear

Gaussian filters proposed in [31] for edge-preserving diffusion. It takes into account the

difference in the pixel intensities as well as the pixel locations while assigning weights, as

Reconfigurable Processor for Computational Photography

opposed to linear Gaussian filtering that assigns filter weights based solely on the pixel

locations [91,92]. For an image I at pixel position p, the bilateral filtered output, 1B, is

defined by eq. (3.1).

N

IB(P) = Gs(n) - G1(I(p) - I(p - n)) -I(p - n) (3.1)n=-N

where,N

W (p)= 1 Gs (n) -G, (I (p) -I (p -n))n=-N

The output value at each pixel in the image is a weighted average of the values in a

neighborhood, where the weight is the product of a Gaussian on the spatial distance

(Gs) with standard deviation a, and a Gaussian on the pixel intensity/range difference

(GI) with standard deviation a,. In linear Gaussian filtering, on the other hand, the

weights are determined solely by the spatial term. In bilateral filtering, the range term

GI(I(p) - I(p - n)) ensures that only those pixels in the vicinity that have similar in-

tensities contribute significantly towards filtering. This avoids blurring across edges and

results in an output that effectively reduces the noise while preserving the scene details.

Figure 3-2 compares Gaussian filtering and bilateral filtering in reducing image noise and

preserving details.

However, non-linear filtering is inefficient and slow to implement because the filter kernel

is spatially variant and needs to be recomputed for filtering every pixel. In addition, most

computational photography applications require large filter kernels, 64 x 64 or more. A di-

rect implementation of bilateral filtering can take several minutes to process HD images on

a CPU. Faster approaches for bilateral filtering have been proposed. A separable approx-

imation of the bilateral filter is proposed in [117] that speeds up processing and improves

efficiency for applications that use small filter kernels, such as denoising. Optimization

techniques have been proposed that reduce the processing time by filtering subsampled

versions of the image with discrete intensity kernels and reconstructing the filtered results

80

31 Bilateral Filtering 81

Linear Gaussian Filtering

Non-Linear Bilateral Filtering

Figure 3-2: Comparison of Gaussian filtering and bilateral filtering. Bilateral filtering effectively

reduces noise while preserving scene details.

using linear interpolation [87,118]. A fast approach to bilateral filtering based on a box

spatial kernel, which can be iterated to yield smooth spatial falloff, is proposed in [119].

However real-time processing of HD images requires further speed-up.

3.1.1 Bilateral Grid

The bilateral grid structure for fast bilateral filtering is proposed in [116], where the pro-

cessing complexity is reduced by down-sampling the image for filtering. But to preserve

the details while down-sampling, a third intensity dimension is added so that pixels with

very different intensities, within a block being down-sampled, are assigned to different in-

tensity levels, thus preserving the intensity differences. This results in a three dimensional

structure. Creating a 3D bilateral grid and processing it requires large amount of storage

(65 MB for a 10 megapixel image). In this work, we implement bilateral filtering using a

reconfigurable grid. To translate the grid structure efficiently into hardware, we convert

3.1 Bilateral Filtering 81


it into a data structure. The storage requirement is reduced to 21.5 kB by scheduling

the filtering engine tasks so that only two grid rows need to be stored at a time. The

implementation is flexible to allow varying grid sizes for energy/resolution scalable image

processing.

The bilateral grid structure used by this chip is constructed as follows. The input image

is partitioned into blocks of size a, x a, pixels and a histogram of pixel intensity values

is generated for each block. Each histogram has 256/ar bins, where each bin corresponds

to an intensity level in the grid. This results in a 3D representation of the 2D image,

as shown in Figure 3-3. Each grid cell (i, j, r) stores the number of pixels in a block

corresponding to that intensity bin (Wj) and their summed intensity (I, ). To provide

flexibility in grid creation and processing, the processor supports block sizes ranging from

16 x 16 to 128 x 128 pixels with 4 to 16 intensity bins in the histogram.

3D Grid0 1 2

0 -. (2,1.8) (2,1.9) (2,1.8)

1Histogram Summed Intensity 1

2D Image 1 2

Figure 3-3: Construction of a 3D bilateral grid from a 2D image

The bilateral grid has two key advantages:

Aggressive down-sampling: The size of the blocks (a, x a,) used while creating

the grid and the number of intensity bins (256/ur) determine the amount by which

the image is down-sampled. a, controls smoothing and a, controls the extent of edge-

preservation. Most computational photography applications only require a coarse

grid resolution. The hardware implementation merges blocks of 16 x 16 to 128 x 128

pixels into 4 to 16 grid cells. This significantly reduces the number of computations

required for processing as well as the amount of on-chip storage required.

82

* Built-in edge awareness: Two pixels that are spatially adjacent but have very

different intensities end up far apart in the grid along the intensity dimension. Filter-

ing the grid level-by-level using a 3D linear Gaussian kernel, only the intensity levels

that are near each other influence the filtering and the levels that are far apart do

not contribute in each other's filtering. Without any downsampling (a, = a, = 1),

this operation is identical to performing bilateral filtering on the 2D image. Filtering

a down-sampled grid using a 3D Gaussian kernel provides a good approximation to

bilateral filtering the image for most computational photography applications.

3.2 Bilateral Filter Engine

Intensity levels in the bilateral grid can be processed in parallel. This enables a highly

parallel architecture, where 256/ar intensity levels are created, filtered and interpolated

in a parallel and pipelined manner. The bilateral filter engine using the bilateral grid is

implemented as shown in Figure 3-4. It consists of three components - the grid assign-

ment engine, the grid filtering engine and the grid interpolation engine. The spatial and

intensity down-sampling factors, o, and ar, are programmed by the user at the start of the

processing. The image is scanned pixel by pixel in a block-wise manner. The size of the

block is scalable from 16 x16 pixels (a, = 16) to 128x128 pixels (o, = 128). Depending

on the intensity of the input pixel, it is assigned to one of the intensity bins. The number

of intensity bins is also scalable from 4 (ar = 64) to 16 (a, = 16). As the data structure is

stored on-chip, the different intensity levels in the grid can be processed in parallel. This

enables a highly parallel architecture for processing.

3.2.1 Grid Assignment

The pixels are assigned to the appropriate grid cells by the grid assignment engines. The

hardware has 16 Grid Assignment (GA) engines that can operate in parallel to process 16

intensity levels in the grid. But 4, 8 or 12 grid assignment engines could be activated if the

3.2 Bilateral Filter Engine 83


Figure 3-4: Architecture of the bilateral filtering engine. Grid scalability

processing engines and SRAM banks

is achieved by gating

grid uses fewer intensity levels. Figure 3-5 shows the architecture of the grid assignment

engine. For each pixel from each block, its intensity is compared with the boundaries

of the intensity bins using digital comparators. If the pixel intensity is within the bin

boundaries, it is assigned to that intensity bin. Intensities of all the pixels assigned to a

bin are summed by an accumulator. A weight counter maintains the count of number of

pixels assigned to the bin. Both the summed intensity and weight are stored for each bin

in on-chip memory.

S

as *0

0 --.) 0 bit S um m e dIJ~ 1 + Intensity

)Weight

4,. xrab a<b

X b a<bo, x(r+1)

i

Iri

Figure 3-5: Architecture of the grid assignment engine.

Memory Interface128 bit data bus

-- -0-- -- GA Engine 00 Conv Engine 0016 -31 Temporary .Temporary 0

Buffer Buffer .CCGA Engine 03 Conv Engine 03 c

- GA Engine 04 Bak0Conv Engine C4Bn

1 GA En Ine 07 - Bank 1 Conv Engine 07 - Bank 5

E GA Engine 08 1Bank2 Conv En ine 08 I Bank

22 . 2- Bank 3 r* Bank7tGGA Engine Sl Conv Engine 15

Grid Scaling Control

84

3.2.2 Grid Filtering

The Convolution (Conv) engine, shown in Figure 3-6, convolves the grid intensities and

weights with a 3 x 3 x 3 Gaussian kernel, which is equivalent to bilateral filtering in the

image domain, and returns the normalized intensity. The convolution is performed by

multiplying the 27 coefficients of the filter kernel with the 27 grid cells and adding them

using a 3-stage adder tree. The intensity and weight are convolved in parallel and the

convolved intensity is normalized with the convolved weight by using a fixed point divider

to make sure that there is no intensity scaling during filtering. The filter coefficients are

programmable to enable filtering operations of different types, including non-separable

filters, to be performed using the same reconfigurable hardware. The coefficients are

programmed by the user in the beginning of the processing, otherwise the default 3 x 3 x 3

Gaussian kernel is used. The hardware has 16 convolution engines that can operate in

parallel to filter a grid with 16 intensity levels. But 4, 8 or 12 of them can be activated if

fewer intensity levels are used in the grid.

r

Assigned Gridr

- Filtered GridxGAssin KGrnd

Figure 3-6: Architecture of the convolution engine for grid filtering.



3.2.3 Grid Interpolation

The interpolation engine, shown in Figure 3-7, reconstructs the filtered 2D image from the

filtered grid. The filtered intensity value at pixel (x, y) is obtained by trilinear interpola-

tion of 2 x 2 x 2 filtered grid values surrounding the location (x/-,, y/-s, Ixy/-r). Trilinear

interpolation is equivalent to performing linear interpolations independently across each

of the three dimensions of the grid. To meet throughput requirements, the interpolation

engine is implemented as three pipelined stages of linear interpolations. The output value

IBF(X, y) is calculated from filtered grid values Fg using four parallel linear interpolations

along the i dimension, given by eq. (3.2):

Fj = Fj' , i + I1jX Wi

F = x w + F+ 1, +1 x

= F,+l x w' + F xr+l3 ,31 , 2

Fjff = F+x wi + F+1 xw (3.2)

followed by two parallel linear interpolations along the j dimension, given by eq. (3.3):

Fx w1 + F+1 x

-+1 F+1 (3.3)

followed by an interpolation along the r dimension, given by eq. (3.4):

IBF(x, y) = F x Wi + Fr+1 xi (3.4)

The interpolation weights, given by eq. (3.5), are computed based on the output pixel

location (x, y), the intensity of the original pixel in the input image Ixy at location (X, y),

and the grid cell index (i, j, r).

86


x xWT =--i; W +

s 2s

W = - j; Wi =j +1-

W[ = -- Y - ' r; ri r + 1 -- -E (3.5)

The pixel location (x, y) and the grid cell index (i, j, r) are maintained in internal counters.

The original pixel intensity 1,, is read from the DRAM in chunks of 32 pixels per read

request to fully utilize the memory bandwidth.

x

r+~Y~

r

Filtered Grid Filtered 2D Image

LI

Figure 3-7: Architecture of the interpolation engine. Trilinear interpolation is implemented as

three pipelined stages of linear interpolations.

The assigned and filtered grid cells are stored in the on-chip memory. Last three assigned

blocks are stored in a temporary buffer and two previous rows of grid blocks are stored

in the SRAM. Last two filtered blocks are stored in the temporary buffer and one filtered

grid row is stored in the SRAM. The on-chip SRAM can store up to 256 blocks per row

with 16 intensity levels.

rFJ +1' bi

j +1.+-

F r--1Fj+1,J

i- +1--

F.1r,,y+Fir --9

Linearinterpolation Linear G

inerInterpolation

LinearInterpolation Lna

inaInterpolation Le R f

Interpolation j dimensioni dimension

F, F2

w w2

+

Linearinterpolation

r dimension

Ix a 9:r_-

Output


o+ f j41


3.2.4 Memory Management

The grid processing tasks are scheduled to minimize local storage requirements and mem-

ory traffic. Figure 3-8 shows the memory management scheme by task scheduling. Grid

processing is performed cell-by-cell in a row-wise manner. The last three blocks are stored

in the temporary buffer and the last two rows are stored in the SRAM. Once a 3x3x3

block is available, the convolution engine begins filtering the grid. When block A, shown

in Figure 3-8, is being assigned, the convolution engine is filtering block F. As filtering

proceeds to the next block in the row, the first assigned block, stored in the SRAM,

becomes redundant and is replaced by the first assigned block in the temporary buffer.

Last two filtered blocks are stored in the temporary buffer and the previous row of fil-

tered blocks are stored in the SRAM. As 2x2x2 filtered blocks become available, the

interpolation engine begins reconstructing the output 2D image. When block F, shown

in Figure 3-8, is being filtered, the interpolation engine is reconstructing the output 2D

image from block I. As interpolation proceeds to the next block in the row, the first fil-

tered block, stored in the SRAM, becomes redundant and is replaced by the first filtered

block in the temporary buffer. Boundary rows and columns are replicated for processing

boundary cells. This scheduling scheme allows processing without storing the entire grid.

Only two assigned grid rows and one filtered grid row need to be stored locally at a time.

Memory management reduces the memory requirement to 21.5 kB for processing a 10

megapixel image and allows processing grids of arbitrary height using the same amount

of on-chip memory.

3.2.5 Scalable Grid

Energy-efficiency is the key concern in processing on mobile platforms. The ability to

trade-off computational quality for energy is highly desirable, making algorithm struc-

tures and systems that enable this trade-off extremely useful to explore [120]. An user

performing computational photography on a mobile device might choose to trade-off out-

88


- i ,b0 2

0I

Assigned Grid3 4 5 6 W-3 W-2 W-1

******S

*****]Stored in

SRAM

2 A A Block being assigned

F Block being filtered

Temporary Buffer Blocks used for filtering

Filtered Grid0 1 2 3 4 5 6 W-3 W-2 W-I

Stored inI I *SRAM

1 Block being filtered

j Block being interpolatedTemporary Buffer Filtered Blocks used for interpolation

Figure 3-8: Memory management by task scheduling.

put resolution for energy, depending on the current state of the battery and the energy

requirement for the task. This trade-off could also be made based on the intended usage

for the image. For example, if the output image is intended for use on social media or

web-based applications, a lower resolution, such as 2 megapixel, might be most appropri-

ate. Whereas, for generating high-quality prints, the user would like to achieve the highest

resolution possible. This makes an architecture that enables energy-scalable processing

extremely valuable.

We develop an architecture that enables the energy vs. quality trade-off by scaling the

size of the bilateral grid to support the desired output resolution. The size of the grid is

determined by the image size and the downsampling factors. For an image of size Iw x IH

pixels with the spatial and intensity/range downsampling factors oa and -, respectively,

the grid width (Gw) and height (GH) are given by eq. (3.6) and the number of grid cells

(NG) is given by eq. (3.7).

Gw = ; GH- (3.6)

0-r



The number of computations as well as storage depends directly on the size of the grid.

Selecting the downsampling factors the same as the standard deviations of the spatial

and intensity/range Gaussians in the bilateral filter (eq. (3.1)) provides a good trade-

off between the output quality and processing complexity. The choice of downsampling

factors is guided by the image content and the application. Most applications work well

with a coarse grid resolution on the order of 32 pixels with 8 to 12 intensity bins. If the

image has high spatial details, a smaller o would result in better preservation of those

details in the output. Similarly, a smaller Ur would help preserve fine intensity details.

The grid size is configurable by adjusting as from 16 to 128, which scales the block size

from 16 x 16 to 128 x 128 pixels, and Ur from 16 to 64, which scales the number of intensity

levels from 16 to 4. For a 10 megapixel (4096 x 2592) image, the number of grid cells

scales from 663552 (a = 16, Ur = 16) to 2592 (a = 128, Ur = 64). The architecture

achieves energy scalability by activating only the required number of hardware units for

a given grid resolution.

The 21.5 kB of on-chip SRAM used to store two rows of created grid cells and one row

of filtered grid cells. The SRAM is implemented as 8 banks supporting a maximum of

256 cells in each row of the grid with 16 intensity levels, corresponding to the worst case

of a = 16, Ur = 16. Each bank is power gated to save energy when a lower resolution

grid is used. Only one bank is used when a = 128 and all 8 banks are used when

as = 16. The bilateral filter engine achieves scalability by activating only the required

number of processing engines and SRAM banks, and power gating the remaining engines

and memory banks, for the desired grid resolution.

3.3 Applications

The testchip has two bilateral filter engines, each processing 4 pixels/cycle. The processor

performs HDR imaging, low-light enhanced imaging and glare reduction using the bilateral

filter engines.

90

3.3.1 High Dynamic Range Imaging

The range of intensities captured in an image is limited by the resolution of the image

sensor. Typically, image sensors use 8 bits/pixel resolution, which limits the dynamic

range of intensities captured in an image to 256 : 1. On the other hand, the range of

intensities we encounter in the real-world is 5 to 6 orders of magnitude. HDR imaging

is a technique for capturing a greater dynamic range between the brightest and darkest

regions of an image than a traditional digital camera. It is done by capturing multiple

images of the same scene with varying exposure levels, such that the low exposure images

capture the bright regions of the scene well without loss of detail and the high exposure

images capture the dark regions of the scene. These differently exposed images are then

combined together into a high dynamic range image, which more faithfully represents the

brightness values in the scene.

HDR Creation

The first step in HDR imaging is to create a composite HDR image, from multiple differ-

ently exposed images, which represents the true scene radiance value at each pixel of the

image [121]. The true scene radiance value at each pixel is recovered from the recorded

intensity I and the exposure time At as follows. The exposure E is defined as the product

of sensor irradiance R (which is the amount of light hitting the camera sensor and is pro-

portional to the scene radiance) and the exposure time At. The intensity I is a nonlinear

function of the exposure E, given by eq. 3.8.

I = f(E)

I = f(R x At) (3.8)

We can then obtain the sensor irradiance as given by eq. 3.9, where, g = log f-.

log(R) = g(I) - log(At)

913.3 Applications

(3.9)


The mapping g is knows as the camera curve [121]. Figure 3-9 shows the camera curves

for the RGB color channels of a typical camera sensor.

3U

3

0

-3

-6

-9

-12

-15 ( 32 64 96 128 160 192Image Intensity

224 256

Figure 3-9: Camera curves that map the pixel intensity values on to the incident exposure.

The HDR creation module, shown in Figure 3-10 takes values of a pixel from three different

exposures (IEl, IE2, IE3) and generates an output pixel which represents the true scene

radiance value (IHDR) at that location. Since we are working with a finite range of discrete

pixel values (8 bits/color), the camera curves are stored as combinational look-up tables

r.Camera Exposure WeightedCurve Correction Average Exponent

S C 12 btLUT

cc EXPLUT LUT

CC0LUTX Ei4 Wj

W2 | E Ij 128

R

IHDR

Figure 3-10: HDR creation module.

B

92

El

E2

(LUTs) to enable fast access. The true (log) exposure values are obtained from the pixel

intensities using the camera curves, followed by exposure time correction to obtain (log)

scene radiance. The three resulting (log) radiance values obtained from the three images

represent the radiance values of the same location in the scene. A weighted average of

these three values is taken to obtain the final (log) radiance value. The weighting function

gives a higher weight to the exposures in which pixel value is closer to the middle of the

response function, thus avoiding the high contributions from images where the pixel value

is saturated. In the end an exponentiation is performed to get the final radiance value (16

bits/pixel/color). Processing is performed in the log domain for two reasons. The human

visual system responds to the ratio of intensities rather than the absolute difference in

intensities. This can be represented effectively in the log domain. Also, it simplifies the

computations to additions and subtractions instead of multiplications and divisions.

Tone Mapping

High dynamic range images (16 bit/pixel/color) can not be properly displayed on low dy-

namic range media (8 bits/pixel/color), which constitute almost all the displays that are

commonly used. Figure 3-11 shows how an HDR image would appear on a Low Dynamic

Range (LDR) display if it is simply scaled from 16 bit/pixel/color to 8 bit/pixel/color.

Properly preserving the dynamic range, captured in the HDR image, while displaying on

the LDR media requires tone mapping that compresses image dynamic range through con-

trast adjustment [87,94,122-124]. In this work, we leverage the local contrast adjustment

based tone-mapping approach proposed in [87] and implement two-stage decomposition

[125,126] using bilateral filtering in hardware.

Figure 3-12 shows the processing flow for HDR imaging, including HDR creation and tone-

mapping. The 16 bit/pixel/color HDR image is split into intensity and color channels. A

low-frequency base layer is created by bilateral filtering the HDR intensity in log domain

and a high-frequency detail layer is created by dividing the log intensity with the base

3.3 Applications 93


Figure 3-11: HDR image scaled toradiance map courtesy Paul Debevec

8 bit/pixel/color for displaying on LDR media. (HDR[121].)

Input Images

HDR Image

IntensitData

BaseLayer

4,

ty ColorData

Detail ScaledLayer k Color Data

Iv

Tone-MappedHDR Image

Figure 3-12: Processing flow for HDR creation and tone-mapping for displaying HDR imageson LDR media.

94

layer. The dynamic range of the base layer is compressed by a scaling factor in the log

domain. The scaling factor is user programmable to control the base contrast and achieve

a desired look for the image. By default, a scaling factor of 5 is used. The detail layer is

untouched to preserve the details and the colors are scaled linearly to 8 bit/pixel/color.

Merging the compressed base layer, the detail layer and the color channels results in a

tone-mapped HDR image (ITM). Figure 3-13 shows the tone-mapped version of the image

shown in Figure 3-11.

Figure 3-13: Tone-mapped HDR image. (HDR radiance map courtesy Paul Debevec [121].)

Figure 3-14 shows the hardware configuration for HDR imaging. The hardware performs

HDR imaging by activating the HDR Create module for pre-processing that merges three

LDR exposures into one 16 bit/pixel/color HDR image and the Contrast Adjustment

module for post-processing that performs contrast scaling and merging of the intensity

and color data. Both bilateral grids are configured to perform filtering in an interleaved

manner, where each grid processes alternate blocks in parallel. The processor also pre-

serves the 16 bit/pixel/color HDR image in external memory, which could be tone-mapped

using a different software or hardware implementation.

Figure 3-15 shows a set of input low-dynamic range exposures that capture different ranges

of intensities in the scene and the tonemapped HDR output image.

953.3 Applications

96 Reconfigurable Processor for Computational Photography

I'

I\,

IEl

I13

'TM

1BF

11LE

4.

4.mE-

4-

4-

Preprocessing

WeightedAverage

wjdJ

-j

Shad w

Correction

Figure 3-14: Processor configuration

S E

for HDR imaging.

(a) (b)

T-emppd D

(c) (d)

Figure 3-15: Input low-dynamic range images: (a) under exposed image, (b) normally exposed

image, (c) over exposed image. Output image: (d) tonemapped HDR image.

I.

Bilateral Filter

Grid LAssignment

ConvolutionEngine

GtrdLInterpolation

-

Bilateral Filter

GridAssignment

ConvolutionEngine

Grid

rInterpolation

V.

CM


3.3.2 Glare Reduction

Images captured with a bright light source in or near the field of view are affected signif-

icantly by glare, which reduces contrast, color vibrance and often leads to loss of scene

details due to pixel saturation. The effect of veiling glare on HDR imaging in image

capture and display is measured in [127]. An approach to quantify the presence of veiling

glare and related optical artifacts, and reducing glare through deconvolution by a mea-

sured glare spread function, is proposed in [128]. Glare removal in HDR imaging, by

estimating a global glare spread function for a scene based on fitting a radially-symmetric

polynomial to the fall-off of light around bright pixels, is proposed in [129]. Glare is mod-

eled as a 4D ray-space phenomenon in [130] and an approach to remove glare by outlier

rejection in ray-space is proposed.

In this work, we address the effects of glare by improving the contrast and enhancing

colors in the captured image. This process is similar to performing a single image HDR

tone-mapping operation, with the exception that the contrast is increased instead of

compressed. Programmability of the contrast adjustment module, shown in Figure 3-16,

allows us to achieve this by simply using a different contrast adjustment factor than the

one used for HDR imaging.

Combine ColorChannels Exponentiation

+ EXPColor Data LUT

Intensity RangeAdjustment EXP Output

LUT Image

log I

Figure 3-16: Contrast adjustment module. Contrast is increased or decreased depending on the

adjustment factor.

973.3 Applications

Figure 3-17 shows the processing flow for glare reduction.

Input Image

Intensity ColorData Data

Detail Base ScaledLayer Layer Color Data

Output Image

Figure 3-17: Processing flow for glare reduction.

The input image is split into intensity and color channels. A low-frequency base layer

and a high-frequency detail layer are created by bilateral filtering the intensity. The

contrast of the base layer is enhanced using the contrast adjustment module, which is

also used in HDR tone-mapping. The adjustment factor is user programmable to achieve

the desired look for the output image. Adjustment factor of 0.25 is used as a default

for glare reduction. The scaled color data is merged with the contrast enhanced base

layer and the detail layer to obtain a glare reduced output image. Figure 3-18 shows the

processor configuration for glare reduction.

Figure 3-19 shows an input image with glare and the glare reduced output image. Glare

reduction recovers details that are white-washed in the original image and enhances the

image colors and contrast.


3.3 Applications 99

1F - WeigtedBilateral Fifter Bilateral FftrsFWeighted

1, Average -

IG Grid Grid

11, 1- -> Assignment AssignmentHDR

Creation -

Convolution ConvolutionHDREngine Engine

InaGrid Grid

SBF JwInterpolation HI InterpolatIon

LE Correction

Postprocessing

Figure 3-18: Processor configuration for glare reduction.

(a) (b)

Figure 3-19: (a) Input image with glare. (b) Output image with reduced glare.


3.3.3 Low-Light Enhanced Imaging

Photography in low-light situations is a challenging task due to a number of conflicting

requirements. Capturing dynamic scenes without blurring requires short exposure times.

However, inadequate amount of light entering the image sensor in this short duration

results in images that are dark, noisy and lacking details. A possible solution to this

problem is to use a flash to add artificial light to the scene. This addresses the problems

of brightness, noise and lack of details, while enabling small exposure times to avoid

blurring. However, use of the flash defeats the original purpose of creating a realistic

representation of the scene in the photograph. The artificial light of the flash destroys the

natural scene ambience. It makes objects near the flash appear extremely bright, while

objects that are beyond the range of the flash appear very dark. In addition, it introduces

unpleasant artifacts due to flash shadows.

Combining the information captured in images of the same scene with flash (high details

and low noise) and without flash (natural scene ambience) in quick succession provides

a possible way to avoid the limitations of an image with flash or an image without flash

alone. Using flash and no-flash images to estimate ambient illumination and using that

information for color balancing is proposed in [131]. Creating enhanced images by pro-

cessing a stack of registered images of the same scene is proposed in [132]. This approach

allows users to combine multiple images, captured under varying lighting conditions, and

blend them in a spatially varying manner. Acquiring multiple images with different levels

of flash intensity, including no flash, and subsequently adjusting the flash level by linearly

interpolating among these images is proposed in [133]. Images of the same scene, captured

with different aperture and focal-length settings, but not with different flash settings, are

merged by interpolating between the settings in [134]. Approaches for synthetically re-

lighting scenes using sequences of images captured under varying lighting conditions have

also been proposed [135-137].

In this work, we implement an approach for low-light enhancement, similar to the ap-

100

1013.3 Applications

proaches proposed in [138] and [139], that merges two images captured in quick succession,

one taken without flash (INF) and one with flash (IF). The main difference between our

approach and [138,139] lies in flash shadow treatment and how that affects the overall fil-

tering operation. The large scale features in an image can be considered as representative

of the scene illumination and the details of the scene albedo [140]. Both the images with

and without flash are decomposed into a large scale base layer and a high frequency detail

layer through bilateral filtering. To preserve the natural scene ambience in the output,

the large scale base layer from the no-flash image is selected. This layer is merged with

the detail layer from the flash image to achieve high details and low-noise in the output.

However, flash shadows need to be considered during the merging process and treated to

avoid artifacts in the final output.

The approach used by [138] assumes that the regions in flash shadow should appear exactly

the same in both the flash and the no-flash image. So any regions where the differences

in intensities between the flash and no-flash image are small are considered as shadow

regions. A shadow mask representing such regions is created and the the details from the

flash detail layer are only added to the no-flash base layer in the regions not covered by

the mask. This approach avoids flash shadows but regions that are farther away from

the flash and do not receive sufficient illumination are also detected as shadows. Since no

details are added in these regions, large areas of the image often tend to appear smooth

and lacking details.

The approach in [139] makes a similar assumption to detect flash shadows. The regions

where the differences in intensities between the flash and no-flash image are the lowest

are considered to be the umbra of the flash shadows. The gradients at the flash shadow

boundaries are then analyzed to determine the penumbra regions. The shadow mask,

consisting of the umbra and the penumbra regions is then used to exclude shadow pixels

from bilateral filtering. In this scheme, while filtering a pixel, only the pixels in its

neighborhood that are outside the shadow region are used and the pixels in the shadow

region receive no weight. This approach also assigns colors from the flash image to the

Flash e Non-FlashImage I Image

Base Detail Base DetailLayer Layer Layer Layer

Low-Light Enhancedimage

Figure 3-20: Processing flow for low-light enhancement.

final output. For the shadow regions, local color correction is performed that copies colors

from illuminated regions in the flash image. Since this approach requires a specialized

type of bilateral filtering that takes into account the shadow mask, it can not be easily

implemented using the bilateral grid.

To address these challenges, we took an algorithm/architecture co-design approach and

developed a technique that decouples the core filtering operation from the shadow correc-

tion operation. This enables us to perform bilateral filtering efficiently using the bilateral

grid and correct for flash shadows as a post-processing step. Figure 3-20 shows the process-

ing flow for low-light enhancement. The RGB color channels are processed independently

and merged in the end to generate the final output. Figure 3-21 shows the processor

configuration for low-light enhancement.

The bilateral grid is used to decompose both images into base and detail layers. The scene

ambience is captured in the base layer of the no-flash image and details are captured in

the detail layer of the flash image. In this mode, one bilateral filter engine is configured

to perform bilateral filtering on the flash image and the other to perform cross-bilateral

filtering, given by eq. (3.10), on the no-flash image using the flash image. The location

of the grid cell is determined by the flash image and the intensity value is determined by

the no-flash image.


3.3 Applications 103

IF

INF -

~( -

IE1'E2

E3

IR G'F

IF -

LLE

Preprocessing

WeightedAverage

HDRCreation -

ContrastAdjustment

Figure 3-21: Processor configuration for low-light enhancement.

N

ICB (P) = (P)1 Gs (n) -Gi (IF (p) ~~ IF (p - n ' INF (P - n)n=-N

where,N

W(p) = E Gs(n) . GI(IF(p) - IF(p ~~ n)n=-N

(3-10)

(3-11)

Shadow Correction

A shadow correction module is implemented which merges the details from the flash

image with base layer of the cross-bilateral filtered no-flash image and corrects for the

flash shadows to avoid artifacts in the output image. The shadow correction algorithm was

developed in collaboration with Srikanth Tenneti. Instead of detecting the flash shadows

and attempting to avoid those while adding details, we create a mask representing regions

with high details in the scene. This is done by detecting edges that appear in the bilateral

filtered no-flash image, which preserves the scene details but avoids spurious edges due

to noise. Figure 3-22 shows the mask generation process. Gradients are computed at

each pixel for blocks of 4x4 pixels. If the gradient at a pixel is higher than the average

6Bilateral Filter

GridAssignment

Convolution MEngine

G rid _Interpolation

Postprocessing

1033.3 Applications

Interpoation

10

4x4 blockNo-Flash

Base Layer

1i Binary rdSmooth

Mask Mask

01-+r iing 16. b Smoothing Filter

g a b <2 + >>3 mi

_____ ____ ____ ___b__IV bv-

S0 1 Mean Gradient mean(s) b bSoj

g4 mean(s)

Figure 3-22: Generating a mask representing regions with high scene details.

gradient for that block, the pixel is assigned as an edge pixel. This results in a binary

mask that highlights all the strong edges in the scene but no false edges due to the flash

shadows.

The details from the flash image are added to the filtered no-flash image, as shown in

Figure 3-23, only in the regions represented by the mask. A linear filter is used to smooth

1 filt Mask IF IF

xv Flashv Details

withX shadow

artifacts

Non-flash Shadow correctedbase layer details

ILLE

Figure 3-23: Merging flash and no-flash images with shadow correction.


1053.3 Applications

the mask to ensure that that the resulting image does not have discontinuities. This

implementation of the shadow correction module handles shadows effectively to produce

low-light enhanced images without artifacts.

Figure 3-24 shows a set of flash and no-flash images, the no-flash base layer from bilateral

filtering, the flash detail layer, the edge mask, created using the process described in

Figure 3-22, and the low-light enhanced output image.

Figure 3-25 shows a set of flash and no-flash images and the low-light enhanced output

image. The enhanced output effectively reduces noise while preserving the natural look

and scene details, and avoiding artifacts due to flash shadows.

Figure 3-26 compares the output from our approach with that from [138] and [139].

Our approach and the approach in [138] use colors from the no-flash image for the final

output. The approach in [139] uses the colors from the flash image for the output. Our

approach achieves output quality comparable to the previous approaches, as indicated by

the difference images. Decoupling the shadow correction process from the core bilateral

filtering process enables efficient implementation using the bilateral grid.


(a) (b) (c)

Flash Details

(d)

Figure 3-24:detail layer,

Edge Mask

(e) (f)

(a) Image with flash, (b) image without flash, (c) no-flash base layer, (d) flash

(d) edge mask, (f) low-light enhanced output.


3.3 Applications lOT

(b) (c)

Figure 3-25: Input images: (a) image with flash, (b) image without flash.

low-light enhanced image.

(a) (b)

Output image: (c)

(c)

(d) (e)

Figure 3-26: Comparison of the image quality performance from the proposed approach with

that of [138 and [139]. (a) Output from our approach, (b) output from [138], (c) output from

[139], (d) difference image between (a) and (b) - amplified 5x, (e) difference image between (a)

and (c) - amplified 5x.

(a)

3.3 Applications 107


3.4 Low-Voltage Operation

In addition to algorithm optimizations and highly-parallel processor architectures, the

third key component of energy-efficient system design is implementing such architectures

using ultra-low power circuits. The energy consumed by a digital circuit can be minimized

by operating at the optimal VDD, which requires the ability to operate at low voltage

[17,38,141,142].

3.4.1 Statistical Design Methodology

In this work, we use a statistical timing analysis approach, similar to the Operating Point

Analysis (OPA) based statistical design methodology outlined in Section 2.3, to ensure

reliable low-voltage operation. One important difference in the approach, however, is that

the transistor random variables corresponding to local variations, also known as mismatch

parameters, were not available from the foundry for the 40 nm CMOS library used in this

design. In absence of the mismatch parameters, we used the global corner delays, that

model the impact of global variations, to estimate the impact of local variations. The

typical corner delay provides the nominal delay for the standard cell. The best and worst

corner delays are used to model the -3- and the +3- global corner delays respectively. At

low-voltage, the impact of local variations is comparable to global variations [81]. As a

result, we use the standard deviation (o) obtained from the global corner delays to model

the impact of local variations as a Gaussian Probability Density Function (PDF) with

the mean delay given by the global corner delay. A subset of the standard cells from the

40 nm CMOS logic library are analyzed in this manner to model the impact of variations

at 0.5 V. These models of standard cell variations are then used to perform setup/hold

analysis for the timing paths in the processor.

The setup/hold timing closure for the processor, with 3- performance requirement at

0.5 V, is performed using the OPA based approach. The PDF of delay at 0.5 V for a

representative path from the design, computed using the models of standard cell variations

108

3.4 Low-Voltage Operation 109

as described above, is shown in Figure 3-27. The global corner delay for this path is 21.9

ns. However, after accounting for the local variations, OPA estimates the 3- delay to be

36.1 ns. Note that even if the standard cell delay PDFs are modeled as Gaussians, the

timing path delay PDF can be non-Gaussian.

x 108

2 -COorn'er delay

2 -

3o- delay

0.5

0 .1

10 15 20 25 30 35 40Timing Path Delay (ns)

Figure 3-27: Delay PDF of a representative timing path from the computational photography

processor at 0.5 V. STA estimate of the global corner delay is 21.9 ns, the 30- delay estimate

using OPA is 36.1 ns.

Table 3.1 shows statistics on the number of paths analyzed for both setup and hold analysis

of the chip. Setup/hold fixing using OPA ensures that cells that are very sensitive to VT

variations are not used in the critical paths. This helps improve the 3- performance at

0.5 V by 32%, from 17 MHz to 25 MHz. The OPA analysis for timing paths ensures

reliable functionality at 0.5 V.

3.4.2 Multiple Voltage Domains

SRAM based on the six transistor (6T) cell is the most common form of embedded mem-

ory in processor design. However, low-voltage operation of 6T SRAM faces significant

challenges from process variations, bit cell stability and sensing. Threshold voltage vari-

ations among transistors that constitute the 6T cell significantly degrade the read/write

stability of the bit cell, especially at low voltages [143]. To ensure that the memory will


Table 3.1: Setup/Hold Timing Analysis at 0.5 V

Phase Data Path Clock Path Paths Analyzed Worst Slack % Fail

Setup Analysis @ 25MHz

1 STA (+3u) STA (-3a) 95k -10.7 ns 3.6%

2 STA (+3u) OPA 3.4k -2.9 ns 1.5%

3 OPA OPA 52 -0.05 ns 13.4%


Hold Analysis

1 STA (-3o) STA (+3u) 95k -8.2 ns 2.8%

2 STA (-3a) OPA 2.7k -1.8 ns 2.4%

3 OPA OPA 65 -0.13 ns 13.8%


operate reliably as the logic voltage is scaled down, we use separate voltage domains for

logic and memory. This allows us to operate the memory at the nominal voltage of 0.9 V,

while scaling the logic voltage down to 0.5 V. Voltage level shifters, capable of transition-

ing between 0.5 V and 0.9 V, are used to transition the signals between the logic and

memory voltage domains. Figure 3-28 shows the logic and memory voltage domains in

the processor and the level shifters used to transition between the domains. The logic

domain is operated at voltage VDDL and the memory domain is operated at voltage

VDDM.


The target external memory consists of two 64 Mx 16 bit DDR2 DRAM modules with a

burst length of 8. The processor generates 23 bit addresses for accessing the DRAM that

are divided as: 13 bit row address, 3 bit bank address and 7 bit column address. A 32 bit

wide 266 MHz DDR2 memory controller is implemented using Xilinx XC5VLX50 FPGA.

110

3.5 Memory Bandwidth Optimization 111

Figure 3-28: Separate voltage domains for logic and memory. Level shifters are used to transition

between domains.

We use a modified version of the Xilinx MIG DRAM controller which supports a lazy

pre-charge policy. Hence, a row is only pre-charged when an access is made to a different

row in the same bank. The 256 bit DDR2 interface is connected to the 64 bit processor

interface through asynchronous FIFOs. This enables the processor to work with any 256

bit DRAM system as well as allows the processor and memory to operate at different

frequencies.

The goal of memory optimization is to reduce the memory size and bandwidth required to

support real-time operation. To process 1080p images (1920 x 1080 at 30 fps) in real-time,

a naive bilateral filtering implementation in 2D image domain with 64 x 64 filter kernel

Computational Photography Processor

ill3.5 Memory Bandwidth Optimization


and a 4 kB cache to store 64 x 64 pixels (8 bit each), the DRAM bandwidth is:

BW 2 D Bilateral = (1080 x 30 x 64 x 64 + 1919 x 1080 x 30 x 64) x 3 colors

= 11.5 GB/s (3.12)

64 x 64 pixels are accessed for the first element in each row and cached in the buffer.

For subsequent pixels in the same row, only the next 64 pixels need to be accessed. The

processing for RGB color channels is performed independently.

Algorithmic optimizations that leverage the bilateral grid structure to perform bilateral

filtering in 3D grid domain, with 16 x 16 pixel blocks, 16 intensity levels and a 3 x 3 x 3

filter kernel, reduces the bandwidth requirement to:

BW 3D Grid = BWGrid Creation + BWGrid Filtering + BWGrid Interpolation (3.13)

where,

BWGrid Creation = BWImage Read + BWGrid Write

= (1920 x 1080 x 30 +1920 x 1080 x 30

16 x 16 blocksx 16 levels x 4

= 222.5 MB/s

BWGrid Filtering = BWGrid Read + BWFiltered Grid Write

1920 x 1080 x 3016 x 16 blocks

1920 x 1080 x 30+ x 16 levels x

16 x 16 blocks

B/level x (3 x 3 x 3 kernel) x 3 colors

1 B/level x 3 colors

= 1212.5 MB/s

B/level) x 3 colors

(3.14)

112

(3.15)

BWGrid Interpolation = BWFiltered Grid Read + BWoutput Image Write

_1920 x 1080 x 301620 x 1680 blo x 16 levels x 1 B/level x 3 colors16 x 16 blocks

+1920 x 1080 x 30 x 3 colors

= 189.1 MB/s (3.16)

Combining the bandwidth requirements for grid creation, filtering and interpolation, from

eq. (3.14), eq. (3.15) and eq. (3.16), the total bandwidth requirement for processing the

3D bilateral grid, from eq. (3.13), is:

BW 3D Grid = 222.5 MB/s + 1212.5 MB/s + 189.1 MB/s

= 1624.1 MB/s (3.17)

The significant downsampling and reduction in computational complexity enabled by

the bilateral grid, compared to bilateral filtering in the 2D image domain, provides a

bandwidth reduction of 86% - from 11.5 GB/s to 1.6 GB/s.

Architectural optimizations and the memory management approach, described in Sec-

tion 3.2.4, that uses task scheduling and the 21.5 kB on-chip SRAM as a cache for in-

termediate data, further reduce the memory bandwidth. This approach only requires

reading the original image and writing back the filtered output, resulting in the band-

width requirement of:

BWprocessor = BWImage Read + BWoutput Image Write

= 1920 x 1080 x 30 x 3 colors + 1920 x 1080 x 30 x 3 colors

= 356 MB/s (3.18)

The memory management approach enables processing in the 3D grid domain while stor-


ing only two rows of created grid blocks and one row of filtered grid blocks, without

having to create an entire grid before processing. This data can be stored efficiently on-

chip using SRAM and avoid a significant number of off-chip DRAM accesses, reducing

the memory bandwidth by 97% compared to bilateral filtering in the 2D image domain -

from 11.5 GB/s to 356 MB/s.

Based on the number of memory accesses, we can estimate the memory power using a

memory power consumption model [144]. The memory size is optimized for the specific

implementation. For example, for 2D bilateral filtering implementation and our Grid and

task scheduling implementation, the DRAM only stores an input image and an output

image, which requires 12 MB of memory. Whereas, the 3D Grid implementation with-

out task scheduling requires storing the created and filtered grid as well as the input

and output images, which requires 13.7 MB of memory. Figure 3-29 shows the memory

bandwidth and estimated power consumption for 2D bilateral filtering, after algorithmic

optimizations with the 3D bilateral grid and the after architectural optimizations involving

memory management with task scheduling. The bilateral grid reduces the memory power

consumption by 75% - from 697 mW to 175 mW. Architectural optimizations with mem-

ory management further reduce the memory power to 108 mW - an overall savings of 85%

x103 x 102

1 12 11700 Bandwidth 8697 Power

9 -6 0

3 175 21624 108

jML 356M0 0

2D Bilateral 3D Grid 3D Grid &Filtering Processing Scheduling

Figure 3-29: Memory bandwidth and estimated power consumption for 2D bilateral filtering,3D bilateral grid and bilateral grid with memory management using task scheduling.


compared to bilateral filtering in the 2D image domain. The memory power does not scale

linearly with the bandwidth because of the standby power consumption of the memory.

This comparison demonstrates the significance of algorithm/architecture co-design and

considering trade-offs for optimizing power consumption not only for the processor core

but for the system as a whole, including external memory and communication costs.


The testchip, shown in Figure 3-30, is implemented in 40 nm CMOS technology with

the active area of 1.1 mmx1.1 mm, 1.94 million transistors and 21.5 kB SRAM. The

processor is verified to be operational from 25 MHz at 0.5 V to 98 MHz at 0.9 V with

SRAMs operating at 0.9 V.

This chip is designed to function as an accelerator core as part of a larger microprocessor

system, utilizing the system's existing DRAM resources. For standalone testing of this

2 mm Chip Features

TraV sto 0.9

Figure 3-30: Die photo of the testchip. Highlighted boxes indicate SRAMs. HDR, CR and SC

refer to HDR create, contrast reduction and shadow correction modules respectively.



chip, a 32 bit wide 266 MHz DDR2 memory controller was implemented using a Xilinx

XC5VLX50 FPGA. The performance vs. energy trade-off of the testchip for a range of

VDD is shown in Figure 3-31. For best image quality settings, grid block size 16 x 16 with

16 intensity levels, the processor is able to operate from 25 MHz at 0.5 V with 2.3 mW

power consumption to 98 MHz at 0.9 V with 17.8 mW power consumption.

1.5 - I I . I I , . I . I . I . 1- - Energy ......--.........- ...--------- ..----- ..-.---.--------- ..--.-.-.------ *-

1 .2 - - V oltage ----.--...--.----.--------------.-------------- .---------- .-----.-- - 0 .9

S0.6 ...-........... .... I ..-.. --.-..... ... 0.7

20 30 40 50 60 70 80 90 100.Frequency (MHz)

Figure 3-31: Processor performance: trade-off of energy vs. performance for varying VDD

The processing run-time scales linearly with the image size with 60 megapixels/second

processing at 0.9 V. Figure 3-32 shows the area and power breakdowns of the processor for

the bilateral filter engines and the pre-processing and post-processing modules. The power

breakdown is obtained from post-layout simulations. The shadow correction module is

power gated during HDR imaging and the HDR creation and contrast adjustment modules

are power gated during low-light enhancement.

Area 3% Power

15kSBilateral Filter Engines

* HDR Creation

Contrast Adjustment

iShadow Correction

HDR Imaging Low-Light Enhancement

Figure 3-32: Processor area (number of gates) and power breakdown.

3.6 Measurement Results 117

3.6.1 Energy Scalable Processing

The grid scalability, described in Section 3.2.5, provides a trade-off between grid resolution

and the amount of energy required for processing. Figure 3-33 demonstrates this trade-off

at 0.9 V for grid block size varying from 16 x 16 pixels to 128 x 128 pixels and the number

of intensity levels varying from 4 to 16.

1 .0 .... ...

125 ..... 0.416

te k t 1 6 2...12.C......\..Intel 4 18 6

Figure 3-33: Energy scalable processing. Grid resolution vs. energy trade-off at 0.9 V.

The energy consumption has a roughly linear dependance on the number of grid intensity

levels. This is because the number of active processing engines and memory banks is

proportional to the number of intensity levels, which results in an approximately linear

scaling in power consumption while the processing run-time remains unchanged. The

energy consumption varies roughly quadratically with the grid block size, because the

number of blocks to process decreases quadratically with the downsampling factor (same

as the block size). This results in an approximately quadratic scaling in run-time while the

processing power consumption remains unaffected. A combination of these grid scaling

parameters enables processing energy scalability from 0.19 mJ to 1.37 mJ per megapixel

at 0.9 V.

The energy vs. image quality trade-off is depicted by a comparison of output images for

different grid configurations, for HDR imaging and low-light enhancement in Figure 3-34

and Figure 3-35 respectively. The impact of intensity downsampling on the image quality

is much more significant than spatial downsampling because the edge-preserving nature

of the bilateral grid depends on the number of intensity levels.

(a) Block size: 16x16, Intensity levels: 16

Energy: 13.7 mJ

(c) Block size: 16x 16, Intensity levels: 4

Energy: 6.4 mJ

(b) Block size: 128x 128, Intensity levels: 16

Energy: 4.2 mJ

(d) Block size: 128 x 128, Intensity levels: 4

Energy: 1.9 mJ

Figure 3-34: Energy/resolution scalable processing. HDR imaging outputs for (a) grid block

size: 16 x 16, intensity levels: 16, (b) grid block size: 128 x 128, intensity levels: 16, (c) grid

block size: 16 x 16, intensity levels: 4, (d) grid block size: 128 x 128, intensity levels: 4.



(a) Block size: 16 x16 (b) Block size: 128x128 (c) Block size: 16x16 (d) Block size: 128x128

Intensity Levels: 16 Intensity Levels: 16 Intensity Levels: 4 Intensity Levels: 4

Energy: 13.7 ml Energy: 4.2 m) Energy: 6.4 mJ Energy: 1.9 mj

Figure 3-35: Energy/resolution scalable processing. Low-light enhancement outputs for (a) grid

block size: 16 x 16, intensity levels: 16, (b) grid block size: 128 x 128, intensity levels: 16, (c)

grid block size: 16 x 16, intensity levels: 4, (d) grid block size: 128 x 128, intensity levels: 4.

3.6.2 Energy Efficiency

Image processing pipelines typically involve a complex set of interconnected operations,

where each processing stage has large data dependencies. These operations don't auto-

matically lend themselves to spatial and temporal parallelism. Several memory read/write

operations are required for every stage of processing, making the cost of memory accesses

often higher than the cost of computations [145]. This makes it difficult to achieve effi-

cient software implementations without significant efforts to manually optimize the code,

including decisions regarding memory access patterns and order of processing. Significant

efforts are also required to enhance processing locality and parallelism using intrinsics and

other low-level programming techniques [146,147].

Table 3.2 shows a comparison of the processor performance with implementations on

other mobile processors at 0.9 V. Software that replicates the functionality of the testchip

and maintains identical image quality is implemented on the mobile processors. The

implementations are optimized for multi-threading and multi-core processing as well as



taking advantage of available GPU resources on the processors. Processing runtime and

power consumption during software execution are measured. The processor achieves more

than 5.2 x faster performance than the fastest software implementation and consumes less

than 40 x power compared to the most power efficient one, resulting in an energy reduction

of more than 280x compared to software implementations on some of the recent mobile

processors while maintaining the same output image quality.

Table 3.2: Performance comparison with mobile processor implementations at 0.9 V.

Processor Technology Frequency Power Runtime* Energy*(nm) (MHz) (mW) (s) (mJ)

Intel Atom [148] 32 1800 870 4.96 4315

Qualcomm Snapdragon [24] 28 1500 760 5.19 3944

Samsung Exynos [25] 32 1700 1180 4.05 4779

TI OMAP [149] 45 1000 770 6.47 4981

This Work 40 98 17.8 0.771 13.7

*Image size: 10 megapixel

To make software implementations more efficient and easier to implement without signif-

icant manual tuning, the Halide image processing language [150] proposes decoupling the

algorithm definition from its execution strategy and automating the search for optimized

mappings of the pipelines to parallel processors and memory hierarchies. An optimizing

compiler generates higher performance implementations from an algorithm and a schedule

described using Halide. We compared the processing performance using Halide with a C

implementation and the hardware implementation of our processor. A moderately opti-

mized implementation generated using Halide for an ARM core, running on a Qualcomm

Snapdragon processor [24], was able to process a 10 megapixel image in 2.1 seconds. With

better optimization, the runtime could be reduced even further. This compared with 4.05

seconds for the manually optimized C implementation running on the same processor. The

hardware implementation completed the processing in 771 ms. Halide provided significant

performance gains while making the software easier to implement.

120


It is also useful to quantify the energy-efficiency of processors in terms of operations

performed per second per unit of power consumption (MOPS/mW), which highlights the

trade-offs associated with different architectures. Figure 3-36 shows such a comparison for

processors ranging from fully-programable CPUs and mobile processors to FPGAs and

ASICs. An operation is defined as a 16 bit add computation.

10 5 This work (0.5 V)

41 104 xEx

o 00NW 03 Mobile

ProcessorsC 102

x00

S101 0-

CPU2~100

W 10-1

1 2 3 4 5 6 7 8 9 10

Processors

Figure 3-36: Energy efficiency of processors ranging from CPUs and mobile processors to FPGAs

and ASICs.

Processor Description

1 Intel Sandy Bridge [20]

2 Intel Ivy Bridge [21]

3 Multimedia DSP [23]

4 Mobile Processors [24,25]

5 GPGPU Application Processor [26]

6 FPGA with hierarchical interconnects [151]

7 SVD ASIC [28]

8 Video Decoder ASIC [29]

9 Multi-Granularity FPGA [152]

10 This work (0.5 V)


The significant enhancement in processing speed as well as the reduction in power con-

sumption achieved by the hardware implementation in this work, resulting in 2 to 3 orders

of magnitude higher energy-efficiency, can be attributed to several factors.

1. The algorithmic and architectural optimizations maximize data locality and enable

spatial and temporal parallelism that helps maximize the number of computations

performed per memory access. This amortizes the cost of memory accesses over

computations as well as reduces the memory bandwidth. Even an optimized software

implementation has a very limited amount of control over the processing architecture

and memory management strategies of the general purpose processor to be able to

achieve the most optimal implementation.

2. The high amount of parallelism enabled by algorithm/architecture co-design facil-

itates real-time performance while operating at less than 100 MHz, compared to

other processors operating at higher than 1 GHz frequency.

3. The hardware implementation allows careful pipelining with flexible bit widths that

enables preserving full-resolution of fixed-point computations at each stage of the

pipeline, whereas software implementations are restricted to a fixed bit width of 32

bit or 64 bit operations. Attempting to adapt bit widths to match the required res-

olution at pipeline stages often leads to a degradation in performance on these cores

instead of enhancement, because this introduces additional typecasting operations

in software processing.

4. Hardware implementations tailored to the specific applications avoid the significant

overhead of a control unit that is essential in a general purpose processor to configure

the processing units and complex memory hierarchies. The performance and power

overhead of just the instruction fetch and decode unit can be significant. Even

with an optimized software implementation, it is hard to avoid uneven pipelines

and variable memory latencies, resulting in stalls that prevent optimal resource

utilization.

122

5. The ability to scale voltage and frequency is key to ensuring minimum energy con-

sumption for the desired performance. The active power consumption of circuits

scales quadratically with voltage. Circuits that are able to operate reliably down to

near threshold voltage enable minimum energy point operation for maximizing effi-

ciency. General purpose processors rarely provide such flexibility to optimize energy

and performance requirements.

3.7 System Integration

The processor is integrated, as shown in Figure 3-37, with external DDR2 memory, a cam-

era and a display. A 32 bit wide 266 MHz DDR2 memory controller and an USB interface

for communicating with a host PC are implemented using a Xilinx XC5VLX50 FPGA.

A software application, running on the host PC, is developed for processor configuration,

image capture, activating processing and result display.

Camera

USB Interface

DDR2 MemoryController

k ~ 0E

DDR2 Memory256MB, 32b

64b

64b

Figure 3-37: Processor integration with external memory, camera and display.

The Printed Circuit Board (PCB) that integrates the processor, memory and interfaces

is shown in Figure 3-38, along with a setup that connects to a camera and display. The

system provides a portable platform for live computational photography.

3.7 System Integration 123

Host USB

USB


USB I/F -

FPGA -XCSVLX5O

DRAM

ASIC

Figure 3-38: Printed circuit board and system integration with camera and display.


In this work, we developed a reconfigurable processor for computational photography

that enables real-time processing in an energy efficient manner. The processor performs

HDR imaging, low-light enhancement and glare reduction using a scalable bilateral grid.

Algorithmic optimizations that leverage the 3D bilateral grid structure, map the com-

putationally complex non-linear filtering operation on to an efficient linear filtering op-

eration in the 3D grid domain, significantly reduce the computational complexity and

memory requirement, enhance processing locality and enable a highly parallel architec-

ture. Architectural optimizations exploit parallelism to enable high throughput real-time

performance while operating at low frequency and achieve hardware scalability to enable

energy vs. output quality trade-offs for energy/resolution scalable processing. Through al-

gorithm/architecture co-design, an approach for low-light enhancement and flash shadow

correction that enables efficient implementation using the bilateral grid architecture is

developed. Circuit design for low voltage operation ensures reliable performance down to

0.5 V, enabling a wide voltage operating range for voltage/frequency scaling and achieving

minimum energy operation for the desired performance.

The processor is implemented using 40 nm CMOS technology and verified to be opera-

tional from 98 MHz at 0.9 V with 17.8 mW power consumption to 25 MHz at 0.5 V with

124

2.3 mW power consumption. At 0.9 V, it can process up to 60 megapixel/s. The scalabil-

ity of the architecture enables processing from 0.19 mJ/megapixel to 1.37 mJ/megapixel

for different grid configurations at 0.9 V, while trading-off output quality for energy. The

processor achieves 280 x energy reduction compared to identical software implementations

on recent mobile processors. The energy scalable implementation proposed in this work

enables efficient integration into portable multimedia devices for real time computational

photography.

Based on the system design approach, from algorithms to circuit implementation, adopted

in this work, the following conclusions can be drawn.

1. Hardware oriented algorithmic reframing is key to efficient implementation. The effi-

ciency gains achievable for a system through architectural and circuit optimizations

are limited if the algorithm requires sequential processing with large data depen-

dencies. The significant reduction in computational complexity, memory size and

bandwidth, achieved through algorithmic transformation from inefficient non-linear

filtering in the image domain to efficient linear filtering in the 3D grid domain,

demonstrates the significance of algorithmic trade-offs in system design.

2. Scalable architectures, with efficient clock and power gating, enable energy vs. per-

formance/quality trade-offs that are extremely desirable for mobile processing. This

energy-scalable processing allows the user to determine the energy usage for a task,

based on the battery state or intended usage for the output.

3. Memory management - both on-chip memory size and off-chip memory bandwidth -

is critical to maximizing the system energy-efficiency. Reduction in external memory

bandwidth from 11.5 GB/s to 356 MB/s and the corresponding power consumption

from 697 mW to 108 mW, through algorithm/architecture co-design, careful task

scheduling and use of on-chip SRAM cache, demonstrates this effect.

4. Low-voltage circuit operation is important to enable voltage/frequency scaling and

attain minimum energy operation for the desired performance.



Chapter 4

Portable Medical Imaging

Platform

Medical imaging techniques play a crucial role in the diagnosis and treatment of numer-

ous medical conditions. Traditionally, medical diagnostic systems have been restricted

to sophisticated clinical environments due to cost, size and expertise required to operate

such equipment. Recent advances in computational photography and computer vision,

coupled with efficient high-performance processing on portable multimedia devices, pro-

vide a unique opportunity for high quality and highly capable medical imaging systems

to become much more portable and cost efficient. Image processing techniques such as

High Dynamic Range (HDR) imaging, contrast enhancement, image segmentation and

registration, could be used to ease the requirements of high-precision optical front-ends

for medical imaging systems that make such equipment bulky and expensive, and enable

digital cameras and smartphones to be used for medical imaging. Proliferation of con-

nected portable devices presents an opportunity for making sophisticated medical imaging

systems available to small clinics and individuals in rural areas and emerging countries to

enable early diagnosis and better treatment outcomes.

128

4.1 Skin Conditions - Diagnosis & Treatment

Skin conditions are among the top five leading causes of nonfatal disease burden glob-

ally [153] and can have a significant negative impact on the quality of life. Chronic skin

conditions are often easily visible and can be characterized by multiple features including

pigmentation, erythema, scale or other secondary features. Vitiligo is one such common

condition found in up to 2% of the worldwide population [154]. The disease is charac-

terized by loss of pigment in the skin, hair and mucous membranes caused in part by

autoimmune destruction of epidermal melanocytes [155,156]. Due to its appearance on

visible areas of the skin, Vitiligo can have a significant negative impact on the quality of

life in affected children and adults.

4.1.1 Clinical Assessment: Current Approaches

Treatments of skin conditions aim to arrest disease progression and induce repigmenta-

tion of affected skin. Several surgical and non-surgical treatments, such as topical im-

munomodulators, phototherapy, and surgical grafting and transplantation, are available

[157,158]. However, diagnosis is primarily based on visual clinical evaluation. Dermoscopy

[159,160] is a noninvasive technique that aids visual observations by allowing clinician to

perform direct microscopic examination of diagnostic features in pigmented skin lesions

and visualization of pigmented cutaneous lesions in vivo [161,162]. Commercially avail-

able dermoscopy tools, such as DermLite [163], aim to improve the ease and accuracy

of visual evaluations by providing magnification, LED lighting and polarizing filters to

enhance the field of view and reduce glare and shadows. However, reliable objective out-

come measures, to allow for comparison of studies and to accurately assess changes over

time, are currently lacking [164-166]. Several tissue lesions can be identified based on

measurable features extracted from a lesion, making the accurate quantification of tissue

lesion features of essential importance in clinical practice.

Current outcome measures include the Physician's Global Assessment (PGA) that grades

Portable Medical Imaging Platform

41 Skin Conditions - Diaonosis & Treatment12

patient improvement based on broad categories of percentage repigmentation over time (0-

25%, 25-50%, 50-75% and 75-100%) and the Vitiligo Area and Severity Index (VASI) [167]

that measures percentage repigmentation graded over area of involvement summed over

body sites involved. Figure 4-1, reproduced with permission from [167], shows an example

of VASI assessment.

100%

75%

90%

50%

25% 10%

Figure 4-1: Standardized assessments for estimating the degree of pigmentation to derive the

Vitiligo Area Scoring Index. At 100% depigmentation, no pigment is present; at 90%, specksof pigment are present; at 75%, the depigmented area exceeds the pigmented area; at 50%, thedepigmented and pigmented areas are equal; at 25%, the pigmented area exceeds the depig-

mented area; and at 10%, only specks of depigmentation are present. (Figure reproduced with

permission from [167])

129

These outcome measures rely on subjective clinical assessment through visual observation,

which cannot exclude inter-observer bias and have limited accuracy, reproducibility and

quantifiability. Two recent studies [165,166] conclude that the current outcome measures

have poor methodological quality and unclear clinical relevance as well as lack consensus

among the clinicians, researchers and patients. Recent studies have begun using image

analysis to evaluate treatment efficacy, but these trials rely on investigator-defined bound-

aries of skin lesions which can be biased, and these programs require user involvement

to analyze each image separately, which can be time-consuming [168,169]. An objective

measurement tool that accurately quantifies repigmentation could overcome these limita-

tions and serve as a diagnostic tool for dermatologists. Image processing techniques can

be applied to identify the skin lesions and extract their features, which would allow much

more accurate determination of disease progression. The ability to more objectively quan-

tify change over time will significantly improve the physician's ability to perform clinical

trials and determine the efficacy of therapies.

4.1.2 Quantitative Dermatology

Algorithms for quantitative dermatology are being developed. A framework to detect

and label moles on skin images is proposed in [170]. The method searches the image for

skin regions using a non-parametric skin detection scheme and uses difference of Gaussian

filters to find possible mole candidates. A trained Support Vector Machine (SVM) is

used to classify the candidates as moles. An approach for registering micro-level features

in high-resolution face images is proposed in [171]. The approach registers features in

images captured with different light polarizations by approximating the face surface as

a collection of quasi-planar skin patches and estimates spatially varying homographies

using feature matching and quasiconvex optimization. A supervised learning technique

to automatically detect acne-like lesions and enable computer assisted counting of acne

lesions in skin images is proposed in [172], which models skin regions by a six dimensional

vector using temporal and spatial features, and detects the separating boundary between


the patch images. Quantitative assessment of wound healing through dimensional mea-

surements and tissue classification is proposed in [173]. The approach computes a 3D

model from multiple views of the wound. Tissue classification is performed from color

and texture region descriptors computed after unsupervised segmentation. Principal com-

ponent analysis followed by image segmentation is used in [174] to analyze and determine

areas of skin that have undergone repigmentation during the treatment of Vitiligo. This

approach converts an RGB image into an image that represent skin areas due to melanin

and haemoglobin and determines the change in area of such regions over time. All the

images taken over time are assumed to be accurately registered with respect to each other

and have uniform color profiles. A technique for melanocytic lesion segmentation based on

image thresholding is proposed in [175]. Thresholding schemes work well when the lesion

and background skin have distinct intensity and color profiles. However, their accuracy

is limited when the image has intensity and/or color inhomogeneities.

Table 4.1 summarizes the current approaches for clinical assessment and recent work in

quantitative dermatology.

A review of the automated analysis techniques for pigmented skin lesions [176], applied to

dermoscopic and clinical images, finds that even though several approaches for analyzing

individual lesions have been proposed, there is a scarcity of approaches on the automation

of lesion change detection. The study concludes that computer-aided diagnosis systems

based on individual pigmented skin lesion image analysis cannot yet be used to provide

the best diagnostic results.

In this work, we develop a system for skin lesion detection and progression analysis and

apply it to clinical images for Vitiligo, obtained from ten different subjects during treat-

ment. Institutional Review Board approval was obtained for data analysis (MIT Protocol

Number: 1301005500) as well as the clinical pilot study in collaboration with the Brigham

and Women's Hospital (BWH Protocol Number: 2012-P-002185/1). Image segmentation

is used to accurately determine the lesion contours in an image and a registration scheme

1314.1 Skin Conditions - Diagnosis & Treatment

Table 4.1: Summary of clinical assessment and quantitative dermatology approaches

Reference Description

[160,162] Dermoscopy - Microscopic examination of diagnostic features in pig-mented skin lesions.

[163] DermLite - Commercial dermoscope to provide magnification, LEDlighting, polarizing filters.

[165] PGA - Patient improvement based on broad categories of percentagerepigmentation over time (0-25%, 25-50%, 50-75% and 75-100%)

[167] VASI - Measuring percentage repigmentation, based on visual observa-tion, graded over area of involvement summed over body sites involved.

[170] A framework, based on difference of Gaussian filters and trained SVM,to detect and label moles on skin images.

[171] Registering micro-level features, using feature matching and quasicon-vex optimization, in high-resolution face images, captured with differentlight polarizations.

[172] A supervised learning technique, using temporal and spatial features, toautomatically detect and count acne-like lesions in images.

[173] Quantitative assessment of wound healing by computing a 3D modelfrom multiple views of the wound and tissue classification based on colorand texture region descriptors.

[174] Principal component analysis and image segmentation of images cap-tured with standardized lighting and alignment to determine repig-mented skin areas during treatment.

[175] Melanocytic lesion segmentation based on image thresholding.

using feature matching is implemented to align a sequence of images for a lesion. A

progress metric called fill factor, which accurately quantifies repigmentation of skin le-

sions, is proposed.


4.2 Skin Condition Progression: Quantitative Analysis

The focus of this work is on developing a system for lesion detection and progression

analysis of skin conditions, based not only on standardized clinical imaging but also

through images captured by patients at home, using smartphone or digital cameras, with-

out any standardization. The main contributions of this work are to leverage algorithmic

techniques from different areas of computer vision, such as color correction, image seg-

mentation and feature matching, optimize and modify them to enhance accuracy for the

skin imaging application and reduce computational and memory complexity for efficient

and fast implementation, and develop an easy-to-use automated mobile system that could

be used by patients as well as doctors for frequent monitoring of skin conditions.

The overall processing flow, from the non-standardized image sequence to quantification

of progression, is summarized in Figure 4-2.

The progress of a skin lesion is recorded by capturing images of the lesion at regular

intervals of time. This is done for all lesions located on different body areas. Color

correction is performed by adjusting R, G, B histograms to neutralize the effects of varying

lighting and enhance the contrast. A Level Set Method (LSM) based image segmentation

approach is implemented to identify the lesion contours. In the vicinity of the lesion

contours, Scale Invariant Feature Transform (SIFT) based feature detection is performed

to identify key features of the lesion. For the first set of images of all skin lesions, they are

manually tagged based on their location on the body. For all future images, the tagging is

performed automatically by comparing the features from the new image with the previous

set of images for all skin lesions. Once the new image is tagged to a specific lesion, it is

registered with the first image in the sequence for that lesion, using pre-computed SIFT

features. The warped lesion contours are computed after alignment and their area is

compared to the area of the first lesion in the sequence to determine the fill factor that

indicates the change in area and quantifies the progress over time.


134PotbeMdclmaigPafr

time

Lesion imageSequence

f Color CorrectionHistogram Adjustment )Contour Detection

Segmentation - Level Set Method]

Feature DetectionSIFT in the vicinity of the contour

Auto-Feature co

previous ima

Store ConFea

No First image in Yesthe sequence?

Tagging Manual Tagging

nparison with Based on lesion locationges of all lesions - T

Store Contour and SIFT:our and SIFT Features

F4

Fill Factor = 0Image Alignment Reference

Homography warping -feature matching

Fill FactorWarped lesion area

comparison

Figure 4-2: Processing flow for skin lesion progression analysis.

I

4.2.1 Color Correction

Accurate color information of skin lesions is significant for dermatology diagnosis and

treatment [177,178]. However, different lighting conditions and non-uniform illumination

during image capture often lead to images with varying color profiles. Having a consistent

L

Portable Medical Imaging Platform134

tures

color profile in the images captured over time is important for both visual comparison as

well as to accurately determine the progression over time. Some approaches for color nor-

malization in dermatological applications have proposed normalizing color profiles of the

instruments to match the images captured with different devices, through users character-

izing and calibrating the color response [179]. An approach to build color normalization

filters by analyzing features in a large data set of images for a skin condition is proposed

in [180], which extracts image features from the inside, outside, and peripheral regions of

the tumor and builds multiple regression models with statistical feature selection.

We developed a color correction scheme that automatically corrects for color variations

and enhances image contrast using color histograms. Histogram equalization is typically

used to enhance contrast in intensity images. However, performing histogram equalization

on R, G and B color channels independently, brings the color peaks in alignment and

results in an image that closely resembles one in neutral lighting environment. For an

image I, the color histogram for channel c (R, G or B) is modified by adjusting the pixel

color values I,(x, y) to span the entire dynamic range D, as given by eq. (4.1).

IM c(XY) = I (XY) - Ic x D (4.1)(ICU - II)

where, Icu and I represent the upper and lower limits of the histogram.

The approach can be summarized as follows:

1. Compute histograms for R, G and B color channels.

2. Determine the upper and lower limits of the R, G and B histograms as the +2U

limit (Ic" > intensity of 97.8% pixels) and the -2- limit (I < intensity of 97.8%

pixels). This avoids histogram skewing due to long tails and results in better peak

alignment.

3. Expand the R, G, B histograms to occupy the entire dynamic range (D) of 0 to 255

using eq. 4.1.


Figure 4-3 shows the performance of this approach for images of two different skin lesions.

The approach achieves performance comparable to white-balance calibration with a color

chart, while also enhancing the contrast to make the lesion more prominent.

50 100 150 200 ; 50 100 150 200 2500Intensity

50 100 150 200 250

100 150 200

(a) Captured with ColorTinted Lighting

2500 50 100 150 200 2501Intensity

(b) Captured with NeutralLighting

D 50 100 150 200 250

(c) After Color Correctionand Contrast Enhancement

Figure 4-3: Color correction by histogram matching. Images captured with normal room lighting

(a) and with color chart white-balance calibration (b). Images after color correction and contrast

enhancement (c) of images in (a).

6

S4-

01

0

6 -

4

2

00 50

2500


4.2.2 Contour Detection

Accurately determining the contours of skin lesions is critical to diagnosis and treatment

as the contour shape is often an important feature in determining the skin condition.

It is also important for determining the response to treatment and the progress over

time. Due to non-uniform illumination, skin curvature and camera perspective, the im-

ages tend to have intensity and color variations within lesions. This makes it difficult

for segmentation algorithms that rely on intensity/color uniformity to accurately identify

the lesion contours. Segmentation approaches for images with intensity bias have been

proposed [181-184]. A level set approach is proposed in [184] that models the distribu-

tion of intensity belonging to each tissue as a Gaussian distribution with spatially varying

mean and variance and creates a level set formulation by defining a maximum likelihood

objective function. An LSM based approach called the distance regularized level set evo-

lution was proposed in [185] and extended in [186] to a region-based image segmentation

scheme that can take into account intensity inhomogeneities. Based on a model of images

with intensity inhomogeneities, the approach in [186] derives a local intensity clustering

property of the image intensities, and defines a local clustering criterion function for the

image intensities in a neighborhood of each point.

We leverage the level set method for image segmentation [186], which provides good accu-

racy in lesion segmentation with intensity/color inhomogeneities. However, this approach

has very high computational complexity and memory requirement, as described below.

We develop an efficient and accurate narrowband implementation that significantly re-

duces the computational complexity and memory requirement. A distance regularized

level set function, similar to that proposed in [185], is used to update the level set values

during iterations. Our implementation only performs updates to the level set function,

and the related variables (energy function, bias field, etc. - defined below), for a small

subset of pixels that fall within a narrow band around the current segmentation contour

in an iteration. This limits the computations and memory accesses to this small subset


of pixels, instead of the entire image. The following section describes the approach in

further detail.

Level Set Method for Segmentation

The original image I with non-uniform intensity profile is modeled as a combination of the

homogeneous image J and a bias field b that captures all the intensity inhomogeneities in

I, given by eq (4.2).

I = bJ + n (4.2)

n is the additive zero-mean Gaussian noise.

A Level Set Function (LSF) #(x) is defined for every pixel x in the image. The image

is segmented into two regions Q, and Q2 based on the values of the level set function in

these regions, such that:

Q1 = {x: O(x) > 0}, Q2 = {x: #(x) < 0} (4.3)

The segmentation contours are represented by the 'zero level set': {x : #(x) = 0}. The

level set function is initialized over the entire image and iteratively evolved to achieve the

final segmentation.

The unknown homogeneous image J is modeled by two constants ci and c2 in regions

Q, and Q2 respectively. An energy function, .F(0, {ci, c2 }, b), is defined over Q1, Q 2, c1 ,

c2 and b. The optimal regions Q1 and Q2 are obtained by minimizing the energy, F, in

a variational framework. The energy minimization is performed in an iterative manner

with respect to one variable at a time while the other variables are set to their values

in the previous iteration. The iterative process is implemented numerically using a finite

difference scheme [185].

This process iteratively converges to the homogeneous image J and the corresponding

level set function #(x), as shown by the sketch in Figure 4-4.


4.2 Skin Condition Progression: Quantitative Analysis 139

(a) (b)

Figure 4-4: Level set segmentation. (a) Original image with intensity inhomogeneity and ini-

tialization of the level set function. (b) Homogeneous image obtained at the end of iterations

and the corresponding level set function.

The iterative process achieves accurate segmentation despite intensity inhomogeneities.

However, it requires storage and update of the level set function O(x), the bias field b, the

homogeneous image model {c 1 , c2 } and the corresponding energy function F(#, {c1i, c2 }, b)

for every pixel in each iteration. Bit widths for representing this data are given by

Table 4.2. This results in a 42 bits/pixel requirement for the level set approach. Processing

Table 4.2: Bit Width Representations of LSM Variables.

Variable Bit Width

I(x) 8 bits/pixel

J(x) 8 bits/pixel

b(x) 8 bits/pixel

O(x) 2 bits/pixel

F(O, {ci, c 2}, b) 16 bits/pixel

a 2 megapixel (1920 x 1080) image requires storing 11 MB of data and updating it in each

iteration. On-chip SRAM in a processor is typically not suited to such large memory

requirement, necessitating an external DRAM for storing these variables. To process a 2

megapixel image with 50 LSM iterations in one second requires the memory bandwidth

of:


(D(X)=O

140PotbeMdclmangPafr

BWLSM BWIRead/WriteBW + BO+ B-BWLsM = BWRead + BWRead/Write + BW+ ead/Write +BWRead/Write

= 1920 x 1080 x (8 + 2 x 8 + 2 x 2 + 2 x 8 + 2 x 16) x (50 iterations)

= 985 MB/s (4.4)

To enable energy efficient implementations and real-time processing, we need to optimize

the algorithm and reduce the computational and memory requirements.

Narrowband LSM

We develop a narrowband implementation of the approach, where instead of storing and

updating the LSM variables for all the pixels in the image in each iteration, we only need

to track a small subset of pixels that fall within a narrow band defined around the zero

level set, as depicted in Figure 4-5.

<D(x)=0

Segmentation contour in <D(x)=0 Narrowbandcurrent iteration

Figure 4-5: Narrowband implementation of level set segmentation. LSM variables are tracked

only for pixels that fall within a narrow band defined around the zero level set in the current

iteration.

The narrowband implementation is achieved by limiting the computations to a narrow

band around the zero level set [185]. The LSF at a pixel x = (i, j) in the image is denoted

by <pj and a set of zero-crossing pixels is determined as the pixels (i, j) such that either


qi+i,j and 4 i-1,j or #i,j+1 and #ij-1 have opposite signs. If the set of zero crossing pixels

is denoted by Z, the narrowband B is constructed as given by eq. (4.5).

B= U Ni (4.5)(ij)eZ

where, Nij is a 5 x 5 pixel window centered around pixel (i, j). The 5 x 5 window is

experimentally determined to provide a good trade-off between computational complexity

and quality of the results.

The LSF based segmentation using narrowband can be summarized as follows.

1. Initialize the LSF to 09. where # indicates the LSF value during iteration k.

Construct the narrowband B 0 using eq. (4.5).

2. Update the LSF on the narrowband using a finite difference scheme [185] as =k+1

# j + At - L(#k ), where At is the time step of the iteration and L(# ) ~ .

3. Determine the set of zero-crossing pixels of #k+ and update the narrowband Bk+1

using eq. (4.5).

4. For pixels (i, j) part of the updated narrowband Bk+1 that were not part of the

narrowband Bk, set # l = 3 if # ±l > 0 and #ktl = -3 otherwise.

5. Continue iterations till the narrowband stops changing (Bk+1 = Bk = Bk-1) or the

limit on maximum iterations is reached. The set of zero-crossing points at the end

of iteration represents the segmentation contour.

The narrowband approach significantly reduces the computational costs as well as memory

requirements for LSM segmentation. Figure 4-6 shows the number of pixels processed for

five 2 megapixel images of skin lesions over 50 LSM iterations using the narrowband

implementation. On average, 400,000 pixels are processed per iteration. Compared to the

2 million pixels processed per iteration using original LSM, this represents a 80% reduction

in the processing cost and reduces the average memory bandwidth to 197 MB/s.


x10510

--- Image 18 Image 2

Image 36 Image 4

-- Image 54 --

02

0 10 20 30 40 50Number of Iterations

Figure 4-6: Number of pixels processed using the narrowband implementation over 50 LSMiterations.

Two-Step Segmentation

This narrowband implementation, however, has one important limitation. We perform

updates on the LSM variables only for the pixels in the small neighborhood of the segmen-

tation contour in the current iteration. If the LSF isn't properly initialized, it is possible

for the energy function to get trapped in a local minima resulting in inaccurate segmen-

tation. This can be easily avoided by starting with a good initialization. We achieve this

by using a 2-step approach:

* Step 1: A simple segmentation technique such as thresholding or K-means is used.

This step is very efficient computationally and generates segmentation contours that

are not completely accurate but serve as a good starting point for our narrowband

LSM implementation.

" Step 2: Contours generated in Step 1 are used to initialize the LSF. Narrowband

LSM then iteratively refines these contours to achieve final segmentation.

Figure 4-7 shows the segmentation achieved by K-means in Step 1 for a skin lesion. Using

these contours to initialize the LSM iterations, Figure 4-8 shows the evolution of contours

during LSM iterations in Step 2.



* *Original Image K-means segmentation Initial contours

Figure 4-7: Lesion segmentation using K-means.

Initial contours

30 Iterations

10 Iterations

40 Iterations

20 Iterations

50 Iterations: Final Contours

Figure 4-8: Contour evolution for lesion segmentation using narrowband LSM.

4.2.3 Progression Analysis

The ability to accurately determine the progression of a skin condition over time is an

important aspect of diagnosis and treatment. In this work, we capture images of the

same skin lesions using a handheld digital camera over an extended period of time during

treatment and analyze them to determine the progress. However, the lesion contours

determined in individual images can not be directly compared as the images typically

have scaling, orientation and perspective mismatch.

We propose an image registration scheme based on SIFT feature matching [34] for pro-


gression analysis. Skin surface typically does not have significant features that could be

detected and matched across images by SIFT. However, the lesion boundary creates dis-

tinct features due to transition in color and intensity from the regular skin to the lesion.

To further highlight these features, we superimpose the identified contour on to the origi-

nal image before feature detection. The lesion contours change over time as the treatment

progresses, however this change is typically slow and non-uniform. Repigmentation often

occurs within the lesion and some parts of the contour shrink while others remain the

same. Performing SIFT results in several matching features corresponding to the areas

of the lesion that haven't significantly changed. Matching SIFT features over large im-

ages can be computationally expensive. Also, on relatively featureless skin surfaces, most

useful SIFT features are concentrated around the lesion contour, where there is change in

intensity and color. To take advantage of this, we restrict feature matching using SIFT to

a narrow band of pixels in the neighborhood of the contour, defined in the same manner

as the narrow band in Section 4.2.2 by eq. (4.5). Figure 4-9 shows a pair of images of the

same lesion with some of the matching SIFT features identified on them.

Figure 4-9: SIFT feature matching performed on the highlighted narrow band of pixels in thevicinity of the contour.

This significantly speeds up the processing by reducing the number of computations and

memory requirement, while providing significant features near the contour that can be

matched across images. For a 2 megapixel image, instead of performing SIFT feature


detection over 2 million pixels, this approach requires processing only 250,000 pixels on

average - a reduction of 88%. This also reduces the memory requirement from 2 MB to

about 250 kB which could be efficiently implemented as on-chip SRAM instead of external

DRAM.

SIFT is performed only once on any given image, the first time it is analyzed. The SIFT

features for the image are stored in the database and used for subsequent analyses. Once

the SIFT features are determined in all the images in a sequence, we identify matching

features across images using Random Sample Consensus (RANSAC) [187] and compute

homography transforms that map every image in the sequence to the first image. The

homographies are used to warp images in the sequence to align with the first image.

Lesion contours in the warped images can be used to compare .the lesions and determine

the progression over time. The lesion area, confined by the warped contours, is determined

for each image in the sequence. We define a quantitative metric called fill factor (F) at

time t as the change in area of the lesion with respect to the reference (first image,

captured before the beginning of the treatment), given by eq. (4.6).

F 1 = -- (4.6)A0

where, At is the lesion area at time t and Ao is the lesion area in the reference image.

A limitation of the narrowband approach for feature matching is that it can be difficult to

determine a significant number of matching features if the lesion contours in two subse-

quent images have changed dramatically. If the images are only collected during clinical

visits that are usually more than a month apart, it is possible to have significant changes

in the lesion contours, depending on the skin condition. The images collected for this

work, as part of the pilot study for Vitiligo, were usually a month apart. The approach

worked well for these images. A goal is this work is to facilitate frequent image collec-

tion by enabling patients to capture images at home, achieving accurate feature matching

between subsequent images as well as frequent feedback for doctors and patients.


4.2.4 Auto-tagging

Many skin conditions typically result in lesions in multiple body areas. For a patient

or a doctor to be able to keep track of the various lesions, it is important to be able to

classify the lesions based on the body areas maintain individual databases for a sequence

of images from each lesion. In this work, we implement a scheme where the subject needs

to manually identify the lesions only once, during initial setup, and all future instances

of the same lesion are automatically classified and entered into the database for analysis.

The auto-tagging approach is developed in collaboration with undergraduate researchers

Michelle Chen and Qui Nguyen.

The well-studied problem of image classification is similar and several image classification

techniques exist that may adapt well to this application. In an image classification prob-

lem, there are several predefined classes of images into which unknown images must be

classified. In this case, the different affected areas on the body can be thought of as the

classes, and we would like to classify new photographs taken by the patient. A large body

of research exists on image classification. Most approaches generally involve determining

distinctive features of each class and then comparing the features of unknown images to

these known patterns to determine their likely classifications. SIFT features are a very

popular option for general image classification, because they are resistant to changes in

scale and transformations [34]. The SIFT descriptors have high discriminative power,

while at the same time being robust to local variations [188]. SIFT has been shown to

significantly outperform descriptors such as pixel intensities [189,190], edge points [191]

and steerable pyramids [192]. Features such as the Harris-Affine detector [193] and geo-

metric blur descriptors [194] have also emerged as alternatives to SIFT for image matching

and classification.

Furthermore, given the nature of skin images, where a lightly colored lesion is surrounded

by darker skin with very few other features present, the main distinguishing feature of each

image is simply the shape of the lesion. This enables us to use descriptors designed for


shape recognition, such as shape contexts [195]. The accuracy of shape context matching

is strongly correlated with the accuracy of segmentation to determine the lesion contour.

The accuracy of segmentation increases for darker skin types, especially in presence of

intensity inhomogeneities.

In this work, we implemented and analyzed both SIFT based and shape context based

classification. One important difference between classic image classification algorithms

and the approach that we used in this work is how the definitive features of each class are

determined. In classic image classification, there are a large number of training examples

that can be used to determine the distinctive pattern of features for each class, and

machine learning techniques are often used to do this. In our case, however, there are

only a few examples per class, and because the lesions change over time, older examples

are less relevant. As a result, we do not use machine learning to combine the examples.

Instead, we use the features of the most recent photograph in each class to represent that

class.

At the beginning of the treatment, all skin lesions are photographed and manually tagged

based on the body areas. An image of lesion i captured at time t is denoted by L . The

images (L9) are processed to perform color correction and contour detection, as described

in Section 4.2.1 and 4.2.2.

SIFT-based classification

SIFT features are computed for each image and stored along with the image as Sio. When

a new image (Li) is captured at time t = 1, same processing is performed to determine the

contour and SIFT features S . SIFT features for the new image (SI) are compared with

those determined earlier (S?) to find matches using two nearest neighbor approach. The

largest set of inliers (Ij) with Nij elements and the total symmetric transfer error (ei,j)

(normalized over the range [0, 1]) for every combination { Si, S } are determined using

RANSAC. The image (LI) is then classified to belong to lesion i if the given i maximizes


the matching criterion Mi,, defined by eq. (4.7).

Mij = Nij (1 + A(1 - ei,j)) (4.7)

where, A is a constant and set to 0.2 in this work. The homography HO'1 , corresponding

to the best match, is stored for later use in progression analysis.

Shape context based classification

Shape context descriptors [195] that for a given point on the contour encode the distri-

bution of the relative locations of all the other points, are computed for each image and

stored along with the image as SC2. When a new image (Li) is captured at time t = 1,

same processing is performed to determine the contour and shape context descriptors

SCf. Shape context descriptors for the new image (SCJ) are compared with those de-

termined earlier (SC?) to find the minimum cost matching between these points, using

the difference between the shape context descriptors as the cost of matching two points.

Finally, a thin plate spline transformation [196,197] between the two contours is computed

using the minimum cost matching. The overall difference between two lesion images is

then represented as a combination of the cost of the matching (SCgst) and the size of the

transformation (Tj). The image (Li) is then classified to belong to lesion i if the given i

maximizes the matching criterion Mi,, defined by eq. (4.8).

Mij = -1 x (SCfc8 t +kxTi,5) (4.8)

where, k is a constant that represents how much the size of the transformation is consid-

ered relative to the cost of the matching. In this work, k is set to 10.

The same process is applied for tagging any future image L j by comparing it against the

previously captured set of images L'-.


4.2.5 Skin condition Progression: Summary

The overall processing involving image tagging, lesion contour detection and progression

analysis, can be summarized as follows.

" Initial Setup

1. Manually tag images (L,) based on the location i of the lesion.

2. Perform color correction and segmentation to determine lesion contours (C).

3. Compute SIFT features (S0) in the vicinity of the lesion contour (C). Store

C and Si for future analysis.

4. Shape context based tagging: Compute shape context features (SC) on

the lesion contour (C). Store SC for future analysis.

" Subsequent Analysis

1. For an image L captured at time t, perform color correction and contour de-

tection (C).

2. Compute SIFT features (S ) in the vicinity of the lesion contour (C).

3. Perform feature matching for every combination {S- 1, S3} and tag L3 to lesion

i using eq. 4.7. Store the best match homography H'l"' for further analysis.

4. Shape context based tagging: Perform shape context matching for every

combination {SC-1, SCj} and tag L) to lesion i using eq. 4.8.

5. Using the pre-computed contours (Cl) and homographies (Hi-'), register a

sequence of n images of the same lesion captured over time to the first image

(L9).

6. Compare the areas of the warped lesion contours to determine the progression

over time and compute the fill factor (Ft) using eq. 4.6.


4.3 Experimental Results

4.3.1 Clinical Validation

Institutional Review Board approval was obtained for data analysis (MIT Protocol Num-

ber: 1301005500) as well as the clinical pilot study in collaboration with the Brigham

and Women's Hospital (BWH Protocol Number: 2012-P-002185/1). Ten subjects ages 18

years and older with a dermatologist diagnosis of Vitiligo were recruited by Dr. Vaneeta

Sheth. Subjects had a variety of skin phototypes and disease characteristics. As this

was a pilot study, no standardized intervention was performed. Rather, subjects were

treated with standard therapies used for Vitiligo based on clinical characteristics and pa-

tient preference. Further subject specific details, along with various treatment modalities

are outlined in Appendix B. Photographs of skin lesions were taken at the beginning of

treatment and during at least two subsequent clinical follow-up visits, using normal room

lighting and a handheld digital camera.

4.3.2 Progression Quantification

The approach to analyze the individual images and determine the progress over time is

implemented using MATLAB.

For a sequence of images of a skin lesion captured over time, we process each image to

perform color correction and contrast enhancement. Figure 4-10 shows a sequence of

images with their R, G, B histograms and the outputs after color correction.

The color corrected images are then processed to perform lesion contour detection. Fig-

ure 4-11 shows a sequence of images with the detected contours over-laid. LSM based

image segmentation accurately detects the lesion boundaries despite intensity/color inho-

mogeneities in the image.

Feature matching is performed across images to correct for scaling, orientation and per-


4.3 Experimental Results 151

3

00 50 100 150 200 250s0 50 100 150 200 2500 SO 100 I1 so 20 0Intensity

so

(a) Captured Image Sequence

S

~2 // /0

0 50 100 15 200 250 0 50 100 150 200 2Mo 0 50 100 ISO 200 2S 0 50Intensity

(b) Color Corrected Image Sequence

100 ISO 200 250

Figure 4-10: Color correction for a sequence of images by R, G, B histogram modification. (a)Original image sequence, (b) Color corrected image sequence. The lesion color changes due tophototherepy.

Figure 4-11: Image segmentation using LSM for lesion contour detection despite intensity/colorinhomogeneities in the image.

spective mismatch. Homography transform, computed based on the matching features,

is used to warp all the images in a sequence with respect to the first image, which is

used as a reference. Figure 4-12 shows a sequence of warped images. The warped lesions

/

100 ISO 200 250

4.3 Experimental Results 151

are compared with respect to the reference lesion at the beginning of the treatment to

determine the progress over time in terms of the fill factor.

Nov'12Fill Factor = 0

Mar'13 Jul'13 Sep'13Fill Factor = 27% Fill Factor = 51% Fill Factor = 57%

Figure 4-12: Image registration based on matching features with respect to the reference image

at the beginning of the treatment.

A sequence of captured and processed images of a different skin lesion from another subject

are shown in Figure 4-13 and the fill factor is computed by comparing the warped lesions.

(a) Images captured with normal room lighting

Nov'12 Dec'12Fill Factor = 0 Fill Factor = 6%

Jan'13Fill Factor = 16%

Feb'13Fill Factor = 22%

(b) Processed outputs after contour detection and alignment

Figure 4-13: Sequence of images during treatment. (a) Images captured with normal roomlighting. (b) Processed image sequence.


153

The approach for image registration is independently validated by analyzing images of the

same skin lesion captured from different camera angles. Contour detection is performed

on the individual images that are then aligned by feature matching. Figure 4-14 shows

one such comparison. The aligned lesions are compared in terms of their area as well

(a) Images of the same lesion from different camera angles

(b) Images after lesion contour detection and alignment

Figure 4-14: Image registration through feature matching. (a) Images of a lesion from differentcamera angles, (b) Images after contour detection and alignment. Area matches to 98% accuracyand pixel overlap to 97% accuracy.

as the number of pixels that overlap. Analysis of 100 images from 25 lesions, with four

real and artificial camera angles each, shows a 96% accuracy in area and 95% accuracy

in pixel overlap.

To validate the progression analysis, we take one image each from 50 different lesions and

artificially generate a sequence of 4 images for each lesion with known change in area. We

then apply rotation, scaling and perspective mismatch to the new images. This artificial

sequence is used as input to our system, which determines the lesion contours, aligns the

sequence and computes the fill factor. We compare the fill factor with the known change

in area from the artificial sequence and also compute pixel overlap between the lesions

identified on the original sequence (before adding mismatch) and those on the processed

sequence. Figure 4-15 shows one such comparison. Analysis of 200 images from 50 such

sequences shows a 95% accuracy in fill factor computation and pixel overlap.



Fill Factor = 0 Fill Factor = 6% Fill Factor = 13% Fill Factor = 18% Fill Factor = 23%(a) Artificial imaae sequence with known area chance

Fill Factor = 30%

(b) Artificial image sequence with added mismatch in scaling, rotation and perspective

Fill Factor = 0 Fill Factor= 8% Fill Factor = 14% Fill Factor = 21% Fill Factor = 25% Fill Factor = 31%Pixel Overlap = 100% Pixel Overlap = 97% Pixel Overlap = 98% Pixel Overlap = 96% Pixel Overlap = 96% Pixel Overlap = 97%

(c) Aligned image sequence with computed fill factor

Figure 4-15: Progression analysis. (a) Artificial image sequence with known area change, cre-ated from a lesion image. (b) Image sequence after applying scaling, rotation and perspectivemismatch. (c) Output image sequence after lesion alignment and fill factor computation.

The proposed approach is used to analyze 174 images corresponding to 50 skin lesions

from ten subjects to determine the progression over time. The progression of multiple

lesions during treatment, as well as a detailed analysis of progression for all ten patients

in the clinical study is presented in Appendix B.

4.3.3 Auto-tagging Performance

Performance of the auto-tagging technique is evaluated by analyzing images of twenty

lesions from ten subjects with five images in each sequence. The first twenty images,

captured at the beginning of the treatment, are manually tagged. The auto-tagging tech-

niques using SIFT and Shape Contexts are used to classify the remaining 80 images.

For each technique, the performance is evaluated as follows. For each image, we use the

technique to calculate the similarity between that image and all images from the previous

timestep, defined by the matching criteria in eq. (4.7) or eq. (4.8). If the image from the

previous timestep with the highest similarity is from the same set, then the technique



classifies the image correctly. Otherwise, the classification is incorrect.

The SIFT based classification approach is able to accurately tag 70 of the 80 images,

achieving an accuracy of 87%. The shape context based approach is able to accurately

classify 72 of the 80 images, achieving an accuracy of 90%. The images in this test data

set were captured one to three months apart, which resulted in significant changes in the

lesion contours for some of the test images. If the contours are significantly different, it

becomes difficult for both SIFT based feature matching and shape context matching to

identify enough matching features for robust classification. Enabling more frequent data

collection, where adjacent images have far fewer changes in the lesion shape, will further

help enhance the accuracy of tagging.

The processing steps for tagging using SIFT are part of the steps necessary for contour de-

tection and progression analysis. So this approach adds very small overhead while achiev-

ing good accuracy. Shape context based approach requires computing shape contexts and

transforms, but this is a small overhead (less than 5%) in the overall processing.

4.3.4 Energy-Efficient Processing

The algorithmic optimizations outlined in Section 4.2.2 and Section 4.2.3 for segmen-

tation and SIFT based progression analysis respectively provide significant reduction in

computational complexity and memory size and bandwidth requirements.

We can estimate the reduction in processing complexity through a comparison of run-times

for the different implementations. Three different implementations with full LSM and full

SIFT, narrowband LSM and full SIFT, and narrowband LSM and narrowband SIFT, are

created in MATLAB. All three implementations are run on a computer with 2.4 GHz

Intel Core i5 processor and 8 GB 1600 MHz DDR3 memory. Run-times are determined as

average of fifty runs of the same implementation for processing two 2 megapixel images.

Table 4.3 shows the comparison of run-times for the different implementations. The

narrowband LSM implementation enhances the performance by 62% compared to full

Table 4.3: Performance enhancement through algorothmic optimizations.

Approach Run Time Power Energy

Segmentation Feature Matching

Full LSM Full SIFT 11.4 sec 20.6 W 235 J

Narrowband LSM Full SIFT 4.3 sec 21.2 W 91 J

Narrowband LSM Narrowband SIFT 3.1 sec 20.2 W 63 J

LSM. The narrowband SIFT implementation improves the performance by 28% compared

to full SIFT. A combination of both results in a 73% performance enhancement compared

to full LSM and SIFT. The power consumption of the CPU during processing is measured

using Intel Power Gadget [198]. The algorithmic optimizations result in a 73% reduction

in the overall energy consumption.

For processing a 2 megapixel image in one second, based on the number of memory ac-

cesses, we can estimate the memory power using a memory power consumption model [144].

Figure 4-16 shows the memory bandwidth and estimated power consumption for process-

ing with full LSM segmentation and SIFT feature matching compared with the optimized

Full ImageLSM & SIFT

Bandwidth -

Power

104

97 -

NarrowbandLSM & SIFT

Figure 4-16: Memory bandwidth and estimated power consumption for full image LSM and

SIFT compared to the optimized narrowband implementations of LSM and SIFT.

9g71000

750

500

250

0

0

CD

CD

200

150

100

50

00


narrowband LSM segmentation and narrowband SIFT feature matching. The algorithmic

optimizations leading to the narrowband implementations of both LSM segmentation and

SIFT feature matching result in a 80% reduction in memory bandwidth and 45% reduction

in memory power. These algorithmic enhancements pave the way for energy-efficient hard-

ware implementations that could enable real-time processing on mobile platforms.

4.3.5 Limitations

The performance of the system depends on several factors, including image exposure, skin

type, location of the lesion, etc. For example, it is harder to accurately segment and align

lesions that may not have well defined boundaries such as a lesion that wraps around

a finger or feet. Figure 4-17 shows an example of where segmentation fails to identify

the right lesion contours. Capturing multiple images of the lesion, each zoomed-in on a

narrow patch, could help improve the performance in such cases.

November'12 January'13

Figure 4-17: Image segmentation fails to accurately

don't have well defined boundaries.

identify lesion contours where the lesions

All the data analyzed in this work is based on image collection that happens only during

the patients visit to the doctor. Such visits may be far apart (a month or more) and

the lesions may have changed significantly to accurately determine matching features


between the new image and the previously collected image. One of the goals of the

mobile application is to enable patients to frequently capture and analyze images, even

outside clinical visits. Frequent data collection and analysis would not only enhance the

performance of the system further, but also provide doctors a near real-time feedback of

the response to treatment that could be used to tailor the treatment for best outcomes.

The approach is validated for Vitiligo skin condition, but it has general applicability and

could be extended to other skin conditions as well.

4.4 Mobile Application

A key objective of this work is to enable patients to perform imaging and progression

analysis of skin lesions at home much more frequently than having the usage limited to

dermatologists in a clinical environment. The ability to perform imaging and analysis

using mobile devices, such as smartphones, is important towards achieving this goal.

Along with undergraduate researchers Michelle Chen and Qui Nguyen, we are developing a

mobile application for the Android platform that enables image capture of the skin lesions

and provides a simple user interface to analyze the images. The analysis is performed using

a cloud-based system that integrates with the mobile application. Figure 4-18 shows the

architecture of the mobile application and cloud integration.

Figure 4-18: Architecture of the mobile application with cloud integration.


159

The application allows the user to capture images of the skin lesion using the built-in

camera on the mobile device. The images are uploaded to the cloud server. The first

time that a patient uses the application, they are asked to label each image manually. For

all subsequent usage, the images are tagged automatically based on the labels originally

provided by the user. The tag is suggested to the user for confirmation to prevent misla-

beling in cases where auto-tagging might result in a wrong classification. A database of

all the images, organized according to the tags and the date of capture, is maintained in

the cloud server. The user can select a region to analyze, which activates processing on

the cloud server. After the processing is complete, the results are retrieved and displayed

on the mobile device as an animation sequence that takes the user through all the images

of that lesion, warped to align with the first image in the sequence, and shows the pro-

gression in terms of the corresponding fill factors. Figure 4-19 shows some of the screens

that form the user interface of the application that is currently under development.

4.5 Multispectral Imaging: Future Work

Medical imaging techniques are important tools in diagnosis and treatment of various skin

conditions, including skin cancers such as melanoma. Defining the true border of skin le-

sions and detecting their features are critical for dermatology. Imaging techniques such

as multi-spectral imaging with polarized light provide non-invasive tools for probing the

structure of living epithelial cells in situ without need for tissue removal. Light polariza-

tion also makes it possible to distinguish between single backscattering from epithelial-cell

nuclei and multiply scattered light. Polarized light imaging gives relevant information on

the borders of skin lesions that are not visible to the naked eye. Many skin conditions

typically originate in the superficial regions of the skin (epidermal basement membrane)

where polarized light imaging is most effective [199].

A number of polarized light imaging systems have been used in clinical imaging [199-201].

However, widespread use of these systems has been limited by their complexity and cost.

4.5 Multispectral Imaging: Future Work


Initial Screen Image Capture

Left Hand

First Usage: ManualTagging

Subsequent Usage: Select Region toAuto-tagging and user Analyze

confirmation

Figure 4-19: User interface of the mobile application. (Contributed

Nguyen).

Display Progression

by Michelle Chen and Qui

Some of the commercially available Dermlite [163] systems are useful for eliminating glare

and shadows from the field of view but do not provide information on the backscattered

degree of polarization and superficial light scattering. More complex systems based on

confocal microscopy [202] trade-off portability and cost for high resolution and depth

information.


We envision a portable imaging module with multispectral polarized light for medical

imaging that could serve as an optical front-end for the skin imaging and analysis sys-

tem developed in this work. A conceptual diagram of the imaging module is shown in

Figure 4-20. The imaging module could function as an attachment to a smartphone and

Cross MultispectralPolarization Illumination

Figure 4-20: A conceptual diagram of the portable imaging module for multispectral polarized

light imaging

augment the built-in camera by enabling image capture under different spectral wave-

lengths, ranging from infrared to ultraviolet, and light polarization. The multispectral

illumination could be created using LEDs of varying wavelengths that are trigged sequen-

tially and synchronized with the camera to capture a stack of images of the same lesion

under different wavelength illumination. The synchronization could be achieved through

a wired or wireless interface, such as USB or Bluetooth, with the smartphone.

The images captured under multispectral illumination provide a way to optically dissect

a skin lesion by analyzing the features visible under different wavelengths. For exam-

ple, surface pigmentation using blue light, superficial vascularity under yellow light, and

deeper pigmentation and vascularity with the deeper-penetrating red light [203]. Such a

device could enable early detection of skin conditions, even before the lesions fully man-

ifest on the skin surface, as well as more accurate diagnosis and treatment by providing

dermatologists with far more details of the lesion morphology than are visible under white

light illumination.

4.5 Multispectral Imaging: Future Work 161

162


In this work, we developed and implemented a system for identifying skin lesions and

determining the progression of the skin condition over time. The approach is applied to

clinical images of skin lesions captured using a handheld digital camera during the course

of treatment.

This work leverages computer vision techniques, such as SIFT feature matching and LSM

image segmentation, and makes application specific modifications, such as color/contrast

enhancement, contour based feature detection and contour detection in presence of inten-

sity/color inhomogeneities. A system that integrates all of these aspects into a seamless

flow and enables lesion detection and progression analysis of skin conditions, based not

only on standardized clinical imaging but also through images captured by patients at

home, using smartphone or digital cameras, without any standardization, is developed.

The algorithmic enhancements and optimizations with the narrowband implementations

of level set segmentation and SIFT feature matching help improve the software run-time

performance by over 70% and CPU energy consumption by 73%. These optimizations

also reduce the estimated memory bandwidth requirement by 80% and memory power

consumption by 45%. These optimizations pave the way for energy-efficient hardware

implementations that could enable real-time processing on mobile platforms.

Based on the images of skin lesions obtained from the pilot study, in collaboration with

the Brigham and Women's Hospital, the results indicate that the lesion segmentation and

progression analysis approach is able to effectively handle images captured under varying

lighting conditions without the need for specialized imaging equipment. R, G, B histogram

matching and expansion neutralizes the effect of lighting variations while also enhancing

the contrast to make the skin lesions more prominent. LSM based segmentation accurately

identifies the lesion contours despite intensity/color inhomogeneities in the image. Feature

matching using SIFT effectively corrects for scaling, orientation and perspective mismatch

in camera angles for a sequence of images captured over time and aligns the lesions

Portable Medical Imaging Platform


that can then be compared to determine progress over time. The fill factor provides

objective quantification of the progression with 95% accuracy, representing a significant

improvement over the current subjective outcome metrics such as the Physician's Global

Assessment and VASI that have assessment variability of more than 25%.

Based on the analysis of existing assessment techniques and the contributions of this work,

the following conclusions can be drawn:

1. The current assessment techniques for skin conditions are primarily based on sub-

jective clinical assessment by physicians. Lack of quantification tools also has a

significant impact on patient compliance. There is a significant need for quantita-

tive dermatology approaches to aid doctors in determining important lesion features

and accurately tracking progression over time, as well as giving patients confidence

that a treatment is having the desired impact.

2. A diverse set of computer vision functionalities need to be integrated to enable skin

imaging and analysis without any standardization in image capture. Application

requirements, such as image segmentation in presence of intensity/color inhomo-

geneities and feature matching on relatively featureless skin surfaces, pose important

challenges. This work leverages recent approaches in level set methods and feature

matching, and enhances them for robustness with application specific modifications.

3. The algorithms have high computational complexity and memory requirements. Effi-

cient software and hardware implementations require algorithmic optimizations that

significantly reduce the processing complexity without sacrificing accuracy. The nar-

rowband image segmentation and feature matching approaches proposed in this work

achieve this objective. These algorithmic optimizations could enable efficient hard-

ware implementations for real-time analysis on mobile devices.

4. It is important to have a simple tool with reproducible results. The proposed sys-

tem is demonstrated to achieve these goals through a pilot study for Vitiligo.This

approach provides a significant tool for accurate and objective assessment of the

163


progress with impact on patient compliance. The precise quantification of progres-

sion would enable physicians perform an objective follow-up study and test the

efficacy of therapeutic procedures for best outcomes.

5. Combining efficient mobile processing with portable optical front-ends that enable

enhanced image acquisition, such as multispectral imaging, polarized lighting and

macro/microscopic imaging, will be key to developing portable medical imaging

systems. Such devices could be deployed widely at low cost for early detection and

monitoring of diseases in rural areas and emerging countries.

Chapter 5

Conclusions and Future Directions

The energy cost of processor programability is very high due to the overhead associated

with supporting a fine-grained instruction set compared to the actual cost of computa-

tion. As we go from CPUs and DSPs to FPGAs and ASICs, we progressively reduce this

overhead and trade-off programability to gain energy-efficiency [204]. It is important to

note, however, that the energy cost is ultimately determined by the desired operation and

underlying algorithms. An algorithm that requires high precision floating point opera-

tions to maintain functionality and accuracy will not be able to achieve energy-efficiency

comparable to one that can be implemented using small bit-width fixed point opera-

tions. Same is true with the performance enhancement that a parallel architecture could

achieve, as described by Amdahl's Law. The energy requirement of an algorithm with

large data dependencies will be dominated by the cost of memory accesses. Even a highly

optimized hardware implementation will not significant improve the energy-efficiency of

such a system. The development of a system that maximizes energy-efficiency must be-

gin with algorithms - often reframing the problem and optimizing processing without

changing functionality or impacting accuracy - and co-designing algorithms and architec-

tures.

5.1 Summary of Contributions

This thesis demonstrates the significance of the co-design approach for mobile platforms

through energy-efficient system design for multiple application areas.

5.1.1 Video Coding

Reconfigurability is key to enabling a class of closely related functionalities efficiently in

hardware. Algorithmic rearrangements and optimizations for transform matrix computa-

tions were key to developing a reconfigurable transform engine for multiple video coding

standards. The optimizations maximized hardware sharing and minimized the amount

of computations required to implement large transform matrices. The shared transform

resulted in 30% hardware saving compared to total hardware requirement of individual

H.264/AVC and VC-1 transform implementations. Algorithmic modifications for data

dependent processing to optimize pipeline bit widths and reduce switching activity of

the system reduced the power consumption by 15%. Moving away from conventional

2D transform architectures, an approach to eliminate an explicit transpose memory was

demonstrated, by reusing the output buffer to store intermediate data and separately

designing the row-wise and column-wise 1D transforms. It helped reduce the area by 23%

and power by 26% compared to the implementation using transpose memory.

Low-voltage circuit design using statistical performance analysis ensured reliable oper-

ation down to 0.35 V. The transform engine was demonstrated to support video en-

coding/decoding in both H.264 and VC-1 standards with Quad Full-HD (3840 x 2160)

resolution at 30 fps, while operating at 25 MHz, 0.52 V and consuming 214 pW of power.

This provided a 250x higher power efficiency while supporting the same throughput as

the previous state-of-the-art ASIC implementations. The design provided efficient perfor-

mance scalability with 1080p (1920 x 1080) at 30 fps, while operating at 6.3 MHz, 0.41 V

with 79 ,uW of power consumption, and 720p (1280 x 720) at 30 fps, while operating at

2.8 MHz, 0.35 V with 43 ptW of power consumption.


The ideas of matrix factorization for hardware sharing, eliminating transpose memory

and data dependent processing have general applicability. As bigger block sizes such as

32x32 and 64x64 are explored in new video coding standards like HEVC, these ideas

could lead to even higher savings in area and power requirement of the transform engine,

allowing their efficient implementation in multi-standard multimedia devices.

5.1.2 Computational Photography

The importance of reframing algorithms for efficient hardware implementations is clearly

demonstrated by the optimizations, leveraging the 3D bilateral grid, that led to significant

reductions in computational complexity, memory size and bandwidth, while preserving

the output quality. The bilateral grid implementation enhanced processing locality by

reducing the data dependencies from multiple image rows to a few grid blocks in the

neighborhood, and enabled highly parallel processing.

Architectural optimizations exploiting parallelism, with two bilateral filter engines oper-

ating in parallel and each supporting 16 x parallel processing, enabled high throughput

real-time performance while operating at less than 100 MHz frequency. Combining al-

gorithmic optimizations, parallelism and processing data locality with careful memory

management, helped reduce the external memory bandwidth by 97% - from 5.6 GB/s

to 165.9 MB/s and the DDR2 memory power consumption by 74% - from 380 mW to

99 mW. Through algorithm/architecture co-design, an approach for low-light enhance-

ment and flash shadow correction was developed that enables efficient implementation

using the bilateral grid architecture.

Circuit design for low-voltage operation and multiple voltage domains enabled the pro-

cessor to achieve a wide operating range - from 25 MHz at 0.5 V with 2.3 mW power con-

sumption to 98 MHz at 0.9 V with 17.8 mW power consumption. Co-designing algorithms,

architectures and circuits, enabled the processor to achieve 280 x higher energy-efficiency

compared to software implementations with identical functionality on state-of-the-art mo-

5.1 Summary of Contributions 167

bile processors. A scalable architecture, with clock and power gating, enabled users to

perform energy/resolution scalable processing and was demonstrated to achieve energy

scalability from 0.19 mJ/megapixel to 1.37 mJ/megapixel for different grid configurations

at 0.9 V, while trading-off output quality for energy.

5.1.3 Medical Imaging

The current assessment techniques for skin conditions are primarily based on subjective

clinical assessment by physicians. The algorithmic enhancements that extended com-

puter vision techniques - from image segmentation in presence of inhomogeneities to

feature matching on relatively featureless surfaces - were key to developing a system for

objective quantification of skin condition progression. The system achieved robust per-

formance in clinical validation with 95% accuracy, representing a significant improvement

over the current subjective outcome metrics such as the Physician's Global Assessment

and VASI that have assessment variability of more than 25%. Algorithmic optimizations

with the narrowband implementations of level set segmentation and SIFT feature match-

ing helped improve the software run-time performance and CPU energy consumption by

over 70%. These optimizations also reduced the estimated memory bandwidth require-

ment by 80% and memory power consumption by 45%. These optimizations pave the way

for energy-efficient hardware implementations that could enable real-time processing on

mobile platforms.

5.2 Conclusions

This thesis focuses on addressing the challenges of implementing high-complexity applica-

tions with high-performance requirements on mobile platforms through a comprehensive

view of system design, where algorithms are designed and optimized to enhance processing

locality and enable highly parallel architectures that can be implemented using low-power


5.2 Conclusions19

low-voltage circuits to achieve maximally energy-efficient systems. The investigation in

this thesis for multiple application areas leads to the following conclusions.

1. Application Specific Processing: With the performance per watt gains due to

technology scaling saturating and the tight energy constraints of mobile platforms,

energy-efficiency is the key bottleneck in scaling performance. Application specific

hardware units that trade-off programmability for high energy-efficiency are becom-

ing an increasingly important part of processor architectures. Hardware-optimized

algorithm design is crucial to maximizing performance and efficiency gains.

2. Reconfigurable Architectures: A hardware implementation with highly opti-

mized processing units supporting core functionalities in a class of applications (ex-

ample: computational photography or video coding) and the ability to activate these

processing units and configure the datapaths based on the application requirements,

provides a very attractive alternative to individual hardware implementations for

each algorithm or application, that maintains high energy-efficiency while support-

ing a class of applications.

3. Scalable Architectures: Scalable architectures, with efficient clock and power

gating, enable energy vs. performance/quality trade-offs that are extremely desirable

for mobile processing. This energy-scalable processing allows the user to determine

the energy usage for a task, based on the battery state or intended usage for the

output.

4. Data Dependent Processing: Data dependent processing can be a powerful tool

in reducing system power consumption. Applications such as multimedia processing

have high data dependency, where intensities of pixels in an image, pixel blocks

in consecutive frames in a video sequence or utterances in a speech sequence are

highly correlated. By exploiting the characteristics of the data being processed,

architectures can be designed to minimize switching activity, optimize pipeline bit

widths and perform variable number of operations per block [67]. The reduction in

169

number of computations and switching activity has a direct impact on the system

power consumption.

5. Low-Voltage Circuit Design: Low-voltage circuit operation is important to en-

able voltage/frequency scaling and attain minimum energy operation for the desired

performance. Variations play a key role in determining circuit performance for low-

voltage operation. The non-linear impact of local variations on performance must

be taken into account to ensure a robust design at low-voltage.

6. Memory Bandwidth and Power: External memory bandwidth and power con-

sumption is a key bottleneck in achieving maximally efficient systems for data in-

tensive applications. If the power consumption of the external memory and the

interface between the memory and the processor is the dominant source of system

power consumption, optimizing the processor alone adds very little to the system ef-

ficiency. New technology solutions such as embedded DRAM [205,206], that enables

DRAM integration onto the processor die, can play a crucial role in maximizing the

system energy-efficiency by minimizing the cost of memory accesses while enabling

significantly higher bandwidths.

5.3 Future Directions

5.3.1 Computational Photography and Computer Vision

With recent advances in photography, incorporating computer vision and computational

photography techniques, we have just begun to scratch the surface of what cameras of

the future could achieve. For example, embedded computer vision aspires to enable

an ever expanding range of applications such as image and video search, scene recon-

struction, 3D scanning and modeling. Enabling such applications requires a proces-

sor capable of sustained computational loads and memory bandwidths, while operating

within the tight constraints of low power mobile platforms. Chapter 3 presents the al-


gorithm/architecture/circuit co-design approach, as it relates to a set of computational

photography applications. Such a comprehensive system design approach will be essential

to enable computational photography for embedded vision and video processing on mo-

bile devices. This opens up a new dimension in video processing with possibilities such as

lighfield video, where the video could be manipulated in real time during playback - re-

focusing frames and changing viewpoints. New research in image sensors [207,208], along

with multi-sensor arrays, could be coupled with energy efficient processing to realize excit-

ing new possibilities for future generation cameras and smartphones, in applications such

as 3D image and video capture, depth sensing, multi-view video and gesture control.

Combining the ability to interpret very complex real 3D environments using computa-

tional photography, with object and feature recognition techniques from computer vision,

and natural human interfaces such as gesture and speech recognition, are key to mak-

ing a truly immersive environment, like the Holodeck, a reality [209]. The performance

and energy constraints of such a system would necessitate novel architectural and circuit

design innovations. Many of the underlying algorithms in computational photography

and computer vision are still in a nascent stage, which requires reconfigurability and

programability in the hardware implementations. For example, an efficient processor for

OpenCV [37], the library of programming functions for computer vision, could dramat-

ically transform the way computer vision applications are implemented. The challenges

of such processors would lie in implementing computationally complex and memory in-

tensive hardware primitives while ensuring flexibility for new software innovations to be

realized.

5.3.2 Portable Medical Imaging

Proliferation of connected portable devices and cloud computing provides us an unique

opportunity to revolutionize the delivery of affordable primary health care. A secure and

portable medical imaging platform is a key milestone in making this goal a reality. Com-


putational imaging is becoming an integral part of portable devices such as smartphones.

Extending this functionality for medical imaging applications will enable portable non-

invasive medical monitoring. A cloud based service can then allow the patient and the

doctor to share this medical database and perform image analysis to help with the diag-

nosis and monitor the progress. Strong security guarantees are essential to ensure that

patient-doctor confidentiality is respected by such services. Strong cryptographic primi-

tives like homomorphic encryption [210] provide potential ways to enable secure processing

in the encrypted domain, which would ensure user privacy and protect patient data. Fig-

ure 5-1 shows the conceptual representation of such a cloud-based processing platform.

One of the major challenges in using this approach is the extremely high computational

Patient Doctor

Captue Encrypt Secure Database Encrypt Clinical Image Capture

Display Decrypt Processing Decrypt View Results& Diagnosis &Results Lesion Features Treatment

Figure 5-1: Secure cloud-based medical imaging platform.

complexity and memory requirement of processing in the encrypted domain. This makes

software-based processing extremely inefficient and real-time operation impractical. Op-

timized encryption algorithm with efficient hardware implementation would be essential

to make secure real-time processing a reality.

The work presented in Chapter 4 provides a foundation for developing efficient hard-

ware implementations to integrate medical imaging in mobile devices. This would enable

real-time processing of hundreds of images, captured over time, to provide doctors and pa-

tients immediate feedback that could be used to determine the future course of treatment.

The enormous performance and energy advantages that efficient hardware implementa-

tions provide could be used to transform medical imaging application, such as Optical

Coherence Tomography (OCT), Magnetic Resonance Imaging (MRI) and Computed To-


mography (CT) scan reconstruction, and shift the analysis from bulky GPU clusters to

portable devices. Such systems could significantly enhance medical imaging and finally

bring the Tricorder from the realms of science fiction to reality!

The intersection of cutting-edge algorithms, massively-parallel architectures with special-

ized reconfigurable accelerators and ultra-low power circuits is ripe for exploration. The

future of technology innovation will be defined by societal imperatives such as affordable

healthcare, energy-efficiency and security, and the biggest challenge of this era will be

to revolutionize these fields just as the era of CMOS scaling revolutionized computing,

communication and consumer entertainment. In just a decade, the relationship among

our daily activities, our data, and the mediums of content creation and consumption will

be radically different. This thesis attempts to define the challenges and propose system

design solutions to help build the technologies that will define this relationship.



Appendix A

Integer Transform

The most commonly use transform in video and image coding applications is the Discrete

Cosine Transform (DCT). DCT has excellent energy compaction property, which leads

to good compression efficiency of the transform. However, the irrational numbers in the

transform matrix make its exact implementation impossible, leading to a drift between

forward and inverse transform coefficients.

H.264/AVC as well as VC-1 video coding standards use a variation of the DCT, known as

Integer transform. In these transforms, the transform matrices are defined to have only

integers. This makes exact inverse possible using integer arithmetic.

The following sections describe the definitions of integer transforms for H.264/AVC and

VC-1 video coding standards.

A.1 H.264/AVC Integer Transform

The separable 2-D 8x8 forward transform for H.264/AVC can be written as:

F8 =H8 Xs-Hs (A.1H( A.1)

176 Integer Transform

and the separable 2-D 8x8 inverse transform can be written as:

18= H8 -Ys8 - H (A.2)

Where, the 1-D 8x8 integer transform for H.264/AVC is defined as:

8 12 8 10 8 6 4 3

8 10 4 -3 -8 -12 -8 -6

8 6 -4 -12 -8 3 8 10

8 3 -8 -6 8 10 -4 -12

8 -3 -8 6 8 -10 -4 12

8 -6 -4 12 -8 -3 8 -10

8 -10 4 3 -8 12 -8 6

8 -12 8 -10 8 -6 4 -3

(A.3)

Similarly, the separable 2-D 4x 4 forward transform for H.264/AVC can be written as:

F4 =HT - X4x4 . H4 (A.4)

and the separable 2-D 4x4 inverse transform can be written as:

14 =H 4 - Y 4 x4 -H4T (A.5)

Where, the 1-D 4x4 transform for H.264/AVC is defined as:

1 2 1 1

1 1 -1 -2

1 -1 -1 2

1 -2 1 -1

H 8 =

(A.6)

Integer Tr-ansformn176

A.2 VC-1 Integer Transform 177

A.2 VC-1 Integer Transform

VC-1 uses 8 x 8, 8 x 4, 4 x 8 and 4 x 4 transforms.

The 2-D separable m x n forward integer transform for VC-1, where m = 8, 4 and n = 8, 4,

is given as:

Fmxn= (Vm - Xmxn - Vn) -Nmxn (A.7)

And the m x n inverse integer transform for VC-1 is given as:

I~ - Vm -YmXn VT

1024(A.8)

The denominator is chosen to be the power of 2 closest to the squared norm of the basis

functions (288, 289 and 292) of the ID transformation.

In order to preserve one extra bit of precision, the 1-D transform operation is performed

as:Ymxn - VT

16

Vm.Dand ImXn = "6

64(A.9)

The 1-D transform matrix is defined as:

12 16 16 15 12 9 6 4

12 15 6 -4 -12 -16 -16 -9

12 9 -6 -16 -12 4 16 15

12 4 -16 -9 12 15 -6 -16

12 -4 -16 9 12 -15 -6 16

12 -9 -6 16 -12 -4 16 -15

12 -15 6 4 -12 16 -16 9

12 -16 16 -15 12 -9 6

V 8 = (A.10)

-4

Integer Transform

and the 1-D 4x4 inverse transform matrix is defined as:

17 22 17 10

17 10 -17 -22

17 -10 -17 22

17 -22 17 -10

178

V4 = (A.11)

Appendix B

Clinical Pilot Study for Vitiligo

Progression Analysis

B.1 Subjects for Pilot Study

Institutional Review Board approval was obtained for data analysis (MIT Protocol Num-

ber: 1301005500) as well as the clinical pilot study in collaboration with the Brigham

and Women's Hospital (BWH Protocol Number: 2012-P-002185/1). Ten subjects ages 18

years and older with a dermatologist diagnosis of vitiligo were recruited by Dr. Vaneeta

Sheth. Subjects had a variety of skin phototypes and disease characteristics, as outlined

in Table B.1. As this was a pilot study, no standardized intervention was performed.

Rather, subjects were treated with standard therapies used for vitiligo based on clinical

characteristics and patient preference.

Table B.1: Demographics of the subjects for clinical study.

Subject Age Gender Ethnicity Vitiligo Treatment(Years) Phenotype Modalities

F Hispanic

M African-American

M Caucasian,Native

American

M Caucasian

Acrofacial

Non-segmentalvitiligo

Non-segmentalvitiligo

Mucosal/genital

21

59

57

29

43

27

46

43

35

43

Greek

Caucasian

South Asian

African-American

Acrofacial

Nonsegmen-tal/common

vitiligo

Acrofacial

Segmental

None

NBUVB*, oralcorticosteroids

NBUVB

Topical calcineurininhibitor, NBUVB

NBUVB

NBUVB

Topical corticosteroids

NBUVB, topicalcorticosteroids

NBUVB

NBUVB, topicalimmunomodulators,topical bimatoprost

*NBUVB: Narrow-band Ultraviolet B

B.2 Progression Analysis

The proposed approach is used to analyze 174 images corresponding to 50 skin lesions

from ten subjects to determine the progression over time. Figure B-i shows the progres-

sion of five lesions through 20 images captured during treatment. A detailed analysis of

progression for all ten patients in the clinical study is presented in Table B.2.

M Caucasian Nonsegmentalcommon vitiligo

F South Asian Segmental

M

M

F

F

180 Clinical Pilot Study for Vitiligo Progression Analysis

B.2 Progyression Analvsis18

Fill Factor = 0 Fill Factor = 11% Fill Factor = 25% Fill Factor = 36%


Fill Factor =0 Fill Factor = -9% Fill Factor = -2% Fill Factor = 17%

Fill Factor =0 Fill Factor =7% Fill Factor =4% Fill Factor =13%


Figure B-1: Progression of skin lesions over time. Lesion contours are identified from the color

corrected images and the lesions are aligned using SIFT feature matching to determine the fill

factor.

181


Table B.2: Progression of Skin Lesions During Treatment

Subject Site Fill Factor (%)

Dec'12 Jan'13 Jun'13 - -

1 Left Hand 0 -2 -19 - - -

Right Hand 0 1 -9 - -

Nov'12 Dec'12 Jan'13 Feb'13 Mar'13 -

Chest 0 8 24 63 78 -2

Left Elbow 0 2 9 17 31 -

Right Elbow 0 4 17 26 59 -

Nov'12 Dec'12 Jan'13 Feb'13 -

Left Popliteal 0 5 9 10 - -

Fossa

Left Wrist 0 4 10 16 - -

3 Right Popliteal 0 3 5 11 -Fossa

Right 0 6 16 22 -Antecubital

Fossa

Right Forearm 0 11 25 36 - -

Right Wrist 0 9 25 28 - -

Dec'12 Mar'13 May'13 Jun'13 Jul'13 Oct'13

Left Foot 0 -3 2 5 13 17

Left Hand 0 1 5 14 21 22

4 Left Knee 0 7 1 14 17 26

Right Hand 0 2 3 24 33 39

Right Foot 0 2 4 -5 6 18

Right Knee 0 0 3 7 19 52


Subject Site Fill Factor (%)

Jan'13 Feb'13 Mar'13 -- -5

Genital 0 2 5 - -

Apr'13 May'13 Jun'13 Jul'13 -

Left Eye 0 3 19 28 -

6 Left Neck 0 3 4 6 -

Left 0 6 53 - -

Preauricular

May'13 Jul'13 Sep'13 - - -

Left Forehead 0 2 -2 - -

7 Left Hand 0 1 3 - -

Right Forehead 0 3 7 - -

Right Hand 0 -2 1 - - -

Jun'13 Jul'13 Aug'13 Oct'13 Nov'13 -

Forehead 0 2 3 16 - -

8Left Temple 0 -83 -91 - 46 -

Right Temple 0 -16 -11 84 95 -

Jun'13 Jul'13 Sep'13 - - -

Left Cutaneous 0 4 7 - -

Lower Lip

Right Oral 0 -4 -8 - - -

Commissure

R. Cutaneous 0 2 21 - - -

Upper Lip

Right 0 8 86 - -

Preauricular

Nov'12 Mar'13 Jul'13 Sep'13 - -

10Right Cheek 0 27 52 57 - -

B.2 Progression Analysis 183


Acronyms

ASIC Application Specific Integrated Circuit

BW Bandwidth

CC Camera Curves

CMOS Complementary Metal Oxide Semiconductor

Cony Convolution

CPU Central Processing Unit

CT Computed Tomography

DCT Discrete Cosine Transform

DRAM Dynamic Random Access Memory

DSP Digital Signal Processor

DVFS Dynamic Voltage-Frequency Scaling

FIFO First In First Out

186 Acronyms

FPGA Field Programmable Gate Array

fps frames per second

GA Grid Assignment

GPGPU General Purpose Graphics Processing Unit

GPU Graphics Processing Unit

HD High Definition

HDR High Dynamic Range

HEVC High-Efficiency Video Coding

HoG Histogram of Gaussians

IC Integrated Circuit

LDR Low Dynamic Range

LED Light Emitting Diode

LSB Least Significant Bit

LSF Level Set Function

LSM Level Set Method

LUT Look-Up Table

MBPS Megabytes per second

186 Acronyms

Acronyms 187

MRI

MSB

NBUVB

OCT

OPA

OPS

PC

PCB

PDF

PGA

QFHD

RANSAC

RDF

SIFT

SRAM

SSTA

STA

Magnetic Resonance Imaging

Most Significant Bit

Narrow-band Ultraviolet B

Optical Coherence Tomography

Operating Point Analysis

Operations Per Second

Personal Computer

Printed Circuit Board

Probability Density Function

Physician's Global Assessment

Quad Full-HD

Random Sample Consensus

Random Dopant Fluctuations

Scale Invariant Feature Transform

Static Random Access Memory

Statistical Static Timing Analysis

Static Timing Analysis

187Acronyms

Acronyms

SVD Singular Value Decomposition

SVM Support Vector Machine

VASI Vitiligo Area and Severity Index

188

Bibliography

[1] C. Babbage, "On the mathematical powers of the calculating engine," Original

Manuscript: Museum of History of Science, Oxford, December 1837.

[2] B. Collier, "The little engines that could've: The calculating machines of Charles Bab-

bage," Doctoral dissertation, Harvard University, August 1970.

[3] G. E. Moore, "Cramming more components onto integrated circuits," Electronics, pp. 114-

117, April 1965.

[4] R. H. Dennard, F. Gaensslen, H. Yu, L. Rideout, E. Bassous, and A. LeBlanc, "Design of

ion-implanted MOSFET's with very small physical dimensions," IEEE Journal of Solid-

State Circuits, vol. SC-9, pp. 256-268, October 1974.

[5] M. Weiser, "The computer for the 21st century," Scientific American, vol. 265, pp. 94-104,September 1991.

[6] R. Want, W. Schilit, N. Adams, R. Gold, K. Petersen, D. Goldberg, J. Ellis, and M. Weiser,"An overview of the ParcTab ubiquitous computing experiment," IEEE Personal Com-

munications, vol. 2, pp. 28-43, December 1995.

[7] R. Broderson, "Infopad - an experiment in system level design and integration," Design

Automation Conference, pp. 313-314, 1997.

[8] A. Chandrakasan, A. Burstein, and R. W. Brodersen, "A low power chipset for portable

multimedia applications," International Solid-State Circuits Conference, pp. 82-83, 1994.

[9] J. C. Maxwell, "Experiments on color, as perceived by the eye, with remarks on color-

blindness," Transactions of the Royal Society of Edinburgh, vol. 21, no. 2, pp. 275-298,1855.

[10] J. A. Paradiso and T. Starner, "Energy scavenging for mobile and wireless electronics,"

IEEE Pervasive Computing, vol. 4, pp. 18-27, January 2005.

[11] Y. Miyabe, "Smart life solutions: from home to city," International Solid-State Circuits

Conference, pp. 12-17, 2013.

[12] R. H. Dennard, J. Cai, and A. Kumar, "A perspective on today's scaling challenges and

possible future directions," Solid-State Electronics, vol. 51, pp. 518-525, April 2007.

[13] M. Horowitz, "Computing's energy problem (and what we can do about it)," International

Solid-State Conference, pp. 10-14, 2014.

[14] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, "Low-power CMOS digital design,"

IEEE Journal of Solid-State Circuits, vol. 27, pp. 473-484, April 1992.

[15] B. Davari, R. H. Dennard, and G. G. Shahidi, "CMOS scaling for high performance and

low-power-the next ten years," Proceedings of the IEEE, vol. 83, pp. 595-606, April 1995.

[16] M. Horowitz, E. Alon, D. Patil, S. Naffziger, R. Kumar, and K. Bernstein, "Scaling, power,

and the future of CMOS," IEEE International Electron Devices Meeting, pp. 7-15, 2005.

[17] K. Itoh, "Adaptive circuits for the 0.5-V nanoscale CMOS era," IEEE International Solid

State Circuits Conference, pp. 14-20, 2009.

[18] G. M. Amdahl, "Validity of the single processor approach to achieving large-scale com-

puting capabilities," AFIPS Spring Joint Computer Conference, pp. 483-485, 1967.

[19] W. Dally, "The path to high-efficiency computing," Computational Sciences

and Engineering Conference [online] http: //computing. ornl.gov/workshops/SMC13/

presentations/3-SMC_0913_Dally.pdf,2013.

[20] M. Yuffe, E. Knoll, M. Mehalel, J. Shor, and T. Kurts, "A fully integrated multi-CPU,

GPU and memory controller 32nm processor," International Solid-State Circuits Confer-

ence, pp. 264-265, 2011.

[21] S. Damaraju, V. George, S. Jahagirdar, T. Khondker, R. Milstrey, S. Sarkar, S. Siers,

I. Stolero, and A. Subbiah, "A 22nm IA multi-CPU and GPU system-on-chip," Interna-

tional Solid-State Circuits Conference, pp. 56-57, 2012.

[22] P. Ou, J. Zhang, H. Quan, Y. Li, M. He, Z. Yu, X. Yu, S. Cui, J. Feng, S. Zhu, J. Lin,

M. Jing, X. Zeng, and Z. Yu, "A 65nm 39GOPS/W 24-core processor with 11Tb/s/W

packet-controlled circuit-switched double-layer network-on-chip and heterogeneous execu-

tion array," International Solid-State Circuits Conference, pp. 56-57, 2013.

[23] G. Gammie, N. Ickes, M. E. Sinangil, R. Rithe, J. Gu, A. Wang, H. Mair, S. Datla,

B. Rong, S. Honnavara-Prasad, L. Ho, G. Baldwin, D. Buss, A. P. Chandrakasan, and

U. Ko, "A 28nm 0.6V low-power DSP for mobile applications," International Solid-State

Circuits Conference, pp. 132-133, 2011.

[24] "Dragonboard Snapdragon S4 plus APQ8060A mobile development board," [online]

https: //developer .qualcomm. com/mobile-development/development-devices/

dragonboard.

[25] "Samsung Exynos 5 dual Arndale board," [online] http: //www. arndaleboard. org/

wiki/index .php/Main.Page.

[26] Y. Park, C. Yu, K. Lee, H. Kim, Y. Park, C. Kim, Y. Choi, J. Oh, C. Oh, G. Moon,

S. Kim, H. Jang, J. A. Lee, C. Kim, and S. Park, "72.5GFlops 240Mpixel/s 1080p 60fps

multi-format video codec application processor enabled with GPGPU for fused multimedia

application," International Solid-State Circuits Conference, pp. 160-161, 2013.

[27] J. Park, I. Hong, G. Kim, Y. Kim, K. Lee, S. Park, K. Bong, and H. J. Yoo, "A

646GOPS/W multi-classifier many-core processor with cortex-like architecture for super-

resolution recognition," International Solid-State Circuits Conference, pp. 168-169, 2013.

190 BIBLIOGRAPHY

[28] D. Markovic, R. W. Brodersen, and B. Nikolic, "A 70GOPS, 34mW multi-carrier MIMO

chip in 3.5mm2 ," IEEE Symposium on VLSI Circuits, pp. 158-159, 2006.

[29] C. T. Huang, M. Tikekar, C. Juvekar, V. Sze, and A. Chandrakasan, "A 249Mpixel/s

HEVC video-decoder chip for quad full HD applications," International Solid-State Cir-

cuits Conference, pp. 162-163, 2013.

[30] M. Mehendale, S. Das, M. Sharma, M. Mody, R. Reddy, J. Meehan, H. Tamama, B. Carl-

son, and M. Polley, "A true multistandard, programmable, low-power, full HD video-codec

engine for smartphone SoC," International Solid-State Circuits Conference, pp. 226-227,

2012.

[31] V. Aurich and J. Weule, "Non-linear gaussian filters performing edge preserving diffusion,"

Springer Berlin Heidelberg, pp. 538-545, 1995.

[32] P. J. Burt, "Fast algorithms for estimating local image properties," Computer Vision,

Graphics, and Image Processing, vol. 21, pp. 368-382, March 1983.

[33] P. J. Burt and E. H. Adelson, "The laplacian pyramid as a compact image code," IEEE

Transactions on Communication, vol. 31, pp. 532-540, April 1983.

[34] D. Lowe, "Distinctive image features from scale-invariant keypoints," International Jour-

nal of Computer Vision, vol. 60, pp. 91-110, February 2004.

[35] N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection," Computer

Vision and Pattern Recognition Conference, pp. 886-893, 2005.

[36] P. Viola and M. Jones, "Rapid object detection using a boosted cascade of simple features,"

Computer Vision and Pattern Recognition Conference, pp. 511-518, 2001.

[37] "OpenCV: Open source computer vision," [online] http: //opencv. org/.

[38] A. P. Chandrakasan, D. C. Daly, D. F. Finchelstein, J. Kwong, Y. K. Ramadass, M. E.

Sinangil, V. Sze, and N. Verma, "Technologies for ultradynamic voltage scaling," Proceed-

ings of the IEEE, vol. 98, pp. 191-214, February 2010.

[39] B. Calhoun, A. Wang, and A. Chandrakasan, "Modeling and sizing for minimum en-

ergy operation in subthreshold circuits," IEEE Journal of Solid-State Circuits, vol. 40,

pp. 1778-1786, September 2005.

[40] A. Asenov, "Random dopant induced threshold voltage lowering and fluctuations in sub-

0.1 pm MOSFET's: A 3-D "atomistic" simulation study," IEEE Transactions on Electron

Devices, vol. 45, pp. 2505-2513, December 1998.

[41] P. Andrei and I. Mayergoyz, "Random doping-induced fluctuations of subthreshold charac-

teristics in MOSFET devices," Solid-State Electronics, vol. 47, pp. 2055-2061, November

2003.

[42] B. Zhai, S. Hanson, D. Blaauw, and D. Sylvester, "Analysis and mitigation of variability

in subthreshold design," International Symposium on Low Power Electronics and Design,

pp. 20-25, 2005.

BIBLIOGRAPHY 191

BIBLIOGRAPHY192

[43] M. J. M. Pelgrom, A. C. J. Duinmaijer, and A. P. G. Welbers, "Matching properties of

MOS transistors," IEEE Journal of Solid-State Circuits, vol. 24, pp. 1433-1440, October

1989.

[44] B. H. Calhoun and A. P. Chandrakasan, "Static noise margin variation for sub-threshold

SRAM in 65-nm CMOS," IEEE Journal of Solid-State Circuits, vol. 41, pp. 1673-1679,

July 2006.

[45] "Cisco visual networking index: Global mobile data traffic forecast update, 2013-2018,"

[online] http: //www. cisco. com/c/en/us/solut ions/collateral/service-provider/

visual-networking-index-vni/white _papercll-5 208 6 2 .html.

[46] Y. K. Lin, D. W. Li, C. C. Lin, T. Y. Kuo, S. J. Wu, W. C. Tai, W. C. Chang, and T. S.

Chang, "A 242mW 10mm 2 1080p H.264/AVC high-profile encoder chip," International

Solid-State Circuits Conference, pp. 314-315, 2008.

[47] D. F. Finchelstein, V. Sze, M. E. Sinangil, Y. Koken, and A. P. Chandrakasan, "A low-

power 0.7-V H.264 720p video decoder," IEEE Asian Solid-State Circuits Conference,

pp. 173-176, 2008.

[48] K. Yu, M. Takahashi, T. Maeda, H. Hara, H. Arakida, H. Yamamoto, Y. Hagiwara, T. Fu-

jita, M. Watanabe, T. Shimazawa, Y. Ohara, T. Miyamori, M. Hamada, and Y. Oowaki,

"A 222mW H.264 full-HD decoding application processor with x512b stacked dram in

40nm," International Solid-State Circuits Conference, pp. 326-327, 2010.

[49] Y. Park, C. Yu, K. Lee, H. Kim, Y. Park, C. Kim, Y. Choi, J. Oh, C. Oh, G. Moon,

S. Kim, H. Jang, J. A. Lee, C. Kim, and S. Park, "72.5GFlops 240Mpixel/s 1080p 60fps

multi-format video codec application processor enabled with GPGPU for fused multimedia

application," International Solid-State Circuits Conference, pp. 160-161, 2013.

[50] T. Burd and R. Broderson, "Design issues for dynamic voltage scaling," IEEE Interna-

tional Symposium on Low Power Electronics and Design, pp. 9-14, 2000.

[51] B. H. Calhoun and A. P. Chandrakasan, "Characterizing and modeling minimum energy

operation for subthreshold circuits," IEEE International Symposium on Low Power Elec-

tronics and Design, pp. 90-95, 2004.

[52] I.-T. S. H, "H.264: Advanced video coding for generic audiovisual services,"

[53] T. Wiegand and G. J. Sullivan, "Overview of the H.264/AVC video coding standard,"

IEEE Transactions on Circuits and Systems for Video Processing, vol. 13, pp. 560-576,

July 2003.

[54] S. 421M, "VC-1 compressed video bitstream format and decoding process,"

[55] H. Kalva and J. Lee, "The VC-1 video coding standard," IEEE Multimedia, vol. 14,

pp. 88-91, October 2007.

[56] S. Srinivasan, P. Hsu, T. Holcomb, K. Mukerjee, S. L. Regunathan, B. Lin, J. Liang, M.-C.

Lee, and J. Ribas-Corbera, "Windows Media Video 9: Overview and applications," Signal

Processing: Image Communication, vol. 19, pp. 851-875, October 2004.

[57] H. S. Malvar, A. Hallapuro, M. Karczewicz, and L. Kerofsky, "Low-complexity transform

and quantization in H.264/AVC," IEEE Transactions on Circuits and Systems for Video

Processing, vol. 13, pp. 598-603, July 2003.

[58] S. Srinivasan, S. Regunathan, and B. Lin, "Computationally efficient transforms for video

coding," IEEE International Conference on Image Processing, pp. 11-14, 2005.

[59] S. Srinivasan and J. Liang, "Fast video codec transform implementations," U.S. Patent

20050256916, November 2005.

[60] S. Lee and K. Cho, "Design of transform and quantization circuit for multi-standard

integrated video decoder," IEEE Workshop on Signal Processing Systems, pp. 181-186,

2007.

[61] C.-P. Fan and G.-A. Su, "Efficient low-cost sharing design of fast 1-D inverse integer

transform algorithms for H.264/AVC and VC-1," IEEE Signal Processing Letters, vol. 15,pp. 926-929, 2008.

[62] C.-P. Fan and G.-A. Su, "Efficient fast 1-D 8x8 inverse integer transform for VC-1 ap-

plication," IEEE Transactions on Circuits and Systems for Video Technology, vol. 19,pp. 584-590, April 2009.

[63] G.-A. Su and C.-P. Fan, "Cost effective hardware sharing architecture for fast 1D 8x8

forward and inverse integer transforms of H.264/AVC high profile," IEEE Asia Pacific

Conference on Circuits and Systems, pp. 1332-1335, 2008.

[64] S. Lee and K. Cho, "Design of high-performance transform and quantization circuit for

unified video codec," IEEE Asia Pacific Conference on Circuits and Systems, pp. 1450-

1453, 2008.

[65] R. Rithe, C. C. Cheng, and A. Chandrakasan, "Quad full-HD transform engine for dual-

standard low-power video coding," IEEE Asian Solid-State Circuits Conference, pp. 401-

404, 2011.

[66] W.-H. Chen, C. Smith, and S. Fralick, "A fast computational algorithm for the discrete co-

sine transform," IEEE Transactions on Communications, vol. 25, pp. 1004-1009, Septem-

ber 1977.

[67] T. Xanthopoulos and A. P. Chandrakasan, "A low-power IDCT macrocell for MPEG-2

MP©ML exploiting data distribution properties for minimal activity," IEEE Journal of

Solid-State Circuits, vol. 34, pp. 693-703, May 1999.

[68] H. Fujiwara, K. Nii, H. Noguchi, J. Miyakoshi, Y. Murachi, Y. Morita, H. Kawaguchi, andM. Yoshimoto, "Novel video memory reduces 45% of bitline power using majority logic

and data-bit reordering," IEEE Transactions on Very Large Scale Integration Systems,

vol. 16, pp. 620-627, June 2008.

[69] M. E. Sinangil and A. P. Chandrakasan, "Application-specific SRAM design using output

prediction to reduce bit-line switching activity and statistically gated sense amplifiers for

up to 1.9x lower energy/access," IEEE Journal of Solid-State Circuits, vol. 49, pp. 107-

117, January 2014.

193BIBLIOGRAPHY

[70] "ITU-T recommendation H.265 and ISO/IEC 23008-2: High Efficiency Video Coding,"

[online] http: //www. itu. int/ITU-T/recommendations/rec .aspx?rec=11885, 2013.

[71] M. Budagavi, A. Fuldseth, G. Bjontegaard, V. Sze, and M. Sadafale, "Core transform

design in the high efficiency video coding (HEVC) standard," IEEE Journal of Selected

Topics in Signal Processing, vol. 7, pp. 1029-1041, December 2013.

[72] M. Tikekar, C.-T. Huang, C. Juvekar, V. Sze, and A. P. Chandrakasan, "A 249-Mpixel/s

HEVC video-decoder chip for 4k ultra-HD applications," IEEE Journal of Solid-State

Circuits, vol. 49, pp. 61-72, January 2014.

[73] K. J. Kuhn, "Reducing variation in advanced logic technologies: Approaches to process and

design for manufacturability of nanoscale CMOS," IEEE International Electron Devices

Meeting, pp. 471-474, 2007.

[74] L. Cheng, P. Gupta, C. Spanos, K. Qian, and L. He, "Physically justifiable die-level model-

ing of spatial variation in view of systematic across wafer variability," Design Automation


[75] D. Blaauw, K. Chopra, A. Srivastava, and L. Scheffer, "Statistical timing analysis: From

basic principles to state of the art," IEEE Transactions on Computer-Aided Design of

Integrated Circuits and Systems, vol. 27, pp. 589-607, April 2008.

[76] A. Wang and A. Chandrakasan, "A 180-mV subthreshold FFT processor using a minimum

energy design methodology," IEEE Journal of Solid-State Circuits, vol. 40, pp. 310-319,

January 2005.

[77] Y. Cao and L. T. Clark, "Mapping statistical process variations toward circuit perfor-

mance variability: An analytical modeling approach," ACM IEEE Design Automation


[78] J. Kwong, Y. K. Ramadass, N. Verma, and A. P. Chandrakasan, "A 65 nm sub-V micro-

controller with integrated sram and switched capacitor DC-DC converter"," IEEE Journal

of Solid-State Circuits, vol. 44, pp. 115-126, January 2009.

[79] H. Mahmoodi, S. Mukhapadhyay, and K. Roy, "Estimation of delay variations due to

random-dopant fluctuations in nanoscale CMOS circuits," IEEE Journal of Solid-State

Circuits, vol. 40, pp. 1787-1796, September 2005.

[80] S. Sundareswaran, J. A. Abraham, A. Ardelea, and R. Panda, "Characterization of stan-

dard cells for intra-cell mismatch variations," International Symposium on Quality Elec-

tronic Design, pp. 213-219, 2008.

[81] R. Rithe, S. Chao, J. Gu, A. Wang, S. Datla, G. Gammie, D. Buss, and A. Chandrakasan,

"The effect of random dopant fluctuations on logic timing at low voltage," IEEE Trans-

actions on Very Large Scale Integration (VLSI) Systems, vol. 20, pp. 911-924, May 2012.

[82] R. Rithe, "SSTA design methodology for low voltage operation," Master's thesis, Mas-

sachusetts Institute of Technology, 2010.

[83] C. Y. Huang, L. F. Chen, and Y. K. Lai, "A high-speed 2D transform architecture with

unique kernel for multi-standard video applications," IEEE International Symposium on

Circuits and Systems, pp. 21-24, 2008.

194 BIBLIOGRAPHY

[84] C. P. Fan, C. H. Fang, C. W. Chang, and S. J. Hsu, "Fast multiple inverse transforms withlow-cost hardware sharing design for multistandard video decoding," IEEE Transactions

on Circuits and Systems-IL: Express Briefs, vol. 58, pp. 517-521, August 2011.

[85] K. Wang, J. Chen, W. Cao, Y. Wang, L. Wang, and J. Tong, "A reconfigurable multi-transform VLSI architecture supporting video codec design," IEEE Transactions on Cir-cuits and Systems-II: Express Briefs, vol. 58, pp. 432-436, July 2011.

[86] Y.-H. Chen, T.-Y. Chang, and C.-W. Lu, "A low-cost and high-throughput architecturefor H.264/AVC integer transform by using four computation streams," IEEE InternationalSymposium on Integrated Circuits, pp. 380-383, 2011.

[87] F. Durand and J. Dorsey, "Fast bilateral filtering for the display of high-dynamic-rangeimages," ACM Transactions on Graphics, vol. 21, pp. 257-266, July 2002.

[88] M. Brown and D. G. Lowe, "Automatic panoramic image stitching using invariant fea-tures," International Journal of Computer Vision, vol. 74, pp. 59-73, August 2007.

[89] A. Levin, Y. Weiss, F. Durand, and W. T. Freeman, "Efficient marginal likelihood op-timization in blind deconvolution," IEEE Conference on Computer Vision and Pattern

Recognition, pp. 2657-2664, 2011.

[90] R. Ng, M. Levoy, M. Bredif, G. Duval, M. Horowitz, and P. Hanrahan, "Light-field pho-tography with a handheld plenoptic camera," Stanford University Computer Science Tech

Report, April 2005.

[91] C. Tomasi and R. Manduchi, "Bilateral filtering for gray and color images," IEEE Inter-

national Conference on Computer Vision, pp. 839-846, 1998.

[92] S. M. Smith and J. M. Brady, "SUSAN - a new approach to low level image processing,"International Journal of Computer Vision, vol. 23, pp. 45-78, May 1997.

[93] P. Perona and J. Malik, "Scale-space and edge detection using anisotropic diffusion," IEEE

Transactions Pattern Analysis Machine Intelligence, vol. 12, pp. 629-639, July 1990.

[94] J. Tumblin and G. Turk, "LCIS: A boundary hierarchy for detail-preserving contrastreduction," A CM SIGGRAPH Conference, pp. 83-90, 1999.

[95] A. Levin, A. Rav-Acha, and D. Lischinski, "Spectral matting," IEEE Transactions on

Pattern Analysis and Machine Intelligence, vol. 30, pp. 1699-1712, October 2008.

[96] D. Lischinski, Z. Farbman, M. Uyttendaele, and R. Szeliski, "Interactive local adjustment

of tonal values," ACM Transactions on Graphics, vol. 25, pp. 646-653, March 2006.

[97] N. Sochen, R. Kimmel, and A. M. Bruckstein, "Diffusions and confusions in signal and

image processing," Journal of Mathematical Imaging and Vision, vol. 14, pp. 237-244,May 2001.

[98] M. Elad, "On the bilateral filter and ways to improve it," IEEE Transactions on Image

Processing, vol. 11, pp. 1141-1151, October 2002.

[99] J. van de Weijer and R. van den Boomgaard, "On the equivalence of local-mode finding,robust estimation and mean-shift analysis as used in early vision tasks," International

Conference on Pattern Recognition, pp. 927-930, 2002.

BIBLIOGRAPHY 195

[100] D. Barash and D. Comaniciu, "A common framework for nonlinear diffusion, adaptive

smoothing, bilateral filtering and mean shift," Image and Video Computing, vol. 22, pp. 73-81, January 2004.

[101] A. Buades, B. Coll, and J.-M. Morel, "Neighborhood filters and PDE's," NumerischeMathematik, vol. 105, pp. 1-34, November 2006.

[102] P. Mrazek, J. Weickert, and A. Bruhn, "On robust estimation and smoothing with spatial

and tonal kernels," Springer Geometric Properties for Incomplete data, vol. 31, pp. 335-

352, 2006.

[103] M. Aleksic, M. Smirnov, and S. Goma, "Novel bilateral filter approach: Image noise

reduction with sharpening," Proceedings of the SPIE, vol. 6069, pp. 141-147, May 2006.

[104] C. Liu, W. T. Freeman, R. Szeliski, and S. Kang, "Noise estimation from a single image,"

IEEE Computer Vision and Pattern Recognition Conference, pp. 901-908, 2006.

[105] S. Bae, S. Paris, and F. Durand, "Two-scale tone management for photographic look,"

ACM Transactions on Graphics, vol. 25, pp. 637-645, July 2006.

[106] M. Elad, "Retinex by two bilateral filters," Scale-Space Conference, pp. 217-229, July.

[107] E. Bennet and L. McMillan, "Video enhancement using per-pixel virtual exposures," A CM

Transactions on Graphics, vol. 24, pp. 845-852, July 2005.

[108] H. Winnemoller, S. C. Olsen, and B. Gooch, "Real-time video abstraction," ACM Trans-

actions on Graphics, vol. 25, pp. 1221-1226, August 2006.

[109] J. Xiao, H. Cheng, H. Awhney, C. Rao, and M. Isnardi, "Bilateral filtering based op-

tical flow estimation with occlusion detection," European Conference on Computer Vision,pp. 211-224, 2006.

[110] P. Sand and S. Teller, "Particle video: Long-range motion estimation using point trajec-

tories," International Journal of Computer Vision, vol. 80, pp. 72-91, January 2008.

[111] E.-H. Woo, J.-H. Sohn, H. Kim, and H.-J. Yoo, "A 195 mW, 9.1 Mvertices/s fully pro-grammable 3-D graphics processor for low-power mobile devices," IEEE Journal of Solid-

State Circuits, vol. 43, pp. 2370-2380, July 2008.

[112] F. Sheikh, S. K. Mathew, M. A. Anders, H. Kaul, S. K. Hsu, A. Agarwal, R. K. Krishna-murthy, and S. Borkar, "A 2.05 Gvertices/s 151 mW lighting accelerator for 3D graphics

vertex and pixel shading in 32 nm CMOS," IEEE Journal of Solid-State Circuits, vol. 48,pp. 128-139, January 2013.

[113] G. Wan, X. Li, G. Agranov, M. Levoy, and M. Horowitz, "CMOS image sensors with

multi-bucket pixels for computational photography," IEEE Journal of Solid-State Circuits,vol. 47, pp. 1031-1042, April 2012.

[114] S. Sukegawa, T. Umebayashi, T. Nakajima, H. Kawanobe, K. Koseki, I. Hirota, T. Haruta,M. Kasai, K. Fukumoto, T. Wakano, K. Inoue, H. Takahashi, T. Nagano, Y. Nitta, T. Hi-

rayama, and N. Fukushima, "A 1/4-inch 8Mpixel back-illuminated stacked CMOS image

sensor," IEEE International Solid-State Circuits Conference, pp. 484-485, 2013.

196 BIBLIOGRAPHY

[115] Y. Chen, Y. Xu, Y. Chae, A. Mierop, X. Wang, and A. Theuwissen, "A 0.7e-rms-temporal-readout-noise CMOS image sensor for low-light-level imaging," IEEE International Solid-State Circuits Conference, pp. 384-385, 2012.

[116] J. Chen, S. Paris, and F. Durand, "Real time edge-aware image processing with the bilat-eral grid," ACM Transactions on Graphics, vol. 26, July 2007.

[117] T. Q. Pham and L. J. V. Vliet, "Separable bilateral filtering for fast video preprocessing,"IEEE International Conference on Multimedia and Expo, pp. 4-8, 2005.

[118] S. Paris and F. Durand, "A fast approximation of the bilateral filter using a signal pro-cessing approach," International Journal of Computer Vision, vol. 81, pp. 24-52, January

2009.

[119] B. Weiss, "Fast median and bilateral filtering," ACM Transactions on Graphics, vol. 25,pp. 519-526, July 2006.

[120] A. Sinha, A. Wang, and A. Chandrakasan, "Energy scalable system design," IEEE Trans-

actions on Very Large Scale Integration Systems, vol. 10, pp. 135-145, April 2002.

[121] P. E. Debevec and J. Malik, "Recovering high dynamic range radiance maps from pho-

tographs," ACM Conference on Computer Graphics and Interactive Techniques, pp. 369-378, 1997.

[122] G. W. Larson, H. Rushmeier, and C. Piatko, "A visibility matching tone reproduction op-

erator for high dynamic range scenes," IEEE Transactions on Visualization and Computer

Graphics, vol. 3, pp. 291-306, October 1997.

[123] J. DiCarlo and B. Wandell, "Rendering high dynamic range images," Proceedings of the

SPIE: Image Sensors, pp. 392-401, 2000.

[124] J. Cohen, C. Tchou, T. Hawkins, and P. Debevec, "Real-time high-dynamic range texture

mapping," Eurographics Workshop on Rendering, pp. 313-320, October 2001.

[125] D. J. Jobson, Z. U. Rahman, and G. A. Woodell, "A multi-scale retinex for bridging the

gap between color images and the human observation of scenes," IEEE Transactions on

Image Processing, vol. 6, pp. 965-976, July 1997.

[126] S. N. Pattanaik, J. A. Ferwerda, M. D. Fairchild, and D. P. Greenberg, "A multiscalemodel of adaptation and spatial vision for realistic image display," ACM SIGGRAPH


[127] J. J. McCann and A. Rizzi, "Veiling glare: The dynamic range limit of HDR images,"

Human Vision and Electronic Imaging XII, SPIE, vol. 6492, 2007.

[128] E. V. Talvala, A. Adams, M. Horowitz, and M. Levoy, "Veiling glare in high dynamic

range imaging," ACM Transactions on Graphics, vol. 26, July 2007.

[129] E. Reinhard, G. Ward, S. Pattanaik, and P. Debevec, "High dynamic range imaging -acquisition, display and image-based lighting," Morgan Kaufman Publishers, 2006.

[130] R. Raskar, A. Agrawal, C. A. Wilson, and A. Veeraraghavan, "Glare aware photogra-

phy: 4D ray sampling for reducing glare effects of camera lenses," ACM Transactions on

Graphics, vol. 27, pp. 56:1-56:10, August 2008.

197BIBLIOGRAPHY

[131] J. M. DiCarlo, F. Xiao, and B. A. Wandell, "Illuminating illumination," Color Imaging


[132] M. F. Cohen, A. Colburn, and S. Drucker, "Image stacks," MSR Technical Report, vol. 40,

July 2003.

[133] H. Hoppe and K. Toyama, "Continuous flash," MSR Technical Report, vol. 63, October

2003.

[134] K. Toyama and B. Schoelkopf, "Interactive images," MSR Technical Report, vol. 64, De-

cember 2003.

[135] P. Debevec, T. Hawkins, C. Tchou, H. Duiker, W. Sarokin, and M. Sagar, "Acquiring the

reflectance field of the human face," ACM SIGGRAPH Conference, pp. 145-156, 2000.

[136] V. Masselus, P. Dutre, and F. Anrys, "The free-form light stage," Eurographics Rendering

Symposium, pp. 247-256, 2002.

[137] D. Akers, F. Losasso, J. Klingner, M. Agrawala, J. Rick, and P. Hanrahan, "Conveying

shape and features with image-based relighting," IEEE Visualization, pp. 349-354, 2003.

[138] G. Petschnigg, M. Agrawala, H. Hoppe, R. Szeliski, M. Cohen, and K. Toyama, "Digital

photography with flash and no-flash image pairs," A CM Transactions on Graphics, vol. 23,

pp. 664-672, August 2004.

[139] E. Eisemann and F. Durand, "Flash photography enhancement via intrinsic relighting,"

ACM Transactions on Graphics, vol. 23, pp. 673-678, August 2004.

[140] B. M. Oh, M. Chen, J. Dorsey, and F. Durand, "Image-based modeling and photo editing,"

ACM SIGGRAPH Conference, 2001.

[141] A. Wang and A. P. Chandrakasan, "A 180mV FFT processor using subthreshold circuit

technologies," IEEE International Solid State Circuits Conference, pp. 292-293, 2004.

[142] S. Sridhara, M. DiRenzo, S. Lingam, S.-J. Lee, R. Blazquez, J. Maxey, S. Ghanem, Y.-H.

Lee, R. Abdallah, P. Singh, and M. Goel, "Microwatt embedded processor platform for

medical system-on-chip applications," IEEE Symposium on VLSI Circuit, pp. 15-16, 2010.

[143] M. Qazi, M. E. Sinangil, and A. P. Chandrakasan, "Challenges and directions for low-

voltage SRAM," IEEE Design & Test of Computers, vol. 28, pp. 32-43, January 2011.

[144] "Intel Atom processor Z2760," [online] http://www.micron.com/products/dram/

ddr2-sdram.

[145] A. Adams, E. Talvala, S. H. Park, D. E. Jacobs, B. Ajdin, N. Gelfand, J. Dolson, D. Va-

quero, J. Baek, M. Tico, H. P. A. Lensch, W. Matusik, K. Pulli, M. Horowitz, and

M. Levoy, "The Frankencamera: An experimental platform for computational photog-

raphy," ACM Transactions on Graphics, vol. 29, pp. 29:1-29:12, July 2010.

[146] "Intel integrated performance primitives," [online] https ://software. intel. com/

en-us/intel-ipp.

[147] "The OpenMP api specification for parallel programming," [online] http: //openmp. org/.

198 BIBLIOGRAPHY

199

[148] "DDR2 SDRAM system-power calculator," [online] http: //www. intel. com/content/

www/us/en/processors/atom/atom-z2760-datasheet.html.

[149] "Pandaboard: Open OMAP 4 mobile software development platform," [online] http:

//pandaboard. org/content/platform.

[150] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe, "Halide:

A language and compiler for optimizing parallelism, locality, and recomputation in image

processing pipelines," ACM SIGPLAN Conference on Programming Language Design and

Implementation, pp. 519-530, 2013.

[151] C. C. Wang, F. L. Yuan, H. Chen, and D. Markovic, "A 1.1 GOPS/mW FPGA with hierar-

chical interconnect fabric," IEEE International Symposium on VLSI Circuits, pp. 136-137,

2011.

[152] C. C. Wang, F. L. Yuan, T. H. Yu, and D. Markovic, "A multi-granularity FPGA with

hierarchical interconnects for efficient and flexible mobile computing," International Con-

ference on Solid-State Circuits, pp. 460-461, 2014.

[153] R. J. Hay, N. E. Johns, H. C. Williams, I. W. Bolliger, R. P. Dellavalle, D. J. Margolis,

R. Marks, L. Naldi, M. A. Weinstock, S. K. Wulf, C. Michaud, C. J. L. Murray, and

M. Naghavi, "The global burden of skin disease in 2010: An analysis of the prevalence

and impact of skin conditions," Journal of Investigative Dermatology, November 2013.

[154] P. E. Grimes, "New insight and new therapies in vitiligo," The Journal of the American

Medical Association, vol. 293, pp. 730-735, February 2005.

[155] A. Alikhan, L. M. Felsten, M. Daly, and V. Petronic-Rosic, "Vitiligo: a comprehensive

overview part I. Introduction, epidemiology, quality of life, diagnosis, differential diagnosis,

associations, histopathology, etiology, and work-up," Journal of American Academy of

Dermatology, vol. 65, pp. 473-491, September 2011.

[156] K. Ezzedine, H. W. Lim, T. Suzuki, I. Katayama, I. Hamzavi, C. C. Lan, B. K. Goh,

T. Anbar, C. S. de Castro, A. Y. Lee, D. Prasad, N. V. Geel, I. C. L. Poole, N. Oiso,

L. Benzekri, R. Spritz, Y. Gauthier, S. K. Hann, M. Picardo, and A. Taieb, "Revised clas-

sification/nomenclature of vitiligo and related issues: The vitiligo global issues consensus

conference," Pigment Cell and Melanoma Research, vol. 25, pp. E1-13, May 2012.

[157] D. J. Gawkrodger, A. D. Ormerod, L. Shaw, I. Mauri-Sole, M. E. Whitton, M. J. Watts,

A. V. Anstey, J. Ingham, and K. Young, "Guideline for the diagnosis and management of

vitiligo," British Journal of Dermatology, vol. 159, pp. 1051-1076, November 2008.

[158] R. M. Halder and J. L. Chappell, "Vitiligo update," Seminars in cutaneous medicine and

surgery, vol. 28, pp. 86-92, June 2009.

[159] G. C. do Carmo and M. R. e Silva, "Dermoscopy: basic concepts," International Journal

of Dermatology, vol. 47, pp. 712-719, July 2008.

[160] M. E. Vestergaard, P. Macaskill, P. E. Holt, and S. W. Menzies, "Dermoscopy compared

with naked eye examination for the diagnosis of primary melanoma: a meta-analysis of

studies performed in a clinical setting," The British Journal of Dermatology, vol. 159,

pp. 669-676, September 2008.

BIBLIOGRAPHY200

[161] W. Stolz, P. Bilek, M. Landthaler, T. Merkle, and 0. Braun-Falco, "Skin surface mi-

croscopy," The Lancet, vol. 334, pp. 864-865, October 1989.

[162] R. P. Braun, H. S. Rabinovitz, M. Oliviero, A. W. Kopf, and J. H. Saurat, "Dermoscopy

of pigmented skin lesions," Journal of the American Academy of Dermatology, vol. 52,

pp. 109-121, January 2005.

[163] "Dermlite," [online] http://dermlite.com/.

[164] U. Gonzalez, M. Whitton, V. Eleftheriadou, M. Pinart, J. Batchelor, and J. Leonardi-Bee,

"Guidelines for designing and reporting clinical trials in vitiligo," Archives of Dermatology,

vol. 147, pp. 1428-1436, December 2011.

[165] V. Eleftheriadou, K. S. Thomas, M. E. Whitton, J. M. Batchelor, and J. C. Ravenscroft,

"Which outcomes should we measure in vitiligo? Results of a systematic review and a

survey among patients and clinicians on outcomes in vitiligo trials," The British Journal

of Dermatology, vol. 167, pp. 804-814, October 2012.

[166] C. Vrijman, M. L. Homan, J. Limpens, W. van der Veen, A. Wolkerstorfer, C. B. Terwee,

and P. I. Spuls, "Measurement properties of outcome measures for vitiligo: A systematic

review," Archives of Dermatology, vol. 17, pp. 1-8, September 2012.

[167] I. Hamzavi, H. Jain, D. McLean, J. Shapiro, H. Zeng, and H. Lui, "Parametric modeling

of narrowband UV-B phototherapy for vitiligo using a novel quantitative tool: The vitiligo

area scoring index," Archives of Dermatology, vol. 140, pp. 677-683, June 2004.

[168] T. S. Oh, 0. Lee, J. E. Kim, S. W. Son, and C. H. Oh, "Quantitative method for measuring

therapeutic efficacy of the 308 nm excimer laser for vitiligo," Skin Research and Technology,

vol. 18, pp. 347-355, August 2012.

[169] M. W. L. Homan, A. Wolkerstorfer, M. A. Sprangers, and J. L. V. der Veen, "Digital image

analysis vs. clinical assessment to evaluate repigmentation after punch grafting in vitiligo,"

Journal of the European Academy of Dermatology and Venereology, vol. 27, pp. 235-238,

February 2013.

[170] T. S. Cho, W. T. Freeman, and H. Tsao, "A reliable skin mole localization scheme," in

International Conference on Computer Vision, pp. 1-8, IEEE, 2007.

[171] S. K. Madan, K. J. Dana, and 0. G. Cula, "Quasiconvex alignment of multimodal skin

images for quantitative dermatology," in Computer Vision and Pattern Recognition Work-

shops, pp. 117-124, IEEE, 2009.

[172] S. K. Madan, K. J. Dana, and 0. Cula, "Learning-based detection of acne-like regions using

time-lapse features," in Signal Processing in Medicine and Biology Symposium, pp. 1-6,

IEEE, 2011.

[173] H. Wannous, Y. Lucas, and S. Treuillet, "Enhanced assessment of the wound-healing pro-

cess by accurate multiview tissue classification," IEEE Transactions on Medical Imaging,

vol. 30, pp. 315-326, February 2011.

[174] H. Nugroho, M. H. A. Fadzil, V. V. Yap, S. Norashikin, and H. H. Suraiya, "Determination

of skin repigmentation progression," IEEE International Conference of the Engineering in

Medicine and Biology Society, pp. 3442-3445, 2007.

[175] F. Peruch, F. Bogo, M. Bonazza, V. M. Cappelleri, and E. Peserico, "Simpler, faster,

more accurate melanocytic lesion segmentation through MEDS," IEEE Transactions on

Biomedical Engineering, vol. 61, pp. 557-565, February 2014.

[176] K. Korotkov and R. Garcia, "Computerized analysis of pigmented skin lesions: A review,"

Artificial Intelligence in Medicine, vol. 56, pp. 69-90, October 2012.

[177] R. J. Friedman, D. S. Rigel, and A. W. Kopf, "Early detection of malignant melanoma:

The role of physician examination and self-examination of the skin," CA: A Cancer Journal

for Clinicians, vol. 35, pp. 130-151, May 1985.

[178] J. Chen, J. Stanley, R. H. Moss, and W. V. Stoecker, "Color analysis of skin lesion regions

for melanoma discrimination in clinical images," Skin Research and Technology, vol. 9,

pp. 94-104, May 2003.

[179] C. Grana, G. Pellacani, and S. Seidenari, "Practical color calibration for dermoscopy,

applied to a digital epiluminescence microscope," Skin Research and Technology, vol. 11,pp. 242-247, November 2005.

[180] H. Iyatomi, M. E. Celebi, G. Schaefer, and M. Tanaka, "Automated color normalization

for dermascopy images," in International Conference on Image Processing, pp. 4357-4360,

IEEE, 2010.

[181] M. Styner, C. Brechbuhler, G. Szckely, and G. Gerig, "Parametric estimate of intensity

inhomogeneities applied to MRI," IEEE Transactions on Medical Imaging, vol. 19, pp. 153-

165, March 2000.

[182] J. Milles, Y. Zhu, G. Gimenez, C. Guttmann, and I. Magnin, "MRI intensity nonuniformity

correction using simultaneously spatial and gray-level histogram information," Journal of

Computerized Medical Imaging and Graphics, vol. 31, pp. 81-90, March 2007.

[183] U. Vovk, F. Pernus, and B. Likar, "Review of methods for correction of intensity inho-

mogeneity in MRI," IEEE Transactions on Medical Imaging, vol. 26, pp. 405-421, March

2007.

[184] K. Zhang, L. Zhang, and S. Zhang, "A variational multiphase level set approach to simulta-

neous segmentation and bias correction," in International Conference on Image Processing,pp. 4105-4108, IEEE, 2010.

[185] C. Li, C. Xu, C. Gui, and M. D. Fox, "Distance regularized level set evolution and its

application to image segmentation," IEEE Transactions on Image Processing, vol. 19,pp. 3243-3254, December 2010.

[186] C. Li, R. Huang, Z. Ding, C. Gatenby, D. N. Metaxas, and J. C. Gore, "A level set method

for image segmentation in the presence of intensity inhomogeneities with application to

MRI," IEEE Transactions on Image Processing, vol. 20, pp. 2007-2016, July 2011.

[187] M. A. Fischler and R. C. Bolles, "Random sample consensus: A paradigm for model fitting

with applicatlons to image analysis and automated cartography," ACM Communications

on Graphics and Image Processing, vol. 24, pp. 381-395, June 1981.

201BIBLIOGRAPHY

[188] K. Mikolajczyk and C. Schmid, "A performance evaluation of local descriptors," IEEE

Transactions on Pattern Analysis and Machine Intelligence, vol. 27, pp. 1615-1630, Oc-

tober 2005.

[189] L. Fei-Fei and P. Perona, "A bayesian hierarchical model for learning natural scene cat-

egories," IEEE Conference on Computer Vision and Pattern Recognition, pp. 524-531,

2005.

[190] A. Bosch, A. Zisserman, and X. Munoz, "Scene classification using a hybrid genera-

tive/discriminative approach," IEEE Transactions on Pattern Analysis and Machine In-

telligence, vol. 30, pp. 712-727, April 2008.

[191] S. Lazebnik, C. Schmid, and J. Ponce, "Beyond bags of features: Spatial pyramid match-

ing for recognizing natural scene categories," IEEE Conference on Computer Vision and

Pattern Recognition, pp. 2169-2178, 2006.

[192] J. J. Kivinen, E. B. Sudderth, and M. Jordan, "Learning multiscale representation of

natural scenes using Dirichlet processes," IEEE Conference on Computer Vision, pp. 1-8,

2007.

[193] K. Grauman and T. Darrell, "Efficient image matching with distributions of local invariant

features," IEEE Conference on Computer Vision and Pattern Recognition, pp. 627-634,

2005.

[194] K. Frome, Y. Singer, F. Sha, and J. Malik, "Learning globally-consistent local distance

functions for shape-based image retrieval and classification," IEEE Conference on Com-

puter Vision, pp. 1-8, 2007.

[195] S. Belongie, J. Malik, and J. Puzicha, "Shape matching and object recognition using

shape contexts," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24,

pp. 509-522, April 2002.

[196] J. Duchon, "Splines minimizing rotation-invariant semi-norms in Sobolev spaces," Con-

structive Theory of Functions of Several Variables, vol. 571, pp. 85-100, 1977.

[197] J. Meinguet, "Multivariate interpolation at arbitrary points made simple," Journal of

Applied Mathematics and Physics, vol. 5, pp. 439-468, 1979.

[198] "Intel power gadget," [online] https ://software. intel. com/en-us/articles/

intel-power-gadget-20.

[199] S. L. Jacques, J. C. Ramella-Roman, K. Lee, "Imaging skin pathology with polarized

light," Journal of Biomedical Optics, Vol. 7, No. 3, 329-340, July 2002.

[200] M. H. Smith, P. Burke, A. Lompado, E. Tanner, L. W. Hillman, "Mueller matrix imaging

polarimetry in dermatology," Proceedings of SPIE, 2000.

[201] J. A. Muccini, N. Kollias, S. B. Phillips, R. R. Anderson, A. J. Sober, M. J. Stiller, L.

A. Drake, "Polarized light photography in the evaluation of photoaging," Journal of the

American Academy of Dermatology, Vol. 33, No. 5, 765-769, Nov. 1995.

202 BIBLIOGRAPHY

[202] R. Langley, M. Rajadhyaksha, P. Dwyer, A. Sober, T. Flotte, R. R. Anderson, "Confocal

scanning laser microscopy of benign and malignant melanocytic skin lesions in vivo,"

Journal of the American Academy of Dermatology, Vol. 45, No. 3, 365-276, Sept. 2001.

[203] D. Kapsokalyvas, N. Bruscino, D. Alfieri, V. de Giorgi, G. Cannarozzo, T. Lotti, and F. S.

Pavone, "Imaging of human skin lesions with the multispectral dermoscope," Proceedings

of the SPIE, 2010.

[204] R. W. Brodersen, "Low power design, past and future," 2014.

[205] E. J. Fluhr, J. Friedrich, D. Dreps, V. Zyuban, G. Still, C. Gonzalez, A. Hall, D. Hogen-

miller, F. Malgioglio, R. Nett, J. Paredes, J. Pille, D. Plass, R. Puri, P. Restle, D. Shan,

K. Stawiasz, Z. T. Deniz, D. Wendel, and M. Ziegler, "Power8: A 12-core server-class

processor in 22nm SOI with 7.6Tb/s off-chip bandwidth," IEEE International Solid State

Circuits Conference, pp. 96-97, 2014.

[206] N. Kurd, M. Chowdhury, E. Burton, T. P. Thomas, C. Mozak, B. Boswell, M. Lal, A. De-

val, J. Douglas, M. Elassal, A. Nalamalpu, T. M. Wilson, M. Merten, S. Chennupaty,

W. Gomes, and R. Kumar, "Haswell: A family of IA 22nm processors," IEEE Interna-

tional Solid State Circuits Conference, pp. 112-113, 2014.

[207] A. Wang, P. R. Gill, and A. Molnar, "An angle-sensitive CMOS imager for single-sensor

3D photography," International Solid-State Circuits Conference, pp. 412-413, 2011.

[208] W. Kim, Y. Wang, I. Ovsiannikov, S. H. Lee, Y. Park, C. Chung, and E. Fossum, "A

1.5 Mpixel RGBZ CMOS image sensor for simultaneous color and range image capture,"

International Solid-State Circuits Conference, pp. 392-393, 2012.

[209] L. T. Su, "Architecting the future through heterogeneous computing," International Solid-

State Circuits Conference, pp. 8-11, 2011.

[210] C. Gentry, "Computing arbitrary functions of encrypted data," Communications of the

ACM, vol. 53, pp. 97-105, March 2010.

203BIBLIOGRAPHY

Documents

Appendix A Integer Transform