Video audio

Video & Audio representa0on and coding

Sistemi Mul+mediali ‐ DIS 2011

5.1 Types of Video Signals Component video • Component video: Higher‐end video systems make use of three separate

video signals for the red, green, and blue image planes. Each color channel is sent as a separate video signal. (a) Most computer systems use Component Video, with separate signals for R, G,

and B signals. (b) For any color separaHon scheme, Component Video gives the best color

reproducHon since there is no “crosstalk” between the three channels. (c) This is not the case for S‐Video or Composite Video, discussed next.

Component video, however, requires more bandwidth and good synchronizaHon of the three components.

2 Li & Drew


Composite Video — 1 Signal • Composite video: color (“chrominance”) and intensity (“luminance”) signals are

mixed into a single carrier wave.

a) Chrominance is a composiHon of two color components (I and Q, or U and V).

b) In NTSC TV, e.g., I and Q are combined into a chroma signal, and a color subcarrier is then employed to put the chroma signal at the high‐frequency end of the signal shared with the luminance signal.

c) The chrominance and luminance components can be separated at the receiver end and then

the two color components can be further recovered. d) When connecHng to TVs or VCRs, Composite Video uses only one wire and video color signals

are mixed, not sent separately. The audio and sync signals are addiHons to this one signal.

• Since color and intensity are wrapped into the same signal, some interference between the luminance and chrominance signals is inevitable.

3 Li & Drew


S‐Video — 2 Signals • S‐Video: as a compromise, (separated video, or Super‐video, e.g., in S‐VHS)

uses two wires, one for luminance and another for a composite chrominance signal.

• As a result, there is less crosstalk between the color informaHon and the

crucial gray‐scale informaHon. • The reason for placing luminance into its own part of the signal is that

black‐and‐white informaHon is most crucial for visual percepHon.

– In fact, humans are able to differenHate spaHal resoluHon in grayscale images with a much higher acuity than for the color part of color images.

– As a result, we can send less accurate color informaHon than must be sent for

intensity informaHon — we can only see fairly large blobs of color, so it makes sense to send less color detail.

4 Li & Drew


5.2 Analog Video • An analog signal f(t) samples a Hme‐varying image. So‐called

“progressive” scanning traces through a complete picture (a frame) row‐wise for each Hme interval.

• In TV, and in some monitors and mulHmedia standards as well,

another system, called “interlaced” scanning is used:

a) The odd‐numbered lines are traced first, and then the even‐numbered lines are traced. This results in “odd” and “even” fields — two fields make up one frame. b) In fact, the odd lines (starHng from 1) end up at the middle of a line at the end of the odd field, and the even scan starts at a half‐way point.

5 Li & Drew


• Table 5.2 gives a comparison of the three major analog broadcast TV systems.

Table 5.2: Comparison of Analog Broadcast TV Systems

6 Li & Drew

TV System Frame Rate (fps)

# of Scan Lines

Total Channel Width (MHz)

Bandwidth Alloca0on (MHz)

Y I or U Q or V

NTSC 29.97 525 6.0 4.2 1.6 0.6

PAL 25 625 8.0 5.5 1.8 1.8

SECAM 25 625 8.0 6.0 2.0 2.0


5.3 Digital Video • The advantages of digital representaHon for video are many.

For example:

(a) Video can be stored on digital devices or in memory, ready to be processed (noise removal, cut and paste, etc.), and integrated to various mulHmedia applicaHons;

(b) Direct access is possible, which makes nonlinear video ediHng achievable as a simple, rather than a complex, task;

(c) Repeated recording does not degrade image quality; (d) Ease of encrypHon and beier tolerance to channel noise.

Li & Drew 7


CCIR Standards for Digital Video • CCIR is the ConsultaHve Commiiee for InternaHonal Radio, and one of the most important standards it has produced is CCIR‐601, for component digital video.

– This standard has since become standard ITU‐R‐601, an internaHonal standard for professional video applicaHons — adopted by certain digital video formats including the popular DV video.

Li & Drew 8


HDTV (High Defini0on TV) • The main thrust of HDTV (High DefiniHon TV) is not to increase the

“definiHon” in each unit area, but rather to increase the visual field especially in its width. (a) The first generaHon of HDTV was based on an analog technology developed

by Sony and NHK in Japan in the late 1970s. (b) MUSE (MUlHple sub‐Nyquist Sampling Encoding) was an improved NHK HDTV

with hybrid analog/digital technologies that was put in use in the 1990s. It has 1,125 scan lines, interlaced (60 fields per second), and 16:9 aspect raHo.

(c) Since uncompressed HDTV will easily demand more than 20 MHz bandwidth,

which will not fit in the current 6 MHz or 8 MHz channels, various compression techniques are being invesHgated.

(d) It is also anHcipated that high quality HDTV signals will be transmiied using

more than one channel even amer compression.

Li & Drew 9


• A brief history of HDTV evoluHon:

(a) In 1987, the FCC decided that HDTV standards must be compaHble with the exisHng NTSC standard and be confined to the exisHng VHF (Very High Frequency) and UHF (Ultra High Frequency) bands.

(b) In 1990, the FCC announced a very different iniHaHve, i.e., its preference for a full‐resoluHon HDTV, and it was decided that HDTV would be simultaneously broadcast with the exisHng NTSC TV and eventually replace it.

(c) Witnessing a boom of proposals for digital HDTV, the FCC made a key decision

to go all‐digital in 1993. A “grand alliance” was formed that included four main proposals, by General Instruments, MIT, Zenith, and AT&T, and by Thomson, Philips, Sarnoff and others.

(d) This eventually led to the formaHon of the ATSC (Advanced Television Systems

Commiiee) — responsible for the standard for TV broadcasHng of HDTV. (e) In 1995 the U.S. FCC Advisory Commiiee on Advanced Television Service

recommended that the ATSC Digital Television Standard be adopted.

10 Li & Drew


• The standard supports video scanning formats shown in Table 5.4. In the table, “I” mean interlaced scan and “P” means progressive (non‐interlaced) scan.

Table 5.4: Advanced Digital TV formats supported by ATSC

11 Li & Drew

# of Ac0ve Pixels per line

# of Ac0ve Lines

Aspect Ra0o Picture Rate

1,920 1,080 16:9 60I 30P 24P

1,280 720 16:9 60P 30P 24P

704 480 16:9 & 4:3 60I 60P 30P 24P

640 480 4:3 60I 60P 30P 24P


• For video, MPEG‐2 is chosen as the compression standard. For audio, AC‐3 is the standard. It supports the so‐called 5.1 channel Dolby surround sound, i.e., five surround channels plus a subwoofer channel.

• The salient difference between convenHonal TV and HDTV:

(a) HDTV has a much wider aspect raHo of 16:9 instead of

4:3.

(b) HDTV moves toward progressive (non‐interlaced) scan. The raHonale is that interlacing introduces serrated edges to moving objects and flickers along horizontal edges.

12 Li & Drew


• The FCC has planned to replace all analog broadcast services with digital TV broadcasHng by the year 2009. The services provided will include:

– SDTV (Standard Defini0on TV): the current NTSC TV or higher.

– EDTV (Enhanced Defini0on TV): 480 acHve lines or higher, i.e., the third and fourth rows in Table 5.4.

– HDTV (High Defini0on TV): 720 acHve lines or higher.

13 Li & Drew


6.1 Digi0za0on of Sound What is Sound? • Sound is a wave phenomenon like light, but is macroscopic

and involves molecules of air being compressed and expanded under the acHon of some physical device.

(a) For example, a speaker in an audio system vibrates back and forth and produces a longitudinal pressure wave that we perceive as sound.

(b) Since sound is a pressure wave, it takes on conHnuous values, as opposed to digiHzed ones.

14 Li & Drew


(c) Even though such pressure waves are longitudinal, they sHll have ordinary wave properHes and behaviors, such as reflecHon (bouncing), refracHon (change of angle when entering a medium with a different density) and diffracHon (bending around an obstacle).

(d) If we wish to use a digital version of sound waves we must form digiHzed representaHons of audio informaHon.

15 Li & Drew


Digi0za0on • Digi0za0on means conversion to a stream of numbers, and preferably these numbers should be integers for efficiency.

• Fig. 6.1 shows the 1‐dimensional nature of sound: amplitude values depend on a 1D variable, Hme. (And note that images depend instead on a 2D set of variables, x and y).

16 Li & Drew


Fig. 6.1: An analog signal: conHnuous measurement of pressure wave.

17 Li & Drew


• The graph in Fig. 6.1 has to be made digital in both Hme and amplitude. To digiHze, the signal must be sampled in each dimension: in Hme, and in amplitude.

(a) Sampling means measuring the quanHty we are interested in, usually

at evenly‐spaced intervals. (b) The first kind of sampling, using measurements only at evenly spaced

Hme intervals, is simply called, sampling. The rate at which it is performed is called the sampling frequency (see Fig. 6.2(a)).

(c) For audio, typical sampling rates are from 8 kHz (8,000 samples per

second) to 48 kHz. This range is determined by the Nyquist theorem, discussed later.

(d) Sampling in the amplitude or voltage dimension is called

quan0za0on. Fig. 6.2(b) shows this kind of sampling.

18 Li & Drew


Signal to Noise Ra0o (SNR) • The raHo of the power of the correct signal and the noise is

called the signal to noise ra+o (SNR) — a measure of the quality of the signal.

• The SNR is usually measured in decibels (dB), where 1 dB is

a tenth of a bel. The SNR value, in units of dB, is defined in terms of base‐10 logarithms of squared voltages, as follows:

(6.2)

Li & Drew 19

2

10 10210log 20logsignal signal

noise noise

V VSNR

V V= =


a)  The power in a signal is proporHonal to the square of the voltage. For example, if the signal voltage Vsignal is 10 Hmes the noise, then the SNR is 20 ∗ log10(10) = 20dB.

b) In terms of power, if the power from ten violins is ten Hmes that from one violin playing, then the raHo of power is 10dB, or 1B.

c) To know: Power — 10; Signal Voltage — 20.

20 Li & Drew


• The usual levels of sound we hear around us are described in terms of decibels, as a raHo to the quietest sound we are capable of hearing. Table 6.1 shows approximate levels for these sounds.

Table 6.1: Magnitude levels of common sounds, in decibels

21 Li & Drew

Threshold of hearing 0

Rustle of leaves 10

Very quiet room 20

Average room 40

ConversaHon 60

Busy street 70

Loud radio 80

Train through staHon 90

Riveter 100

Threshold of discomfort 120

Threshold of pain 140

Damage to ear drum 160


Audio Filtering • Prior to sampling and AD conversion, the audio signal is also usually filtered

to remove unwanted frequencies. The frequencies kept depend on the applicaHon:

(a) For speech, typically from 50Hz to 10kHz is retained, and other frequencies

are blocked by the use of a band‐pass filter that screens out lower and higher frequencies.

(b) An audio music signal will typically contain from about 20Hz up to 20kHz. (c) At the DA converter end, high frequencies may reappear in the output —

because of sampling and then quanHzaHon, smooth input signal is replaced by a series of step funcHons containing all possible frequencies.

(d) So at the decoder side, a lowpass filter is used amer the DA circuit.

Li & Drew 22


Audio Quality vs. Data Rate • The uncompressed data rate increases as more bits are used for

quanHzaHon. Stereo: double the bandwidth. to transmit a digital audio signal.

Table 6.2: Data rate and bandwidth in sample audio applicaHons

Li & Drew 23

Quality Sample Rate (Khz)

Bits per Sample

Mono / Stereo

Data Rate (uncompressed)

(kB/sec)

Frequency Band (KHz)

Telephone 8 8 Mono 8 0.200‐3.4

AM Radio 11.025 8 Mono 11.0 0.1‐5.5

FM Radio 22.05 16 Stereo 88.2 0.02‐11

CD 44.1 16 Stereo 176.4 0.005‐20

DAT 48 16 Stereo 192.0 0.005‐20

DVD Audio 192 (max) 24(max) 6 channels 1,200 (max) 0‐96 (max)


6.2 MIDI: Musical Instrument Digital Interface

• Use the sound card’s defaults for sounds: ⇒ use a simple scripHng language and hardware setup called MIDI.

• MIDI Overview

(a) MIDI is a scripHng language — it codes “events” that stand for the producHon of sounds. E.g., a MIDI event might include values for the pitch of a single note, its duraHon, and its volume.

(b) MIDI is a standard adopted by the electronic music industry for controlling devices, such as synthesizers and sound cards, that produce music.

Li & Drew 24


(c) The MIDI standard is supported by most synthesizers, so sounds created on one synthesizer can be played and manipulated on another synthesizer and sound reasonably close.

(d) Computers must have a special MIDI interface, but this is incorporated into most sound cards. The sound card must also have both D/A and A/D converters.

25 Li & Drew


MIDI Concepts • MIDI channels are used to separate messages.

(a) There are 16 channels numbered from 0 to 15. The channel forms the last 4 bits (the least significant bits) of the message.

(b) Usually a channel is associated with a parHcular instrument: e.g., channel 1 is the piano, channel 10 is the drums, etc.

(c) Nevertheless, one can switch instruments midstream, if desired, and associate another instrument with any channel.

Li & Drew 26


• System messages (a) Several other types of messages, e.g. a general message for all instruments indicaHng a change in tuning or Hming.

(b) If the first 4 bits are all 1s, then the message is interpreted as a system common message.

• The way a syntheHc musical instrument responds to a MIDI message is usually by simply ignoring any play sound message that is not for its channel.

– If several messages are for its channel, then the instrument responds, provided it is mul0‐voice, i.e., can play more than a single note at once.

27 Li & Drew


• It is easy to confuse the term voice with the term 0mbre — the laier is MIDI terminology for just what instrument that is trying to be emulated, e.g. a piano as opposed to a violin: it is the quality of the sound.

(a) An instrument (or sound card) that is mul0‐0mbral is one that is

capable of playing many different sounds at the same Hme, e.g., piano, brass, drums, etc.

(b) On the other hand, the term voice, while someHmes used by

musicians to mean the same thing as Hmbre, is used in MIDI to mean every different Hmbre and pitch that the tone module can produce at the same Hme.

• Different Hmbres are produced digitally by using a patch — the set

of control sevngs that define a parHcular Hmbre. Patches are omen organized into databases, called banks.

28 Li & Drew


• The data in a MIDI status byte is between 128 and 255; each of the data bytes is between 0 and 127. Actual MIDI bytes are 10‐bit, including a 0 start and 0 stop bit.

Fig. 6.8: Stream of 10‐bit bytes; for typical MIDI messages, these consist of {Status byte, Data Byte, Data Byte} = {Note On, Note Number, Note Velocity}

29 Li & Drew


Hardware Aspects of MIDI • The MIDI hardware setup consists of a 31.25 kbps serial connecHon. Usually, MIDI‐capable units are either Input devices or Output devices, not both.

• A tradiHonal synthesizer is shown in Fig. 6.10:

Fig. 6.10: A MIDI synthesizer

Li & Drew 30


• The physical MIDI ports consist of 5‐pin connectors for IN and OUT, as well as a third connector called THRU.

(a)  MIDI communicaHon is half‐duplex.

(b) MIDI IN is the connector via which the device receives all MIDI data.

(c) MIDI OUT is the connector through which the device transmits all the MIDI data it generates itself.

(d) MIDI THRU is the connector by which the device echoes the data it receives from MIDI IN. Note that it is only the MIDI IN data that is echoed by MIDI THRU — all the data generated by the device itself is sent via MIDI OUT.

31 Li & Drew


• A typical MIDI sequencer setup is shown in Fig. 6.11:

32 Li & Drew

Fig. 6.11: A typical MIDI setup


Structure of MIDI Messages • MIDI messages can be classified into two types: channel messages and system messages, as in Fig. 6.12:

Fig. 6.12: MIDI message taxonomy

Li & Drew 33


• A. Channel messages: can have up to 3 bytes:

a) The first byte is the status byte (the opcode, as it were); has its most significant bit set to 1. b) The 4 low‐order bits idenHfy which channel this message belongs to (for 16 possible channels). c) The 3 remaining bits hold the message. For a data byte, the most significant bit is set to 0.

• A.1. Voice messages:

a) This type of channel message controls a voice, i.e., sends informaHon specifying which note to play or to turn off, and encodes key pressure.

b) Voice messages are also used to specify controller effects such as sustain, vibrato, tremolo, and

the pitch wheel. c) Table 6.3 lists these operaHons.

34 Li & Drew


Table 6.3: MIDI voice messages

(** &H indicates hexadecimal, and ‘n’ in the status byte hex value stands for a channel number. All values are in 0..127 except Controller number, which is in 0..120)

35 Li & Drew

Voice Message Status Byte Data Byte1 Data Byte2

Note Off &H8n Key number Note Off velocity

Note On &H9n Key number Note On velocity

Poly. Key Pressure &HAn Key number Amount

Control Change &HBn Controller num. Controller value

Program Change &HCn Program number None

Channel Pressure &HDn Pressure value None

Pitch Bend &HEn MSB LSB


General MIDI • General MIDI is a scheme for standardizing the assignment of instruments to patch numbers.

a) A standard percussion map specifies 47 percussion sounds.

b) Where a “note” appears on a musical score determines what percussion instrument is being struck: a bongo drum, a cymbal.

c) Other requirements for General MIDI compaHbility: MIDI device must support all 16 channels; a

device must be mulHHmbral (i.e., each channel can play a different instrument/program); a device must be polyphonic (i.e., each channel is able to play many voices); and there must be a minimum of 24 dynamically allocated voices.

• General MIDI Level2: An extended general MIDI has recently been defined, with a

standard .smf “Standard MIDI File” format defined — inclusion of extra character informaHon, such as karaoke lyrics.

Li & Drew 36


MIDI to WAV Conversion • Some programs, such as early versions of Premiere, cannot include .mid files — instead, they insist on .wav format files.

a) Various shareware programs exist for approximaHng a reasonable conversion between MIDI and WAV formats.

b) These programs essenHally consist of large lookup files that try to subsHtute pre‐defined or shimed WAV output for MIDI messages, with inconsistent success.

Li & Drew 37


7.1 Introduc0on • Compression: the process of coding that will effecHvely reduce the total number of bits needed to represent certain informaHon.

Fig. 7.1: A General Data Compression Scheme.

Li & Drew 38


Introduc0on (cont’d) • If the compression and decompression processes induce no informaHon loss, then the compression scheme is lossless; otherwise, it is lossy.

• Compression ra0o:

(7.1) B0 – number of bits before compression B1 – number of bits amer compression

Li & Drew 39

0

1

BcompressionratioB

=


7.2 Basics of Informa0on Theory • The entropy η of an informaHon source with alphabet S =

{s1, s2, . . . , sn} is:

(7.2)

(7.3)

pi – probability that symbol si will occur in S. – indicates the amount of informaHon ( self‐

informaHon as defined by Shannon) contained in si, which corresponds to the number of bits needed to encode si.

Li & Drew 40

21

1( ) logn

ii i

H S pp

η=

= =∑

21

logn

i iip p

=

= −∑

1log2 pi


Distribu0on of Gray‐Level Intensi0es

Fig. 7.2 Histograms for Two Gray‐level Images.

• Fig. 7.2(a) shows the histogram of an image with uniform distribuHon of gray‐level intensiHes, i.e., ∀i pi = 1/256. Hence, the entropy of this image is:

log2256 = 8 (7.4)

• Fig. 7.2(b) shows the histogram of an image with two possible values. Its

entropy is 0.92.

Li & Drew 41


Entropy and Code Length • As can be seen in Eq. (7.3): the entropy η is a weighted‐sum

of terms ; hence it represents the average amount of informaHon contained per symbol in the source S.

• The entropy η specifies the lower bound for the average

number of bits to code each symbol in S, i.e.,

(7.5) ‐ the average length (measured in bits) of the codewords produced by the encoder.

Li & Drew 42

1log2 pi

lη ≤

l


7.3 Run‐Length Coding • Memoryless Source: an informaHon source that is

independently distributed. Namely, the value of the current symbol does not depend on the values of the previously appeared symbols.

• Instead of assuming memoryless source, Run‐Length Coding

(RLC) exploits memory present in the informaHon source. • Ra0onale for RLC: if the informaHon source has the

property that symbols tend to form conHnuous groups, then such symbol and the length of the group can be coded.

Li & Drew 43


7.4 Variable‐Length Coding (VLC) Shannon‐Fano Algorithm — a top‐down approach 1. Sort the symbols according to the frequency count of their

occurrences.

2. Recursively divide the symbols into two parts, each with approximately the same number of counts, unHl all parts contain only one symbol.

An Example: coding of “HELLO”

Frequency count of the symbols in ”HELLO”.

Li & Drew 44

Symbol H E L O

Count 1 1 2 1


Huffman Coding ALGORITHM 7.1 Huffman Coding Algorithm— a boiom‐up approach 1. IniHalizaHon: Put all symbols on a list sorted according to their frequency counts.

2. Repeat unHl the list has only one symbol lem: (1) From the list pick two symbols with the lowest frequency counts. Form a Huffman subtree

that has these two symbols as child nodes and create a parent node.

(2) Assign the sum of the children’s frequency counts to the parent and insert it into the list such that the order is maintained.

(3) Delete the children from the list.

3. Assign a codeword for each leaf based on the path from the root.

Li & Drew 45


Fig. 7.5: Coding Tree for “HELLO” using the Huffman Algorithm.

Li & Drew 46


Huffman Coding (cont’d) In Fig. 7.5, new symbols P1, P2, P3 are created to refer to the parent nodes in the Huffman coding tree. The contents in the list are illustrated below:

Amer iniHalizaHon: L H E O Amer iteraHon (a): L P1 H Amer iteraHon (b): L P2 Amer iteraHon (c): P3

Li & Drew 47


Proper0es of Huffman Coding 1. Unique Prefix Property: No Huffman code is a prefix of any other Huffman

code ‐ precludes any ambiguity in decoding.

2. Op0mality: minimum redundancy code ‐ proved opHmal for a given data model (i.e., a given, accurate, probability distribuHon):

• The two least frequent symbols will have the same length for their Huffman

codes, differing only at the last bit. • Symbols that occur more frequently will have shorter Huffman codes than

symbols that occur less frequently. • The average code length for an informaHon source S is strictly less than η + 1.

Combined with Eq. (7.5), we have:

(7.6)

Li & Drew 48

1l η< +


7.7 Lossless Image Compression • Approaches of Differen0al Coding of Images:

– Given an original image I(x, y), using a simple difference operator we can define a difference image d(x, y) as follows:

d(x, y) = I(x, y) − I(x − 1, y) (7.9) or use the discrete version of the 2‐D Laplacian operator to define a difference image d(x, y) as

d(x, y) = 4 I(x, y) − I(x, y − 1) − I(x, y +1) − I(x+1, y) − I(x − 1, y) (7.10)

• Due to spa+al redundancy existed in normal images I, the

difference image d will have a narrower histogram and hence a smaller entropy, as shown in Fig. 7.9.

Li & Drew 49


Fig. 7.9: DistribuHons for Original versus DerivaHve Images. (a,b): Original gray‐level image and its parHal derivaHve image; (c,d): Histograms for original and derivaHve images. (This figure uses a commonly employed image called “Barb”.)

Li & Drew 50


8.1 Introduc0on • Lossless compression algorithms do not deliver compression ra+os that are high enough. Hence, most mulHmedia compression algorithms are lossy.

• What is lossy compression?

– The compressed data is not the same as the original data, but a close approximaHon of it.

– Yields a much higher compression raHo than that of lossless compression.

Li & Drew 51


8.2 Distor0on Measures • The three most commonly used distorHon measures in image compression are:

– mean square error (MSE) σ2,

(8.1)

where xn, yn, and N are the input data sequence, reconstructed data sequence, and length of the data sequence respecHvely.

– signal to noise ra+o (SNR), in decibel units (dB),

(8.2)

where is the average square value of the original data sequence and is the MSE.

– peak signal to noise ra+o (PSNR),

(8.3)

Li & Drew 52

2 2

1

1 ( )N

n nnx yNσ

=

= −∑

2

10 210log x

d

SNR σσ

=

2

10 210log peak

d

xPSNR

σ=

2xσ

2dσ


Spa0al Frequency and DCT • Spa+al frequency indicates how many Hmes pixel values change across an image block.

• The DCT formalizes this noHon with a measure of how much the image contents change in correspondence to the number of cycles of a cosine wave per block.

• The role of the DCT is to decompose the original signal into its DC and AC components; the role of the IDCT is to reconstruct (re‐compose) the signal.

Li & Drew 53


Defini0on of DCT: Given an input funcHon f(i, j) over two integer variables i and j (a piece of an image), the 2D DCT transforms it into a new funcHon F(u, v), with integer u and v running over the same range as i and j. The general definiHon of the transform is:

(8.15) where i, u = 0, 1, . . . ,M − 1; j, v = 0, 1, . . . ,N − 1; and the constants C(u) and C(v) are determined by

(8.16)

Li & Drew 54

1 1

0 0

2 ( ) ( ) (2 1)· (2 1)·( , ) cos ·cos · ( , )2 2

M N

i j

C u C v i u j vF u v f i jM NMNπ π− −

= =

+ += ∑∑

2 0,( ) 21 .

ifCotherwise

ξξ

=

=


2D Discrete Cosine Transform (2D DCT):

(8.17)

where i, j, u, v = 0, 1, . . . , 7, and the constants C(u) and C(v) are determined by Eq. (8.5.16).

2D Inverse Discrete Cosine Transform (2D IDCT): The inverse funcHon is almost the same, with the roles of f(i, j) and F(u, v) reversed, except that now C(u)C(v) must stand inside the sums:

(8.18) where i, j, u, v = 0, 1, . . . , 7.

Li & Drew 55

7 7

0 0

( ) ( ) (2 1) (2 1)( , ) cos cos ( , )4 16 16i j

C u C v i u j vF u v f i jπ π

= =

+ += ∑∑

7 7

0 0

( ) ( ) (2 1) (2 1)( , ) cos cos ( , )4 16 16u v

C u C v i u j vf i j F u vπ π

= =

+ +=∑∑%


The DCT is a linear transform: In general, a transform T (or funcHon) is linear, iff

(8.21) where α and β are constants, p and q are any funcHons, variables or constants. From the definiHon in Eq. 8.17 or 8.19, this property can readily be proven for the DCT because it uses only simple arithmeHc operaHons.

Li & Drew 56

T (!p +!q) =!T (p)+!T (q)


The Cosine Basis Func0ons • FuncHon Bp(i) and Bq(i) are orthogonal, if

(8.22) • FuncHon Bp(i) and Bq(i) are orthonormal, if they are orthogonal and

(8.23) • It can be shown that:

Li & Drew 57

[ ( )· ( )] 0 p qiB i B i if p q= ≠∑

7

0

(2 1)· (2 1)·cos ·cos 0 16 16i

i p i q if p qπ π

=

+ + = ≠ ∑

7

0

( ) (2 1)· ( ) (2 1)·cos · cos 12 16 2 16i

C p i p C q i q if p qπ π

=

+ + = = ∑

[ ( )· ( )] 1 p qiB i B i if p q= =∑


Fig. 8.9: Graphical IllustraHon of 8 × 8 2D DCT basis. Li & Drew 58


2D Separable Basis • The 2D DCT can be separated into a sequence of two, 1D DCT steps:

(8.24)

(8.25)

• It is straigh�orward to see that this simple change saves many arithmeHc steps. The number of iteraHons required is reduced from 8 × 8 to 8+8.

Li & Drew 59

7

0

(2 1)1( , ) ( ) cos ( , )2 16j

j vG i v C v f i jπ

=

+= ∑

7

0

(2 1)1( , ) ( ) cos ( , )2 16i

i uF u v C u G i vπ

=

+= ∑


9.1 The JPEG Standard • JPEG is an image compression standard that was developed

by the “Joint Photographic Experts Group”. JPEG was formally accepted as an internaHonal standard in 1992.

• JPEG is a lossy image compression method. It employs a

transform coding method using the DCT (Discrete Cosine Transform).

• An image is a funcHon of i and j (or convenHonally x and y)

in the spa+al domain. The 2D DCT is used as one step in JPEG in order to yield a frequency response which is a funcHon F(u, v) in the spa+al frequency domain, indexed by two integers u and v.

Li & Drew 60


Observa0ons for JPEG Image Compression • The effecHveness of the DCT transform coding method in JPEG relies on 3 major observaHons:

Observa0on 1: Useful image contents change relaHvely slowly across the image, i.e., it is unusual for intensity values to vary widely several Hmes in a small area, for example, within an 8×8 image block. • much of the informaHon in an image is repeated, hence “spaHal redundancy”.

Li & Drew 61


Observa0ons for JPEG Image Compression (cont’d)

Observa0on 2: Psychophysical experiments suggest that humans are much less likely to noHce the loss of very high spaHal frequency components than the loss of lower frequency components.

• the spaHal redundancy can be reduced by largely reducing the high spaHal frequency contents.

Observa0on 3: Visual acuity (accuracy in disHnguishing closely

spaced lines) is much greater for gray (“black and white”) than for color.

• chroma subsampling (4:2:0) is used in JPEG.

Li & Drew 62


Fig. 9.1: Block diagram for JPEG encoder. Li & Drew 63


9.1.1 Main Steps in JPEG Image Compression • Transform RGB to YIQ or YUV and subsample color. • DCT on image blocks. • QuanHzaHon. • Zig‐zag ordering and run‐length encoding. • Entropy coding.

Li & Drew 64


DCT on image blocks • Each image is divided into 8 × 8 blocks. The 2D DCT is applied to each block image f(i, j), with output being the DCT coefficients F(u, v) for each block.

• Using blocks, however, has the effect of isolaHng each block from its neighboring context. This is why JPEG images look choppy (“blocky”) when a high compression ra+o is specified by the user.

Li & Drew 65


Quan0za0on

(9.1) • F(u, v) represents a DCT coefficient, Q(u, v) is a “quanHzaHon matrix” entry,

and represents the quan+zed DCT coefficients which JPEG will use in the succeeding entropy coding.

– The quan0za0on step is the main source for loss in JPEG compression. – The entries of Q(u, v) tend to have larger values towards the lower right corner.

This aims to introduce more loss at the higher spaHal frequencies — a pracHce supported by ObservaHons 1 and 2.

– Table 9.1 and 9.2 show the default Q(u, v) values obtained from psychophysical

studies with the goal of maximizing the compression raHo while minimizing perceptual losses in JPEG images.

Li & Drew 66

( , )ˆ ( , ) ( , )F u vF u v round Q u v

=

ˆ ( , )F u v


Table 9.1 The Luminance Quan0za0on Table

Table 9.2 The Chrominance Quan0za0on Table

Li & Drew 67

16 11 10 16 24 40 51 61 12 12 14 19 26 58 60 55 14 13 16 24 40 57 69 56 14 17 22 29 51 87 80 62 18 22 37 56 68 109 103 77 24 35 55 64 81 104 113 92 49 64 78 87 103 121 120 101 72 92 95 98 112 100 103 99

17 18 24 47 99 99 99 99 18 21 26 66 99 99 99 99 24 26 56 99 99 99 99 99 47 66 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99


An 8 × 8 block from the Y image of ‘Lena’

Fig. 9.2: JPEG compression for a smooth image block.

Li & Drew 68

200 202 189 188 189 175 175 175 200 203 198 188 189 182 178 175 203 200 200 195 200 187 185 175 200 200 200 200 197 187 187 187 200 205 200 200 195 188 187 175 200 200 200 200 200 190 187 175 205 200 199 200 191 187 187 175 210 200 200 200 188 185 187 186

f(i, j)

515 65 -12 4 1 2 -8 5 -16 3 2 0 0 -11 -2 3 -12 6 11 -1 3 0 1 -2 -8 3 -4 2 -2 -3 -5 -2 0 -2 7 -5 4 0 -1 -4 0 -3 -1 0 4 1 -1 0 3 -2 -3 3 3 -1 -1 3

-2 5 -2 4 -2 2 -3 0 F(u, v)


9.1.2 Four Commonly Used JPEG Modes

• SequenHal Mode — the default JPEG mode, implicitly assumed in the discussions so far. Each graylevel image or color image component is encoded in a single lem‐to‐right, top‐to‐boiom scan.

• Progressive Mode. • Hierarchical Mode. • Lossless Mode

Li & Drew 69


9.1.3 A Glance at the JPEG Bitstream

Li & Drew 70

Fig. 9.6: JPEG bitstream.


9.2 The JPEG2000 Standard • Design Goals: – To provide a beier rate‐distorHon tradeoff and improved subjecHve image quality. – To provide addiHonal funcHonaliHes lacking in the current JPEG standard.

• The JPEG2000 standard addresses the following problems: – Lossless and Lossy Compression: There is currently no standard that can provide superior lossless compression and lossy compression in a single bitstream.

Li & Drew 71


– Low Bit‐rate Compression: The current JPEG standard offers excellent rate‐distorHon performance in mid and high bit‐rates. However, at bit‐rates below 0.25 bpp, subjecHve distorHon becomes unacceptable. This is important if we hope to receive images on our web‐ enabled ubiquitous devices, such as web‐aware wristwatches and so on.

– Large Images: The new standard will allow image resoluHons greater than 64K by 64K without Hling. It can handle image size up to 232 − 1.

– Single Decompression Architecture: The current JPEG standard has 44 modes, many of which are applicaHon specific and not used by the majority of JPEG decoders.

Li & Drew 72


– Transmission in Noisy Environments: The new standard will provide improved error resilience for transmission in noisy environments such as wireless networks and the Internet.

– Progressive Transmission: The new standard provides seamless quality and resoluHon scalability from low to high bit‐rate. The target bit‐rate and reconstrucHon resoluHon need not be known at the Hme of compression.

– Region of Interest Coding: The new standard allows the specificaHon of Regions of Interest (ROI) which can be coded with superior quality than the rest of the image. One might like to code the face of a speaker with more quality than the surrounding furniture.

Li & Drew 73


– Computer Generated Imagery: The current JPEG standard is opHmized for natural imagery and does not perform well on computer generated imagery.

– Compound Documents: The new standard offers metadata mechanisms for incorporaHng addiHonal non‐image data as part of the file. This might be useful for including text along with imagery, as one important example.

• In addiHon, JPEG2000 is able to handle up to 256 channels of informaHon whereas the current JPEG standard is only able to handle three color channels.

Li & Drew 74


Proper0es of JPEG2000 Image Compression • Uses Embedded Block Coding with OpHmized TruncaHon

(EBCOT) algorithm which parHHons each subband LL, LH, HL, HH produced by the wavelet transform into small blocks called “code blocks”.

• A separate scalable bitstream is generated for each code

block ⇒ improved error resilience.

Fig. 9.7: Code block structure of EBCOT.

Li & Drew 75


Main Steps of JPEG2000 Image Compression • Embedded Block coding and bitstream generaHon.

• Post compression rate distorHon (PCRD) opHmizaHon.

• Layer formaHon and representaHon.

Li & Drew 76


Region of Interest Coding in JPEG2000 • Goal:

– ParHcular regions of the image may contain important informaHon, thus should be coded with beier quality than others.

• Usually implemented using the MAXSHIFT method which

scales up the coefficients within the ROI so that they are placed into higher bit‐planes.

• During the embedded coding process, the resulHng bits are

placed in front of the non‐ROI part of the image. Therefore, given a reduced bit‐rate, the ROI will be decoded and refined before the rest of the image.

Li & Drew 77


Fig. 9.11: Region of interest (ROI) coding of an image using a circularly shaped ROI. (a) 0.4 bpp, (b) 0.5 bpp, (c) 0.6bpp, and (d) 0.7 bpp.

Li & Drew 78


Fig. 9.12: Performance comparison for JPEG and JPEG2000 on different image types. (a): Natural images.

Li & Drew 79


(a) Fig. 9.13: Comparison of JPEG and JPEG2000. (a) Original image.

Li & Drew 80


(c) Fig. 9.13 (Cont’d): Comparison of JPEG and JPEG2000. (b) JPEG (lem) and JPEG2000 (right) images

compressed at 0.75 bpp. (c) JPEG (lem) and JPEG2000 (right) images compressed at 0.25 bpp.

Li & Drew 81

(b)


9.3 The JPEG‐LS Standard • JPEG‐LS is in the current ISO/ITU standard for lossless or “near

lossless” compression of conHnuous tone images. • It is part of a larger ISO effort aimed at beier compression of

medical images. • Uses the LOCO‐I (LOw COmplexity LOssless Compression for Images)

algorithm proposed by Hewlei‐Packard. • MoHvated by the observaHon that complexity reducHon is omen

more important than small increases in compression offered by more complex algorithms.

Main Advantage: Low complexity!

Li & Drew 82


10.1 Introduc0on to Video Compression

•  A video consists of a Hme‐ordered sequence of frames, i.e., images.

•  An obvious soluHon to video compression would be predic+ve coding based on previous frames. Compression proceeds by subtracHng images: subtract in Hme order and code the residual error.

•  It can be done even beier by searching for just the right parts of the

image to subtract from the previous frame.

83 Li & Drew


10.2 Video Compression with Mo0on Compensa0on

•  ConsecuHve frames in a video are similar — temporal redundancy exists.

•  Temporal redundancy is exploited so that not every frame of the video needs to be coded independently as a new image.

The difference between the current frame and other frame(s) in the sequence will be coded — small values and low entropy, good for compression.

•  Steps of Video compression based on Mo#on Compensa#on (MC):

1. MoHon EsHmaHon (moHon vector search).

2. MC‐based PredicHon.

3. DerivaHon of the predicHon error, i.e., the difference.

84 Li & Drew


Mo0on Compensa0on

•  Each image is divided into macroblocks of size N x N. ‐ By default, N = 16 for luminance images. For chrominance images,

N = 8 if 4:2:0 chroma subsampling is adopted. •  MoHon compensaHon is performed at the macroblock level.

‐ The current image frame is referred to as Target Frame. ‐ A match is sought between the macroblock in the Target Frame and the

most similar macroblock in previous and/or future frame(s) (referred to as Reference frame(s)).

‐ The displacement of the reference macroblock to the target macroblock is called a mo+on vector MV.

‐ Figure 10.1 shows the case of forward predic+on in which the Reference frame is taken to be a previous frame.

85 Li & Drew


•  MV search is usually limited to a small immediate neighborhood — both horizontal and verHcal displacements in the range [−p, p]. This makes a search window of size (2p + 1) x (2p + 1).

86

Fig. 10.1: Macroblocks and MoHon Vector in Video Compression.

Li & Drew


10.3 Search for Mo0on Vectors •  The difference between two macroblocks can then be measured by their Mean

Absolute Difference (MAD):

(10.1)

N — size of the macroblock,

k and l — indices for pixels in the macroblock,

i and j — horizontal and verHcal displacements,

C ( x + k, y + l ) — pixels in macroblock in Target frame,

R ( x + i + k, y + j + l ) — pixels in macroblock in Reference frame.

•  The goal of the search is to find a vector (i, j) as the moHon vector MV = (u, v), such that MAD(i, j) is minimum:

(10.2)

87 Li & Drew

1 1

20 0

1( , ) ( , ) ( , )N N

k lMAD i j C x k y l R x i k y j l

N

− −

= =

= + + − + + + +∑∑

[ ]( , ) ( , ) | ( , ) , [ , ], [ , ] u v i j MAD i j is minimum i p p j p p= ∈ − ∈ −


Sequen0al Search

•  Sequen0al search: sequenHally search the whole (2p + 1) x (2p + 1) window in the Reference frame (also referred to as Full search). ‐  a macroblock centered at each of the posiHons within the window is

compared to the macroblock in the Target frame pixel by pixel and their respecHve MAD is then derived using Eq. (10.1).

‐  The vector (i, j) that offers the least MAD is designated as the MV (u, v) for the macroblock in the Target frame.

‐ sequenHal search method is very costly — assuming each pixel comparison requires three operaHons (subtracHon, absolute value, addiHon), the cost for obtaining a moHon vector for a single macroblock is (2p + 1) (2p + 1) N 2 3 O ( p 2 N 2 ).

88 Li & Drew


PROCEDURE 10.1 Mo0on‐vector:sequen0al‐search

begin

min_MAD = LARGE NUMBER; /* IniHalizaHon */

for i = −p to p

for j = −p to p

{

cur_MAD = MAD(i, j);

if cur_MAD < min_MAD

{

min_MAD = cur_MAD;

u = i; /* Get the coordinates for MV. */

v = j;

}

}

end

89 Li & Drew


2D Logarithmic Search

•  Logarithmic search: a cheaper version, that is subopHmal but sHll usually effecHve.

•  The procedure for 2D Logarithmic Search of moHon vectors takes several iteraHons and is akin to a binary search: ‐ As illustrated in Fig.10.2, iniHally only nine locaHons in the search window

are used as seeds for a MAD‐based search; they are marked as ‘1’. ‐ Amer the one that yields the minimum MAD is located, the center of the

new search region is moved to it and the step‐size (“offset”) is reduced to half.

‐ In the next iteraHon, the nine new locaHons are marked as ‘2’ and so on.

90 Li & Drew


Hierarchical Search

•  The search can benefit from a hierarchical (mulHresoluHon) approach in which iniHal esHmaHon of the moHon vector can be obtained from images with a significantly reduced resoluHon.

•  Figure 10.3: a three‐level hierarchical search in which the original image is at Level 0, images at Levels 1 and 2 are obtained by down‐sampling from the previous levels by a factor of 2, and the iniHal search is conducted at Level 2. Since the size of the macroblock is smaller and p can also be proporHonally reduced, the number of operaHons required is greatly reduced.

91 Li & Drew


Fig. 10.3: A Three‐level Hierarchical Search for MoHon Vectors.

92 Li & Drew


Hierarchical Search (Cont'd)

•  Given the esHmated moHon vector (uk, vk) at Level k, a 3 x 3 neighborhood centered at (2 ∙ uk, 2 ∙ vk) at Level k − 1 is searched for the refined moHon vector.

•  the refinement is such that at Level k − 1 the moHon vector (uk−1 , vk−1) saHsfies:

(2uk − 1 ≤ uk−1 ≤ 2uk +1, 2vk − 1 ≤ vk−1 ≤ 2vk +1) •  Let (xk

0, yk0) denote the center of the macroblock at Level k in the Target

frame. The procedure for hierarchical moHon vector search for the macroblock centered at (x0

0, y00) in the Target frame can be outlined as

follows:

93 Li & Drew


10.4 H.261

•  H.261: An earlier digital video compression standard, its principle of MC‐based compression is retained in all later video compression standards. ‐ The standard was designed for videophone, video conferencing and other

audiovisual services over ISDN. ‐ The video codec supports bit‐rates of p x 64 kbps, where p ranges from 1 to

30 (Hence also known as p * 64). ‐ Require that the delay of the video encoder be less than 150 msec so that

the video can be used for real‐Hme bidirecHonal video conferencing.

94 Li & Drew


ITU Recommendations & H.261 Video Formats

•  H.261 belongs to the following set of ITU recommendaHons for visual telephony systems:

1.  H.221 — Frame structure for an audiovisual channel supporHng 64 to 1,920 kbps.

2.  H.230 — Frame control signals for audiovisual systems. 3.  H.242 — Audiovisual communicaHon protocols. 4.  H.261 — Video encoder/decoder for audiovisual services at p x 64 kbps. 5.  H.320 — Narrow‐band audiovisual terminal equipment for p x 64 kbps

transmission.

95 Li & Drew


Table 10.2 Video Formats Supported by H.261

96 Li & Drew


Fig. 10.4: H.261 Frame Sequence.

97 Li & Drew


H.261 Frame Sequence

•  Two types of image frames are defined: Intra‐frames (I‐frames) and Inter‐frames (P‐frames): ‐ I‐frames are treated as independent images. Transform coding method similar

to JPEG is applied within each I‐frame, hence “Intra”. ‐ P‐frames are not independent: coded by a forward predicHve coding method

(predicHon from a previous P‐frame is allowed — not just from a previous I‐frame).

‐ Temporal redundancy removal is included in P‐frame coding, whereas I‐frame coding performs only spa0al redundancy removal.

‐  To avoid propagaHon of coding errors, an I‐frame is usually sent a couple of Hmes in each second of the video.

•  MoHon vectors in H.261 are always measured in units of full pixel and

they have a limited range of ± 15 pixels, i.e., p = 15.

98 Li & Drew


Intra‐frame (I‐frame) Coding

Fig. 10.5: I‐frame Coding. •  Macroblocks are of size 16 x 16 pixels for the Y frame, and 8 x 8 for Cb

and Cr frames, since 4:2:0 chroma subsampling is employed. A macroblock consists of four Y, one Cb, and one Cr 8 x 8 blocks.

•  For each 8 x 8 block a DCT transform is applied, the DCT coefficients then go through quanHzaHon zigzag scan and entropy coding.

99 Li & Drew


Inter-frame (P-frame) Predictive Coding

•  Figure 10.6 shows the H.261 P‐frame coding scheme based on moHon compensaHon:

‐ For each macroblock in the Target frame, a moHon vector is

allocated by one of the search methods discussed earlier. ‐ Amer the predicHon, a difference macroblock is derived to measure

the predic+on error. ‐ Each of these 8 x 8 blocks go through DCT, quanHzaHon, zigzag scan

and entropy coding procedures.

100 Li & Drew


•  The P‐frame coding encodes the difference macroblock (not the Target macroblock itself).

•  SomeHmes, a good match cannot be found, i.e., the predicHon error

exceeds a certain acceptable level. ‐ The MB itself is then encoded (treated as an Intra MB) and in this case it is

termed a non‐mo+on compensated MB.

•  For a moHon vector, the difference MVD is sent for entropy coding: MVD = MVPreceding − MVCurrent (10.3)

101 Li & Drew


Fig. 10.6: H.261 P‐frame Coding Based on Mo0on Compensa0on.

102 Li & Drew


11.1 Overview • MPEG: Moving Pictures Experts Group, established in 1988 for the development of digital video.

• It is appropriately recognized that proprietary interests need to be maintained within the family of MPEG standards:

– Accomplished by defining only a compressed bitstream that implicitly defines the decoder.

– The compression algorithms, and thus the encoders, are completely up to the manufacturers.

Li & Drew 103


11.2 MPEG‐1 • MPEG‐1 adopts the CCIR601 digital TV format also known as

SIF (Source Input Format). • MPEG‐1 supports only non‐interlaced video. Normally, its

picture resoluHon is:

– 352 × 240 for NTSC video at 30 fps – 352 × 288 for PAL video at 25 fps – It uses 4:2:0 chroma subsampling

• The MPEG‐1 standard is also referred to as ISO/IEC 11172. It has five parts: 11172‐1 Systems, 11172‐2 Video, 11172‐3 Audio, 11172‐4 Conformance, and 11172‐5 Somware.

Li & Drew 104


Mo0on Compensa0on in MPEG‐1 • MoHon CompensaHon (MC) based video encoding in H.261 works as follows:

– In MoHon EsHmaHon (ME), each macroblock (MB) of the Target P‐frame is assigned a best matching MB from the previously coded I or P frame ‐ predic0on.

– predic0on error: The difference between the MB and its matching MB, sent to DCT and its subsequent encoding steps.

– The predicHon is from a previous frame — forward predic0on.

Li & Drew 105


Fig 11.1: The Need for BidirecHonal Search.

The MB containing part of a ball in the Target frame cannot find a good matching MB in the previous frame because half of the ball was occluded by another object. A match however can readily be obtained from the next frame.

Li & Drew 106


Mo0on Compensa0on in MPEG‐1 (Cont’d) • MPEG introduces a third frame type — B‐frames, and its accompanying bi‐

direcHonal moHon compensaHon. • The MC‐based B‐frame coding idea is illustrated in Fig. 11.2:

– Each MB from a B‐frame will have up to two moHon vectors (MVs) (one from the forward and one from the backward predicHon).

– If matching in both direcHons is successful, then two MVs will be sent and the

two corresponding matching MBs are averaged (indicated by ‘%’ in the figure) before comparing to the Target MB for generaHng the predicHon error.

– If an acceptable match can be found in only one of the reference frames, then

only one MV and its corresponding MB will be used from either the forward or backward predicHon.

Li & Drew 107


Fig 11.2: B‐frame Coding Based on BidirecHonal MoHon CompensaHon.

Li & Drew 108


Fig 11.3: MPEG Frame Sequence.

Li & Drew 109


Fig 11.5: Layers of MPEG‐1 Video Bitstream. Li & Drew 110


11.3 MPEG‐2 • MPEG‐2: For higher quality video at a bit‐rate of more than

4 Mbps. • Defined seven profiles aimed at different applicaHons:

– Simple, Main, SNR scalable, Spa0ally scalable, High, 4:2:2, Mul0view.

– Within each profile, up to four levels are defined (Table 11.5). – The DVD video specificaHon allows only four display resoluHons: 720×480, 704×480, 352×480, and 352×240 — a restricted form of the MPEG‐2 Main profile at the Main and Low levels.

Li & Drew 111


Table 11.5: Profiles and Levels in MPEG‐2

Table 11.6: Four Levels in the Main Profile of MPEG‐2

Li & Drew 112

Level Simple profile

Main profile

SNR Scalable profile

Spa0ally Scalable profile

High Profile

4:2:2 Profile

Mul0view Profile

High High 1440 Main Low

*

* * * *

* *

*

* * *

*

*

Level Max. Resolu0on

Max fps

Max pixels/sec

Max coded Data Rate (Mbps)

Applica0on

High High 1440 Main Low

1,920 × 1,152 1,440 × 1,152 720 × 576 352 × 288

60 60 30 30

62.7 × 106 47.0 × 106 10.4 × 106 3.0 × 106

80 60 15 4

film producHon consumer HDTV

studio TV consumer tape equiv.


Suppor0ng Interlaced Video • MPEG‐2 must support interlaced video as well since this is

one of the opHons for digital broadcast TV and HDTV. • In interlaced video each frame consists of two fields,

referred to as the top‐field and the boZom‐field.

– In a Frame‐picture, all scanlines from both fields are interleaved to form a single frame, then divided into 16×16 macroblocks and coded using MC.

– If each field is treated as a separate picture, then it is called Field‐picture.

Li & Drew 113


Fig. 11.6: Field pictures and Field‐predicHon for Field‐pictures in MPEG‐2. (a) Frame−picture vs. Field−pictures, (b) Field PredicHon for Field−pictures

Li & Drew 114


Five Modes of Predic0ons • MPEG‐2 defines Frame Predic0on and Field Predic0on as well as five predicHon modes:

1.  Frame Predic0on for Frame‐pictures: IdenHcal to

MPEG‐1 MC‐based predicHon methods in both P‐frames and B‐frames.

2.  Field Predic0on for Field‐pictures: A macroblock size of 16 × 16 from Field‐pictures is used. For details, see Fig. 11.6(b).

Li & Drew 115


3. Field Predic0on for Frame‐pictures: The top‐field and boiom‐field of a Frame‐picture are treated separately. Each 16 × 16 macroblock (MB) from the target Frame‐picture is split into two 16 × 8 parts, each coming from one field. Field predicHon is carried out for these 16 × 8 parts in a manner similar to that shown in Fig. 11.6(b).

4. 16×8 MC for Field‐pictures: Each 16×16 macroblock (MB) from the target Field‐picture is split into top and boiom 16 × 8 halves. Field predicHon is performed on each half. This generates two moHon vectors for each 16×16 MB in the P‐Field‐picture, and up to four moHon vectors for each MB in the B‐Field‐picture.

This mode is good for a finer MC when moHon is rapid and irregular.

Li & Drew 116


5. Dual‐Prime for P‐pictures: First, Field predicHon from each previous field with the same parity (top or boiom) is made. Each moHon vector mv is then used to derive a calculated moHon vector cv in the field with the opposite parity taking into account the temporal scaling and verHcal shim between lines in the top and boiom fields. For each MB the pair mv and cv yields two preliminary predicHons. Their predicHon errors are averaged and used as the final predicHon error. This mode mimics B‐picture predicHon for P‐pictures without adopHng backward predicHon (and hence with less encoding delay). This is the only mode that can be used for either Frame‐pictures or Field‐pictures.

Li & Drew 117


Alternate Scan and Field DCT • Techniques aimed at improving the effecHveness of DCT on

predicHon errors, only applicable to Frame‐pictures in interlaced videos:

– Due to the nature of interlaced video the consecuHve rows in the 8×8

blocks are from different fields, there exists less correlaHon between them than between the alternate rows.

– Alternate scan recognizes the fact that in interlaced video the verHcally

higher spaHal frequency components may have larger magnitudes and thus allows them to be scanned earlier in the sequence.

• In MPEG‐2, Field_DCT can also be used to address the same issue.

Li & Drew 118


Fig 11.7: Zigzag and Alternate Scans of DCT Coefficients for Progressive and Interlaced Videos in MPEG‐2.

Li & Drew 119


12.1 Overview of MPEG‐4 • MPEG‐4: a newer standard. Besides compression, pays great

aienHon to issues about user interacHviHes. • MPEG‐4 departs from its predecessors in adopHng a new

object‐based coding:

– Offering higher compression raHo, also beneficial for digital video composiHon, manipulaHon, indexing, and retrieval.

– Figure 12.1 illustrates how MPEG‐4 videos can be composed and manipulated by simple operaHons on the visual objects.

• The bit‐rate for MPEG‐4 video now covers a large range

between 5 kbps to 10 Mbps.

Li & Drew 120


Fig. 12.1: ComposiHon and ManipulaHon of MPEG‐4 Videos.

Li & Drew 121


Overview of MPEG‐4 (Cont’d) • MPEG‐4 (Fig. 12.2(b)) is an enHrely new standard for:

(a)  Composing media objects to create desirable audiovisual scenes.

(b) MulHplexing and synchronizing the bitstreams for these media data enHHes so that they can be transmiied with guaranteed Quality of Service (QoS).

(c) InteracHng with the audiovisual scene at the receiving end — provides a toolbox of advanced coding modules and algorithms for audio and video compressions.

Li & Drew 122


Fig. 12.2: Comparison of interacHviHes in MPEG standards: (a) reference models in MPEG‐1 and 2 (interacHon in dashed lines supported only by MPEG‐2); (b) MPEG‐4 reference model.

Li & Drew 123

(a) (b)


Overview of MPEG‐4 (Cont’d) • The hierarchical structure of MPEG‐4 visual bitstreams is very different from that of MPEG‐1 and ‐2, it is very much video object‐oriented.

Fig. 12.3: Video Object Oriented Hierarchical DescripHon of a Scene in MPEG‐4 Visual Bitstreams.

Li & Drew 124


Overview of MPEG‐4 (Cont’d) 1. Video‐object Sequence (VS)—delivers the complete MPEG‐4 visual scene,

which may contain 2‐D or 3‐D natural or syntheHc objects. 2. Video Object (VO) — a parHcular object in the scene, which can be of

arbitrary (non‐rectangular) shape corresponding to an object or background of the scene.

3. Video Object Layer (VOL) — facilitates a way to support (mulH‐layered)

scalable coding. A VO can have mulHple VOLs under scalable coding, or have a single VOL under non‐scalable coding.

4. Group of Video Object Planes (GOV) — groups Video Object Planes

together (opHonal level). 5. Video Object Plane (VOP) — a snapshot of a VO at a parHcular moment.

Li & Drew 125


12.2 Object‐based Visual Coding in MPEG‐4

VOP‐based vs. Frame‐based Coding

• MPEG‐1 and ‐2 do not support the VOP concept, and hence their coding method is referred to as frame‐based (also known as Block‐based coding).

• Fig. 12.4 (c) illustrates a possible example in which both potenHal matches yield small predicHon errors for block‐based coding.

• Fig. 12.4 (d) shows that each VOP is of arbitrary shape and ideally will obtain a unique moHon vector consistent with the actual object moHon.

Li & Drew 126


Fig. 12.4: Comparison between Block‐based Coding and Object‐based Coding.

Li & Drew 127


VOP‐based Coding • MPEG‐4 VOP‐based coding also employs the MoHon CompensaHon

technique:

– An Intra‐frame coded VOP is called an I‐VOP. – The Inter‐frame coded VOPs are called P‐VOPs if only forward

predicHon is employed, or B‐VOPs if bi‐direcHonal predicHons are employed.

• The new difficulty for VOPs: may have arbitrary shapes, shape

informaHon must be coded in addiHon to the texture of the VOP. Note: texture here actually refers to the visual content, that is the gray‐level (or chroma) values of the pixels in the VOP.

Li & Drew 128


VOP‐based Mo0on Compensa0on (MC) • MC‐based VOP coding in MPEG‐4 again involves three steps:

(a)  MoHon EsHmaHon.

(b) MC‐based PredicHon. (c)  Coding of the predicHon error.

• Only pixels within the VOP of the current (Target) VOP are considered for matching in MC.

• To facilitate MC, each VOP is divided into many macroblocks (MBs).

MBs are by default 16×16 in luminance images and 8 × 8 in chrominance images.

Li & Drew 129


• MPEG‐4 defines a rectangular bounding box for each VOP (see Fig. 12.5 for details).

• The macroblocks that are enHrely within the VOP are

referred to as Interior Macroblocks. The macroblocks that straddle the boundary of the VOP are called Boundary Macroblocks.

• To help matching every pixel in the target VOP and meet the

mandatory requirement of rectangular blocks in transform codine (e.g., DCT), a pre‐processing step of padding is applied to the Reference VOPs prior to moHon esHmaHon.

Note: Padding only takes place in the Reference VOPs.

Li & Drew 130


Fig. 12.5: Bounding Box and Boundary Macroblocks of VOP.

Li & Drew 131


I. Padding • For all Boundary MBs in the Reference VOP, Horizontal Repe++ve

Padding is invoked first, followed by Ver+cal Repe++ve Padding.

Fig. 12.6: A Sequence of Paddings for Reference VOPs in MPEG‐4. • Amerwards, for all Exterior Macroblocks that are outside of the VOP

but adjacent to one or more Boundary MBs, extended padding will be applied.

Li & Drew 132


Example 12.1: Repe00ve Paddings

Fig. 12.7: An example of RepeHHve Padding in a boundary macroblock of a Reference VOP: (a) Original pixels within the VOP, (b) Amer Horizontal RepeHHve Padding, (c) Followed by VerHcal RepeHHve Padding.

Li & Drew 133


Shape Coding • MPEG‐4 supports two types of shape informaHon, binary

and gray scale. • Binary shape informaHon can be in the form of a binary map

(also known as binary alpha map) that is of the size as the rectangular bounding box of the VOP.

• A value ‘1’ (opaque) or ‘0’ (transparent) in the bitmap

indicates whether the pixel is inside or outside the VOP. • AlternaHvely, the gray‐scale shape informaHon actually

refers to the transparency of the shape, with gray values ranging from 0 (completely transparent) to 255 (opaque).

Li & Drew 134


I. Binary Shape Coding • BABs (Binary Alpha Blocks): to encode the binary alpha map

more efficiently, the map is divided into 16×16. blocks • It is the boundary BABs that contain the contour and hence

the shape informaHon for the VOP — the subject of binary shape coding.

• Two bitmap‐based algorithms:

(a) Modified Modified READ (MMR).

(b) Context‐based Arithme0c Encoding (CAE).

Li & Drew 135


Modified Modified READ (MMR) • MMR is basically a series of simplificaHons of the Rela0ve

Element Address Designate (READ) algorithm • The READ algorithm starts by idenHfying five pixel locaHons

in the previous and current lines:

– a0: the last pixel value known to both the encoder and decoder; – a1: the transiHon pixel to the right of a0; – a2: the second transiHon pixel to the right of a0; – b1: the first transiHon pixel whose color is opposite to a0 in the previously coded line; and

– b2: the first transiHon pixel to the right of b1 on the previously coded line.

Li & Drew 136


II. Gray‐scale Shape Coding • The gray‐scale here is used to describe the transparency of the shape, not the texture.

• Gray‐scale shape coding in MPEG‐4 employs the same technique as in the texture coding described above.

– Uses the alpha map and block‐based moHon compensaHon, and encodes the predicHon errors by DCT.

– The boundary MBs need padding as before since not all pixels are in the VOP.

Li & Drew 137


Sta0c Texture Coding • MPEG‐4 uses wavelet coding for the texture of staHc objects. • The coding of subbands in MPEG‐4 staHc texture coding is conducted in the

following manner:

– The subbands with the lowest frequency are coded using DPCM. PredicHon of each coefficient is based on three neighbors.

– Coding of other subbands is based on a mulHscale zero‐tree wavelet coding

method.

• The mulHscale zero‐tree has a Parent‐Child RelaHon tree (PCR tree) for each coefficient in the lowest frequency sub‐band to beier track locaHons of all coefficients.

• The degree of quanHzaHon also affects the data rate.

Li & Drew 138


Sprite Coding • A sprite is a graphic image that can freely move around within a larger

graphic image or a set of images. • To separate the foreground object from the background, we introduce the

noHon of a sprite panorama: a sHll image that describes the staHc background over a sequence of video frames.

– The large sprite panoramic image can be encoded and sent to the decoder only

once at the beginning of the video sequence. – When the decoder receives separately coded foreground objects and

parameters describing the camera movements thus far, it can reconstruct the scene in an efficient manner.

– Fig. 12.10 shows a sprite which is a panoramic image sHtched from a sequence

of video frames.

Li & Drew 139


Fig. 12.10: Sprite Coding. (a) The sprite panoramic image of the background, (b) the foreground object (piper) in a blue‐screen image, (c) the composed video scene. Piper image courtesy of Simon Fraser University Pipe Band.

Li & Drew 140


Global Mo0on Compensa0on (GMC) • “Global” – overall change due to camera moHons (pan, Hlt, rotaHon and

zoom)

Without GMC this will cause a large number of significant moHon vectors • There are four major components within the GMC algorithm:

– Global moHon esHmaHon – Warping and blending – MoHon trajectory coding – Choice of LMC (Local MoHon CompensaHon) or GMC.

Li & Drew 141


12.3 Synthe0c Object Coding in MPEG‐4 2D Mesh Object Coding • 2D mesh: a tessellaHon (or parHHon) of a 2D planar region using

polygonal patches:

– The verHces of the polygons are referred to as nodes of the mesh. – The most popular meshes are triangular meshes where all polygons are

triangles. – The MPEG‐4 standard makes use of two types of 2D mesh: uniform

mesh and Delaunay mesh – 2D mesh object coding is compact. All coordinate values of the mesh

are coded in half‐pixel precision. – Each 2D mesh is treated as a mesh object plane (MOP).

Li & Drew 142


Fig. 12.11: 2D Mesh Object Plane (MOP) Encoding Process

Li & Drew 143


I. 2D Mesh Geometry Coding • MPEG‐4 allows four types of uniform meshes with different triangulaHon structures.

Fig. 12.12: Four Types of Uniform Meshes.

Li & Drew 144


• Defini0on: If D is a Delaunay triangulaHon, then any of its triangles tn = (Pi, Pj, Pk) ∈ D saHsfies the property that the circumcircle of tn does not contain in its interior any other node point Pl.

• A Delaunay mesh for a video object can be obtained in the

following steps:

1. Select boundary nodes of the mesh: A polygon is used to approximate the boundary of the object.

2. Choose interior nodes: Feature points, e.g., edge points or corners, within the object boundary can be chosen as interior nodes for the mesh.

3. Perform Delaunay triangula0on: A constrained Delaunay triangula+on is performed on the boundary and interior nodes with the polygonal boundary used as a constraint.

Li & Drew 145


I. Face Object Coding and Anima0on • MPEG‐4 has adopted a generic default face model, which was

developed by VRML ConsorHum. • Face Anima0on Parameters (FAPs) can be specified to achieve

desirable animaHons — deviaHons from the original “neutral” face. • In addiHon, Face Defini0on Parameters (FDPs) can be specified to

beier describe individual faces. • Fig. 12.16 shows the feature points for FDPs. Feature points that can

be affected by animaHon (FAPs) are shown as solid circles, and those that are not affected are shown as empty circles.

Li & Drew 146


Fig. 12.16: Feature Points for Face DefiniHon Parameters (FDPs). (Feature points for teeth and tongue not shown.)

Li & Drew 147


II. Body Object Coding and Anima0on • MPEG‐4 Version 2 introduced body objects, which are a

natural extension to face objects. • Working with the Humanoid AnimaHon (H‐Anim) Group in

the VRML ConsorHum, a generic virtual human body with default posture is adopted.

– The default posture is a standing posture with feet poinHng to the front, arms on the side and palms facing inward.

– There are 296 Body Anima0on Parameters (BAPs). When applied to any MPEG‐4 compliant generic body, they will produce the same animaHon.

Li & Drew 148


– A large number of BAPs are used to describe joint angles connecHng different body parts: spine, shoulder, clavicle, elbow, wrist, finger, hip, knee, ankle, and toe — yields 186 degrees of freedom to the body, and 25 degrees of freedom to each hand alone.

– Some body movements can be specified in mulHple levels of detail.

• For specific bodies, Body Defini0on Parameters (BDPs) can

be specified for body dimensions, body surface geometry, and opHonally, texture.

• The coding of BAPs is similar to that of FAPs: quanHzaHon

and predicHve coding are used, and the predicHon errors are further compressed by arithmeHc coding.

Li & Drew 149


12.4 MPEG‐4 Object types, Profiles and Levels • The standardizaHon of Profiles and Levels in MPEG‐4 serve two main

purposes:

(a)  ensuring interoperability between implementaHons

(b) allowing tesHng of conformance to the standard • MPEG‐4 not only specified Visual profiles and Audio profiles, but it

also specified Graphics profiles, Scene descripHon profiles, and one Object descriptor profile in its Systems part.

• Object type is introduced to define the tools needed to create video

objects and how they can be combined in a scene.

Li & Drew 150


Table 12.1: Tools for MPEG‐4 Natural Visual Object Types

Li & Drew 151


Table 12.2: MPEG‐4 Natural Visual Object Types and Profiles

Li & Drew 152

• For “Main Profile”, for example, only Object Types “Simple”, “Core”, “Main”, and“Scalable Still Texture” are supported.


Table 12.3: MPEG‐4 Levels in Simple, Core, and Main Visual Profiles

Li & Drew 153


12.6 MPEG‐7 • The main objecHve of MPEG‐7 is to serve the need of audio‐visual content‐based retrieval (or audiovisual object retrieval) in applicaHons such as digital libraries.

• Nevertheless, it is also applicable to any mulHmedia applicaHons involving the generaHon (content crea+on) and usage (content consump+on) of mulHmedia data.

• MPEG‐7 became an InternaHonal Standard in September 2001 — with the formal name Mul0media Content Descrip0on Interface.

Li & Drew 154


Applica0ons Supported by MPEG‐7 • MPEG‐7 supports a variety of mulHmedia applicaHons. Its data may include sHll pictures, graphics, 3D models, audio, speech, video, and composiHon informaHon (how to combine these elements).

• These MPEG‐7 data elements can be represented in textual format, or binary format, or both.

• Fig. 12.17 illustrates some possible applicaHons that will benefit from the MPEG‐7 standard.

Li & Drew 155


Fig. 12.17: Possible ApplicaHons using MPEG‐7.

Li & Drew 156


MPEG‐7 and Mul0media Content Descrip0on • MPEG‐7 has developed Descriptors (D), DescripHon Schemes (DS) and

DescripHon DefiniHon Language (DDL). The following are some of the important terms:

– Feature — characterisHc of the data. – Descrip0on — a set of instanHated Ds and DSs that describes the structural and

conceptual informaHon of the content, the storage and usage of the content, etc.

– D — definiHon (syntax and semanHcs) of the feature. – DS — specificaHon of the structure and relaHonship between Ds and between

DSs. – DDL — syntacHc rules to express and combine DSs and Ds.

• The scope of MPEG‐7 is to standardize the Ds, DSs and DDL for descripHons. The mechanism and process of producing and consuming the descripHons are beyond the scope of MPEG‐7.

Li & Drew 157


Descriptor (D) • The descriptors are chosen based on a comparison of their performance, efficiency, and size. Low‐level visual descriptors for basic visual features include:

– Color ∗ Color space. (a) RGB, (b) YCbCr, (c) HSV (hue, saturaHon, value), (d) HMMD (HueMaxMinDiff), (e) 3D color space derivable by a 3 × 3 matrix from RGB, (f) monochrome. ∗ Color quanHzaHon. (a) Linear, (b) nonlinear, (c) lookup tables. ∗ Dominant colors. ∗ Scalable color. ∗ Color layout. ∗ Color structure. ∗ Group of Frames/Group of Pictures (GoF/GoP) color.

Li & Drew 158


– Texture ∗ Homogeneous texture. ∗ Texture browsing. ∗ Edge histogram.

– Shape ∗ Region‐based shape. ∗ Contour‐based shape. ∗ 3D shape.

Li & Drew 159


– Mo0on ∗ Camera moHon (see Fig. 12.18). ∗ Object moHon trajectory. ∗ Parametric object moHon. ∗ MoHon acHvity.

– Localiza0on ∗ Region locator. ∗ SpaHotemporal locator.

– Others ∗ Face recogniHon.

Li & Drew 160


Fig. 12.18: Camera moHons: pan, Hlt, roll, dolly, track, and boom.

Li & Drew 161


Descrip0on Scheme (DS) • Basic elements

– Datatypes and mathemaHcal structures. – Constructs. – Schema tools.

• Content Management – Media DescripHon. – CreaHon and ProducHon DescripHon. – Content Usage DescripHon.

• Content Descrip0on – Structural DescripHon.

Li & Drew 162


A Segment DS, for example, can be implemented as a class object. It can have five subclasses: Audiovisual segment DS, Audio segment DS, S+ll region DS, Moving region DS, and Video segment DS. The subclass DSs can recursively have their own subclasses.

– Conceptual DescripHon.

• Naviga0on and access

– Summaries. – ParHHons and DecomposiHons. – VariaHons of the Content.

• Content Organiza0on – CollecHons. – Models.

• User Interac0on – UserPreference.

Li & Drew 163


Fig. 12.19: MPEG‐7 video segment.

Li & Drew 164


Fig. 12.20: A video summary.

Li & Drew 165


Descrip0on Defini0on Language (DDL) • MPEG‐7 adopted the XML Schema Language iniHally developed by the WWW ConsorHum (W3C) as its DescripHon DefiniHon Language (DDL). Since XML Schema Language was not designed specifically for audiovisual contents, some extensions are made to it:

– Array and matrix data types. – MulHple media types, including audio, video, and audiovisual presentaHons.

– Enumerated data types for MimeType, CountryCode, RegionCode, CurrencyCode, and CharacterSetCode.

– Intellectual Property Management and ProtecHon (IPMP) for Ds and DSs.

Li & Drew 166


12.7 MPEG‐21 • The development of the newest standard, MPEG‐21: Mul0media

Framework, started in June 2000, and was expected to become InternaHonal Stardard by 2003.

• The vision for MPEG‐21 is to define a mulHmedia framework to enable

transparent and augmented use of mulHmedia resources across a wide range of networks and devices used by different communiHes.

• The seven key elements in MPEG‐21 are:

– Digital item declara0on — to establish a uniform and flexible abstracHon and interoperable schema for declaring Digital items.

– Digital item iden0fica0on and descrip0on— to establish a framework for

standardized idenHficaHon and descripHon of digital items regardless of their origin, type or granularity.

Li & Drew 167


– Content management and usage — to provide an interface and protocol that facilitate the management and usage (searching, caching, archiving, distribuHng, etc.) of the content.

– Intellectual property management and protec0on (IPMP) — to enable contents to be reliably managed and protected.

– Terminals and networks — to provide interoperable and transparent access to content with Quality of Service (QoS) across a wide range of networks and terminals.

– Content representa0on — to represent content in an adequate way for pursuing the objecHve of MPEG‐21, namely “content anyHme anywhere”.

– Event repor0ng — to establish metrics and interfaces for reporHng events (user interacHons) so as to understand performance and alternaHves.

Li & Drew 168

Technology

Video audio