ForntPAGE.doc

8/12/2019 ForntPAGE.doc

1/29

A

SEMINAR REPORT

ON

THE INTEL MMX TECHNOLOGY

Submitted in partial fulfillment for the requirement of the award for the

Degree of Bachelor in Technology

In

Electronics & Commnic!tion En"ineerin"

S#mitte$ %' S#mitte$ to'

Renu Kanwar Mr.Yogendra boti!".#.D.$

B. Tech. !%thSem.$ Department of &'&

(EPARTMENT O) ELECTRONICS & COMM*NICATION ENGINEERING

MAR+AR ENGINEERING COLLEGE & RESEARCH CENTRE ,O(HP*R

-RA,ASTHAN.

RA,ASTHAN TECHNICAL *NI/ERSITY0 1OTA -RA,ASTHAN.

-234352344.

1


2/29

Intro$ction'5

Intel() MM*+ technology ,- /0 i) an e1ten)ion to the ba)ic Intel rchitecture !I$

de)igned to impro2e performance of multimedia and communication algorithm). Thetechnology include) new in)truction) and data type) which achie2e new le2el) of

performance for the)e algorithm) on ho)t proce))or).

MM* technology e1ploit) the paralleli)m inherent in many of the)e algorithm). Many

of the)e algorithm) e1hibit the property of 3fi1ed4 computation on a large

data )et.

The definition of MM* technology e2ol2ed from earlier wor5 in the i%67+

architecture ,80. The i%67 architecture wa) the indu)try() fir)t general purpo)e

proce))or to pro2ide )upport for graphic) rendering. The i%67 proce))or pro2ided

in)truction) that operated on multiple ad9acent data operand) in parallel for e1ample

four ad9acent pi1el) of an image.

fter the introduction of the i%67 proce))or Intel e1plored e1tending the i%67

architecture in order to deli2er high performance for other media application) for

e1ample image proce))ing te1ture mapping and audio and 2ideo decompre))ion.

Se2eral of the)e algorithm) naturally lent them)el2e) to SIMD proce))ing. Thi) effort

laid the foundation for )imilar )upport for Intel() main)tream general purpo)e

architecture I.

The MM* technology e1ten)ion wa) the fir)t ma9or addition to the in)truction )et

)ince the Intel8%6+ architecture. :i2en the large in)talled )oftware ba)e for the I a

)ignificant e1ten)ion to the architecture required )pecial attention to bac5ward

compatibility and de)ign i))ue).

MM* technology pro2ide) benefit) to the end u)er by impro2ing the performance of

multimedia;rich application) by a factor of -.


3/29

Thi) paper pro2ide) in)ight into the proce)) and con)ideration) u)ed to define the

MM* technology. It al)o pro2ide) )pecific) on MM* in)truction) that were added to

the I a) well a) the approach ta5en to add thi) )ignificant capability without adding a

new )oftware 2i)ible architectural )tate.

The paper al)o pre)ent) application e1ample) that )how the u)age and benefit) of

MM* in)truction). Data )howing the performance benefit) for the application) i)

al)o pre)ented.

3


4/29

AC1NO+LE(GEMENT

Theoretical 5nowledge i) impro2ed through )eminar preparation a) it

contribute) )ignificantly to the )tudent() under)tanding and gi2e) him to

fir)t hand 5nowledge of the comple1itie) of engineering arena.

>ir)t I would li5e to than5 the almighty and my parent) who ga2e me

their 2aluable )upport and ble))ing) to complete thi) minor pro9ect . I

would al)o li5e to than5 &r. ?. K. Bhan)ali !Director M.&.'.R.'. @odhpur$

A &r. Yogendra boti !"ead of Department &lectronic A 'ommunication

&ngineering$ for their encouragement A appreciation).

n accompli)hment of any )ignificance depend) on the Synergy and

'ooperation of re)ource) both material and human. I e1pre)) my heartfelt

gratitude to all tho)e who ha2e contributed directly or indirectly in thi)

endea2or.

My fir)t and foremo)t regard) are for my family member) who

patiently and pain)ta5ingly helped me out in e2ery way the can.

IN(EX

4


5/29

S. o. 'ontent Cage o.

7- $e6in!tion 7rocess

7/ #!sic conce7ts E

78 P!c8e$ $!t! 6orm!t E

7= 'on$ition!l e9ection -7

7< S!tr!tin" !rit:metic -/

76 )i9e$ 7oint !rit:metic -/

7 Re7ositionin" o6 $!t! elements ;it:in

7!c8e$ $!t! 6orm!t

-8

7% (!t! !li"nment -=

7E 6e!tres -or e1ample

for a motion e)timation algorithm data i) naturally organiGed in -6 row) with each

row containing only -6 byte) of data. In thi) ca)e operating on more than -6 data

element) at a time will require reformatting the input data. De)ign con)ideration)

in2ol2e i))ue) )uch a) the practical width of the data path and how many time)

functional unit) will replicate.

:i2en that current Intel proce))or) already ha2e 6=;bit data path) !for e1ample

floating;point data path) a) well a) a data path between the integer regi)ter file and

memory )ub)y)tem due to dual load)tore capability in the Centium proce))or$ we

cho)e the width of MM* data type) to be 6= bit).

Con$ition!l E9ection'5

#perating on multiple data operand) u)ing a )ingle in)truction pre)ent) an intere)tingi))ue. Fhat happen) when a computation i) only done if the operand 2alue pa))e)

9


10/29

)ome conditional chec5L >or e1ample in an ab)olute 2alue calculation only if the

number i) alreadynegati2e do we perform a /() complement on itJ

for I - -77

if a,i0 N 7 then b,i0 ; a,i0 el)e b,i0 a,i0

O b)olute 2alue calculation

There are different approache) po))ible and )ome are impler than other). P)ing a

branch approach doe) not wor5 well for two rea)on)J fir)t a branch;ba)ed )olution i)

)lower becau)e of the inherent branch mi)prediction penalty and )econd becau)e of

the need to con2ert pac5ed data type) to )calar).

Direct conditional e1ecution )upport doe) not wor5 well for the I )ince it require)

three independent operand) !)ource )ourcede)tination and predicate 2ector$. Keeping

with the philo)ophy of performance and )implicity we cho)e a )impler )olution. Theba)ic idea wa) to con2ert a conditional e1ecution into a conditional a))ignment.

'onditional a))ignment in turn can be implemented through different approache). #ne

approach would be to pro2ide the fle1ibility of )pecifying a dynamically generated

ma)5 with an a))ignment in)truction. Such an approach would ha2e required defining

in)truction) with three operand) !)ource )ourcede)tination and ma)5$. "ere al)o we

adopted a )olution that i) more amenable to higher performance de)ign).

'ompare operation) in MM* technology re)ult in a bit ma)5 corre)ponding to the

length of the operand). >or e1ample a compare operation operating on pac5ed byteoperand) produce byte;wide ma)5). The)e ma)5) then can be u)ed in con9unction with

logical operation) to achie2e conditional a))ignment.

'on)ider the following e1ampleJ

If True

Ra J Rb el)e Ra J Rc

et u) )ay regi)ter R1 contain) all -() if the condition i) true and all 7() if thecondition i) fal)e. Then we can compute Ra with the following logical e1pre))ionJ

Ra !Rb D R1$ #R !Rc D#T R1$

Thi) approach wor5) for operation) with a regi)ter a) the de)tination. 'onditional

a))ignment to memory can be implemented a) a )equence of load conditional

a))ignment and )tore. Fe re9ected more efficient )upport for conditional )tore) for

two rea)on)J fir)t the )upport require) three )ource operand) which doe) not map well

to high;performance architecture) and )econd the benefitof )uch )upport i) dependent on )upport from the platform for efficient partial

tran)fer).

10


11/29

The MM* in)truction )et contain) a pac5ed compare in)truction that generate) a bit

ma)5 enabling data dependent calculation) to be e1ecuted without branch in)truction)

and to be e1ecuted on )e2eral data element) in parallel. The bit ma)5 re)ult of the

pac5ed compare in)truction ha) all -() in element) where the relation te)ted for i) true

and all 7() otherwi)e !)ee >igure -$.

S!tr!tin" Arit:metic'5

#perand )iGe) typically u)ed in multimedia are )mall !for e1ample % bit) for

repre)enting a color component$. n %;bit number allow) only /


12/29

There may be ca)e) where an application want) to e1amine the occurrence of an

o2erflow in a computation. Cro2iding a flag to indicate thi) !i.e. indicating whether or

not the 2alue wa) )aturated$ would ha2e been de)irable. "owe2er we decided again)t

pro2iding thi) flag )ince we did not want to add any additional new )tate) to the

architecture to pre)er2e the bac5ward compatibility. #ur analy)i) al)o )howed that itwa) not critical to pro2ide thi) information in mo)t application). If needed an

application can determine if )aturation wa) encountered by comparing the re)ult of a

computation with the ma1imum and minimum 2alueO typically )aturation i) the

correct beha2ior.

)i9e$5Point Arit:metic'5

Media application) in2ol2e wor5ing on fraction 2alue) for e1ample the u)e of a

weighting coefficient in filtering a2eraging etc. #ne way to )upport operation) on

fraction 2alue) i) to pro2ide SIMD operation) for floating;point operand). "owe2er

floating;point unit) are hardware inten)i2e. l)o for )e2eral media application) e2en

preci)ion of -7 to -/ binary bit) and dynamic range of = to 6 bit) are )ufficient.

Indu)try;)tandard floating;point !I&&& >C$ require) a minimum of /8 bit) of

preci)ion. oo5ing at application requirement) and the trade;off of performance and

de)ign comple1ity lead) to the u)e of a fi1ed;point arithmetic paradigm for )e2eral

media application). ote that )ome of the computation) may )till require the dynamic

range and the preci)ion )upported by I&&& floating;point for e1ample geometry

tran)formation for )tate;of;the;art 8D application).

In fi1ed;point computation from the point of 2iew of the proce))or architecture

computation) are done on integer 2alue) but programmerapplication) interpret the

integer 2alue) a) fraction 2alue). Some number of leading bit) !determined by the

application$ are interpreted a) an integer while the remaining bit) of the 2alue are

interpreted a) a fraction. It i) the application() re)pon)ibility to perform appropriate

)hift) in order to )cale the number.

Re7ositionin" o6 (!t! Elements +it:in P!c8e$ (!t! )orm!t'5

The pac5ed data format pre)ent) one other i))ue. There are )e2eral ca)e) where

element) of pac5ed data may be required to be repo)itioned within the pac5ed data or

the element) of two pac5ed data operand) may need to be merged. There are ca)e)

where either input or the de)ired output repre)entation of a data may not be ideal for

ma1imiGing computation throughput. >or e1ample it may be preferable to compute oncolor component) of a pi1el in 3planar format4 while the input may be in 3pac5ed

format.4

12


13/29

There are al)o )ituation) where one need) to perform intermediate computation) in

wider format !perhap) pac5ed word format$ while the re)ult i) pre)ented in

pac5ed byte format.

In the abo2e ca)e) there i) a need to e1tract )ome element) of a pac5ed data type andwrite them into a different po)ition in the pac5ed re)ult.

#ne general )olution to thi) i))ue i) to pro2ide an in)truction that ta5e) two pac5ed

data operand) and allow) merging of their byte) in any arbitrary order into the

de)tination pac5ed data operand. "owe2er )uch a general )olution i) e1pen)i2e to

implement. Thi) )olution e))entially will require a full cro)) bar connection.

In the MM* technology architecture we defined an in)truction that require) a

relati2ely ea)y )wiGGle networ5 and yet allow) the efficient repo)itioning and

combining of element) from pac5ed data operand) in mo)t ca)e).

The in)truction unpack ta5e) two pac5ed data operand) and merge) them a) )hown in

>igure /.

The unpack in)truction can be u)ed for a 2ariety of efficient repo)itioning of data

element) including data replication within pac5ed data. >or e1ample con)ider

con2erting a color repre)entation from pac5ed form !i.e. for each pi1el four

con)ecuti2e byte) repre)ent R : B and lpha 2alue)$ to planar format !i.e. four

con)ecuti2e byte) repre)ent the red component of four con)ecuti2e pi1el)$.

(!t! Ali"nment'5

13


14/29

P)e of pac5ed data al)o pre)ent) data alignment i))ue). In )ome ca)e) the data may be

aligned on it) natural boundary and not on the )iGe of the pac5ed data operand. >or

e1ample in a motion e)timation routine the -61-6

bloc5 i) aligned at an arbitrary byte boundary and not at a 6=;bit boundary. Therefore

in )ome ca)e) there i) a need to )upport efficient acce)) of unaligned data for media

application). #ne approach i) to )upport unaligned

acce))e) directly in hardware which generally doe) not wor5 well with the high;performance cache de)ign. lternati2ely one can limit memory acce))e) to aligned

data and e1tract out the de)ired data from the acce))ed data u)ing e1plicit in)truction).

MM* technology include) logical )hift;left and )hift;right operation) on 6= bit).

The)e in)truction) enable u)ing a )equence of Shift left Shift right and Or operation)

to a))emble the de)ired byte from the aligned data that encompa))e) the de)ired byte).

)e!tres'5

MM* technology feature) includeJ

ew data type) built by pac5ing independent data element) together into oneregi)ter.

n enhanced in)truction )et that operate) on all independent data element) in a

regi)ter u)ing parallel SIMD fa)hion.

ew 6=;bit MM* regi)ter) that are mapped on the I floating;point regi)ter).

>ull I compatibility.

Ne; (!t! T7es'5

MM* technology introduce) four new data type)J three pac5ed data type) and a new

6=;bit entity. &ach element within the pac5ed data type) i) an independent fi1ed;point

integer. The architecture doe) not )pecify the place of the

fi1ed point within the element) becau)e it i) the u)er() re)pon)ibility to control it)

place within each element throughout the calculation. Thi) add) a burden on the u)er

but it al)o lea2e) a large amount of fle1ibility to choo)e and change the preci)ion of

fi1ed;point number) during the cour)e of the application in order to fully control the

dynamic range of 2alue).

The following four data type) are defined !)ee >igure 8$J

14


15/29

Cac5ed byte % byte) pac5ed into 6= bit)

Cac5ed word = word) pac5ed into 6= bit)

Cac5ed double word / double word) pac5ed into 6= bit)

Cac5ed quad word 6= bit)

En:!nce$ Instrction Set'5

MM* technology define) a rich )et of in)truction) that perform parallel operation) on

multiple data element) pac5ed into 6= bit) !%1%;bit =1-6;bit or /18/;bit fi1ed point

integer data element)$. Fe 2iew the MM* technology in)truction )et a) an e1ten)ion

of the ba)ic operation) one would perform on a )ingle datum in the

SIMD domain. In)truction) that operate on pac5ed byte) were defined to )upport

frequent image operation) thatin2ol2e %;bit pi1el) or one of the %;bit colorcomponent) of /=8/;bit pi1el) !Red :reen Blue lpha channel$. Fe

15


16/29

defined full )upport for pac5ed word !-6;bit$ data type).Thi) i) becau)e we found -6;

bit data to be a frequent data type in many multimedia algorithm) !e.g. M#D&M

udio$ and )er2e) a) the higher preci)ion bac5up for operation) on byte data.

ba)ic in)truction )et i) pro2ided for pac5ed doubleword data type) to )upport

operation) that need intermediate higher preci)ion than -6 bit) and a 2ariety of 8Dgraphic) algorithm). Becau)e MM* technology i) a 6=;bit capability new in)truction)

to )upport 6= bit) were added )uch a) 6=;bit memory mo2e) or 6=;bit logical

operation).

#2erall


17/29

Table - )ummariGe) the in)truction) introduced by MM* technologyJ


18/29

) the MM* regi)ter) are mapped o2er the floating;point regi)ter) application) that

u)e MM* technology ha2e -6 regi)ter) to u)e. &ight are the MM* regi)ter) each 6=

bit) in )iGe that hold pac5ed data and eight are integer regi)ter) which can be u)ed fordifferent operation) li5e addre))ing loop control or any other data manipulation.

MM* data 2alue) re)ide in the low order 6= bit) !the manti))a$ of the I %7;bit

floatingpoint regi)ter) !)ee >igure =$.

The e1ponent field of the corre)ponding floating;point regi)ter !bit) 6=;%$ and the

)ign bit !bit E$ are )et to one) !-()$ ma5ing the 2alue in the regi)ter a a !ot a

umber$ or infinity when 2iewed a) a floating;point 2alue. Thi) help) to reduce

confu)ion by en)uring that an MM* data 2alue will not loo5 li5e a 2alid floating;point

2alue. MM* in)truction) only acce)) the low;order 6= bit) of the floating;point

regi)ter) and are not affected by the fact that they operate on in2alid floating;point

2alue).

The dual u)age of the floating;point regi)ter) doe) not preclude application) from

u)ing both MM* code and floating;point code. In)ide the application the MM*

18


19/29

codeand floating;point code )hould be encap)ulated in )eparate code )equence). fter

one )equence complete) the floating;point )tate i) re)et and the ne1t )equence can

)tart. The need to u)e floating;point data and MM* !fi1ed;point integer$ data at the

)ame time i) infrequent.

t a gi2en time in an application data being operated upon i) u)ually of one type.

Thi) enabled u) to u)e the floating;point regi)ter) to )tore the MM* technology 2alue)and achie2e our full bac5ward compatibility goal.

Preser>in" )ll %!c8;!r$ Com7!ti#ilit'5

#ne of the important requirement) for MM* technology wa) to enable u)e of MM*

in)truction) in application) without requiring any change) in the I )y)tem )oftware.

n additional requirement wa) that an application )hould be able to utiliGe

performance benefit) of MM* technology in a )eamle)) fa)hion i.e. it )hould be able

to employ MM* in)truction) in part of the application

without requiring the whole of the application to be MM* technology;aware.

Crimary bac5ward compatibility requirement) and their implication) areJ

pplication) u)ing MM* in)truction) )hould wor5 on all e1i)ting multita)5ing

and non;multita)5ing operating )y)tem). Thi) require) that MM* technology

)hould not add any new architecturally 2i)ible )tate) or e2ent) !e1ception)$.

&1i)ting application) that do not u)e MM* in)truction) )hould run unchanged.

Thi) require) that MM* technology )hould not redefine the beha2ior of any

e1i)ting I 8/;bit in)truction). #nly tho)e undefined opcode) that are not relied

on for cau)ing illegal e1ception) by e1i)ting )oftware )hould be u)ed to define

MM* in)truction). l)o MM* in)truction) )hould only affect the I 8/; bit

)tate when in u)e.

&1i)ting application) )hould be able to utiliGe MM* technology without being

required to ma5e the whole application MM* technology;aware. It )hould be

po))ible to employ MM* in)truction) within a procedure in an e1i)ting

application without requiring any change) in the re)t of the application. Thi)

require) that MM* in)truction) wor5 well within the conte1t of e1i)ting I

calling con2ention) for procedure call).

It )hould be po))ible to run an application e2en in an older generation of

proce))or) that doe) not )upport MM* technology. P)ing dynamically lin5ed

librarie) !D)$ for MM* and non;MM* technology proce))or) i) an ea)y way

to do thi).

MM* in)truction) )hould be )emantically compatible with other I

in)truction) i.e. it )hould be ea)y to )upport new MM* in)truction) in e1i)ting

a))embler). They )hould al)o ha2e minimal impact on the in)truction decoder.

nother a)pect of thi) i) that MM* in)truction) )hould not require

programmer) to thin5 in new way) regarding the ba)ic beha2ior of in)truction).

19


20/29

>or e1ample addre))ing mode) and the a2ailability of operation) with memory

)hould conceptually wor5 the )ame.

No Ne; St!te'5

The MM* technology )tate o2erlap) with the >loating; Coint )tate. #2erlapping the

MM* )tate with the >C )tac5 pre)ented an intere)ting challenge. >or performance

rea)on) a) well a) for ea)e of implementation for )ome micro architecture) we wanted

to allow the acce))ing of the MM* regi)ter) in a flat regi)ter model. Fe needed to

enable o2erlapping MM* regi)ter) with the >C )tac5 while )till allowing a flat regi)ter

acce)) model for MM* in)truction). Thi) wa) accompli)hed by enforcing a fi1ed

relation)hip between the logical and phy)ical regi)ter) for the >C )tac5 when acce))ed2ia MM* in)truction). dditionally e2ery MM* in)truction ma5e) the whole MM*

regi)ter file 2alid. Thi) i) different from the floating;point )tac5 model where new

)tac5 entrie) are made 2alid only if the in)truction )pecifie) a 3pu)h4 operation.

MM* in)truction) them)el2e) do not update >C in)truction )tate regi)ter) !for

e1ample >C opcode >#C >C Data )elector >DS >C IC >IC etc.$. The >C in)truction

)tate i) u)ed only by >C e1ception handler). Since MM* in)truction) do not create any

computation e1ception) thi) )tate i) really not meaningful for MM* in)truction).

dditionally not updating the)e )tate) eliminate) the comple1ity of maintaining thi))tate for MM* technology implementation). Therefore we made a deci)ion to let the

>C in)truction )tate regi)ter point to the la)t >C in)truction e1ecuted e2en though

future MM* in)truction) will update the >C )tac5 and T: regi)ter. &2entually when

an >C in)truction i) e1ecuted all of the >C in)truction )tate get) updated. Therefore

>C e1ception handler) alway) )ee con)i)tent >C in)truction )tate.

No Ne; E9ce7tions'5

MM* in)truction) can be 2iewed a) new non;I&&& floating;point in)truction) that donot generate computation e1ception). "owe2er )imilar to >C in)truction) they do

report any pending >C e1ception). >or compatibility with e1i)ting )oftware it i)

critical that any pending >C e1ception i) reported to the )oftware prior to e1ecution of

any MM* in)truction which could update the >C )tate.

t the point of rai)ing the pending >C e1ception the >C e1ception )tate )till point) to

the la)t >C in)truction creating the >C condition. Therefore the fact that the e1ception

get) reported by an MM* in)truction in)tead of an >C in)truction i) tran)parent to the

>C e1ception handler.

dditional e1ception) that are pertinent to MM*

20


21/29

technology are memory e1ception) de2ice;not;a2ailable !D ; IT$ e1ception)

and >C emulation e1ception).

"andling of memory e1ception) in general doe) not depend on the opcode of the

in)truction cau)ing the e1ception. Therefore MM* technology e1ception) do not

cau)e a malfunction of any memory acce));related e1ception handler. #ur e1ten)i2e

compatibility 2erification 2alidated thi) further.

D e1ception i) cau)ed when the TS bit in 'R7 i) )et and any other in)truction

that could modify the >C )tate i) i))ued. Thi) include) e1ecution of an MM*

in)truction when the TS bit i) )et. In thi) ca)e )imilar to the >C ca)e a D e1ception

i) in2o5ed. The re)pon)e of thi) e1ception i) to )a2e the >C )tate and free it up for u)e

by future >CMM* in)truction). Thi) e1ception handler al)o doe) not ha2e a u)e for

the opcode of the in)truction cau)ing thi) e1ception.

Fhen the 'R7.&M bit i) )et a floating;point in)truction cau)e) an >C emulation

e1ception. In thi) ca)e in)tead of u)ing >C hardware >C functionality i) )upported 2ia)oftware emulation. Since the MM* technology architecture )tate o2erlap) with the

>C architecture )tate the i))ue ari)e) a) to the correct beha2ior for MM* in)truction)

when the 'R7.&M bit i) )et.

'au)ing an emulation e1ception for MM* in)truction) when 'R7.&M i) )et i) not the

right beha2ior )ince the e1i)ting >C emulator doe) not 5now about MM*

in)truction). Therefore the fir)t natural choice )eemed to ignore 'R7.&M for MM*

technology. "owe2er thi) choice ha) a problem. Ignoring 'R7.&M for MM*

in)truction) would re)ult in two )eparate conte1t) for the >C Stac5 and T: word)Jone conte1t in the emulator memory for >C and one conte1t in the hardware for MM*

in)truction). Thi) lead) to an architectural incon)i)tency between the ca)e) when

'R7.&M i) )et and when it i) not )et.

Fe had to find )ome other logical way to deal with thi) without defining any new

e1ception). Fe cho)e to define the 'R7.&M - ca)e to re)ult in an illegal opcode

e1ception. Thu) e))entially when 'R7.&M i) )et the MM* technology architecture

e1ten)ion i) di)abled.

C:oice o6 O7co$es 6or MMX Instrctions'5

The MM* in)truction opcode) were cho)en after e1ten)i2e analy)i) of the undefined

opcode map. Fe hadto ma5e )ure that the a2ailable opcode) were reallyunu)ed. Thi)

required en)uring that no )oftware wa) relying on the illegal opcode fault beha2ior of

the)e opcode). Intel wa) already wor5ing with )oftware 2endor) to en)ure that they

relied only on one )pecific encoding 7>>> to cau)e an illegal opcode fault. #ther

encoding may cau)e an illegal e1ception fault in future implementation).

&1cept for a few ca)e) we found that )oftware wa) u)ing only pre)cribed encoding for

cau)ing a programcontrolled in2alid opcode fault.

21


22/29

#nly addre)) prefi1e) are defined to be meaningful for MM* in)truction). P)e of a

Repeat oc5 or Data prefi1 i) illegal for MM* in)truction). The addre)) prefi1 ha)

the )ame beha2ior a) for any other in)truction.

*se o6 )P (LL Mo$el 6or MMX Co$e'5

To enable common multimedia application) for proce))or) with and without MM*

technology we cho)e to promote the Dynamic in5ed ibrary !D$ model a)

the primary model to )upport MM* in)truction).

In the D model depending upon whether the proce))or pro2ide) MM* technology

)upport in hardware !the proce))or 'CPID pro2ide) thi) information$ the appropriate

2er)ion of the media library function i) lin5ed dynamically.

MM* technology D) )ugge)t the )ame guideline) a) that of >C D). The primary

guideline) areJ

t the end of a D lea2e the floating;point regi)ter) in the correct )tate for thecalling procedure. Thi) generally mean) lea2ing the floating;point )tac5

empty unle)) a procedure ha) a return 2alue. Thi) al)o mean) that the caller

)hould chec5 for and handle any >C e1ception) that it might ha2e generated.

Do not a))ume that the floating;point )tate remain) the )ame acro)) procedure).

The callee can typically a))ume that at entry the >C )tac5 i) empty unle)) there

i) )ome )et con2ention for parameter pa))ing. ote that nothing in the MM*

technology architecture depend) on the)e guideline) for functional correctne)).

MM* technology can be u)ed in any other u)age model). MM* technology pro2ide)an in)truction to clear all of >C )tate with a )ingle in)truction !&MMS in)truction$. If

)ome D i) written to return with the >C )tac5 only partially empty one need) to u)e

a combination of &MMS and floating;point load) to create the correct >C )tac5 )tate.

'lean the )tate of MM* with &MMS in)truction.

Per6orm!nce A$>!nt!"e'5

Fe will analyGe the performance enhancement due to MM* technology through ane1ample of a matri1;2ector multiplication 2ery much li5e the one in >igure


23/29

multimedia and communication) application) u)ed in ba)ic mathematical primiti2e)

li5e matri1 multiply and filter).

multiply;accumulate operation !M'$ i) defined a) the product of two operand)

added to a third operand !the accumulator$. Thi) operation require) two load)

!operand) of the multiplication operation$ a multiply and an add !to the

accumulator$. MM* technology doe) not )upport three operand in)truction)O

therefore it doe) not ha2e a full M' capability. #n the other hand the pac5ed

multiply;add in)truction !CMDDFD$ i) defined which compute) four -6;bit 1 -6;

bit multiplie) generating four 8/;bit product) and doe) two 8/;bit add) !out of the four

needed$. )eparate pac5ed add double word !CDDD$ add) the two 8/;bit re)ult) of

the pac5ed multiply;add to another MM* regi)ter which i) u)ed a) an accumulator.

>or thi) performance e1ample we will a))ume both input 2ector) to be the length of

-6 element) each element in the 2ector) being )igned -6 bit). ccumulation will be

performed in 8/;bit preci)ion. The Centium proce))or for e1ample would ha2e to

proce)) each of the operation) one at a time in a )equential fa)hion. Thi) amount) to

8/ load) -6 multiplie) and -< addition) a total of 68 in)truction). ))uming weperform = M') !out of the -6$ per iteration we need to add -/ in)truction) for loop

control !8 in)truction) per iteration increment compare branch$ and one in)truction

for )toring the re)ult. The total i) 6 in)truction). ))uming all data and in)truction)

are in the on;chip cache) and that e1iting the loop will incur one branch mi)prediction

the integer a))embly optimiGed 2er)ion of thi) code !utiliGing both pipeline)$ ta5e) 9u)t

o2er /77 cycle) on a Centium proce))or microarchitecture. The cycle count i)

dominated by the integer multiply being a non;pipelined --;cycle operation. Pnder the

)ame condition) but a))uming the data i) in a floating;point format the floating;point

optimiGed a))embly 2er)ion e1ecute) in = cycle). The floating;point 2er)ion i) fa)ter!a))uming the data i) in floating;pointing format$ )ince the floating;point multiply

ta5e) three cycle) to e1ecute and i) a pipelined unit.

23


24/29

MM* technology on the other hand compute) four element) at a time. Thi) reduce)

the in)truction count to eight load) four CMDDFD in)truction) three CDDD

in)truction) one )tore in)truction and three additional in)truction) !o2erhead due to

pac5ed data type)$ totaling -E in)truction). Cerforming loop unrolling of four

CMDDFD in)truction) eliminate) the need to in)ert any loop control in)truction).

Thi) i) becau)e four CMDDFD) already perform all the -6 required M'). The

MM* in)truction count i) four time) le)) than when u)ing integer or floating;pointoperation)Q Fith the )ame a))umption) a) abo2e on Centium proce))or with MM*

technology an MM* technology;optimiGed a))embly 2er)ion of the code utiliGing

both pipeline) will e1ecute in only -/ cycle).

'ontinuing the abo2e e1ample a))ume a -61-6 matri1 i) multiplied by a -6;element

2ector. Thi) operation i) built of -6 ?ector;Dot;Croduct) !?DC$ of length -6.

Repeating the )ame e1erci)e a) before and a))uming a loop unrolling that perform)

four ?DC) each iteration the regular Centium proce))or code will total =!=68$

-//% in)truction). P)ing MM* technology will require=!=-E8$ 8-6 in)truction). The MM* in)truction count i) 8.E time) le)) than

when u)ing regular operation). The be)t regular code implementation !floating;point

optimiGed 2er)ion$ ta5e) 9u)t under -/77 cycle) to complete in compari)on to /7

cycle) for the MM* code 2er)ion.

Intel ha) introduced two proce))or familie) with MM* technologyJ the Centium

proce))or with MM* technology and the Centium II proce))or. The performance of

both proce))or) wa) compared on the Intel Media Benchmar5 !IMB$ ,igure 6 and Table / compare the Centium proce))or with MM* technology and the

Centium II proce))or again)t the Centium proce))or and the Centium Cro proce))or.

24


25/29

25


26/29


27/29

The floating point regi)ter)J;

-. >loating point i) proce))ed by eight %7 bit regi)ter) ST!7$ ST!-$ UST!$ in

the floating point unit.

/. Fhen doing floating point arithmetic the)e regi)ter) are organiGed in a

)tac5.8. Crogramming floating point i) quite different that programming integer

arithmetic.

=. >loating point calculation) are done u)ing %7 bit) e2en when the program

)pecifie) )toring 8/ or 6= bit data 2alue).

d2antage) of u)ing the floating point regi)ter) in MM*J;

-. The regi)ter) already e1i)t. #nly logic had to be added to the chip.

/. The operating )y)tem already 5now) about the floating point regi)ter).8. Fhen a computer i) )witche) from one program to another the )tate

!regi)ter)$ of the current program mu)t be )a2ed )o )tate can be re)tored

when the program become) the acti2e program once again.

=. The floating point regi)ter) are automatically )a2ed a) part of the )tate of a

program.


28/29

Conclsion'5

MM* technology implement) a high;performance technique that enhance) theperformance of Intel rchitecture microproce))or) for media application). The core

algorithm) in the)e application) are compute inten)i2e. The)e algorithm) perform

operation) on a large amount of data u)e )mall data type) and pro2ide many

opportunitie) for paralleli)m. The)e algorithm) are a natural fit for SIMD architecture.

MM* technology define) a general purpo)e and ea)y;to;implement )et of primiti2e)

to operate on pac5ed data type).

MM* technology while deli2ering performance boo)t to media application) i) fully

compatible with the e1i)ting application and operating )y)tem ba)e.

MM* technology i) general by de)ign and can be applied to a 2ariety of )oftware

media problem). Some e1ample) of thi) 2ariety were de)cribed in thi) paper. >uture

media related )oftware technologie) for u)e on the Intranet and Internet )hould benefit

from MM* technology.

Centium proce))or) with MM* technology pro2ide a )ignificant performance boo)t

!appro1imately =1 for )ome of the 5ernel)$ for media application). Cerformance gain)

from the technology will )cale well with an increa)ed proce))or operating frequency

and future microarchitecture).

28


29/29

Re6erences'5

,-0 . Celeg P. Fei)erMMX Technology Extension to the

Intel Architecture I&&& Micro ?ol. -6 o. = ugu)t

-EE6 pp. =/;

Documents

ForntPAGE.doc