Upload
rohitkota
View
218
Download
0
Embed Size (px)
Citation preview
8/12/2019 ForntPAGE.doc
1/29
A
SEMINAR REPORT
ON
THE INTEL MMX TECHNOLOGY
Submitted in partial fulfillment for the requirement of the award for the
Degree of Bachelor in Technology
In
Electronics & Commnic!tion En"ineerin"
S#mitte$ %' S#mitte$ to'
Renu Kanwar Mr.Yogendra boti!".#.D.$
B. Tech. !%thSem.$ Department of &'&
(EPARTMENT O) ELECTRONICS & COMM*NICATION ENGINEERING
MAR+AR ENGINEERING COLLEGE & RESEARCH CENTRE ,O(HP*R
-RA,ASTHAN.
RA,ASTHAN TECHNICAL *NI/ERSITY0 1OTA -RA,ASTHAN.
-234352344.
1
8/12/2019 ForntPAGE.doc
2/29
Intro$ction'5
Intel() MM*+ technology ,- /0 i) an e1ten)ion to the ba)ic Intel rchitecture !I$
de)igned to impro2e performance of multimedia and communication algorithm). Thetechnology include) new in)truction) and data type) which achie2e new le2el) of
performance for the)e algorithm) on ho)t proce))or).
MM* technology e1ploit) the paralleli)m inherent in many of the)e algorithm). Many
of the)e algorithm) e1hibit the property of 3fi1ed4 computation on a large
data )et.
The definition of MM* technology e2ol2ed from earlier wor5 in the i%67+
architecture ,80. The i%67 architecture wa) the indu)try() fir)t general purpo)e
proce))or to pro2ide )upport for graphic) rendering. The i%67 proce))or pro2ided
in)truction) that operated on multiple ad9acent data operand) in parallel for e1ample
four ad9acent pi1el) of an image.
fter the introduction of the i%67 proce))or Intel e1plored e1tending the i%67
architecture in order to deli2er high performance for other media application) for
e1ample image proce))ing te1ture mapping and audio and 2ideo decompre))ion.
Se2eral of the)e algorithm) naturally lent them)el2e) to SIMD proce))ing. Thi) effort
laid the foundation for )imilar )upport for Intel() main)tream general purpo)e
architecture I.
The MM* technology e1ten)ion wa) the fir)t ma9or addition to the in)truction )et
)ince the Intel8%6+ architecture. :i2en the large in)talled )oftware ba)e for the I a
)ignificant e1ten)ion to the architecture required )pecial attention to bac5ward
compatibility and de)ign i))ue).
MM* technology pro2ide) benefit) to the end u)er by impro2ing the performance of
multimedia;rich application) by a factor of -.
8/12/2019 ForntPAGE.doc
3/29
Thi) paper pro2ide) in)ight into the proce)) and con)ideration) u)ed to define the
MM* technology. It al)o pro2ide) )pecific) on MM* in)truction) that were added to
the I a) well a) the approach ta5en to add thi) )ignificant capability without adding a
new )oftware 2i)ible architectural )tate.
The paper al)o pre)ent) application e1ample) that )how the u)age and benefit) of
MM* in)truction). Data )howing the performance benefit) for the application) i)
al)o pre)ented.
3
8/12/2019 ForntPAGE.doc
4/29
AC1NO+LE(GEMENT
Theoretical 5nowledge i) impro2ed through )eminar preparation a) it
contribute) )ignificantly to the )tudent() under)tanding and gi2e) him to
fir)t hand 5nowledge of the comple1itie) of engineering arena.
>ir)t I would li5e to than5 the almighty and my parent) who ga2e me
their 2aluable )upport and ble))ing) to complete thi) minor pro9ect . I
would al)o li5e to than5 &r. ?. K. Bhan)ali !Director M.&.'.R.'. @odhpur$
A &r. Yogendra boti !"ead of Department &lectronic A 'ommunication
&ngineering$ for their encouragement A appreciation).
n accompli)hment of any )ignificance depend) on the Synergy and
'ooperation of re)ource) both material and human. I e1pre)) my heartfelt
gratitude to all tho)e who ha2e contributed directly or indirectly in thi)
endea2or.
My fir)t and foremo)t regard) are for my family member) who
patiently and pain)ta5ingly helped me out in e2ery way the can.
IN(EX
4
8/12/2019 ForntPAGE.doc
5/29
S. o. 'ontent Cage o.
7- $e6in!tion 7rocess
7/ #!sic conce7ts E
78 P!c8e$ $!t! 6orm!t E
7= 'on$ition!l e9ection -7
7< S!tr!tin" !rit:metic -/
76 )i9e$ 7oint !rit:metic -/
7 Re7ositionin" o6 $!t! elements ;it:in
7!c8e$ $!t! 6orm!t
-8
7% (!t! !li"nment -=
7E 6e!tres -or e1ample
for a motion e)timation algorithm data i) naturally organiGed in -6 row) with each
row containing only -6 byte) of data. In thi) ca)e operating on more than -6 data
element) at a time will require reformatting the input data. De)ign con)ideration)
in2ol2e i))ue) )uch a) the practical width of the data path and how many time)
functional unit) will replicate.
:i2en that current Intel proce))or) already ha2e 6=;bit data path) !for e1ample
floating;point data path) a) well a) a data path between the integer regi)ter file and
memory )ub)y)tem due to dual load)tore capability in the Centium proce))or$ we
cho)e the width of MM* data type) to be 6= bit).
Con$ition!l E9ection'5
#perating on multiple data operand) u)ing a )ingle in)truction pre)ent) an intere)tingi))ue. Fhat happen) when a computation i) only done if the operand 2alue pa))e)
9
8/12/2019 ForntPAGE.doc
10/29
)ome conditional chec5L >or e1ample in an ab)olute 2alue calculation only if the
number i) alreadynegati2e do we perform a /() complement on itJ
for I - -77
if a,i0 N 7 then b,i0 ; a,i0 el)e b,i0 a,i0
O b)olute 2alue calculation
There are different approache) po))ible and )ome are impler than other). P)ing a
branch approach doe) not wor5 well for two rea)on)J fir)t a branch;ba)ed )olution i)
)lower becau)e of the inherent branch mi)prediction penalty and )econd becau)e of
the need to con2ert pac5ed data type) to )calar).
Direct conditional e1ecution )upport doe) not wor5 well for the I )ince it require)
three independent operand) !)ource )ourcede)tination and predicate 2ector$. Keeping
with the philo)ophy of performance and )implicity we cho)e a )impler )olution. Theba)ic idea wa) to con2ert a conditional e1ecution into a conditional a))ignment.
'onditional a))ignment in turn can be implemented through different approache). #ne
approach would be to pro2ide the fle1ibility of )pecifying a dynamically generated
ma)5 with an a))ignment in)truction. Such an approach would ha2e required defining
in)truction) with three operand) !)ource )ourcede)tination and ma)5$. "ere al)o we
adopted a )olution that i) more amenable to higher performance de)ign).
'ompare operation) in MM* technology re)ult in a bit ma)5 corre)ponding to the
length of the operand). >or e1ample a compare operation operating on pac5ed byteoperand) produce byte;wide ma)5). The)e ma)5) then can be u)ed in con9unction with
logical operation) to achie2e conditional a))ignment.
'on)ider the following e1ampleJ
If True
Ra J Rb el)e Ra J Rc
et u) )ay regi)ter R1 contain) all -() if the condition i) true and all 7() if thecondition i) fal)e. Then we can compute Ra with the following logical e1pre))ionJ
Ra !Rb D R1$ #R !Rc D#T R1$
Thi) approach wor5) for operation) with a regi)ter a) the de)tination. 'onditional
a))ignment to memory can be implemented a) a )equence of load conditional
a))ignment and )tore. Fe re9ected more efficient )upport for conditional )tore) for
two rea)on)J fir)t the )upport require) three )ource operand) which doe) not map well
to high;performance architecture) and )econd the benefitof )uch )upport i) dependent on )upport from the platform for efficient partial
tran)fer).
10
8/12/2019 ForntPAGE.doc
11/29
The MM* in)truction )et contain) a pac5ed compare in)truction that generate) a bit
ma)5 enabling data dependent calculation) to be e1ecuted without branch in)truction)
and to be e1ecuted on )e2eral data element) in parallel. The bit ma)5 re)ult of the
pac5ed compare in)truction ha) all -() in element) where the relation te)ted for i) true
and all 7() otherwi)e !)ee >igure -$.
S!tr!tin" Arit:metic'5
#perand )iGe) typically u)ed in multimedia are )mall !for e1ample % bit) for
repre)enting a color component$. n %;bit number allow) only /
8/12/2019 ForntPAGE.doc
12/29
There may be ca)e) where an application want) to e1amine the occurrence of an
o2erflow in a computation. Cro2iding a flag to indicate thi) !i.e. indicating whether or
not the 2alue wa) )aturated$ would ha2e been de)irable. "owe2er we decided again)t
pro2iding thi) flag )ince we did not want to add any additional new )tate) to the
architecture to pre)er2e the bac5ward compatibility. #ur analy)i) al)o )howed that itwa) not critical to pro2ide thi) information in mo)t application). If needed an
application can determine if )aturation wa) encountered by comparing the re)ult of a
computation with the ma1imum and minimum 2alueO typically )aturation i) the
correct beha2ior.
)i9e$5Point Arit:metic'5
Media application) in2ol2e wor5ing on fraction 2alue) for e1ample the u)e of a
weighting coefficient in filtering a2eraging etc. #ne way to )upport operation) on
fraction 2alue) i) to pro2ide SIMD operation) for floating;point operand). "owe2er
floating;point unit) are hardware inten)i2e. l)o for )e2eral media application) e2en
preci)ion of -7 to -/ binary bit) and dynamic range of = to 6 bit) are )ufficient.
Indu)try;)tandard floating;point !I&&& >C$ require) a minimum of /8 bit) of
preci)ion. oo5ing at application requirement) and the trade;off of performance and
de)ign comple1ity lead) to the u)e of a fi1ed;point arithmetic paradigm for )e2eral
media application). ote that )ome of the computation) may )till require the dynamic
range and the preci)ion )upported by I&&& floating;point for e1ample geometry
tran)formation for )tate;of;the;art 8D application).
In fi1ed;point computation from the point of 2iew of the proce))or architecture
computation) are done on integer 2alue) but programmerapplication) interpret the
integer 2alue) a) fraction 2alue). Some number of leading bit) !determined by the
application$ are interpreted a) an integer while the remaining bit) of the 2alue are
interpreted a) a fraction. It i) the application() re)pon)ibility to perform appropriate
)hift) in order to )cale the number.
Re7ositionin" o6 (!t! Elements +it:in P!c8e$ (!t! )orm!t'5
The pac5ed data format pre)ent) one other i))ue. There are )e2eral ca)e) where
element) of pac5ed data may be required to be repo)itioned within the pac5ed data or
the element) of two pac5ed data operand) may need to be merged. There are ca)e)
where either input or the de)ired output repre)entation of a data may not be ideal for
ma1imiGing computation throughput. >or e1ample it may be preferable to compute oncolor component) of a pi1el in 3planar format4 while the input may be in 3pac5ed
format.4
12
8/12/2019 ForntPAGE.doc
13/29
There are al)o )ituation) where one need) to perform intermediate computation) in
wider format !perhap) pac5ed word format$ while the re)ult i) pre)ented in
pac5ed byte format.
In the abo2e ca)e) there i) a need to e1tract )ome element) of a pac5ed data type andwrite them into a different po)ition in the pac5ed re)ult.
#ne general )olution to thi) i))ue i) to pro2ide an in)truction that ta5e) two pac5ed
data operand) and allow) merging of their byte) in any arbitrary order into the
de)tination pac5ed data operand. "owe2er )uch a general )olution i) e1pen)i2e to
implement. Thi) )olution e))entially will require a full cro)) bar connection.
In the MM* technology architecture we defined an in)truction that require) a
relati2ely ea)y )wiGGle networ5 and yet allow) the efficient repo)itioning and
combining of element) from pac5ed data operand) in mo)t ca)e).
The in)truction unpack ta5e) two pac5ed data operand) and merge) them a) )hown in
>igure /.
The unpack in)truction can be u)ed for a 2ariety of efficient repo)itioning of data
element) including data replication within pac5ed data. >or e1ample con)ider
con2erting a color repre)entation from pac5ed form !i.e. for each pi1el four
con)ecuti2e byte) repre)ent R : B and lpha 2alue)$ to planar format !i.e. four
con)ecuti2e byte) repre)ent the red component of four con)ecuti2e pi1el)$.
(!t! Ali"nment'5
13
8/12/2019 ForntPAGE.doc
14/29
P)e of pac5ed data al)o pre)ent) data alignment i))ue). In )ome ca)e) the data may be
aligned on it) natural boundary and not on the )iGe of the pac5ed data operand. >or
e1ample in a motion e)timation routine the -61-6
bloc5 i) aligned at an arbitrary byte boundary and not at a 6=;bit boundary. Therefore
in )ome ca)e) there i) a need to )upport efficient acce)) of unaligned data for media
application). #ne approach i) to )upport unaligned
acce))e) directly in hardware which generally doe) not wor5 well with the high;performance cache de)ign. lternati2ely one can limit memory acce))e) to aligned
data and e1tract out the de)ired data from the acce))ed data u)ing e1plicit in)truction).
MM* technology include) logical )hift;left and )hift;right operation) on 6= bit).
The)e in)truction) enable u)ing a )equence of Shift left Shift right and Or operation)
to a))emble the de)ired byte from the aligned data that encompa))e) the de)ired byte).
)e!tres'5
MM* technology feature) includeJ
ew data type) built by pac5ing independent data element) together into oneregi)ter.
n enhanced in)truction )et that operate) on all independent data element) in a
regi)ter u)ing parallel SIMD fa)hion.
ew 6=;bit MM* regi)ter) that are mapped on the I floating;point regi)ter).
>ull I compatibility.
Ne; (!t! T7es'5
MM* technology introduce) four new data type)J three pac5ed data type) and a new
6=;bit entity. &ach element within the pac5ed data type) i) an independent fi1ed;point
integer. The architecture doe) not )pecify the place of the
fi1ed point within the element) becau)e it i) the u)er() re)pon)ibility to control it)
place within each element throughout the calculation. Thi) add) a burden on the u)er
but it al)o lea2e) a large amount of fle1ibility to choo)e and change the preci)ion of
fi1ed;point number) during the cour)e of the application in order to fully control the
dynamic range of 2alue).
The following four data type) are defined !)ee >igure 8$J
14
8/12/2019 ForntPAGE.doc
15/29
Cac5ed byte % byte) pac5ed into 6= bit)
Cac5ed word = word) pac5ed into 6= bit)
Cac5ed double word / double word) pac5ed into 6= bit)
Cac5ed quad word 6= bit)
En:!nce$ Instrction Set'5
MM* technology define) a rich )et of in)truction) that perform parallel operation) on
multiple data element) pac5ed into 6= bit) !%1%;bit =1-6;bit or /18/;bit fi1ed point
integer data element)$. Fe 2iew the MM* technology in)truction )et a) an e1ten)ion
of the ba)ic operation) one would perform on a )ingle datum in the
SIMD domain. In)truction) that operate on pac5ed byte) were defined to )upport
frequent image operation) thatin2ol2e %;bit pi1el) or one of the %;bit colorcomponent) of /=8/;bit pi1el) !Red :reen Blue lpha channel$. Fe
15
8/12/2019 ForntPAGE.doc
16/29
defined full )upport for pac5ed word !-6;bit$ data type).Thi) i) becau)e we found -6;
bit data to be a frequent data type in many multimedia algorithm) !e.g. M#D&M
udio$ and )er2e) a) the higher preci)ion bac5up for operation) on byte data.
ba)ic in)truction )et i) pro2ided for pac5ed doubleword data type) to )upport
operation) that need intermediate higher preci)ion than -6 bit) and a 2ariety of 8Dgraphic) algorithm). Becau)e MM* technology i) a 6=;bit capability new in)truction)
to )upport 6= bit) were added )uch a) 6=;bit memory mo2e) or 6=;bit logical
operation).
#2erall
8/12/2019 ForntPAGE.doc
17/29
Table - )ummariGe) the in)truction) introduced by MM* technologyJ
8/12/2019 ForntPAGE.doc
18/29
) the MM* regi)ter) are mapped o2er the floating;point regi)ter) application) that
u)e MM* technology ha2e -6 regi)ter) to u)e. &ight are the MM* regi)ter) each 6=
bit) in )iGe that hold pac5ed data and eight are integer regi)ter) which can be u)ed fordifferent operation) li5e addre))ing loop control or any other data manipulation.
MM* data 2alue) re)ide in the low order 6= bit) !the manti))a$ of the I %7;bit
floatingpoint regi)ter) !)ee >igure =$.
The e1ponent field of the corre)ponding floating;point regi)ter !bit) 6=;%$ and the
)ign bit !bit E$ are )et to one) !-()$ ma5ing the 2alue in the regi)ter a a !ot a
umber$ or infinity when 2iewed a) a floating;point 2alue. Thi) help) to reduce
confu)ion by en)uring that an MM* data 2alue will not loo5 li5e a 2alid floating;point
2alue. MM* in)truction) only acce)) the low;order 6= bit) of the floating;point
regi)ter) and are not affected by the fact that they operate on in2alid floating;point
2alue).
The dual u)age of the floating;point regi)ter) doe) not preclude application) from
u)ing both MM* code and floating;point code. In)ide the application the MM*
18
8/12/2019 ForntPAGE.doc
19/29
codeand floating;point code )hould be encap)ulated in )eparate code )equence). fter
one )equence complete) the floating;point )tate i) re)et and the ne1t )equence can
)tart. The need to u)e floating;point data and MM* !fi1ed;point integer$ data at the
)ame time i) infrequent.
t a gi2en time in an application data being operated upon i) u)ually of one type.
Thi) enabled u) to u)e the floating;point regi)ter) to )tore the MM* technology 2alue)and achie2e our full bac5ward compatibility goal.
Preser>in" )ll %!c8;!r$ Com7!ti#ilit'5
#ne of the important requirement) for MM* technology wa) to enable u)e of MM*
in)truction) in application) without requiring any change) in the I )y)tem )oftware.
n additional requirement wa) that an application )hould be able to utiliGe
performance benefit) of MM* technology in a )eamle)) fa)hion i.e. it )hould be able
to employ MM* in)truction) in part of the application
without requiring the whole of the application to be MM* technology;aware.
Crimary bac5ward compatibility requirement) and their implication) areJ
pplication) u)ing MM* in)truction) )hould wor5 on all e1i)ting multita)5ing
and non;multita)5ing operating )y)tem). Thi) require) that MM* technology
)hould not add any new architecturally 2i)ible )tate) or e2ent) !e1ception)$.
&1i)ting application) that do not u)e MM* in)truction) )hould run unchanged.
Thi) require) that MM* technology )hould not redefine the beha2ior of any
e1i)ting I 8/;bit in)truction). #nly tho)e undefined opcode) that are not relied
on for cau)ing illegal e1ception) by e1i)ting )oftware )hould be u)ed to define
MM* in)truction). l)o MM* in)truction) )hould only affect the I 8/; bit
)tate when in u)e.
&1i)ting application) )hould be able to utiliGe MM* technology without being
required to ma5e the whole application MM* technology;aware. It )hould be
po))ible to employ MM* in)truction) within a procedure in an e1i)ting
application without requiring any change) in the re)t of the application. Thi)
require) that MM* in)truction) wor5 well within the conte1t of e1i)ting I
calling con2ention) for procedure call).
It )hould be po))ible to run an application e2en in an older generation of
proce))or) that doe) not )upport MM* technology. P)ing dynamically lin5ed
librarie) !D)$ for MM* and non;MM* technology proce))or) i) an ea)y way
to do thi).
MM* in)truction) )hould be )emantically compatible with other I
in)truction) i.e. it )hould be ea)y to )upport new MM* in)truction) in e1i)ting
a))embler). They )hould al)o ha2e minimal impact on the in)truction decoder.
nother a)pect of thi) i) that MM* in)truction) )hould not require
programmer) to thin5 in new way) regarding the ba)ic beha2ior of in)truction).
19
8/12/2019 ForntPAGE.doc
20/29
>or e1ample addre))ing mode) and the a2ailability of operation) with memory
)hould conceptually wor5 the )ame.
No Ne; St!te'5
The MM* technology )tate o2erlap) with the >loating; Coint )tate. #2erlapping the
MM* )tate with the >C )tac5 pre)ented an intere)ting challenge. >or performance
rea)on) a) well a) for ea)e of implementation for )ome micro architecture) we wanted
to allow the acce))ing of the MM* regi)ter) in a flat regi)ter model. Fe needed to
enable o2erlapping MM* regi)ter) with the >C )tac5 while )till allowing a flat regi)ter
acce)) model for MM* in)truction). Thi) wa) accompli)hed by enforcing a fi1ed
relation)hip between the logical and phy)ical regi)ter) for the >C )tac5 when acce))ed2ia MM* in)truction). dditionally e2ery MM* in)truction ma5e) the whole MM*
regi)ter file 2alid. Thi) i) different from the floating;point )tac5 model where new
)tac5 entrie) are made 2alid only if the in)truction )pecifie) a 3pu)h4 operation.
MM* in)truction) them)el2e) do not update >C in)truction )tate regi)ter) !for
e1ample >C opcode >#C >C Data )elector >DS >C IC >IC etc.$. The >C in)truction
)tate i) u)ed only by >C e1ception handler). Since MM* in)truction) do not create any
computation e1ception) thi) )tate i) really not meaningful for MM* in)truction).
dditionally not updating the)e )tate) eliminate) the comple1ity of maintaining thi))tate for MM* technology implementation). Therefore we made a deci)ion to let the
>C in)truction )tate regi)ter point to the la)t >C in)truction e1ecuted e2en though
future MM* in)truction) will update the >C )tac5 and T: regi)ter. &2entually when
an >C in)truction i) e1ecuted all of the >C in)truction )tate get) updated. Therefore
>C e1ception handler) alway) )ee con)i)tent >C in)truction )tate.
No Ne; E9ce7tions'5
MM* in)truction) can be 2iewed a) new non;I&&& floating;point in)truction) that donot generate computation e1ception). "owe2er )imilar to >C in)truction) they do
report any pending >C e1ception). >or compatibility with e1i)ting )oftware it i)
critical that any pending >C e1ception i) reported to the )oftware prior to e1ecution of
any MM* in)truction which could update the >C )tate.
t the point of rai)ing the pending >C e1ception the >C e1ception )tate )till point) to
the la)t >C in)truction creating the >C condition. Therefore the fact that the e1ception
get) reported by an MM* in)truction in)tead of an >C in)truction i) tran)parent to the
>C e1ception handler.
dditional e1ception) that are pertinent to MM*
20
8/12/2019 ForntPAGE.doc
21/29
technology are memory e1ception) de2ice;not;a2ailable !D ; IT$ e1ception)
and >C emulation e1ception).
"andling of memory e1ception) in general doe) not depend on the opcode of the
in)truction cau)ing the e1ception. Therefore MM* technology e1ception) do not
cau)e a malfunction of any memory acce));related e1ception handler. #ur e1ten)i2e
compatibility 2erification 2alidated thi) further.
D e1ception i) cau)ed when the TS bit in 'R7 i) )et and any other in)truction
that could modify the >C )tate i) i))ued. Thi) include) e1ecution of an MM*
in)truction when the TS bit i) )et. In thi) ca)e )imilar to the >C ca)e a D e1ception
i) in2o5ed. The re)pon)e of thi) e1ception i) to )a2e the >C )tate and free it up for u)e
by future >CMM* in)truction). Thi) e1ception handler al)o doe) not ha2e a u)e for
the opcode of the in)truction cau)ing thi) e1ception.
Fhen the 'R7.&M bit i) )et a floating;point in)truction cau)e) an >C emulation
e1ception. In thi) ca)e in)tead of u)ing >C hardware >C functionality i) )upported 2ia)oftware emulation. Since the MM* technology architecture )tate o2erlap) with the
>C architecture )tate the i))ue ari)e) a) to the correct beha2ior for MM* in)truction)
when the 'R7.&M bit i) )et.
'au)ing an emulation e1ception for MM* in)truction) when 'R7.&M i) )et i) not the
right beha2ior )ince the e1i)ting >C emulator doe) not 5now about MM*
in)truction). Therefore the fir)t natural choice )eemed to ignore 'R7.&M for MM*
technology. "owe2er thi) choice ha) a problem. Ignoring 'R7.&M for MM*
in)truction) would re)ult in two )eparate conte1t) for the >C Stac5 and T: word)Jone conte1t in the emulator memory for >C and one conte1t in the hardware for MM*
in)truction). Thi) lead) to an architectural incon)i)tency between the ca)e) when
'R7.&M i) )et and when it i) not )et.
Fe had to find )ome other logical way to deal with thi) without defining any new
e1ception). Fe cho)e to define the 'R7.&M - ca)e to re)ult in an illegal opcode
e1ception. Thu) e))entially when 'R7.&M i) )et the MM* technology architecture
e1ten)ion i) di)abled.
C:oice o6 O7co$es 6or MMX Instrctions'5
The MM* in)truction opcode) were cho)en after e1ten)i2e analy)i) of the undefined
opcode map. Fe hadto ma5e )ure that the a2ailable opcode) were reallyunu)ed. Thi)
required en)uring that no )oftware wa) relying on the illegal opcode fault beha2ior of
the)e opcode). Intel wa) already wor5ing with )oftware 2endor) to en)ure that they
relied only on one )pecific encoding 7>>> to cau)e an illegal opcode fault. #ther
encoding may cau)e an illegal e1ception fault in future implementation).
&1cept for a few ca)e) we found that )oftware wa) u)ing only pre)cribed encoding for
cau)ing a programcontrolled in2alid opcode fault.
21
8/12/2019 ForntPAGE.doc
22/29
#nly addre)) prefi1e) are defined to be meaningful for MM* in)truction). P)e of a
Repeat oc5 or Data prefi1 i) illegal for MM* in)truction). The addre)) prefi1 ha)
the )ame beha2ior a) for any other in)truction.
*se o6 )P (LL Mo$el 6or MMX Co$e'5
To enable common multimedia application) for proce))or) with and without MM*
technology we cho)e to promote the Dynamic in5ed ibrary !D$ model a)
the primary model to )upport MM* in)truction).
In the D model depending upon whether the proce))or pro2ide) MM* technology
)upport in hardware !the proce))or 'CPID pro2ide) thi) information$ the appropriate
2er)ion of the media library function i) lin5ed dynamically.
MM* technology D) )ugge)t the )ame guideline) a) that of >C D). The primary
guideline) areJ
t the end of a D lea2e the floating;point regi)ter) in the correct )tate for thecalling procedure. Thi) generally mean) lea2ing the floating;point )tac5
empty unle)) a procedure ha) a return 2alue. Thi) al)o mean) that the caller
)hould chec5 for and handle any >C e1ception) that it might ha2e generated.
Do not a))ume that the floating;point )tate remain) the )ame acro)) procedure).
The callee can typically a))ume that at entry the >C )tac5 i) empty unle)) there
i) )ome )et con2ention for parameter pa))ing. ote that nothing in the MM*
technology architecture depend) on the)e guideline) for functional correctne)).
MM* technology can be u)ed in any other u)age model). MM* technology pro2ide)an in)truction to clear all of >C )tate with a )ingle in)truction !&MMS in)truction$. If
)ome D i) written to return with the >C )tac5 only partially empty one need) to u)e
a combination of &MMS and floating;point load) to create the correct >C )tac5 )tate.
'lean the )tate of MM* with &MMS in)truction.
Per6orm!nce A$>!nt!"e'5
Fe will analyGe the performance enhancement due to MM* technology through ane1ample of a matri1;2ector multiplication 2ery much li5e the one in >igure
8/12/2019 ForntPAGE.doc
23/29
multimedia and communication) application) u)ed in ba)ic mathematical primiti2e)
li5e matri1 multiply and filter).
multiply;accumulate operation !M'$ i) defined a) the product of two operand)
added to a third operand !the accumulator$. Thi) operation require) two load)
!operand) of the multiplication operation$ a multiply and an add !to the
accumulator$. MM* technology doe) not )upport three operand in)truction)O
therefore it doe) not ha2e a full M' capability. #n the other hand the pac5ed
multiply;add in)truction !CMDDFD$ i) defined which compute) four -6;bit 1 -6;
bit multiplie) generating four 8/;bit product) and doe) two 8/;bit add) !out of the four
needed$. )eparate pac5ed add double word !CDDD$ add) the two 8/;bit re)ult) of
the pac5ed multiply;add to another MM* regi)ter which i) u)ed a) an accumulator.
>or thi) performance e1ample we will a))ume both input 2ector) to be the length of
-6 element) each element in the 2ector) being )igned -6 bit). ccumulation will be
performed in 8/;bit preci)ion. The Centium proce))or for e1ample would ha2e to
proce)) each of the operation) one at a time in a )equential fa)hion. Thi) amount) to
8/ load) -6 multiplie) and -< addition) a total of 68 in)truction). ))uming weperform = M') !out of the -6$ per iteration we need to add -/ in)truction) for loop
control !8 in)truction) per iteration increment compare branch$ and one in)truction
for )toring the re)ult. The total i) 6 in)truction). ))uming all data and in)truction)
are in the on;chip cache) and that e1iting the loop will incur one branch mi)prediction
the integer a))embly optimiGed 2er)ion of thi) code !utiliGing both pipeline)$ ta5e) 9u)t
o2er /77 cycle) on a Centium proce))or microarchitecture. The cycle count i)
dominated by the integer multiply being a non;pipelined --;cycle operation. Pnder the
)ame condition) but a))uming the data i) in a floating;point format the floating;point
optimiGed a))embly 2er)ion e1ecute) in = cycle). The floating;point 2er)ion i) fa)ter!a))uming the data i) in floating;pointing format$ )ince the floating;point multiply
ta5e) three cycle) to e1ecute and i) a pipelined unit.
23
8/12/2019 ForntPAGE.doc
24/29
MM* technology on the other hand compute) four element) at a time. Thi) reduce)
the in)truction count to eight load) four CMDDFD in)truction) three CDDD
in)truction) one )tore in)truction and three additional in)truction) !o2erhead due to
pac5ed data type)$ totaling -E in)truction). Cerforming loop unrolling of four
CMDDFD in)truction) eliminate) the need to in)ert any loop control in)truction).
Thi) i) becau)e four CMDDFD) already perform all the -6 required M'). The
MM* in)truction count i) four time) le)) than when u)ing integer or floating;pointoperation)Q Fith the )ame a))umption) a) abo2e on Centium proce))or with MM*
technology an MM* technology;optimiGed a))embly 2er)ion of the code utiliGing
both pipeline) will e1ecute in only -/ cycle).
'ontinuing the abo2e e1ample a))ume a -61-6 matri1 i) multiplied by a -6;element
2ector. Thi) operation i) built of -6 ?ector;Dot;Croduct) !?DC$ of length -6.
Repeating the )ame e1erci)e a) before and a))uming a loop unrolling that perform)
four ?DC) each iteration the regular Centium proce))or code will total =!=68$
-//% in)truction). P)ing MM* technology will require=!=-E8$ 8-6 in)truction). The MM* in)truction count i) 8.E time) le)) than
when u)ing regular operation). The be)t regular code implementation !floating;point
optimiGed 2er)ion$ ta5e) 9u)t under -/77 cycle) to complete in compari)on to /7
cycle) for the MM* code 2er)ion.
Intel ha) introduced two proce))or familie) with MM* technologyJ the Centium
proce))or with MM* technology and the Centium II proce))or. The performance of
both proce))or) wa) compared on the Intel Media Benchmar5 !IMB$ ,igure 6 and Table / compare the Centium proce))or with MM* technology and the
Centium II proce))or again)t the Centium proce))or and the Centium Cro proce))or.
24
8/12/2019 ForntPAGE.doc
25/29
25
8/12/2019 ForntPAGE.doc
26/29
8/12/2019 ForntPAGE.doc
27/29
The floating point regi)ter)J;
-. >loating point i) proce))ed by eight %7 bit regi)ter) ST!7$ ST!-$ UST!$ in
the floating point unit.
/. Fhen doing floating point arithmetic the)e regi)ter) are organiGed in a
)tac5.8. Crogramming floating point i) quite different that programming integer
arithmetic.
=. >loating point calculation) are done u)ing %7 bit) e2en when the program
)pecifie) )toring 8/ or 6= bit data 2alue).
d2antage) of u)ing the floating point regi)ter) in MM*J;
-. The regi)ter) already e1i)t. #nly logic had to be added to the chip.
/. The operating )y)tem already 5now) about the floating point regi)ter).8. Fhen a computer i) )witche) from one program to another the )tate
!regi)ter)$ of the current program mu)t be )a2ed )o )tate can be re)tored
when the program become) the acti2e program once again.
=. The floating point regi)ter) are automatically )a2ed a) part of the )tate of a
program.
8/12/2019 ForntPAGE.doc
28/29
Conclsion'5
MM* technology implement) a high;performance technique that enhance) theperformance of Intel rchitecture microproce))or) for media application). The core
algorithm) in the)e application) are compute inten)i2e. The)e algorithm) perform
operation) on a large amount of data u)e )mall data type) and pro2ide many
opportunitie) for paralleli)m. The)e algorithm) are a natural fit for SIMD architecture.
MM* technology define) a general purpo)e and ea)y;to;implement )et of primiti2e)
to operate on pac5ed data type).
MM* technology while deli2ering performance boo)t to media application) i) fully
compatible with the e1i)ting application and operating )y)tem ba)e.
MM* technology i) general by de)ign and can be applied to a 2ariety of )oftware
media problem). Some e1ample) of thi) 2ariety were de)cribed in thi) paper. >uture
media related )oftware technologie) for u)e on the Intranet and Internet )hould benefit
from MM* technology.
Centium proce))or) with MM* technology pro2ide a )ignificant performance boo)t
!appro1imately =1 for )ome of the 5ernel)$ for media application). Cerformance gain)
from the technology will )cale well with an increa)ed proce))or operating frequency
and future microarchitecture).
28
8/12/2019 ForntPAGE.doc
29/29
Re6erences'5
,-0 . Celeg P. Fei)erMMX Technology Extension to the
Intel Architecture I&&& Micro ?ol. -6 o. = ugu)t
-EE6 pp. =/;