276
Advanced Micro Devices AMD64 Technology AMD64 Architecture Programmer’s Manual Volume 6: 128-Bit and 256-Bit XOP and FMA4 Instructions Publication No. Revision Date 43479 3.04 November 2009

128-Bit and 256-Bit XOP and FMA4 Instructions

  • Upload
    vanphuc

  • View
    250

  • Download
    1

Embed Size (px)

Citation preview

Page 1: 128-Bit and 256-Bit XOP and FMA4 Instructions

Advanced Micro Devices

AMD64 Technology

AMD64 ArchitectureProgrammer’s Manual

Volume 6:128-Bit and 256-Bit

XOP and FMA4 Instructions

Publication No. Revision Date

43479 3.04 November 2009

Page 2: 128-Bit and 256-Bit XOP and FMA4 Instructions

TrademarksAMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc.MMX is a trademark of Intel Corporation. Other product names used in this publication are for identification purposes only and may be trademarks of theirrespective companies.

©2009 Advanced Micro Devices, Inc. All rights reserved.

The contents of this document are provided in connection with Advanced MicroDevices, Inc. (“AMD”) products. AMD makes no representations or warranties withrespect to the accuracy or completeness of the contents of this publication andreserves the right to make changes to specifications and product descriptions atany time without notice. The information contained herein may be of a preliminaryor advance nature and is subject to change without notice. No license, whetherexpress, implied, arising by estoppel or otherwise, to any intellectual property rightsis granted by this publication. Except as set forth in AMD’s Standard Terms andConditions of Sale, AMD assumes no liability whatsoever, and disclaims anyexpress or implied warranty, relating to its products including, but not limited to, theimplied warranty of merchantability, fitness for a particular purpose, or infringementof any intellectual property right.

AMD’s products are not designed, intended, authorized or warranted for use ascomponents in systems intended for surgical implant into the body, or in other appli-cations intended to support or sustain life, or in any other application in which thefailure of AMD’s product could create a situation where personal injury, death, orsevere property or environmental damage may occur. AMD reserves the right todiscontinue or make changes to its products at any time without notice.

Page 3: 128-Bit and 256-Bit XOP and FMA4 Instructions

3

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Contents

Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11

1 New 128-Bit and 256-Bit Instructions . . . . . . . . . . . . . . . . . . . . . . . . . .251.1 New Instruction Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .251.2 Opcode Byte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .281.3 Destination XMM registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .291.4 Four-Operand Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .291.5 Three-Operand Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .291.6 Two Operand Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .301.7 XOP Integer Multiply (Add) and Accumulate

Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .311.8 Packed Integer Horizontal Add and Subtract . . . . . . . . . . . . . . . . . . . . .331.9 Vector Conditional Moves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .341.10 Packed Integer Rotates and Shifts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .341.11 Packed Integer Comparison and Predicate Generation . . . . . . . . . . . . . .351.12 Fraction Extract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .36

2 AMD XOP and FMA4 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . .392.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .392.2 Operand Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .402.3 Instruction Reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41

VFMADDPD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42VFMADDPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46VFMADDSD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .50VFMADDSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .53VFMADDSUBPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .56VFMADDSUBPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .60VFMSUBADDPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .64VFMSUBADDPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .68VFMSUBPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72VFMSUBPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76VFMSUBSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .80VFMSUBSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .84VFNMADDPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .88VFNMADDPS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .91VFNMADDSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .94VFNMADDSS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .97VFNMSUBPD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .100VFNMSUBPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .104VFNMSUBSD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .108VFNMSUBSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .112VFRCZPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .116VFRCZPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .119VFRCZSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .122

Page 4: 128-Bit and 256-Bit XOP and FMA4 Instructions

4

AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009

AMD Confidential-Advance Information

VFRCZSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .126VPCMOV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .130VPCOMB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .133VPCOMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .136VPCOMQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .139VPCOMUB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .142VPCOMUD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .145VPCOMUQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .148VPCOMUW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .151VPCOMW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .154VPERMIL2PD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .157VPERMIL2PS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .163VPHADDBD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .169VPHADDBQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .171VPHADDBW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .173VPHADDDQ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .175VPHADDUBD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .177VPHADDUBQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .179VPHADDUBW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .181VPHADDUDQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .183VPHADDUWD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .185VPHADDUWQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .187VPHADDWD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .189VPHADDWQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .191VPHSUBBW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .193VPHSUBDQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .195VPHSUBWD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .197VPMACSDD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .199VPMACSDQH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .202VPMACSDQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .205VPMACSSDD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .208VPMACSSDQH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .211VPMACSSDQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .214VPMACSSWD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .217VPMACSSWW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .220VPMACSWD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .223VPMACSWW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .226VPMADCSSWD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .229VPMADCSWD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .232VPPERM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .235VPROTB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .239VPROTD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .242VPROTQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .245VPROTW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .248VPSHAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .251VPSHAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .254VPSHAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .257

Page 5: 128-Bit and 256-Bit XOP and FMA4 Instructions

5

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

VPSHAW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .260VPSHLB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .263VPSHLD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .266VPSHLQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .269VPSHLW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .272

Page 6: 128-Bit and 256-Bit XOP and FMA4 Instructions

6

AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009

AMD Confidential-Advance Information

Page 7: 128-Bit and 256-Bit XOP and FMA4 Instructions

Tables 7

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Tables

Table 1-1. VEX.pp Prefix Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28Table 1-2. Operand Element Size—OES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29Table 1-3. Operand Configurations for FMA4,V PCMOV and VPPERM Instructions29Table 1-4. Operand Configurations for Three Operand Instructions . . . . . . . . . . . . . .30Table 1-5. Immediate Operand Values for Unsigned Vector Comparison Operations35Table 2-1. VPCOMB Comparison Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .133Table 2-2. VPCOMD Comparison Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .136Table 2-3. VPCOMQ Comparison Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .139Table 2-4. VPCOMUB Comparison Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . .142Table 2-5. VPCOMUD Comparison Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . .145Table 2-6. VPCOMUQ Comparison Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . .148Table 2-7. VPCOMUW Comparison Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . .151Table 2-8. VPCOMW Comparison Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .154Table 2-9. Selector and Source Selected . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .158Table 2-10. Interaction of Selector Match Bit and Immediate Operand Match Field .159Table 2-11. Selector and Source Selected . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .164Table 2-12. Interaction of Selector Match Bit and Immediate Operand Match Field .165Table 2-13. VPPERM Control Byte. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .236

Page 8: 128-Bit and 256-Bit XOP and FMA4 Instructions

8 Tables

AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009

Page 9: 128-Bit and 256-Bit XOP and FMA4 Instructions

Revision History 9

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Revision History

Date Revision Description

November 2009 3.04Removed #UD CR0.EM exception from all exception tables.Added VPERMIL2PD and VPERMIL2PS instructions.Corrected many small factual errors and typos.

Page 10: 128-Bit and 256-Bit XOP and FMA4 Instructions

10 Revision History

AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009

Page 11: 128-Bit and 256-Bit XOP and FMA4 Instructions

11

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Preface

About This BookThe instructions described in this book are part of a multivolume work entitled the AMD64Architecture Programmer’s Manual. The following table lists each volume and its order number.

AudienceThis document is intended for all programmers writing application or system software for a processorthat implements the AMD64 architecture.

OrganizationVolumes 3 through 6 describe the AMD64 architecture’s instruction set in detail. Together, they covereach instruction’s mnemonic syntax, opcodes, functions, affected flags, and possible exceptions.

The AMD64 instruction set is divided into seven subsets:

• General-purpose instructions

• System instructions

• 128-bit media instructions

• 64-bit media instructions

• x87 floating-point instructions

• 128-bit and 256-bit XOP media instructions

Several instructions belong to—and are described identically in—multiple instruction subsets.

Title Order No.

Volume 1: Application Programming 24592

Volume 2: System Programming 24593

Volume 3: General-Purpose and System Instructions 24594

Volume 4: 128-Bit Media Instructions 26568

Volume 5: 64-Bit Media and x87 Floating-Point Instructions 26569

Volume 6: 128-Bit and 256-Bit XOP and FMA4 Instructions 43479

Page 12: 128-Bit and 256-Bit XOP and FMA4 Instructions

12

AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009

This volume describes the 128-bit and 256-bit XOP and FMA4 instruction extensions. The index at theend cross-references topics within this volume. For other topics relating to the AMD64 architecture,and for information on instructions in other subsets, see the tables of contents and indexes of the othervolumes.

DefinitionsMany of the following definitions assume an in-depth knowledge of the legacy x86 architecture. See “Related Documents” on page 22 for descriptions of the legacy x86 architecture.

Terms and Notation

In addition to the notation described below, “Opcode-Syntax Notation” in Volume 3 describes notation relating specifically to opcodes.

1011bA binary value—in this example, a 4-bit value.

F0EAhA hexadecimal value—in this example a 2-byte value.

[1,2)A range that includes the left-most value (in this case, 1) but excludes the right-most value (in this case, 2).

7–4A bit range, from bit 7 to 4, inclusive. The high-order bit is shown first.

128-bit media instructionsInstructions that use the 128-bit XMM registers. These are a combination of the SSE and SSE2 instruction sets.

64-bit media instructionsInstructions that use the 64-bit MMX registers. These are primarily a combination of MMX™ and 3DNow!™ instruction sets, with some additional instructions from the SSE and SSE2 instruction sets.

16-bit modeLegacy mode or compatibility mode in which a 16-bit address size is active. See legacy mode and compatibility mode.

32-bit modeLegacy mode or compatibility mode in which a 32-bit address size is active. See legacy mode and compatibility mode.

Page 13: 128-Bit and 256-Bit XOP and FMA4 Instructions

13

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

64-bit modeA submode of long mode. In 64-bit mode, the default address size is 64 bits and new features, such as register extensions, are supported for system and application software.

#GP(0)Notation indicating a general-protection exception (#GP) with error code of 0.

absoluteSaid of a displacement that references the base of a code segment rather than an instruction pointer. Contrast with relative.

ASIDAddress space identifier.

biased exponentThe sum of a floating-point value’s exponent and a constant bias for a particular floating-point data type. The bias makes the range of the biased exponent always positive, which allows reciprocation without overflow.

byteEight bits.

clearTo write a bit value of 0. Compare set.

compatibility modeA submode of long mode. In compatibility mode, the default address size is 32 bits, and legacy 16-bit and 32-bit applications run without modification.

commitTo irreversibly write, in program order, an instruction’s result to software-visible storage, such as a register (including flags), the data cache, an internal write buffer, or memory.

CPLCurrent privilege level.

CR0–CR4A register range, from register CR0 through CR4, inclusive, with the low-order register first.

CR0.PE = 1Notation indicating that the PE bit of the CR0 register has a value of 1.

directReferencing a memory location whose address is included in the instruction’s syntax as an immediate operand. The address may be an absolute or relative address. Compare indirect.

Page 14: 128-Bit and 256-Bit XOP and FMA4 Instructions

14

AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009

dirty dataData held in the processor’s caches or internal buffers that is more recent than the copy held in main memory.

displacementA signed value that is added to the base of a segment (absolute addressing) or an instruction pointer (relative addressing). Same as offset.

doublewordTwo words, or four bytes, or 32 bits.

double quadwordEight words, or 16 bytes, or 128 bits. Also called octword.

DS:rSIThe contents of a memory location whose segment address is in the DS register and whose offset relative to that segment is in the rSI register.

EFER.LME = 0Notation indicating that the LME bit of the EFER register has a value of 0.

effective address sizeThe address size for the current instruction after accounting for the default address size and any address-size override prefix.

effective operand sizeThe operand size for the current instruction after accounting for the default operand size and any operand-size override prefix.

elementSee vector.

exceptionAn abnormal condition that occurs as the result of executing an instruction. The processor’s response to an exception depends on the type of the exception. For all exceptions except 128-bit media SIMD floating-point exceptions and x87 floating-point exceptions, control is transferred to the handler (or service routine) for that exception, as defined by the exception’s vector. For floating-point exceptions defined by the IEEE 754 standard, there are both masked and unmasked responses. When unmasked, the exception handler is called, and when masked, a default response is provided instead of calling the handler.

FF /0Notation indicating that FF is the first byte of an opcode, and a subfield in the second byte has a value of 0.

Page 15: 128-Bit and 256-Bit XOP and FMA4 Instructions

15

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

flushAn often ambiguous term meaning (1) writeback, if modified, and invalidate, as in “flush the cache line,” or (2) invalidate, as in “flush the pipeline,” or (3) change a value, as in “flush to zero.”

GDTGlobal descriptor table.

GIFGlobal interrupt flag.

IDTInterrupt descriptor table.

IGNIgnore. Field is ignored.

indirectReferencing a memory location whose address is in a register or other memory location. The address may be an absolute or relative address. Compare direct.

IRBThe virtual-8086 mode interrupt-redirection bitmap.

ISTThe long-mode interrupt-stack table.

IVTThe real-address mode interrupt-vector table.

LDTLocal descriptor table.

legacy x86The legacy x86 architecture. See “Related Documents” on page 22 for descriptions of the legacy x86 architecture.

legacy modeAn operating mode of the AMD64 architecture in which existing 16-bit and 32-bit applications and operating systems run without modification. A processor implementation of the AMD64 architecture can run in either long mode or legacy mode. Legacy mode has three submodes, real mode, protected mode, and virtual-8086 mode.

long modeAn operating mode unique to the AMD64 architecture. A processor implementation of the AMD64 architecture can run in either long mode or legacy mode. Long mode has two submodes, 64-bit mode and compatibility mode.

Page 16: 128-Bit and 256-Bit XOP and FMA4 Instructions

16

AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009

lsbLeast-significant bit.

LSBLeast-significant byte.

main memoryPhysical memory, such as RAM and ROM (but not cache memory) that is installed in a particular computer system.

mask(1) A control bit that prevents the occurrence of a floating-point exception from invoking an exception-handling routine. (2) A field of bits used for a control purpose.

MBZMust be zero. If software attempts to set an MBZ bit to 1, a general-protection exception (#GP) occurs.

memoryUnless otherwise specified, main memory.

ModRMA byte following an instruction opcode that specifies address calculation based on mode (Mod), register (R), and memory (M) variables.

moffsetA 16, 32, or 64-bit offset that specifies a memory operand directly, without using a ModRM or SIB byte.

msbMost-significant bit.

MSBMost-significant byte.

multimedia instructionsA combination of 128-bit media instructions and 64-bit media instructions.

octwordSame as double quadword.

offsetSame as displacement.

Page 17: 128-Bit and 256-Bit XOP and FMA4 Instructions

17

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

overflowThe condition in which a floating-point number is larger in magnitude than the largest, finite, positive or negative number that can be represented in the data-type format being used.

packedSee vector.

PAEPhysical-address extensions.

physical memoryActual memory, consisting of main memory and cache.

probeA check for an address in a processor’s caches or internal buffers. External probes originate outside the processor, and internal probes originate within the processor.

protected modeA submode of legacy mode.

quadwordFour words, or eight bytes, or 64 bits.

reservedFields marked as reserved may be used at some future time.To preserve compatibility with future processors, reserved fields require special handling when read or written by software.Reserved fields may be further qualified as MBZ, RAZ, SBZ or IGN (see definitions).Software must not depend on the state of a reserved field, nor upon the ability of such fields to return to a previously written state.If a reserved field is not marked with one of the above qualifiers, software must not change the state of that field; it must reload that field with the same values returned from a prior read.

RAZRead as zero (0), regardless of what is written.

real-address modeA submode of legacy mode with 16-bit addressing and operand size and a simple form of segmentation, lacking the segment and privilege protection mechanisms of protected mode. See real mode.

real modeA short name for real-address mode, a submode of legacy mode.

Page 18: 128-Bit and 256-Bit XOP and FMA4 Instructions

18

AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009

relativeReferencing with a displacement (also called offset) from an instruction pointer rather than the base of a code segment. Contrast with absolute.

REXAn instruction prefix that specifies a 64-bit operand size and provides access to additional registers.

RIP-relative addressingAddressing relative to the 64-bit RIP instruction pointer.

setTo write a bit value of 1. Compare clear.

SIBA byte following an instruction opcode that specifies address calculation based on scale (S), index (I), and base (B).

SIMDSingle instruction, multiple data. See vector.

SSEn and SSSEnVarious extensions to the SSE instruction set. See 128-bit media instructions and 64-bit media instructions.

sticky bitA bit that is set or cleared by hardware and that remains in that state until explicitly changed by software.

TOPThe x87 top-of-stack pointer.

TSSTask-state segment.

underflowThe condition in which a floating-point number is smaller in magnitude than the smallest nonzero, positive or negative number that can be represented in the data-type format being used.

vector(1) A set of integer or floating-point values, called elements, that are packed into a single operand. Most of the 128-bit and 64-bit media instructions use vectors as operands. Vectors are also called packed or SIMD (single-instruction multiple-data) operands. (2) An index into an interrupt descriptor table (IDT), used to access exception handlers. Compare exception.

Page 19: 128-Bit and 256-Bit XOP and FMA4 Instructions

19

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

virtual-8086 modeA submode of legacy mode.

VMCBVirtual machine control block.

VMMVirtual machine monitor.

wordTwo bytes, or 16 bits.

x86See legacy x86.

Registers

In the following list of registers, the names are used to refer either to a given register or to the contents of that register:

AH–DHThe high 8-bit AH, BH, CH, and DH registers. Compare AL–DL.

AL–DLThe low 8-bit AL, BL, CL, and DL registers. Compare AH–DH.

AL–r15BThe low 8-bit AL, BL, CL, DL, SIL, DIL, BPL, SPL, and R8B–R15B registers, available in 64-bit mode.

BPBase pointer register.

CRnControl register number n.

CSCode segment register.

eAX–eSPThe 16-bit AX, BX, CX, DX, DI, SI, BP, and SP registers or the 32-bit EAX, EBX, ECX, EDX, EDI, ESI, EBP, and ESP registers. Compare rAX–rSP.

EBPExtended base pointer register.

Page 20: 128-Bit and 256-Bit XOP and FMA4 Instructions

20

AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009

EFERExtended features enable register.

eFLAGS16-bit or 32-bit flags register. Compare rFLAGS.

EFLAGS32-bit (extended) flags register.

eIP16-bit or 32-bit instruction-pointer register. Compare rIP.

EIP32-bit (extended) instruction-pointer register.

FLAGS16-bit flags register.

GDTRGlobal descriptor table register.

GPRsGeneral-purpose registers. For the 16-bit data size, these are AX, BX, CX, DX, DI, SI, BP, and SP. For the 32-bit data size, these are EAX, EBX, ECX, EDX, EDI, ESI, EBP, and ESP. For the 64-bit data size, these include RAX, RBX, RCX, RDX, RDI, RSI, RBP, RSP, and R8–R15.

IDTRInterrupt descriptor table register.

IP16-bit instruction-pointer register.

LDTRLocal descriptor table register.

MSRModel-specific register.

r8–r15The 8-bit R8B–R15B registers, or the 16-bit R8W–R15W registers, or the 32-bit R8D–R15D registers, or the 64-bit R8–R15 registers.

rAX–rSPThe 16-bit AX, BX, CX, DX, DI, SI, BP, and SP registers, or the 32-bit EAX, EBX, ECX, EDX, EDI, ESI, EBP, and ESP registers, or the 64-bit RAX, RBX, RCX, RDX, RDI, RSI, RBP, and RSP

Page 21: 128-Bit and 256-Bit XOP and FMA4 Instructions

21

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

registers. Replace the placeholder r with nothing for 16-bit size, “E” for 32-bit size, or “R” for 64-bit size.

RAX64-bit version of the EAX register.

RBP64-bit version of the EBP register.

RBX64-bit version of the EBX register.

RCX64-bit version of the ECX register.

RDI64-bit version of the EDI register.

RDX64-bit version of the EDX register.

rFLAGS16-bit, 32-bit, or 64-bit flags register. Compare RFLAGS.

RFLAGS64-bit flags register. Compare rFLAGS.

rIP16-bit, 32-bit, or 64-bit instruction-pointer register. Compare RIP.

RIP64-bit instruction-pointer register.

RSI64-bit version of the ESI register.

RSP64-bit version of the ESP register.

SPStack pointer register.

SSStack segment register.

Page 22: 128-Bit and 256-Bit XOP and FMA4 Instructions

22

AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009

TPRTask priority register (CR8), a new register introduced in the AMD64 architecture to speed interrupt management.

TRTask register.

XMM0–XMM15The 128-bit XMM registers; each is the lower half of a corresponding 256-bit YMM register.

YMM0–YMM15The 256-bit YMM registers; the lower half of each of these is the corresponding 128-bit XMMregister.

Endian Order

The x86 and AMD64 architectures address memory using little-endian byte-ordering. Multibyte values are stored with their least-significant byte at the lowest byte address, and they are illustrated with their least significant byte at the right side. Strings are illustrated in reverse order, because the addresses of their bytes increase from right to left.

Related Documents• Peter Abel, IBM PC Assembly Language and Programming, Prentice-Hall, Englewood Cliffs, NJ,

1995.

• Rakesh Agarwal, 80x86 Architecture & Programming: Volume II, Prentice-Hall, Englewood Cliffs, NJ, 1991.

• AMD, AMD-K6™ MMX™ Enhanced Processor Multimedia Technology, Sunnyvale, CA, 2000.

• AMD, 3DNow!™ Technology Manual, Sunnyvale, CA, 2000.

• AMD, AMD Extensions to the 3DNow!™ and MMX™ Instruction Sets, Sunnyvale, CA, 2000.

• Don Anderson and Tom Shanley, Pentium Processor System Architecture, Addison-Wesley, New York, 1995.

• Nabajyoti Barkakati and Randall Hyde, Microsoft Macro Assembler Bible, Sams, Carmel, Indiana, 1992.

• Barry B. Brey, 8086/8088, 80286, 80386, and 80486 Assembly Language Programming, Macmillan Publishing Co., New York, 1994.

• Barry B. Brey, Programming the 80286, 80386, 80486, and Pentium Based Personal Computer, Prentice-Hall, Englewood Cliffs, NJ, 1995.

• Ralf Brown and Jim Kyle, PC Interrupts, Addison-Wesley, New York, 1994.

• Penn Brumm and Don Brumm, 80386/80486 Assembly Language Programming, Windcrest McGraw-Hill, 1993.

• Geoff Chappell, DOS Internals, Addison-Wesley, New York, 1994.

Page 23: 128-Bit and 256-Bit XOP and FMA4 Instructions

23

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

• Chips and Technologies, Inc. Super386 DX Programmer’s Reference Manual, Chips and Technologies, Inc., San Jose, 1992.

• John Crawford and Patrick Gelsinger, Programming the 80386, Sybex, San Francisco, 1987.

• Cyrix Corporation, 5x86 Processor BIOS Writer's Guide, Cyrix Corporation, Richardson, TX, 1995.

• Cyrix Corporation, M1 Processor Data Book, Cyrix Corporation, Richardson, TX, 1996.

• Cyrix Corporation, MX Processor MMX Extension Opcode Table, Cyrix Corporation, Richardson, TX, 1996.

• Cyrix Corporation, MX Processor Data Book, Cyrix Corporation, Richardson, TX, 1997.

• Ray Duncan, Extending DOS: A Programmer's Guide to Protected-Mode DOS, Addison Wesley, NY, 1991.

• William B. Giles, Assembly Language Programming for the Intel 80xxx Family, Macmillan, New York, 1991.

• Frank van Gilluwe, The Undocumented PC, Addison-Wesley, New York, 1994.

• John L. Hennessy and David A. Patterson, Computer Architecture, Morgan Kaufmann Publishers, San Mateo, CA, 1996.

• Thom Hogan, The Programmer’s PC Sourcebook, Microsoft Press, Redmond, WA, 1991.

• Hal Katircioglu, Inside the 486, Pentium, and Pentium Pro, Peer-to-Peer Communications, Menlo Park, CA, 1997.

• IBM Corporation, 486SLC Microprocessor Data Sheet, IBM Corporation, Essex Junction, VT, 1993.

• IBM Corporation, 486SLC2 Microprocessor Data Sheet, IBM Corporation, Essex Junction, VT, 1993.

• IBM Corporation, 80486DX2 Processor Floating Point Instructions, IBM Corporation, Essex Junction, VT, 1995.

• IBM Corporation, 80486DX2 Processor BIOS Writer's Guide, IBM Corporation, Essex Junction, VT, 1995.

• IBM Corporation, Blue Lightning 486DX2 Data Book, IBM Corporation, Essex Junction, VT, 1994.

• Institute of Electrical and Electronics Engineers, IEEE Standard for Binary Floating-Point Arithmetic, ANSI/IEEE Std 754-1985.

• Institute of Electrical and Electronics Engineers, IEEE Standard for Radix-Independent Floating-Point Arithmetic, ANSI/IEEE Std 854-1987.

• Muhammad Ali Mazidi and Janice Gillispie Mazidi, 80X86 IBM PC and Compatible Computers, Prentice-Hall, Englewood Cliffs, NJ, 1997.

• Hans-Peter Messmer, The Indispensable Pentium Book, Addison-Wesley, New York, 1995.

• Karen Miller, An Assembly Language Introduction to Computer Architecture: Using the Intel Pentium, Oxford University Press, New York, 1999.

Page 24: 128-Bit and 256-Bit XOP and FMA4 Instructions

24

AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009

• Stephen Morse, Eric Isaacson, and Douglas Albert, The 80386/387 Architecture, John Wiley & Sons, New York, 1987.

• NexGen Inc., Nx586 Processor Data Book, NexGen Inc., Milpitas, CA, 1993.

• NexGen Inc., Nx686 Processor Data Book, NexGen Inc., Milpitas, CA, 1994.

• Bipin Patwardhan, Introduction to the Streaming SIMD Extensions in the Pentium III, www.x86.org/articles/sse_pt1/ simd1.htm, June, 2000.

• Peter Norton, Peter Aitken, and Richard Wilton, PC Programmer’s Bible, Microsoft Press, Redmond, WA, 1993.

• PharLap 386|ASM Reference Manual, Pharlap, Cambridge MA, 1993.

• PharLap TNT DOS-Extender Reference Manual, Pharlap, Cambridge MA, 1995.

• Sen-Cuo Ro and Sheau-Chuen Her, i386/i486 Advanced Programming, Van Nostrand Reinhold, New York, 1993.

• Jeffrey P. Royer, Introduction to Protected Mode Programming, course materials for an onsite class, 1992.

• Tom Shanley, Protected Mode System Architecture, Addison Wesley, NY, 1996.

• SGS-Thomson Corporation, 80486DX Processor SMM Programming Manual, SGS-Thomson Corporation, 1995.

• Walter A. Triebel, The 80386DX Microprocessor, Prentice-Hall, Englewood Cliffs, NJ, 1992.

• John Wharton, The Complete x86, MicroDesign Resources, Sebastopol, California, 1994.

• Web sites and newsgroups:

- www.amd.com- news.comp.arch- news.comp.lang.asm.x86- news.intel.microprocessors- news.microsoft

Page 25: 128-Bit and 256-Bit XOP and FMA4 Instructions

New 128-Bit and 256-Bit Instructions 25

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

1 New 128-Bit and 256-Bit Instructions

This release of the AMD64 architecture covers the XOP and FMA4 instruction set extensions. These128-bit and 256-bit instructions complement the AMD64 128-bit media instructions deescribed indetail in the AMD64 Architecture Programmer’s Manual Volume 4: 128-Bit Media Instructions,order# 26568. This document describes new instructions that are designed to:

Improve performance by increasing the work per instruction andreduce the need to copy and move around register operands.

These instruction set extensions include:

Floating-point multiply accumulate instructionsFloating-point fraction extractInteger horizontal add instructionsInteger multiply accumulate instructionsByte permutation and bit granularity conditional move instructionsPacked integer compare and individual-partition shift/rotate instructions

These instructions all use the new XOP instruction format, which takes advantage of the three- andfour-operand non-destructive capability, 256-bit operand size, and instruction length efficiencyprovided by this encoding. These instructions operate on either the lower 128- or full 256-bits of thenew YMM registers. Context handling of the YMM register set is supported by the newXSAVE/XRSTOR instructions in conjunction with the XSETBV and XGETBV instructions. Supportfor YMM context handling must be provided by the operating system and must be indicated by settingCR4.OSXSAVE to 1.

Support for the new instructions is indicated by use of the CPUID instruction:

XOP—ECX bit 11 as returned by CPUID function 8000_0001h. FMA4—ECX bit 16 as returned by CPUID function 8000_0001h.

Attempting to execute these instructions causes a #UD exception either if they are not present in thehardware or if operating system support for YMM context switching is not indicated by settingCR4.OSXSAVE to 1.

1.1 New Instruction FormatThe XOP instructions utilize a new three-byte XOP prefix preceding the opcode byte. This prefixreplaces the use of the 0F, 66, F2 and F3 prefix bytes and the REX prefix and encodes additionalinformation as well. The FMA4 instructions utilize the new AVX VEX prefix which provides similarencoding capabilities.

Figure 1-1 shows the byte order of the instruction format.

Page 26: 128-Bit and 256-Bit XOP and FMA4 Instructions

26 New 128-Bit and 256-Bit Instructions

AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009

Figure 1-1. Instruction Byte-Order

1.1.1 Legacy Prefix

The optional legacy prefixes include operand-size override, address-size override, segment override,Lock and REP prefixes. For additional information, see section 1.2, “Instruction Prefixes” in theAMD64 Architecture Programmer’s Manual Volume 3: General Purpose and System Instructions,order#24594.

1.1.2 Three-byte Prefix Format

The format of the three-byte form of the XOP and FMA4 instruction prefixes is shown in Figure 1-2.

XOP Prefix( 3 byte) Opcode ModRM SIB

xxyyzz

Displacement1, 2, or 4 Bytes

Immediate1 Byte Legacy

[Prefix]

Page 27: 128-Bit and 256-Bit XOP and FMA4 Instructions

New 128-Bit and 256-Bit Instructions 27

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Figure 1-2. Three-byte XOP Format

Prefix Byte 0

Byte 0 of the XOP prefix is set to 8Fh. This signifies an XOP prefix only in conjunction with themmmmm field of the following byte being greater than or equal to 8; if the mmmmm field is less than8 then these two bytes are a form of the POP instruction rather than an XOP prefix.

Prefix Byte 1

Byte 1 of the XOP prefix has four fields.

R Bit (Prefix Byte 1, Bit 7). This bit provides a one bit extension of the ModRM.reg field in 64-bitmode, permitting access to all 16 YMM/XMM and GPR registers. In 32-bit protected andcompatibility modes, this bit must be set to 1. This bit is the bit-inverted equivalent of the REX.R bit.

X Bit (Prefix Byte 1, Bit 6). This bit provides a one bit extension of the SIB.index field in 64-bitmode, permitting access to 16 YMM/XMM and GPR registers. In 32-bit protected and compatibilitymodes, this bit must be set to 1. This bit is the bit-inverted equivalent of the REX.X bit.

Byte 0 Byte 1 Byte 27 0 7 5 4 0 7 6 3 2 1 0

8F R X B mmmmm W vvvv L pp

Bit Mnemonic Description

Byt

e 0

7–0 8Fh XOP Prefix Byte for 3-byte XOP Prefix

Byt

e 1

7 R Inverted one bit extension to ModRM.reg field

6 X Inverted one bit extension of the SIB index field

5 BInverted one bit extension of the ModRM r/m field or the SIB base field

4–0 mmmmmXOP opcode map select: 08h—instructions with immediate byte;09h—instructions with no immediate;

Byt

e 2

7 W

Default operand size override for a general pur-pose register to 64-bit size in 64-bit mode; oper-and configuration specifier for certain XMM/YMM-based operations.

6–3 vvvvSource or destination register specifier in inverted 1’s complement format.

2 L Vector length for XMM/YMM-based operations.

1–0 ppSpecifies whether there's an implied 66, F2, or F3 opcode extension

Page 28: 128-Bit and 256-Bit XOP and FMA4 Instructions

28 New 128-Bit and 256-Bit Instructions

AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009

B Bit (Prefix Byte 1, Bit 5). This bit provides a one-bit extension of either the ModRM.r/m field tospecify a GPR or XMM register or to the SIB base field to specify a GPR. This permits access to 16registers. In 32-bit protected and compatibility modes, this bit is ignored. This bit is the bit-invertedequivalent of the REX.B bit and is available only in the 3-byte prefix format.

mmmmm (Prefix Byte 1, Bits 4–0. A five bit field encoding a one- or two-byte opcode prefix.

Prefix Byte 2

Byte 2 of the three-byte prefix has four fields.

W Bit (Prefix Byte 2, Bit 7). The meaning of the W bit is opcode specific. This bit toggles sourceoperand order or is ignored, depending upon the opcode.

vvvv (Prefix Byte 2, Bits 6–3). Encodes a source XMM or YMM register in inverted 1s complementform.

L (Prefix Byte 2, Bit 2). If L is 0, encodes a vector length of 128-bits or indicates scalar operands; ifL is 1, the vector length is 256-bits. The register operands for a given instruction are either all 128-bitXMM registers or all 256-bit YMM registers.

pp (Prefix Byte 2, Bits 1–0). Specifies an implied 66, F2, or F3 opcode extension as defined by thefollowing table.

1.2 Opcode ByteThe format of the opcode byte is shown in Figure 1-3. For most instructions, the operand element size(OES) is specified by the two least-significant opcode bits, as shown in Table 1-2.

Figure 1-3. Opcode Byte Format

Table 1-1. VEX.pp Prefix Mapping

pp Implied Prefix

00b None

01b 66h

10b F3h

11b F2h

7 2 1 0

Opcode OES

Page 29: 128-Bit and 256-Bit XOP and FMA4 Instructions

New 128-Bit and 256-Bit Instructions 29

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

1.3 Destination XMM registersThe destination of XOP and FMA4 instructions may be a 128-bit XMM register or a 256-bit YMMregister. When a 128-bit result is written to a destination XMM register, the upper 128 bits of thecorresponding YMM register are cleared.

1.4 Four-Operand InstructionsSome new instructions require three input operands and one destination register. This is accomplishedby using the Prefix.vvvv field and Imm8[7:4] along with the MODRM.reg and MODRM.r/m fields.

VPCMOV is an example of a four operand instruction:

VPCMOV dest, src1, src2, src3; dest = (src1 & src3) | (src2 & ~src3)

The first operand is the destination operand and is an XMM or YMM register addressed by theModRM.reg field.

The second, third and fourth operands are sources. The first source operand is an XMM registerspecified by the vvvv field. The second and third source operands are specified by the MODRM.r/mand Imm8[7:4] fields, respectively, when VEX.W is set to 0. The FMA4, VPCMOV and VPPERMinstructions provide the option of swapping the second and third source operands by setting W to 1, asshown in Table 1-3. This allows either the second data operand or the control operand to be memorybased.

1.5 Three-Operand InstructionsSome instructions have two source operands and a destination operand.

Table 1-2. Operand Element Size—OES

Opcode.OES Integer Operation Floating-PointOperation

00 Byte PS01 Word PD10 Doubleword SS11 Quadword SD

Table 1-3. Operand Configurations for FMA4,V PCMOV and VPPERM Instructions

XOP.W dest src1 src2 src30 ModRM.reg VEX/XOP.vvvv modrm.r/m imm8[7:4]1 ModRM.reg VEX/XOP.vvvv imm8[7:4] ModRM.r/m

Page 30: 128-Bit and 256-Bit XOP and FMA4 Instructions

30 New 128-Bit and 256-Bit Instructions

AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009

VPROTB is an example of a three operand instruction:

VPROTB dest, src, count dest = src << | >> count

The first operand is the destination operand, and is an XMM register addressed by the ModRM.regfield. The second and third operands are source operands. One source operand is an XMM registeraddressed by the XOP.vvvv field, the other source operand is an XMM register or memory operandaddressed by the ModRM.r/m field.

For certain instructions, in the three-operand format the XOP.W bit determines which source operandis specified by which operand field, as shown in Table 1-4.

Table 1-4. Operand Configurations for Three Operand Instructions

1.6 Two Operand InstructionsTwo-operand instructions use the normal ModRM-based operand assignment. For most instructions,the first operand is the destination, addressed by the ModRM.reg field and the second operand is eitheran XMM or YMM register or a memory operand, as determined by the ModRM.mod field. For all ofthese instructions, the XOP.vvvv field is not applicable and must be set to 1111b.

The VFRCZPD instruction is an example of a two operand instruction.

VFRCZPD xmm1, xmm2/mem128

VEX.W dest src count0 ModRM.reg ModRM.r/m VEX.vvvv1 ModRM.reg VEX.vvvv ModRM.r/m

Page 31: 128-Bit and 256-Bit XOP and FMA4 Instructions

New 128-Bit and 256-Bit Instructions 31

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

1.7 XOP Integer Multiply (Add) and AccumulateInstructions

The multiply and accumulate and multiply, add and accumulate instructions operate on and producepacked signed integer values. These instructions allow the accumulation of results from (possibly)many iterations of similar operations without a separate intermediate addition operation to update theaccumulator register.

1.7.1 Saturation

Some instructions limit the result of an operation to the maximum or minimum value representable bythe data type of the destination—an operation known as saturation. Many of the integer multiply andaccumulate instructions saturate the cumulative results of the multiplication and addition(accumulation) operations before writing the final results to the destination (accumulator) register.

Note, however, that not all multiply and accumulate instructions saturate results. (For furtherdiscussion of saturation, see the AMD64 Architecture Programmer’s Manual Volume 1: ApplicationProgramming, order# 24592.)

1.7.2 Multiply and Accumulate Instructions

The operation of a typical XOP integer multiply and accumulate instruction is shown in Figure 1-4 onpage 32.

The multiply and accumulate instructions operate on and produce packed signed integer values. Theseinstructions first multiply the value in the first source operand by the corresponding value in thesecond source operand. Each signed integer product is then added to the corresponding value in thethird source operand, which is the accumulator and is identical to the destination operand. The resultsmay or may not be saturated prior to being written to the destination register, depending on theinstruction.

Page 32: 128-Bit and 256-Bit XOP and FMA4 Instructions

32 New 128-Bit and 256-Bit Instructions

AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009

Figure 1-4. Operation of Multiply and Accumulate Instructions

The XOP instruction extensions provide the following integer multiply and accumulate instructions.

VPMACSSWW—Packed Multiply Accumulate Signed Word to Signed Word with SaturationVPMACSWW—Packed Multiply Accumulate Signed Word to Signed WordVPMACSSWD—Packed Multiply Accumulate Signed Word to Signed Doubleword withSaturationVPMACSWD—Packed Multiply Accumulate Signed Word to Signed DoublewordVPMACSSDD—Packed Multiply Accumulate Signed Doubleword to Signed Doubleword withSaturationVPMACSDD—Packed Multiply Accumulate Signed Doubleword to Signed DoublewordVPMACSSDQL—Packed Multiply Accumulate Signed Low Doubleword to Signed Quadwordwith SaturationVPMACSSDQH—Packed Multiply Accumulate Signed High Doubleword to Signed Quadwordwith SaturationVPMACSDQL—Packed Multiply Accumulate Signed Low Doubleword to Signed QuadwordVPMACSDQH—Packed Multiply Accumulate Signed High Doubleword to Signed Quadword

src1

127 96 95 64 63 32 31 0

src2

src3127 96 95 64 63 32 31 0

(saturate)

dest127 96 95 64 63 32 31 0

multiply

add

multiply

add

(saturate)

multiplymultiply

add

add(accumulate)(accumulate)

(accumulate)(accumulate)

(saturate) (saturate)

127 96 95 64 63 32 31 0

Page 33: 128-Bit and 256-Bit XOP and FMA4 Instructions

New 128-Bit and 256-Bit Instructions 33

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

1.7.3 Integer Multiply, Add and Accumulate Instructions

The operation of the multiply, add and accumulate instructions is illustrated in Figure 1-5.

The multiply, add and accumulate instructions first multiply each packed signed integer value in thefirst source operand by the corresponding packed signed integer value in the second source operand.The odd and even adjacent resulting products are then added. Each resulting sum is then added to thecorresponding packed signed integer value in the third source operand.

Figure 1-5. Operation of Multiply, Add and Accumulate Instructions

The XOP instruction set provides the following integer multiply, add and accumulate instructions.

VPMADCSSWD—Packed Multiply Add and Accumulate Signed Word to Signed Doublewordwith SaturationVPMADCSWD—Packed Multiply Add and Accumulate Signed Word to Signed Doubleword

1.8 Packed Integer Horizontal Add and SubtractThe packed horizontal add and subtract signed byte instructions successively add adjacent pairs ofsigned integer values from the source XMM register or 128-bit memory operand and pack the (sign-extended) integer result of each addition in the destination.

VPHADDBW—Packed Horizontal Add Signed Byte to Signed Word

127 112 111 96 95 80 79 64 63 48 47 32 31 16 15 0 src2

127 112 111 96 95 80 79 64 63 48 47 32 31 16 15 0

src3127 96 95 64 63 32 31 0

multiplymultiply

multiplymultiply

multiply multiply

multiplymultiply

add

add

dest127 96 95 64 63 32 31 0

(saturate)

add add

add

(saturate)

add

(saturate)

add

(saturate)

add

[accumulate][accumulate]

[accumulate][accumulate]

src1

Page 34: 128-Bit and 256-Bit XOP and FMA4 Instructions

34 New 128-Bit and 256-Bit Instructions

AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009

VPHADDBD—Packed Horizontal Add Signed Byte to Signed DoublewordVPHADDBQ—Packed Horizontal Add Signed Byte to Signed QuadwordVPHADDDQ—Packed Horizontal Add Signed Doubleword to Signed QuadwordVPHADDUBW—Packed Horizontal Add Unsigned Byte to WordVPHADDUBD—Packed Horizontal Add Unsigned Byte to DoublewordVPHADDUBQ—Packed Horizontal Add Unsigned Byte to QuadwordVPHADDUWD—Packed Horizontal Add Unsigned Word to DoublewordVPHADDUWQ—Packed Horizontal Add Unsigned Word to QuadwordVPHADDUDQ—Packed Horizontal Add Unsigned Doubleword to QuadwordVPHADDWD—Packed Horizontal Add Signed Word to Signed DoublewordVPHADDWQ—Packed Horizontal Add Signed Word to Signed QuadwordVPHSUBBW—Packed Horizontal Subtract Signed Byte to Signed WordVPHSUBWD—Packed Horizontal Subtract Signed Word to Signed DoublewordVPHSUBDQ—Packed Horizontal Subtract Signed Doubleword to Signed Quadword

1.9 Vector Conditional MovesXOP instructions include vector conditional move instructions:

VPCMOV—Vector Conditional MovesVPPERM—Packed Permute Bytes

The VPCMOV instruction implements the C/C++ language ternary ‘?’ operator a bit level. Thisinstruction operates on individual bits and requires a bitwise predicate in one XMM or YMM registerand the two source operands in two more XMM or YMM registers.

The VPPERM instruction performs vector permutation on a packed array of 32 bytes composed of two16-byte input operands. The VPPERM instruction replaces each destination byte with 00h, FFh, or oneof the 32 bytes of the packed array. A byte selected from the array may have an additional operationsuch as NOT or bit reversal applied to it, before it is written to the destination. The action for eachdestination byte is determined by a corresponding control byte. The VPPERM instruction allowseither the second 16-byte input array or the control array to be memory based, per the XOP.W bit.

1.10 Packed Integer Rotates and ShiftsThese instructions rotate/shift the elements of the vector in the first source YMM or 128-bit memoryoperand by the amount specified by a control byte. The rotates and shifts differ in the way they handlethe control byte.

Page 35: 128-Bit and 256-Bit XOP and FMA4 Instructions

New 128-Bit and 256-Bit Instructions 35

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

1.10.1 Packed Integer Shifts

The packed integer shift instructions shift each element of the vector in the first source XMM or 128-bit memory operand by the amount specified by a control byte contained in the least significant byte ofthe corresponding element of the second source operand. The result of each shift operation is returnedin the destination XMM register. This allows load-and-shift from memory operations, with either thesource operand or the shift-count operand being memory-based, as indicated by the XOP.W bit. TheXOP instruction set provides the following packed integer shift instructions:

VPSHLB—Packed Shift Logical BytesVPSHLW—Packed Shift Logical WordsVPSHLD—Packed Shift Logical DoublewordsVPSHLQ—Packed Shift Logical QuadwordsVPSHAB—Packed Shift Arithmetic BytesVPSHAW—Packed Shift Arithmetic WordsVPSHAD—Packed Shift Arithmetic DoublewordsVPSHAQ—Packed Shift Arithmetic Quadwords

1.10.2 Packed Integer Rotate

There are two variants of the packed integer rotate instructions. The first is identical to that describedabove (see “Packed Integer Shifts”). In the second variant, the control byte is supplied as an 8-bitimmediate operand that specifies a single rotate amount for every element in the first source operand.The XOP instruction set provides the following packed integer rotate instructions:

VPROTB—Packed Rotate BytesVPROTW—Packed Rotate WordsVPROTD—Packed Rotate DoublewordsVPROTQ—Packed Rotate Quadwords

1.11 Packed Integer Comparison and Predicate GenerationThe XOP comparison instructions compare packed integer values in the first source XMM registerwith corresponding packed integer values in the second source XMM register or 128-bit memory. Thetype of comparison is specified by the immediate-byte operand. The resulting predicate is placed in thedestination XMM register. If the condition is true, all bits in the corresponding field in the destinationregister are set to 1s; otherwise all bits in the field are set to 0s.

Table 1-5. Immediate Operand Values for Unsigned Vector Comparison Operations

Immediate Operand Byte Comparison Operation

Bits 7:3 Bits 2:0

Page 36: 128-Bit and 256-Bit XOP and FMA4 Instructions

36 New 128-Bit and 256-Bit Instructions

AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009

The integer comparison and predicate generation instructions compare corresponding packed signedor unsigned bytes in the first and second source operands and write the result of each comparison in thecorresponding element of the destination. The result of each comparison is a value of all 1s (TRUE) orall 0s (FALSE). The type of comparison is specified by the three low-order bits of the immediate-byteoperand. The XOP instruction set provides the following integer comparison instructions.

VPCOMUB—Compare Vector Unsigned BytesVPCOMUW—Compare Vector Unsigned WordsVPCOMUD—Compare Vector Unsigned DoublewordsVPCOMUQ—Compare Vector Unsigned QuadwordsVPCOMB—Compare Vector Signed BytesVPCOMW—Compare Vector Signed WordsVPCOMD—Compare Vector Signed DoublewordsVPCOMQ—Compare Vector Signed Quadwords

1.12 Fraction ExtractThe fraction extract instructions isolate the fractional portions of vector or scalar floating pointoperands. The result of _PD and _PS instructions is a vector of integer numbers. The result of _SD and_SS instructions is always a scalar integer number. XOP provides the following fraction extractinstructions:

VFRCZPD—Extract Fraction Packed Double-Precision Floating-PointVFRCZPS—Extract Fraction Packed Single-Precision Floating-PointVFRCZSD— Extract Fraction Scalar Double-Precision Floating-PointVFRCZSS— Extract Fraction Scalar Single-Precision Floating Point

The VFRCZPD and VFRCZPS instructions extract the fractional portions of a vector of double-/single-precision floating-point values in an XMM or YMM register or a 128- or 256-bit memorylocation and write the results in the corresponding field in the destination register.

00000b

000b Less Than

001b Less Than or Equal

010b Greater Than

011b Greater Than or Equal

100b Equal

101b Not Equal

110b False

111b True

Page 37: 128-Bit and 256-Bit XOP and FMA4 Instructions

New 128-Bit and 256-Bit Instructions 37

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

The VFRCZSS and VFRCZSD instructions extract the fractional portion of the single-/double-precision scalar floating-point value in an XMM register or 32- or 64-bit memory location and writesthe result in the lower element of the destination register. The upper elements of the destination XMMregister are unaffected by the operation, while the upper 128 bits of the corresponding YMM registerare cleared to zeros.

Page 38: 128-Bit and 256-Bit XOP and FMA4 Instructions

38 New 128-Bit and 256-Bit Instructions

AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009

Page 39: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference 39

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

2 AMD XOP and FMA4 Instructions

The following section describes the complete set of XOP 128-media instructions. Instructions arelisted alphabetically by mnemonic.

2.1 NotationThe notation used to denote the size and type of source and destination operands in both mnemonicsand opcodes is discussed in detail in Section 2.5, “Notation,” on page 37 in the AMD64 ArchitectureProgrammer’s Manual Volume 3: General Purpose and System Instructions. Mnemonic conventionsthat are idiosyncratic to the XOP instruction set have been included in Chapter 1, “New 128-Bit and256-Bit Instructions”, in this document.

2.1.1 Opcode Syntax

Opcode specification for the XOP and FMA4 instruction sets, with their two, three and four operand syntax, requires a slightly different approach from that used to specify the opcodes for previous generation 64- and 128-bit instructions (documented in the AMD64 Architecture Programmer’s Manual Volume 4: 128-Bit Media Instructions, order# 26568, and AMD64 Architecture Programmer’s Manual Volume 5: 64-Bit Media and x87 Floating-Point Instructions, order# 26569). In the following pages, opcodes are specified using the order of fields and bits as they occur in a complete opcode specification as outlined in Section 1.1, “New Instruction Format,” on page 25. The following opcode specification is typical:

Most of the terms and symbols used in the following pages are defined in Section 1.1, “NewInstruction Format,” on page 25. The following notations and convention are used in this volume, inaddition to the opcode notational conventions specified in Section 2.5.2, “Opcode Syntax,” on page 39

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPCMOV ymm1, ymm2, ymm3/mem256, ymm4 8F RXB.08 0.src1.1.00 A2 /r /imm[7:4]

assembly language representation XOPprefix

3-bit field representing R, X, B bit values

W bitvvvv field

L bitpp field

opcoderegister/memory type specifier

immediate operand5-bit encoding for opcode prefix

Page 40: 128-Bit and 256-Bit XOP and FMA4 Instructions

40 Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

in the AMD64 Architecture Programmer’s Manual Volume 3: General Purpose and SystemInstructions:

cntrControl bits (for comparison instructions); immediate byte bits 3–0.

is4Destination register specifier; immediate byte bits 7:4.

RXB

Bit field specifying the R, X and B bit values. Specified in one’s complement form.

VEX.WThe meaning of the W bit is opcode specific. This bit toggles source operand order or is ignored,depending upon the opcode.

VEX.LVector length specifier

VEX.vvvv

Additional operand register specifier.

XOPIndicates the XOP prefix byte (8Fh).

2.2 Operand SpecificationThe packed values in a operand are numbered starting with 0, which is considered to be even-numbered.

Page 41: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference 41

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

2.3 Instruction Reference

Page 42: 128-Bit and 256-Bit XOP and FMA4 Instructions

42 VFMADDPD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Multiplies each packed double-precision floating-point value in the first source by the correspondingpacked double-precision floating-point value in the second source, then adds each product to thecorresponding packed double-precision floating-point value in the third source and writes the roundedresults to the destination register.

The VFMADDPD instruction requires four operands:

VFMADDPD dest, src1, src2, src3 dest = (src1* src2) + src3

The 128-bit version multiplies each of the two double-precision values in the first source XMMregister by its corresponding double-precision value in the second source. It then adds eachintermediate product to the corresponding double-precision value in the third source and places theresult in the destination XMM register.

The 256-bit version multiplies each of the four double-precision values in the first source YMMregister by its corresponding double-precision value in the second source. It then adds each product tothe corresponding double-precision value in the third source and places the results in the destinationYMM register.

If VEX.W is 0, the second source is either a register or memory and the third source is a register. IfVEX.W is 1, the second source is a register and the third source is a register or memory location.

The destination is always either an XMM register or a YMM register, depending on the vector size, asdetermined by the value of VEX.L. When the destination is a 128-bit XMM register, the upper 128 bitsof the corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the infinitely precise products are used in the addition. Theresults of the addition are rounded, as specified by the rounding mode in MXCSR.

The VFMADDPD instruction is an FMA4 instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VFMADDPD Multiply and Add Packed Double-Precision Floating-Point

Mnemonic Encoding

VEX RXB.mmmmm W.vvvv.L.pp Opcode

VFMADDPD xmm1, xmm2, xmm3/mem128, xmm4 C4 RXB.03 0.xsrc1.0.01 69 /r /is4

VFMADDPD ymm1, ymm2, ymm3/mem256, ymm4 C4 RXB.03 0.ysrc1.1.01 69 /r /is4

VFMADDPD xmm1, xmm2, xmm3, xmm4/mem128 C4 RXB.03 1.xsrc1.0.01 69 /r /is4

VFMADDPD ymm1, ymm2, ymm3, ymm4/mem256 C4 RXB.03 1.ysrc1.1.01 69 /r /is4

Page 43: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFMADDPD 43

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VFMADDPS, VFMADDSD, VFMADDSS

rFLAGS Affected

None

127 64 63 0

255 192 191 128

127 64 63 0

255 192 191 128

127 64 63 0255 128

255

192 191

128

VEX.L=0

127 64 63 0

255 192 191 128

VEX.L=1

0s

VEX.L=1

mul mul

mul mul

VEX.L=1

add add

add add

rnd rnd rnd rnd

VEX.L=1

dest = xmm1 | ymm1

src1 = xmm2 | ymm2 src2 = xmm3/mem128 | ymm/mem256

VFMADDPD

src3 = xmm4/mem128 | ymm4/mem256

Page 44: 128-Bit and 256-Bit XOP and FMA4 Instructions

44 VFMADDPD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

MXCSR Flags Affected

Exceptions

MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE

M M M M M

17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X FMA4 instructions are only recognized in protected mode.

X The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h.

X The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set.

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT = 0.See SIMD Floating-Point Exceptions, below, for details.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.SIMD Floating-Point Exception, #XF

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT=1.See SIMD Floating-Point Exceptions, below, for details.

Page 45: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFMADDPD 45

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

SIMD Floating-Point ExceptionsInvalid-operation excep-tion (IE)

X A source operand was an SNaN value.X +/-zero was multiplied by +/- infinityX +infinity was added to -infinity

Denormalized-operand exception (DE)

X A source operand was a denormal value.

Overflow exception (OE) X A rounded result was too large to fit into the format of the destination operand.

Underflow exception (UE)

X A rounded result was too small to fit into the format of the destination operand.

Precision exception (PE) X A result could not be represented exactly in the desti-nation format.

Exception Real Virtual8086 Protected Cause of Exception

Page 46: 128-Bit and 256-Bit XOP and FMA4 Instructions

46 VFMADDPS Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Multiplies each packed single-precision floating-point value in the first source by the correspondingsingle-precision floating-point value in the second source, then adds each product to the correspondingpacked single-precision floating-point value in the third source and writes the rounded results to thedestination register.

The VFMADDPS instruction requires four operands:

VFMADDPS dest, src1, src2, src3 dest = src1* src2 + src3

The 128-bit version multiplies each of the four single-precision values in the first source XMMregister by its corresponding single-precision value in the second source. It then adds each product tothe corresponding single-precision value in the third source and places the results in the destinationXMM register.

The 256-bit version multiplies each of the eight single-precision values in the first source YMMregister by its corresponding double-precision value in the second source. It then adds each product tothe corresponding double-precision value in the third source and places the results in the destinationYMM register.

If VEX.W is 0, the second source is either a register or memory location and the third source is aregister. If VEX.W is 1, the second source is a register and the third source is a register or memorylocation.

The destination is always either an XMM register or a YMM register, depending on the vector size, asdetermined by the value of VEX.L. When the destination is a 128-bit XMM register, the upper 128 bitsof the corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the infinitely precise products are used in the addition. Theresults of the addition are rounded, as specified by the rounding mode in MXCSR.

The VFMADDPS instruction is an FMA4 instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VFMADDPS Multiply and Add Packed Single-Precision Floating-Point

Mnemonic Encoding

VEX RXB.mmmmm W.vvvv.L.pp Opcode

VFMADDPS xmm1, xmm2, xmm3/mem128, xmm4 C4 RXB.03 0.xsrc1.0.01 68 /r /is4

VFMADDPS ymm1, ymm2, ymm3/mem256, ymm4 C4 RXB.03 0.ysrc1.1.01 68 /r /is4

VFMADDPS xmm1, xmm2, xmm3, xmm4/mem128 C4 RXB.03 1.xsrc1.0.01 68 /r /is4

VFMADDPS ymm1, ymm2, ymm3, ymm4/mem256 C4 RXB.03 1.ysrc1.1.01 68 /r /is4

Page 47: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFMADDPS 47

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VFMADDPD, VFMADDSD, VFMADDSS

rFLAGS Affected

None

127 6463 0

255 192191 128

255 128VEX.L=0

VEX.L=1

0s

VEX.L=1

mul

VEX.L=1

add

rnd rndrnd

VEX.L=1

dest = xmm1 | ymm1

src1 = xmm2 | ymm2 src2 = xmm3/mem128 | ymm4/mem256

VFMADDPS

31329596

159160223224

add

rnd

mul

rnd

mul

add

mul

add

rnd

add

rnd

mul

addrnd

255 192191 128159160223224255 192191 128159160223224

255 192191 128159160223224

127 6463 031329596

127 6463 031329596

255 192191 128159160223224

mul mul

add

mul

add

src3 = xmm4/mem128 | ymm4/mem256

Page 48: 128-Bit and 256-Bit XOP and FMA4 Instructions

48 VFMADDPS Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

MXCSR Flags Affected

Exceptions

MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE

M M M M M

17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X FMA4 instructions are only recognized in protected mode.

X The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h.

X The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set.

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT = 0.See SIMD Floating-Point Exceptions, below, for details.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.SIMD Floating-Point Exception, #XF

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT=1.See SIMD Floating-Point Exceptions, below, for details.

SIMD Floating-Point Exceptions

Page 49: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFMADDPS 49

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Invalid-operation excep-tion (IE)

X A source operand was an SNaN value.X +/-zero was multiplied by +/- infinityX +infinity was added to -infinity

Denormalized-operand exception (DE)

X A source operand was a denormal value.

Overflow exception (OE) X A rounded result was too large to fit into the format of the destination operand.

Underflow exception (UE)

X A rounded result was too small to fit into the format of the destination operand.

Precision exception (PE) X A result could not be represented exactly in the desti-nation format.

Exception Real Virtual8086 Protected Cause of Exception

Page 50: 128-Bit and 256-Bit XOP and FMA4 Instructions

50 VFMADDSD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Multiplies the double-precision floating-point value in the low-order quadword of the first source bythe double-precision floating-point value in the low-order quadword of the second source, then addsthe product to the double-precision floating-point value in the low-order quadword of the third source.The low-order quadword result is written to the destination.

The VFMADDSD instruction requires four operands:

VFMADDSD dest, src1, src2, src3 dest = src1* src2 + src3

If VEX.W is 0, the second source is either a register or 64-bit memory location and the third source isa register. If VEX.W is 1, the second source is a a register and the third source is a register or 64-bitmemory location.

The destination is an XMM register. When the result is written to the destination XMM register, theupper quadword of the destination register (bits 64–127) and the upper 128-bits of the correspondingYMM register are cleared to zeros.

The intermediate product is not rounded; the infinitely precise product is used in the addition. Theresult of the addition is rounded, as specified by the rounding mode in MXCSR.

The VFMADDSD instruction is an FMA4 instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VFMADDSD Multiply and Add Scalar Double-Precision Floating-Point

Mnemonic Encoding

VEX RXB.mmmmm W.vvvv.L.pp Opcode

VFMADDSD xmm1, xmm2, xmm3/mem64, xmm4 C4 RXB.03 0.xsrc1.0.01 6B /r /is4

VFMADDSD xmm1, xmm2, xmm3, xmm4/mem64 C4 RXB.03 1.xsrc1.0.01 6B /r /is4

Page 51: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFMADDSD 51

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VFMADDPD, VFMADDPS, VFMADDSS

rFLAGS Affected

None

MXCSR Flags Affected

MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE

M M M M M

17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.

127 64 63 0

127 64 63 0

127 64 63 0

255 128

127 64 63 0

0s

mul

add

rnd

dest = xmm1

src1 = xmm2 src2 = xmm3/mem64

src3 = xmm4/mem64

VFMADDSD

0s

Page 52: 128-Bit and 256-Bit XOP and FMA4 Instructions

52 VFMADDSD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X FMA4 instructions are only recognized in protected mode.

X The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h.

X The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set.

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT = 0.See SIMD Floating-Point Exceptions, below, for details.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.SIMD Floating-Point Exception, #XF

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT=1.See SIMD Floating-Point Exceptions, below, for details.

SIMD Floating-Point ExceptionsInvalid-operation excep-tion (IE)

X A source operand was an SNaN value.X +/-zero was multiplied by +/- infinityX +infinity was added to -infinity

Denormalized-operand exception (DE)

X A source operand was a denormal value.

Overflow exception (OE) X A rounded result was too large to fit into the format of the destination operand.

Underflow exception (UE)

X A rounded result was too small to fit into the format of the destination operand.

Precision exception (PE) X A result could not be represented exactly in the desti-nation format.

Page 53: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFMADDSS 53

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Multiplies the single-precision floating-point value in the low-order doubleword of the first source bythe low-order single-precision floating-point value in the second source, then adds the product to thelow-order single-precision floating-point value in the third source. The low-order doubleword result iswritten to the destination.

The VFMADDSS instruction requires four operands:

VFMADDSS dest, src1, src2, src3 dest = src1* src2 + src3

If VEX.W is 0, the second source is either a register or 32-bit memory location and the third source isa register. If VEX.W is 1, the second source is a a register and the third source is a register or 32-bitmemory location.

The destination is an XMM register. When the result is written to the destination XMM register, theupper three doublewords of the destination register (bits 32–127) and the upper 128-bits of thecorresponding YMM register are cleared to zeros.

The intermediate product is not rounded; the infinitely precise product is used in the addition. Theresult of the addition is rounded, as specified by the rounding mode in MXCSR.

The VFMADDSS instruction is an FMA4 instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VFMADDSS Multiply and Add Scalar Single-Precision Floating-Point

Mnemonic Encoding

VEX RXB.mmmmm W.vvvv.L.pp Opcode

VFMADDSS xmm1, xmm2, xmm3/mem32, xmm4 C4 RXB.03 0.xsrc1.0.01 6A /r /is4

VFMADDSS xmm1, xmm2, xmm3, xmm4/mem32 C4 RXB.03 1.xsrc1.0.01 6A /r /is4

Page 54: 128-Bit and 256-Bit XOP and FMA4 Instructions

54 VFMADDSS Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Related Instructions

VFMADDPD, VFMADDPS, VFMADDSD

rFLAGS Affected

None

MXCSR Flags Affected

MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE

M M M M M

17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.

127 6463 0

255 128

0sdest = xmm1

src1 = xmm2 src2 = xmm3/mem32

src3 = xmm4/mem32

VFMADDSS

31329596

mul

rnd

127 6463 031329596

127 6463 031329596

add

127 6463 031329596

0s0s0s

Page 55: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFMADDSS 55

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X FMA4 instructions are only recognized in protected mode.

X The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h.

X The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set.

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT = 0.See SIMD Floating-Point Exceptions, below, for details.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.SIMD Floating-Point Exception, #XF

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT=1.See SIMD Floating-Point Exceptions, below, for details.

SIMD Floating-Point ExceptionsInvalid-operation excep-tion (IE)

X A source operand was an SNaN value.X +/-zero was multiplied by +/- infinityX +infinity was added to -infinity

Denormalized-operand exception (DE)

X A source operand was a denormal value.

Overflow exception (OE) X A rounded result was too large to fit into the format of the destination operand.

Underflow exception (UE)

X A rounded result was too small to fit into the format of the destination operand.

Precision exception (PE) X A result could not be represented exactly in the desti-nation format.

Page 56: 128-Bit and 256-Bit XOP and FMA4 Instructions

56 VFMADDSUBPD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Multiplies each packed double-precision floating-point value in the first source by the correspondingpacked double-precision floating-point value in the second source. Adds each odd-numbered double-precision floating-point value in the third source to the corresponding infinite-precision intermediateproduct; subtracts each even-numbered double-precision floating-point value in the third source fromits corresponding product. Finally, writes the results to the destination.

The 128-bit version multiplies each of the two double-precision floating-point values in the firstsource by its corresponding value in the second source. The low-order double-precision floating-pointvalue in the third source is subtracted from its corresponding infinite-precision product and the high-order double-precision floating-point value in the third source is added to its corresponding product.The results of these operations are placed in their corresponding positions in the destination.

The 256-bit version multiplies each of the four double-precision floating-point values in first source byits corresponding double-precision value in the second source. The even-numbered double-precisionvalues in the third source are subtracted from their corresponding infinite-precision intermediateproducts and the odd-numbered double-precision values in the third source are added to theircorresponding infinite precision intermediate products. The results of these operations are placed intheir corresponding positions in the destination.

The first source is an XMM register or a YMM register, depending on the vector size, as determined byVEX.L.

If VEX.W is 0, the second source is either a register or memory location and the third source is aregister. If VEX.W is 1, the second source is a register and the third source is a register or memorylocation.

The destination is always either an XMM register or a YMM register, depending on the vector size, asdetermined by the value of VEX.L. When writing to a 128-bit XMM destination register, the upper128 bits of the corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the infinitely precise products are used in the final additionand subtraction operation(s). The results of the addition and subtraction operations are rounded, asspecified by the rounding mode in MXCSR.

The VFMADDSUBPD instruction is an FMA4 instruction. The presence of this instruction set isindicated by a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VFMADDSUBPD Multiply with Alternating Add/Subtract of Packed Double-Precision Floating-Point

Page 57: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFMADDSUBPD 57

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VFMADDSUBPD, VFMSUBADDPD, VFMSUBADDPS

rFLAGS Affected

None

Mnemonic Encoding

VEX RXB.mmmmm W.vvvv.L.pp Opcode

VFMADDSUBPD xmm1, xmm2, xmm3/mem128, xmm4 C4 RXB.03 0.xsrc1.0.01 5D /r /is4

VFMADDSUBPD ymm1, ymm2, ymm3/mem256, ymm4 C4 RXB.03 0.ysrc1.1.01 5D /r /is4

VFMADDSUBPD xmm1, xmm2, xmm3, xmm4/mem128 C4 RXB.03 1.xsrc1.0.01 5D /r /is4

VFMADDSUBPD ymm1, ymm2, ymm3, ymm4/mem256 C4 RXB.03 1.ysrc1.1.01 5D /r /is4

127 64 63 0

255 192 191 128

127 64 63 0

255 192 191 128

127 64 63 0255 128

255

192 191

128

VEX.L=0

127 64 63 0

255 192 191 128

VEX.L=1

0s

VEX.L=1

mul mul

mul mul

VEX.L=1

add

add

rnd rnd rnd rnd

VEX.L=1

dest = xmm/ymm

src1 = xmm/ymm src2 = xmm/ymm/mem

src3 = mem/xmm/ymm

VFMADDSUBPD

sub

sub

Page 58: 128-Bit and 256-Bit XOP and FMA4 Instructions

58 VFMADDSUBPD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

MXCSR Flags Affected

Exceptions

MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE

M M M M M

17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X FMA4 instructions are only recognized in protected mode.

X The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h.

X The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set.

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT = 0.See SIMD Floating-Point Exceptions, below, for details.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.SIMD Floating-Point Exception, #XF

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT=1.See SIMD Floating-Point Exceptions, below, for details.

Page 59: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFMADDSUBPD 59

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

SIMD Floating-Point ExceptionsInvalid-operation excep-tion (IE)

X A source operand was an SNaN value.X +/–zero was multiplied by +/–infinityX +infinity was added to –infinityX +infinity was subtracted from +infinityX –infinity was subtracted from –infinity

Denormalized-operand exception (DE)

X A source operand was a denormal value.

Overflow exception (OE) X A rounded result was too large to fit into the format of the destination operand.

Underflow exception (UE)

X A rounded result was too small to fit into the format of the destination operand.

Precision exception (PE) X A result could not be represented exactly in the desti-nation format.

Exception Real Virtual8086 Protected Cause of Exception

Page 60: 128-Bit and 256-Bit XOP and FMA4 Instructions

60 VFMADDSUBPS Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Multiplies each packed single-precision floating-point value in the first source by the correspondingpacked single-precision floating-point value in the second source. Adds each odd-numbered single-precision floating-point value in the third source to the corresponding infinite-precision intermediateproduct; subtracts each even-numbered single-precision floating-point value in the third source fromits corresponding product. Finally, writes the results to the destination.

The 128-bit version multiplies each of the four single-precision floating-point values in first source byits corresponding single-precision value in the second source. The even-numbered single-precisionvalues in the third source are subtracted from their corresponding infinite-precision intermediateproducts and the odd-numbered single-precision values in the third source are added to theircorresponding infinite precision intermediate products. The results of these operations are placed intheir corresponding positions in the destination.

The 256-bit version multiplies each of the eight single-precision floating-point values in first source byits corresponding single-precision value in the second source. The even-numbered single-precisionvalues in the third source are subtracted from their corresponding infinite-precision intermediateproducts and the odd-numbered single-precision values in the third source are added to theircorresponding infinite precision intermediate products. The results of these operations are placed intheir corresponding positions in the destination.

The first source is either an XMM register or a YMM register, depending on the vector size, asdetermined by VEX.L.

If VEX.W is 0, the second source is either a register or memory location and the third source is aregister. If VEX.W is 1, the second source is a register and the third source is a register or memorylocation.

The destination is always either an XMM register or a YMM register, depending on the vector size, asdetermined by the value of VEX.L. When writing to a 128-bit XMM destination register, the upper128 bits of the corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the infinitely precise intermediate products are used in theaddition and subtraction operations. The results of the addition and subtraction operations are rounded,as specified by the rounding mode in MXCSR.

The VFMADDSUBPS instruction is an FMA4 instruction. The presence of this instruction set isindicated by a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VFMADDSUBPS Multiply with Alternating Add/Subtract of Packed Single-Precision Floating-Point

Page 61: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFMADDSUBPS 61

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VFMADDSUBPD, VFMSUBADDPD, VFMSUBADDPS

rFLAGS Affected

None

Mnemonic Encoding

VEX RXB.mmmmm W.vvvv.L.pp Opcode

VFMADDSUBPS xmm1, xmm2, xmm3/mem128, xmm4 C4 RXB.03 0.xsrc1.0.01 5C /r /is4

VFMADDSUBPS ymm1, ymm2, ymm3/mem256, ymm4 C4 RXB.03 0.ysrc1.1.01 5C /r /is4

VFMADDSUBPS xmm1, xmm2, xmm3, xmm4/mem128 C4 RXB.03 1.xsrc1.0.01 5C /r /is4

VFMADDSUBPS ymm1, ymm2, ymm3, ymm4/mem256 C4 RXB.03 1.ysrc1.1.01 5C /r /is4

127 6463 0

255 192191 128

255 128VEX.L=0

VEX.L=1

0s

VEX.L=1

mul

VEX.L=1

rnd rndrnd

VEX.L=1

dest = xmm/ymm

src1 = xmm/ymm src2 = xmm/ymm/mem

src3 = mem/xmm/ymm

VFMADDSUBPS

31329596

159160223224

sub

rnd

mul

rnd

mul

add

mul

rnd

add

rnd

mul

subrnd

255 192191 128159160223224

255 192191 128159160223224

127 6463 031329596

127 6463 031329596

255 192191 128159160223224

mul mul

add

mul

add

sub sub

127 6463 031329596

Page 62: 128-Bit and 256-Bit XOP and FMA4 Instructions

62 VFMADDSUBPS Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

MXCSR Flags Affected

Exceptions

MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE

M M M M M

17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X FMA4 instructions are only recognized in protected mode.

X The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h.

X The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set.

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT = 0.See SIMD Floating-Point Exceptions, below, for details.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.SIMD Floating-Point Exception, #XF

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT=1.See SIMD Floating-Point Exceptions, below, for details.

Page 63: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFMADDSUBPS 63

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

SIMD Floating-Point ExceptionsInvalid-operation excep-tion (IE)

X A source operand was an SNaN value.X +/–zero was multiplied by +/–infinityX +infinity was added to –infinityX +infinity was subtracted from +infinityX –infinity was subtracted from –infinity

Denormalized-operand exception (DE)

X A source operand was a denormal value.

Overflow exception (OE) X A rounded result was too large to fit into the format of the destination operand.

Underflow exception (UE)

X A rounded result was too small to fit into the format of the destination operand.

Precision exception (PE) X A result could not be represented exactly in the desti-nation format.

Exception Real Virtual8086 Protected Cause of Exception

Page 64: 128-Bit and 256-Bit XOP and FMA4 Instructions

64 VFMSUBADDPD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Multiplies each packed double-precision floating-point value in the first source by the correspondingpacked double-precision floating-point value in the second source. Adds each even-numbered double-precision floating-point value in the third source to the corresponding infinite-precision intermediateproduct; subtracts each odd-numbered double-precision floating-point value in the third source fromits corresponding product. Finally, writes the results to the destination.

The 128-bit version multiplies each of the two double-precision floating-point values in the firstsource by its corresponding value in the second source. The high-order double-precision floating-pointvalue in the third source is subtracted from its corresponding infinite-precision product and the low-order double-precision floating-point value in the third source is added to its corresponding product.The results of these operations are placed in their corresponding positions in the destination.

The 256-bit version multiplies each of the four double-precision floating-point values in first source byits corresponding double-precision value in the second source. The odd-numbered double-precisionvalues in the third source are subtracted from their corresponding infinite-precision intermediateproducts and the even-numbered double-precision values in the third source are added to theircorresponding infinite precision intermediate products. The results of these operations are placed intheir corresponding positions in the destination.

The first source is either an XMM register or a YMM register, depending on the vector size, asdetermined by VEX.L.

If VEX.W is 0, the second source is either a register or memory location and the third source is aregister. If VEX.W is 1, the second source is a register and the third source operand is a register ormemory location.

The destination is always either an XMM register or a YMM register, depending on the vector size, asdetermined by the value of VEX.L. When writing to a 128-bit XMM destination register, the upper128 bits of the corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the two infinitely precise intermediate products are used inthe addition. The results of the addition and subtraction operations are rounded, as specified by therounding mode in MXCSR.

The VFMSUBADDPD instruction is an FMA4 instruction. The presence of this instruction set isindicated by a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VFMSUBADDPD Multiply with Alternating Subtract/Add of Packed Double-Precision Floating-Point

Page 65: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFMSUBADDPD 65

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VFMADDSUBPD, VFMADDSUBPS, VFMSUBADDPS

rFLAGS Affected

None

Mnemonic Encoding

VEX RXB.mmmmm W.vvvv.L.pp Opcode

VFMSUBADDPD xmm1, xmm2, xmm3/mem128, xmm4 C4 RXB.03 0.xsrc1.0.01 5F /r /is4

VFMSUBADDPD ymm1, ymm2, ymm3/mem256, ymm4 C4 RXB.03 0.ysrc1.1.01 5F /r /is4

VFMSUBADDPD xmm1, xmm2, xmm3, xmm4/mem128 C4 RXB.03 1.xsrc1.0.01 5F /r /is4

VFMSUBADDPD ymm1, ymm2, ymm3, ymm4/mem256 C4 RXB.03 1.ysrc1.1.01 5F /r /is4

127 64 63 0

255 192 191 128

127 64 63 0

255 192 191 128

127 64 63 0255 128

255

192 191

128

VEX.L=0

127 64 63 0

255 192 191 128

VEX.L=1

0s

VEX.L=1

mul mul

mul mul

VEX.L=1

add

add

rnd rnd rnd rnd

VEX.L=1

dest = xmm/ymm

src1 = xmm/ymm src2 = xmm/ymm/mem

src3 = mem/xmm/ymm

VFMSUBADDPD

sub

sub

Page 66: 128-Bit and 256-Bit XOP and FMA4 Instructions

66 VFMSUBADDPD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

MXCSR Flags Affected

Exceptions

MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE

M M M M M

17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X FMA4 instructions are only recognized in protected mode.

X The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h.

X The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set.

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT = 0.See SIMD Floating-Point Exceptions, below, for details.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.SIMD Floating-Point Exception, #XF

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT=1.See SIMD Floating-Point Exceptions, below, for details.

Page 67: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFMSUBADDPD 67

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

SIMD Floating-Point ExceptionsInvalid-operation excep-tion (IE)

X A source operand was an SNaN value.X +/-zero was multiplied by +/- infinityX +infinity was added to -infinityX +infinity was subtracted from +infinityX –infinity was subtracted from –infinity

Denormalized-operand exception (DE)

X A source operand was a denormal value.

Overflow exception (OE) X A rounded result was too large to fit into the format of the destination operand.

Underflow exception (UE)

X A rounded result was too small to fit into the format of the destination operand.

Precision exception (PE) X A result could not be represented exactly in the desti-nation format.

Exception Real Virtual8086 Protected Cause of Exception

Page 68: 128-Bit and 256-Bit XOP and FMA4 Instructions

68 VFMSUBADDPS Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Multiplies each packed single-precision floating-point value in the first source by the correspondingpacked single-precision floating-point value in the second source. Adds each even-numbered single-precision floating-point value in the third source to the corresponding infinite-precision intermediateproduct; subtracts each odd-numbered single-precision floating-point value in the third source from itscorresponding product. Finally, writes the results to the destination.

The 128-bit version multiplies each of the four single-precision floating-point values in first source byits corresponding single-precision value in the second source. The odd-numbered single-precisionvalues in the third source are subtracted from their corresponding infinite-precision intermediateproducts and the even-numbered single-precision values in the third source are added to theircorresponding infinite precision intermediate products. The results of these operations are placed intheir corresponding positions in the destination.

The 256-bit version multiplies each of the eight single-precision floating-point values in first source byits corresponding single-precision value in the second source. The odd-numbered single-precisionvalues in the third source are subtracted from their corresponding infinite-precision intermediateproducts and the even-numbered single-precision values in the third source are added to theircorresponding infinite precision intermediate products. The results of these operations are placed intheir corresponding positions in the destination.

The first source is either an XMM register or a YMM register, depending on the vector size, asdetermined by VEX.L.

If VEX.W is 0, the second source is either a register or memory location and the third source is aregister. If VEX.W is 1, the second source is a register and the third source is a register or memorylocation.

The destination is always either an XMM register or a YMM register, depending on the vector size, asdetermined by the value of VEX.L. When writing to a 128-bit XMM destination register, the upper128 bits of the corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the infinitely precise products are used in the addition. Theresults of the additions and subtracts are rounded, as specified by the rounding mode in MXCSR.

The VFMSUBADDPS instruction is an FMA4 instruction. The presence of this instruction set isindicated by a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VFMSUBADDPS Multiply with Alternating Subtract/Add of Packed Single-Precision Floating-Point

Page 69: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFMSUBADDPS 69

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VFMADDSUBPD, VFMADDSUBPS, VFMSUBADDPD, VFMSUBADDPS

rFLAGS Affected

None

Mnemonic Encoding

VEX RXB.mmmmm W.vvvv.L.pp Opcode

VFMSUBADDPS xmm1, xmm2, xmm3/mem128, xmm4 C4 RXB.03 0.xsrc1.0.01 5E /r /is4

VFMSUBADDPS ymm1, ymm2, ymm3/mem256, ymm4 C4 RXB.03 0.ysrc1.1.01 5E /r /is4

VFMSUBADDPS xmm1, xmm2, xmm3, xmm4/mem128 C4 RXB.03 1.xsrc1.0.01 5E /r /is4

VFMSUBADDPS ymm1, ymm2, ymm3, ymm4/mem256 C4 RXB.03 1.ysrc1.1.01 5E /r /is4

127 6463 0

255 192191 128

255 128VEX.L=0

VEX.L=1

0s

VEX.L=1

mul

VEX.L=1

rnd rndrnd

VEX.L=1

dest = xmm/ymm

src1 = xmm/ymm src2 = xmm/ymm/mem

src3 = mem/xmm/ymm

VFMSUBADDPS

31329596

159160223224

sub

rnd

mul

rnd

mul

add

mul

rnd add

rnd

mul

sub

rnd255 192191 128159160223224

255 192191 128159160223224

255 192191 128159160223224

127 6463 031329596

127 6463 031329596

255 192191 128159160223224

mul mul

add

mul

add

sub sub

Page 70: 128-Bit and 256-Bit XOP and FMA4 Instructions

70 VFMSUBADDPS Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

MXCSR Flags Affected

Exceptions

MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE

M M M M M

17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X FMA4 instructions are only recognized in protected mode.

X The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h.

X The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set.

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT = 0.See SIMD Floating-Point Exceptions, below, for details.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.SIMD Floating-Point Exception, #XF

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT=1.See SIMD Floating-Point Exceptions, below, for details.

Page 71: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFMSUBADDPS 71

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

SIMD Floating-Point ExceptionsInvalid-operation excep-tion (IE)

X A source operand was an SNaN value.X +/-zero was multiplied by +/- infinityX +infinity was added to -infinityX +infinity was subtracted from +infinityX –infinity was subtracted from –infinity

Denormalized-operand exception (DE)

X A source operand was a denormal value.

Overflow exception (OE) X A rounded result was too large to fit into the format of the destination operand.

Underflow exception (UE)

X A rounded result was too small to fit into the format of the destination operand.

Precision exception (PE) X A result could not be represented exactly in the desti-nation format.

Exception Real Virtual8086 Protected Cause of Exception

Page 72: 128-Bit and 256-Bit XOP and FMA4 Instructions

72 VFMSUBPD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Multiplies each of the packed double-precision floating-point values in the first source by itscorresponding packed double-precision floating-point value in the second source, then subtracts thecorresponding packed double-precision floating-point values in the third source from the intermediateproducts of the multiplication. The results are written to the destination register.

The VFMSUBPD instruction requires four operands:

VFMSUBPD dest, src1, src2, src3 dest = src1* src2 - src3

The 128-bit version multiplies two packed double-precision floating-point values in the first source,by their corresponding packed double-precision floating point values in the second source, producingtwo intermediate products. The two double precision floating-point values in the third source aresubtracted from the intermediate products of the multiplication and the remainders are placed in thedestination XMM register.

The 256-bit version multiplies four packed double-precision floating-point values in the first source bytheir corresponding packed double-precision floating point values in the second source, producingfour intermediate products. The four double-precision floating-point values in the third source aresubtracted from the intermediate products of the multiplication and the remainders are placed in thedestination YMM register.

If VEX.W is 0, the second source is either a register or memory location and the third source is aregister. If VEX.W is 1, the second source is a register and the third source is a register or memorylocation.

The destination is always either an XMM register or a YMM register, depending on the vector size, asdetermined by the value of VEX.L. When writing to a 128-bit XMM destination register, the upper128 bits of the corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the two infinitely precise products are used in thesubtraction. The results of the subtraction are rounded, as specified by the rounding mode in MXCSR.

The VFMSUBPD instruction is an FMA4 instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VFMSUBPD Multiply and Subtract Packed Double-Precision Floating-Point

Mnemonic Encoding

VEX RXB.mmmmm W.vvvv.L.pp Opcode

VFMSUBPD xmm1, xmm2, xmm3/mem128, xmm4 C4 RXB.03 0.xsrc1.0.01 6D /r /is4

VFMSUBPD ymm1, ymm2, ymm3/mem256, ymm4 C4 RXB.03 0.ysrc1.1.01 6D /r /is4

VFMSUBPD xmm1, xmm2, xmm3, xmm4/mem128 C4 RXB.03 1.xsrc1.0.01 6D /r /is4

VFMSUBPD ymm1, ymm2, ymm3, ymm4/mem256 C4 RXB.03 1.ysrc1.1.01 6D /r /is4

Page 73: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFMSUBPD 73

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VFMSUBPS, VFMSUBSD, VFMSUBSS

rFLAGS Affected

None

MXCSR Flags Affected

MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE

M M M M M

17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.

127 64 63 0

255 192 191 128

127 64 63 0

255 192 191 128

127 64 63 0255 128

255

192 191

128

VEX.L=0

127 64 63 0

255 192 191 128

VEX.L=1

0s

VEX.L=1

mul mul

mul mul

VEX.L=1

sub sub

sub sub

rnd rnd rnd rnd

VEX.L=1

dest = xmm/ymm

src1 = xmm/ymm src2 = xmm/ymm/mem

src3 = mem/xmm/ymm

VFMSUBPD

Page 74: 128-Bit and 256-Bit XOP and FMA4 Instructions

74 VFMSUBPD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X FMA4 instructions are only recognized in protected mode.

X The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h.

X The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set.

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT = 0.See SIMD Floating-Point Exceptions, below, for details.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.SIMD Floating-Point Exception, #XF

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT=1.See SIMD Floating-Point Exceptions, below, for details.

SIMD Floating-Point ExceptionsInvalid-operation excep-tion (IE)

X A source operand was an SNaN value.X +/-zero was multiplied by +/- infinityX +infinity was added to -infinityX -infinity was subtracted from -infinity

Denormalized-operand exception (DE)

X A source operand was a denormal value.

Page 75: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFMSUBPD 75

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Overflow exception (OE) X A rounded result was too large to fit into the format of the destination operand.

Underflow exception (UE)

X A rounded result was too small to fit into the format of the destination operand.

Precision exception (PE) X A result could not be represented exactly in the desti-nation format.

Exception Real Virtual8086 Protected Cause of Exception

Page 76: 128-Bit and 256-Bit XOP and FMA4 Instructions

76 VFMSUBPS Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Multiplies each of the packed single-precision floating-point values in the first source by itscorresponding packed single-precision floating-point value in the second source, then subtracts thecorresponding packed single-precision floating-point values in the third source from the products. Thefour results are written to the destination register.

The VFMSUBPS instruction requires four operands:

VFMSUBPS dest, src1, src2, src3 dest = src1* src2 - src3

The 128-bit version multiplies four packed single-precision floating-point values in the first source bytheir corresponding packed single-precision floating point values in the second source, producing fourintermediate products. The four single-precision floating-point values in the third source aresubtracted from the intermediate products of the multiplication and the remainders are placed in thedestination XMM register.

The 256-bit version multiplies eight packed single-precision floating-point values in the first source bytheir corresponding packed single-precision floating point values in the second source, producingeight intermediate products. The eight single-precision floating-point values in the third source aresubtracted from the intermediate products of the multiplication and the remainders are placed in thedestination YMM register.

If VEX.W is 0, the second source is either a register or memory location and the third source is aregister. If VEX.W is 1, the second source is a a register and the third source is a register or memorylocation.

The destination is always either an XMM register or a YMM register, depending on the vector size, asdetermined by the value of VEX.L. When writing to a 128-bit XMM destination register, the upper128 bits of the corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the two infinitely precise products are used in thesubtraction. The results of the subtraction are rounded, as specified by the rounding mode in MXCSR.

The VFMSUBPS instruction is an FMA4 instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VFMSUBPS Multiply and Subtract Packed Single-Precision Floating-Point

Mnemonic Encoding

VEX RXB.mmmmm W.vvvv.L.pp Opcode

VFMSUBPS xmm1, xmm2, xmm3/mem128, xmm4 C4 RXB.03 0.xsrc1.0.01 6C /r /is4

VFMSUBPS ymm1, ymm2, ymm3/mem256, ymm4 C4 RXB.03 0.ysrc1.1.01 6C /r /is4

VFMSUBPS xmm1, xmm2, xmm3, xmm4/mem128 C4 RXB.03 1.xsrc1.0.01 6C /r /is4

VFMSUBPS ymm1, ymm2, ymm3, ymm4/mem256 C4 RXB.03 1.ysrc1.1.01 6C /r /is4

Page 77: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFMSUBPS 77

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VFMSUBPD, VFMSUBSD, VFMSUBSS

rFLAGS Affected

None

MXCSR Flags Affected

MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE

M M M M M

17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.

127 6463 0

255 192191 128

255 128

VEX.L=0

VEX.L=1

0s

VEX.L=1

mul

VEX.L=1

sub

rnd rndrnd

VEX.L=1

dest = xmm/ymm

src1 = xmm/ymm src2 = xmm/ymm/mem

src3 = mem/xmm/ymm

VFMSUBPS

31329596

159160223224

sub

rnd

mul

rnd

mul

sub

mul

rnd

sub

rnd

mul

subrnd

255 192191 128159160223224

255 192191 128159160223224

127 6463 031329596

127 6463 031329596

255 192191 128159160223224

mul mul

sub

mul

sub

127 6463 031329596

sub

Page 78: 128-Bit and 256-Bit XOP and FMA4 Instructions

78 VFMSUBPS Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X FMA4 instructions are only recognized in protected mode.

X The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h.

X The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set.

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT = 0.See SIMD Floating-Point Exceptions, below, for details.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.SIMD Floating-Point Exception, #XF

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT=1.See SIMD Floating-Point Exceptions, below, for details.

SIMD Floating-Point ExceptionsInvalid-operation excep-tion (IE)

X A source operand was an SNaN value.X +/-zero was multiplied by +/- infinityX +infinity was added to -infinityX -infinity was subtracted from -infinity

Denormalized-operand exception (DE)

X A source operand was a denormal value.

Page 79: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFMSUBPS 79

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Overflow exception (OE) X A rounded result was too large to fit into the format of the destination operand.

Underflow exception (UE)

X A rounded result was too small to fit into the format of the destination operand.

Precision exception (PE) X A result could not be represented exactly in the desti-nation format.

Exception Real Virtual8086 Protected Cause of Exception

Page 80: 128-Bit and 256-Bit XOP and FMA4 Instructions

80 VFMSUBSD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Multiplies the double-precision floating-point value in the low-order quadword of the first source bythe double-precision floating-point value in the low-order quadword of the second source, thensubtracts the double-precision floating-point value in the low-order quadword of the third source fromthe intermediate product. The low-order quadword result is written to the destination.

The VFMSUBSD instruction requires four operands:

VFMSUBSD dest, src1, src2, src3 dest = src1* src2 - src3

If VEX.W is 0, the second source is either a register or 64-bit memory location and the third source isa register. If VEX.W is 1, the second source is a register and the third source is a register or 64-bitmemory location.

The destination is an XMM register. When the result is written to the destination XMM register, theupper quadword of the destination register (bits 64–127) and the upper 128-bits of the correspondingYMM register are cleared to zeros.

The intermediate product is not rounded; the infinitely precise product is used in the subtraction. Theresult of the subtraction is rounded, as specified by the rounding mode in MXCSR.

The VFMSUBSD instruction is an FMA4 instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VFMSUBSD Multiply and Subtract Scalar Double-Precision Floating-Point

Mnemonic Encoding

VEX RXB.mmmmm W.vvvv.L.pp Opcode

VFMSUBSD xmm1, xmm2, xmm3/mem64, xmm4 C4 RXB.03 0.xsrc1.0.01 6F /r /is4

VFMSUBSD xmm1, xmm2, xmm3, xmm4/mem64 C4 RXB.03 1.xsrc1.0.01 6F /r /is4

Page 81: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFMSUBSD 81

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VFMSUBPD, VFMSUBPS, VFMSUBSS

rFLAGS Affected

None

MXCSR Flags Affected

MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE

M M M M M

17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.

127 64 63 0

127 64 63 0

127 64 63 0

255 128

127 64 63 0

0s

mul

sub

rnd

dest = xmm

src1 = xmm src2 = xmm | mem64

src3 = mem64 | xmm

VFMSUBSD

0s

Page 82: 128-Bit and 256-Bit XOP and FMA4 Instructions

82 VFMSUBSD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X FMA4 instructions are only recognized in protected mode.

X The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h.

X The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set.

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT = 0.See SIMD Floating-Point Exceptions, below, for details.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.SIMD Floating-Point Exception, #XF

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT=1.See SIMD Floating-Point Exceptions, below, for details.

SIMD Floating-Point ExceptionsInvalid-operation excep-tion (IE)

X A source operand was an SNaN value.X +/-zero was multiplied by +/- infinityX +infinity was added to -infinityX -infinity was subtracted from -infinity

Denormalized-operand exception (DE)

X A source operand was a denormal value.

Page 83: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFMSUBSD 83

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Overflow exception (OE) X A rounded result was too large to fit into the format of the destination operand.

Underflow exception (UE)

X A rounded result was too small to fit into the format of the destination operand.

Precision exception (PE) X A result could not be represented exactly in the desti-nation format.

Exception Real Virtual8086 Protected Cause of Exception

Page 84: 128-Bit and 256-Bit XOP and FMA4 Instructions

84 VFMSUBSS Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Multiplies the single-precision floating-point value in the low-order doubleword of the first source bythe single-precision floating-point value in the low-order doubleword of the second source, thensubtracts the single-precision floating-point value in the low-order doubleword of the third sourcefrom the product. The low-order doubleword result is written to the destination.

The VFMSUBSS instruction requires four operands:

VFMSUBSS dest, src1, src2, src3 dest = src1* src2 - src3

If VEX.W is 0, the second source is either a register or 32-bit memory location and the third source isa register. If VEX.W is 1, the second source is a register and the third source is a register or 32-bitmemory location.

The destination is an XMM register. When the result is written to the destination XMM register, theupper three doublewords of the destination register (bits 32–127) and the upper 128-bits of thecorresponding YMM register are cleared to zeros.

The intermediate product is not rounded; the infinitely precise product is used in the subtraction. Theresult of the subtraction is rounded, as specified by the rounding mode in MXCSR.

The VFMSUBSS instruction is an FMA4 instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VFMSUBSS Multiply and Subtract Scalar Single-Precision Floating-Point

Mnemonic Encoding

VEX RXB.mmmmm W.vvvv.L.pp Opcode

VFMSUBSS xmm1, xmm2, xmm3/mem32, xmm4 C4 RXB.03 0.xsrc1.0.01 6E /r /is4

VFMSUBSS xmm1, xmm2, xmm3, xmm4/mem32 C4 RXB.03 1.xsrc1.0.01 6E /r /is4

Page 85: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFMSUBSS 85

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VFMSUBPD, VFMSUBPS, VFMSUBSD

rFLAGS Affected

None

MXCSR Flags Affected

MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE

M M M M M

17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.

127 6463 0

255 128

0sdest = xmm

src1 = xmm src2 = xmm | mem32

src3 = mem32 | xmm

VFMSUBSS

31329596

mul

rnd

127 6463 031329596

127 6463 031329596

sub

127 6463 031329596

0s0s0s

Page 86: 128-Bit and 256-Bit XOP and FMA4 Instructions

86 VFMSUBSS Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X FMA4 instructions are only recognized in protected mode.

X The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h.

X The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set.

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT = 0.See SIMD Floating-Point Exceptions, below, for details.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.SIMD Floating-Point Exception, #XF

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT=1.See SIMD Floating-Point Exceptions, below, for details.

SIMD Floating-Point ExceptionsInvalid-operation excep-tion (IE)

X A source operand was an SNaN value.X +/–zero was multiplied by +/– infinityX +infinity was added to –infinityX –infinity was subtracted from –infinity

Denormalized-operand exception (DE)

X A source operand was a denormal value.

Page 87: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFMSUBSS 87

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Overflow exception (OE) X A rounded result was too large to fit into the format of the destination operand.

Underflow exception (UE)

X A rounded result was too small to fit into the format of the destination operand.

Precision exception (PE) X A result could not be represented exactly in the desti-nation format.

Exception Real Virtual8086 Protected Cause of Exception

Page 88: 128-Bit and 256-Bit XOP and FMA4 Instructions

88 FNMADDPD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Multiplies each of the packed double-precision floating-point values in the first source by thecorresponding packed double-precision floating-point values in the second source, then negates theproducts and adds them to the corresponding packed double-precision floating-point values in thethird source. The results are written to the destination register.

The VFNMADDPD instruction requires four operands:

VFNMADDPD dest, src1, src2, src3 dest = – (src1* src2) + src3

The 128-bit version multiplies the two double-precision values in the first source XMM register by thecorresponding double-precision values in the second source, which can be either an XMM register or a128-bit memory location. It then negates each product and adds it to the corresponding double-precision value in the third source. The results are then placed in the destination XMM register.

The 256-bit version multiplies the four double-precision values in the first source YMM register by thefour double-precision values in the second source, which can be either a YMM register or a 256-bitmemory location. It then negates each product and adds it to the corresponding double-precision valuein the third source. The results are then placed in the destination YMM register.

If VEX.W is 0, the second source is either a register or memory location and the third source is aregister. If VEX.W is 1, the second source is a register and the third source is a register or memorylocation.

The destination is always either an XMM register or a YMM register, depending on the vector size, asdetermined by the value of VEX.L. When the destination is a 128-bit XMM register, the upper 128 bitsof the corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the infinitely precise products are used in the addition. Theresults of the addition are rounded, as specified by the rounding mode in MXCSR.

The VFNMADDPD instruction is an FMA4 instruction. The presence of this instruction set isindicated by a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VFNMADDPD Negative Multiply and Add Packed Double-Precision Floating-Point

Mnemonic Encoding

VEX RXB.mmmmm W.vvvv.L.pp Opcode

VFNMADDPD xmm1, xmm2, xmm3/mem128, xmm4 C4 RXB.03 0.xsrc1.0.01 79 /r /is4

VFNMADDPD ymm1, ymm2, ymm3/mem256, ymm4 C4 RXB.03 0.ysrc1.1.01 79 /r /is4

VFNMADDPD xmm1, xmm2, xmm3, xmm4/mem128 C4 RXB.03 1.xsrc1.0.01 79 /r /is4

VFNMADDPD ymm1, ymm2, ymm3, ymm4/mem256 C4 RXB.03 1.ysrc1.1.01 79 /r /is4

Page 89: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference FNMADDPD 89

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VFNMADDPS, VFNMADDSD, VFNMADDSS

rFLAGS Affected

None

MXCSR Flags Affected

MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE

M M M M M

17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.

127 64 63 0

255 192 191 128

127 64 63 0

255 192 191 128

127 64 63 0255 128

255

192 191

128

VEX.L=0

127 64 63 0

255 192 191 128

VEX.L=1

0s

VEX.L=1

mul mul

mul mul

VEX.L=1

add add

add add

rnd rnd rnd rnd

VEX.L=1

dest = xmm/ymm

src1 = xmm/ymm src2 = xmm/ymm/mem

src3 = mem/xmm/ymm

VFNMADDPD

neg neg neg neg

Page 90: 128-Bit and 256-Bit XOP and FMA4 Instructions

90 FNMADDPD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X FMA4 instructions are only recognized in protected mode.

X The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h.

X The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set.

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT = 0.See SIMD Floating-Point Exceptions, below, for details.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.SIMD Floating-Point Exception, #XF

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT=1.See SIMD Floating-Point Exceptions, below, for details.

SIMD Floating-Point ExceptionsInvalid-operation excep-tion (IE)

X A source operand was an SNaN value.X +/-zero was multiplied by +/- infinityX +infinity was added to -infinity

Denormalized-operand exception (DE)

X A source operand was a denormal value.

Overflow exception (OE) X A rounded result was too large to fit into the format of the destination operand.

Underflow exception (UE)

X A rounded result was too small to fit into the format of the destination operand.

Precision exception (PE) X A result could not be represented exactly in the desti-nation format.

Page 91: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference FNMADDPS 91

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Multiplies each of the packed single-precision floating-point values in first source by thecorresponding packed single-precision floating-point value in the second source, then negates theproducts and adds them to the corresponding packed single-precision floating-point values in the thirdsource. The results are written to the destination register.

The VFNMADDPS instruction requires four operands:

VFNMADDPS dest, src1, src2, src3 dest = – (src1* src2) + src3

The 128-bit version multiplies the four single-precision values in the first source XMM register by thecorresponding single-precision values in the second source, which can be either an XMM register or a128-bit memory location. It then negates each product and adds it to the corresponding single-precision value in the third source. The results are then placed in the destination XMM register.

The 256-bit version multiplies the eight single-precision values in the first source YMM register by theeight single-precision values in the second source, which can be either a YMM register or a 256-bitmemory location. It then negates each product and adds it to the corresponding single-precision valuein the third source. The result is then placed in the destination YMM register.

If VEX.W is 0, the second source is either a register or memory location and the third source is aregister. If VEX.W is 1, the second source is a register and the third source is a register or memorylocation.

The destination is always either an XMM register or a YMM register, depending on the vector size, asdetermined by the value of VEX.L. When the destination is a 128-bit XMM register, the upper 128 bitsof the corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the infinitely precise products are used in the addition. Theresults of the addition are rounded, as specified by the rounding mode in MXCSR.

The FNMADDPS instruction is an FMA4 instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VFNMADDPS Negative Multiply and Add Packed Single-Precision Floating-Point

Mnemonic Encoding

VEX RXB.mmmmm W.vvvv.L.pp Opcode

VFNMADDPS xmm1, xmm2, xmm3/mem128, xmm4 C4 RXB.03 0.xsrc1.0.01 78 /r /is4

VFNMADDPS ymm1, ymm2, ymm3/mem256, ymm4 C4 RXB.03 0.ysrc1.1.01 78 /r /is4

VFNMADDPS xmm1, xmm2, xmm3, xmm4/mem128 C4 RXB.03 1.xsrc1.0.01 78 /r /is4

VFNMADDPS ymm1, ymm2, ymm3, ymm4/mem256 C4 RXB.03 1.ysrc1.1.01 78 /r /is4

Page 92: 128-Bit and 256-Bit XOP and FMA4 Instructions

92 FNMADDPS Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Related Instructions

VFNMADDPD, VFNMADDSD, VFNMADDSS

rFLAGS Affected

None

MXCSR Flags Affected

MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE

M M M M M

17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.

127 6463 0

255 192191 128

255 128

VEX.L=0

VEX.L=1

0s

VEX.L=1

mul

VEX.L=1

add

rnd rndrnd

VEX.L=1

dest = xmm/ymm

src1 = xmm/ymm src2 = xmm/ymm/mem

src3 = mem/xmm/ymm

VFNMADDPS

31329596

159160223224

add

rnd

mul

rnd

mul

add

mul

add

rnd

add

rnd

mul

addrnd

255 192191 128159160223224

255 192191 128159160223224

127 6463 031329596

127 6463 031329596

255 192191 128159160223224

mul mul

add

mul

add

neg neg

neg neg

neg neg

neg neg

127 6463 031329596

Page 93: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference FNMADDPS 93

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X FMA4 instructions are only recognized in protected mode.

X The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h.

X The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set.

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT = 0.See SIMD Floating-Point Exceptions, below, for details.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.SIMD Floating-Point Exception, #XF

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT=1.See SIMD Floating-Point Exceptions, below, for details.

SIMD Floating-Point ExceptionsInvalid-operation excep-tion (IE)

X A source operand was an SNaN value.X +/-zero was multiplied by +/- infinityX +infinity was added to -infinity

Denormalized-operand exception (DE)

X A source operand was a denormal value.

Overflow exception (OE) X A rounded result was too large to fit into the format of the destination operand.

Underflow exception (UE)

X A rounded result was too small to fit into the format of the destination operand.

Precision exception (PE) X A result could not be represented exactly in the desti-nation format.

Page 94: 128-Bit and 256-Bit XOP and FMA4 Instructions

94 VFNMADDSD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Multiplies the double-precision floating-point value in the low-order quadword of the first source bythe double-precision floating-point value in the low-order quadword of the second source, thennegates the product and adds it to the double-precision floating-point value in the low-order quadwordof the third source. The low-order quadword result is written to the destination register.

The VFNMADDSD instruction requires four operands:

VFNMADDSD dest, src1, src2, src3 dest = – (src1* src2) + src)

The first source is an XMM register indicated by VEX.vvvv.

If VEX.W is 0, the second source is either a register or 64-bit memory location and the third source isa register. If VEX.W is 1, the second source is a register and the third source is a register or 64-bitmemory location.

The destination is always an XMM register. When the result is written to the destination XMMregister, the high quadword of the destination register (bits 64–127) and the upper 128-bits of thecorresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the infinitely precise products are used in the addition. Theresults of the addition are rounded, as specified by the rounding mode in MXCSR.

The VFNMADDSD instruction is an FMA4 instruction. The presence of this instruction set isindicated by a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VFNMADDSD Negative Multiply and Add Scalar Double-Precision Floating-Point

Mnemonic Encoding

VEX RXB.mmmmm W.vvvv.L.pp Opcode

VFNMADDSD xmm1, xmm2, xmm3/mem64, xmm4 C4 RXB.03 0.xsrc1.0.01 7B /r /is4

VFNMADDSD xmm1, xmm2, xmm3, xmm4/mem64 C4 RXB.03 1.xsrc1.0.01 7B /r /is4

Page 95: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFNMADDSD 95

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VFNMADDPD, VFNMADDPS, VFNMADDSS

rFLAGS Affected

None

MXCSR Flags Affected

MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE

M M M M M

17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.

127 64 63 0

127 64 63 0

127 64 63 0

255 128

127 64 63 0

0s

mul

add

rnd

dest = xmm

src1 = xmm src2 = xmm | mem64

src3 = mem64 | xmm

VFNMADDSD

0s

neg

Page 96: 128-Bit and 256-Bit XOP and FMA4 Instructions

96 VFNMADDSD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X FMA4 instructions are only recognized in protected mode.

X The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h.

X The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set.

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT = 0.See SIMD Floating-Point Exceptions, below, for details.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.SIMD Floating-Point Exception, #XF

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT=1.See SIMD Floating-Point Exceptions, below, for details.

SIMD Floating-Point ExceptionsInvalid-operation excep-tion (IE)

X A source operand was an SNaN value.X +/-zero was multiplied by +/- infinityX +infinity was added to -infinity

Denormalized-operand exception (DE)

X A source operand was a denormal value.

Overflow exception (OE) X A rounded result was too large to fit into the format of the destination operand.

Underflow exception (UE)

X A rounded result was too small to fit into the format of the destination operand.

Precision exception (PE) X A result could not be represented exactly in the desti-nation format.

Page 97: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFNMADDSS 97

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Multiplies the single-precision floating-point value in the low-order doubleword of the first source bythe single-precision floating-point value in the low-order doubleword of the second source, thennegates the product and adds it to the single-precision floating-point value in the low-orderdoubleword of the third source. The low-order doubleword result is written to the destination.

The VFNMADDSS instruction requires four operands:

VFNMADDSS dest, src1, src2, src3 dest = - (src1* src2) + src3

If VEX.W is 0, the second source is either a register or 32-bit memory location and the third source isa register. If VEX.W is 1, the second source is a register and the third source is a register or 32-bitmemory location.

The destination is always an XMM register. When the result is written to the destination XMMregister, the upper three doublewords of the destination register (bits 32–127) and the upper 128-bits ofthe corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the infinitely precise products are used in the addition. Theresults of the addition are rounded, as specified by the rounding mode in MXCSR.

The VFNMADDSS instruction is an FMA4 instruction. The presence of this instruction set isindicated by a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VFNMADDSS Negative Multiply and Add Scalar Single-Precision Floating-Point

Mnemonic Encoding

VEX RXB.mmmmm W.vvvv.L.pp Opcode

VFNMADDSS xmm1, xmm2, xmm3/mem32, xmm4 C4 RXB.03 0.xsrc1.0.01 7A /r /is4

VFNMADDSS xmm1, xmm2, xmm3, xmm4/mem32 C4 RXB.03 1.xsrc1.0.01 7A /r /is4

Page 98: 128-Bit and 256-Bit XOP and FMA4 Instructions

98 VFNMADDSS Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Related Instructions

VFNMADDPD, VFNMADDPS, VFNMADDSS

rFLAGS Affected

None

MXCSR Flags Affected

MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE

M M M M M

17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.

127 6463 0

255 128

0sdest = xmm

src1 = xmm src2 = xmm | mem32

src3 = mem32 | xmm

VFNMADDSS

31329596

mul

rnd

127 6463 031329596

127 6463 031329596

add

127 6463 031329596

0s0s0s

neg

Page 99: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFNMADDSS 99

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X FMA4 instructions are only recognized in protected mode.

X The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h.

X The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set.

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT = 0.See SIMD Floating-Point Exceptions, below, for details.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.SIMD Floating-Point Exception, #XF

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT=1.See SIMD Floating-Point Exceptions, below, for details.

SIMD Floating-Point ExceptionsInvalid-operation excep-tion (IE)

X A source operand was an SNaN value.X +/-zero was multiplied by +/- infinityX +infinity was added to -infinity

Denormalized-operand exception (DE)

X A source operand was a denormal value.

Overflow exception (OE) X A rounded result was too large to fit into the format of the destination operand.

Underflow exception (UE)

X A rounded result was too small to fit into the format of the destination operand.

Precision exception (PE) X A result could not be represented exactly in the desti-nation format.

Page 100: 128-Bit and 256-Bit XOP and FMA4 Instructions

100 VFNMSUBPD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Multiplies each of the packed double-precision floating-point values in the first source by thecorresponding packed double-precision floating-point value in the second source, then subtracts thecorresponding packed double-precision floating-point value in the third source from the negatedinterim products.The results are written to the destination register.

The VFNMSUBPD instruction requires four operands:

VFNMSUBPD dest, src1, src2, src3 dest = – (src1* src2) - src3

The 128-bit version multiplies each of the two double-precision values in the first source XMMregister by its corresponding double-precision value in the second source, which can be either anXMM register or a 128-bit memory location. It then subtracts the corresponding double-precisionvalue in the third source from the negated interim product. The results are then placed in thedestination XMM register.

The 256-bit version multiplies each of the four double-precision values in the first source YMMregister by its corresponding double-precision value in the second source, which can be either a YMMregister or a 256-bit memory location. It then subtracts the corresponding double-precision value in thethird source from the negated interim product. The results are then placed in the destination YMMregister.

If VEX.W is 0, the second source is either a register or memory location and the third source is aregister. If VEX.W is 1, the second source is a register and the third source is a register or memorylocation.

The destination is always either an XMM register or a YMM register, depending on the vector size, asdetermined by the value of VEX.L. When the destination is a 128-bit XMM register, the upper 128 bitsof the corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the infinitely precise products are used in the subtraction.The results of the subtraction are rounded, as specified by the rounding mode in MXCSR.

The VFNMSUBPD instruction is an FMA4 instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VFNMSUBPD Negative Multiply and Subtract Packed Double-Precision Floating-Point

Mnemonic Encoding

VEX RXB.mmmmm W.vvvv.L.pp Opcode

VFNMSUBPD xmm1, xmm2, xmm3/mem128, xmm4 C4 RXB.03 0.xsrc1.0.01 7D /r /is4

VFNMSUBPD ymm1, ymm2, ymm3/mem256, ymm4 C4 RXB.03 0.ysrc1.1.01 7D /r /is4

VFNMSUBPD xmm1, xmm2, xmm3, xmm4/mem128 C4 RXB.03 1.xsrc1.0.01 7D /r /is4

VFNMSUBPD ymm1, ymm2, ymm3, ymm4/mem256 C4 RXB.03 1.ysrc1.1.01 7D /r /is4

Page 101: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFNMSUBPD 101

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VFNMSUBPS, VFNMSUBSD, VFNMSUBSS

rFLAGS Affected

None

MXCSR Flags Affected

MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE

M M M M M

17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.

127 64 63 0

255 192 191 128

127 64 63 0

255 192 191 128

127 64 63 0255 128

255

192 191

128

VEX.L=0

127 64 63 0

255 192 191 128

VEX.L=1

0s

VEX.L=1

mul mul

mul mul

VEX.L=1

sub sub

sub sub

rnd rnd rnd rnd

VEX.L=1

dest = xmm/ymm

src1 = xmm/ymm src2 = xmm/ymm/mem

src3 = mem/xmm/ymm

VFNMSUBPD

neg neg neg neg

Page 102: 128-Bit and 256-Bit XOP and FMA4 Instructions

102 VFNMSUBPD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X FMA4 instructions are only recognized in protected mode.

X The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h.

X The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set.

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT = 0.See SIMD Floating-Point Exceptions, below, for details.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.SIMD Floating-Point Exception, #XF

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT=1.See SIMD Floating-Point Exceptions, below, for details.

SIMD Floating-Point ExceptionsInvalid-operation excep-tion (IE)

X A source operand was an SNaN value.X +/–zero was multiplied by +/–infinityX +infinity was added to –infinityX –infinity was subtracted from –infinity

Denormalized-operand exception (DE)

X A source operand was a denormal value.

Page 103: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFNMSUBPD 103

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Overflow exception (OE) X A rounded result was too large to fit into the format of the destination operand.

Underflow exception (UE)

X A rounded result was too small to fit into the format of the destination operand.

Precision exception (PE) X A result could not be represented exactly in the desti-nation format.

Exception Real Virtual8086 Protected Cause of Exception

Page 104: 128-Bit and 256-Bit XOP and FMA4 Instructions

104 VFNMSUBPS Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Multiplies each of the packed single-precision floating-point values in the first source by thecorresponding packed single-precision floating-point value in the second source, then subtracts thecorresponding packed single-precision floating-point values in the third source from the negatedproducts. The results are written to the destination register.

The VFNMSUBPS instruction requires four operands:

VFNMSUBPS dest, src1, src2, src3 dest = – (src1* src2) – src3

The 128-bit version multiplies each of the four single-precision values in the first source XMMregister by its corresponding single-precision value in the second source, which can be either an XMMregister or a 128-bit memory location. It then subtracts the corresponding single-precision value in thethird source from the negated interim product. The results are then placed in the destination XMMregister.

The 256-bit version multiplies each of the eight single-precision values in the first source YMMregister by its corresponding single-precision value in the second source, which can be either a YMMregister or a 256-bit memory location. It then subtracts the corresponding single-precision value in thethird source from the negated interim product. The results are then placed in the destination YMMregister.

If VEX.W is 0, the second source is either a register or memory location and the third source is aregister. If VEX.W is 1, the second source is a register and the third source is a register or memorylocation.

The destination is always either an XMM register or a YMM register, depending on the vector size, asdetermined by the value of VEX.L. When the destination is a 128-bit XMM register, the upper 128 bitsof the corresponding YMM register are cleared to zeros.

The intermediate products are not rounded; the infinitely precise products are used in the subtraction.The results of the subtraction are rounded, as specified by the rounding mode in MXCSR.

The VFNMSUBPS instruction is an FMA4 instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VFNMSUBPS Negative Multiply and Subtract Packed Single-Precision Floating-Point

Mnemonic Encoding

VEX RXB.mmmmm W.vvvv.L.pp Opcode

VFNMSUBPS xmm1, xmm2, xmm3/mem128, xmm4 C4 RXB.03 0.xsrc1.0.01 7C /r /is4

VFNMSUBPS ymm1, ymm2, ymm3/mem256, ymm4 C4 RXB.03 0.ysrc1.1.01 7C /r /is4

VFNMSUBPS xmm1, xmm2, xmm3, xmm4/mem128 C4 RXB.03 1.xsrc1.0.01 7C /r /is4

VFNMSUBPS ymm1, ymm2, ymm3, ymm4/mem256 C4 RXB.03 1.ysrc1.1.01 7C /r /is4

Page 105: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFNMSUBPS 105

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VFNMSUBPD, VFNMSUBSD, VFNMSUBSS

rFLAGS Affected

None

MXCSR Flags Affected

MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE

M M M M M

17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.

127 6463 0

255 192191 128

255 128

VEX.L=0

VEX.L=1

0s

VEX.L=1

mul

VEX.L=1

sub

rnd rndrnd

VEX.L=1

dest = xmm/ymm

src1 = xmm/ymm src2 = xmm/ymm/mem

src3 = mem/xmm/ymm

VFNMSUBPS

31329596

159160223224

sub

rnd

mul

rnd

mul

sub

mul

sub

rnd

sub

rnd

mul

subrnd

255 192191 128159160223224

255 192191 128159160223224

127 6463 031329596

127 6463 031329596

255 192191 128159160223224

mul mul

sub

mul

sub

neg neg neg neg neg neg neg neg

127 6463 031329596

Page 106: 128-Bit and 256-Bit XOP and FMA4 Instructions

106 VFNMSUBPS Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X FMA4 instructions are only recognized in protected mode.

X The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h.

X The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set.

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT = 0.See SIMD Floating-Point Exceptions, below, for details.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.SIMD Floating-Point Exception, #XF

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT=1.See SIMD Floating-Point Exceptions, below, for details.

SIMD Floating-Point ExceptionsInvalid-operation excep-tion (IE)

X A source operand was an SNaN value.X +/–zero was multiplied by +/–infinityX +infinity was added to –infinityX –infinity was subtracted from –infinity

Denormalized-operand exception (DE)

X A source operand was a denormal value.

Page 107: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFNMSUBPS 107

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Overflow exception (OE) X A rounded result was too large to fit into the format of the destination operand.

Underflow exception (UE)

X A rounded result was too small to fit into the format of the destination operand.

Precision exception (PE) X A result could not be represented exactly in the desti-nation format.

Exception Real Virtual8086 Protected Cause of Exception

Page 108: 128-Bit and 256-Bit XOP and FMA4 Instructions

108 VFNMSUBSD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Multiplies the double-precision floating-point value in the low-order quadword of the first source bythe double-precision floating-point value in the low-order quadword of the second source, thensubtracts the double-precision floating-point value in the low-order quadword of the third source fromthe negated interim product.The low-order quadword result is written to the destination.

The VFNMSUBSD instruction requires four operands:

VFNMSUBSD dest, src1, src2, src3 dest = – (src1* src2) – src3

The first source is an XMM register.

If VEX.W is 0, the second source is either a register or 64-bit memory location and the third source isa register. If VEX.W is 1, the second source is a register and the third source is a register or 64-bitmemory location.

The destination is always an XMM register indicated by VEX.vvvv. All unaffected bits of thedestination XMM register (bits 64–127) and its corresponding YMM register (bits 128–255) arecleared to zeros.

The intermediate products are not rounded; the infinitely precise products are used in the subtraction.The results of the subtraction are rounded, as specified by the rounding mode in MXCSR.

The VFNMSUBSD instruction is an FMA4 instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VFNMSUBSD Negative Multiply and Subtract Scalar Double-Precision Floating-Point

Mnemonic Encoding

VEX RXB.mmmmm W.vvvv.L.pp Opcode

VFNMSUBSD xmm1, xmm2, xmm3/mem64, xmm4 C4 RXB.03 0.xsrc1.0.01 7F /r /is4

VFNMSUBSD xmm1, xmm2, xmm3, xmm4/mem64 C4 RXB.03 1.xsrc1.0.01 7F /r /is4

Page 109: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFNMSUBSD 109

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VFNMSUBPD, VFNMSUBPS, VFNMSUBSS

rFLAGS Affected

None

MXCSR Flags Affected

MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE

M M M M M

17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.

127 64 63 0

127 64 63 0

127 64 63 0

255 128

127 64 63 0

0s

mul

sub

rnd

dest = xmm

src1 = xmm src2 = xmm | mem64

src3 = mem64 | xmm

VFNMSUBSD

0s

neg

Page 110: 128-Bit and 256-Bit XOP and FMA4 Instructions

110 VFNMSUBSD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X FMA4 instructions are only recognized in protected mode.

X The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h.

X The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set.

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT = 0.See SIMD Floating-Point Exceptions, below, for details.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.SIMD Floating-Point Exception, #XF

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT=1.See SIMD Floating-Point Exceptions, below, for details.

SIMD Floating-Point ExceptionsInvalid-operation excep-tion (IE)

X A source operand was an SNaN value.X +/-zero was multiplied by +/- infinityX +infinity was added to -infinityX –infinity was subtracted from –infinity

Denormalized-operand exception (DE)

X A source operand was a denormal value.

Page 111: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFNMSUBSD 111

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Overflow exception (OE) X A rounded result was too large to fit into the format of the destination operand.

Underflow exception (UE)

X A rounded result was too small to fit into the format of the destination operand.

Precision exception (PE) X A result could not be represented exactly in the desti-nation format.

Exception Real Virtual8086 Protected Cause of Exception

Page 112: 128-Bit and 256-Bit XOP and FMA4 Instructions

112 VFNMSUBSS Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Multiplies the single-precision floating-point value in the low-order doubleword of the first source bythe single-precision floating-point value in the low-order doubleword of the second source, thensubtracts the single-precision floating-point value in the low-order doubleword of the third sourcefrom the negated product. The low-order doubleword result is written to the destination.

The VFNMSUBSS instruction requires four operands:

VFNMSUBSS dest, src1, src2, src3 dest = - (src1* src2) - src3

If VEX.W is 0, the second source is either a register or 32-bit memory location and the third source isa register. If VEX.W is 1, the second source is a register and the third source is a register or 32-bitmemory location.

The destination is always an XMM register indicated by VEX.vvvv. All unaffected bits of thedestination XMM register (bits 32–127) and its corresponding YMM register (bits 128–255) arecleared to zeros.

The intermediate products are not rounded; the infinitely precise products are used in the subtraction.The results of the subtraction are rounded, as specified by the rounding mode in MXCSR.

The VFNMSUBSS instruction is an FMA4 instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VFNMSUBSS Negative Multiply and Subtract Scalar Single-Precision Floating-Point

Mnemonic Encoding

VEX RXB.mmmmm W.vvvv.L.pp Opcode

VFNMSUBSS xmm1, xmm2, xmm3/mem32, xmm4 C4 RXB.03 0.xsrc1.0.01 7E /r /is4

VFNMSUBSS xmm1, xmm2, xmm3, xmm4/mem32 C4 RXB.03 1.xsrc1.0.01 7E /r /is4

Page 113: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFNMSUBSS 113

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VFNMSUBPD, VFNMSUBPS, VFNMSUBSD

rFLAGS Affected

None

MXCSR Flags Affected

MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE

M M M M M

17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.

127 6463 0

255 128

0sdest = ymm

src1 = xmm src2 = xmm | mem32

src3 = mem32 | xmm

VFNMSUBSS

31329596

mul

rnd

127 6463 031329596

127 6463 031329596

sub

127 6463 031329596

0s0s0s

neg

Page 114: 128-Bit and 256-Bit XOP and FMA4 Instructions

114 VFNMSUBSS Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X FMA4 instructions are only recognized in protected mode.

X The FMA4 instructions are not supported, as indicated by ECX bit 16 of CPUID function 8000_0001h.

X The operating-system XSAVE support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits (YMM and XMM) of XFEATURE_ENABLED_MASK were not both set.

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT = 0.See SIMD Floating-Point Exceptions, below, for details.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.SIMD Floating-Point Exception, #XF

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT=1.See SIMD Floating-Point Exceptions, below, for details.

SIMD Floating-Point ExceptionsInvalid-operation excep-tion (IE)

X A source operand was an SNaN value.X +/–zero was multiplied by +/–infinityX +infinity was added to –infinityX –infinity was subtracted from –infinity

Denormalized-operand exception (DE)

X A source operand was a denormal value.

Page 115: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFNMSUBSS 115

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Overflow exception (OE) X A rounded result was too large to fit into the format of the destination operand.

Underflow exception (UE)

X A rounded result was too small to fit into the format of the destination operand.

Precision exception (PE) X A result could not be represented exactly in the desti-nation format.

Exception Real Virtual8086 Protected Cause of Exception

Page 116: 128-Bit and 256-Bit XOP and FMA4 Instructions

116 VFRCZPD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Extracts the fractional portion of each double-precision floating-point value in a source register ormemory location and writes the resulting values in the corresponding elements of the destinationregister. The fractional results are precise.

If XOP.L is 0, the source is an XMM register or 128-bit memory location; If XOP.L is 1, the source is aYMM register or 256-bit memory location.

The destination is always either an XMM register or a YMM register, depending on the vector size, asdetermined by the value of XOP.L. When the destination is a 128-bit XMM register, the upper 128 bitsof the corresponding YMM register are cleared to zeros.

The rounding mode indicated in the MXCSR is ignored unless the input is an integer, a zero, or adenormal value that is coerced to zero by MXCSR.DAZ, in which case the sign of the resultant zero isa function of MXCSR.RC:

If the source value is QNaN, it is written to the destination with no exception generated. If the sourcevalue is infinity, the instruction returns an indefinite value when the invalid-operation exception (IE) ismasked. If the source value is an integer, the instruction returns zero. The sign of the instruction resultis the same as the input.

The VFRCZPD instruction is an XOP instruction. The presence of this instruction set is indicated by aCPUID feature bit. (See the CPUID Specification, order# 25481.)

VFRCZPD Extract Fraction Packed Double-Precision Floating-Point

MXCSR.RC Result

Round down -0

Round to nearest +0

Round up +0

Round toward zero +0

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VFRCZPD xmm1, xmm2/mem128 8F RXB.09 0.1111.0.00 81 /r

VFRCZPD ymm1, ymm2/mem256 8F RXB.09 0.1111.1.00 81 /r

Page 117: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFRCZPD 117

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

ROUNDPD, ROUNDPS, ROUNDSD, ROUNDSS, VFRCZPS, VFRCZSS, VFRCZSD

rFLAGS Affected

None

MXCSR Flags Affected

MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE

M M M

17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.

127 64 63 0

255 192 191 128

127 64 63 0255 128

255

192 191

128

VEX.L=0

VEX.L=1

0s

VEX.L=1

dest = xmm/ymm

src1 = xmm/mem128 | ymm/mem256

VFRCZPD

extract extract extract extract

Page 118: 128-Bit and 256-Bit XOP and FMA4 Instructions

118 VFRCZPD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT = 0.See SIMD Floating-Point Exceptions, below, for details.

X VEX.W was set to 1.X VEX.vvvv was not 1111b.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

SIMD Floating-Point Exception, #XF

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT=1.See SIMD Floating-Point Exceptions, below, for details.

SIMD Floating-Point ExceptionsInvalid-operation excep-tion (IE)

X A source operand was an SNaN value or infinity

Denormalized-operand exception (DE)

X A source operand was a denormal value.

Underflow exception (UE)

X A rounded result was too small to fit into the format of the destination operand.

Page 119: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFRCZPS 119

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Extracts the fractional portion of each of the single-precision floating-point values in a source registeror memory location and writes the resulting values to the corresponding elements of the destinationregister. The fractional results are exact.

IfXOP.L is 0, the source is an XMM register or 128-bit memory location; If XOP.L is 1, the source is aYMM register or 256-bit memory location.

The destination is always an XMM register or a YMM register, depending on the vector size, asdetermined by the value of XOP.L. When the destination is a 128-bit XMM register, the upper 128 bitsof the corresponding YMM register are cleared to zeros.

The rounding mode indicated in the MXCSR is ignored unless the input is an integer, a zero, or adenormal value that is coerced to zero by MXCSR.DAZ, in which case the sign of the resultant zero isa function of MXCSR.RC:

If the source value is QNaN, it is written to the destination with no exception generated. If the sourcevalue is infinity, the instruction returns an indefinite value when the invalid-operation exception (IE) ismasked. If the source value is an integer, the instruction returns zero. The sign of the instruction resultis the same as the input.

The VFRCZPS instruction is an XOP instruction. The presence of this instruction set is indicated by aCPUID feature bit. (See the CPUID Specification, order# 25481.)

VFRCZPS Extract Fraction Packed Single-Precision Floating-Point

MXCSR.RC Result

Round down -0

Round to nearest +0

Round up +0

Round toward zero +0

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VFRCZPS xmm1, xmm2/mem128 8F RXB.09 0.1111.0.00 80 /r

VFRCZPS ymm1, ymm2/mem256 8F RXB.09 0.1111.1.00 80 /r

Page 120: 128-Bit and 256-Bit XOP and FMA4 Instructions

120 VFRCZPS Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Related Instructions

ROUNDPD, ROUNDPS, ROUNDSD, ROUNDSS, VFRCZPD, VFRCZSS, VFRCZSD

rFLAGS Affected

None

MXCSR Flags Affected

MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE

M M M

17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.

127 6463 0

255 192191 128

255 128VEX.L=0

VEX.L=1

0s

VEX.L=1

dest = xmm | ymm

src1 = xmm/mem128 | ymm/mem256

VFRCZPS

31329596

159160223224

255 192191 128159160223224

extract extract

extract extract extract

extract extract

extract

127 6463 031329596

Page 121: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFRCZPS 121

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT = 0.See SIMD Floating-Point Exceptions, below, for details.

X VEX.W was set to 1.X VEX.vvvv was not 1111b.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.SIMD Floating-Point Exception, #XF

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT=1.See SIMD Floating-Point Exceptions, below, for details.

SIMD Floating-Point ExceptionsInvalid-operation excep-tion (IE)

X A source operand was an SNaN value or infinity

Denormalized-operand exception (DE)

X A source operand was a denormal value.

Underflow exception (UE)

X A rounded result was too small to fit into the format of the destination operand.

Page 122: 128-Bit and 256-Bit XOP and FMA4 Instructions

122 VFRCZSD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Extracts the fractional portion of the double-precision floating-point value in the low-order quadwordof an XMM register or 64-bit memory location and writes the result in the low-order quadword of thedestination XMM register. The fractional results are precise.

When the result is written to the destination XMM register, the upper quadword of the destinationregister and the upper 128-bits of the corresponding YMM register are cleared to zeros.

The rounding mode indicated in the MXCSR is ignored unless the input is an integer, a zero, or adenormal value that is coerced to zero by MXCSR.DAZ, in which case the sign of the resultant zero isa function of MXCSR.RC:

If the source value is QNaN, it is written to the destination with no exception generated. If the sourcevalue is infinity, the instruction returns an indefinite value when the invalid-operation exception (IE) ismasked. If the source value is an integer, the instruction returns zero. The sign of the instruction resultis the same as the input.

The VFRCZSD instruction is an XOP instruction. The presence of this instruction set is indicated by aCPUID feature bit. (See the CPUID Specification, order# 25481.)

VFRCZSD Extract Fraction Scalar Double-Precision Floating-Point

MXCSR.RC Result

Round down -0

Round to nearest +0

Round up +0

Round toward zero +0

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VFRCZSD xmm1, xmm2/mem64 8F RXB.09 0.1111.0.00 83 /r

Page 123: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFRCZSD 123

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

ROUNDPD, ROUNDPS, ROUNDSD, ROUNDSS, VFRCZPS, VFRCZPD, VFRCZSS

rFLAGS Affected

None

127 64 63 0

127 64 63 0

255 128

0sdest = xmm

src1 = xmm | mem64

VFRCZSD

0s

extract

Page 124: 128-Bit and 256-Bit XOP and FMA4 Instructions

124 VFRCZSD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

MXCSR Flags Affected

Exceptions

MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE

M M M

17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT = 0.See SIMD Floating-Point Exceptions, below, for details.

X VEX.W was set to 1.X VEX.vvvv was not 1111b.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.SIMD Floating-Point Exception, #XF

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT=1.See SIMD Floating-Point Exceptions, below, for details.

SIMD Floating-Point Exceptions

Page 125: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFRCZSD 125

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Invalid-operation excep-tion (IE)

X A source operand was an SNaN value or infinity

Denormalized-operand exception (DE)

X A source operand was a denormal value.

Underflow exception (UE)

X A rounded result was too small to fit into the format of the destination operand.

Exception Real Virtual8086 Protected Cause of Exception

Page 126: 128-Bit and 256-Bit XOP and FMA4 Instructions

126 VFRCZSS Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Extracts the fractional portion of the single-precision floating-point value in the low-order doublewordof an XMM register or 32-bit memory location and writes the result in the low-order doubleword in thedestination XMM register. The fractional results are precise.

When the result is written to the destination XMM register, the upper three doublewords of thedestination register and the upper 128-bits of the corresponding YMM register are cleared to zeros.

The upper 224 bits of the YMM destination register are cleared to zeros.

The rounding mode indicated in the MXCSR is ignored unless the input is an integer, a zero, or adenormal value that is coerced to zero by MXCSR.DAZ, in which case the sign of the resultant zero isa function of MXCSR.RC:

If the source value is QNaN, it is written to the destination with no exception generated. If the sourcevalue is infinity, the instruction returns an indefinite value when the invalid-operation exception (IE) ismasked. If the source value is an integer, the instruction returns zero. The sign of the instruction resultis the same as the input.

The VFRCZSS instruction is an XOP instruction. The presence of this instruction set is indicated by aCPUID feature bit. (See the CPUID Specification, order# 25481.)

VFRCZSS Extract Fraction Scalar Single-Precision Floating Point

MXCSR.RC Result

Round down -0

Round to nearest +0

Round up +0

Round toward zero +0

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VFRCZSS xmm1, xmm2/mem32 8F RXB.09 0.1111.0.00 82 /r

Page 127: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFRCZSS 127

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

ROUNDPD, ROUNDPS, ROUNDSD, ROUNDSS, VFRCZPS, VFRCZPD, VFRCZSD

rFLAGS Affected

None

127 6463 0

255 1280s

dest = xmm

src1 = xmm/mem32

VFRCZSS

31329596

extract

0s 0s 0s127 6463 031329596

Page 128: 128-Bit and 256-Bit XOP and FMA4 Instructions

128 VFRCZSS Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

MXCSR Flags Affected

Exceptions

MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE

M M M

17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT = 0.See SIMD Floating-Point Exceptions, below, for details.

X VEX.W was set to 1.X VEX.vvvv was not 1111b.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.SIMD Floating-Point Exception, #XF

X There was an unmasked SIMD floating-point excep-tion while CR4.OSXMMEXCPT=1.See SIMD Floating-Point Exceptions, below, for details.

SIMD Floating-Point Exceptions

Page 129: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VFRCZSS 129

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Invalid-operation excep-tion (IE)

X A source operand was an SNaN value or infinity

Denormalized-operand exception (DE)

X A source operand was a denormal value.

Underflow exception (UE)

X A rounded result was too small to fit into the format of the destination operand.

Exception Real Virtual8086 Protected Cause of Exception

Page 130: 128-Bit and 256-Bit XOP and FMA4 Instructions

130 VPCMOV Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Moves bits of either the first source or the second source into their corresponding positions in thedestination, depending on the value of the corresponding selector bit in the selector. If the selector bitis set to 1, the corresponding bit in the first source is moved to the destination; otherwise, thecorresponding bit from the second source is moved to the destination.

This instruction directly implements the C-language ternary “?” operation on each of the source bits.

Arbitrary bit-granular predicates can be constructed by any number of methods, or loaded as constantsfrom memory. The VPCMOV instruction may use the results of any SSE instructions as the predicatein the selector. VPCMPEQB (VPCMPGTB), VPCMPEQW (VPCMPGTW), VPCMPEQD(VPCMPGTD) and VPCMPEQQ (VPCMPGTQ) compare bytes, words, doublewords, quadwordsand integers, respectively, and set the predicate in the destination register to masks of 1s and 0saccordingly. VCMPPS (VCMPSS) and VCMPPD (VCMPSD) compare word and doublewordfloating-point source values, respectively, and provide the predicate for the floating-point instructions.

The VPCMOV instruction requires four operands:

VPCMOV dest, src1, src2, selector

The vector size is determined by the value of VEX.L. All moves are 128 bits in length if XOP.L iscleared to 0 and 256 bits in length if XOP.L is set to 1. The sources are the same size as the destination.

The first source (src1) is always an XMM or YMM register specified by XOP.vvvv.

This instruction supports operand configuration using XOP.W. When XOP.W is 0, the second source(src2) is an XMM or YMM register or 128- or 256-bit memory location specified by MODRM.rm andthe selector is an XMM or YMM register specified by imm8[7:4]. When XOP.W is 1, the secondsource (src2) is an XMM or YMM register specified by imm8[7:4] and selector is an XMM or YMMregister or 128- or 256-bit memory location specified by MODRM.rm.

The destination (dest) is always either an XMM register or a YMM register, depending on the vectorsize, as determined by the value of VEX.L. When the destination is a 128-bit XMM register, the upper128 bits of the corresponding YMM register are cleared to zeros.

The VPCMOV instruction is an XOP instruction. The presence of this instruction set is indicated by aCPUID feature bit. (See the CPUID Specification, order# 25481.)

VPCMOV Vector Conditional Moves

Page 131: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPCMOV 131

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

)

Related Instructions

VPCOMUB, VPCOMUD, VPCOMUQ, VPCOMUW, VCMPPD, VCMPPS

rFLAGS Affected

None

MXCSR Flags Affected

None

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPCMOV xmm1, xmm2, xmm3/mem128, xmm4 8F RXB.08 0.xsrc1.0.00 A2 /r imm[7:4]

VPCMOV ymm1, ymm2, ymm3/mem256, ymm4 8F RXB.08 0.ysrc1.1.00 A2 /r imm[7:4]

VPCMOV xmm1, xmm2, xmm3, xmm4/mem128 8F RXB.08 1.xsrc1.0.00 A2 /r imm[7:4]

VPCMOV ymm1, ymm2, ymm3, ymm4/mem256 8F RXB.08 1.ysrc1.1.00 A2 /r imm[7:4]

127 0

255 128

127 0

255 128

127 0255 128

255 128

VEX.L=0

127 0

255 128

VEX.L=1

0s

VEX.L=1 VEX.L=1

VEX.L=1

dest = xmm/ymm

src1 = xmm/ymm src2 = xmm/mem128 | ymm/mem256

src3 = mem/xmm/ymm

VPCMOV

select select

select select

selectorsselectors

Page 132: 128-Bit and 256-Bit XOP and FMA4 Instructions

132 VPCMOV Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.

Page 133: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPCOMB 133

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Compares corresponding packed signed bytes in the first and second sources and writes the result ofeach comparison in the corresponding byte of the destination. The result of each comparison is an 8-bitvalue of all 1s (TRUE) or all 0s (FALSE).

The VPCOMB instruction requires four operands:

VPCOMB dest, src1, src2, comp

The destination (dest) is an XMM register addressed by the MODRM.reg field. When the comparisonresults are written to the destination XMM register, the upper 128 bits of the corresponding YMMregister are cleared to zeros.

The first source (src1) is an XMM register specified by the XOP.vvvv field and the second source(src2) is an XMM register or 128-bit memory location specified by the MODRM.rm field.

The comp type is specified by the three low-order bits of an immediate-byte, as shown in Table 2-1.The VPCOMB instruction with an appropriate value of imm8 is aliased to the following mnemonics tofacilitate coding.

The VPCOMB instruction is an XOP instruction. The presence of this instruction set is indicated by aCPUID feature bit. (See the CPUID Specification, order# 25481.)

VPCOMB Compare Vector Signed Bytes

Table 2-1. VPCOMB Comparison Operations

Mnemonic Implied Value of imm8

Comparison Operation

VPCOMLTB 0 Less Than

VPCOMLEB 1 Less Than or Equal

VPCOMGTB 2 Greater Than

VPCOMGEB 3 Greater Than or Equal

VPCOMEQB 4 Equal

VPCOMNEQB 5 Not Equal

VPCOMFALSEB 6 False

VPCOMTRUEB 7 True

Page 134: 128-Bit and 256-Bit XOP and FMA4 Instructions

134 VPCOMB Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Related Instructions

VPCOMUB, VPCOMUW, VPCOMUD, VPCOMUQ, VPCOMW, VPCOMD, VPCOMQ

rFLAGS Affected

None

MXCSR Flags Affected

None

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPCOMB xmm1, xmm2, xmm3/mem128, imm8 8F RXB.8 0.xsrc1.0.00 CC /r /imm8

07153147 23637995

128255

39557187103111119127

0s

07153147 23637995 39557187103111119127

07153147 23637995 39557187103111119127

compare

compare

02

cond

VPCOMB

16 groups of all zeros or all ones

16 Comparisons

imm8

src1 = xmm2

src2 = xmm3/mem128

dest = xmm1

Page 135: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPCOMB 135

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.

Page 136: 128-Bit and 256-Bit XOP and FMA4 Instructions

136 VPCOMD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Compares corresponding packed signed doublewords in the first and second sources and writes theresult of each comparison in the corresponding doubleword of the destination. The result of eachcomparison is a 32-bit value of all 1s (TRUE) or all 0s (FALSE).

The VPCOMD instruction requires four operands:

VPCOMD dest, src1, src2, comp

The destination (dest) is an XMM register addressed by the MODRM.reg field. When the results of thecomparisons are written to the destination XMM register, the upper 128 bits of the correspondingYMM register are cleared to zeros.

The first source (src1) is an XMM register specified by the XOP.vvvv field and the second source(src2) is an XMM register or 128-bit memory location specified by the MODRM.rm field.

The comp type is specified by the three low-order bits of an immediate-byte, as shown in Table 2-2.The VPCOMD instruction with an appropriate value of imm8 is aliased to the following mnemonics tofacilitate coding.

The VPCOMD instruction is an XOP instruction. The presence of this instruction set is indicated by aCPUID feature bit. (See the CPUID Specification, order# 25481.)

VPCOMD Compare Vector Signed Doublewords

Table 2-2. VPCOMD Comparison Operations

Mnemonic Implied Value of imm8

Comparison Operation

VPCOMLTD 0 Less Than

VPCOMLED 1 Less Than or Equal

VPCOMGTD 2 Greater Than

VPCOMGED 3 Greater Than or Equal

VPCOMEQD 4 Equal

VPCOMNEQD 5 Not Equal

VPCOMFALSED 6 False

VPCOMTRUED 7 True

Page 137: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPCOMD 137

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VPCOMUB, VPCOMUW, VPCOMUD, VPCOMUQ, VPCOMB, VPCOMW, VPCOMQ

rFLAGS Affected

None

MXCSR Flags Affected

None

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPCOMD xmm1, xmm2, xmm3/mem128, imm8 8F RXB.8 0.xsrc1.0.00 CE /r /imm8

031326395

128255

6496127

0s

0316395 326496127

0316395 3264127

compare

02

cond

VPCOMD

four groups of all zeros or all ones

compare

96

Four Comparisons

src1 = xmm2

dest = xmm1

imm8

src2 = xmm3/mem128

Page 138: 128-Bit and 256-Bit XOP and FMA4 Instructions

138 VPCOMD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 139: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPCOMQ 139

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Compares corresponding packed signed quadwords in the first and second sources and writes theresult of each comparison in the corresponding quadword of the destination. The result of eachcomparison is a 64-bit value of all 1s (TRUE) or all 0s (FALSE).

The VPCOMQ instruction requires four operands:

VPCOMQ dest, src1, src2, comp

The destination (dest) is an XMM register addressed by the MODRM.reg field. When the result iswritten to the destination XMM register, the upper 128 bits of the corresponding YMM register arecleared to zeros.

The first source (src1) is an XMM register specified by the XOP.vvvv field and the second source(src2) is an XMM register or 128-bit memory location specified by the MODRM.rm field.

The comp type is specified by the three low-order bits of an immediate-byte, as shown in Table 2-3.The VPCOMQ instruction with an appropriate value of imm8 is aliased to the following mnemonics tofacilitate coding.

The VPCOMQ instruction is an XOP instruction. The presence of this instruction set is indicated by aCPUID feature bit. (See the CPUID Specification, order# 25481.)

VPCOMQ Compare Vector Signed Quadwords

Table 2-3. VPCOMQ Comparison Operations

Mnemonic Implied Value of imm8

Comparison Operation

VPCOMLTQ 0 Less Than

VPCOMLEQ 1 Less Than or Equal

VPCOMGTQ 2 Greater Than

VPCOMGEQ 3 Greater Than or Equal

VPCOMEQQ 4 Equal

VPCOMNEQQ 5 Not Equal

VPCOMFALSEQ 6 False

VPCOMTRUEQ 7 True

Page 140: 128-Bit and 256-Bit XOP and FMA4 Instructions

140 VPCOMQ Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Related Instructions

VPCOMUB, VPCOMUW, VPCOMUD, VPCOMUQ, VPCOMB, VPCOMW, VPCOMD

rFLAGS Affected

None

MXCSR Flags Affected

None

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPCOMQ xmm1, xmm2, xmm3/mem128, imm8 8F RXB.8 0.xsrc1.0.00 CF /r imm8

63

12825564127

0s

06364127

6364127

compare

02

cond

VPCOMQ

two groups of all zeros or all ones

compare

src1 = xmm2

dest = xmm1

src2 = xmm3/mem128

imm8

0

all zeros or all ones all zeros or all ones0

Page 141: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPCOMQ 141

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 142: 128-Bit and 256-Bit XOP and FMA4 Instructions

142 VPCOMUB Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Compares corresponding packed unsigned bytes in the first and second sources and writes the result ofeach comparison in the corresponding byte of the destination. The result of each comparison is an 8-bitvalue of all 1s (TRUE) or all 0s (FALSE).

The VPCOMUB instruction requires four operands:

VPCOMUB dest, src1, src2, comp

The destination (dest) is an XMM register addressed by the MODRM.reg field. When the result iswritten to the destination XMM register, the upper 128 bits of the corresponding YMM register arecleared to zeros.

The first source (src1) is an XMM register specified by the XOP.vvvv field and the second source(src2) is an XMM register or 128-bit memory location specified by the MODRM.rm field.

The comp type is specified by the three low-order bits of an immediate-byte, as shown in Table 2-4.The VPCOMUB instruction with an appropriate value of imm8 is aliased to the following mnemonicsto facilitate coding.

The VPCOMUB instruction is an XOP instruction. The presence of this instruction set is indicated bya CPUID feature bit. (See the CPUID Specification, order# 25481.)

VPCOMUB Compare Vector Unsigned Bytes

Table 2-4. VPCOMUB Comparison Operations

Mnemonic Implied Value of imm8

Comparison Operation

VPCOMLTUB 0 Less Than

VPCOMLEUB 1 Less Than or Equal

VPCOMGTUB 2 Greater Than

VPCOMGEUB 3 Greater Than or Equal

VPCOMEQUB 4 Equal

VPCOMNEQUB 5 Not Equal

VPCOMFALSEUB 6 False

VPCOMTRUEUB 7 True

Page 143: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPCOMUB 143

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VPCOMUW, VPCOMUD, VPCOMUQ, VPCOMB, VPCOMW, VPCOMD, VPCOMQ

rFLAGS Affected

None

MXCSR Flags Affected

None

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPCOMUB xmm1, xmm2, xmm3/mem128, imm8 8F RXB.8 0.xsrc1.0.00 EC /r imm8

07153147 23637995

128255

39557187103111119127

0s

07153147 23637995 39557187103111119127

07153147 23637995 39557187103111119127

compare

compare

02

cond

VPCOMUB

16 groups of all zeros or all ones

16 Comparisons

imm8

src1 = xmm2

src2 = xmm3/mem128

dest = xmm1

Page 144: 128-Bit and 256-Bit XOP and FMA4 Instructions

144 VPCOMUB Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 145: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPCOMUD 145

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Compares corresponding packed unsigned doublewords in the first and second sources and writes theresult of each comparison in the corresponding doubleword of the destination. The result of eachcomparison is a 32-bit value of all 1s (TRUE) or all 0s (FALSE).

The VPCOMUD instruction requires four operands:

VPCOMUD dest, src1, src2, comp

The destination (dest) is an XMM register addressed by the MODRM.reg field. When the results arewritten to the destination XMM register, the upper 128 bits of the corresponding YMM register arecleared to zeros.

The first source (src1) is an XMM register specified by the XOP.vvvv field and the second source(src2) is an XMM register or 128-bit memory location specified by the MODRM.rm field.

The comp type is specified by the three low-order bits of an immediate-byte, as shown Table 2-5. TheVPCOMUD instruction with an appropriate value of imm8 is aliased to the following mnemonics tofacilitate coding.

The VPCOMUD instruction is an XOP instruction. The presence of this instruction set is indicated bya CPUID feature bit. (See the CPUID Specification, order# 25481.)

VPCOMUD Compare Vector Unsigned Doublewords

Table 2-5. VPCOMUD Comparison Operations

Mnemonic Implied Value of imm8

Comparison Operation

VPCOMLTUD 0 Less Than

VPCOMLEUD 1 Less Than or Equal

VPCOMGTUD 2 Greater Than

VPCOMGEUD 3 Greater Than or Equal

VPCOMEQUD 4 Equal

VPCOMNEQUD 5 Not Equal

VPCOMFALSEUD 6 False

VPCOMTRUEUD 7 True

Page 146: 128-Bit and 256-Bit XOP and FMA4 Instructions

146 VPCOMUD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Related Instructions

VPCOMUB, VPCOMUW, VPCOMUQ, VPCOMB, VPCOMW, VPCOMD, VPCOMQ

rFLAGS Affected

None

MXCSR Flags Affected

None

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPCOMUD xmm1, xmm2, xmm3/mem128, imm8 8F RXB.8 0.xsrc1.0.00 EE /r imm8

031326395

128255

6496127

0s

0316395 326496127

0316395 3264127

compare

02

cond

VPCOMUD

four groups of all zeros or all ones

compare

96

Four Comparisons

src1 = xmm2

dest = xmm1

src2 = xmm3/mem128

imm8

Page 147: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPCOMUD 147

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 148: 128-Bit and 256-Bit XOP and FMA4 Instructions

148 VPCOMUQ Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Compares corresponding packed unsigned quadwords in the first and second sources and writes theresult of each comparison in the corresponding quadword of the destination. The result of eachcomparison is a 64-bit value of all 1s (TRUE) or all 0s (FALSE).

The VPCOMUQ instruction requires four operands:

VPCOMUQ dest, src1, src2, comp

The destination (dest) is an XMM register addressed by the MODRM.reg field. When the results arewritten to the destination XMM register, the upper 128 bits of the corresponding YMM register arecleared to zeros.

The first source (src1) is an XMM register specified by the XOP.vvvv field and the second source(src2) is an XMM register or 128-bit memory location specified by the MODRM.rm field.

The comp type is specified by the three low-order bits of an immediate-byte, as shown in Table 2-6.The VPCOMUQ instruction with an appropriate value of imm8 is aliased to the following mnemonicsto facilitate coding.

The VPCOMUQ instruction is an XOP instruction. The presence of this instruction set is indicated bya CPUID feature bit. (See the CPUID Specification, order# 25481.)

VPCOMUQ Compare Vector Unsigned Quadwords

Table 2-6. VPCOMUQ Comparison Operations

Mnemonic Implied Value of imm8

Comparison Operation

VPCOMLTUQ 0 Less Than

VPCOMLEUQ 1 Less Than or Equal

VPCOMGTUQ 2 Greater Than

VPCOMGEUQ 3 Greater Than or Equal

VPCOMEQUQ 4 Equal

VPCOMNEQUQ 5 Not Equal

VPCOMFALSEUQ 6 False

VPCOMTRUEUQ 7 True

Page 149: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPCOMUQ 149

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VPCOMUB, VPCOMUW, VPCOMUD, VPCOMB, VPCOMW, VPCOMD, VPCOMQ

rFLAGS Affected

None

MXCSR Flags Affected

None

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPCOMUQ xmm1, xmm2, xmm3/mem128, imm8 8F RXB.8 0.xsrc1.0.00 EF /r imm8

63

128255

64127

0s

06364127

6364127

compare

02

cond

VPCOMUQ

two groups of all zeros or all ones

compare

src1 = xmm2

dest = xmm1

src2 = xmm3/mem128

imm8

0

all zeros or all ones all zeros or all ones0

Page 150: 128-Bit and 256-Bit XOP and FMA4 Instructions

150 VPCOMUQ Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 151: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPCOMUW 151

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Compares corresponding packed unsigned words in the first and second sources and writes the resultof each comparison in the corresponding word of the destination. The result of each comparison is a16-bit value of all 1s (TRUE) or all 0s (FALSE).

The VPCOMUW instruction requires four operands:

VPCOMUW dest, src1, src2, comp

The destination (dest) is an XMM register addressed by the MODRM.reg field. When the results arewritten to the destination XMM register, the upper 128 bits of the corresponding YMM register arecleared to zeros.

The first source (src1) is an XMM register specified by the XOP.vvvv field and the second source(src2) is an XMM register or 128-bit memory location specified by the MODRM.rm field.

The comp type is specified by the three low-order bits of an immediate-byte, as defined in Table 2-7.The VPCOMUW instruction with an appropriate value of imm8 is aliased to the following mnemonicsto facilitate coding.

The VPCOMUW instruction is an XOP instruction. The presence of this instruction set is indicated bya CPUID feature bit. (See the CPUID Specification, order# 25481.)

VPCOMUW Compare Vector Unsigned Words

Table 2-7. VPCOMUW Comparison Operations

Mnemonic Implied Value of imm8

Comparison Operation

VPCOMLTUW 0 Less Than

VPCOMLEUW 1 Less Than or Equal

VPCOMGTUW 2 Greater Than

VPCOMGEUW 3 Greater Than or Equal

VPCOMEQUW 4 Equal

VPCOMNEQUW 5 Not Equal

VPCOMFALSEUW 6 False

VPCOMTRUEUW 7 True

Page 152: 128-Bit and 256-Bit XOP and FMA4 Instructions

152 VPCOMUW Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Related Instructions

VPCOMUB, VPCOMUD, VPCOMUQ, VPCOMB, VPCOMW, VPCOMD, VPCOMQ

rFLAGS Affected

None

MXCSR Flags Affected

None

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPCOMUW xmm1, xmm2, xmm3/mem128, imm8 8F RXB.8 0.xsrc1.0.00 ED /r imm8

0153147 16637995

128255

3248648096111112127

0s

015316395 32648096111112127

0153147 16637995 3248648096111112127

compare

02

cond

VPCOMUW

8 groups of all zeros or all ones

8 Comparisons

imm8

src1 = xmm2

src2 = xmm3/mem128

dest = xmm1

474879 16

compare

Page 153: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPCOMUW 153

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 154: 128-Bit and 256-Bit XOP and FMA4 Instructions

154 VPCOMW Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Compares corresponding packed signed words in the first and second sources and writes the result ofeach comparison in the corresponding word of the destination. The result of each comparison is a 16-bit value of all 1s (TRUE) or all 0s (FALSE).

The VPCOMW instruction requires four operands:

VPCOMW dest, src1, src2, comp

The destination (dest) is an XMM register addressed by the MODRM.reg field. When the results arewritten to the destination XMM register, the upper 128 bits of the corresponding YMM register arecleared to zeros.

The first source (src1) is an XMM register specified by the XOP.vvvv field and the second source(src2) is an XMM register or 128-bit memory location specified by the MODRM.rm field.

The comp type is specified by the three low-order bits of an immediate-byte, as defined in Table 2-8.The VPCOMW instruction with an appropriate value of imm8 is aliased to the following mnemonicsto facilitate coding.

The VPCOMW instruction is an XOP instruction. The presence of this instruction set is indicated by aCPUID feature bit. (See the CPUID Specification, order# 25481.)

VPCOMW Compare Vector Signed Words

Table 2-8. VPCOMW Comparison Operations

Mnemonic Implied Value of imm8

Comparison Operation

VPCOMLTW 0 Less Than

VPCOMLEW 1 Less Than or Equal

VPCOMGTW 2 Greater Than

VPCOMGEW 3 Greater Than or Equal

VPCOMEQW 4 Equal

VPCOMNEQW 5 Not Equal

VPCOMFALSEW 6 False

VPCOMTRUEW 7 True

Page 155: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPCOMW 155

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VPCOMUB, VPCOMUW, VPCOMUD, VPCOMUQ, VPCOMB, VPCOMD, VPCOMQ

rFLAGS Affected

None

MXCSR Flags Affected

None

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPCOMW xmm1, xmm2, xmm3/mem128, imm8 8F RXB.8 0.xsrc1.0.00 CD /r imm8

0153147 16637995 3248648096111112127

015316395 32648096111112127

0153147 16637995 3248648096111112127

compare

02

cond

VPCOMW

8 groups of all zeros or all ones

8 Comparisons

imm8

src1 = xmm2

src2 = xmm3/mem128

dest = xmm1

474879 16

compare

1282550s

Page 156: 128-Bit and 256-Bit XOP and FMA4 Instructions

156 VPCOMW Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 157: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPERMIL2PD 157

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Loads each quadword element of the destination register with either a quadword value from one of thesource registers, or with zero. The first and second source operands contain the quadword values tocopy from, and the third source operand and an immediate byte operand control which sourcequadword to copy and whether each destination quadword is cleared to zero.

There are XMM and YMM versions of this instruction, as determined by the value of VEX.L. Bothversions require four operands:

VPERMIL2PD dest, src1, src2, selector, imm8

Selector Operand. The selector operand is either a register or memory operand. The selectoroperand for the 128-bit version of this instruction is divided into two quadword selector elements; thatfor the 256-bit version is divided into four quadword selector elements. Each selector operanddetermines the value moved to its positionally corresponding element in the destination XMM orYMM register.

The bit field of each quadword selector element determines the source element selected and the valueactually written to the destination register element. See Figure 2-1.

Figure 2-1. 64-Bit Selector Element Format

Bit 0 and bits 63–4 of the quadword selector element are ignored. Selector element fields are:

Select—Bits 2–1 of each quadword element of the selector determine which source value is copiedinto the corresponding quadword element of the destination register. The source value selected byeach bit pattern is defined in Table 2-9: Match—Bit 3. The values of the Match bit and the Match field of the immediate byte operanddetermine whether each quadword in the destination operand is written with zero. See table 2-4 fordetails. See Table 2-10 for details.

VPERMIL2PD Permute Two-Source Double-Precision Floating-Point Values

63 4 3 2 1 0

Ignored M Sel I

Bits Mnemonic Description63-4 Ignored

3 M Match2-1 Sel Select0 Ignored

Page 158: 128-Bit and 256-Bit XOP and FMA4 Instructions

158 VPERMIL2PD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Immediate Operand. The immediate byte layout is illustrated in Figure 2-2.

Figure 2-2. 8-Bit Immediate Operand Format

Source Register Select—Bits 7–4. In 64-bit mode, if VEX.W is 0, the 4-bit src3 field identifies thethird source register; if VEX.W is 1, SRC identifies the second source register. (See source registerdiscussion below.) In non-64-bit mode, only bits [6:4] identify the register, and imm8[7] is ignored.Match to Zero—Bits 1–0. The two-bit M2Z field interacts with the Match bit of the selectorelement to determine the double-precision floating-point value that is written to the correspondingdoubleword element in the destination operand. For details on the interaction between theimmediate operand Match to Zero field and the Selector element Match bit, see Table 2-10 below.

Interaction between Selector Match Bit and Immediate Operand Match Field. Theresults of combining the selector match bit and the immediate operand match field are shown inTable 2-10.

Table 2-9. Selector and Source Selected

SelectorSource Selected for Quadwords

0 and 1

Source Selected for Quadwords

2 and 300b src1[63:0] src1[191:128]01b src1[127:64] src1[255:192]10b src2[63:0] src2[191:128]11b src2[127:64] src2[255:192]

7 4 3 2 1 0

SRC Ignore M2Z

Bits Mnemonic Description7–4 SRC Source Register Select3-2 Ignore1–0 M2Z Match to Zero

Page 159: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPERMIL2PD 159

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Bits 7-4. If VEX.W is 0, this field identifies the third source register; if VEX.W is 1, this fieldidentifies the second source register.In non-64-bit mode, only bits [6:4] identify the register, and imm8[7] is ignored.Bits 1–0. The two-bit M2Z field interacts with the Match bit of the selector element to determinethe double-precision floating-point value that is written to the corresponding quadword element inthe destination operand. For details on the interaction between the immediate operand Match toZero field and the Selector element Match bit, see Table 2-10 above.

Source Operands. The 128-bit version of this instruction forms a single source operand byconcatenation of the first source XMM register and the second source, which can be either an XMMregister or a 128-bit memory location, to form a single operand partition consisting of four 64-bitdouble-precision floating point values.

The 256-bit version forms two source operands by concatenation of the first source YMM register andthe second source, which can be either an XMM register or a 128-bit memory location, to form twooperand partitions each containing four 64-bit double-precision floating point values.

If VEX.W is 0, the second source is either an XMM or a YMM register or memory location and thethird source is a register. If VEX.W is 1, the second source is a register and the third source is a registeror a memory location. (See “Source Register Select,” above for a discussion of the use of theimmediate byte to select the second or third source register.)

Destination Register. The destination is always either an XMM register or a YMM register,depending on the vector size, as determined by the value of VEX.L. When the destination is a 128-bitXMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The VPERMIL2PD instruction is an XOP instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

Table 2-10. Interaction of Selector Match Bit and Immediate Operand Match Field

Immediate M2Z Field

Selector Match Bit Value Loaded into Destination Quadword

0Xb X Operand value according to Table 2-9.

10b 0 Operand value according to Table 2-9.

10b 1 Zero

11b 0 Zero

11b 1 Operand value according to Table 2-9.

Page 160: 128-Bit and 256-Bit XOP and FMA4 Instructions

160 VPERMIL2PD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Mnemonic Encoding

VEX RXB.mmmmmW.vvvv.L.pp Opcode

VPERMIL2PD xmm1, xmm2, xmm3/mem128, xmm4, imm8 C4 RXB.03 0.xsrc1.0.00 49 /r imm8

VPERMIL2PD xmm1, xmm2, xmm3, xmm4/mem128, imm8 C4 RXB.03 1.xsrc1.0.00 49 /r imm8

VPERMIL2PD ymm1, ymm2, ymm3/mem256, ymm4, imm8 C4 RXB.03 0.ysrc1.1.00 49 /r imm8

VPERMIL2PD ymm1, ymm2, ymm3, ymm4/mem256, imm8 C4 RXB.03 1.ysrc1.1.00 49 /r imm8

Page 161: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPERMIL2PD 161

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

123

0 or src 0 or src

06364127

06364127

VPERMIL2PD

src2

dest128255

0s

06364127

src1

06364127selector

1212 33

10

imm8

0 or src 0 or src

06364127

06364127src2

dest

06364127src1

06364127

selector

10imm8

0 or src 0 or src

VPERMIL2PD

128-Bit

256-Bit

128191192255

128191192255

191192255

123

128191192255

0 or src 0 or src

128

zero or src zero or src

select select

select select select select

123 123

zero or src zero or src

Page 162: 128-Bit and 256-Bit XOP and FMA4 Instructions

162 VPERMIL2PD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Related Instructions

VPERM2F128, VPERMIL2PD, VPERMIL2PS, VPERMILPD, VPERMILPS, VPPERM

rFLAGS Affected

None

MXCSR Flags Affected

None

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.

Page 163: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPERMIL2PS 163

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Loads each doubleword element of the destination register with either a doubleword value from one ofthe source registers, or with zero. The first and second source operands contain the doubleword valuesto copy from, and the third source operand and imm8 control which source doubleword to copy, andwhether each destination doubleword is cleared to zero.

There are XMM and YMM versions of this instruction, as determined by the value of VEX.L. Bothversions require four operands:

VPERMIL2PS dest, src1, src2, selector, imm8

Selector Operand. The selector operand is either a register or memory operand. The selectoroperand for the 128-bit version of this instruction is divided into four doubleword selector elements;that for the 256-bit version is divided into eight doubleword selector elements. Each selector operanddetermines the value moved to its positionally corresponding element in the destination XMM orYMM register.

The bit field of each doubleword selector element determines the source element selected and thevalue actually written to the destination register element. See Figure 2-3.

Figure 2-3. 32-Bit Selector Element Format

Bits 31–4 of the doubleword selector element are ignored. Selector element fields are:

Select—Bits 2–0 of each doubleword element of the selector determine which source value iscopied into the corresponding doubleword element of the destination register. The source valueselected by each bit pattern is defined in Table 2-11: Match—Bit 3. The values of the Match bit and the Match field of the immediate byte operanddetermine whether each doubleword in the destination operand is written with zero. See table 2-4for details

VPERMIL2PS Permute Two-Source Single-Precision Floating-Point Values

31 4 3 2 1 0

Ignored M Sel

Bits Mnemonic Description31-4 Ignored

3 M Match2-0 Sel Select

Page 164: 128-Bit and 256-Bit XOP and FMA4 Instructions

164 VPERMIL2PS Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Immediate Operand. The immediate byte layout is illustrated in Figure 2-4.

Figure 2-4. 8-Bit Immediate Operand Format

Source Register Select—Bits 7–4. In 64-bit mode, if VEX.W is 0, the 4-bit src3 field identifies thethird source register; if VEX.W is 1, SRC identifies the second source register. (See source registerdiscussion below.) In non-64-bit mode, only bits [6:4] identify the register, and imm8[7] is ignored.Match to Zero—Bits 1–0. The two-bit M2Z field interacts with the Match bit of the selectorelement to determine the single-precision floating-point value that is written to the correspondingdoubleword element in the destination operand. For details on the interaction between theimmediate operand Match to Zero field and the Selector element Match bit, see Table 2-12 below.

Interaction between Selector Match Bit and Immediate Operand Match Field. Theresults of combining the selector match bit and the immediate operand match field are shown inTable 2-12.

Table 2-11. Selector and Source Selected

SelectorSource Selected for Doublewords

0, 1, 2, 3

Source Selected for Doublewords

4, 5, 6, 7000b src1[31:0] src1[159:128]001b src1[63:32] src1[191:160]010b src1[95:64] src1[223:192]011b src1[127:96] src1[255:224]100b src2[31:0] src2[159:128]101b src2[63:32] src2[191:160]110b src2[95:64] src2[223:192]111b src2[127:96] src2[255:224]

7 6 5 4 3 2 1 0

src3 Ignore M2Z

Bits Mnemonic Description7–4 SRC Source Register Select3-2 Ignore1–0 M2Z Match to Zero

Page 165: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPERMIL2PS 165

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Source Operands. The 128-bit version of this instruction forms a single source operand byconcatenation of the first source XMM register and the second source, which can be either an XMMregister or a 128-bit memory location, to form a single operand partition consisting of eight 32-bitsingle-precision floating-point values.

The 256-bit version forms two source operands by concatenation of the lower 128-bits of the firstsource with the lower 128-bits of the second source and concatenation of the upper 128-bits of the firstsource with the upper 128-bits of the second source, which forms two operand partitions eachconsisting of eight 32-bit single precision floating-point values. The first source is a YMM register andthe second source is either a YMM register or a 256-bit memory location.

If VEX.W is 0, the second source is either an XMM or a YMM register or memory location and thethird source is a register. If VEX.W is 1, the second source is a register and the third source is a registeror a memory location. (See “Source Register Select,” above for a discussion of the use of theimmediate byte to select the second or third source register.)

Destination Register. The destination is always either an XMM register or a YMM register,depending on the vector size, as determined by the value of VEX.L. When the destination is a 128-bitXMM register, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The VPERMIL2PS instruction is an XOP instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

Table 2-12. Interaction of Selector Match Bit and Immediate Operand Match Field

Immediate M2Z Field

Selector Match Bit Value Loaded into Destination Quadword

0Xb X Operand value according to Table 2-11.

10b 0 Operand value according to Table 2-11.

10b 1 Zero

11b 0 Zero

11b 1 Operand value according to Table 2-11.

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPERMIL2PS xmm1, xmm2, xmm3/mem128, xmm4, imm8 C4 RXB.03 0.xsrc1.0.00 48 /r imm8

VPERMIL2PS xmm1, xmm2, xmm3, xmm4/mem128, imm8 C4 RXB.03 1.xsrc1.0.00 48 /r imm8

VPERMIL2PS ymm1, ymm2, ymm3/mem256, ymm4, imm8 C4 RXB.03 0.ysrc1.1.00 48 /r imm8

VPERMIL2PS ymm1, ymm2, ymm3, ymm4/mem256, imm8 C4 RXB.03 1.ysrc1.1.00 48 /r imm8

Page 166: 128-Bit and 256-Bit XOP and FMA4 Instructions

166 VPERMIL2PS Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

06364127

VPERMIL2PS

src2

dest128255

0s

06364127src1

02 sel

3

10imm8

0 or src0 or src

128-Bit

31329596

31329596

zero or src

zero or src

zero or src

zero or src

06364127 31329596

select

select

select

02 sel

302 sel

3 02 sel

3

0 or src0 or src

select

src3

Page 167: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPERMIL2PS 167

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

063

6412

7

VPER

MIL

2PS

src3

dest

063

6412

7

src1 1

0

imm

8

0 or

src

0 or

src

256-

Bit

3132

9596

3132

9596

063

6412

731

3295

96

sel

ect

sel

ect

sel

ect

0 or

src

0 or

src

sel

ect

sel

ect

sel

ect

sel

ect

sel

ect

zer

oor sr

c

zer

oor sr

c

zer

oor sr

c

zer

oor sr

c

zer

oor sr

c

zer

oor sr

c

zer

oor sr

c

zer

oor sr

c

128

191

192

255

159

160

223

224

128

191

192

255

159

160

223

224

0 or

src

0 or

src

128

191

192

255

159

160

223

224

0 or

src

0 or

src

0 2 sel

3 0 2 sel

3 0 2 sel

30 2 sel

3 0 2 sel

3 0 2 sel

3 0 2 sel

30 2 sel

3

src2

Page 168: 128-Bit and 256-Bit XOP and FMA4 Instructions

168 VPERMIL2PS Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Related Instructions

VPERM2F128, VPERMIL2PD, VPERMIL2PS, VPERMILPD, VPERMILPS, VPPERM

rFLAGS Affected

None

MXCSR Flags Affected

None

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.

Page 169: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPHADDBD 169

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Adds four successive 8-bit signed integer values from the source and packs the sign-extended resultsof the additions in the corresponding doubleword in the destination.

This instruction takes two operands:

VPHADDBD dest, src

The destination is an XMM register and the source is an XMM register or 128-bit memory location.When the destination register is written, the upper 128 bits of the corresponding YMM register arecleared to zeros.

The VPHADDBD instruction is an XOP instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

Related Instructions

VPHADDBW, VPHADDBQ, VPHADDWD, VPHADDWQ, VPHADDDQ

rFLAGS Affected

None

VPHADDBD Packed Horizontal Add Signed Byte to Signed Doubleword

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPHADDBD xmm1, xmm2/mem128 8F RXB.09 0.1111.0.00 C2 /r

07153147 23637995 39557187103111119127

0316395

128255

127

0’s

add add

src = xmm2mem128

dest = xmm1

VPHADDBD

add

96 64 32

add add

add

add add

add

add add

add

Page 170: 128-Bit and 256-Bit XOP and FMA4 Instructions

170 VPHADDBD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

MXCSR FLAGS Affected

None

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.X XOP.vvvv was not 1111b.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 171: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPHADDBQ 171

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Adds eight successive 8-bit signed integer values from the source and packs the sign-extended resultsof the additions in the corresponding quadword in the destination.

This instruction takes two operands:

VPHADDBQ dest, src

The destination is an XMM register and the source is an XMM register or 128-bit memory location.When the destination register is written, the upper 128 bits of the corresponding YMM register arecleared to zeros.

The VPHADDBQ instruction is an XOP instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.

Related Instructions

VPHADDBW, VPHADDBD, VPHADDWD, VPHADDWQ, VPHADDDQ

VPHADDBQ Packed Horizontal Add Signed Byte to Signed Quadword

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPHADDBQ xmm1, xmm2/mem128 8F RXB.09 0.1111.0.00 C3 /r

07153147 23637995 39557187103111119127

063

128255

127

0’s

add add

src = xmm2/mem128

dest = xmm1

VPHADDBQ

add

64

add add

add

add

add add

add

add add

add

add

Page 172: 128-Bit and 256-Bit XOP and FMA4 Instructions

172 VPHADDBQ Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

rFLAGS Affected

None

MXCSR FLAGS Affected

None

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.X XOP.vvvv was not 1111b.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 173: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPHADDBW 173

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Adds each adjacent pair of 8-bit signed integer values from the source and packs the sign-extended 16-bit integer result of each addition in the corrresponding word element of the destination.

This instruction takes two operands:

VPHADDBW dest, src

The destination is an XMM register and the source is an XMM register or 128-bit memory location.When the destination XMM register is written, the upper 128 bits are cleared to zeros.

The PHADDBW instruction is an XOP instruction. The presence of this instruction set is indicated bya CPUID feature bit. (See the CPUID Specification, order# 25481.)

Related Instructions

VPHADDBD, VPHADDBQ, VPHADDWD, VPHADDWQ, VPHADDDQ

rFLAGS Affected

None

MXCSR FLAGS Affected

None

VPHADDBW Packed Horizontal Add Signed Byte to Signed Word

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPHADDBW xmm1, xmm2/mem128 8F RXB.09 0.1111.0.00 C1 /r

VPHADDBW

add

07153147 23637995128 39557187103111119127

0153147637995

128255

111127

0s

add add add add add add add

dest = xmm1

src = xmm2mem128

Page 174: 128-Bit and 256-Bit XOP and FMA4 Instructions

174 VPHADDBW Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.X XOP.vvvv was not 1111b.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 175: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPHADDDQ 175

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Adds each adjacent pair of signed doubleword integer values in the source and packs the sign-extended sums of each additions in the corresponding quadword in the destination register.

This instruction takes two operands:

VPHADDDQ dest, src

The source is an XMM register or 128-bit memory location and the destination is an XMM register. .When the destination XMM register is written, the upper 128 bits of the corresponding YMM registerare cleared to zeros.

The VPHADDDQ instruction is an XOP instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

Related Instructions

VPHADDBW, VPHADDBD, VPHADDBQ, VPHADDWD, VPHADDWQ

rFLAGS Affected

None

MXCSR FLAGS Affected

None

VPHADDDQ Packed Horizontal Add Signed Doubleword to Signed Quadword

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPHADDDQ xmm1, xmm2/mem128 8F RXB.09 0.1111.0.00 CB /r

095 63127 64 313296

063127 64

128255

add add

0s

src = xmm2/mem128

dest = xmm1

VPHADDDQ

Page 176: 128-Bit and 256-Bit XOP and FMA4 Instructions

176 VPHADDDQ Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.X XOP.vvvv was not 1111b.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 177: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPHADDUBD 177

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Adds four successive 8-bit unsigned integer values from the source and packs the results of theadditions in the corresponding doubleword in the destination.

This instruction takes two operands:

VPHADDUBD dest, src

The destination is an XMM register and the source is an XMM register or 128-bit memory location.When the destination register is written, the upper 128 bits of the corresponding YMM register arecleared to zeros.

The VPHADDUBD instruction is an XOP instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

Related Instructions

VPHADDUBW, VPHADDUBQ, VPHADDUWD, VPHADDUWQ, VPHADDUDQ

rFLAGS Affected

None

VPHADDUBD Packed Horizontal Add Unsigned Byte to Doubleword

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPHADDUBD xmm1, xmm2/mem128 8F RXB.09 0.1111.0.00 D2 /r

VPHADDUBD

07153147 23637995 39557187103111119127

0316395

128255

127

0s

add add

src = xmm2/mem128

dest = xmm1

add

96 64 32

add add

add

add add

add

add add

add

Page 178: 128-Bit and 256-Bit XOP and FMA4 Instructions

178 VPHADDUBD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

MXCSR FLAGS Affected

None

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.X XOP.vvvv was not 1111b.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 179: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPHADDUBQ 179

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Adds eight successive 8-bit unsigned integer values from the second source and packs the results ofthe additions in the corresponding quadword in the destination.

This instruction takes two operands:

VPHADDUBQ dest, src

The destination is an XMM register and the source is an XMM register or 128-bit memory location.When the destination XMM register is written, the upper 128 bits of the corresponding YMM registerare cleared to zeros.

The PHADDUBQ instruction is an XOP instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

Related Instructions

VPHADDUBW, VPHADDUBD, VPHADDUWD, VPHADDUWQ, VPHADDUDQ

VPHADDUBQ Packed Horizontal Add Unsigned Byte to Quadword

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPHADDUBQ xmm1, xmm2/mem128 8F RXB.09 0.1111.0.00 D3 /r

PHADDUBQ

07153147 23637995 39557187103111119127

063

128255

127

0s

add add

src = xmm2/mem128

dest = xmm1

add

64

add add

add

add

add add

add

add add

add

add

Page 180: 128-Bit and 256-Bit XOP and FMA4 Instructions

180 VPHADDUBQ Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

rFLAGS Affected

None

MXCSR FLAGS Affected

None

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.X XOP.vvvv was not 1111b.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 181: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPHADDUBW 181

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Adds each adjacent pair of 8-bit unsigned integer values from the source and packs the 16-bit integerresults of each addition in the corresponding word in the destination.

This instruction takes two operands:

VPHADDUBW dest, src

The destination is an XMM register and the source is an XMM register or 128-bit memory location.When the destination XMM register is written, the upper 128 bits of the corresponding YMM registerare cleared to zeros.

The VPHADDUBW instruction is an XOP instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

Related Instructions

VPHADDUBD, VPHADDUBQ, VPHADDUWD, VPHADDUWQ, VPHADDUDQ

rFLAGS Affected

None

MXCSR FLAGS Affected

None

VPHADDUBW Packed Horizontal Add Unsigned Byte to Word

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPHADDUBWD xmm1, xmm2/mem128 8F RXB.09 0.1111.0.00 D1 /r

VPHADDUBW

add

07153147 23637995 39557187103111119127

0153147637995

128255

111127

0s

add add add add add add add

dest = xmm1

src = xmm2/mem128

Page 182: 128-Bit and 256-Bit XOP and FMA4 Instructions

182 VPHADDUBW Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.X XOP.vvvv was not 1111b.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 183: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPHADDUDQ 183

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Adds each adjacent pair of 32-bit unsigned integer values from the source and packs the results of eachaddition in the corresponding quadword in the destination.

This instruction takes two operands:

VPHADDUDQ dest, src

The destination is an XMM register and the source is an XMM register or 128-bit memory location.When the destination register is written, the upper 128 bits of the corresponding YMM register arecleared to zeros.

The VPHADDUDQ instruction is an XOP instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

Related Instructions

VPHADDUBW, VPHADDUBD, VPHADDUBQ, VPHADDUWD, VPHADDUWQ

rFLAGS Affected

None

VPHADDUDQ Packed Horizontal Add Unsigned Doubleword to Quadword

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPHADDUDQ xmm1, xmm2/mem128 8F RXB.09 0.1111.0.00 DB /r

VPHADDDQ095 63127 64 313296

063127 64

128255

add add

0s

src = xmm2/mem128

dest = xmm1

Page 184: 128-Bit and 256-Bit XOP and FMA4 Instructions

184 VPHADDUDQ Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

MXCSR FLAGS Affected

None

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.X XOP.vvvv was not 1111b.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 185: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPHADDUWD 185

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Adds each adjacent pair of 16-bit unsigned integer values from the source and packs the results of eachaddition in the corresponding doubleword in the destination.

This instruction takes two operands:

VPHADDUWD dest, src

The destination is an XMM register and the source is an XMM register or 128-bit memory location.When the destination register is written, the upper 128 bits of the corresponding YMM register arecleared to zeros.

The VPHADDUWD instruction is an XOP instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.

Related Instructions

VPHADDUBW, VPHADDUBD, VPHADDUBQ, VPHADDUWQ, VPHADDUDQ

rFLAGS Affected

None

MXCSR FLAGS Affected

None

VPHADDUWDPacked Horizontal Add Unsigned Word to Doubleword

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPHADDUWD xmm1, xmm2/mem128 8F RXB.09 0.1111.0.00 D6 /r

PHADDUWD

0111 95 63127 161564 3132484780 7996112

128255095 63127 64 313296

add add add add

src = xmm2/mem128

dest = xmm10s

Page 186: 128-Bit and 256-Bit XOP and FMA4 Instructions

186 VPHADDUWD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.X XOP.vvvv was not 1111b.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 187: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPHADDUWQ 187

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Adds four successive 16-bit unsigned integer values from the source and packs the results of theadditions in the corresponding quadword element in the destination.

This instruction takes two operands:

VPHADDUWQ dest, src

The destination is an XMM register and the source is an XMM register or 128-bit memory location.When the destination register is written, the upper 128 bits of the corresponding YMM register arecleared to zeros.

The VPHADDUWQ instruction is an XOP instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

Related Instructions

VPHADDUBW, VPHADDUBD, VPHADDUBQ, VPHADDUWD, VPHADDUDQ

rFLAGS Affected

None

VPHADDUWQ Packed Horizontal Add Unsigned Word to Quadword

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPHADDUWQ xmm1, xmm2/mem128 8F RXB.09 0.1111.0.00 D7 /r

095 63127 64 313296

063127 64128255

add add

0s

dest = xmm2/mem128

dest = xmm1

VPHADDDQ

Page 188: 128-Bit and 256-Bit XOP and FMA4 Instructions

188 VPHADDUWQ Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

MXCSR FLAGS Affected

None

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.X XOP.vvvv was not 1111b.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 189: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPHADDWD 189

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Adds each adjacent pair of 16-bit signed integer values from the source and packs the sign-extendedresults of the addition in the corresponding doubleword in the destination).

This instruction takes two operands:

VPHADDWD dest, src

The destination is an XMM register and the source is an XMM register or 128-bit memory location.When the destination XMM register is written, the upper 128 bits or the corresponding YMM registerare cleared to zeros.

The VPHADDWD instruction is an XOP instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

Related Instructions

VPHADDBW, VPHADDBD, VPHADDBQ, VPHADDWQ, VPHADDDQ

rFLAGS Affected

None

VPHADDWD Packed Horizontal Add Signed Word to Signed Doubleword

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPHADDWD xmm1, xmm2/mem128 8F RXB.09 0.1111.0.00 C6 /r

0111 95 63127 161564 3132484780 7996112

128255095 63127 64 313296

add add add add

src = xmm2/mem128

dest = xmm1

PHADDWD

0s

Page 190: 128-Bit and 256-Bit XOP and FMA4 Instructions

190 VPHADDWD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

MXCSR FLAGS Affected

None

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.X XOP.vvvv was not 1111b.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 191: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPHADDWQ 191

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Adds four successive 16-bit signed integer values from the second source and packs the sign-extendedresults of each addition in the corresponding quadword in the destination.

The destination is an XMM register and the source is an XMM register or 128-bit memory location.When the destination XMM register is written, the upper 128 bits of the corresponding YMM registerare cleared to zeroes.

The VPHADDWQ instruction is an XOP instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

Related Instructions

VPHADDBW, VPHADDBD, VPHADDBQ, VPHADDWD, VPHADDDQ

rFLAGS Affected

None

MXCSR FLAGS Affected

None

VPHADDWQ Packed Horizontal Add Signed Word to Signed Quadword

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPHADDWQ xmm1, xmm2/mem128 8F RXB.09 0.1111.0.00 C7 /r

0111 95 63127 161564 3132484780 7996112

128255 063127

add

src = xmm2/mem128

dest = xmm1

64

add

add add

add

add

VPHADDWD

0s

Page 192: 128-Bit and 256-Bit XOP and FMA4 Instructions

192 VPHADDWQ Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.X XOP.vvvv was not 1111b.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 193: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPHSUBBW 193

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Subtracts the most significant signed integer byte from the least significant signed integer byte of eachword element in the source and packs the sign-extended 16-bit integer results of each subtraction in thedestination.

This instruction takes two operands:

VPHSUBBW dest, src

The destination is an XMM register and the source is an XMM register or 128-bit memory location.When the destination register is written, the upper 128 bits of the corresponding YMM register arecleared to zeros.

The VPHSUBBW instruction is an XOP instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.

Related Instructions

VPHSUBWD, VPHSUBDQ

rFLAGS Affected

None

VPHSUBBW Packed Horizontal Subtract Signed Byte to Signed Word

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPHSUBBW xmm1, xmm2/mem128 8F RXB.09 0.1111.0.00 E1 /r

VPHSUBBW

subtract

07153147 23637995 39557187103111119127

0153147637995128255

111127

0s

src = xmm2/mem128

dest = xmm1

subtract subtract subtract

subtract subtract

subtract subtract

Page 194: 128-Bit and 256-Bit XOP and FMA4 Instructions

194 VPHSUBBW Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

MXCSR FLAGS Affected

None

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.X XOP.vvvv was not 1111b.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 195: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPHSUBDQ 195

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Subtracts the most significant signed integer doubleword from the least significant signed integerdoubleword of each quadword in the source and packs the sign-extended 64-bit integer result of eachsubtraction in the corresonding quadword element of the destination.

This instruction takes two operands:

VPHSUBDQ dest, src

The destination is an XMM register and the source is an XMM register or 128-bit memory location.When the destination register is written, the upper 128 bits of the corresponding YMM register arecleared to zeros.

The VPHSUBDQ instruction is an XOP instruction. The presence of this instruction set is indicated bya CPUID feature bit. (See the CPUID Specification, order# 25481.

Related Instructions

VPHSUBBW, VPHSUBWD

rFLAGS Affected

None

MXCSR FLAGS Affected

None

VPHSUBDQ Packed Horizontal Subtract Signed Doubleword to Signed Quadword

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPHSUBDQ xmm1, xmm2/mem128 8F RXB.09 0.1111.0.00 E3 /r

095 63127 64 313296

063127 64128255

subtract

0s

src = xmm2/mem128

dest = xmm1

VPHSUBDQ

subtract

Page 196: 128-Bit and 256-Bit XOP and FMA4 Instructions

196 VPHSUBDQ Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.X XOP.vvvv was not 1111b.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 197: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPHSUBWD 197

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Subtracts the most significant signed integer word from the least significant signed integer word ofeach doubleword from the source and packs the sign-extended 32-bit integer result of each subtractionin the destination.

This instruction takes two operands:

VPHSUBWD dest, src

The destination is an XMM register and the source is an XMM register or 128-bit memory location.When the destination register is written, the upper 128 bits of the corresponding YMM register arecleared to zeros.

The VPHSUBWD instruction is an XOP instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.

Related Instructions

VPHSUBBW, VPHSUBDQ

rFLAGS Affected

None

MXCSR FLAGS Affected

None

VPHSUBWD Packed Horizontal Subtract Signed Word to Signed Doubleword

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPHSUBWD xmm1, xmm2/mem128 8F RXB.09 0.1111.0.00 E2 /r

0111 95 63127 161564 3132484780 7996112

128255095 63127 64 313296

src = xmm2/mem128

dest = xmm1

VPHSUBWD

subtract subtract subtract subtract

0s

Page 198: 128-Bit and 256-Bit XOP and FMA4 Instructions

198 VPHSUBWD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.X XOP.vvvv was not 1111b.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 199: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPMACSDD 199

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Multiplies each packed 32-bit signed integer value in the first source by the corresponding packed 32-bit signed integer value in the second source, then adds the 64-bit signed integer product to thecorresponding packed 32-bit signed integer value in the third source. The four resulting 32-bit sumsare stored in the destination.

The VPMACSDD instruction requires four operands:

VPMACSDD dest, src1, src2, src3 dest = src1* src2 + src3

The destination (dest) is an XMM register addressed by the MODRM.reg field. When the destinationregister is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source (src1) is an XMM register specified by the XOP.vvvv fields; the second source (src2)is an XMM register or 128-bit memory location specified by the MODRM.rm field; and the thirdsource (src3) is an XMM register specified by imm8[7:4].

When the third source designates the same XMM register as the destination, the XMM registerbehaves as an accumulator.

No saturation is performed on the sum. If the result of the multiplication causes non-zero values to beset in the upper 32 bits of the 64 bit product, they are ignored. If the result of the add overflows, thecarry is ignored (neither the overflow nor carry bit in rFLAGS is set). In both cases, only the signedlow-order 32 bits of the result are written to the destination.

The VPMACSDD instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order# 25481.

VPMACSDD Packed Multiply Accumulate Signed Doubleword to Signed Doubleword

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPMACSDD xmm1, xmm2, xmm3/mem128, xmm4 8F RXB.08 0.xsrc1.0.00 9E /r /is4

Page 200: 128-Bit and 256-Bit XOP and FMA4 Instructions

200 VPMACSDD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Related Instructions

VPMACSSWW, VPMACSWW, VPMACSSWD, VPMACSWD, VPMACSSDD, VPMACSSDQL, VPMACSSDQH, VPMACSDQL, VPMACSDQH, VPMADCSSWD, VPMADCSWD

rFLAGS Affected

None

MXCSR Flags Affected

None

VPMACSDD

127 6463 0

255 128VEX.L=0 0s

dest = xmm1/ymm

src1 = xmm2 src2 = xmm3/mem128

src3 = xmm4/ymm

31329596

mul

add add

mul

add

255 192191 128159160223224

127 6463 031329596

127 6463 031329596

mul

add

mul

Page 201: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPMACSDD 201

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.X XOP.vvvv was not 1111b.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 202: 128-Bit and 256-Bit XOP and FMA4 Instructions

202 VPMACSDQH Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Multiplies the second 32-bit signed integer value of the first source by the second 32-bit signed integervalue in the second source, then adds the 64-bit signed integer product to the low-order 64-bit signedinteger value in the third source. Simultaneously, multiplies the fourth 32-bit signed integer value ofthe first source by the fourth 32-bit signed integer value in the second source, then adds the 64-bitsigned integer product to the second 64-bit signed integer value in the third source.The results arewritten to the destination register.

The VPMACSDQH instruction requires four operands:

VPMACSDQH dest, src1, src2, src3 dest = src1* src2 + src3

The destination (dest) is an XMM register addressed by the MODRM.reg field. When the destinationregister is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source (src1) is an XMM register specified by the XOP.vvvv field; the second source (src2) isan XMM register or 128-bit memory location specified by the MODRM.rm field; and the third source(src3) is an XMM register specified by imm8[7:4].

When the third source designates the same XMM register as th destination register, the XMM registerbehaves as an accumulator.

No saturation is performed on the sum. If the result of the add overflows, the carry is ignored (neitherthe overflow nor carry bit in rFLAGS is set).

The VPMACSDQH instruction is an XOP instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VPMACSDQH Packed Multiply Accumulate Signed High Doubleword to Signed Quadword

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPMACSDQH xmm1, xmm2, xmm3/mem128, xmm4 8F RXB.08 0.xsrc1.0.00 9F /r /is4

Page 203: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPMACSDQH 203

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VPMACSSWW, VPMACSWW, VPMACSSWD, VPMACSWD, VPMACSSDD, VPMACSDD, VPMACSSDQL, VPMACSSDQH, VPMACSDQL, VPMADCSSWD, VPMADCSWD

rFLAGS Affected

None

MXCSR Flags Affected

None

63127

255064

127

0s dest = xmm1

src3 = xmm4095 63127 64 313296

095 63127 64 313296

multiply

063127 64

add

addmultiply

VPMACSDQH

src2 = xmm3/mem128

src1 = xmm2

Page 204: 128-Bit and 256-Bit XOP and FMA4 Instructions

204 VPMACSDQH Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 205: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPMACSDQL 205

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Multiplies the low-order 32-bit signed integer value of the first source by the low-order 32-bit signedinteger value in the second source, then adds the 64-bit signed integer product to the low-order 64-bitsigned integer value in the third source. Simultaneously, multiplies the third 32-bit signed integervalue of the first source by the corresponding 32-bit signed integer value in the second source, thenadds the 64-bit signed integer product to the second 64-bit signed integer value in the third source. Theresults are written to the destination (register.

TheVPMACSDQL instruction requires four operands:

VPMACSDQL dest, src1, src2, src3 dest = src1* src2 + src3

The destination register is a YMM register addressed by the MODRM.reg field. When the destinationregister is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source (src1) is an XMM register specified by the XOP.vvvv fields; the second source (src2)is an XMM register or 128-bit memory location specified by the MODRM.rm field; and the thirdsource (src3) is an XMM register specified by imm8[7:4].

When src3 designates the same XMM register as the dest register, the XMM register behaves as anaccumulator.

No saturation is performed on the sum. If the result of the add overflows, the carry is ignored (neitherthe overflow nor carry bit in rFLAGS is set). Only the low-order 64 bits of each result are written in thedestination.

The VPMACSDQL instruction is an XOP instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VPMACSDQL Packed Multiply Accumulate Signed Low Doubleword to Signed Quadword

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPMACSDQL xmm1, xmm2, xmm3/mem128, xmm4 8F RXB.8 0.xsrc1.0.00 97 /r /is4

Page 206: 128-Bit and 256-Bit XOP and FMA4 Instructions

206 VPMACSDQL Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Related Instructions

VPMACSSWW, VPMACSWW, VPMACSSWD, VPMACSWD, VPMACSSDD, VPMACSDD, VPMACSSDQL, VPMACSSDQH, VPMACSDQH, VPMADCSSWD, VPMADCSWD

rFLAGS Affected

None

MXCSR Flags Affected

None

63127

255064

127

0s dest = xmm1

src3 = xmm4

095 63127 64 313296

095 63127 64 313296

multiply

063127 64

add

addmultiply

VPMACSDQL

src2 = xmm3/mem128

src1 = xmm2

Page 207: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPMACSDQL 207

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 208: 128-Bit and 256-Bit XOP and FMA4 Instructions

208 VPMACSSDD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Multiplies each packed 32-bit signed integer value in the first source by the corresponding packed 32-bit signed integer value in the second source, then adds each 64-bit signed integer product to thecorresponding packed 32-bit signed integer value in the third source. The saturated results are writtento the destination register.

The VPMACSSDD instruction requires four operands:

VPMACSSDD dest, src1, src2, src3 dest = src1* src2 + src3

The destination (dest) is an XMM register addressed by the MODRM.reg field. When the destination register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source (src1) is an XMM register specified by the XOP.vvvv fields; the second source (src2)is an XMM register or 128-bit memory location specified by the MODRM.rm field; and the thirdsource (src3) is an XMM register specified by imm8[7:4].

When src3 designates the same XMM register as the dest register, the XMM register behaves as anaccumulator.

Out of range results of the addition are saturated to fit into a signed 32-bit integer. For each packedvalue in the destination, if the value is larger than the largest signed 32-bit integer, it is saturated to7FFF_FFFFh, and if the value is smaller than the smallest signed 32-bit integer, it is saturated to8000_0000h.

The VPMACSSDD instruction is an XOP instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VPMACSSDD Packed Multiply Accumulate Signed Doubleword to Signed Doubleword with Saturation

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPMACSSDD xmm1, xmm2, xmm3/mem128, xmm4 8F RXB.08 0.xsrc1.0.00 8E /r /is4

Page 209: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPMACSSDD 209

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VPMACSSWW, VPMACSWW, VPMACSSWD, VPMACSWD, VPMACSDD, VPMACSSDQL, VPMACSSDQH, VPMACSDQL, VPMACSDQH, VPMADCSSWD, VPMADCSWD

rFLAGS Affected

None

MXCSR Flags Affected

None

VPMACSSDD

127 6463 0

255 128VEX.L=0 0s

sat

dest = xmm1

src1 = xmm2 src2 = xmm3/mem128

src3 = xmm4/ymm

31329596

mul

add add

mul

add

255 192191 128159160223224

127 6463 031329596

127 6463 031329596

mul

add

mul

satsat

sat

Page 210: 128-Bit and 256-Bit XOP and FMA4 Instructions

210 VPMACSSDD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 211: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPMACSSDQH 211

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Multiplies the second 32-bit signed integer value of the first source by the second 32-bit signed integervalue in the second source, then adds the 64-bit signed integer product to the low-order 64-bit signedinteger value in the third source. Simultaneously, multiplies the fourth 32-bit signed integer value ofthe first source by the fourth 32-bit signed integer value in the second source, then adds the 64-bitsigned integer product to the high-order 64-bit signed integer value in the third source. The saturatedresults are written to the destination register.

The PMACSSDQH instruction requires four operands:

VPMACSSDQH dest, src1, src2, src3 dest = src1* src2 + src3

The destination (dest) is an XMM register addressed by the MODRM.reg field. When the destination XMM register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source (src1) is an XMM register specified by the XOP.vvvv fields; the second source (src2)is an XMM register or 128-bit memory location specified by the MODRM.rm field; and the thirdsource (src3) is an XMM register specified by imm8[7:4].

When src3 designates the same XMM register as the dest register, the XMM register behaves as anaccumulator.

Out of range results of the addition are saturated to fit into a signed 64-bit integer. For each packed value in the destination, if the value is larger than the largest signed 64-bit integer, it is saturated to 7FFF_FFFF_FFFF_FFFFh, and if the value is smaller than the smallest signed 64-bit integer, it is saturated to 8000_0000_0000_0000h.

The VPMACSSDQH instruction is an XOP instruction. The presence of this instruction set is indicated by a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VPMACSSDQH Packed Multiply Accumulate Signed High Doubleword to Signed Quadword with Saturation

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPMACSSDQH xmm1, xmm2, xmm3/mem128, xmm4 8F RXB.08 0.xsrc1.0.00 8F /r is4

Page 212: 128-Bit and 256-Bit XOP and FMA4 Instructions

212 VPMACSSDQH Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Related Instructions

VPMACSSWW, VPMACSWW, VPMACSSWD, VPMACSWD, VPMACSSDD, VPMACSDD, VPMACSSDQL, VPMACSDQL, VPMACSDQH, VPMADCSSWD, VPMADCSWD

rFLAGS Affected

None

MXCSR Flags Affected

None

63127

255064

127

0s dest = xmm1

src3 = xmm4095 63127 64 313296

095 63127 64 313296

063127 64

multiply

VPMACSSDQH

src2 = xmm3/mem128

src1 = xmm2

saturate saturate

add

add

multiply

Page 213: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPMACSSDQH 213

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 214: 128-Bit and 256-Bit XOP and FMA4 Instructions

214 VPMACSSDQL Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Multiplies the low-order 32-bit signed integer value of the first source by the low-order 32-bit signedinteger value in the second source, then adds the 64-bit signed integer product to the low-order 64-bitsigned integer value in the third source. Simultaneously, multiplies the third 32-bit signed integervalue of the first source by the third 32-bit signed integer value in the second source, then adds the 64-bit signed integer product to the high-order 64-bit signed integer value in the third source. Thesaturated results are written to the destination register.

The VPMACSSDQL instruction requires four operands:

VPMACSSDQL dest, src1, src2, src3 dest = src1* src2 + src3

The destination (dest) register is an XMM register addressed by the MODRM.reg field. When the destination register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source (src1) is an XMM register specified by the XOP.vvvv fields; the second source (src2)is an XMM register or 128-bit memory location specified by the MODRM.rm field; and the thirdsource (src3) is an XMM register specified by imm8[7:4].

When src3 designates the same XMM register as the dest register, the XMM register behaves as anaccumulator.

Out of range results of the addition are saturated to fit into a signed 64-bit integer. For each packed value in the destination, if the value is larger than the largest signed 64-bit integer, it is saturated to 7FFF_FFFF_FFFF_FFFFh, and if the value is smaller than the smallest signed 64-bit integer, it is saturated to 8000_0000_0000_0000h.

The VPMACSSDQL instruction is an XOP instruction. The presence of this instruction set isindicated by a CPUID feature bit. (See the CPUID Specification, order# 25481.

VPMACSSDQL Packed Multiply Accumulate Signed Low Doubleword to Signed Quadword with Saturation

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

PMACSSDQL xmm1, xmm2, xmm3/mem128, xmm4 8F RXB.08 0.xsrc1.0.00 87 /r /is4

Page 215: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPMACSSDQL 215

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VPMACSSWW, VPMACSWW, VPMACSSWD, VPMACSWD, VPMACSSDD, VPMACSDD, VPMACSSDQH, VPMACSDQL, VPMACSDQH, VPMADCSSWD, VPMADCSWD

rFLAGS Affected

None

MXCSR Flags Affected

None

63127

255064

128

0s dest = xmm1

src3 = xmm4095 63127 64 313296

095 63127 64 313296

multiply

063127 64

multiply

VPMACSSDQL

src2 = xmm3/mem128

src1 = xmm2

saturate saturate

add

add

Page 216: 128-Bit and 256-Bit XOP and FMA4 Instructions

216 VPMACSSDQL Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 217: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPMACSSWD 217

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Multiplies the odd-numbered packed 16-bit signed integer values in the first source by thecorresponding packed 16-bit signed integer values in the second source, then adds the 32-bit signedinteger products to the corresponding packed 32-bit signed integer values in the third source. Thesaturated results are written to the destination register.

The VPMACSSWD instruction requires four operands:

VPMACSSWD dest, src1, src2, src3 dest = src1* src2 + src3

The destinationa (dest) is an XMM register addressed by the MODRM.reg field. When the destination XMM register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source (src1) is an XMM register specified by the XOP.vvvv field; the second source (src2) isan XMM register or 128-bit memory location specified by the MODRM.rm field; and the third source(src3) is an XMM register specified by imm8[7:4].

When src3 designates the same XMM register as the dest register, the XMM register behaves as anaccumulator.

Out of range results of the addition are saturated to fit into a signed 32-bit integer. For each packedvalue in the destination, if the value is larger than the largest signed 32-bit integer, it is saturated to7FFF_FFFFh, and if the value is smaller than the smallest signed 32-bit integer, it is saturated to8000_0000h.

The VPMACSSWD instruction is an XOP instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.

VPMACSSWD Packed Multiply Accumulate Signed Word to Signed Doubleword with Saturation

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPMACSSWD xmm1, xmm2, xmm3/mem128, xmm4 8F RXB.08 0.xsrc1.0.00 86 /r /is4

Page 218: 128-Bit and 256-Bit XOP and FMA4 Instructions

218 VPMACSSWD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Related Instructions

VPMACSSWW, VPMACSWW, VPMACSWD, VPMACSSDD, VPMACSDD, VPMACSSDQL, VPMACSSDQH, VPMACSDQL, VPMACSDQH, VPMADCSSWD, VPMADCSWD

rFLAGS Affected

None

MXCSR Flags Affected

None

src1 = xmm2

dest = xmm1

015

1631

3247

4863

6479

8095

96111

112127

multiply

multiply

multiply

0151631324748636479809596111112127

multiply

031

3263

6495

96127

add

add

add

add

255 1280s

031

3263

6495

96127

VPMACSSWD

src2 = xmm3/mem128

src3 = xmm4

saturatesaturatesaturate saturate

Page 219: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPMACSSWD 219

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 220: 128-Bit and 256-Bit XOP and FMA4 Instructions

220 VPMACSSWW Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Multiplies each packed 16-bit signed integer value in the first source by its corresponding packed 16-bit signed integer value in the second source, then adds the 32-bit signed integer products to thecorresponding packed 16-bit signed integer value in the third source. The saturated results are writtento the destination register.

The VPMACSSWW instruction requires four operands:

VPMACSSWW dest, src1, src2, src3 dest = src1* src2 + src3

The destination register is an XMM register addressed by the MODRM.reg field. When the destination register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source (src1) is an XMM register specified by the XOP.vvvv fields; the second source (src2)is an XMM register or 128-bit memory location specified by the MODRM.rm field; and the thirdsource (src3) is an XMM register specified by imm8[7:4].

When src3 and dest designate the same XMM register, this register behaves as an accumulator.

Out of range results of the addition are saturated to fit into a signed 16-bit integer. For each packedvalue in the destination, if the value is larger than the largest signed 16-bit integer, it is saturated to7FFFh, and if the value is smaller than the smallest signed 16-bit integer, it is saturated to 8000h.

The VPMACSSWW instruction is an XOP instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.

VPMACSSWW Packed Multiply Accumulate Signed Word to Signed Word with Saturation

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

PMACSSWW xmm1, xmm2, xmm3/mem128, xmm4 8F RXB.08 0.xsrc1.0.00 85 /r /is4

Page 221: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPMACSSWW 221

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VPMACSWW, VPMACSSWD, VPMACSWD, VPMACSSDD, VPMACSDD, VPMACSSDQL, VPMACSSDQH, VPMACSDQL,VPMACSDQH, VPMADCSSWD, VPMADCSWD

rFLAGS Affected

None

MXCSR Flags Affected

None

src1 = xmm2

src2 = xmm3/mem128

dest = xmm1

0151631324748636479809596111112127

multiply

multiply

multiply

multiply add

add

multiply

multiply

multiply

0151631324748636479809596111112127

multiply

add

add

add

add

add

015

1631

3247

4863

6479

8095

96111

112127

VPMACSSWW

add

multiply

0151631324748636479809596111112127255 128

0s

saturate saturate saturate saturatesaturate saturate saturate saturate

src3 = xmm4

Page 222: 128-Bit and 256-Bit XOP and FMA4 Instructions

222 VPMACSSWW Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 223: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPMACSWD 223

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Multiplies each odd-numbered packed 16-bit signed integer value in the first source by thecorresponding packed 16-bit signed integer value in the second source, then adds the 32-bit signedinteger products to the corresponding packed 32-bit signed integer value in the third source. The fourresults are written to the destination register.

The VPMACSWD instruction requires four operands:

VPMACSWD dest, src1, src2, src3 dest = src1* src2 + src3

The destination (dest) register is an XMM register addressed by the MODRM.reg field. When the destination XMM register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source (src1) is an XMM register specified by the XOP.vvvv fields; the second source (src2)is an XMM register or 128-bit memory location specified by the MODRM.rm field; and the thirdsource (src3) is an XMM register specified by imm8[7:4].

When src3 designates the same XMM register as the dest register, the XMM register behaves as anaccumulator.

If the result of the add overflows, the carry is ignored (neither the overflow nor carry bit in rFLAGS isset). Only the low-order 32 bits of the result are written in the destination.

The VPMACSWD instruction is an XOP instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VPMACSWD Packed Multiply Accumulate Signed Word to Signed Doubleword

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPMACSWD xmm1, xmm2, xmm3/mem128, xmm4 8F RXB.08 0.xsrc1.0.00 96 /r /is4

Page 224: 128-Bit and 256-Bit XOP and FMA4 Instructions

224 VPMACSWD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Related Instructions

VPMACSSWW, VPMACSWW, VPMACSSWD, VPMACSSDD, VPMACSDO, VPMACSSDQL, VPMACSSDQH, VPMACSDQL, VPMACSDQH, VPMADCSSWD, VPMADCSWD

rFLAGS Affected

None

MXCSR Flags Affected

None

src1 = xmm2

src3 = xmm/mem128

015

1631

3247

4863

6479

8095

96111

112127

multiply

multiply

multiply

0151631324748636479809596111112127

multiply

031

3263

6495

96127

add

add

add

add

031

3263

6495

96127

128255

0s

src4 = xmm

dest = xmm1

VPMACSWD

Page 225: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPMACSWD 225

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 226: 128-Bit and 256-Bit XOP and FMA4 Instructions

226 VPMACSWW Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Multiplies each packed 16-bit signed integer value in the first source by the corresponding packed 16-bit signed integer value in the second source, then adds each 32-bit signed integer product to thecorresponding packed 16-bit signed integer value in the third source. The eight results are written tothe destination register.

The VPMACSWW instruction requires four operands:

VPMACSWW dest, src1, src2, src3 dest = src1* src2 + src3

The destination (dest) is an XMM register addressed by the MODRM.reg field. When the destination XMM register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source (src1) is an XMM register specified by the XOP.vvvv fields; the second source (src2)is an XMM register or 128-bit memory location specified by the MODRM.rm field; and the thirdsource (src3) is an XMM register specified by imm8[7:4].

When src3 designates the same XMM register as the dest register, the XMM register behaves as anaccumulator.

No saturation is performed on the sum. If the result of the multipliplication causes non-zero values tobe set in the upper 16 bits of the 32 bit result, they are ignored. If the result of the add overflows, thecarry is ignored (neither the overflow nor carry bit in rFLAGS is set). In both cases, only the signedlow-order 16 bits of the result are written in the destination.

The VPMACSWW instruction is an XOP instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VPMACSWW Packed Multiply Accumulate Signed Word to Signed Word

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPMACSWW xmm1, xmm2, xmm3/mem128, xmm4 8F RXB.08 0.xsrc1.0.00 95 /r /is4

Page 227: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPMACSWW 227

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VPMACSSWW, VPMACSSWD, VPMACSWD, VPMACSSDD, VPMACSDD, VPMACSSDQL, VPMACSSDQH, VPMACSDQL, VPMACSDQH, VPMADCSSWD, VPMADCSWD

rFLAGS Affected

None

MXCSR Flags Affected

None

src1 = xmm src3 = xmm0

1516

3132

4748

6364

7980

9596

111112

127

multiply

multiply

multiply

multiply add

add

multiply

multiply

multiply

015

1631

3247

4863

6479

8095

96111

112127

multiply

add

add

add

add

add

add

015

1631

3247

4863

6479

8095

96111

112127

015

1631

3247

4863

6479

8095

96111

112127 src2 = xmm/mem128

dest = xmm

VPMACSWW

128255

0s

Page 228: 128-Bit and 256-Bit XOP and FMA4 Instructions

228 VPMACSWW Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 229: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPMADCSSWD 229

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Multiplies each packed 16-bit signed integer value in the first source by the corresponding packed 16-bit signed integer value in the second source, then adds the 32-bit signed integer products of the even-odd adjacent words. Each resulting sum is then added to the corresponding packed 32-bit signedinteger value in the third source. The four results are written to the destination (accumulator) register.

The VPMADCSSWD instruction requires four operands:

VPMADCSSWD dest, src1, src2, src3 dest = src1* src2 + src3

The destination register is an XMM register addressed by the MODRM.reg field. When the destination register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source is an XMM register specified by the XOP.vvvv fields; the second source is an XMMregister or 128-bit memory location specified by the MODRM.rm field; and the third source is anXMM register specified by imm8[7:4].

When src3 designates the same XMM register as the dest register, the XMM register behaves as anaccumulator.

Out of range results of the addition are saturated to fit into a signed 32-bit integer. For each packed value in the destination, if the value is larger than the largest signed 32-bit integer, it is saturated to 7FFF_FFFFh, and if the value is smaller than the smallest signed 32-bit integer, it is saturated to 8000_0000h.

The VPMADCSSWD instruction is an XOP instruction. The presence of this instruction set isindicated by a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VPMADCSSWD Packed Multiply, Add and Accumulate Signed Word to Signed Doubleword with Saturation

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPMADCSSWD xmm1, xmm2, xmm3/mem128, xmm4 8F RXB.08 0.xsrc1.0.00 A6 /r /is4

Page 230: 128-Bit and 256-Bit XOP and FMA4 Instructions

230 VPMADCSSWD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Related Instructions

VPMACSSWW, VPMACSWW, VPMACSSWD, VPMACSWD, VPMACSSDD, VPMACSDD, VPMACSSDQL, VPMACSSDQH, VPMACSDQL, VPMACSDQH, VPMADCSWD

rFLAGS Affected

None

MXCSR Flags Affected

None

src1 = xmm2

src2 = xmm3/mem128

src3 = xmm40

1516

3132

4748

6364

7980

9596

111112

127

multiply

multiply

multiply

multiplyadd

multiply

multiply

multiply

015

1631

3247

4863

6479

8095

96111

112127

multiply

add

add

031

3263

6495

96127

saturatesaturatesaturatesaturate

add

031

3263

6495

96127

dest = xmm1128255

0s

VPMADCSSWD

add

add

add

add

Page 231: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPMADCSSWD 231

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 232: 128-Bit and 256-Bit XOP and FMA4 Instructions

232 VPMADCSWD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Multiplies each packed 16-bit signed integer value in the first source by the corresponding packed 16-bit signed integer value in the second source, then adds the 32-bit signed integer products of the even-odd adjacent words together and adds their sum to the corresponding packed 32-bit signed integervalues in the third source. The four results are written to the destination register.

The VPMADCSWD instruction requires four operands:

VPMADCSWD dest, src1, src2, src3 dest = src1* src2 + src3

The destination register is an XMM register addressed by the MODRM.reg field. When the destination register is written, the upper 128 bits of the corresponding YMM register are cleared to zeros.

The first source is an XMM register specified by the XOP.vvvv fields, the second source is an XMMregister or 128-bit memory location specified by the MODRM.rm field; and the third source is anXMM register specified by imm8[7:4].

When src3 designates the same XMM register as the dest register, the XMM register behaves as anaccumulator.

No saturation is performed on the sum. If the result of the addition overflows, the carry is ignored(neither the overflow nor carry bit in rFLAGS is set). Only the signed 32-bits of the result are writtento the destination.

The VPMADCSWD instruction is an XOP instruction. The presence of this instruction set is indicatedby a CPUID feature bit. (See the CPUID Specification, order# 25481.)

VPMADCSWD Packed Multiply Add and Accumulate Signed Word to Signed Doubleword

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

PMADCSWD xmm1, xmm2, xmm3/mem128, xmm4 8F RXB.08 0.xsrc1.0.00 B6 /r /is4

Page 233: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPMADCSWD 233

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VPMACSSWW, VPMACSWW, VPMACSSWD, VPMACSWD, VPMACSSDD, VPMACSDD, VPMACSSDQL, VPMACSSDQH, VPMACSDQL, VPMACSDQH, VPMADCSSWD

rFLAGS Affected

None

MXCSR Flags Affected

None

src1 = xmm2

src2 = xmm3/mem128

src3 = xmm40151631324748636479809596111112127

multiply

multiply

multiply

multiplyadd

multiply

multiply

multiply

015

1631

3247

4863

6479

8095

96111

112127

multiply

add

add

0313263649596127

add

031

3263

6495

96127

dest = xmm1128255

0s

VPMADCSWD

add

add

add

add

Page 234: 128-Bit and 256-Bit XOP and FMA4 Instructions

234 VPMADCSWD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.W was set to 1.X XOP.L was set to 1.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 235: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPPERM 235

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Selects 16 of the 32-packed bytes in the two sources and optionally applies a logical transformation toeach selected byte before it is stored to its specified position in the destination XMM register.

The VPPERM instruction requires four operands:

VPPERM dest, src1, src2, selector

The 32-byte source consists of the concatenation of the second source (src2) and the first source (src1).

The third source operand (selector) contains 16 control bytes, each specifying the source byte and thelogical operation to perform on that byte before the result is stored in the destination register. The orderof the bytes in the destination register is the same as that of the selector bytes in the selector.

The src1 operand is always an XMM register specified by XOP.vvvv

This instruction supports operand source configuration using XOP.W. When XOP.W is 0, src2 is anXMM register or 128-bit memory location specified by MODRM.rm and selector is an XMM registerspecified by imm8[7:4]. When XOP.W is 1, src2 is an XMM register specified by imm8[7:4] andselector is an XMM register or 128-bit memory location specified by MODRM.rm.

The destination (dest) is always an XMM register specified by MODRM.reg. When the result operandis written to the dest XMM register, the upper 128 bits of the corresponding YMM register are clearedto zeros.

For each byte of the 16-byte result, the corresponding selector byte is used as follows:

• Bits 4:0 of the selector selects the source byte to move from the 32 bytes from src2:src1.

• Bits 7:5 of the selector selects the logical operation to perform on the selected operand.

VPPERM Packed Permute Bytes

Page 236: 128-Bit and 256-Bit XOP and FMA4 Instructions

236 VPPERM Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Table 2-13. VPPERM Control Byte

TheVPPERM instruction is an XOP instruction. The presence of this instruction set is indicated by aCPUID feature bit. (See the CPUID Specification, order# 25481.)

Bits Description7:5 Defines the logical operation performed on the selected operand.

OP Operation

000 Source byte (no logical operation)

001 Invert source byte

010 Bit reverse of source byte

011 Bit reverse of inverted source byte

100 00h

101 FFh

110 Most significant bit of source byte replicated in all bit positions.

111 Invert most significant bit of source byte and replicate in all bit positions.

4:0 Selects the source byte to permuteSelector Source Selected Selector Source Selected

00000 src1[7:0] 10000 src2[7:0]

00001 src1[15:8] 10001 src2[15:8]

00010 src1[23:16] 10010 src2[23:16]

00011 src1[31:24] 10011 src2[31:24]

00100 src1[39:32] 10100 src2[39:32]

00101 src1[47:40] 10101 src2[47:40]

00110 src1[55:48] 10110 src2[55:48]

00111 src1[63:56] 10111 src2[63:56]

01000 src1[71:64] 11000 src2[71:64]

01001 src1[79:72] 11001 src2[79:72]

01010 src1[87:80] 11010 src2[87:80]

01011 src1[95:88] 11011 src2[95:88]

01100 src1[103:96] 11100 src2[103:96]

01101 src1[111:104] 11101 src2[111:104]

01110 src1[119:112] 11110 src2[119:112]

01111 src1[127:120] 11111 src2[127:120]

Mnemonic Encoding

Page 237: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPPERM 237

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VPSHUFHW, VPSHUFD, VPSHUFLW, VPSHUFW, VPERMIL2PS, VPERMIL2PD

rFLAGS Affected

None

MXCSR Flags Affected

None

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPPERM xmm1, xmm2, xmm3, xmm4/mem128 8F RXB.8 1.xsrc1.0.00 A3 /r is4

VPPERM xmm1, xmm2, xmm3/mem128, xmm4 8F RXB.8 0.xsrc1.0.00 A3 /r is4

src1src2

1631

0 byte8112127

……d-1dd+1

src30 byte11415

……n-1nn+116173031

……m-1mm+1

dest0s

0 byte11415

……d-1dd+1

bits 4:0

(selector)

selector= src1[n]

VPPERM

logOp[d] bits 7:5

Page 238: 128-Bit and 256-Bit XOP and FMA4 Instructions

238 VPPERM Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.L was set to 1.Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled while MXCSR.MM=1.

Page 239: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPROTB 239

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Rotates each byte of the source by the amount specified in the signed value of the corresponding countbyte and writes the result in the corresponding byte of the destination.

There are two versions of the instruction, depending on the source of the count byte used for each 8-bitshift:

VPROTB dest, src, fixed-countVPROTB dest, src, variable-count-src

The destination (dest) operand of both versions of this instruction is an XMM register addressed by theMODRM.reg field. When the result of the rotation is written to the destination XMM register, theupper 128 bits of the corresponding YMM register are cleared to zeros.

The fixed-count version of this instruction rotates each byte element of the source (src) by the numberof bits specified by the immediate fixed-count byte. All byte elements of the source are rotated by thesame number of bits. The source is a 128-bit XMM register or memory location addressed by theMODRM.rm field.

The variable-count-src version of this instruction rotates each byte of the source by the amountspecified in the corresponding byte element in the variable-count-src, which is an XMM register or128-bit memory location.

The src and variable-count-src are configurable through XOP.W. If XOP.W is 0, the variable-count-src is an XMM register specified by XOP.vvvv and the src operand is an XMM register or 128-bitmemory location specified by MODRM.rm. If XOP.W is 1, the variable-count-src is an XMMregister or 128-bit memory location specified by MODRM.rm and the src operand is an XMM registerspecified by XOP.vvvv.

If the count value is positive, bits are rotated to the left (toward the more significant bit positions). Thebits rotated out left of the most significant bit are rotated back in at the right end (least-significant bit)of the byte.

If the count value is negative, bits are rotated to the right (toward the least significant bit positions).The bits rotated to the right out of the least significant bit are rotated back in at the left end (most-significant bit) of the byte.

The VPROTB instruction is an XOP instruction. The presence of this instruction set is indicated by aCPUID feature bit. (See the CPUID Specification, order# 25481.)

VPROTB Packed Rotate Bytes

MnemonicEncoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPROTB xmm1, xmm2/mem128, xmm8 8F RXB.09 0.xcnt.0.00 90 /r

VPROTB xmm1, xmm2, xmm3/mem128 8F RXB.09 1.xsrc.0.00 90 /r

VPROTB xmm1, xmm2/mem128, imm8 8F RXB.08 0.1111.0.00 C0 /r /ib

Page 240: 128-Bit and 256-Bit XOP and FMA4 Instructions

240 VPROTB Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Related Instructions

VPROTW, VPROTD, VPROTQ,VPSHLB, VPSHLW, VPSHLD, VPSHLQ, VPSHAB, VPSHAW,VPSHAD, VPSHAQ

rFLAGS Affected

None

imm8

127 0 7 16 32 23 39 48 55 64 71 80 87 96 103 112 119

dest = xmm

src1127 0 7 16 32 23 39 48 55 64 71 80 87 96 103 112 119

0 7 16 32 23 39 48 55 64 71 80 87 96 103 112 119

rotate

rotate

16 rotates

… … … … … … … … … … … … … … … … …

src2 = 16 count bytes

0

dest = xmm

count

127 0 7 16 32 23 39 48 55 64 71 80 87 96 103 112 119

127 0 7 16 32 23 39 48 55 64 71 80 87 96 103 112 119

rotate

rotate

16 rotates

… … … … … … … … … … … … … … … … …

… … … … … … … … … … … … … … … … …

src1

7

256 128

0s

count count

256 128

0s

VPROTB

xmm if VEX.W = 1xmm/mem128 if VEX.W = 0

xmm/mem128 if VEX.W = 0mem128 if VEX.W = 1

Page 241: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPROTB 241

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

MXCSR Flags Affected

None

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.L was set to 1.X XOP.vvvv was not 1111b for immediate count form of

instruction (opcode C0h).Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.

Page 242: 128-Bit and 256-Bit XOP and FMA4 Instructions

242 VPROTD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Rotates each of the four doublewords of the source operand by the amount specified in the signedvalue of the corresponding count byte and writes the result in the corresponding doubleword of thedestination.

There are two variants of this instruction, depending on the source of the count byte used for each doubleword shift:

VPROTD dest, src, fixed-countVPROTD dest, src, variable-count

The dest operand of both versions of this instruction is an XMM register addressed by theMODRM.reg field. When the 128-bit result operand is written to the dest register, the upper 128 bits ofthe corresponding YMM register are cleared to zeros.

The fixed count version of this instruction rotates each doubleword of the source operand by thenumber of bits specified by the immediate fixed-count byte operand. All doubleword elements of thesource operand are rotated by the same number of bits. The src is anXMM register or memory locationaddressed by the MODRM.rm field.

The variable count version of this instruction rotates each doubleword of the source by the amountspecified in the low order byte of the corresponding doubleword of the variable-count operand vector.

The src and variable-count operand vector are configurable through XOP.W. If XOP.W is 0, the src isan XMM register or 128-bit memory location specified by the MODRM.rm field and the variable-count operand vector is an XMM register specified by XOP.vvvv. If XOP.W is 1, the src operand is anXMM register specified by XOP.vvvv and the variable-count operand is an XMM register or 128-bitmemory location specified by the MODRM.rm field.

If the count value is positive, bits are rotated to the left (toward the more significant bit positions). Thebits rotated out to the left of the most significant bit of each source doubleword operand are rotatedback in at the right end (least-significant bit) of the doubleword.

If the count value is negative, bits are rotated to the right (toward the least significant bit positions).The bits rotated to the right out of the least significant bit of each source doubleword operand arerotated back in at the left end (most-significant bit) of the doubleword.

The VPROTD instruction is an XOP instruction. The presence of this instruction set is indicated by aCPUID feature bit. (See the CPUID Specification, order# 25481.)

VPROTD Packed Rotate Doublewords

Page 243: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPROTD 243

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VPROTB, VPROTW, VPROTQ, VPSHLB, VPSHLW, VPSHLD, VPSHLQ, VPSHAB, VPSHAW,VPSHAD, VPSHAQ

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPROTD xmm1, xmm2/mem128, xmm3 8F RXB.09 0.xcnt.0.00 92 /r

VPROTD xmm1, xmm2, xmm3/mem128 8F RXB.09 1.xsrc.0.00 92 /r

VPROTD xmm1, xmm2/mem128, imm8 8F RXB.08 0.1111.0.00 C2 /ib

countimm8

xmm/mem128127 6463 0

255 1280s

dest = xmm

31329596

rotate

rotate

rotate

rotate

127 6463 031329596

count

src2 = variable count127 6463 0

255 1280s

dest = xmm

xmm/mem128 if VEX.W = 031329596 127 6463 031329596

rotate

rotate

rotate

rotate

127 6463 031329596

count

count

count

count

mem128 if VEX.W = 1xmm if VEX.W = 1xmm/mem128 if VEX.W = 0

VPROTD

count

count

count

src1

src1

Page 244: 128-Bit and 256-Bit XOP and FMA4 Instructions

244 VPROTD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

rFLAGS Affected

None

MXCSR Flags Affected

None

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.L was set to 1.X XOP.vvvv was not 1111b for immediate count form of

instruction (opcode C2h).Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.

Page 245: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPROTQ 245

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Rotates each of the quadwords of the source operand by the amount specified in the signed value of thecorresponding count byte and writes the result in the corresponding quadword of the destination.

There are two variants of this instruction, depending on the source of the count byte used for each quadword shift:

VPROTQ dest, src, fixed-countVPROTQ dest, src, variable-count

The dest operand of both versions of this instruction is an XMM register addressed by theMODRM.reg field. When the 128-bit result is written to the dest XMM register, the upper 128 bits ofthe corresponding YMM register are cleared to zeros.

The fixed count version of this instruction rotates each quadword in the source by the number of bitsspecified by the immediate fixed-count byte operand. All quadword elements of the source are rotatedby the same number of bits. The src is a 128-bit XMM register or memory location addressed by theMODRM.rm field.

The variable count version of this instruction rotates each quadword of the source by the amountspecified in the low order byte of the corresponding quadword of the variable-count operand.

The src and variable-count are configurable through XOP.W. If XOP.W is 0, the src is an XMMregister or 128-bit memory location specified by MODRM.rm and the count is an XMM registerspecified by XOP.vvvv. If XOP.W is 1, src is an XMM register specified by XOP.vvvv and thevariable-count is an XMM register or 128-bit memory location specified by MODRM.rm.

If the count value is positive, bits are rotated to the left (toward the more significant bit positions) ofthe operand element. The bits rotated out to the left of the most significant bit of the word element arerotated back in at the right end (least-significant bit).

If the count value is negative, operand element bits are rotated to the right (toward the least significantbit positions). The bits rotated to the right out of the least significant bit are rotated back in at the leftend (most-significant bit) of the word element.

The VPROTQ instruction is an XOP instruction. The presence of this instruction set is indicated by aCPUID feature bit. (See the CPUID Specification, order# 25481.)

VPROTQ Packed Rotate Quadwords

Page 246: 128-Bit and 256-Bit XOP and FMA4 Instructions

246 VPROTQ Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Related Instructions

PROTB, PROTW, PROTD, PSHLB, PSHLW, PSHLD, PSHLQ, PSHAB, PSHAW, PSHAD, PSHAQ

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPROTQ xmm1, xmm2/mem128, xmm3 8F RXB.09 0.xcnt.0.00 93 /r

VPROTQ xmm1, xmm2, xmm3/mem128 8F RXB.09 1.xsrc.0.00 93 /r

VPROTQ xmm1, xmm2/mem128, imm8 8F RXB.08 0.1111.0.00 C3 /ib

VPROTQ

countimm8

xmm/mem128127 6463 0

255 1280s

dest = xmm

rotate

rotate

127 6463 0

count

count

src1

src2 = variable count127 6463 0

255 1280s

dest = xmm

xmm/mem128 if VEX.W = 0127 6463 0

rotate

rotate

127 64 63 0

count

count

mem128 if VEX.W = 1xmm if VEX.W = 1xmm/mem128 if VEX.W = 0

src1

Page 247: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPROTQ 247

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

rFLAGS Affected

None

MXCSR Flags Affected

None

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.L was set to 1.X XOP.vvvv was not 1111b for immediate count form of

instruction (opcode C3h).Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.

Page 248: 128-Bit and 256-Bit XOP and FMA4 Instructions

248 VPROTW Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Rotates each of the eight words of the source operand by the amount specified in the signed value ofthe corresponding count byte and writes the result in the corresponding word of the destination.

There are two variants of this instruction, depending on the source of the count byte used for each word shift:

VPROTW dest, src, fixed-countVPROTW dest, src, variable-count

The dest operand of both versions of this instruction is a YMM register addressed by the MODRM.regfield. When the 128-bit result operand is written to the dest XMM register, the upper 128 bits of thecorresponding YMM register are cleared to zeros.

The fixed count version of this instruction rotates each word of the source operand by the number ofbits specified by the immediate fixed-count byte operand. All word elements of the source operand arerotated by the same number of bits. The src operand is a 128-bit YMM register or memory locationaddressed by the MODRM.rm field.

The variable count version of this instruction rotates each word of the source operand by the amountspecified in the low order byte of the corresponding word of the variable-count operand.

The src and count operands are configurable through XOP.W. If XOP.W is 0, the src operand is anXMM register or 128-bit memory location specified by MODRM.rm and the count operand is anXMM register specified by XOP.vvvv. If XOP.W is 1, the src operand is an XMM register specifiedby XOP.vvvv and the variable-count operand is an XMM register or 128-bit memory locationspecified by MODRM.rm.

If the count value is positive, bits are rotated to the left (toward the more significant bit positions) . Thebits rotated out to the left of the most significant bit of an element are rotated back in at the right end(least-significant bit) of the word element.

If the count value is negative, bits are rotated to the right (toward the least significant bit positions) ofthe element. The bits rotated to the right out of the least significant bit of an element are rotated back inat the left end (most-significant bit) of the word element.

The PROTW instruction is an XOP instruction. The presence of this instruction set is indicated by aCPUID feature bit. (See the CPUID Specification, order# 25481.)

VPROTW Packed Rotate Words

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPROTW xmm1, xmm2/mem128, xmm3 8F RXB.09 0.xcnt.0.00 91 /r

VPROTW xmm1, xmm2, xmm3/mem128 8F RXB.09 1.xsrc.0.00 91 /r

VPROTW xmm1, xmm2/mem128, imm8 8F RXB.08 0.1111.0.00 C1 /r /ib

Page 249: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPROTW 249

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

PROTB, PROTD, PROTQ, PSHLB, PSHLW, PSHLD, PSHLQ, PSHAB, PSHAW, PSHAD, PSHAQ

rFLAGS Affected

None

127 0 7 16 32 23 39 48 55 64 71 80 87 96 103 112 119

src1

015

1631

3247

4863

6479

8095

96111

112127

dest

015

1631

3247

4863

6479

8095

96111

112127

255 128

rotate

count count

count count

count count

count

count

src2(xmm/mem)

(xmm)

(xmm)

rotaterotate

rotaterotate

rotaterotate

rotate

src1

015

1631

3247

4863

6479

8095

96111

112127

dest015

1631

3247

4863

6479

8095

96111

112127

255 128

rotate

(xmm)

(xmm)

rotaterotate

rotaterotate

rotaterotate

rotate

0 7

count countimm8

0s

0s

VPROTW

Page 250: 128-Bit and 256-Bit XOP and FMA4 Instructions

250 VPROTW Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

MXCSR Flags Affected

None

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.L was set to 1.X XOP.vvvv was not 1111b for immediate count form of

instruction (opcode C1h).Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.

Page 251: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPSHAB 251

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Shifts each signed byte of the source operand by the amount specified in the signed value of thecorresponding count byte and writes the result in the corresponding byte of the destination.

The count byte for each 8-bit shift is an 8-bit signed two's-complement value in the corresponding byteelement of the count operand.

If the count value is positive, bits are shifted to the left (toward the more significant bit positions).Zeros are shifted in at the right end (least-significant bit) of the byte.

If the count value is negative, bits are shifted to the right (toward the least significant bit positions).The most significant bit (sign bit) is replicated and shifted in at the left end (most-significant bit) of thebyte.

The VPSHAB instruction requires three operands:

VPSHAB dest, src, count

The destination (dest) is an XMM register addressed by the MODRM.reg field. When the results arewritten to the destination XMM register, the upper 128 bits of the corresponding YMM register arecleared to zeros.

If XOP.W is 0, the count is an XMM register specified by XOP.vvvv and the src is an XMM register or128-bit memory location specified by MODRM.rm. If XOP.W is 1, the count is an XMM register or128-bit memory location specified by MODRM.rm and the src operand is an XMM register specifiedby XOP.vvvv.

The VPSHAB instruction is an XOP instruction. The presence of this instruction set is indicated by aCPUID feature bit. (See the CPUID Specification, order# 25481.)

VPSHAB Packed Shift Arithmetic Bytes

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPSHAB xmm1, xmm2/mem128, xmm3 8F RXB.09 0.xcnt.0.00 98 /r

VPSHAB xmm1, xmm2, xmm3/mem128 8F RXB.09 1.xsrc.0.00 98 /r

Page 252: 128-Bit and 256-Bit XOP and FMA4 Instructions

252 VPSHAB Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Related Instructions

VPROTB, VPROTW, VPROTD, VPROTQ, VPSHLB, VPSHLW, VPSHLD, VPSHLQ, VPSHAW,VPSHAD, VPSHAQ

rFLAGS Affected

None

MXCSR Flags Affected

None

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.L was set to 1.Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

src

0s

127

127

127

255

00

0128

shift

shift

16 count bytes

count

count

VPSHAB

Page 253: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPSHAB 253

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.

Exception Real Virtual8086 Protected Cause of Exception

Page 254: 128-Bit and 256-Bit XOP and FMA4 Instructions

254 VPSHAD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Shifts each of the four signed doublewords of the source operand by the amount specified in the signedvalue of the corresponding count byte and writes the result in the corresponding doubleword of thedestination.

The count byte for each doubleword shift is an 8-bit signed two's-complement value located in thelow-order byte of the corresponding doubleword element of the count operand.

If the count value is positive, bits are shifted to the left (toward the more significant bit positions).Zeros are shifted in at the right end (least-significant bit) of the doubleword.

If the count value is negative, bits are shifted to the right (toward the least significant bit positions).The most significant bit (sign bit) is replicated and shifted in at the left end (most-significant bit) of thedoubleword.

The VPSHAD instruction requires three operands:

VPSHAD dest, src, count

The destination (dest) is an XMM register addressed by the MODRM.reg field. When the 128-bitresult is written to the dest XMM register, the upper 128 bits of the corresponding YMM register arecleared to zeros.

The src and count are configurable through XOP.W. If XOP.W is 0, the count is an XMM registerspecified by XOP.vvvv and the src is an XMM register or memory location specified by MODRM.rm.If XOP.W is 1, the count is an XMM register or memory location specified by MODRM.rm and the srcis an XMM register specified by XOP.vvvv.

The VPSHAD instruction is an XOP instruction. The presence of this instruction set is indicated by aCPUID feature bit. (See the CPUID Specification, order# 25481.)

VPSHAD Packed Shift Arithmetic Doublewords

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPSHAD xmm1, xmm2/mem128, xmm3 8F RXB.09 0.xcnt.0.00 9A /r

VPSHAD xmm1, xmm2, xmm3/mem128 8F RXB.09 1.xsrc.0.00 9A /r

Page 255: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPSHAD 255

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VPROTB, VPROTW, VPROTD, VPROTQ, VPSHLB, VPSHLW, VPSHLD, VPSHLQ, VPSHAB,VPSHAW, VPSHAQ

rFLAGS Affected

None

MXCSR Flags Affected

None

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.L was set to 1.

0 7 32 39 64 71 96 103

src = xmm/mem

031326364

9596127

dest = xmm

0313263649596127

shift

shift

shiftshift

count

count

count

count

count = xmm/mem127

255 128

VPSHAD

0s

Page 256: 128-Bit and 256-Bit XOP and FMA4 Instructions

256 VPSHAD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.

Exception Real Virtual8086 Protected Cause of Exception

Page 257: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPSHAQ 257

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Shifts the two quadwords of the source operand by the amount specified in the signed value of thecorresponding count byte and writes the result in the corresponding quadword of the destination.

The count byte for each quadword shift is an 8-bit signed two's-complement value located in the low-order byte of the corresponding quadword element of the count operand.

If the count value is positive, bits are shifted to the left (toward the more significant bit positions).Zeros are shifted in at the right end (least-significant bit) of the quadword.

If the count value is negative, bits are shifted to the right (toward the least significant bit positions).The most significant bit is replicated and shifted in at the left end (most-significant bit) of thequadword.

The shift amount is stored in two’s-complement form. The count is modulo 64.

The VPSHAQ instruction requires three operands:

VPSHAQ dest, src, count

The destination (dest) is an XMM register addressed by the MODRM.reg field. When the 128-bitresult operand is written to the dest XMM register, the upper 128 bits of the corresponding YMMregister are cleared to zeros.

The src and count are configurable through XOP.W. If XOP.W is 0, the count is a 128-bit XMMregister specified by XOP.vvvv and the src is a 128-bit XMM register or memory location specified byMODRM.rm. If XOP.W is 1, the count is a 128-bit XMM register or memory location specified byMODRM.rm and the src is a 128-bit XMM register specified by XOP.vvvv.

The VPSHAQ instruction is an XOP instruction. The presence of this instruction set is indicated by aCPUID feature bit. (See the CPUID Specification, order# 25481.)

VPSHAQ Packed Shift Arithmetic Quadwords

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPSHAQ xmm1, xmm2/mem128, xmm3 8F RXB.09 0.xcnt.0.00 9B /r

VPSHAQ xmm1, xmm2, xmm3/mem128 8F RXB.09 1.xsrc.0.00 9B /r

Page 258: 128-Bit and 256-Bit XOP and FMA4 Instructions

258 VPSHAQ Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Related Instructions

VPROTB, VPROTW, VPROTD, VPROTQ, VPSHLB, VPSHLW, VPSHLD, VPSHLQ, VPSHAB,VPSHAW, VPSHAD

rFLAGS Affected

None

MXCSR Flags Affected

None

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.L was set to 1.

127 0 7 64 71

src106364127

dest06364127255 128

shift

shift

count

count

count(xmm/mem)

(xmm)

(ymm)0s

Page 259: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPSHAQ 259

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enable.

Exception Real Virtual8086 Protected Cause of Exception

Page 260: 128-Bit and 256-Bit XOP and FMA4 Instructions

260 VPSHAW Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Shifts each of the eight words of the source operand by the amount specified in the signed value of thecorresponding count byte and writes the result in the corresponding signed word of the destination.

The count byte for each word shift is an 8-bit signed two's-complement value located in the low-orderbyte of the corresponding word element of the count operand.

If the count value is positive, bits are shifted to the left (toward the more significant bit positions).Zeros are shifted in at the right end (least-significant bit) of the word.

If the count value is negative, bits are shifted to the right (toward the least significant bit positions).The most significant bit (signed bit) is replicated and shifted in at the left end (most-significant bit) ofthe word.

The shift amount is stored in two’s-complement form. The count is modulo 16.

The VPSHAW instruction requires three operands:

VPSHAW dest, src, count

The destination (dest) is a YMM register addressed by the MODRM.reg field. When the 128-bit resultoperand is written to the destination XMM register, the upper 128 bits of the corresponding YMMregister are cleared to zeros.

The src and count are configurable through XOP.W. If XOP.W is 0, the count is a 128-bit XMMregister specified by XOP.vvvv and the src operand is a 128-bit XMM register or memory locationspecified by MODRM.rm. If XOP.W is 1, the count operand is a 128-bit XMM register or memorylocation specified by MODRM.rm and the src operand is a 128-bit XMM register specified byXOP.vvvv.

The VPSHAW instruction is an XOP instruction. The presence of this instruction set is indicated by aCPUID feature bit. (See the CPUID Specification, order# 25481.)

VPSHAW Packed Shift Arithmetic Words

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPSHAW xmm1, xmm2/mem128, xmm3 8F RXB.09 0.xcnt.0.00 99 /r

VPSHAW xmm1, xmm2, xmm3/mem128 8F RXB.09 1.xsrc.0.00 99 /r

Page 261: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPSHAW 261

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VPROTB, VPROTW, VPROTD, VPROTQ, VPSHLB, VPSHLW, VPSHLD, VPSHLQ, VPSHAB,VPSHAD, VPSHAQ

rFLAGS Affected

None

MXCSR Flags Affected

None

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.L was set to 1.

127 0 7 16 32 23 39 48 55 64 71 80 87 96 103 112 119

src1015

1631

3247

4863

6479

8095

96111

112127

dest015

1631

3247

4863

6479

8095

96111

112127128

shiftshift

shiftshift

shiftshift

shiftshift

count count

count count

count count

count

count

count(xmm/mem)

(xmm)

(ymm)0s

Page 262: 128-Bit and 256-Bit XOP and FMA4 Instructions

262 VPSHAW Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.

Exception Real Virtual8086 Protected Cause of Exception

Page 263: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPSHLB 263

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Shifts each byte of the source operand by the amount specified in the signed value of thecorresponding count byte and writes the result in the corresponding byte of the destination.

The count byte for each byte shift is an 8-bit signed two's-complement value located in the thecorresponding byte element of the count operand.

If the count value is positive, bits are shifted to the left (toward the more significant bit positions).Zeros are shifted in at the right end (least-significant bit) of the byte.

If the count value is negative, bits are shifted to the right (toward the least significant bit positions).Zeros are shifted in at the left end (most-significant bit) of the byte.

The VPSHLB instruction requires three operands:

VPSHLB dest, src, count

The destination (dest) is an XMM register addressed by the MODRM.reg field. When the 128-bitresult is written to the destination XMM register, the upper 128 bits of the corresponding YMMregister are cleared to zeros.

The src and count are configurable through XOP.W. If XOP.W is 0, the count is a 128-bit XMMregister specified by XOP.vvvv and the src is a 128-bit XMM register or memory location specified byMODRM.rm. If XOP.W is 1, the count is a 128-bit XMM register or memory location specified byMODRM.rm and the src is a 128-bit XMM register specified by XOP.vvvv.

The VPSHLB instruction is an XOP instruction. The presence of this instruction set is indicated by aCPUID feature bit. (See the CPUID Specification, order# 25481.)

VPSHLB Packed Shift Logical Bytes

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPSHLB xmm1, xmm2/mem128, xmm3 8F RXB.9 0.xcnt.0.00 94 /r

VPSHLB xmm1, xmm2, xmm3/mem128 8F RXB.9 1.xsrc.0.00 94 /r

Page 264: 128-Bit and 256-Bit XOP and FMA4 Instructions

264 VPSHLB Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Related Instructions

VPROTB, VPROTW, VPROTD, VPROTQ, VPSHLW, VPSHLD, VPSHLQ, VPSHAB, VPSHAW,VPSHAD, VPSHAQ

rFLAGS Affected

None

MXCSR Flags Affected

None

src

count

127 0

127 0

dest 0127

……16 counts…

shift shift

16 shifts

Page 265: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPSHLB 265

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.L was set to 1.X VEX.vvvv was not 1111b.

Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.

Page 266: 128-Bit and 256-Bit XOP and FMA4 Instructions

266 VPSHLD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Shifts each doubleword of the source operand by the amount specified in the signed value of thecorresponding count byte and writes the result in the corresponding doubleword of the destination.

The count byte for each doubleword shift is an 8-bit signed two's-complement value located in thelow-order byte of the corresponding doubleword element of the count operand.

If the count value is positive, bits are shifted to the left (toward the more significant bit positions).Zeros are shifted in at the right end (least-significant bit) of the doubleword.

If the count value is negative, bits are shifted to the right (toward the least significant bit positions).Zeros are shifted in at the left end (most-significant bit) of the doubleword.

The shift amount is stored in two’s-complement form. The count is modulo 32.

The VPSHLD instruction requires three operands:

VPSHLD dest, src, count

The destination (dest) is an XMM register addressed by the MODRM.reg field. When the 128-bitresult is written to the destination XMM register, the upper 128 bits of the corresponding YMMregister are cleared to zeros.

The src and count are configurable through XOP.W. If XOP.W is 0, the count is a 128-bit XMMregister specified by XOP.vvvv and the src is a 128-bit XMM register or memory location specified byMODRM.rm. If XOP.W is 1, the count is a 128-bit XMM register or memory location specified byMODRM.rm and the src operand is a 128-bit XMM register specified by XOP.vvvv.

The VPSHLD instruction is an XOP instruction. The presence of this instruction set is indicated by aCPUID feature bit. (See the CPUID Specification, order# 25481.)

VPSHLD Packed Shift Logical Doublewords

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPSHLD xmm1, xmm3/mem128, xmm2 8F RXB.09 0.xcnt.0.00 96 /r

VPSHLD xmm1, xmm2, xmm3/mem128 8F RXB.09 1.xsrc.0.00 96 /r

Page 267: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPSHLD 267

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VPROTB, VPROTW, VPROTD, VPROTQ, VPSHLB, VPSHLW, VPSHLQ, VPSHAB, VPSHAW,VPSHAD, VPSHAQ

rFLAGS Affected

None

MXCSR Flags Affected

None

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.L was set to 1.Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

255 128

0 7 32 39 64 71 96 103

src10313263649596127

dest0313263649596127

shift

shift

shiftshift

count

count

count

count

count(xmm/mem)

(xmm)

(ymm)

127

0s

Page 268: 128-Bit and 256-Bit XOP and FMA4 Instructions

268 VPSHLD Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.

Exception Real Virtual8086 Protected Cause of Exception

Page 269: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPSHLQ 269

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Shifts the two quadwords of the source operand by the amount specified in the signed value of thecorresponding count byte and writes the result in the corresponding quadword of the destination.

The count byte for each quadword shift is an 8-bit signed two's-complement value located in the low-order byte of the corresponding quadword element of the count operand.

Bit 6 of the count byte is ignored.

If the count value is positive, bits are shifted to the left (toward the more significant bit positions).Zeros are shifted in at the right end (least-significant bit) of the quadword.

If the count value is negative, bits are shifted to the right (toward the least significant bit positions).Zeros are shifted in at the left end (most-significant bit) of the quadword.

The VPSHLQ instruction requires three operands:

VPSHLQ dest, src, count

The destination (dest) is an XMM register addressed by the MODRM.reg field. When the 128-bitresult is written to the dest YMM register, the upper 128 bits of the corresponding YMM register arecleared to zeros.

The src and count are configurable through XOP.W. If XOP.W is 0, the count operand is a 128-bitXMM register specified by XOP.vvvv and the src is a 128-bit XMM register or memory locationspecified by MODRM.rm. If XOP.W is 1, the count is a 128-bit XMM register or memory locationspecified by MODRM.rm and the src is a 128-bit XMM register specified by XOP.vvvv.

The VPSHLQ instruction is an XOP instruction. The presence of this instruction set is indicated by aCPUID feature bit. (See the CPUID Specification, order# 25481.)

VPSHLQ Packed Shift Logical Quadwords

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPSHLQ xmm1, xmm3/mem128, xmm2 8F RXB.09 0.xcnt.0.00 97 /r

VPSHLQ xmm1, xmm2, xmm3/mem128 8F RXB.09 1.xsrc.0.00 97 /r

Page 270: 128-Bit and 256-Bit XOP and FMA4 Instructions

270 VPSHLQ Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Related Instructions

VPROTB, VPROTW, VPROTD, VPROTQ, VPSHLB, VPSHLW, VPSHLD, VPSHAB, VPSHAW,VPSHAD, VPSHAQ

rFLAGS Affected

None

MXCSR Flags Affected

None

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.L was set to 1.Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

127 0 7 64 71

src1063

64127

dest063

64127255 128

shift

shift

count

count

count(xmm/mem)

(xmm)

(ymm)0s

Page 271: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPSHLQ 271

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.

Exception Real Virtual8086 Protected Cause of Exception

Page 272: 128-Bit and 256-Bit XOP and FMA4 Instructions

272 VPSHLW Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Shifts each of the eight words of the source operand by the amount specified in the signed value of thecorresponding count byte and writes the result in the corresponding word of the destination.

The count byte for each word shift is an 8-bit signed two's-complement value located in the low-orderbyte of the corresponding word element of the count operand.

If the count value is positive, bits are shifted to the left (toward the more significant bit positions).Zeros are shifted in at the right end (least-significant bit) of the word.

If the count value is negative, bits are shifted to the right (toward the least significant bit positions).Zeros are shifted in at the left end (most-significant bit) of the word.

The VPSHLW instruction requires three operands:

VPSHLW dest, src, count

The destination (dest) is an XMM register addressed by the MODRM.reg field. When the 128-bitresult is written to the destination XMM register, the upper 128 bits of the corresponding YMMregister are cleared to zeros.

The src and count are configurable through XOP.W. If XOP.W is 0, the count operand is a 128-bitXMM register specified by XOP.vvvv and the src is a 128-bit XMM register or memory locationspecified by MODRM.rm. If XOP.W is 1, the count is a 128-bit XMM register or memory locationspecified by MODRM.rm and the src is a 128-bit XMM register specified by XOP.vvvv.

The VPSHLW instruction is an XOP instruction. The presence of this instruction set is indicated by aCPUID feature bit. (See the CPUID Specification, order# 25481.)

VPSHLW Packed Shift Logical Words

Mnemonic Encoding

XOP RXB.mmmmm W.vvvv.L.pp Opcode

VPSHLW xmm1, xmm3/mem128, xmm2 8F RXB.09 0.xcnt.0.00 95 /r

VPSHLW xmm1, xmm2, xmm3/mem128 8F RXB.09 1.xsrc.0.00 95 /r

Page 273: 128-Bit and 256-Bit XOP and FMA4 Instructions

Instruction Reference VPSHLW 273

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Related Instructions

VPROTB, VPROLW, VPROTD, VPROTQ, VPSHLB, VPSHLD, VPSHLQ, VPSHAB, VPSHAW,VPSHAD, VPSHAQ

rFLAGS Affected

None

MXCSR Flags Affected

None

0 7 16 32 23 39 48 55 64 71 80 87 96 103 112 119

src1015

1631

3247

4863

6479

8095

96111

112127

dest015

1631

3247

4863

6479

8095

96111

112127255 128

shiftshift

shiftshift

shiftshift

shiftshift

count count

count count

count count

count

count

src2(ymm/mem)

(ymm)

(ymm)0s

Page 274: 128-Bit and 256-Bit XOP and FMA4 Instructions

274 VPSHLW Instruction Reference

AMD64 Technology Documentation Updates 43479—3.04—November 2009

Exceptions

Exception Real Virtual8086 Protected Cause of Exception

Invalid opcode, #UD

X X XOP instructions are only recognized in protected mode.

X The XOP instructions are not supported, as indicated by ECX bit 11 of CPUID function 8000_0001h.

X The operating-system XSAVE/XRSTOR support bit (OSXSAVE) of CR4 was cleared to 0, as indicated by ECX bit 27 of CPUID function 0000_0001h.

X The operating-system YMM support bits XFEATURE_ENABED_MASK[2:1] were were not both set to 1.

X XOP.L was set to 1.Device not available, #NM

X The task-switch bit (TS) of CR0 was set to 1.

Stack, #SS X A memory address exceeded the stack segment limit or was non-canonical.

General protection, #GP X A memory address exceeded a data segment limit or was non-canonical.

X A null data segment was used to reference memory.Page fault, #PF X A page fault resulted from the execution of the instruc-

tion.Alignment Check, #AC X An unaligned memory reference was performed while

alignment checking was enabled.

Page 275: 128-Bit and 256-Bit XOP and FMA4 Instructions

275

43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates

Numerics16-bit mode ........................................................... 1232-bit mode ........................................................... 1264-bit mode ........................................................... 13Aaddressing

RIP-relative........................................................ 18Bbiased exponent..................................................... 13Ccommit .................................................................. 13compatibility mode ............................................... 13Ddirect referencing .................................................. 13displacements ........................................................ 14double quadword................................................... 14doubleword ........................................................... 14EeAX–eSP register.................................................. 19effective address size ............................................ 14effective operand size............................................ 14eFLAGS register ................................................... 20eIP register ............................................................ 20element .................................................................. 14endian order .......................................................... 22exceptions ............................................................. 14exponent ................................................................ 13Fflush....................................................................... 15IIGN ....................................................................... 15indirect .................................................................. 15Llegacy mode .......................................................... 15legacy x86 ............................................................. 15long mode.............................................................. 15LSB ....................................................................... 16lsb .......................................................................... 16Mmask ...................................................................... 16MBZ ...................................................................... 16modes

16-bit.................................................................. 1232-bit.................................................................. 1264-bit.................................................................. 13compatibility ...................................................... 13legacy ................................................................. 15long .................................................................... 15

protected............................................................ 17real..................................................................... 17virtual-8086....................................................... 19

moffset.................................................................. 16MSB ..................................................................... 16msb ....................................................................... 16MSR ..................................................................... 20Ooctword................................................................. 16offset..................................................................... 16overflow ............................................................... 17Ppacked .................................................................. 17PHADDUBD...................................................... 177protected mode ..................................................... 17Qquadword .............................................................. 17Rr8–r15 ................................................................... 20rAX–rSP ............................................................... 20RAZ ...................................................................... 17real address mode. See real modereal mode .............................................................. 17registers

eAX–eSP........................................................... 19eFLAGS ............................................................ 20eIP ..................................................................... 20r8–r15 ................................................................ 20rAX–rSP............................................................ 20rFLAGS............................................................. 21rIP...................................................................... 21

relative .................................................................. 18reserved ................................................................ 17revision history ....................................................... 9rFLAGS register ................................................... 21rIP register ............................................................ 21RIP-relative addressing ........................................ 18Sset ......................................................................... 18SSE ....................................................................... 18sticky bits ............................................................. 18TTSS ....................................................................... 18Uunderflow ............................................................. 18Vvector.................................................................... 18VFMADDPD ....................................................... 42

Page 276: 128-Bit and 256-Bit XOP and FMA4 Instructions

276

AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009

VFMADDPS......................................................... 46VFMADDSD ........................................................ 50VFMADDSS................................................... 53, 97VFMADDSUBPD ................................................ 56VFMADDSUBPS ................................................. 60VFMSUBADDPD ................................................ 64VFMSUBADDPS ................................................. 68VFMSUBPD ......................................................... 72VFMSUBPS.......................................................... 76VFMSUBSD ......................................................... 80VFMSUBSS.......................................................... 84VFNMADDPD ..................................................... 88VFNMADDPS ...................................................... 91VFNMADDSD ..................................................... 94VFNMSUBPD .................................................... 100VFNMSUBPS..................................................... 104VFNMSUBSD .................................................... 108VFNMSUBSS..................................................... 112VFRCZPD........................................................... 116VFRCZPS ........................................................... 119VFRCZSD........................................................... 122VFRCZSS ........................................................... 126virtual-8086 mode ................................................. 19VPCMOV ........................................................... 130VPCOMB.................................................... 133, 136VPCOMQ ........................................................... 139VPCOMUB......................................................... 142VPCOMUD................................................. 142, 145VPCOMUQ......................................................... 148VPCOMUW................................................ 148, 151VPCOMW........................................................... 154VPERMIL2PD .................................................... 157VPERMIL2PS..................................................... 163VPHADDBD ...................................................... 169VPHADDBQ ...................................................... 171VPHADDBW ..................................................... 173VPHADDDQ ...................................................... 175VPHADDUBQ ................................................... 179VPHADDUBW................................................... 181VPHADDUDQ ................................................... 183VPHADDUWD .................................................. 185VPHADDUWQ .................................................. 187VPHADDWD ..................................................... 189VPHADDWQ ..................................................... 191VPHSUBBW ...................................................... 193VPHSUBDQ ....................................................... 195VPHSUBWD ...................................................... 197VPMACSDD ...................................................... 199VPMACSDQH ................................................... 202VPMACSDQL .................................................... 205VPMACSSDD .................................................... 208

VPMACSSDQL ................................................. 214VPMACSSQH .................................................... 211VPMACSSWD .................................................. 217VPMACSSWW.................................................. 220VPMACSWD..................................................... 223VPMACSWW .................................................... 226VPMADCSSWD................................................ 229VPMADCSWD .................................................. 232VPPERM ............................................................ 235VPROTB ............................................................ 239VPROTD ............................................................ 242VPROTQ ............................................................ 245VPROTW ........................................................... 248VPSHAB ............................................................ 251VPSHAD ............................................................ 254VPSHAQ ............................................................ 257VPSHAW ........................................................... 260VPSHLB ............................................................ 263VPSHLD ............................................................ 266VPSHLQ ............................................................ 269VPSHLW............................................................ 272