Advanced Computer Architecture 1

Embed Size (px)

Citation preview

  • 8/7/2019 Advanced Computer Architecture 1

    1/14

    TERM PAPER

    OF

    ADVANCE COMPUTER ARCHITECTURE

    ON

    USE OPEN MP OR ANY OTHER APIS TO PARALLELIZE A MATRIX

    MULTIPLICATION OF 5000 BY 5000 IN C / C++ FOR A DUAL CORE

    MACHINE. USE INTEL TOOLS TO ANALYZE THE CHANGE IN

    PERFORMANCE. (FOR MORE THAN ONE STUDENT TOGETHER)

    Submitted To: Submitted BY:

    Pency meam Nakul Kumar

    Roll no: A (07)

    Reg. no: 10901295

    Sec. no: RS1906

  • 8/7/2019 Advanced Computer Architecture 1

    2/14

    TABLE OF CONTENT:

    Acknowledgement: Computer architecture:Matrix Multiplication:

    y CREW Matrix Multiplication:y EREW Matrix Multiplication:

    Parallel Matrix Multiplication:y Dumb,y Standard,y Single,y Unsafe Single,y Jagged,y Jagged from C++,y Stack Allocated,

    Parallel Algorithms forMatrix Multiplication:Optimizing the matrix multiplication:Optimizing the Parallel Matrix Multiplication, References:

  • 8/7/2019 Advanced Computer Architecture 1

    3/14

    ACKNOWLEDGEMENT

    THE SUCCESSFUL COMPLETION OF ANY TASK WOULD BE INCOMPLETE WITHOUT

    MENTIONING THE PEOPLE WHO HAVE MADE IT POSSIBLE. SO IT`S WITH THE GRATITUDE

    THAT I ACKNOWLEDGE THE HELP, WHICH CROWNED MY EFFORTS WITH SUCCESS.

    LIFE IS A PROCESS OF ACCUMULATING AND DISCHARGING DEBTS, NOT ALL OF THOSE CAN

    BE MEASURED. I CANNOT HOPE TO DISCHARGE THEM WITH SIMPLE WORDS OF THANKS BUT

    I CAN CERTAINLY ACKNOWLEDGE THEM.

    I OWE MY GRATITUDE TO MS.PENCY LECT. LSM FOR HER CONSTANT GUIDANCE AND

    SUPPORT.

    I WOULD ALSO LIKE TO THANK THE VARIOUS DEPARTMENT OFFICIALS AND STAFF WHO

    NOT ONLY PROVIDED ME WITH REQUIRED OPPORTUNITY BUT ALSO EXTENDED THEIR

    VALUABLE TIME AND I HAVE NO WORDS TO EXPRESS MY GRATEFULNESS TO THEM.

    LAST BUT NOT THE LEAST I AM VERY MUCH INDEBTED TO MY FAMILY AND FRIENDS FOR

    THEIR WARM ENCOURAGEMENT AND MORAL SUPPORT IN CONDUCTING THIS PROJECT

    WORK.

    NAKUL KUMAR

  • 8/7/2019 Advanced Computer Architecture 1

    4/14

    Computer Architecture:

    Computer architecture or Digital computer organization is the conceptual design and fundamental

    operational structure of a computer system. It's a blueprint and functional description of requirements

    and design implementations for the various parts of a computer, focusing largely on the way by which

    the central processing unit (CPU) performs internally and accesses addresses in memory.

    Computer architecture comprises at least three main subcategories:

    Instruction set architecture: Instruction set architecture is the abstract image of a computing system

    that is seen by a machine language programmer, including the instruction set, word size, memory

    address modes, processor registers, and address and data formats.

    Micro architecture:Micro architecture also known as Computer organization is a lower level, more

    concrete and detailed, description of the system that involves how the constituent parts of the system

    are interconnected and how they interoperate in order to implement the ISA.

    System Design: System Design which includes all of the other hardware components within a

    computing system such as:

    y System interconnects such as computer buses and switchesy Memory controllers and hierarchiesy CPU off-load mechanisms such as direct memory access (DMA)y Issues like multiprocessing.

    There are many types of computer architectures:

    y Quantum computer v/s Chemical computer,y Scalar processor v/s Vector processor,y Non-UniformMemory Access (NUMA) computers,y Register machine v/s Stack machine,y Harvard architecture v/s von Neumann architecture,y Cellular architecture,

  • 8/7/2019 Advanced Computer Architecture 1

    5/14

    Matrix Multiplication:

    Matrix-matrix multiplication is a fundamental kernel, one which can achieve high efficiency in both

    theory and practice. First, some caveats and assumptions:

    y This material is for dense matrices, ones where there are few zeros and so the matrix isefficiently stored in a 2D arrays

    y Distinguish between a matrix and an array. The first is a mathematical object, a rectangulararrangment of numbers usually indexed by an integer pair (i,j) [that starts indexing from 1,

    BTW]. The second term is a computer data structure, which can be used to hold a matrix, and it

    might be indexed starting from 0, 1, or anything convenient.

    y A load-store analysis shows that the memory reference to flop ratio for matrix-matrix multiplyis O(1/n), and hence it should be implementable with near peak performance on a cache-based

    serial computer.

    y The BLAS function for matrix-matrix multiply is dgemm, which is faster to type.y There are "reduced order" algorithms (Strassen, Winograd) which use extra memory but

    compute the product in fewer than 2n3

    flops. The exponent now is around 2.7. Only the standard

    algorithm is considered here because the reduced order techniques can always be applied on a

    single process for the parallel versions. Also, the basic idea that BLAS matrix-matrix multiply

    has memory reference to flop ratios that go to zero as n increases still holds.

    y Few modern applications really need matrix-matrix multiplication with dense matrices. It ismore of a toy, and is more of a diagnostic for a system than a useful kernel: if 85% of the

    theoretical peak performance cannot be achieved on a machine, then the machine is flawed in

    some way: OS, compiler, or hardware.

    The program is simply a repeated matrix multiplication of two different 100*100 matrices, with a third

    matrix. The multiplication series is repeated 100 times.

    This time, I only tested Matrix Lab and TONS.

    Language Time (seconds)

    Mat Lab 9 . 2

    TO NS 20 . 5

    TONS 10 . 9

    That looks more like a reasonable result from Matrix Lab. The matrix multiplication I implemented is a

    very fast hack, and it is far from being fast. Obviously, Matrix Lab has efficient multiplication routines

    implemented, so even though their virtual machine, or interpreter, is slow as molasses, Matrix Lab is

    twice as fast as TONS, on one CPU.

  • 8/7/2019 Advanced Computer Architecture 1

    6/14

    We scale almost linearly in performance, as the extra CPU is taken into use. This is because the amount

    of work done in the two independent loops (which of course the loop is transformed into as we add the

    second CPU) is the same. No node server waits for the other.

    If we added a third CPU, it would never be taken into use. This code just does not parallelize onto three

    or more CPUs, with the current state of the TONS parallelize. I do not see an easy way of parallelizing

    this program any further, at the virtual machine opcode level, without changing the order in which

    things are happening, which we refrain from.

    We could however, sometime in the future, implement parallel versions of the instructions, so that if

    nodes where available, the matrix multiplication could run in parallel on several nodes. But there are

    two things to this. It is not ``automatic parallelization'' in the sense that the virtual machine code is

    parallelized, it is simply a matter of exchanging the actual implementations of the instructions with

    parallel ones. Secondly, implementing parallel matrix operations in C++ is way beyond the scope of

    this work. It is an area in which there has been a lot of research, and it should be fairly simple just to

    plug in well-known efficient parallel matrix routines, once we get the actual free-node/busy-node

    communication done.

    Matrix Multiplication Algorithm:

    The product of an m x n matrix A and n x k matrix of B is M x K matrix ofC whose elements

    are:

    Cij = ais * bsj; s = 1.n

    ProcedureM

    atrixM

    ultiplicationFor I: = 1 to m do

    For j: = 1 to k do

    Cij = 0

    For s := 1 to n do

    Cij = Cij + ais * bsj;

    End for

    End for

    End for

    CREW Matrix Multiplication

    The algorithm uses n2 processors which are arranged in a 2d array of size n x n.Overall complexity is O (n).

    Procedure CREW Matrix Multiplication

    For I: = 1 to n do in parallel

    For j: = 1 to n do in parallel

    Ci, j = 0

    For k: = 1 to n do

  • 8/7/2019 Advanced Computer Architecture 1

    7/14

    Ci, j = Ci, j + ai, k * bk, j;

    End for

    End for

    End for

    EREW Matrix Multiplication:

    In case ofCREW model one advantage is that a memory location can be accessed by any other

    Processor. In EREW model one needs to ensure that every processor reads the value from a

    memory location which is not being accessed by any other processor.

    Procedure EREW Matrix Multiplication

    For I: = 1 to n do in parallel

    For j: = 1 to n do in parallel

    Ci, j = 0

    For k: = 1 to n do

    lk: = (i+j+

    k) mod n+

    1;Ci, j = Ci, j + ai, 1k * b1k, j;

    End for

    End for

    End for

    CRCW Matrix Multiplication:

    The algorithm uses n 3 processors and runs O (1) time. When more than one processor attempts to

    write to the same memory location, the sum of the values is written onto the memory location.

    For I: = 1 to n do in parallelFor j: = 1 to n do in parallel

    For s: = 1 to n do in parallelCi, j = 0

    Ci, j =Ci, j+ ai, s * bs, j;End for

    End forEnd for

    Parallel Matrix Multiplication:

    I brushed off some old benchmarking code used in my clustering application and decided to see what I

    can do using todays multi-core hardware. When writing computationally intensive algorithms, we havea number of considerations to evaluate. The best (IMHO) algorithms to parallelize are data parallel

    algorithms without loop carried dependencies.

    You may think nothing is special about matrix multiplication, but it actually points out a couple of

    performance implications when writing CLR applications. I originally wrote seven different

    implementations of matrix multiplication in C# thats right, seven.

  • 8/7/2019 Advanced Computer Architecture 1

    8/14

    y Dumby Standardy Singley Unsafe Singley Jaggedy Jagged from C++y Stack Allocated

    Dumb: double [N, N], real type: float64 [0..., 0...]

    The easiest way to do matrix multiplication is with a .NET multidimensional array with i, j, k ordering

    in the loops. The problems are twofold. First, the i, j.k ordering accesses memory in a hectic fashion

    causing data in varied locations to be pulled in. Second, it is using a multidimensional array. Yes, the

    .NET multidimensional array is convenient, but it is very slow. Lets look at the C# and IL

    C#:

    1: C[i, j] += A[i, k] * B[k, j];

    IL ofC#:

    1: ldloc.s i2: ldloc.s jcall instance float64& float64[0...,0...]::Address(int32,int32)

    4: dup4: ldobj float64

    5: ldloc.16: ldloc.s i

    7: ldloc.s k8: call instance float64 float64 [0..., 0...]:: Get (int32, int32)

    9: ldloc.210: ldloc.s k11: ldloc.s j

    12: call instance float64 float64 [0..., 0...]:: Get(int32, int32)13: mul

    14: add15: stobj float64

    If you notice the :: Address and :: Get parts, these are method calls! Yes, when you use a

    multidimensional array, you are using a class instance. So every access, assignment, and read incurs the

    cost of a method call. When you are dealing with and N^3 algorithm, that is N^3 method calls making

    this implementation much slower than other methods.

  • 8/7/2019 Advanced Computer Architecture 1

    9/14

    Standard: double [N,N], real type float64[0..., 0...]

    This implementation rearranges the loop ordering to i,k,j in order to optimize memory access to the

    arrays. No other changes are made from the dumb implementation. The Standard implementation is

    what is used for the base of all other multidimensional implementations.

    Single: double [N * N], real type float64 [ ]

    Instead of creating a multidimensional array, we create a single block of memory. The float 64 [ ] type

    is a block of memory instead of a class. Downside here is that we have to calculate all offsets manually.

    Unsafe Single, real type float 64[ ]

    This method is the same as the single dimensional array, except that the pointers to the arrays are fixed

    and pointers are used in unsafe C#.

    Jagged, double [N][N], real type float 64[ ][ ]

    This is the same implementation as standard, except that we use arrays of arrays instead of a

    multidimensional array. It takes an extra step to initialize, but it is a series of blocks to raw memory

    eliminating the method call overhead. It is typically 30% faster that the multidimensional array.

    Jagged from C++, double [N][N], real type float 64[ ][ ]

    This is a bit more difficult. When writing these algorithms, we let the JIT compiler optimize for us. The

    C++ compiler is unfortunately a lot better, but it isnt real-time. I ported the code from the jaggedimplementation to C++/CLI and enabled heavy optimization. Once compiled, I disassembled the dll

    and converted the IL to C#. The result is this implementation which is harder to read, but it is really

    fast.

    Stack Allocated, stackalloc double [N * N], real type float 64*

    This implementation utilizes the rarely used stackalloc keyword. Using this implementation is very

    problematic as you may get a StackOverflow Exception depending on your current stack usage.

    Parallel Algorithms for Matrix Multiplication:

    The matrix multiplication algorithm, called DIMMA (Distribution-Independent Matrix Multiplication

    Algorithm), for block cyclic data distribution on distributed-memory concurrent computers. The

    algorithm is based on two new ideas; it uses a modified pipelined communication scheme to overlap

    computation and communication effectively, and exploits the LCM block concept to obtain the

    maximum performance of the sequential BLAS routine in each processor even when the block size is

  • 8/7/2019 Advanced Computer Architecture 1

    10/14

    very small as well as very large. The algorithm is implemented and compared with SUMMA on the

    Intel Paragon computer.

    A number of parallel formulations of dense matrix multiplication algorithm have been developed. For

    arbitrarily large number of processors, any of these algorithms or their variants can provide near linear

    speedup for sufficiently large matrix sizes and none of the algorithms can be clearly claimed to be

    superior than others. In this paper we analyze the performance and scalability of a number of parallel

    formulations of the matrix multiplication algorithm and predict the conditions under which each

    formulation is better than the others. We present a parallel formulation for hypercube and related

    architectures that performs better than any of the schemes described in the literature so far for a wide

    range of matrix sizes and number of processors. The superior performance and the analytical scalability

    expressions for this algorithm are verified through experiments on the Thinking Machines

    Corporation's CM-5 TM y parallel computer for up to 512 processors.

    A number of algorithms are currently available for multiplying two matrices A and B to yield the

    product matrix C = A_B on distributed-memory concurrent computers [12, 16]. Two classic algorithms

    are Cannon's algorithm and Fox's algorithm. They are based on a P* P square processor grid with a

    block data distribution in which each processor holds a large consecutive block of data.

    Optimization:

    Optimization activities address performance requirements of the system model. This include changing

    algorithms to responds to speed or memory requirements, reducing multiplicities in association to speed

    up queries, adding redundant association for efficiency, rearranging execution orders, adding derived

    attributes to improve the access time to objects, and opening up the architecture, that is adding access tolower layer because of performance requirements.

    Optimizing the matrix multiplication:

    In the past few years, there have been significant developments in the area of distributed and parallel

    processing. More powerful and new hardware architectures are being produced at a rapid rate, such as

    distributed-memory MIMD computers, which have provided enormous computing power to the

    software engineers. These multiprocessors may provide a significant speed-up over the serial execution

    of an algorithm. However, this requires careful partitioning and allocation of data and control to the

    processor set. Matrix multiplication is a fundamental parallel algorithm which can be effectively

    executed on a distributed-memory multiprocessor and can show significant improvement in the speed-

    up over the serial execution. Ideally, we should be able to achieve a linear speed up with increase in the

    number of processors, but in practice the speed up is much less, and in fact increasing the number of

    processors beyond a certain number may result in degradation of the completion time. This degradation

    is caused by increased communications between modules. Therefore, the optimum speed-up is a

  • 8/7/2019 Advanced Computer Architecture 1

    11/14

    function of the number of processors and the communication cost. To find the optimum performance, a

    user needs to experiment with all the available processors on a multiprocessor.

    In this paper, we studied the detailed performance of the parallel matrix multiplication algorithm. The

    study defines the factors that control the performance of this class of algorithms and shows how to use

    these factors to optimize the algorithm's execution time. Also, an analytic approach is described which

    can eliminate a trial and effort method to actually determine the size of processor set.

    Memory Hierarchy Optimizations

    Blocking:

    Blocking is a common divide-and-conquer technique for using the memory hierachy effectively. Since

    the cache may only be large enough to hold a small piece of one matrix, the data has already been

    kicked out of the cache before it is reused. The processor will thus continually be forced to access

    slower levels of memory, decreasing the algorithm's performance. With blocking, however, each matrix

    is divided into blocks of smaller matrices, and the algorithm multiplies two submatrices, storing their

    product before moving on to the next two submatrices. This better exploits cache locality so that data in

    the cache can be reused before being replaced.

    Copy Optimization:

    Copy optimization can help decrease the number of conflict cache misses. As mentioned above,

    conflict cache misses occur when multiple data items are mapped to the same location in the cache.

    With blocking, this means that cached data may be prematurely kicked out of thconflict misses can

    cause severe performance degradation when an array is accessed with a constant, non-unit stride. In the

    provided matrix-matrix multiplication implementation, the matrix A is accessed in this way

    (specifically, we access A in strides of 'lda'). This is a result of the way the matrix is stored. The

    matrices are stored in a one-dimensional array, with the first column occupying the first M entries in

    the array, where the matrix is MxM. The second column is stored in the next M entries, and so on.

    Thus, consecutive elements in a matrix row are M entries apart in the array, and our matrix

    multiplication routine is forced to access the matrix A in an M-unit stride. In order to improve upon

    this, we re-order the matrix A so that row elements are stored in consecutive entries in the array (i.e.,

    the first row is stored in the first M entries of the array, the second row is stored in the next M entries of

    the array, and so on). This re-ordering is sometimes also called copy optimization. Now, both A and Bare accessed in unit-strides.

  • 8/7/2019 Advanced Computer Architecture 1

    12/14

    Inner Loop Optimizations

    optimizations should focus on the places in the code where the most time is spent. In our matrix-matrix

    multiplication implementation, this is the innermost loop.

    Minimized Branches and Avoidance of Magnitude Compares

    According to [1], C 'do-while' loops tend to perform better than C 'for' loops because compilers tend to

    produce unnecessary loop head branches in 'for' loops. Furthermore, also noted in [1], it is often

    cheaper to do equality or inequality tests in loop conditions than magnitude comparision tests. Thus, we

    translated the innermost 'for' loop into a 'do-while' loop, and used pointer inequality rather than

    magnitude comparision to test for loop termination. The code below exemplifies this technique.

    The original code that looks something like this:

    for (k = 0; k < BLOCK_SIZE; k++) { ... }

    is translated into something like this:

    end=&B_j[BLOCK_SIZE];

    do {

    ...

    }

    while (B_j != end);

    Explicit Loop Unrolling

    Although the compiler option '-funroll-all-loops' is used in the Makefile provided, we decided to see if

    unrolling the innermost loop by hand would improve upon the compiler's optimization. According to

    [1], explicitly unrolling loops can increase opportunities for other optimizations. The graph below

    shows the performance of the matrix-matrix multiply routine with the innermost loop manually

    unrolled 2, 3, 4, and 6 times.

  • 8/7/2019 Advanced Computer Architecture 1

    13/14

    Optimizing the Parallel Matrix Multiplication:

    Parallel Matrix Multiplication method can help reduce the resource requirements for both memory and

    computation. A unique feature of our technique is its formulation of linear recurrences as matrix

    computations, before exploiting their mathematical properties for more compact representations. Based

    on a general notion of closure for matrix multiplication, we present two classes of matrices that have

    compact representations. These classes are permutation matrices and matrices whose elements are

    linearly related to each other. To validate the proposed method, we experiment with solving recurrences

    whose matrices have compact representations using CUDA on nVidia GeForce 8800 GTX GPU. The

    advantages of our technique are that it enables the computation of larger recurrences in parallel and it

    provides good speedups of up to eleven times over the un-optimized parallel computations. Also, the

    memory usage can be as much as nine times lower than that of the un-optimized parallel computations.

    Our result confirms a promising approach for the adoption of more advanced parallelization techniques.

    There have been significant developments in the area of distributed and parallel processing. More

    powerful and new hardware architectures are being produced at a rapid rate, such as distributed-

    memory MIMD computers, which have provided enormous computing power to the software

    engineers. These multiprocessors may provide a significant speed-up over the serial execution of an

    algorithm. However, this requires careful partitioning and allocation of data and control to the

    processor set. Matrix multiplication is a fundamental parallel algorithm which can be effectively

    executed on a distributed-memory multiprocessor and can show significant improvement in the speed-

    up over the serial execution. Ideally, we should be able to achieve a linear speed up with increase in the

    number of processors, but in practice the speed up is much less, and in fact increasing the number of

    processors beyond a certain number may result in degradation of the completion time. This degradation

    is caused by increased communications between modules.

  • 8/7/2019 Advanced Computer Architecture 1

    14/14

    References:

    y http://www.cs.wisc.edu/arch/www/people.html,y http://www-unix.mcs.anl.gov/dbpp/text/node45.html,y http://www.codeproject.com/useritems/System_Design.asp,y http://innovatian.com/2010/03/parallel-matrix-multiplication-with-the-task-

    parallel-library-tpl/trackback/,

    y http://www.roseindia.net/.../Java...Optimizing-Parallel.../Retrieval.htmly http://www.informaworld.com ... Resources Newslettery http://www.informaworld.com/smpp/content~content=a772397562y http://www.sciencedirect.com/science