Parallel Event Driven Simulation using GPU (CUDA) M.Sancar Koyunlu & Ervin Domazet Cmpe 436 TERM PROJECT

Parallel Event Driven Simulation

using GPU (CUDA)

M.Sancar Koyunlu & Ervin Domazet

Cmpe 436 TERM PROJECT

LOGIC SIMULATION

In Cycle Based Simulation*, the evalution schedule of gates in the design for each step of simulation is determined once at the compilation time of the simulator.

Event Based Simulation has a more complicated scheduling policy where a gate is simulated only if at least one of input values have changed.

*Alper Şen, Barış Aksanlı, Murat Bozkurt, "Speeding Up Cycle Based Logic Simulation Using Graphic Processing Units", http://cseweb.ucsd.edu/~baksanli/ijpp11.pdf

What is Cuda ? CUDA™ is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU). Computing languages and API's: C, C++,CUDA x86, Fortran, OpenCL, DirectCompute. We started with C++ to our code. C++ support is very limited right now. We had to switch our entire code from c++ to c after finishing sequential algortihm.

Example ApplicationsIdentify hidden plaque in arteries: Heart attacks are the leading cause of death worldwide. Harvard Engineering, Harvard Medical School and Brigham & Women's Hospital have teamed up to use GPUs to simulate blood flow and identify hidden arterial plaque without invasive imaging techniques or exploratory surgery.Analyze air traffic flow: The National Airspace System manages the nationwide coordination of air traffic flow. Computer models help identify new ways to alleviate congestion and keep airplane traffic moving efficiently. Using the computational power of GPUs, a team at NASA obtained a large performance gain, reducing analysis time from ten minutes to three seconds.

CUDA MOTIVATION

The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel computation – exactly what graphics rendering is about – and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control, as schematically illustrated by Figure 1-2.

What are we using CUDA for ?

We tried to implement the event-driven gate level simulation in cuda supported GPU's.

A cuda program consist of host code and device code. Host code runs on cpu and calls cuda kernel with a configuration.In configuration, number of threads that will run on GPU(device code) is determined.

CUDA BASICS

CUDA BASICS

How to pass parameter to cuda? Passing a single parameter is easy. It is like parameter passing to a function.__global__ void VecAdd(float A, float B, float C) { int i = threadIdx.x; //some calculations }int main() { // Kernel invocation with N threads VecAdd<<<1, N>>>(A, B, C); }

CUDA BASICS

How to pass parameter to cuda? When passing pointers, you have to be careful, because pointers have addresses that are in your machine not in GPU device.There is cudaMalloc() function to allocate memory in cuda device.You can pass struct arrays to CUDA device. But if they have any pointer in them, it will cause a problem. For example struct sampleStr{ int * dsa; }You cannot initilaze this struct in cpu and send it to cuda.

How to send an array and fill it in cuda?

__global__ void sample1(int* num){ num[threadIdx.x] = threadIdx.x ;} cudaMalloc((void**)&deviceInt, sizeof(int)*10); int* hostInt = new int[10]; sample1<<<1,10>>>(deviceInt);

cudaMemcpy(hostInt,deviceInt,sizeof(int)*10, cudaMemcpyDeviceToHost);

Some important functions of CUDA that we used

__syncthreads() : This function synchronizes threads within a block. On current GPUs, a thread block may contain up to 1024 threads. int atomicExch(int* address, int val): reads the 32-bit or 64-bit word old located at the address address in global or shared memory and stores val back to memory at the same address. These two operations are performed in one atomic transaction. The function returns old. We used this function the implement a lock mechanism in our project. (TestAndSet)

MORE ATOMIC TRANSACTIONS

• atomicAdd() • atomicSub() • atomicExch() • atomicMin()• atomicMax() • atomicInc() • atomicDec() • atomicCAS() • atomicAnd() • atomicOr() • atomicXor()

Details in Cuda_C_Programming_Guide.pdf

In our implementation of device code, we needed vectors (dynamic arrays). c++ stl library have vector implementation, but it is missing in cuda.

At this stage, we run accross thrust library, which has stl library for cuda. http://code.google.com/p/thrust/ It turned out that, it is not what we expected. It is just speeding up the stl libraries using cuda, not providing data structures to use in cuda. After that we had to mimic dynamic arrays using extra offset arrays etc.

THRUST LIBRARY

This is just a summary of what we have learned about cuda. If you are interested, there is lots of documents, slides, webinars etc in

http://developer.nvidia.com/nvidia-gpu-computing-documentation

http://developer.nvidia.com/category/zone/cuda-zone

Sequential algorithm of

Event Driven Simulation

Procedure:

1.Get the logic circuit2.Get the input list3.Simulate the circuit with default

values(0) to input gates4.Start sequential algorithm5.Show output

1.Get the logic circuit• We will get the overall logic circuit from a file, where

its format will be as follows:<circuit> <gates> <gate> <name>AND1</name> <signal>0</signal> <delay>2</delay> <type>AND</type> <outGates> <name>OR1</name> <name>NOT1</name> </outGates> <inGates> <name>A</name> <name>B</name> </inGates> </gate> ... <gate> ... </gate> <ga tes> <circuit>

Type of the gate can be:

• INPUT• AND• NAND• OR• NOR• XOR• XNOR• FLIPFLOP• NOT

2.Get the input list

<inputs><input><name>B</name><time>2</time><value>1</value></input><input><name>A</name><time>3</time><value>1</value></input></inputs>

• The input list will be taken from a file , which has the following format:

3.Simulate the circuit with default values(0) to input gates

• Our sequential algorithm assumes that the circuit with current values is in a consistent state.

• In order to reach to such a state, before calling the sequential algorithm, we :o Assign boolean 0 to all gateso Starting from all INPUT gates, we iterate

through all the affected gates recursivellyo At an affected gate X, we do the following:

Find all its input gates Get their current output values According to the type of X gate, make its

operation with current input values, and modify the output accordingly

4.Start sequential algorithm

• Sequential algorithm uses a Future Event List data structure, which will help us in scheduling events.

• FutureEventList is an array of FutureEvent vectors basically

• Future event holds the:o Index of the gateo New value of the gateo Its time to change

• The size of the Future Event List, is found according to the following calculation:o size = MaximumDelay/time_increments + 1o time_increments=GCD(all_delays)

• In addition to the data structures that are used, the algorithm has a variable, which keeps the current time.

1. If current_time==gate_change_time, then get input2. It finds all the affected gates3. It iterates recursivelly to all af them, and calculates their

new output value4. If the new value != old value, then it schedules an event in

the (current_time + gate._delay)'th place of the FEL5. It checks whether there are events in FEL[current_time]6. If so gets the next event, processes it and continues from

point 2.7. If not,prints the current values and increments the current

time.1.if there is not an update in FEL for one full cycle, then

algorithm terminates2.if there is an input at the current time, continues from

point 13.if not continues from point 5

4.Sequential algorithm procedure

5.Show Output

• At every time increment the current values of the gates is printed. The output format will be as such:

Time: 3 (A-1) (AND1-0) (OR1-0) (XOR1-1) (NOT2-1) (NOT1-1) (XOR1-1)

Time: 4 (A-1) (AND1-0) (OR1-1) (XOR1-1) (NOT2-0) (NOT1-1) (XOR1-1)

Parallel algorithm of

Event Driven Simulation

Procedure:

1.Get the logic circuit2.Get the input list3.Simulate the circuit with default

values(0) to input gates4.Start Parallel algorithm in CUDA5.Show output

Note:

The only difference of Parallel and Sequential Event Driven Simulation is the fourth step of the procedure list.

So we will now explain our solution to the Parallel algorithm of Event Driven Simulation using CUDA.(remaining steps are exactly same as in the sequential part.)

Parallel algorithm in CUDA

• As we explained at the beginning, a single CUDA block may have at most 1024 threads running.

• The algorithm starts with (N+1) number of Threads, where every single gate is assigned to a thread.

• Currently the algorithm assumes that the circuit has at most 1023 gates (including the inputs) since we are working in a single block.


• We will take the circuit as a 1-dimentional BaseGate array, from the host.

• The connections between gates are referanced as indexes, rather than pointers.

• CUDA kernel module will also take some other variables, such as the # of gates, #of input gates, a bool array for debugging purposes...


• The remaining thread will act as a "controlling thread", which will make the necessary changes on shared variables.

• Once again we will have a future event list data structure. We will initialize an array of FEL of size N (unique for every gate).

• Future Event List data structure will have a queue data structure inside it, where updates to a certain gate will be schedulled inside this queue.


• In addition to the queue, FEL will have an "update" flag, where if it the corresponding gate has a scheduled update, this flag will be set to "True".

• There will be 2 types of updateso Updates regarding input gates

New value, and time to change will be enqueued to the FEL's queue.

o Updates of remaining gatesOnly the time to change will be enqueued to the FEL's queue.


• Moreover, every gate's FEL will have a corresponding lock, since when processing an update, multiple threads may write to the same gates FEL queue.

• At every time unit, we will keep the current output values of the gates in a multi dimentional boolean array, of size (maxDelay+1), so that no conflict occurs.


• In this manner, when we will process an update, we will subtract its delay from the current time, and take modulus (maxDelay + 1).

• In this manner we will obtain the values of all gates in (current_time - delay)'s time.


• Besides this, we will have a shared "change" variable, which will be controlled by the Controlling thread.

• At every time unit, the controlling thread will pass across all gates, and if at least one of them has an update, it will set the Change to 1.

• On the other hand, if there is not a change in any of the gates, then the algorithm will terminate, by setting change to 0.

Overall Logic

Overall Logic


Future Considerations:

• To write a module which randomly creates big circuits

• To test the parallel algorithm on those circuits• To make the necessary modifications in the

parallel code, so that we can make use of multiple blocks, where the size of gates will not be limitted to 1023.

• Memory optimizations

Thank you for your attention!

Cmpe 436 TERM PROJECT

M.Sancar Koyunlu & Ervin Domazet

Documents

Parallel Event Driven Simulation using GPU (CUDA) M.Sancar Koyunlu & Ervin Domazet Cmpe 436 TERM PROJECT