Password Recovery by using MPI and CUDA

Page 1 PASSWORD RECOVERY BY USING MPI AND CUDA

WALCHAND COLLEGE OF ENGINEERING, SANGLI

AN AUTONOMOUS INSTITUTION

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

YEAR 2013-14

PROJECT REPORT

ON

Password Recovery by using MPI and CUDA

UNDER GUIDANCE OF

Prof. M. A. Shah

SUBMITTED BY:

Sr.

No

NAME Roll

No

Seat No

1 SHWETA CHAUDHARI 125 2011BCS216

2 BHIMASHANKAR DHAPPADHULE 130 W5097102

3 CHAITANYA BHUSARE 132 W53551


CERTIFICATE

2013 – 2014

WALCHAND COLLEGE OF ENGINEERING, SANGLI

AN AUTONOMOUS INSTITUTION

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

This is to certify that, the project report entitled

“Password Recovery by using MPI and CUDA”

Is bonafied record of their own work carried out by them in partial fulfillment for the award

of B. Tech in Computer Science and Engineering Under my supervision of guidance during the

academic year 2013-14.The project report has been approved as it satisfies the academic

requirement with respect to the project work prescribed for the above said course.

Submitted By:

Prof. M. A. Shah.

(External Examiner) (Project Guide)

Sr.

No

NAME Roll

No

Seat No

1 SHWETA CHAUDHARI 125 2011BCS216

2 BHIMASHANKAR DHAPPADHULE 130 W5097102

3 CHAITANYA BHUSARE 132 W53551


ACKNOWLEDGEMENT

We are immensely glad to represent Project report entitled

“Password Recovery by using MPI and CUDA”

Realizing our foremost duty, we express our deep gratitude and respect to our guide Prof.

M. A. Shah, for her uplifting tendency and inspiring us for taking up this project in a successful

way.

We are also grateful to Dr. B. F. Momin, Head of Department of Computer Science &

Engineering for providing all necessary facilities to carry out the project work and whose

encouraging part has been a perpetual source of information.

We are grateful to the library personnel’s for offering all the help in completing the project

work. Last but not only the least we are thankful to our colleagues and those helped us directly or

indirectly throughout this project work.


DECLARATION

We hereby declare that this project report entitled “Password Recovery by using MPI

and CUDA” is the record of authentic work carried out by us during the academic year 2013-

2014.

Date:

Place:

Sr.

No

NAME Seat No Signature

1 SHWETA CHAUDHARI 2011BCS216

2 BHIMASHANKAR D W5097102

3 CHAITANYA BHUSARE W53551


INDEX PAGE NO

1. Introduction 6

1.1. Existing System and it’s Drawback 6

1.2. Literature Survey 7

1.3. Motivation 7

1.4. Scope 8

1.5. Problem Statement 8

2. Objectives of project 9

2.1 Significance of undertaking the project 9

3. Project Overview 10

3.1. System Architecture 10

3.2. MPI 10

3.3. CUDA 13

3.4. MPI+CUDA Platforms Used 13

3.5. Platforms Used 14

3.6. Inputs to the algorithms 14

3.7. Flow Layout 16

3.8. Control Layout 17

3.9. Flowcharts of Algorithms 17

3.9.1 Divided Dictionary Algorithm 17

3.9.2 Divided Password Database Algorithm 18

3.9.3 Minimal Memory Algorithm 19

3.10. MPI Functions Used 20

3.10. CUDA Functions Used 21

4. Implementation 22

4.1. Formation of cluster 22

4.2. Installation of CUDA 23

4.3. Implementation of algorithms 25

5. Experiments and Result Analysis 29

6. Conclusion 34

7. Reference 35


1. INTROUCTION: Electronic authentication is the process of confirming a user identity presented to an

information system. A widely deployed method for confirming a user identity is based on

passwords. A person submits her userid and password to a computer system. The computer

system processes the password with a hash function and compares the hash function output with

a hash value associated with that userid from a password database. If the two hash values match,

she is granted access to the system. Millions of people follow this protocol to confirm their

identity multiple times each day.

The hash functions used to authenticate users are called cryptographic hash functions.

These functions read a string of characters of any length, perform a hash operation on the string,

and then return an encoded, hashed value as a fixed length string. A good cryptographic hash

function is one-way, which means the original input string cannot be recovered by reversing the

function. This one-way characteristic has led to passwords being stored as hashed values instead

of in their plain-text format. Accordingly, even if the password database is compromised, the

plain-text passwords are still somewhat secure.

1.1. Existing System and its Drawbacks:

A dictionary-based password recovery application performs a large number of hash and

string comparison operations. A password dictionary is filled with familiar names, common

words, and simple character sequences. With a large dictionary or a large number of entries in a

password database, password recovery is computationally expensive. However, this type of

analysis has a high level of parallelism; testing one word from the dictionary is independent of

testing another. This makes dictionary-based password recovery well suited for a HPC

environment, like MPI or CUDA. The parallelism can be taken even further and implemented in

a hybrid MPI+CUDA environment, which allows for an initial course-grained division of the

problem using MPI and a fine-grained division of the problem using CUDA. In the realm of

password recovery there are several approaches that have been deployed, each with their own

advantages and disadvantages.

One problem with password authentication is that some passwords are weak; they contain

little randomness from one character to the next. When people create their own passwords, it is

natural for them to choose words that are easy to remember. Passwords are sometimes based on

familiar names, common words, or are simple sequences like “12345.” Those types of passwords

are easily guessed by attackers.


1.2. Literature Survey:

There has been significant research in the areas relating password strength to system

security, improving the performance of password recovery systems, as well as high performance

distributed computing approaches to a variety of problems including password recovery. So we

have surveyed about some of the documents related to passwords and system security, serial

dictionary recovery system, course-grained HPC, fine-grained HPC, hybrid HPC and we have

discovered the following literature of our project.

As we have gone through the passwords and system security, we have seen that one area of

research focuses on improving the strength of systems that use password-based authentication

protocols. System strength can be improved by understanding what makes one password better

than another password. measured password strength in terms of the search space required for

attackers to guess some of the passwords. It has been shown that dictionary attacks are most

effective at discovering weak passwords; dictionary word mangling is useful after the dictionary

has been fully tested. Finally, Markov-based techniques were most useful in breaking strong

passwords.

Performance of password recovery can also be improved by increasing the number of

processors used. We used a loosely coupled architecture based on volunteer computing to

implement a dictionary attack against a private key ring generated by GnuPG software.

MPI and OpenMP provide an efficient way of coarsely dividing work between multiple

computers, cores, or threads where each individual unit performs calculations on a subset of the

data. On the other hand, CUDA is better suited to a more fine-grained parallelism of the

problem, where multiple versions of the LU benchmark are compared on multiple architectures.

Course- and fine-grained HPC have normally been used independent of each other.

However, they may be combined as a hybrid framework of parallelism. A course division of the

problem is done at the CPU level, using MPI or OpenMP. This level divides the data or

calculations into smaller subsets, handling distribution of data and gathering of results. At the

next level GPUs can then apply a fine-grained type of parallelism to the smaller subset of the

problem.

1.3. Motivation:

High Performance computing (HPC) is emerging with a great speed in the world of

technology.

Energy costs of HPC systems are of growing importance

Existing power saving technologies largely unused in HPC

Intelligent use of existing technologies has the potential to significantly reduce power

consumption while retaining performance

Efficient programming of clusters of shared memory (SMP) nodes

Hierarchical system layout

Hybrid programming seems natural


• MPI between the nodes

• Shared memory programming inside of each SMP node

– MPICH2

– MPI-3 shared memory programming

CUDA has the following advantages that lead us to the motivation:

• High-level, C/C++/Fortran language extensions

• Scalability

• Thread-level abstractions

• Runtime library

1.4. Scope:

This MPI+CUDA approach allows for the division of a large dictionary or a password

database between multiple MPI nodes, which is then further divided among the GPUs of each

node. This allows for multiple GPUs to perform calculations at the same time. An analysis of the

algorithms was done to identify the different strengths and weaknesses with respect to execution

time and memory usage and help determine the most efficient data distribution strategy for a

hybrid MPI+CUDA computing environment.

1.5. Problem Statement

Design and implementation of multiple password recovery algorithm using MPI and

CUDA hybrid approach.


2. OBJECTIVES OF THE PROJECT:

Objectives for undertaking the project are:

We are going to compare the memory usage and run times for multiple password

recovery algorithms. We designed and implemented algorithms using an MPI+CUDA hybrid

approach to password recovery, with each algorithm based around the dictionary attack method

of password recovery.

2.1 Significance of undertaking the project:

Thus, the main significant of this proposed algorithm is, to determine the most efficient

use of a hybrid MPI+CUDA system for password recovery, three different algorithms were

developed. The algorithms are similar in that they are based on a dictionary approach to

password recovery. In the dictionary approach hash values for a list of words are computed.

Those computed hash values are then compared against the hash values stored in a password

database. If the computed hash value for a dictionary word matches the stored hash value for an

account in the database, then the dictionary word is the password for the account.

The algorithms differ primarily in how the dictionary and password database are

distributed among the GPU devices. The remainder of this section will describe the hash function

used, our dictionary and password database data, and the three algorithms.


3. PROJECT OVERVIEW:

3.1. System Model:

3.2. MPI:

MPI operates via function calls. C and C++ users should use something like this to make

use of MPI:

#include <mpi.h>

// ...

int rc;

// ...

rc = MPI_Bcast( ... )

Unlike OpenMP, you shouldn’t think of “parallel regions.” Your entire code is running in

parallel. You should exploit this. To initialize the MPI environment, you need to include some

function calls in your code. The following are typically used:

/* add in MPI startup routines */

/* 1st: launch the MPI processes on each node */

MPI_Init(&argc,&argv);


/* 2nd: request a thread id, sometimes called a "rank" from

the MPI master process, which has rank or tid == 0

*/

MPI_Comm_rank(MPI_COMM_WORLD, &tid);

/* 3rd: this is often useful, get the number of threads

or processes launched by MPI, this should be NCPUs-1

*/

MPI_Comm_size(MPI_COMM_WORLD, &nthreads);

When the MPI program reaches that first MPI_Init, it passes the arguments it received

on the command line to the MPI launcher process, which starts new copies of the binary running

on the processors.

At this point in time, you have gone from a single process to NCPU (or N Cores)

coordinated processes. These processes do not share variables. Your job is to figure out what

process needs what data, when they need it, and get it to them.

The MPI_Comm_rank function call returns data on a specific process. A process id of 0

is the “master” process, the original process which did in fact launch all the others. The process

id or “rank” is used to identify the different processes in you program.

The MPI_Comm_size function call returns the number of threads or processes being

used for this program.

With this information, your program is now aware of the parallel environment. This is

explicitly the case. Note that you could run your code on NCPU=1, in which case you have

effectively serial operation.

A way to visualize this is to imagine an implicit loop around your program, where you

have NCPU iterations of the loop. These iterations all occur at the same time, unlike an explicit

loop. The number of CPUs or cores is controlled by specifying a value to the mpirun or mpiexec

command using the -np argument. If it is not set, it could default to 1 or the number of

CPUs/Cores on your machine.

That done let’s parallelize hello.c. (Note: all source code used in this article can be

downloaded here.) Let’s put in the explicit function calls for the MPI processes, and add some

additional “sanity checking”, so we know where and how the code ran. The original code looks

like this:

#include "stdio.h"

int main(int argc, char *argv[])

{

printf("hello MPI user!\n");

return(0);

}

The MPI version of the code looks like this:


#include "stdio.h"

#include <stdlib.h>

#include <mpi.h>

int main(int argc, char *argv[])

{

int tid,nthreads;

char *cpu_name;

/* add in MPI startup routines */

/* 1st: launch the MPI processes on each node */

MPI_Init(&argc,&argv);

/* 2nd: request a thread id, sometimes called a "rank" from

the MPI master process, which has rank or tid == 0

*/

MPI_Comm_rank(MPI_COMM_WORLD, &tid);

/* 3rd: this is often useful, get the number of threads

or processes launched by MPI, this should be NCPUs-1

*/

MPI_Comm_size(MPI_COMM_WORLD, &nthreads);

/* on EVERY process, allocate space for the machine name */

cpu_name = (char *)calloc(80,sizeof(char));

/* get the machine name of this particular host ... well

at least the first 80 characters of it ... */

gethostname(cpu_name,80);

printf("hello MPI user: from process = %i on machine=%s, of NCPU=%i processes\n",

tid, cpu_name, nthreads);

MPI_Finalize();

return(0);

}

Unlike OpenMP, there is no “parallel region” to enclose, the entire code is a “parallel

region”, but nothing is shared. Let us create a simple Makefile to handle building this. The

Makefiles are available with the Source Code. Working binary executable called hello-mpi.exe.

To run the binary, use the mpirun command below. In this example we are starting eight

processes. Each process will be calledhello-mpi.exe and they will communicate with each other

using MPI.

/home/joe/local/bin/mpirun -np 8 -hostfile hostfile ./hello-mpi.exe

hello MPI user: from process = 0 on machine=pegasus-i, of NCPU=8 processes








There are some additional arguments worth discussing. First, the -hostfile argument provides a

filename that tells mpirun the host or hosts on which to run the program. For this tutorial, we will

use a single hostfile with “localhost” as our only host. You can use any name for this file.

Second, since we have multiple cores we can adjust the number of processes to match the

number of cores by using the -np parameter.


3.3. CUDA

CUDA (Compute Unified Device Architecture) is a parallel computing platform and

programming model created by NVIDIA and implemented by the graphics processing

units (GPUs) that they produce. CUDA gives program developers direct access to the

virtual instruction set and memory of the parallel computational elements in CUDA GPUs.

CUDA provides both a low level API and a higher level API. CUDA works with all

Nvidia GPUs from the G8x series onwards, including GeForce, Quadro and the Tesla line.

CUDA is compatible with most standard operating systems. Nvidia states that programs

developed for the G8x series will also work without modification on all future Nvidia video

cards, due to binary compatibility.

3.4. MPI + CUDA:

MPI, the Message Passing Interface, is a standard API for communicating data via

messages between distributed processes that is commonly used in HPC to build applications that

can scale to multi-node computer clusters. As such, MPI is fully compatible with CUDA, which

is designed for parallel computing on a single computer or node. There are many reasons for

wanting to combine the two parallel programming approaches of MPI and CUDA. A common

reason is to enable solving problems with a data size too large to fit into the memory of a single

GPU, or that would require an unreasonably long compute time on a single node. Another reason

is to accelerate an existing MPI application with GPUs or to enable an existing single-node

multi-GPU application to scale across multiple nodes. With CUDA-aware MPI these goals can

be achieved easily and efficiently.

http://en.wikipedia.org/wiki/Parallel_computing

http://en.wikipedia.org/wiki/NVIDIA

http://en.wikipedia.org/wiki/Graphics_processing_unit

http://en.wikipedia.org/wiki/Graphics_processing_unit

http://en.wikipedia.org/wiki/Instruction_set

http://en.wikipedia.org/wiki/API

http://en.wikipedia.org/wiki/NVidia_GeForce

http://en.wikipedia.org/wiki/Nvidia_Quadro

http://en.wikipedia.org/wiki/Nvidia_Tesla

http://en.wikipedia.org/wiki/Message_Passing_Interface


3.5. Platforms Used:

Language Used: C, CUDA C, C++.

Environment: Two Ubuntu 12.04 and two GEFORCE GT 525M NVIDIA GPU having

96 cores each.

The environment allows for an initial course-grained division of the problem using MPI

and a fine-grained division of the problem using CUDA.

3.6. Inputs to the Algorithms:

The password recovery program assumes that some preprocessing of the input files has

been done. The dictionary file and user files require the first two lines of the files to be the

number of entries in the file and the maximum string length of the entries respectively. The

maximum string length for the user account is the maximum length of the user account names.

i.e.

------------------------------------------------

<number of dictionary words>

<maximum string length of the dictionary words>

word1

word2

...

------------------------------------------------


The dictionary file is formatted so that each line of the file after the first two lines is a single

dictionary word; as shown above

The user file is formatted "<userName> : <X X X X X>" where <userName> is the user account

name and <X X X X X> is the user accounts SHA1 hash value in hex format separated into 32

bit blocks separated by a space; as shown below

----------------------------------------------------

...

user1: a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d

...

----------------------------------------------------

Secure hash algorithm input: The Secure Hashing Standard, which uses the Secure Hashing

Algorithm (SHA), produces a 160-bit message digest for a given data stream. In theory, it is

highly improbable that two messages will produce the same message digest. Therefore, this

algorithm can serve as a means of providing a "fingerprint" for a message.


3.7. Flow Layout:

Installation of

MPI ch2-1.4.1

Formation of

MPI cluster of 2

nodes

Formation of

GPU cluster Implementation

of Algorithms

Installation of

CUDA 5.5

Testing

Result Analysis

END a. Divided Dictionary Algorithm

b. Divided Password Database Algorithm

c. Minimal Memory Algorithm


3.8. Control Layout:

1. First we have to install an MPIch2-1.4.1 in the machine having Nvidia GEForce GT

540M.

2. Then we have to form a cluster of MPI of all the nodes.

3. After that, we have to install the CUDA 5.5.

4. Formation of GPU cluster is the next step.

5. Implement the algorithms.

a. Divided Dictionary Algorithm

b. Divided Password Database Algorithm

c. Minimal Memory Algorithm

6. Testing is necessary step for all projects.

7. Analysis of the result is carried out.

3.9. Flowcharts of Algorithms:

YES

NO

Start

Input mode, dictionary file name, password

database file name

If incorrect

argument Generate Error

Generate Dictionary/ Divide Dictionary/ Password

database file according to user input

A


1 2

3

3.9.1 Divided Dictionary Algorithm:

A

Call Divided

Dictionary Algorithm If mode

Call Minimal Memory

Algorithm

Call Divided Password

Database Algorithm 1

2

3

1

Read No of Dictionary words and password database words

MPI will divide dictionary file according to user’s specification.

Load the divided files on each GPU

Initialize threads and allocate GPU memory

Calculate Hash value for each dictionary words and store this value in Global memory

Compare these stored values with password database hash value

Display matched results, time require to loading, generating and comparing hashes

and allocate memory usage

EXIT

Copy the entire dictionary file on GPU


3.9.2 Divided Password Database Algorithm:

3.9.3 Minimal Memory Algorithm:

2

Read No of Dictionary words and password database words

MPI will divide password database file according to user’s specification.

Load the divided files on each GPU

Copy the entire dictionary file on GPU

Initialize threads and allocate GPU memory

Calculate Hash value for each dictionary words and store this value in Global memory

Compare these stored values with password database hash value



EXIT

3

Initialize thread and allocate memory for dictionary word

B


NOT FOUND

FOUND

3.10. MPI Functions Used:

SR. NO. Function Name Description

1 MPI_Init Initializes the MPI execution environment

2 MPI_Comm_size Returns the total number of MPI processes in the

specified communicator.

3 MPI_Comm_rank Returns the rank of the calling MPI process within the

specified communicator.

4 MPI_Send Basic blocking send operation.

5 MPI_Recv Receive a message and block until the requested data is

available in the application buffer in the receiving task.

6 MPI_Barrier Synchronization operation.

7 MPI_Bcast Data movement operation. Broadcasts (sends) a message

from the process with rank "root" to all other processes in

the group.

8 MPI_Scatter Data movement operation. Distributes distinct messages

from a single source task to each task in the group.

B

Load the dictionary word and hash value from password database file

Generate hash values for dictionary word

Compare both the values

If match



EXIT

https://computing.llnl.gov/tutorials/mpi/man/MPI_Init.txt

https://computing.llnl.gov/tutorials/mpi/man/MPI_Comm_size.txt

https://computing.llnl.gov/tutorials/mpi/man/MPI_Comm_rank.txt

https://computing.llnl.gov/tutorials/mpi/man/MPI_Send.txt

https://computing.llnl.gov/tutorials/mpi/man/MPI_Recv.txt

https://computing.llnl.gov/tutorials/mpi/man/MPI_Barrier.txt

https://computing.llnl.gov/tutorials/mpi/man/MPI_Scatter.txt


3.11. CUDA Functions Used:

SR. NO. Function Name Description

1 cudaGetDeviceCount() Returns the number of deices with compute capability

2 cudaGetDeviceProperties() Returns’ all the device properties

3 cudaSetDevice() Returns the device on which the host thread executes.

4 cudaGetDevice() Sets devices as the current device for the calling host

5 cudaMalloc() Allocate the GPU memory

6 cudaMemcpy() Copy count bytes from memory area pointed to by src to

dest.

7 cuDeviceGetCount() Returns number of devices on node.

8 cuDeviceGet() Returns name of the device

9 cuDeviceGetName() Returns name and its properties of devices.

10 cudaGetLastError() Returns the error message.


4. IMPLEMENTATION:

4.1 Formation of a Cluster:

Instructions for building MPICH2

1. Download mpich2-1.4.1p1.tar.gz to your account on Shale. 1.4.1p1 is the current stable

release. You can download it from

http://www.mcs.anl.gov/research/projects/mpich2/

2. Uncompressing and extract the archive.

$ tar xvf mpich2-1.4.1p1.tar.gz

3. Make an installation directory where the final MPI system will reside.

$ mkdir /home/<your userid>/mpich2-install/

4. At this point you should change to one of the CUDA nodes: node1,

node2

5. Run the MPI configuration script from the archive directory. The following disables the

FORTRAN compiler, specifies g++ as the c++ compiler, enables shared libraries, and

specifies the location of the installation directory. The following should be on one line.

$ ./configure --disable-f77 --disable-fc CXX=g++ --enable-shared --prefix=/home/<your

userid>/mpich2-install 2>&1 | tee c.txt

The options for CXX and enable-shared were needed to get the system to run on Shale.

6. Run make to build MPICH2.

$ make 2>&1 | tee m.txt

7. Run the install script.

$ make install 2>&1 | tee mi.txt

8. Add the mpich2-install/bin/ directory to your path.

9. Test that your path is configured correctly. The output of the following commands should

refer to the bin directory of your installation directory. If they do not, return to step 8 and

move the mpich2-install/bin directory to the beginning of your path.

$ which mpicc

$ which mpiexec

10. Test that you can ssh to our four MPI nodes (node1, node2) without having to enter a

password.

11. Create a hosts file. This should be a text file like the mpd.hosts

file we used at the beginning of the semester. The file should contain:


node1

node2

12. Run a test program at mpich2-1.4.1p1/examples/cpi.

$ mpiexec -n 5 -f <path to hosts file><path to cpi program>

13. Compile and run mpi_hello.cxx

$ mpicxx -g -Wall -o mpi_hello mpi_hello.cxx

$ mpiexec -n 4 -f hosts ./mpi_hello

4.2. Installation of CUDA:

STEP I – Driver installation

Download the Nvidia drivers for Linux from their website, making sure to select 32 or 64-bit

Linux based on your system.

Make sure the requisite tools are installed using the following command -

sudo apt-get install freeglut3-dev build-essential libx11-dev libxmu-devlibxi-dev libgl1-

mesa-glx libglu1-mesa libglu1-mesa-dev

Next, blacklist the required modules (so that they don’t interfere with the driver installation) -

sudonano /etc/modprobe.d/blacklist.conf

Add the following lines to the end of the file, one per line, and save it when done -

blacklist amd76x_edac

blacklist vga16fb

blacklist nouveau

blacklist rivafb

blacklist nvidiafb

blacklist rivatv

In order to get rid of any Nvidia residuals, run the following command in a terminal -

http://www.nvidia.in/Download/index.aspx?lang=en-in


sudo apt-get remove --purge nvidia*

Press Ctrl+Alt+F1 to switch to a text-based login, and switch to the directory which contains the

downloaded driver.

Run the following commands -

sudo service lightdm stop

chmod +x NVIDIA*.run

where NVIDIA*.run is the full name of your driver. Next, start the installation with -

sudo ./NVIDIA*.run

sudo service lightdm stop

sudo ./NVIDIA*.run.

Reboot once the installation completes.

STEP II – CUDA toolkit installation

Download the CUDA toolkit Navigate to the directory that contains the downloaded CUDA

toolkit package, and run the following command –

chmod +x cuda*.run

sudo ./cuda*.run

wherecuda*.run is the full name of the downloaded CUDA toolkit. Accept the license that

appears.

To make sure the necessary environment variables (PATH and LD_LIBRARY_PATH) are

modified every time you access a terminal, add the requisite lines (from the summary screen)

to the end of ~/.bashrc as follows -

https://developer.nvidia.com/cuda-downloads


32 bit systems -

export PATH=$PATH:/usr/local/cuda-5.0/bin

export LD_LIBRARY_PATH=/usr/local/cuda-5.0/lib

64 bit systems -

export PATH=$PATH:/usr/local/cuda-5.0/bin

export LD_LIBRARY_PATH=/usr/local/cuda-5.0/lib64:/lib

STEP III – CUDA samples installation and troubleshooting

While the installation of the samples should be straightforward (simply run the all in one

toolkit installer), it’s often not all that easy.

Determine if variants of libglut.so are present as follows -

sudo find /usr -name libglut\*

On my 64-bit installation of Ubuntu 12.04, this output the following text -

/usr/lib/x86_64-linux-gnu/libglut.so.3

/usr/lib/x86_64-linux-gnu/libglut.so.3.9.0

/usr/lib/x86_64-linux-gnu/libglut.a

/usr/lib/x86_64-linux-gnu/libglut.so

4.3. Implementation of Algorithms:

Our hypothesis was that each algorithm will require different amounts of memory and

processing time due to their differences in how they divided the dictionary and password

database data. The scalability of the different algorithms was also of interest.

1) Divided Dictionary Algorithm: The divided dictionary algorithm (shown in Figure 1)

divides the larger dictionary file between the MPI nodes which then further divide the

dictionary words between the GPUs on each MPI node. Each GPU has a different portion of

the dictionary. The full password database file is then loaded onto each GPU. A kernel

function is repeatedly called for each of the user accounts to test for matches between the

password database hash and the dictionary word hashes. Any matches represent the

recovered password for the user account. The main steps of this algorithm are described

below.


a) Evenly divide the dictionary words among the MPI nodes. Each MPI node further

divides its words among its available GPU devices.

b) The GPU devices calculate hash values for each of its dictionary words and store

them in global memory.

c) The contents of the password database are copied to each GPU device. Each GPU

has the same account data.

d) Finally, the GPU tests for matches between the calculated hash value of each

dictionary word with the stored hash value from the password database.

Fig. 1.Divided Dictionary Algorithm. The dictionary of potential passwords is divided

among the MPI nodes. Each MPI node subdivides its words among its available GPU nodes. The

same password database is loaded onto each GPU.

2) Divided Password Database Algorithm: The divided password database algorithm

(shown in Figure 2) evenly divides the password database file between the MPI nodes which

then further divide the user account data between the GPUs on each MPI node. The full

dictionary file is then loaded onto each GPU and a hash value for each word is calculated. A

kernel function is then repeatedly called for each of the user accounts to test for matches

between the password database hash value and the dictionary word hashes. Any matches

represent the recovered password for a given user account as outlined below.

a) Distribute all of the dictionary words to each MPI node and its available GPU

devices.

b) The GPU devices calculate hash values for each dictionary word.

c) Evenly divide the contents of the password database among each MPI node. Each

MPI node then divides its account data among the GPU devices.


d) Finally, the GPU tests for matches between the calculated hash value of each

dictionary word and the stored hash value from the password database.

Fig. 2.Divided Password Database Algorithm. The password database is divided among

the MPI nodes. Each MPI node subdivides that user account data among its available GPU

devices. The entire dictionary is loaded onto each GPU.

3) Minimal Memory Algorithm: The minimal memory algorithm (shown in Figure 3)

repeatedly calculates the hash value for a dictionary word and then tests for a match in the

password database. The intent of this algorithm is to be a low memory solution; it differs

greatly when compared to the other algorithms. The minimal memory algorithm divides the

password database file between the MPI nodes which then further divide the password

database between the GPUs on each MPI node. The choice to divide the password database

between the MPI nodes instead of the dictionary file is based on the assumption that the

password database will be significantly smaller than the dictionary file. The algorithm

processes the data as outlined below.

a) Evenly divide the password database among the MPI nodes. Each MPI node

further divides its user accounts among the GPU devices.

b) For each dictionary word,

i. Load and distribute the word among the GPUs.

ii. Calculate the word’s hash value.

iii. Compare the calculated hash value against the loaded user account hash

values.


Fig. 3.Minimal Memory Algorithm. The password database is divided evenly among the

MPI nodes and then subdivided among the GPU devices. One by one, a single word from the

dictionary is loaded onto the GPUs, hashed, and compared with the user account hash values.


5. EXPERIMENTAL RESULTS A. Divided Dictionary Algorithm

B. Minimal Memory Algorithm.


C. Divided Password Algorithm

D. Divided Dictionary Algorithm on 2 GPU


E. Divided Password Algorithm on 2 GPU

F. Minimal Memory Algorithm on 2 GPU


G. Dictionary Load Time

H. Comparison of Hashes:


I. Execution Time of Program:

J. Average GPU Memory Usage:


6. CONCLUSION

This project demonstrates that significant scalability and performance gains are possible with

a hybrid HPC approach to password recovery using MPI+CUDA. These tools can be easy to use

and will likely be adopted by both sides in the arms race between system administrators and

hackers.

Of the three algorithms analyzed the divided dictionary algorithm was the best approach for

distributing data to each GPU. Dividing the larger dictionary, rather than the smaller password

database among the GPUs ensured performance gains as the number of GPUs increased.

In general, this project shows that MPI and CUDA can be combined to develop an efficient

password recovery system that can scale to multiple compute nodes with multiple GPUs.

Further, this work opens up many avenues for improving performance and scalability for large

scale hybrid computing systems, as well as utilizing and testing modern improvements in GPU

technology.


7. REFERENCES:

[1] MPICH [Online]. Available:

http://www.mcs.anl.gov/research/projects/mpich2/

[2] NVIDIA Corporation. CUDA.[Online]. Available:

http://www.nvidia.com/object/cuda home new.html

[3] M. de la Asuncin, J. M. Mantas, M. J. Castro, and E. Fernndez- Nieto, “An MPI-CUDA

implementation of an improved roe method for

two-layer shallow water systems,”

[4] S. Pellicer, Y. Pan, and M. Guo, “Distributed MD4 password hashing

with grid computing package BOINC,”

[5] S. Pennycook, S. Hammond, S. Jarvis, and G. Mudalige, “Performance

analysis of a hybrid MPI/CUDA implementation of the NAS-LU

Documents

Password Recovery by using MPI and CUDA