31

faster password recovery

  • Upload
    hathuan

  • View
    236

  • Download
    0

Embed Size (px)

Citation preview

FASTER PASSWORD RECOVERY WITH MODERN GPUs

Andrey Belenko

ElcomSoft Co. Ltd.Security Researcher

| Faster Password Recovery with Modern GPUs | June 14, 20113

WHO ARE WE

§Founded in 1990§Privately owned§Doing password recovery (software) since 1998§HQ and development in Moscow, Russia§Brought GPUs to password recovery in 2007§5 US patents issued, more in queue–2 are about GPU-accelerated password recovery

| Faster Password Recovery with Modern GPUs | June 14, 20114

WHO NEEDS PASSWORD RECOVERY?

§Ordinary users–Passwords of their own

§IT Departments–Passwords of the employees

§Security auditors, consultants and penetration testers–Customer/contractor passwords

§Law enforcement & government agencies–Passwords of suspects

§Hackers usually don’t!

| Faster Password Recovery with Modern GPUs | June 14, 20115

WHY SPEED COUNTS?

§Users and IT Departments:–«We needed those passwords yesterday»

§Auditors, consultants and pentesters:–«Time is Money»

§Law Enforcement and investigators–Legal time limits

The slow part

| Faster Password Recovery with Modern GPUs | June 14, 20116

PASSWORD RECOVERY | The Loop

Generate trial password

Transform password

(compute hash or encryption key)

Validate hash/key

Success

Try next password

Failure

| Faster Password Recovery with Modern GPUs | June 14, 20117

PASSWORD RECOVERY | The Slow Part

§Designed to be slow–50ms verification time has no impact on usability but HUGE impact on password recovery performance

§Usually designed around well-known hash functions–MD5 (old days)–SHA-1 (most popular so far)–SHA-2 (still exotic)

§Thousands to millions of hash computations per password

| Faster Password Recovery with Modern GPUs | June 14, 20118

FAST PASSWORD RECOVERY | The CPU Way

Before GPGPU era most optimizations focused on:

§SIMD (MMX, SSE, AVX)

§Multi-core

§Distributed computing (think distributed.net)–Communication overhead–Difficult to manage–Not power-efficient

| Faster Password Recovery with Modern GPUs | June 14, 20119

| Faster Password Recovery with Modern GPUs | June 14, 201110

Done by GPU

| Faster Password Recovery with Modern GPUs | June 14, 201111

FAST PASSWORD RECOVERY | The GPU Way

§Password recovery constitutes “embarrassingly parallel” workload§Each processing unit verifies own password, independently from other processing units§Linear scalability in practice

Generate trial passwords

Validate hashes/ keys

Try next password

Failure

Success

Transform password

Transform password

Transform password

Transform password

Transform password

| Faster Password Recovery with Modern GPUs | June 14, 201112

FAST PASSWORD RECOVERY | The GPU Way

Generate trial passwords

Compute keys from passwords

Validate keys

Passwords[] Passwords[]

Keys[]Keys[]

GPUCPUPCIe

| Faster Password Recovery with Modern GPUs | June 14, 201113

LIMITATIONS

§Works good for “slow” algorithms

§For “fast” algorithms PCIe becomes the bottleneck–e.g. for SHA-1 theoretical limit is 8 Gbps / (20 bytes in + 20 bytes out) ≈ 214 million passwords per second

§Need to offload everything to the GPU–password generation and key validation on GPU are bigger challenges than crypto itself–especially so without OpenCL

| Faster Password Recovery with Modern GPUs | June 14, 201114

ALTERNATIVE WAY

Generate trial passwords

Compute keys from passwords

Validate keys

Initial password

Passwords[]

Keys[]

Result

GPUCPUPCIe

| Faster Password Recovery with Modern GPUs | June 14, 201115

PASSWORD RECOVERY

Generate trial passwords

Compute keys from passwords

Validate keys

Passwords[] Passwords[]

Keys[]Keys[]

GPUCPUPCIe

| Faster Password Recovery with Modern GPUs | June 14, 201116

OVERLAPPING CPU AND GPU

Gen

Compute

Vfy Gen

Compute

Vfy GenCPU

GPU Compute

Vfy

Gen

Compute

VfyGen

Compute

VfyCPU

GPU Compute

VfyGen

§In straightforward implementation it may look like this:

§But CPU and GPU can work simultaneously, so overlap their operations:

Profit!

| Faster Password Recovery with Modern GPUs | June 14, 201117

PERFORMANCE | PBKDF2-SHA1 x 10’000

Intel i7-970

NVIDIA GTX 590

AMD HD 6990

0K 15K 30K 45K 60K

50300

23500

3120

Computations per second

| Faster Password Recovery with Modern GPUs | June 14, 201118

HEY, WHY NO 100X SPEEDUP?

Be fair!

§CPUs are not single core any more–Even Atoms are not

§Extended instruction sets were introduced for performance reasons–So why ignore them?

§Will usually get ~10x on comparable hardware for well-suited compute-bound tasks

| Faster Password Recovery with Modern GPUs | June 14, 201119

CPU LAYOUT

§1.2 billions transistors–Most are L3/L2 caches

§Less than 10% are in execution and/or ALU units

Memory Controller

IO &

QPI

IO &

QPI L3 Cache L3 Cache

Que

ue

CoreCore Core CoreCoreCore

| Faster Password Recovery with Modern GPUs | June 14, 201120

GPU LAYOUT

§3 billions transistors (2.5x)

§About 30% are execution and/or ALU units (3x)

§7.5x more transistors dedicated to execution units

§Core frequency is about lower (~0.4x)

§3x estimated speedup

In fair real-world comparison this GPU is 4x faster than CPU on compute-bound task

| Faster Password Recovery with Modern GPUs | June 14, 201121

HEY, WHY NO 100X SPEEDUP?

Be fair!

§CPUs are not single core any more–Even Atoms are not

§Extended instruction sets were introduced for performance reasons–So why ignore them?

§Will usually get ~10x on comparable hardware for well-suited tasks

In our case:§SSE2 code + processor-specific compiler optimizations§12 threads to fully utilize 6 cores + HT§16x over high-end CPU

| Faster Password Recovery with Modern GPUs | June 14, 201122

PERFORMANCE | PBKDF2-SHA1 x 10’000

Intel i7-970

NVIDIA GTX 590

AMD HD 6990

0K 15K 30K 45K 60K

50300

23500

3120

Computations per second

| Faster Password Recovery with Modern GPUs | June 14, 201123

WHY AMD IS SO FAST?

§Most password transformations are bounded by integer performance–AMD cards exhibit awesome integer performance

§Many password transformations (=crypto) make heavy use of bit rotations (=cyclic shifts)–There is a special instruction for this!–Cyclic shift in 1 instruction instead of 3, up to 30% overall speedup in practice

§GPU code written in IL–Utilize all GPU devices under Windows–(Recent APP SDK versions allow this with OpenCL)

| Faster Password Recovery with Modern GPUs | June 14, 201124

PERFORMANCE | bitalign

§AMD IL Specification, section 7.13:

Aligns bit data for video. This is a special instruction for multi-media video.bitalign dst, src0, src1, src2dst = (src0 >> src2.x) | (src1 << (32-src2.x))

§Can be used to implement cyclic bit shift in 1 instruction–VERY useful for many crypto algorithms

§Introduced in Evergreen

§Exposed at the IL level

| Faster Password Recovery with Modern GPUs | June 14, 201125

PERFORMANCE | Bitfield Insert

§AMD Evergreen ISA Reference, page 9-61:

BFI_INT dst, src0, src1, src2dst = (src1 & src0) | (src2 & -src0)

§This is vector bit selectdsti = (maski != 0 ) ? arg1i : arg2i

§Very useful for accelerating various crypto algorithms–And especially for breaking them

§Introduced in Evergreen

§NOT exposed at the IL level–OpenCL bitselect() is not using it either–No documented way to emit this instruction directly

| Faster Password Recovery with Modern GPUs | June 14, 201126

WHY INTERMEDIATE LANGUAGE

§We chose IL over Brook+–OpenCL has not existed yet–Brook+ programming model was not quite suited for password recovery–ISA provided no significant benefit over IL

§“Early” OpenCL support couldn’t compete with IL either–Limited support for binary (pre-compiled) kernels–Limited support for multi-GPU in OpenCL–(Those issues seems to be fixed in APP SDK 2.4)

§AMD is going to deprecate CAL in next SDK (2.5)–IL will almost certainly be deprecated altogether–This is very bad news for us–Need to decide whether to go up (OpenCL) or down (ISA)–Morning Keynote mentinoed FSAIL which seems like a great alternative!

| Faster Password Recovery with Modern GPUs | June 14, 201127

WRITING IN INTERMEDIATE LANGUAGE

§IL doesn’t seem to be designed to be human-friendly–Use scripting languages to generate IL code–And handle platform-specific optimizations (i.e. emulate bitalign on older GPUs)

§Compile kernels at program build time–Avoids runtime compilation –Solves (partially) IP problem – no source code needs to be distributed–Need to provide new binaries for new devices

§Use CAL at runtime to load, configure and launch pre-compiled kernel

| Faster Password Recovery with Modern GPUs | June 14, 201128

SCALABILITY

§Not all GPUs are equally powerful

§Program should scale nicely with number of processing cores in installed GPU–Query number of processors at runtime–Partition task proportionally to number of processors–Helps to reduce UI update “freezes”–Also helps to avoid TDR

| Faster Password Recovery with Modern GPUs | June 14, 201129

SCALABILITY

§8 GPUs are not uncommon today

§Program should scale nicely with number of GPUs–Query number of devices in system–Spawn thread for each device–Partition task as appropriate

§Speedup should be linear unless you hit PCIe bandwidth limits

| Faster Password Recovery with Modern GPUs | June 14, 201130

Disclaimer & Attribution

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.

NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners.

The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is not responsible for the content herein and no endorsements are implied.

QUESTIONS?