CMPE 511, Fall 20061 Hybrid (Software-Hardware) Dynamic Memory Allocator Prepared by Mustafa Özgür Akduran İstanbul, 2006 Boğaziçi Üniversitesi

CMPE 511, Fall 2006 1

Hybrid (Software-Hardware) Dynamic Memory Allocator

Prepared by Mustafa Özgür Akduran

İstanbul, 2006

Boğaziçi Üniversitesi

CMPE 511, Fall 2006 2

Outline

Introduction

Related Research

Proposed Hybrid Allocator

Complexity and Performance Comparison

Conclusion

References

CMPE 511, Fall 2006 3

Introduction

Need for Efficient Implementation of Memory Management Functions

Memory Usage Execution Performance

Modern Programming

Languages

Dynamic Memory Allocation (DMA)

Garbage Collection

CMPE 511, Fall 2006 4

Current Systems

Execution time spent on Memory Management is 42%. Still important researches on

Good execution performance Memory locality

How to get free chunks of memory ? Software Allocator Hardware Allocator

IntroductionDMM

Pure Software

Low Cost Allocator

Dynamic Memory Management

CMPE 511, Fall 2006 5

Software Allocator Different Search

Techniques to organize available chunks of free memory

Disadvantage Search could be in the

critical path of allocators causing a major performance bottleneck.

Hardware Allocator Parallel Search

Speed up Memory Allocation

Improve Performance Hide execution latency of

freeing objects Coalescing of free chunks

of memory

Disadvantage Potential Hardware

Complexity

Introduction

CMPE 511, Fall 2006 6

Introduction

A New Hybrid Software-Hardware Allocator

PHK (Poul-Henning Kamp) Allocation Algorithm used in

Free-BSD System

Chang’s Hardware Allocator

Aim is to balance the hardware complexity with performance by using both hardware and software together.

CMPE 511, Fall 2006 7

Related Research

PHK (Poul-Henning Kamp) Allocator

Two most popular general purpose open source allocator1. Doug Lea used in LINUX System2. PHK used in Free-BSD System

Difference between them is less than 3% for memory allocation intensive benchmarks in SPEC 2000 CPU.

PHK Allocator chosen bacause of its suitability for hardware/software co-design.

Free-BSD (Berkeley Software Distribution ) is an advanced operating system for x86 compatible (including Pentium® and Athlon™), architectures. It is derived from BSD, the version of UNIX® developed at the University of California, Berkeley. It is developed and maintained by a large team of individuals.

CMPE 511, Fall 2006 8

Related Research

PHK (Poul-Henning Kamp) Allocator Page based allocator Each page can only contain objects of one size For a large object sufficient number of pages allocated For small objects less than a half page, object size is padded to

the nearest power of 2 Allocator keeps a page directory for all allocated pages and at

the beginning of each small object page, bitmap of allocation information is created

While allocating small objects, PHK Allocator performs a linear search on the bitmap to find the first available chunk in that page

CMPE 511, Fall 2006 9

Related Research

Chang’s Hardware Allocator Based on Buddy System invented by Knuth The buddy memory allocation technique divides memory into

partitions to satisfy a memory request as suitably as possible This system makes use of splitting memory into halves to try to

give a best-fit Compared to the memory allocation techniques (such as paging)

that modern OS such as MS Windows and Linux use, the buddy memory allocation is relatively easy to implement, and does not have the hardware requirement of a memory management unit

Chang’s algorithm is a first method based on a binary OR-tree and a binary AND-tree.

CMPE 511, Fall 2006 10

Related Research Chang’s Hardware Allocator

Each leaf node of the OR-tree represents base size of the smallest unit of memory that can be allocated

The leaves of OR-tree together represent the entire memory

AND-tree has the same number of leaves as the OR-tree

Input of the AND-tree is generated by a complex interconnection network of the OR-tree

Or Gates

CMPE 511, Fall 2006 11

Related Research Chang’s Hardware Allocator

Or-Tree Determine if there is a large

enough space for allocation request

AND-Tree Find the beginning address

of that memory chunk Flip the bits corresponding

to the memory chunk in the bit-vector

Bit-vector

CMPE 511, Fall 2006 12

Related Research

CMPE 511, Fall 2006 13

Related Research

The interconnection between the OR-tree and the AND-tree is the most complex part of the Chang’s allocator

The interconnection has the same critical path delay as the OR/AND-tree

Final allocation result is produced by the output of the AND-tree through a set of multiplexers

The Hardware complexity, in terms of number of gates is

O(n logn)

# the memory chunks Critical path delay

CMPE 511, Fall 2006 14


Pure hardware allocators based on

buddy system

1. Complexity of the hardware increases with the size of the memory managed

2. Poor object locality

Software Allocators Poor execution performance

Problems of hardware-software only allocators

CMPE 511, Fall 2006 15


New Hybrid Allocator

1. Using small , fixed hardware to help manage the memory

2. Software portion which is based on PHK algorithm provides better object localities than buddy system

3. Hardware portion improves execution performance of the software portion

CMPE 511, Fall 2006 16


Software portion

Responsible for

1. Creating page indexes

2. For large sized objects (>half a page) does the allocation without any assistance from hardware

3. Allocation for a small sized object, it will locate the bitmap of a page with free memory and issues a search request to the hardware

Hardware portion

1. Search the page index (or bitmap) in parallel to find a free chunk

2. Mark the bitmap to indicate an allocation

CMPE 511, Fall 2006 17

Proposed Hybrid Allocator OR-tree responsible for determining if there is a free chunk in a page (similar to

Chang’s system) AND-tree will locate the position of the first free chunk in the page (similar to

Chang’s system) Because an OR-tree and an AND-tree are dedicated to one object size, complex

interconnections between OR and AND tree are not needed( unlike Chang’s)

CMPE 511, Fall 2006 18

Proposed Hybrid Allocator MUX uses opcode to select the address of the bit needed to be flipped.

If the opcode is “alloc” the address from the AND-tree will be chosen If the opcode is “free” the address from the request will be selected

D-latches are used as storage devices where the bitmap will be loaded from the page in accordance with the allocation size

DEMUX used to decode the address from the MUX

CMPE 511, Fall 2006 19

Proposed Hybrid Allocator Bit-flippers use the decoded address and the opcode to determine how to

flip a desired bit

Block Diagram of Proposed Hardware Component (For Page Size 4096 bytes and Object Size 16 bytes)

CMPE 511, Fall 2006 20

Proposed Hybrid Allocator Overall design of the system with

4096-byte pages For different object sizes, the hardware

needed to support the bit-map will be different

In our design, preselected object sizes are from 16-bytes to 2048-bytes and include hardware to support pages for these objects

MUX is used to select the hardware unit that will be responsible for supporting objects of a given size

The larger the object size, the smaller the amount of hardware needed to support the bit-maps indicating the availability of chunks in that page

CMPE 511, Fall 2006 21

Proposed Hybrid Allocator With 4096-byte pages, we have 8

different sized objects ranging from 16-bytes to 2048-bytes.

For allocating 2048-byte objects we need a tree

with two leaves 16-byte objects we need trees with

256 leaves For a 16-byte object we need only 255

AND/OR gates For overall system

1+3+7+15+31+63+127+255=502 AND gates and 502 OR gates are needed

Very small amount compared to billions of transistors available on modern processor chips

CMPE 511, Fall 2006 22


Complexity Comparison

Existing hardware allocator designs implement the buddy

system

The amount of hardware that is used to implement a buddy

allocator is dependent on the size of memory

That makes buddy system based allocators not scalable.

Our design has much lower hardware complexity than

Chang’s allocator. (Buddy System)

CMPE 511, Fall 2006 23


M: Total dynamic memory size P: Page sizeS: Smallest allocated object size

CMPE 511, Fall 2006 24


Performance Analysis

Hardware-assisted PHK allocator

Conventional CPU using SimpleScalar

simulation tool set V2.0

CMPE 511, Fall 2006 25


CMPE 511, Fall 2006 26


CMPE 511, Fall 2006 27

Complexity and Performance Comparison We show the reduced memory management execution cycles normalized to the

original execution cycles spent on memory management functions by software only allocator

Cfrac application shows the best performance improvement Ave.obj.size is 8 bytes which means that most pages allocated contain 256

objects Linear search in the software implementation for that many objects will be very

slow The hardware speeds up the search, leading to 76.2% normalized performance

improvement over the software only allocation Benchmark espresso with average object size of 250 bytes shows the least amount

of improvement using the hybrid allocator Pages allocated for espresso contain fewer than 20 objects Linear search of 20 objects is not significant, and the hardware allocator only shows 48.0%

nornalized performance improvement Other benchmarks have average object sizes of 16 bytes to 48 bytes, so the

performance gains are not significant as cfrac, but better than espresso On average, the Hybrid allocator reduces the memory management time by 58.9%.

The average overall execution speedup of our design when compared to a software only allocator implementation is 12.7%

CMPE 511, Fall 2006 28

Conclusion

Compared to Hardware only

allocators

1. Significantly lower hardware complexity

2. Lower critical path delays

3. Our design has a fixed hardware complexity which is dependent on the size of a memory page (not the total user memory being managed)

Our Design

Compared to Software only

allocators

1. Overall execution performance is 12.7% better on memory intensive benchmarks

2. Memory management efficiency improved by 58.9%

CMPE 511, Fall 2006 29

Conclusion

Future Work Exploring variable sized pages such that the number of allocated

objects are the same in each page

All the bitmaps will have the same number of bits

Thus, we need only one pair of AND-tree and Or-tree in the

design

That will further reduce the hardware complexity

This will also improve the memory management efficiency of

allocators for large objects

CMPE 511, Fall 2006 30

References [1] W.Li, S.P.Mohanty and K.Kavi, “A Page-based Hybrid (Software-

Hardware) Dynamic Memory Allocator” IEEE Computer Architecture Letters (accepted in July 2006 for future issue)

[2] J.M. Chang and E.F.Gehringer, “A High-Performance Memory Allocator for Object-oriented Systems”, IEEE Transactions on Computers, Mar. 1996, pp 357-366.

[3] P.H.Kamp.“Malloc(3)revisited”, http://phk.freebsd.dk/pubs/malloc.pdf

[4]D.E.Knuth, The Art of Computer Programming Vol.I: Fundamental Algorithms., Addison-Wesley, 1968.

[5]D.Burgerand, T.M.Austin, “The Simple Scalar Tool Set, V2.0”, Tech Report CS-1342, University of Wisconsin-Madison, Jun. 1997.

CMPE 511, Fall 2006 31

Questions ?

Documents

CMPE 511, Fall 20061 Hybrid (Software-Hardware) Dynamic Memory Allocator Prepared by Mustafa Özgür Akduran İstanbul, 2006 Boğaziçi Üniversitesi