Upload
alexis-broman
View
221
Download
2
Embed Size (px)
Citation preview
CMPE 511, Fall 2006 1
Hybrid (Software-Hardware) Dynamic Memory Allocator
Prepared by Mustafa Özgür Akduran
İstanbul, 2006
Boğaziçi Üniversitesi
CMPE 511, Fall 2006 2
Outline
Introduction
Related Research
Proposed Hybrid Allocator
Complexity and Performance Comparison
Conclusion
References
CMPE 511, Fall 2006 3
Introduction
Need for Efficient Implementation of Memory Management Functions
Memory Usage Execution Performance
Modern Programming
Languages
Dynamic Memory Allocation (DMA)
Garbage Collection
CMPE 511, Fall 2006 4
Current Systems
Execution time spent on Memory Management is 42%. Still important researches on
Good execution performance Memory locality
How to get free chunks of memory ? Software Allocator Hardware Allocator
IntroductionDMM
Pure Software
Low Cost Allocator
Dynamic Memory Management
CMPE 511, Fall 2006 5
Software Allocator Different Search
Techniques to organize available chunks of free memory
Disadvantage Search could be in the
critical path of allocators causing a major performance bottleneck.
Hardware Allocator Parallel Search
Speed up Memory Allocation
Improve Performance Hide execution latency of
freeing objects Coalescing of free chunks
of memory
Disadvantage Potential Hardware
Complexity
Introduction
CMPE 511, Fall 2006 6
Introduction
A New Hybrid Software-Hardware Allocator
PHK (Poul-Henning Kamp) Allocation Algorithm used in
Free-BSD System
Chang’s Hardware Allocator
Aim is to balance the hardware complexity with performance by using both hardware and software together.
CMPE 511, Fall 2006 7
Related Research
PHK (Poul-Henning Kamp) Allocator
Two most popular general purpose open source allocator1. Doug Lea used in LINUX System2. PHK used in Free-BSD System
Difference between them is less than 3% for memory allocation intensive benchmarks in SPEC 2000 CPU.
PHK Allocator chosen bacause of its suitability for hardware/software co-design.
Free-BSD (Berkeley Software Distribution ) is an advanced operating system for x86 compatible (including Pentium® and Athlon™), architectures. It is derived from BSD, the version of UNIX® developed at the University of California, Berkeley. It is developed and maintained by a large team of individuals.
CMPE 511, Fall 2006 8
Related Research
PHK (Poul-Henning Kamp) Allocator Page based allocator Each page can only contain objects of one size For a large object sufficient number of pages allocated For small objects less than a half page, object size is padded to
the nearest power of 2 Allocator keeps a page directory for all allocated pages and at
the beginning of each small object page, bitmap of allocation information is created
While allocating small objects, PHK Allocator performs a linear search on the bitmap to find the first available chunk in that page
CMPE 511, Fall 2006 9
Related Research
Chang’s Hardware Allocator Based on Buddy System invented by Knuth The buddy memory allocation technique divides memory into
partitions to satisfy a memory request as suitably as possible This system makes use of splitting memory into halves to try to
give a best-fit Compared to the memory allocation techniques (such as paging)
that modern OS such as MS Windows and Linux use, the buddy memory allocation is relatively easy to implement, and does not have the hardware requirement of a memory management unit
Chang’s algorithm is a first method based on a binary OR-tree and a binary AND-tree.
CMPE 511, Fall 2006 10
Related Research Chang’s Hardware Allocator
Each leaf node of the OR-tree represents base size of the smallest unit of memory that can be allocated
The leaves of OR-tree together represent the entire memory
AND-tree has the same number of leaves as the OR-tree
Input of the AND-tree is generated by a complex interconnection network of the OR-tree
Or Gates
CMPE 511, Fall 2006 11
Related Research Chang’s Hardware Allocator
Or-Tree Determine if there is a large
enough space for allocation request
AND-Tree Find the beginning address
of that memory chunk Flip the bits corresponding
to the memory chunk in the bit-vector
Bit-vector
CMPE 511, Fall 2006 12
Related Research
CMPE 511, Fall 2006 13
Related Research
The interconnection between the OR-tree and the AND-tree is the most complex part of the Chang’s allocator
The interconnection has the same critical path delay as the OR/AND-tree
Final allocation result is produced by the output of the AND-tree through a set of multiplexers
The Hardware complexity, in terms of number of gates is
O(n logn)
# the memory chunks Critical path delay
CMPE 511, Fall 2006 14
Proposed Hybrid Allocator
Pure hardware allocators based on
buddy system
1. Complexity of the hardware increases with the size of the memory managed
2. Poor object locality
Software Allocators Poor execution performance
Problems of hardware-software only allocators
CMPE 511, Fall 2006 15
Proposed Hybrid Allocator
New Hybrid Allocator
1. Using small , fixed hardware to help manage the memory
2. Software portion which is based on PHK algorithm provides better object localities than buddy system
3. Hardware portion improves execution performance of the software portion
CMPE 511, Fall 2006 16
Proposed Hybrid Allocator
Software portion
Responsible for
1. Creating page indexes
2. For large sized objects (>half a page) does the allocation without any assistance from hardware
3. Allocation for a small sized object, it will locate the bitmap of a page with free memory and issues a search request to the hardware
Hardware portion
1. Search the page index (or bitmap) in parallel to find a free chunk
2. Mark the bitmap to indicate an allocation
CMPE 511, Fall 2006 17
Proposed Hybrid Allocator OR-tree responsible for determining if there is a free chunk in a page (similar to
Chang’s system) AND-tree will locate the position of the first free chunk in the page (similar to
Chang’s system) Because an OR-tree and an AND-tree are dedicated to one object size, complex
interconnections between OR and AND tree are not needed( unlike Chang’s)
CMPE 511, Fall 2006 18
Proposed Hybrid Allocator MUX uses opcode to select the address of the bit needed to be flipped.
If the opcode is “alloc” the address from the AND-tree will be chosen If the opcode is “free” the address from the request will be selected
D-latches are used as storage devices where the bitmap will be loaded from the page in accordance with the allocation size
DEMUX used to decode the address from the MUX
CMPE 511, Fall 2006 19
Proposed Hybrid Allocator Bit-flippers use the decoded address and the opcode to determine how to
flip a desired bit
Block Diagram of Proposed Hardware Component (For Page Size 4096 bytes and Object Size 16 bytes)
CMPE 511, Fall 2006 20
Proposed Hybrid Allocator Overall design of the system with
4096-byte pages For different object sizes, the hardware
needed to support the bit-map will be different
In our design, preselected object sizes are from 16-bytes to 2048-bytes and include hardware to support pages for these objects
MUX is used to select the hardware unit that will be responsible for supporting objects of a given size
The larger the object size, the smaller the amount of hardware needed to support the bit-maps indicating the availability of chunks in that page
CMPE 511, Fall 2006 21
Proposed Hybrid Allocator With 4096-byte pages, we have 8
different sized objects ranging from 16-bytes to 2048-bytes.
For allocating 2048-byte objects we need a tree
with two leaves 16-byte objects we need trees with
256 leaves For a 16-byte object we need only 255
AND/OR gates For overall system
1+3+7+15+31+63+127+255=502 AND gates and 502 OR gates are needed
Very small amount compared to billions of transistors available on modern processor chips
CMPE 511, Fall 2006 22
Complexity and Performance Comparison
Complexity Comparison
Existing hardware allocator designs implement the buddy
system
The amount of hardware that is used to implement a buddy
allocator is dependent on the size of memory
That makes buddy system based allocators not scalable.
Our design has much lower hardware complexity than
Chang’s allocator. (Buddy System)
CMPE 511, Fall 2006 23
Complexity and Performance Comparison
M: Total dynamic memory size P: Page sizeS: Smallest allocated object size
CMPE 511, Fall 2006 24
Complexity and Performance Comparison
Performance Analysis
Hardware-assisted PHK allocator
Conventional CPU using SimpleScalar
simulation tool set V2.0
CMPE 511, Fall 2006 25
Complexity and Performance Comparison
CMPE 511, Fall 2006 26
Complexity and Performance Comparison
CMPE 511, Fall 2006 27
Complexity and Performance Comparison We show the reduced memory management execution cycles normalized to the
original execution cycles spent on memory management functions by software only allocator
Cfrac application shows the best performance improvement Ave.obj.size is 8 bytes which means that most pages allocated contain 256
objects Linear search in the software implementation for that many objects will be very
slow The hardware speeds up the search, leading to 76.2% normalized performance
improvement over the software only allocation Benchmark espresso with average object size of 250 bytes shows the least amount
of improvement using the hybrid allocator Pages allocated for espresso contain fewer than 20 objects Linear search of 20 objects is not significant, and the hardware allocator only shows 48.0%
nornalized performance improvement Other benchmarks have average object sizes of 16 bytes to 48 bytes, so the
performance gains are not significant as cfrac, but better than espresso On average, the Hybrid allocator reduces the memory management time by 58.9%.
The average overall execution speedup of our design when compared to a software only allocator implementation is 12.7%
CMPE 511, Fall 2006 28
Conclusion
Compared to Hardware only
allocators
1. Significantly lower hardware complexity
2. Lower critical path delays
3. Our design has a fixed hardware complexity which is dependent on the size of a memory page (not the total user memory being managed)
Our Design
Compared to Software only
allocators
1. Overall execution performance is 12.7% better on memory intensive benchmarks
2. Memory management efficiency improved by 58.9%
CMPE 511, Fall 2006 29
Conclusion
Future Work Exploring variable sized pages such that the number of allocated
objects are the same in each page
All the bitmaps will have the same number of bits
Thus, we need only one pair of AND-tree and Or-tree in the
design
That will further reduce the hardware complexity
This will also improve the memory management efficiency of
allocators for large objects
CMPE 511, Fall 2006 30
References [1] W.Li, S.P.Mohanty and K.Kavi, “A Page-based Hybrid (Software-
Hardware) Dynamic Memory Allocator” IEEE Computer Architecture Letters (accepted in July 2006 for future issue)
[2] J.M. Chang and E.F.Gehringer, “A High-Performance Memory Allocator for Object-oriented Systems”, IEEE Transactions on Computers, Mar. 1996, pp 357-366.
[3] P.H.Kamp.“Malloc(3)revisited”, http://phk.freebsd.dk/pubs/malloc.pdf
[4]D.E.Knuth, The Art of Computer Programming Vol.I: Fundamental Algorithms., Addison-Wesley, 1968.
[5]D.Burgerand, T.M.Austin, “The Simple Scalar Tool Set, V2.0”, Tech Report CS-1342, University of Wisconsin-Madison, Jun. 1997.
CMPE 511, Fall 2006 31
Questions ?