Implementing Advanced Intelligent Memory

Implementing Advanced Intelligent

Memory

Josep Torrellas, U of Illinois & IBM Watson Ctr.David Padua and Dan Reed, U of Illinois

September 1998

[email protected], [email protected], [email protected]

Technological Opportunity

We can fabricate a large silicon area of Merged Logic and Dram (MLD)

Question: How to exploit this capability best to advance computing?

Pieces of the Puzzle

• Today:

• In a couple of years: 512 Mbit MLD process at 0.18um

• Manufacturers:

256 Mbit MLD process with 0.25um

E.g. 2 IBM PowerPC 603 with 8KB I+D caches take 10% of the chip

Includes logic running at 200 MHz

IBM Cmos-7LD technology available Fall 98 Japanese manufacturers (NEC,Fujitzu) are in the lead

Key Applications Clamor for HW

• Data Mining (decision trees and neural networks)

All are Data Intensive Applications

• Plus the typical ones: MPEG, TPCD, speech recognition

• Computational Biology (DNA sequence matching)

• Financial Modeling (stock options, derivatives)

• Molecular Dynamics (short-range forces)

Our Solution: Principles

1. Extract high bandwidth from DRAM:> Many simple processing units

2. Run legacy codes w/ high performance:> Do not replace off-the-shelf uP in workstation

3. Small increase in cost over DRAM:> Simple processing units, still dense

> Take place of memory chip. Same interface as DRAM> Intelligent memory defaults to plain DRAM

4. General purpose:> Do not hardwire any algorithm. No special purpose

Architecture Proposed

P.Host

L1,L2 Cache

P.MemCache

P.ArrayDRAM

Network

Plain DRAM

FlexRAM

Proposed Work

• Design an architecture based on key IBM applications

• Demonstrate significant speedups on the applications

• Fabricate chips using IBM Cmos 7LD technology

• Build a workstation w/ an intelligent memory system

• Build a language and compiler for the intelligent memory

Example App: DNA Matching

BLAST code from NIH web site

sample DNA

database of DNA chains

Problem: Find areas of database DNA chains that match (modulo some mutations) the sample DNA chain

How the Algorithm Works

1. Pick 4 consecutive aminoacids from the sample

bbcf

2. Generate 50+ most-likely mutations

becf

Example App: DNA Matching

3. Compare them to every position in the database DNAsbecf

4. If match is found: try to extend itbecf sample DNA

becf

? ?database of DNA chains

P.Arrays

• Total of 64 per chip (90 mm )

• Organized as a ring, no need for a mesh

• SPMD engines, not SIMD. Cycling at 200 MHz

• 32-bit datapath, integer only, including MPY. 28 instruc.

2

1 Mbyte of DRAM memory. Can also access the memory of N and S neighbors

• Each P.Array

2 1-Kbyte row buffers to capture data locality

8 Kbyte of SRAM I-memory shared by 4 P.Arrays

P.Array Design

ROW Decoder

DRAM Block

Sense AMP/Col. Dec

Port 1

Port 0

Broadcast Bus

Sw

itch

es

Switches

Controller

Addr. Gen.

Inp

ut

Reg

.

Port 2

ALU

Instr. Mem

R.Reg.

P.Mem

• IBM 603 Power PC with 8 KB D + 8 KB I cache

• About 15 mm

• 200 MHz

• Also included: memory interface

2

DRAM Memory

• 512 Mbit (64 Mbyte) with 0.18um

• Memory access time at 200 MHz: 2 cycles for row buffer hit 4 cycles for miss

• Organized as 64 banks of 1 MB each (one per P.Array)

• 2.2V operating voltage

• Internal memory bandwidth: 102 Gbytes/s at 200 MHz

Chip Architecture

8Mb

Blo

ck

PA

rray

PA

rray

Mem

ory

Co

ntr

ol B

lock

Mem

ory

Co

ntr

ol B

lock

1MB

Blo

ck

1MB

Blo

ck

PA

rray

PA

rray

Mem

ory

Co

ntr

ol B

lock

512

row

x 4

k c

olu

mn

s

2Mb

Blo

ck

256k

B B

loc

k

256k

B B

loc

k

Mem

ory

Co

ntr

ol B

lock

Mutiplier8kB Instruction Memory (4-port SRAM) Basic Block Basic Block

Basic Block Basic Block Basic Block Basic Block



Broadcasting

Broadcasting

Basic Block(4 PArray,4MB DRAM,

8kB 4-Port SRAM,1 Multiplier)

Pmem

Basic Block

Mu

tiplier

8kB In

structio

n M

emo

ry (4-po

rt SR

AM

)

PArrayPArray

Memory Control Block


8Mb Block

512 row x 4k columns

2Mb Block

256kB Block

256kB Block

PArrayPArray



1MB Block

1MB Block

Language & Compiler

• High-level C-like explicitly parallel language that exposes the architecture

• Compiler that automatically translates it into structured assembly

• Libraries of Intelligent Memory Operations (IMOs) written in assembly

Intelligent Memory Ops

• General-purpose operations such as:

• Arithmetic/logic/symbolic array operations

• Domain-specific operations: e.g. FFT

• Set operations. Iterators over elements of a set

• Regular/irregular structure search and update (CAM operations)

Performance Evaluation

• Hardware performance monitoring embedded in the chip

• Software tools to extract and interpret performance info

Preliminary Results

0

20

40

60

80

100

120R

elat

ive

Exe

cuti

on T

ime

MPEG2 Chroma/Keying

Uniprocessor

1 FlexRAM

4 FlexRAM

Current Status

• Identified and wrote all applications

• Funds needed for: processor core (P.Mem) chip fabrication hardware and software engineers

• Designed architecture based on apps & IBM technology

• Conceived ideas behind language/compiler

• Need to do: chip layout and fabrication development of the compiler

Conclusion

• We have a handle on:

• A promising technology (MLD)

• Key applications of industrial interest

• Real chance to transform the computing landscape

Current Research Work

Josep Torrellas, U of Illinois & IBM Watson Ctr.

September 1998

[email protected]://iacoma.cs.uiuc.edu

Current Research Projects

• 1. Illinois Aggressive COMA (I-ACOMA): Scalable NUMA and COMA architectures

• 2. FlexRAM: Avanced Intelligent Memory

• 3. Speculative Parallelization Hardware

• 4. Database Workload characterization: TPC-C, TPC-D, Data mining

> Project 4 is also in collaboration with Intel Oregon > All projects are in collaboration with IBM Watson

Publications 1997 and 98 1.Architectural Advances in DSMs: A Possible Road Ahead by Josep Torrellas, Ninth SIAM Conference on Parallel Processing for Scientific Computing Spring 1999.

2.A Direct-Execution Framework for Fast and Accurate Simulation of Superscalar Processors by Venkata Krishnan and Josep Torrellas, International Conference on Parallel Architectures and Compilation Techniques (PACT), October 1998.

3.Hardware and Software Support for Speculative Execution of Sequential Binaries on a Chip-Multiprocessor by Venkata Krishnan and Josep Torrellas, International Conference on Supercomputing (ICS), July 1998.

4.Comparing Data Forwarding and Prefetching for Communication-Induced Misses in Shared-Memory MPs by David Koufaty and Josep Torrellas, International Conference on Supercomputing (ICS), July 1998.

5.Cache-Only Memory Architectures by Fredrik Dahlgren and Josep Torrellas, IEEE Computer Magazine, to appear 1998.

6.Executing Sequential Binaries on a Multithreaded Architecture with Speculation Support by Venkata Krishnan and Josep Torrellas, Workshop on Multi-Threaded Execution, Architecture and Compilation (MTEAC'98), January 1998.

7.A Clustered Approach to Multithreaded Processors by Venkata Krishnan and Josep Torrellas, International Parallel Processing Symposium, March 1998.

8.Hardware for Speculative Run-Time Parallelization in Distributed Shared-Memory Multiprocessors by Ye Zhang, Lawrence Rauchwerger, and Josep Torrellas, Fourth International Symposium on High-Performance Computer Architecture, February 1998.

9.Enhancing Memory Use in Simple Coma: Multiplexed Simple Coma by Sujoy Basu and Josep Torrellas, Fourth International Symposium on High-Performance Computer Architecture, February 1998.

10.How Processor-Memory Integration Affects the Design of DSMs by Liuxi Yang, Anthony-Trung Nguyen, and Josep Torrellas, Workshop on Mixing Logic and DRAM: Chips that Compute and Remember, June 1997.

11.Efficient Use of Processing Transistors for Larger On-Chip Storage: Multithreading by Venkata Krishnan and Josep Torrellas, Workshop on Mixing Logic and DRAM: Chips that Compute and Remember, June 1997.

12.The Memory Performance of DSS Commercial Workloads in Shared-Memory Multiprocessors by Pedro Trancoso, Josep-L. Larriba-Pey, Zheng Zhang, and Josep Torrellas, Third International Symposium on High-Performance Computer Architecture, January 1997.

13.Reducing Remote Conflict Misses: NUMA with Remote Cache versus COMA by Zheng Zhang and Josep Torrellas, Third International Symposium on High-Performance Computer Architecture, January 1997.

14.Speeding up the Memory Hierarchy in Flat COMA Multiprocessors by Liuxi Yang and Josep Torrellas, Third International Symposium on High-Performance Computer Architecture, January 1997.

Documents

Implementing Advanced Intelligent Memory