Departamento de Informática y Sistemas - Librerías matriciales paralelasdis.um.es/~domingo/apuntes/HICC/0910/libreriasmatriciale... · 2010. 1. 13. · Sistemas lineales, Mínimos

Librerías matriciales paralelas

Antonio M. VidalMurcia 2010

Contenido• Librerías matriciales y arquitecturas• El modelo de paso de mensajes• MPI: una implementación del modelo de paso de

mensajes• La librería Scalapack. Filosofía y entorno

– BLACS– PBLAS– ScaLAPACK

• Otras librerías matriciales• La colección ACTS

¿Por qué usar librerías matriciales?

• Desde los principios de la computación científica se entendió la necesidad de utilizar librerías numéricas y matriciales

• Objetivos:– Legibilidad

– Eficiencia

– Portabilidad

– Fiabilidad: calidad de código

– No reprogramar códigos sencillos: organización por niveles

Librerías matriciales y arquitecturas•Computadores secuenciales: …1980Librerías: Legilibilidad, robustez, portabilidadLINPACK, EISPACK,…

•Operaciones vectorvector: BLAS 1 (19731977)

xj

aj* +xjaj b+xjaj

b•Aprovechamiento de las unidades funcionales pipeline: máquinas vectorialesOperaciones matrizvector: BLAS 2 (1988)

Librerías matriciales y arquitecturas•Optimización del uso de la jerarquía de memoriasUtilización de múltiples procesadores: Operaciones matrizmatriz. Organización por bloques.BLAS 3. (1990)

Memoria masiva (disco)

P: Procesador

C: Cache

Memoria central

P

C

P

C

P

C

•LAPACK: Librería secuencial/paralela (memoria compartida). (1992)Sistemas lineales, Mínimos cuadrados, Valores y vectores propios y singulares, Descomposiciones matriciales

Multiprocesadores y Multicomputadores

P2

Espacio único de direcciones: MEMORIA COMPARTIDA

P3P0 P1

Espacios de direcciones locales: MEMORIA DISTRIBUIDA

P0

M0

P1

M1

P2

M2

P3

M3

Red de Interconexión

Hacia 1990 empiezan a tomar protagonismo los multiprocesadores con memoria distribuida o multicomputadores

El Modelo de Paso de Mensajes (Kruskal, 1990)

• Abstracción del modelo físico de multiprocesador con memoria distribuida

Características• p procesadores RAM convenientemente modificados• Cada uno dispone de su propia memoria local con su propio espacio

de direcciones• No existe memoria común o no se puede usar para intercambiar

información• Cada procesador tiene un índice por el que se le reconoce y se

puede referenciar• Todos los procesadores ejecutan el mismo programa (SPMD)• Los códigos pueden particularizarse en función del índice

identificador del procesador• Los datos están en memoria local al empezar una ejecución y se

devuelven a una memoria local al acabar• Cualquier comunicación entre procesadores se hace a través de

paso de mensajes

El Modelo DRAM (DCM en Kruskal, 1990)

Características • Cada procesador dispone de un buffer donde almacenar un mensaje• Existen primitivas de comunicación:• Send(v,i) : Escribe el mensaje almacenado en la variable v en el

buffer del procesador Pi

• Receive(v) : Almacena el mensaje que haya en el buffer en la variable v y vacía el buffer

• Modelo DCM (Directed connected machine): El coste de enviar un mensaje de tamaño n es de la forma: nτ+β , siendo β el tiempo de latencia y τ el tiempo de envío de una palabra.

• Modelo SCM (Sparse connected machine): El coste de enviar un mensaje de tamaño n entre nodos conectados es de la forma: nτ+β, siendo β el tiempo de latencia y τ el tiempo de envío de una palabra.

• Pueden existir mecanismos de sincronización• El coste de ejecutar un algoritmo es el número de ciclos de reloj que

requiere su ejecución más el tiempo de comunicación

P0 P1 Pp2 Pp1

Reloj

Red de interconexión

Memoria local

Programa(0)

Datos(0)

Memoria localDatos(1)

…

Programa(1)

Datos(p2)

Memoria local

Programa(p2)

Datos(p1)Memoria local

Programa(p1)

P0 P1 P3P2Tiempo

tb=m/p tb=m/ptb=m/ptb=m/p

A(m/p:2m/p1,:) (b)

A(2m/p:3m/p1,:) (b)

A(3m/p:m1,:) (b)

Calcula x= A(0:m/p1,:) b

(A) (b)

(A) (b)

(A) (b)

Calcula x= Ab

Calcula x= Ab

Calcula x= Ab

x xx

Copiar a x

(m,n)(m,n)

(m,n)(m,n)

P0 P1 P2 P3


Programa (3)

n

Programa (2)Programa (1)Programa (0)

A bn

m

A bnA b nA b

A,b,m,nA,b,m,n

A,b,m,n

m m m

P0 P1 P2 P3


Programa (3)

n

Programa (2)Programa (1)Programa (0)

A bxn

m

A bxnA bx nA bx

m m m

P0 P1 P2 P3


Programa(3)

n

Programa(2)Programa(1)Programa(0)

A bxnA bx

m m mm

nA bx nA bx

x xx

Entornos de paso de mensajes

•Aparecen los entornos de paso de mensajes que facilitan la programación de estas máquinas

– PVM (Parallel Virtual Machine): 19891990– MPI (Message Passing Interface): 1994

•El entorno MPI se ha convertido en el estándar de facto del modelo de paso de mensajes•Ofrece las primitivas de comunicación necesarias para implementar las comunicaciones•Permiten el desarrollo software de herramientas eficientes de computación: Librerías Matriciales

Modelo de Programación en MPIModelo de Programación en MPI

prog_MPIprog_MPI prog_MPIprog_MPI prog_MPIprog_MPI prog_MPIprog_MPI

prog_MPIprog_MPI

0 1 2 3


prog_MPIprog_MPI

0 1 2 3


prog_MPIprog_MPI

0 1 2 3

17

Procesos y Comunicadores (I)

• En MPI, los procesos involucrados en la ejecución de un programa paralelo se identifican por una secuencia de enteros, que denominamos rangos.

• Si tenemos p procesos ejecutandose en un programa, estos tendrán rangos 0,1, ..., p1.

• Todas las operaciones de comunicación en MPI se realizan dentro de un espacio o ámbito, llamado comunicador.

• Un comunicador contiene un conjunto de procesos pertenecientes a una aplicación MPI.

• Cada proceso posee un rango dentro de cada comunicador.

18

INTRODUCCIÓNINTRODUCCIÓNProcesos y Comunicadores (II)Procesos y Comunicadores (II)

• Este rango permite identificar a todos los procesos que forman parte de un comunicador.

• El uso de diferentes comunicadores permitirá exclusivizar los mensajes enviados entre diferentes partes de un programa de forma más cómoda que con el uso de etiquetas.

19

INTRODUCCIÓNProcesos y Comunicadores (y III)

MPI_COMM_WORLD

0

12

34

5

6

78

COMM_1

0

12

0

COMM_3

21

3

COMM_2

1

02

01

2

COMM_4

Comunicación entre Comunicación entre ProcesosProcesos

Introducción a la Programación en MPIIntroducción a la Programación en MPI

Comunicación entre ProcesosComunicación entre ProcesosConstrucción de un mensaje (I)Construcción de un mensaje (I)

Los mensajes en MPI se definen mediante tres Parámetros. message: puntero al inicio del mensaje. count: número de elementos que posee el mensaje. type: tipo de datos de los elementos del mensaje.

message

12

count

3......

...

...count-1

message+count*sizeof(datatype)datatype

12

count

3......

...

...count-1

message+count*sizeof(datatype)

Memoria

datatype

22

Para hacer el envío de un mensaje en MPI es necesaria la siguiente información:

El rango del receptor.El rango del emisor.

int MPI_Send(void* message, int count, MPI_Datatype datatype,

int dest, int tag, MPI_Comm comm)

int MPI_Recv(void* message, int count, MPI_Datatype datatype,

int source, int tag, MPI_Comm comm, MPI_Status* status)

Las constantes MPI_ANY_SOURCE y MPI_ANY_TAG se utilizan para especificar cualquier fuente y cualquier etiqueta respectivamente.

El último argumento de MPI_Recv es una estructura de dos elementos que contiene el rango del emisor y la etiqueta del mensaje.

Una etiqueta.El comunicador.

Comunicación entre ProcesosOperaciones de Comunicación punto a punto (i)

MPI_Bcast

comm message

rank=1

message

rank=2

message

rank=3

message

rank=0Root=0Root=0

1...

count

1...

count

1...

count

1...

count

MPI_Reduce

comm rank=1

rank=2

rank=3

operand

rank=0Root=0Root=0

*result

*result

*result

opresult

operand

operand

operand

1...

count

1...

count

1...

count

1...

count

1...

count

MPI_Gather

commrank=1

rank=2

rank=3

send_buf

rank=0

root=0root=0

*recv_buf

*recv_buf

*recv_buf

send_buf

send_buf

send_bufrecv_buf

0

1

2

3

1...

send_count

1...

send_count

1...

send_count

1....

4*recv_count

1. . .

send_count

MPI_Scatter

commrank=1

rank=2

rank=3

recv_buf

rank=0

root=0root=0

*send_buf

*send_buf

*send_buf

recv_buf

recv_buf

recv_bufsend_buf

0

1

2

3

1...

recv_count

1...

recv_count

1...

recv_count

1....

4*send_count

1. . .

recv_count

Referencias

• MPIForum: http://www.mpiforum.org. Disponibles de manera electrónica todos los estándares ( MPI 1.1, MPI2)

• A User’s Guide to MPI. Peter S. Pacheco. Lectura más amena que el estándar, acompañado de ejemplos prácticos.

• An Introduction to Message Passing and Parallel Programming with MPI. Dr. Gerhard Wellein, Georg Hager

• A Survey of MPI Implementations. William Shapir. Historia de MPI, información sobre implementaciones existentes, herramientas, problemas, ejercicios prácticos,…

• Introducción a la Programación Paralela. F. Almeida et al. Dedica un capítulo al entorno MPI.

http://www.mpi-forum.org/




Librerías para el modelo de paso de mensajes La librería ScaLAPACK. (1995)

• Librería de rutinas de algebra lineal numérica, para ordenadores de memoria distribuida con paso de mensajes.

• Capacidades similares al LAPACK: Sistemas lineales, Mínimos cuadrados, Valores y vectores propios y singulares, Descomposiciones matriciales

• Objetivos:• Eficiencia• Escalabilidad• Fiabilidad• Portabilidad• Flexibilidad• Facilidad de uso

La librería Scalapack. Filosofía y entorno.Se trata de emplear la misma filosofía de diseño que en el LAPACK:

Núcleos computacionales básicos

Operaciones vectorvectorOperaciones matrizvectorOperaciones matrizmatriz

Rutinas de alto nivelResolución de sistemasCálculo de valores propiosCálculo de valores singularesDescomposiciones matriciales

Problema: Hay que diseñar algoritmos distribuidos para los núcleos computacionales básicos, por ello es necesario un nivel más dedicado a las comunicaciones específicas que se utilizan en estos programas

BLACS PBLAS ScaLAPACKBasic Linear Algebra Communication Subprograms

Scalable LAPACK Parallel Basic Linear Algebra Subprograms

Scalapack

Global

Local

Scalapack

Pblas

Lapack

Blas

Blacs

Entorno de pasode mensajes(PVM, MPI, ...)

Distribución cíclica 2D: Distribución ScaLAPACK

A00 A01 A02 A04A03

A10 A11 A12 A14A13

A20 A21 A22 A24A23

A30 A31 A32 A34A33

00 01 02

10 1211

00 01

10 11

00 01 02

10 1211

00 01

10 11

0

0

21

1

A00 A01 A02A04A03

A10 A11 A12A14A13

A20 A21 A22A24A23

A30 A31 A32A34A33

mxnA ℜ∈

A00 A01 A02 A04A03

A10 A11 A12 A14A13

A20 A21 A22 A24A23

A30 A31 A32 A34A33

tbm

tbn

Particionado

P0 P1 P2 P3 P4 P5

P00 P01 P02

P10 P11 P12

BLACS

Basic Linear Algebra Communications Subprograms

BLACS•Librería de comunicaciones para aplicaciones de algebra líneal,a ejecutarse en multiprocesadores con paso de mensajes.

•Puede funcionar indistintamente sobre PVM, MPI u otras librerías de paso de mensajes.

•Fácil de programar, portable (Clusters de PSs/Workstations, Thinking Machine CM5, IBM SP series, Cray T3X, Intel IPSC2, IPSC/860, Delta, Paragon, ...

BLACS: Malla de Procesos

Dados N procesos (0, ..., N1), generados con MPI o PVM, se distribuyen en una malla bidimensional:

0 1 2 3

0

1

0 1 2 3

7654

Los procesos se referenciarán por sus coordenadas en la malla.

BLACS: contextos

•Contexto en BLACS ≈ comunicador en MPI; habitualmente, malla equivaldrá a proceso

•Como en MPI, es un mecanismo para diseñar software “seguro”.

•Permiten1) Crear grupos de procesos2) Crear mallas solapadas y/o disjuntas.3) Aislar mallas para que no interfieran con otras.

BLACS: Ejemplo (1)// Obtener número de procesos //CALL BLACS_PINFO(iam, nprocs)

// Si estamos en pvm, crear máquina virtual //IF (nprocs.LT.1) THEN nprocs=... CALL BLACS_SETUP(iam, nprocs)ENDIF//Establecer dimensiones de malla de procesos//nrows=...ncols=...//Obtener contexto por defecto y crear la malla//CALL BLACS_GET(1,0,ctxt)CALL BLACS_GRIDINIT(ctxt, ‘Row_major’,nrows, ncols)

BLACS: EjemploCALL BLACS_GRIDINFO(ctxt,nrows,ncols,myrow, mycol)

// Si el proceso no está en la malla, terminar //IF (myrow.eq.1) GOTO 10...

//realizar el trabajo //...CALL BLACS_GRIDEXIT(ctxt)

10CALL BLACS_EXIT(0)

BLACS: operaciones en ámbitos.

Operaciones en las que participa un gupo de procesos determinados.

Dada una malla de procesos bidimensionales, los ámbitos naturales son:filacolumna malla

BLACS: subrutinas de comunicación punto a punto (1)

Envío / Recepción _xxSD2D(CTXT, [UPLO, DIAG], M, N, A, LDA, RDEST, CDEST) _xxRV2D(CTXT, [UPLO, DIAG], M, N, A, LDA, RSRC, CSRC)

xx (Tipo de matrices)xx (Tipo de matrices)

GE Rectangulares generalesGE Rectangulares generalesTR TrapezoidalesTR Trapezoidales

_ (Tipo de datos)_ (Tipo de datos)

I EnterosI EnterosS,D Reales SP/DPS,D Reales SP/DPC,Z Complejos SP/DPC,Z Complejos SP/DP

➙Envío de una (sub)matriz de un proceso a otro➙Envíos no bloqueantes➙Recepciones bloqueantes.

BLACS: subrutinas de comunicación punto a punto (2)

Ejemplo:CALL BLACS_GRIDINFO(ctxt,nrows,ncols,myrow, mycol)

if (myrow.eq.0).and.(mycol.eq.0) then call dgesd2d(ctxt, 5, 1, X, 5, 1, 0) call dgerv2d(ctxt, 5, 1, X, 5, 1, 0)else if (myrow.eq.1).and.(mycol.eq.0) then call dgesd2d(ctxt, 5, 1, X, 5, 0, 0) call dgerv2d(ctxt, 5, 1, X, 5, 0, 0)end if

BLACS: subrutinas de difusión (broadcast) (1)

Envío / Recepción _xxBS2D(CTXT, SCOPE,TOP,[UPLO, DIAG], M, N, A, LDA) _xxBR2D(CTXT,SCOPE,TOP,[UPLO, DIAG], M, N, A, LDA)

‘’ ‘’ por defectopor defecto‘‘Increasing Ring’Increasing Ring’‘‘l-tree’l-tree’

SCOPE (ámbito)SCOPE (ámbito)

‘‘Row’Row’‘‘Column’Column’‘‘All’All’

➙xx (Tipo de matrices)➙Envío de una (sub)matriz a un grupo de procesos➙Operación globalmente bloqueante.

TOP (topología)TOP (topología)

BLACS: subrutinas de difusión (broadcast) (2)

Ejemplo:CALL BLACS_GRIDINFO(ctxt,nrows,ncols,myrow, mycol)

if (mycol.eq.2) then if (myrow.eq.0) then call dgebs2d(ctxt, ‘COLUMN’, ‘ ‘, 5, 7, B(9,4), 500)else if (myrow.eq.1) call dgebr2d(ctxt, ‘COLUMN’, ‘ ‘, 5, 7, WORK, 5, 0, 2)else call dgebr2d(ctxt, ‘COLUMN’, ‘ ‘, 5, 7, B(9,4), 500,0,2)end if

BLACS: subrutinas de reducción

Suma/Máximo/Mínimo_GSUM2D(CTXT,SCOPE,TOP,M,N,A,LDA, RDEST,CDEST)_GMAX2D(CTXT,SCOPE,TOP,M,N,A,LDA,RA,CA,RCFLAG,RDEST,CDEST)_GMIN2D(CTXT,SCOPE,TOP,M,N,A,LDA,RA,CA,RCFLAG,RDEST,CDEST)

BLACS: subrutinas de soporte(I)

➨Inicialización de la mallaBLACS_PINFO(IAM, NPROCS)BLACS_SETUP(IAM, NPROCS)BLACS_GRIDINIT(CTXT, ORDER, NROW, NCOL)BLACS_GRIDMAP(CTXT, USERMAP, LDUMAP, NROW, NCOL)

➨Destrucción de la mallaBLACS_ GRIDEXIT(CTXT)BLACS_ EXIT(CONTINUE)BLACS_ ABORT(CTXT, ENUM)

BLACS: subrutinas de soporte (II)

➨Información

BLACS_GRIDINFO(CTXT, NROW, NCOL, ROW, COL2)BLACS_PNUM(CTXT, ROW, COL)BLACS_PCOORD(CTXT, PNUM, ROW, COL)BLACS_GET(CTXT, WHAT, VAL)BLACS_BARRIER(CTXT, SCOPE)

PBLAS

Parallel Basic Linear Algebra Subprograms

PBLAS Introducción

• Librería de rutinas que implementan versiones paralelas de las rutinas del BLAS para memoria distribuida con paso de mensajes

•Objetivos:•Portabilidad•Altas prestaciones y escalabilidad•Facilidad de uso

PBLAS Operaciones

•Nivel 1: operaciones VectorVector y ¬ a x + y

+

•Nivel 2: operaciones MatrizVector y ¬ a A x + b y

+

•Nivel 2: operaciones MatrizMatriz C ¬ a A B + b C

+

PBLAS Nomenclatura de las rutinas

yy (Tipo de matrices)yy (Tipo de matrices)

GE Rectangulares generalesGE Rectangulares generalesHE HermitianasHE HermitianasSY SimétricasSY SimétricasTR Triangulares TR Triangulares (otros en BLAS 2)(otros en BLAS 2)

x (Tipo de datos)x (Tipo de datos)

S RealesS RealesD Doble PrecisiónD Doble PrecisiónC ComplejosC ComplejosZ Complejos doblesZ Complejos dobles

zz (Operación)zz (Operación)

MV: producto matriz por vectorMV: producto matriz por vectorMM: producto matriz por matrizMM: producto matriz por matrizSV, SM: resolución de sistemas triangularesSV, SM: resolución de sistemas triangularesR,R2,RK,R2K: actualizaciones de rango 1, rango 2, ...R,R2,RK,R2K: actualizaciones de rango 1, rango 2, ...

La rutina de transposición de matrices es P_TRAN_

PBLAS Distribución de datos (1)

Distribución bidimensional cíclica por bloques

PBLAS Distribución de datos(2)

Propiedades:

• Engloba muchas otras distribuciones.• Permite obtener carga balanceada• Permite obtener altas prestaciones y escalabilidad

Es responsabilidad del programador distribuir las matrices entre todos los procesos antes de llamar a las subrutinas de PBLAS que operan con ellas.

Cada proceso tiene sus propios bloques que se almacenan en memoria local en columnas.

PBLAS Distribución de datos(3)

Descriptor de array:Todos los parámetros que describen una matriz distribuida se encapsulan en un descriptor de matriz (array de enteros)

PBLAS Argumentos de las subrutinas

Las llamadas son prácticamente igual a las de BLAS:

LLAMADA BLAS:CALL DGEMM(‘No transpose’, ‘No transpose’, M,K,N,ONE,A(IA,JA), LDA, LDA, B(IB,JB), LDB, ZERO, C(IC,JC), LDC)

LLAMADA PBLAS:CALL PDGEMV(‘No transpose’, ‘No transpose’, M,K,N, ONE, A, IA, JA, DESCA, B, IB, JB, DESCB, ZERO, C, IC JC, DESCC)

PBLAS Eficiencia y Escalabilidad

Es posible usar técnicas de solapado de computación y comunicación.

Tambien hay subrutinas que asignana una topología a un contexto:

PTOPSET(CTXT, OP, SCOPE, TOP)PTOPGET(CTXT, OP, SCOPE, TOP)

Ej: CALL PTOPSET(CTXT, ‘Broadcast’, ‘Row, ‘Sring’)

PBLAS Conclusiones

•PBLAS pretende facilitar la paralelización de código secuencial desarrollado con BLAS.

•Para facilitar su uso, se ha conservado el interfaz prácticamente igual

•Se utiliza BLACS para las comunicaciones (sobre MPI o PVM) habitualmente.

•La distribución de carga bidimensional cíclica por bloques asegura buen balance de carga y garantiza altas prestaciones y escalabilidad.

Scalable Linear Algebra Package

SCALAPACK

SCALAPACKNomenclatura de las rutinas

Básicamente igual al LAPACK formato Pxyyzz

yy (Tipo de matrices)yy (Tipo de matrices)

GE Rectangulares generalesGE Rectangulares generalesHE HermitianasHE HermitianasSY SimétricasSY SimétricasTR Triangulares ...TR Triangulares ...

x (Tipo de datos)x (Tipo de datos)

S RealesS RealesD Doble PrecisiónD Doble PrecisiónC ComplejosC ComplejosZ Complejos doblesZ Complejos dobles

zz (Operación)zz (Operación)

Clasificación:Subrutinas driverSubrutinas computacionalesSubrutinas auxiliares

SCALAPACKSubrutinas Driver

• Rutinas driver

✜ Resuelven un problema completo

✜ Sistemas múltiples de ecuaciones lineales: AX=B: PDGESV✜ Problemas de mínimos cuadrados: minx||Axb||2: PDGELS✜ Cálculo de valores propios: Ax=λx: PDSYEV✜ Descomposición en valores singulares: A=USVT: PDSYEV

Rutinas computacionales

Resuelven diferentes tareas computacionales

Algunas rutinas dependen de la salida de otras

Factorizaciones LU, Cholesky, LDLT, QR, ...

Estimar nº de condición

Cotas de error, ...

Rutinas auxiliares

•Subtareas de algoritmos orientados a bloques, cálculos de bajo nivel, ...

SCALAPACKSubrutinas Computacionales y Auxiliares

SCALAPACKEtapas para llamar a una rutina

1) Inicializar la malla de procesos (BLACS)

2) Distribuir la matriz en la malla2.1) Inicializar el descriptor de array

DESCINIT(DESCA,M,N,MB,NB,RSRC,CSRC,CTCX,LLD,INFO)2.2) Distribuir a cada proceso sus bloques locales

3) Llamar a la rutina

4) Liberar la malla de procesos (BLACS)

SCALAPACKPrestaciones

Factores que pueden afectar a las prestaciones:

• Topología• Tamaño de bloque• Dimensiones de la malla de procesos

Para un problema determinado es necesario estudiar los valores óptimos de estos parámetros

SCALAPACKConclusiones

Scalapack es la versión de LAPACK para ordenadores con memoria distribuida.

Librerías previas: (MPI o PVM), BLACS, PBLAS, BLAS, BLACS, LAPACK.

Su eficiencia depende de las implementaciones de BLACS y BLASEstado actual: Existen implementaciones buenas pero su utilización se hace complicada. No se han desarrollado todas las rutinas que tiene el LAPACK.

Información sobre ScaLAPACKhttp://www.netlib.org/scalapack/slug/index.html

http://www.netlib.org/scalapack/slug/index.html

Otras librerías matriciales

• PLAPACKhttp://www.cs.utexas.edu/users/plapack/Guide/

• FLAMEhttp://www.cs.utexas.edu/users/flame/

Robert A. van de Geijn

The University of Texas at Austin

PLAPACKTranslating the sequential algorithms to a parallel code requires careful manipulation of indices and parameters describing the data, its distribution to processors, and/or the communication required. It is this manipulation of indices that is highly errorprone, leading to bugs in parallel code. The Parallel Linear Algebra Package (PLAPACK) infrastructure attempts to overcome this complexity by providing a coding interface that mirrors the natural description of sequential dense linear algebra algorithms. To achieve this, we have adopted an ``object based'' approach to programming. This object based approach has already been popularized for high performance parallel computing by libraries like the Toolbox being developed at Mississippi State University [], the PETSc library at Argonne National Laboratory , and the MessagePassing Interface.

PLAPACKIn order to parallelize a linear algebra algorithm, we must start by partitioning and distributing the vectors and matrices involved. Traditionally, parallel dense linear algebra algorithms and libraries start by partitioning and distributing the matrix, with the distribution of vectors being an afterthought. This seems to make sense…While this appears to be convenient for the library, this approach creates an inherent conflict between the needs of the application and the library. It is the vectors in linear systems that naturally dictate the partitioning and distribution of work associated with (most) applications that lead to linear systems. Notice that in a typical application, the linear system is created to compute values for degrees of freedom, which have some spatial significance. In finite element or boundary element methods, we solve for force, stress, or displacement at points in space. For the application, it is thus more natural to partition the domain of interest into sub domains, like domain decomposition methods do, and assign those sub domains to nodes (processors). This is equivalent to partitioning the vectors and assigning the subvectors to nodes.

PLAPACK

The PLAPACK infrastructure uses a data distribution that starts by partitioning the vectors associated with the linear algebra problem and assigning the subvectors to nodes. The matrix distribution is then induced by the distribution of these vectors. This approach was chosen in an attempt to create more reasonable interfaces between applications and libraries. However, the surprising discovery has been that this approach greatly simplifies the implementation of the infrastructure, allowing much more generality (in future extensions of the infrastructure) while simultaneously reducing the amount of code required when compared to previous generation parallel dense linear algebra libraries

PLAPACK

PLAPACK

PLAPACK

PLAPACK

PLAPACK

PLAPACK•Conclusiones•Ventajas:

•Distribución de datos más próxima al problemas físico•Orientación a objetos (aunque no use un lenguaje orientado a objetos)•Objetos de álgebra líneal (views)•Un interfaz para rellenar matrices y vectores

•Inconvenientes:•Ciertas operaciones sencillas no previstas son complicadas•Su uso no está tan extendido como el ScaLAPACK

The objective of the FLAME project is to transform the development of dense linear algebra libraries from an art reserved for experts to a science that can be understood by novice and expert alike. Rather than being only a library, the project encompasses a new notation for expressing algorithms, a methodology for systematic derivation of algorithms, Application Program Interfaces (APIs) for representing the algorithms in code, and tools for mechanical derivation, implementation and analysis of algorithms and implementations.

FLAME (Formal Linear Algebra Methods Environment)

FLAME. The Science of Programming Matrix Computations. Robert A. van de Geijn and Enrique S. QuintanaOrtí. www.lulu.com. 2008.

FLAME• The FLAME project promotes the systematic

derivation of loopbased algorithms handinhand with the proof of their correctness. Key is the ability to identify the loopinvariant: the state to be maintained before and after each loop iteration, which then prescribes the loopguard, the initialization before the loop, how to progress through the operand(s), and the updates. To derive the shown algorithm for LU factorization one fills in the below "worksheet". In the greyshaded areas predicates appear that ensure the correctness of the algorithm. .

FLAME

La colección ACTS http://acts.nersc.gov

• The Advanced CompuTational Software (ACTS) Collection is a set of software tools for computational sciences. The starting point for the collection was the former Advanced Computational Testing and Simulation Toolkit Project. The purpose of the ACTS Collection is to accelerate the adoption and use of advanced computing by the Department of Energy programs for their missioncritical problems. While DOE has been motivated to develop the tools for its own programs, it also encourages their adoption and use by nonDOE computational efforts. The ACTS Collection Project disseminates information about the tools, both inside and outside the DOE community. In addition, it focuses on the installation, maintenance and support of the tools on DOE's computing facilities.

La colección ACTS

La colección ACTS

La colección ACTS

La colección ACTS

La colección ACTS

La colección ACTS http://acts.nersc.gov/tools.html

http://acts.nersc.gov/tools.html












Librerías: Aztechttp://www.cs.sandia.gov/CRF/aztec1.html

A Massively Parallel Iterative solver Library for

Solving Sparse Linear SystemsAztec is a library that provides algorithms for the

iterative solution of large sparse linear systems arising in scientific and engineering applications. It is a standalone package comprising a set of iterative solvers, preconditioners and matrixvector multiplication routines. Users are not required to provide their own matrixvector multiplication routines or preconditioners in order to solve a linear system.

Librerías: hyprehttps://computation.llnl.gov/casc/linear_solvers/sls_hypre.html

Scalable Linear Solvers• Hypre is a library for solving large, sparse linear

systems of equations on massively parallel computers.

• Scalable preconditioners (very large sparse linear systems, structured multigrid and elementbased algebraic multigrid).

• Implementation of a suit of common iterative methods (Krylovbased iterative methods, Conjugate Gradient and GMRES).

• Intuitive gridcentric interfaces (sparse matrices through interfaces, stencilbased structured/semistructured interfaces, finiteelement based unstructured interface, and a linear algebra based interface).

• Configuration Options (can be installed in several computer platforms. Include compilers, versions of MPI and BLAS,…).

• Dynamic configuration of parameters (for users with different levels of expertise).

• User defined interfaces for multiple languages (supports Fortran and C languages).

Librerías: OPT++http://csmr.ca.sandia.gov/projects/opt++

The user specifies the function f (and, when available, its first and second analytical derivatives) and the functions h and g. OPT++ provides many different solution algorithms including:

– various Newton methods;– a parallel Newton method;– a nonlinear conjugate gradient method– a parallel direct search method– a nonlinear interior point method.

Objectoriented nonlinear optimization package. It solves optimization problems of the form

Librerías:PETSc http://www.mcs.anl.gov/petsc/petscas/

• Portable, Extensible Toolkit for Scientific computation, provides sets of tools for the parallel (as well as serial), numerical solution of PDEs that require solving largescale, sparse nonlinear systems of equations.

Librerías: SuperLUhttp://crd.lbl.gov/~xiaoye/SuperLU/

•General purpose library for the direct solution of large, sparse, nonsymmetric systems of linear equations on highend computers. The library is written in C and is callable from either C or Fortran. •SuperLU contains a collection of three related subroutine libraries: sequential SuperLU for uniprocessors, the multithreaded version (SuperLU_MT) for mediumsize SMPs, and the MPI version (SuperLU_DIST) for large distributed memory machines. All these implementations are portable across many different platforms.•SuperLU is intended for use in largescale applications that require the solution of large sparse systems of linear equations. It is especially targeted for nonsymmetric, nondefinite systems.

Librerías:taohttp://www.mcs.anl.gov/research/projects/tao/

• The Toolkit for Advanced Optimization (TAO) focuses on largescale optimization software, including nonlinear least squares, unconstrained minimization, bound constrained optimization, and general nonlinear optimization.

• TAO offers an objectoriented based solution that provides a flexible optimization toolkit capable of addressing issues of portability, versatility

and scalability in many computational environments.

Librerías:Global Arrays http://www.emsl.pnl.gov/docs/global/ga.html

• The Global Arrays (GA) toolkit is a library for writing parallel programs that use large arrays distributed across processing nodes. The library has Fortran, C and C++ interfaces. There is also a python wrapper available to GA. Originally developed to support arrays as vectors and matrices (one or two dimensions), it now supports up to seven dimensions in Fortran and even more in C. GA offers two types of operations: collective operations and local operations.

• Collective operations require participation (and synchronization) of all processes. They include array creation, copying, and destruction. They also include dataparallel operations on arrays such as vector dot product and matrix multiplication.

• Local operations may be invoked independently by all processes. They include local access to array elements, and fetching and storing data from and to remote locations. (Contrary to MPI, GA does not require cooperation between sender and receiver to transfer data.)

Librerías: ATLAS http://mathatlas.sourceforge.net/

• ATLAS (Automatically Tuned Linear Algebra Software) is a tool for the automatic generation of optimized numerical software for modern computer architectures and compilers.

• This tool has initially focused on level three BLAS operations (matrixmatrix multiplications) and also a few routines from LAPACK that have high potential for optimization. Traditionally, the optimization of these routines has been a tedious, architecture dependent, hand coding process. Codes automatically generated by ATLAS have been able to meet and even exceed the performance of the vendor supplied, handoptimized BLAS, on a range of platforms.

• Other projects providing for automatic tuning are:– FFTW, for the Discrete Fourier Transform– UHFFT, for the Fast Fourier Transform– SPIRAL, for signal processing– Sparsity, for sparse matrixvector multiplication– PHiPAC, for fast matrixmatrix operations

•El uso de librerías da altas prestaciones representa una avance importante en el campo de la computación científica y en la resolución de problemas complejos de ingeniería. Su utilización se irá incrementando en años sucesivos.•Aportan ventajas indiscutibles:

– Legibilidad, eficiencia, robustez, fiabilidad, portabilidad, evitar reprogramar código,… •Algunas de las actuales han alcanzado su ciclo de vida y deben ser reestructuradas

– Scalapack Nuevo Lapack para memoria distribuida?•Los nuevos diseños deberían cubrir ciertas expectativas:

– Nuevos algoritmos, adaptación a nuevas arquitecturas, independencia de las arquitecturas, autooptimización, …

•Ya se están realizando esfuerzos en esta dirección (Flame, Atlas, ACTS,…)•Debería vencerse la principal dificultad para su plena incorporación al mundo científico y de la ingeniería: Dificultad de programación para usuarios no expertos

Librerías matriciales paralelasConclusiones

Documents

Departamento de Informática y Sistemas - Librerías matriciales paralelasdis.um.es/~domingo/apuntes/HICC/0910/libreriasmatriciale... · 2010. 1. 13. · Sistemas lineales, Mínimos