12
0 Copyright 2016 FUJITSU Human Centric Innovation in Action HPC – DA Data Analytic Appliance When Big Data meets High Performance Computing Fujitsu Systems Europe

HPC Data Analytic Appliance - FujitsuHadoop runs on commodity hardware which means infrastructure costs can be less compared to HPC. ... HPC Data Analytic Appliance Author Fujitsu

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: HPC Data Analytic Appliance - FujitsuHadoop runs on commodity hardware which means infrastructure costs can be less compared to HPC. ... HPC Data Analytic Appliance Author Fujitsu

0 Copyright 2016 FUJITSU

Human Centric Innovation

in Action

HPC – DAData Analytic Appliance

When Big Data meets High Performance Computing

Fujitsu Systems Europe

Page 2: HPC Data Analytic Appliance - FujitsuHadoop runs on commodity hardware which means infrastructure costs can be less compared to HPC. ... HPC Data Analytic Appliance Author Fujitsu

1 © 2016 Fujitsu Technology Solutions

Big Data is about quickly delivering business value

from a range of new and emerging data sources

Social media data, location data generated by

smart phones and other roaming devices,

Public information available online and data

from sensors embedded in cars, buildings and

other objects

Structured or unstructured large data sets

Big Data extends the concept of traditional analytics

focus on analyzing historical data, to include real-

time analytics of in-flight transitory data.

Two different goals

High Performance Computing (HPC) impact our

daily lives in many ways

HPC improves car safety enabling simulated

crash tests

HPC increases oil recovery with more accurate

seismic modelling of oil reservoirs

HPC provides medics and scientists access into

the chemistry of our bodies to help develop new

drugs and therapies

HPC optimizes the utilization of renewable

energy sources with efficiently designed wind

turbines

Just to name a few examples.

Page 3: HPC Data Analytic Appliance - FujitsuHadoop runs on commodity hardware which means infrastructure costs can be less compared to HPC. ... HPC Data Analytic Appliance Author Fujitsu

2 © 2016 Fujitsu Technology Solutions

Collect ActionAnalyze

Structured &

unstructured data

Devices,

Sensors,

Internet of things

Social Media, open

& linked data

Big Data workflow

Page 4: HPC Data Analytic Appliance - FujitsuHadoop runs on commodity hardware which means infrastructure costs can be less compared to HPC. ... HPC Data Analytic Appliance Author Fujitsu

3 © 2016 Fujitsu Technology Solutions

Collect &

ConsolidatePost processingNumerical model

runs

Structured &

unstructured data

Devices,

Sensors,

Physical

measurement

Large database

cumulating science

knowledge

High Performance Computing

Page 5: HPC Data Analytic Appliance - FujitsuHadoop runs on commodity hardware which means infrastructure costs can be less compared to HPC. ... HPC Data Analytic Appliance Author Fujitsu

4 © 2016 Fujitsu Technology Solutions

Big Data and HPC Infrastructure Architecture

Consolidated Data Distilled essenceApplied

knowledgeVarious Data

Extract, collect Cleanse, transform, compute Decide, actAnalysis, visualization

Data sources Analytics or HPC platform Access

Batch

Processing

Event

Processing

Dialog

Processing

Databases

Application

server

Web content

Sensor-

data

Apps

Services

Queries

Visualization

Reporting

Messaging

Page 6: HPC Data Analytic Appliance - FujitsuHadoop runs on commodity hardware which means infrastructure costs can be less compared to HPC. ... HPC Data Analytic Appliance Author Fujitsu

5 © 2016 Fujitsu Technology Solutions

Merging HPC and Big Data: HPC-DA Appliance

Page 7: HPC Data Analytic Appliance - FujitsuHadoop runs on commodity hardware which means infrastructure costs can be less compared to HPC. ... HPC Data Analytic Appliance Author Fujitsu

6 © 2016 Fujitsu Technology Solutions

Hadoop and HPC (I)

Hadoop runs on commodity hardware which means infrastructure

costs can be less compared to HPC.

HPC clusters are often using some form of acceleration hardware on

the nodes, requiring additional programming in some cases and can

provide substantial speed-up for certain applications.

Availability of InfiniBand network connectivity is another highlight of

HPC which is required for high throughput and low latency. Hadoop

clusters typically use 10Gb Ethernet.

Hadoop works with HDFS (Hadoop Distributed File System), HPC

works e.g. with Lustre file system or FEFS (Fujitsu).

So, how scientists could use HADOOP efficiently if two independent

platforms sit side by side ?

Page 8: HPC Data Analytic Appliance - FujitsuHadoop runs on commodity hardware which means infrastructure costs can be less compared to HPC. ... HPC Data Analytic Appliance Author Fujitsu

7 © 2016 Fujitsu Technology Solutions

Hadoop and HPC (II)

Hadoop runs on commodity hardware

which means it runs on any HPC hardware as well.

HPC clusters are often using some form of acceleration hardware

Which could benefits to Data Analytics algorithm.

Availability of InfiniBand network connectivity is another highlight of HPC

which is required for high throughput and low latency.

Hadoop should transfer data even faster.

Hadoop works with HDFS (Hadoop Distributed File System), HPC works e.g.

with Lustre file system or FEFS (Fujitsu).

HPC parallel file systems provide HDFS and HADOOP connector to

take advantage of ultra large and high throughput storage

Page 9: HPC Data Analytic Appliance - FujitsuHadoop runs on commodity hardware which means infrastructure costs can be less compared to HPC. ... HPC Data Analytic Appliance Author Fujitsu

8 © 2016 Fujitsu Technology Solutions

HPC-DA Positioning

HPC HPC-DA HADOOP

Compute massive compute

low in transformation and simple

statistical queries, high in iterative

machine learning algorithms

Both massive and simple compute on flexible HPC cluster

simple enough to enable Spreadsheet

programming like e.g. PRIMEFLEX for

Hadoop

Programming

Languages

C, C++ and FORTRAN All de-facto languages available over the HPC clsuter

programming languages:

Java/Scala/Python;

data access languages: PIG, hive (SQL

like) or Datameer (Spreadsheet

programming);

Data structured File system configuration adapted to mixed access

large and increasing amounts of data from

various data sources, even of partly

unknown structure and content

Data Storage &

Processing

dedicated compute nodes and

dedicated storage nodes. Data is

moved to computing

Data availlable everywhere through parallel file system

Compute and storage in one node;

Computing is moved to data;

Parallel Distributed File

System

Lustre and BeeGFS(openSource,

POSIX compliant), FEFS (Fujitsu).

Lustre and BeeGFS deployed over the compute cluster with HDFS connector

HDFS (openSource)

Network Infiniband (very high throughput and

very low latency network)

Infiniband as unique high end interconnect

10Gb Ethernet

Cluster Resource

Management

SLURM Single batch system SLURM with Hadoopintegration

YARN

Data Redundancy RAID5 and above Standard feature of the parallel file system

Replication by choice of a replication

factor;

Page 10: HPC Data Analytic Appliance - FujitsuHadoop runs on commodity hardware which means infrastructure costs can be less compared to HPC. ... HPC Data Analytic Appliance Author Fujitsu

9 © 2016 Fujitsu Technology Solutions

HPC-DA Appliance

Non intrusive deployment on existiting HPC platform

The additional packages are compatible with

standard openSSF or Fujitsu HCS cluster

management tools

Normal HPC operation is kept untouched

Allow to run both intensive compute and/or data analytic

through the same cluster management tool (SLURM)

standard batch jobs under the control of

SLURM scheduler with fair share of the

ressouce

Take advantage of high end performance of the HPC

platform to speed up HADOOP

Both high speed interconnect and parallel file

system bring unprecedent performance to

boost HADOOP efficiency

Page 11: HPC Data Analytic Appliance - FujitsuHadoop runs on commodity hardware which means infrastructure costs can be less compared to HPC. ... HPC Data Analytic Appliance Author Fujitsu

10 © 2016 Fujitsu Technology Solutions

TeraSort Benchmark figures

Data generation is the first phase of the TeraSort

benchmark.

In strictly identical hardware conditions, including storage

devices, the parallel file system solution show a throughput

far better than standard HADOOP/HDFS, dramatically

reducing the generation time.

With a very coherent behavior the clear advantage of

using an integrated parallel file system was reflected

during the sort performance evaluation.

Still with strictly identical hardware platform, including

storage devices, the parallel file system solution shows

again a significant performance boost compared to the

standard HADOOP/HDFS.

Page 12: HPC Data Analytic Appliance - FujitsuHadoop runs on commodity hardware which means infrastructure costs can be less compared to HPC. ... HPC Data Analytic Appliance Author Fujitsu

11