Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
0 Copyright 2016 FUJITSU
Human Centric Innovation
in Action
HPC – DAData Analytic Appliance
When Big Data meets High Performance Computing
Fujitsu Systems Europe
1 © 2016 Fujitsu Technology Solutions
Big Data is about quickly delivering business value
from a range of new and emerging data sources
Social media data, location data generated by
smart phones and other roaming devices,
Public information available online and data
from sensors embedded in cars, buildings and
other objects
Structured or unstructured large data sets
Big Data extends the concept of traditional analytics
focus on analyzing historical data, to include real-
time analytics of in-flight transitory data.
Two different goals
High Performance Computing (HPC) impact our
daily lives in many ways
HPC improves car safety enabling simulated
crash tests
HPC increases oil recovery with more accurate
seismic modelling of oil reservoirs
HPC provides medics and scientists access into
the chemistry of our bodies to help develop new
drugs and therapies
HPC optimizes the utilization of renewable
energy sources with efficiently designed wind
turbines
Just to name a few examples.
2 © 2016 Fujitsu Technology Solutions
Collect ActionAnalyze
Structured &
unstructured data
Devices,
Sensors,
Internet of things
Social Media, open
& linked data
Big Data workflow
3 © 2016 Fujitsu Technology Solutions
Collect &
ConsolidatePost processingNumerical model
runs
Structured &
unstructured data
Devices,
Sensors,
Physical
measurement
Large database
cumulating science
knowledge
High Performance Computing
4 © 2016 Fujitsu Technology Solutions
Big Data and HPC Infrastructure Architecture
Consolidated Data Distilled essenceApplied
knowledgeVarious Data
Extract, collect Cleanse, transform, compute Decide, actAnalysis, visualization
Data sources Analytics or HPC platform Access
Batch
Processing
Event
Processing
Dialog
Processing
Databases
Application
server
Web content
Sensor-
data
Apps
Services
Queries
Visualization
Reporting
Messaging
5 © 2016 Fujitsu Technology Solutions
Merging HPC and Big Data: HPC-DA Appliance
6 © 2016 Fujitsu Technology Solutions
Hadoop and HPC (I)
Hadoop runs on commodity hardware which means infrastructure
costs can be less compared to HPC.
HPC clusters are often using some form of acceleration hardware on
the nodes, requiring additional programming in some cases and can
provide substantial speed-up for certain applications.
Availability of InfiniBand network connectivity is another highlight of
HPC which is required for high throughput and low latency. Hadoop
clusters typically use 10Gb Ethernet.
Hadoop works with HDFS (Hadoop Distributed File System), HPC
works e.g. with Lustre file system or FEFS (Fujitsu).
So, how scientists could use HADOOP efficiently if two independent
platforms sit side by side ?
7 © 2016 Fujitsu Technology Solutions
Hadoop and HPC (II)
Hadoop runs on commodity hardware
which means it runs on any HPC hardware as well.
HPC clusters are often using some form of acceleration hardware
Which could benefits to Data Analytics algorithm.
Availability of InfiniBand network connectivity is another highlight of HPC
which is required for high throughput and low latency.
Hadoop should transfer data even faster.
Hadoop works with HDFS (Hadoop Distributed File System), HPC works e.g.
with Lustre file system or FEFS (Fujitsu).
HPC parallel file systems provide HDFS and HADOOP connector to
take advantage of ultra large and high throughput storage
8 © 2016 Fujitsu Technology Solutions
HPC-DA Positioning
HPC HPC-DA HADOOP
Compute massive compute
low in transformation and simple
statistical queries, high in iterative
machine learning algorithms
Both massive and simple compute on flexible HPC cluster
simple enough to enable Spreadsheet
programming like e.g. PRIMEFLEX for
Hadoop
Programming
Languages
C, C++ and FORTRAN All de-facto languages available over the HPC clsuter
programming languages:
Java/Scala/Python;
data access languages: PIG, hive (SQL
like) or Datameer (Spreadsheet
programming);
Data structured File system configuration adapted to mixed access
large and increasing amounts of data from
various data sources, even of partly
unknown structure and content
Data Storage &
Processing
dedicated compute nodes and
dedicated storage nodes. Data is
moved to computing
Data availlable everywhere through parallel file system
Compute and storage in one node;
Computing is moved to data;
Parallel Distributed File
System
Lustre and BeeGFS(openSource,
POSIX compliant), FEFS (Fujitsu).
Lustre and BeeGFS deployed over the compute cluster with HDFS connector
HDFS (openSource)
Network Infiniband (very high throughput and
very low latency network)
Infiniband as unique high end interconnect
10Gb Ethernet
Cluster Resource
Management
SLURM Single batch system SLURM with Hadoopintegration
YARN
Data Redundancy RAID5 and above Standard feature of the parallel file system
Replication by choice of a replication
factor;
9 © 2016 Fujitsu Technology Solutions
HPC-DA Appliance
Non intrusive deployment on existiting HPC platform
The additional packages are compatible with
standard openSSF or Fujitsu HCS cluster
management tools
Normal HPC operation is kept untouched
Allow to run both intensive compute and/or data analytic
through the same cluster management tool (SLURM)
standard batch jobs under the control of
SLURM scheduler with fair share of the
ressouce
Take advantage of high end performance of the HPC
platform to speed up HADOOP
Both high speed interconnect and parallel file
system bring unprecedent performance to
boost HADOOP efficiency
10 © 2016 Fujitsu Technology Solutions
TeraSort Benchmark figures
Data generation is the first phase of the TeraSort
benchmark.
In strictly identical hardware conditions, including storage
devices, the parallel file system solution show a throughput
far better than standard HADOOP/HDFS, dramatically
reducing the generation time.
With a very coherent behavior the clear advantage of
using an integrated parallel file system was reflected
during the sort performance evaluation.
Still with strictly identical hardware platform, including
storage devices, the parallel file system solution shows
again a significant performance boost compared to the
standard HADOOP/HDFS.
11