39
Hitachi Analytics Infrastructure for Cloudera Data Platform Private Cloud Reference Architecture Guide MK-SL-212-01 March 2022

Hitachi Analytics Infrastructure for Cloudera Data

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hitachi Analytics Infrastructure for Cloudera Data

Hitachi Analytics Infrastructure for ClouderaData Platform Private Cloud

Reference Architecture Guide

MK-SL-212-01March 2022

Page 2: Hitachi Analytics Infrastructure for Cloudera Data

Legal Notices© 2022 Hitachi Vantara LLC. All rights reserved.

No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including copying and recording,or stored in a database or retrieval system for commercial purposes without the express written permission of Hitachi, Ltd., or Hitachi Vantara LLC(collectively “Hitachi”). Licensee may make copies of the Materials provided that any such copy is: (i) created as an essential step in utilization of theSoftware as licensed and is used in no other manner; or (ii) used for archival purposes. Licensee may not make any other copies of the Materials.“Materials” mean text, data, photographs, graphics, audio, video and documents.

Hitachi reserves the right to make changes to this Material at any time without notice and assumes no responsibility for its use. The Materials containthe most current information available at the time of publication.

Some of the features described in the Materials might not be currently available. Refer to the most recent product announcement for information aboutfeature and product availability, or contact Hitachi Vantara LLC at https://support.hitachivantara.com/en_us/contact-us.html.

Notice: Hitachi products and services can be ordered only under the terms and conditions of the applicable Hitachi agreements. The use of Hitachiproducts is governed by the terms of your agreements with Hitachi Vantara LLC.

By using this software, you agree that you are responsible for:

1. Acquiring the relevant consents as may be required under local privacy laws or otherwise from authorized employees and other individuals; and

2. Verifying that your data continues to be held, retrieved, deleted, or otherwise processed in accordance with relevant laws.

Notice on Export Controls. The technical data and technology inherent in this Document may be subject to U.S. export control laws, including theU.S. Export Administration Act and its associated regulations, and may be subject to export or import regulations in other countries. Reader agrees tocomply strictly with all such regulations and acknowledges that Reader has the responsibility to obtain licenses to export, re-export, or import theDocument and any Compliant Products.

Hitachi and Lumada are trademarks or registered trademarks of Hitachi, Ltd., in the United States and other countries.

AIX, AS/400e, DB2, Domino, DS6000, DS8000, Enterprise Storage Server, eServer, FICON, FlashCopy, GDPS, HyperSwap, IBM, Lotus, MVS, OS/390, PowerHA, PowerPC, RS/6000, S/390, System z9, System z10, Tivoli, z/OS, z9, z10, z13, z14, z/VM, and z/VSE are registered trademarks ortrademarks of International Business Machines Corporation.

Active Directory, ActiveX, Bing, Edge, Excel, Hyper-V, Internet Explorer, the Internet Explorer logo, Microsoft, the Microsoft corporate logo, theMicrosoft Edge logo, MS-DOS, Outlook, PowerPoint, SharePoint, Silverlight, SmartScreen, SQL Server, Visual Basic, Visual C++, Visual Studio,Windows, the Windows logo, Windows Azure, Windows PowerShell, Windows Server, the Windows start button, and Windows Vista are registeredtrademarks or trademarks of Microsoft Corporation. Microsoft product screen shots are reprinted with permission from Microsoft Corporation.

All other trademarks, service marks, and company names in this document or website are properties of their respective owners.

Copyright and license information for third-party and open source software used in Hitachi Vantara products can be found at https://www.hitachivantara.com/en-us/company/legal.html.

FeedbackHitachi Vantara welcomes your feedback. Please share your thoughts by sending an email message to [email protected]. To assist therouting of this message, use the paper number in the subject and the title of this white paper in the text.

Revision history

Changes Date

■ Added support for Hitachi Advanced Server DS120 G2 and Hitachi Advanced Server DS220 G2.

■ Renamed CDP Private Cloud Plus Edition to CDP Private Cloud Data Services.

March 2, 2022

Initial release February 16, 2021

Hitachi Analytics Infrastructure — Hitachi Vantara 2

Page 3: Hitachi Analytics Infrastructure for Cloudera Data

Reference Architecture GuideAccelerate the deployment of your analytics infrastructure by leveraging this referencearchitecture guide to implement Hitachi Analytics Infrastructure for Cloudera Data PlatformPrivate Cloud. Use this guide to implement an architecture that maximizes the return on yourinvestment.

Note: Testing of this configuration was in a lab environment. Many things affectproduction environments beyond prediction or duplication in a lab environment.Follow the recommended practice of conducting proof-of-concept testing foracceptable results in a non-production, isolated test environment that otherwisematches your production environment before your production implementation ofthis solution.

Key solution elementsThese are the key hardware and software components to power this big data solution. Youcan create a scale-out configuration to power your Cloudera environment. For detailedcomponent information, see Product descriptions (on page 36).

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 3

Page 4: Hitachi Analytics Infrastructure for Cloudera Data

This integrated big data infrastructure for advanced analytics uses the following:■ Hitachi Advanced Server DS120 and Hitachi Advanced Server DS120 G2

These are flexible 1U servers designed for optimal performance across multipleapplications.

■ Hitachi Advanced Server DS220 and Hitachi Advanced Server DS220 G2

These are flexible 2U servers designed for optimal performance across multipleapplications.

■ Cloudera Data Platform (CDP)

CDP is the merged solution of Cloudera Enterprise Data Hub and Hortonworks DataPlatform. It is powered by Apache Hadoop, and enables an enterprise data hub togetherwith security, governance, management, support, and a commercial ecosystem. Itconsists of a base edition to provide traditional Hadoop deployments, and a private cloudoption to provide compute power in a containerized environment.

■ Cloudera Data Platform Private Cloud

CDP Private Cloud is a hybrid platform that supports running in a private cloud and scalingto a public cloud as needed. It is designed to increase productivity while reducing costs.

■ Cisco Nexus 92348

This 48-port 1 GbE switch provides a management network. It is used both as a leafswitch and as a spine switch.

■ Cisco Nexus 93180YC-E/FX

This 48-port switch provides 10 GbE connectivity for intra-rack networks. It is used as theleaf switch for the data network. Designed with Cisco Cloud Scale technology, it supportshighly scalable cloud architectures.

■ Cisco Nexus 9332

This 100 GbE switch provides connectivity for inter-rack networks. It is used as the spineswitch for the data network, supporting flexible migration options. It is ideal for highlyscalable cloud architectures and enterprise data centers.

Cloudera Data Platform Private Cloud overviewCloudera Data Platform Private Cloud is Cloudera's latest enterprise Apache Hadoopsolution. It merges Cloudera Data Hub and Hortonworks Data Platform and it provides cloudsupport. Visit the Cloudera website to see their most recent Cloudera Data Platform PrivateCloud Reference Architecture.

Cloudera Data Platform Private Cloud overview

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 4

Page 5: Hitachi Analytics Infrastructure for Cloudera Data

CDP Private Cloud has two options:■ CDP Private Cloud Base

This provides traditional bare metal Apache Hadoop. It is also used to store the data thatis used by CDP Private Cloud Data Services.

■ CDP Private Cloud Data Services

This provides containerized compute processing that integrates with CDP Private CloudBase. For easier use, Cloudera bundles software into Data Services or Experiences thatsolutions can be built around.

Note: Cloudera has renamed CDP Private Cloud Plus Edition to CDP PrivateCloud Data Services.

The following figure shows the high-level components of CDP Private Cloud Base and CDPPrivate Cloud Data Services.

CDP Private Cloud Base

CDP Private Cloud Base follows traditional Apache Hadoop architecture running on baremetal or VMware. It runs traditional workloads that are not containerized, and it providesstorage for CDP Private Cloud Data Services.

Cloudera Data Platform Private Cloud overview

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 5

Page 6: Hitachi Analytics Infrastructure for Cloudera Data

This cluster consists of standard Cloudera Runtime, Cloudera Manager, and ClouderaShared Data Experience (SDX).■ Cloudera Runtime is the core open source software distribution within CDP Private Cloud

Base. It includes about 50 open source projects.■ Cloudera Manager is Cloudera's tool to manage both editions of CDP Private Cloud. It

manages and monitors the software and processes, not the data.■ Cloudera SDX provides unified security, governance, and metadata not only across CDP

Private Cloud Base and CDP Private Cloud Data Services, but also across CDP PublicCloud with support for multiple cloud vendors.

Cloudera Data Platform Private Cloud overview

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 6

Page 7: Hitachi Analytics Infrastructure for Cloudera Data

The following is a partial list of the software components and modules that make up ClouderaRuntime:■ Apache Hadoop Distributed File System

Hadoop Distributed File System (HDFS) is a distributed high-performance file systemdesigned to run on commodity hardware.

■ Apache Hadoop Common

These common utilities support the other Hadoop modules. This programming frameworksupports the distributive processing of large data sets.

■ Apache Hadoop YARN

Apache Hadoop YARN is a framework for job scheduling and cluster resourcemanagement. This splits the functionalities of the following into separate daemons:

● ResourceManager interfaces with the client to track tasks and assign tasks toNodeManagers management.

● NodeManager launches tasks and tracks their status on worker nodes.■ Apache HBase

Apache HBase is a datastore built on top of HDFS.■ Apache Hive

Apache Hive is data warehouse software that facilitates reading, writing, and managinglarge datasets residing in distributed storage using SQL.

■ Apache Impala

Apache Impala is a distributed SQL query engine for Apache Hadoop.■ Apache Kafka

Apache Kafka is a distributed streaming platform.■ Apache Kudu

Apache Kudu provides fast inserts, updates, and analytics on top of Apache Impala orApache Spark.

■ Apache Knox

Apache Knox is used to extend Apache Hadoop services outside the cluster withoutreducing security.

■ Apache Oozie

Apache Oozie is a workflow scheduler system that manages Apache Hadoop jobs.■ Apache Ozone

Apache Ozone is a scalable, redundant, and distributed object store than can be used inplace or alongside of HDFS. Cloudera is moving toward using Apache Ozone instead ofHDFS.

■ Apache Ranger

Cloudera Data Platform Private Cloud overview

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 7

Page 8: Hitachi Analytics Infrastructure for Cloudera Data

Apache Ranger is a framework to enable, monitor, and manage comprehensive datasecurity across the Hadoop platform.

■ Apache Spark

Apache Spark is a fast, general-purpose engine for large-scale data processing.■ Apache Sqoop

Apache Sqoop is a tool designed for efficiently transferring bulk data between ApacheHadoop and structured data stores, such as relational databases.

■ Apache ZooKeeper

Apache ZooKeeper is a centralized service for providing configuration information,naming, synchronization. and group services over large clusters in distributed systems.

● ZooKeeper Master Node

ZooKeeper is a high-availability system, where two or more nodes can connect to aZooKeeper master node. The ZooKeeper master node controls the nodes to providehigh availability.

● ZooKeeper Standby Master Node

When ZooKeeper runs in a high-availability setup, several nodes can be configured asZooKeeper master nodes, but only one of them can be active as the master node atany one time. The others are standby active nodes.

If the active master node fails, then the ZooKeeper cluster itself promotes one of thestandby master nodes to be the active master node.

■ HUE

HUE (Hadoop User Experience) is a web interface for analyzing data with ApacheHadoop.

CDP Private Cloud Base provides a secure data lake for CDP Private Cloud Plus Editionusing HDFS and Ozone.

CDP Private Cloud Data Services

CDP Private Cloud Data Services provides a containerized environment for processing data.It is designed to scale to public clouds when more compute resources are needed. The maincomponents of CDP Private Cloud Data Services are Cloudera SDX Management Console, acontainer cluster, and Cloudera Data Services (experiences).

Cloudera Data Platform Private Cloud overview

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 8

Page 9: Hitachi Analytics Infrastructure for Cloudera Data

Cloudera Data Services (Experiences) are bundled software provided in containers fromCloudera as containerized workloads that solve specific problem cases. Currently there arethree Cloudera Data Services available:■ Cloudera Machine Learning (CML)

Provides an end-to-end workflow for machine learning. It flows from data engineering, todata scientists, to model training, to deployment, to inference and as needed back tomodel training. It contains many popular tools including Spark, Scala, Python, and R.

■ Cloudera Data Warehouse (CDW)

Data Warehouse provides for self-service creation of independent data warehouses anddata marts that auto scale up and down to meet your varying workload demands.

■ Cloudera Data Engineering (CDE)

Cloudera Data Engineering (CDE) allows you to submit Spark jobs to an auto-scalingvirtual cluster.

CDP Private Cloud Data Services runs either on top of the Red Hat OpenShift ContainerPlatform or using Cloudera’s Experiences Compute Service (ECS).

The OpenShift cluster is an independent cluster of machines that is not part of the ClouderaPrivate Cloud Base. It processes data from the CDP Private Cloud Base. An OpenShiftcluster consists of master nodes, worker nodes, and storage.

Note: This Reference Architecture describes OpenShift at a high level. It is not areference architecture on OpenShift and it does not cover many of its details.Documentation for designing, deploying, configuring, and using an OpenShiftcluster can be found here.

The Experiences Compute Service (ECS) is provided by Cloudera. It creates and managesan embedded Kubernetes infrastructure for use with the CDP Private Cloud Experiences.The advantage over OpenShift is that it is simpler to install, configure, and manage. You onlyneed to provide hosts on which to install the service and Cloudera Manager sets up thePrivate Cloud Experiences Cluster. Cloudera Manager also provides management andmonitoring of the cluster.

CDP Private Cloud Base solution designUse this detailed design to create an integrated infrastructure to implement your bigdata andbusiness analytics solution using hardware and software from Hitachi Vantara and Cloudera.This design is modeled on the Cloudera Data Platform - Data Center Reference Architecture.Cloudera Data Platform - Data Center has been renamed to CDP Private Cloud Base.

CDP Private Cloud Base solution design

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 9

Page 10: Hitachi Analytics Infrastructure for Cloudera Data

The following are considered in the solution design:■ Server architecture

● Master node● Worker node● Utility node● Edge node● Hitachi hardware management server

■ HDFS data node storage● Heterogeneous storage● Erasure encoding

■ Network architecture● Switches● Data network● Client network● Management network

■ Deployment options● Visit the Cloudera website to see the current recommended software deployment

options.■ Rack configuration

● Single-rack configuration● Multi-rack configuration

For large deployments, verify that your network meets your specific requirements.

Software architectureThis solution supports CDP base 7.1 and higher, CDP plus 1 and higher. There are differentoperating systems supported depending on the version of CDP that is deployed and thegeneration of Hitachi Advanced Server used. The following table shows the operatingsystem, hardware, and versions of CDP.

OS for CDP Version

DSx20Generation

CDP 7.1.6 and lower CDP 7.17 and higher

DSX20 G1 RHEL 7.6 - RHEL 7.9, SLES 12SP5

RHEL 7.6 - RHEL 7.9, RHEL 8.2 ,SLES 12 SP5

DSX20 G2 RHEL 7.8 - 7.9, SLES 12 SP5 RHEL 7.6 - RHEL 7.9, RHEL 8.2 ,SLES 12 SP5

Software architecture

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 10

Page 11: Hitachi Analytics Infrastructure for Cloudera Data

Server architectureThis solution uses multiple Hitachi Advanced Server DS120 G1, DS120 G2, DS220 G1, andDS220 G2 servers. The architecture supports using servers in multiple configurations. Thisguide does not list all possible options.

This solution uses the following node types:■ Master node■ Hadoop worker nodes■ Edge node■ Utility node

Server architecture

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 11

Page 12: Hitachi Analytics Infrastructure for Cloudera Data

For CDP Private Cloud Base Edition, this reference architecture tests with Red HatEnterprise Linux 7.6 and Cloudera Data Platform Private Cloud Base 7.1.4.■ Master node

Master nodes control other processes on the network. The following master node typesare used:

● Name node● ZooKeeper node● Spark master● Hive MetaStore

The following tables list the standard hardware configuration options for each chassis typefor the master node servers. The number of master nodes depends on the specificimplementation and software used. In general, use at least three master nodes for each100 other node types. The following are descriptions of Hitachi Advanced Server G1servers that can be used.

Component Description

Server ■ Hitachi Advanced Server DS120

CPU ■ From the Intel Xeon Scalable processor family■ Default processors: 2 × Intel Xeon Gold 6226R

(16C, 2.9GHz, 150W)

Memory option ■ Default: 256 GB: 8 × 32 GB DDR4 DIMM

Network connections ■ 1 or 2 × Mellanox CX-4 LX EN 25 GbE dual portSFP28

■ 1 GbE LOM management port

Disk controller ■ LSI 3516 RAID controller

Operating system devices ■ 2 × 128 GB MLC SATADOMs

Data disks ■ One of the following:● Up to 12 SFF SAS drives● Up to 8 SFF SAS or SSD drives and up to 4

NVMe drives■ Default: 8 × 2.4 TB SFF SAS drives

Component Description

Server ■ One of the following chassis:

Server architecture

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 12

Page 13: Hitachi Analytics Infrastructure for Cloudera Data

Component Description● Hitachi Advanced Server DS220 using an LFF

chassis● Hitachi Advanced Server DS220 using an SFF

chassis with 16 SAS or SATA drives, and 8NVMe drives

● Hitachi Advanced Server DS220 using an SFFchassis with up to 24 SAS or SATA drives

■ Default: Hitachi Advanced Server DS220 using anSFF chassis with 24 × 2.4 TB SAS drives

CPU ■ From the Intel Xeon Scalable processor family■ Default processors: 2 × Intel Xeon Gold 6226R

(16C, 2.9GHz, 150W)

Memory option ■ Default: 256 GB: 8 × 32 GB DDR4 DIMM

Network connections ■ 1 or 2 × Mellanox CX-4 LX EN 25 GbE dual portSFP28 (LP-MD2)

■ 1 GbE LOM management port

Disk controller ■ SAS 3516-based RAID controller

Operating system devices ■ 2 × 480 GB SSDs in the rear cage

Disks ■ Multiple sizes of SAS and SATA storage devices,both large and small form factor

■ Default: 8 × 2.4 TB SFF SAS drives

When using Apache Ozone, it is recommended that there is at least 1 SSD or NVMe onevery Apache Ozone node.

■ Hadoop worker nodes

Worker nodes (data nodes) are used to store and/or process data. These nodes have verydiverse needs, with many different configurations and software packages running onthem.

The following are descriptions of Hitachi Advanced Server G1 servers that can be used.For the complete set of current options, contact your Hitachi Vantara sales representative.

Component Description

Server ■ Hitachi Advanced Server DS120

CPU ■ From the Intel Xeon Scalable processor family

Server architecture

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 13

Page 14: Hitachi Analytics Infrastructure for Cloudera Data

Component Description■ Default processors: 2 × Intel Xeon Gold 6226R

(16C, 2.9GHz, 150W)

Memory option ■ Default 256 GB: 8 × 32 GB DDR4 DIMM

Network connections ■ 1 or 2 × Mellanox CX-4 LX EN 25 GbE dual portSFP28 (LP-MD2)

■ 1 GbE LOM management port

Disk controller ■ SAS 3516 RAID controller

Operating system devices ■ 2 × 128 GB MLC SATADOM for operating system

Data disks ■ One of the following:● Up to 12 SFF SAS drives or SSD● Up to 8 SFF SAS or SSD drives, and up to 4

NVMe drives■ Default: 8 × 2.4 TB SFF SAS drives

Default: 12 × 2.4 TB SFF SAS drives

When deploying Apache Ozone on an HDFS node, it isrecommended to have at least 1 or 2 SSDs or NVMes.

Component Description

Server ■ Hitachi Advanced Server DS220 using one of thefollowing options:● 12 LFF chassis option● 24 SFF SAS drives or SSD option● 16 SFF SAS drives or SSD and 8 NVMe drives

option

CPU ■ From the Intel Xeon Scalable processor family■ Default processors: 2 × Intel Xeon Gold 6226R

(16C, 2.9GHz, 150W)

Memory option ■ Default: 256 GB: 8 × 32 GB DDR4 DIMM

Network connections ■ 1 or 2 × Mellanox CX-4 LX EN 25 GbE dual portSFP28 (LP-MD2)

■ 1 GbE LOM management port

Disk controller ■ SAS 3516-based RAID controller

Server architecture

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 14

Page 15: Hitachi Analytics Infrastructure for Cloudera Data

Component Description

Operating system devices ■ 2 × 480 GB SATA SSDs in the rear cage

Disks ■ Storage devices for DS220 using LFF drives● SFF SAS drives● SATA SSD● SATA LFF drives● Default: 12 × 4 TB SATA LFF drives

■ Storage device for DS220 with up to 16 SFF SAS orSATA drives, and up to 8 NVMe drives● 16 SAS HDD or SSD● 8 NVMe drives● Default: 16 × 2.4 TB SAS drives

■ Storage device for DS220 using 24 SFF drives● 24 SAS HDD or SSD● Default: 24 × 2.4 TB SAS drives

■ Edge node

An edge node resides on the client network and the data network to initiate processing toCDP Private Cloudera Base Edition. This multi-homed node can do the following:

● Act as a gateway.● Run software that needs access to the Cloudera environment and the corporate

systems.

Depending on the work being performed and the other software running on the edgenode, the configuration can vary significantly. In this solution, use an edge node either asa master node or in a worker node configuration.

The number of edge nodes depends on the software and how the nodes are used. Thefollowing are examples of edge nodes:

● Gateway node■ Allows access to client and data networks■ Runs Hadoop client processes

● Pentaho node■ Transfers data from existing sources into Hadoop■ Runs Hadoop data reports

Data transfer nodes, such as nodes running Hadoop Sqoop, should be edge nodes.The following are the hardware recommendations for edge nodes:

Server architecture

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 15

Page 16: Hitachi Analytics Infrastructure for Cloudera Data

● Hitachi Advanced Server DS120■ Use the same configuration options listed for DS120 worker nodes

● Hitachi Advanced Server DS220■ Use the same configuration options listed for DS220 worker nodes■ When using LFF drives, the recommended size is 4 TB or greater.

■ Utility node

A utility node runs support software and Cloudera software. Depending on the softwarehosted, a utility node can be an edge node and multihomed, or only connect to the datanetwork.

The number of utility nodes depends on the software and how the nodes are used. Thefollowing are examples:

● Cloudera Navigator● Apache HUE● Cloudera Manager● Underlying database for management servers

The following are the hardware recommendations for utility nodes:

● Hitachi Advanced Server DS120■ Use the same configuration options listed for DS120 worker nodes

● Hitachi Advanced Server DS220■ Use the same configuration options listed for DS220 worker nodes■ When using LFF drives, the recommended size is 4 TB or greater.

■ Hitachi hardware management server

You can include an optional hardware management server in this solution. It allowsaccess to the out-of-band management network. The following table lists the hardwareused for this server

Component Description

Server ■ Hitachi Advanced Server DS120

CPU ■ 2 × Intel 4210R (10C, 2.4GHz 100W)

Memory option ■ 64 GB: 2 × 32 GB DIMMs

Network connections ■ 2 Intel XXV710 10 GbE dual portSFP28

■ 1 GbE LOM management port

Disk controller ■ Intel VROC on the motherboard

Server architecture

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 16

Page 17: Hitachi Analytics Infrastructure for Cloudera Data

Component Description

Disks ■ 2 × 128 GB SATADOM configured asRAID-1

Note: Hitachi Advanced Server G2 Intel Virtual RAID on CPU (VROC) does notsupport RAID for m.2 and NVMe devices with the out-of-box drivers for RHEL 7.xand SLES 12.

HDFS data node storageFor Hadoop Distributed File System (HDFS), Cloudera's current recommendation is to useSAS HDDs over SATA HDDs; however, SSDs provide better performance.

With significant price reductions over the past few years, SSDs present a viable option. Thecost per terabyte for a 1DWDP SSD can be less than an SFF SAS drive. Large form factorSATA drives still provide the lowest cost per terabyte.

Hitachi Advanced Server DS220 has large format factor drives up to 14 TB. For betterperformance and to lower the impact of potential disk failure, the recommendation is to usedisk sizes of 4 TB or less. Cloudera only supports nodes with 100 TB or less and a maximumdrive size of 8 TB.

The following can influence your choice of drives and server types:■ Storage per node or storage per rack

With large drives, more data can be stored in a rack.■ Cost per terabyte

Large form factor drives are usually less expensive per terabyte than small form factordrives. Historically, hard disk drives are less expensive than solid state drives, but theprice difference has been diminishing.

■ Performance

For Hadoop Distributed File System and MapReduce, the bottleneck is usually storageperformance. This is dealt with by having more drives or faster drives. If CPU performanceis an issue, Hitachi Advanced Storage DS120 allows for more compute power in a rack.

■ Software

Software, such as Spark, benefits more from faster storage. Having solid state drives orNVMe drives for Spark temporary files and MapReduce temporaries can provide asignificant improvement in performance. For Ozone data nodes it is recommended to haveat least 1 SSD.

■ Recovery

When a data node or a disk goes down, the data replicates to other nodes. If the device islarger or has more data on it, this replication time increases. Recovery impacts networkperformance and can impact everything that is being done on the cluster.

HDFS data node storage

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 17

Page 18: Hitachi Analytics Infrastructure for Cloudera Data

When sizing the cluster, there must be enough space to accommodate the following:■ Working files■ Extra storage when nodes or drives are down

Plan on 80% free space for small clusters and 90% free space for large clusters. Twofeatures that can significantly impact Hadoop Distributed File System sizing are the following:■ Heterogeneous storage■ Erasure encoding

Heterogeneous storage

Heterogeneous storage introduces the concept of storage classes and data temperature. Itallows you to assign a storage device to a class of storage or storage type. The classes areassociated with data temperatures. The temperatures and classes are predefined in Hadoopand cannot be changed.

Software packages in Hadoop do not have to support all temperatures and classes. Therecould be temperatures and classes that are not used in Hadoop Distributed File System.

To have consistent performance, it is important that all devices assigned to a storage typehave similar performance. There are four storage classes:■ ARCHIVE

This is your slower storage with data dense devices. It is useful for rarely accessed data. Itusually costs less per terabyte. Some examples are an Amazon Simple Storage Service(S3) Bucket, Hitachi United Compute Platform RS, large SATA drives, and SAN storage.The items that can be archive devices have the widest performance variations.

■ DISK

This is for your hard drives, and the default storage class. Even though hard drivevariance in performance is small compared to the archive options, it is still recommendedthat you have similar devices.

■ SSD

This is used for solid state drives or NVMe drives. It is for data that needs faster access.■ RAM_DISK

This is used for high speed single node writes that you can afford to lose. An example ofthis is an in-memory file system.

Use a storage policy to describe which storage types and fallback types to use. The fallbacktype is what to use if there is no space on the requested storage types. For HadoopDistributed File System, the fallback type is DISK.

HDFS data node storage

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 18

Page 19: Hitachi Analytics Infrastructure for Cloudera Data

Hadoop Distributed File System supports six storage policies:■ Hot — All copies of the data are stored on DISK.■ Cold — All copies of the data are stored on ARCHIVE.■ Warm — One copy is stored on DISK and the rest of the copies are stored on ARCHIVE.■ One_SSD — One copy is stored on SSD and the rest of the copies are stored on DISK.■ Lazy_Persist — One copy is written to RAM_DISK. Later, all copies are persisted to DISK.■ All_SSD — All copies are stored on SSD.

The following figure shows the mapping between device types and temperatures. It isimportant to note the following:■ A directory is assigned to a storage policy, and not to devices.■ Devices can be used in multiple directories and have different storage policies.

Storage needs must take different device types, storage policies, and multiple directories intoconsideration when sizing a system.

Erasure encoding

Erasure encoding is Hadoop's term for RAID. With CDP Private Cloud Base, the defaulterasure coding is to use the standard "3-times" replication (mirroring). Cloudera supportsthese methods for erasure encoding:■ Standard replication

● Defaults to three copies of each storage block replicated across three nodes.● It has a 300% overhead compared to storing one copy.● For high availability at the rack level, there should be at least three racks to spread the

replica set across.● Data is local to processing, so — in most cases — it is faster

■ Erasure encoding● For everything except small files, it takes less space.● It can lose more nodes without losing data.● Three levels are supported:

■ 6 data + 3 parity■ 3 data + 2 parity■ 10 data + 4 parity

For high availability at the rack level, each data device used in an erasure-encoded filesystem should be in a different rack. Processing is not co-located with the data, so there ismore network traffic which makes it slower. Only the Reed-Solomon (RS) code algorithm issupported.

HDFS data node storage

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 19

Page 20: Hitachi Analytics Infrastructure for Cloudera Data

The following table shows the space used for the different erasure encoding options whencalculating the space for an individual file system.

TypeMinimum

cluster size

Data durability(number of

machines thatcan go down)

Storageefficiency

Storageneeded (in TB)

for 100 TB

Single copy 1 0 100% 100

3 × replication 3 2 33% 300

RS (6,3) 9 3 67% 150

RS (3,2) 5 2 60% 167

RS (10,4) 14 4 71% 140

Note: Erasure encoding is at the directory level, not at the disk level. This meansthat one disk can have data which has different replication strategies applied to it.

With small files, standard replication can take up less space than erasure encoding.■ Using replication files up to 128 MB requires one block on three disks: 384 MB.■ Using replication files from 128 MB to 256 MB requires two blocks on three disks: 768 MB.■ Using replication files from 384 MB to 512 MB requires three blocks on three disks: 1152

MB.■ Using replication files from 512 MB to 640 MB requires four blocks on three disks: 1536

MB.

The required number of blocks needed on a number of disks for replication files is shown inthe following figure.

When using erasure encoding with six data and three parity, the block size is still 128 MB.■ Files up to 128 MB require one block on all nine disks: 1152 MB.■ Files from 384 MB to 512 MB still require one block on nine disks: 1152 MB.■ Files from 512 MB to 768 MB still require one block on nine disks: 1152 MB.

Both erasure encoding and the storage policy are applied to the directory. Not allcombinations are supported. The following storage policies require a replication encodingpolicy: Warm, One, SSD, and Lazy_Persist.

HDFS data node storage

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 20

Page 21: Hitachi Analytics Infrastructure for Cloudera Data

Network configurationThis solution uses three logical networks: client, data, and management. There can bemultiple network configurations, depending on the Apache Hadoop deployment.■ Client network — Client access to edge nodes■ Data network — Communication between nodes■ Management network — Hardware management

The network architecture uses the following components:■ Switches■ Data network■ Client network■ Management network

Switches

This solution uses the following switches:■ Leaf data switches — Cisco Nexus 93180YC-E/FX

These leaf data switches connect all nodes in a rack together. The leaf switches areuplinked to the spine data switches.

Connect two switches together using stacking. This enables both switches to act as onesingle logical switch. If one switch fails, there still is a path to the hosts.

■ Spine data switches — Cisco Nexus 9332

Spine data switches interconnects leaf switches from different racks. Place these switchesin a separate network rack.

■ Leaf and spine management switches — Cisco Nexus 92348

A leaf and spine switch connects the management ports of the hardware to themanagement server. When there is more than one rack, use a spine switch to connect allmanagement leaf switches together.

Uplink the management network to the in-house management network.

Note: Other switches can be used, but check the network oversubscriptionrate, based upon the selected switches.

Data network

Use the data network for communications between the nodes.

Provide redundancy with two network interfaces configured at the operating system level touse the active-active network-bonding mode.

With Cloudera Private Cloud Base, it is recommended to use 2 × 25 GbE or better NICs, orports, and an oversubscription ratio as close to 1:1 as possible.

Network configuration

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 21

Page 22: Hitachi Analytics Infrastructure for Cloudera Data

The following table shows the correlation between network option oversubscription ratios andthe number of nodes that can be in a rack. The more uplink ports you use, the more spineswitches are required. The default configuration is four uplink ports, two per top-of-rackswitch.

Targetover-

subscription

UplinkSpeed

Totaluplinkports

Maximumuplink

bandwidth NetworkMaximum1U nodes

Maximum2U nodes

1:1 100 4 400 25 GbE 16 16

2:1 100 4 400 25 GbE 32 18

3:1 100 4 400 25 GbE 36 18

1:1 100 4 400 10 GbE 36 18

Client network

The client network is an optional network used on edge nodes. Using a client networkseparates the Hadoop network traffic from the rest of the client network traffic. Clouderarecommends that only edge nodes are accessible from the client network.

Management network

The management network allows access to the nodes using the 1 GbE LAN on motherboard(LOM) interface. This network provides out-of-band monitoring and management of theservers.

You can uplink this network to the client management network.

Deployment optionsThe following table shows the nodes used to deploy the different components in a three-master node configuration. This is a subset of the components that can be used in a CDPPrivate Cloud Base Edition solution. Software deployed on an edge node is specific to adeployment and is not shown in this table.

Note: The CDP Private Cloud Base Edition software deployed, and configurationof the software, can change when it is used with CDP Private Cloud Plus Edition.Verify the requirements listed in the CDP Private Cloud Plus Editiondocumentation.

ComponentMasternode 1

Master node2

Master node3

Utilitynodes

Data nodes(multiplenodes)

ZooKeeper ZooKeeper ZooKeeper ZooKeeper

Deployment options

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 22

Page 23: Hitachi Analytics Infrastructure for Cloudera Data

ComponentMasternode 1

Master node2

Master node3

Utilitynodes

Data nodes(multiplenodes)

HadoopDistributedFile System

Name node

QuorumJournalnode

Name node

QuorumJournal node

QuorumJournal node

Data nodes

Ozone Resourcemanager

Resourcemanager

Historyserver

Nodemanager

Hive MetaStore

WebHCat

HiveServer

Management ClouderaAgent

ClouderaAgent

Ozonemanager

Ooze

ClouderaAgent

ManagementServices

OzoneStorageContainerManager

Cloudera

ManagerClouderaAgent

ClouderaAgent

Miscellaneous Navigator

Databaseinstances

KMS

HUE HUE

Spark Runs onYARN orstandalone

Deployment options

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 23

Page 24: Hitachi Analytics Infrastructure for Cloudera Data

Rack configurationWhen determining how many racks are needed, there are many different deployment-specificconsiderations that must be evaluated and balanced for an individual deployment.■ Storage per square foot — In this case, racks are filled up as much as possible.■ Rack high availability — This is recommended to reduce the likelihood of a complete

system shutdown.● Nodes with the same data or processes need to be spread out across multiple racks.

For standard Hadoop Distributed File System 3-times replication, you should have atleast two racks. An erasure encoding of 6+3 would require at least three rack.

● A spine switch pair should have each switch in a different rack.■ Software — When there are multiple racks, spread the different Hadoop components

across the racks.■ Performance — Nodes that work together should be as close as possible. If not in the

same rack, they should be under the same spine switch.■ Growth plans — When adding new nodes, they are placed in new racks, existing racks, or

both.■ Network over configuration — The number of nodes in a rack can be limited on the

network oversubscription■ Power and heat design — The data center's power and heat requirements can increase

the number of racks needed.

Rack configuration

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 24

Page 25: Hitachi Analytics Infrastructure for Cloudera Data

Single rack configuration

The following figure shows a Hitachi Advanced Server DS120 deployment and one usingHitachi Advanced Server DS220 with the following components:■ Top-of-rack data and management switches■ 3 master nodes■ 1 edge node■ 2 utility nodes■ 1 hardware management node■ 9 worker nodes■ Advanced Server DS120 worker node configuration

● 2 Intel 4210 processors● 368 GB RAM● 12 × 1.8 TB SAS drives

■ Advanced Server DS220 worker node configuration● 2 Intel 4210● 368 GB RAM● 12 × 6 TB SATA Drives

Rack configuration

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 25

Page 26: Hitachi Analytics Infrastructure for Cloudera Data

Multiple rack configuration

This example deployment shows a Hitachi Advanced Server DS120 solution and a similarHitachi Advanced Server DS220 solution. The spine switch placement is not shown.

Rack configuration

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 26

Page 27: Hitachi Analytics Infrastructure for Cloudera Data

To provide high availability in a multi-rack system, spread the node types out across multipleracks. The following figure shows a three-rack Hitachi Advanced Server DS120 configuration.■ First rack

● Top-of-rack data and management switches● 2 master nodes● 1 edge node● 1 utility node● 1 hardware management node● 32 worker nodes

■ Second rack● Top-of-rack data and management switches● 1 edge node● 2 master nodes● 33 worker nodes

■ Third rack● Top-of-rack data and management switches● 1 master node● 1 utility node● 32 worker nodes

■ Advanced Server DS120 worker node configuration● 2 Intel 4210 processors● 368 GB RAM● 12 × 1.8 TB SAS drives

■ Total resources● 99 worker nodes● 198 CPUs● 36432 GB RAM● 2138.4 TB raw data storage

Rack configuration

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 27

Page 28: Hitachi Analytics Infrastructure for Cloudera Data

Rack configuration

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 28

Page 29: Hitachi Analytics Infrastructure for Cloudera Data

The following figure shows a similar three-rack deployment using a Hitachi Advanced ServerDS220 configuration.■ First rack

● Top-of-rack data and management switches● 2 master nodes● 1 edge node● 1 utility node● 1 hardware management node● 14 worker nodes

■ Second rack● Top-of-rack data and management switches● 2 master nodes● 1 edge node● 15 worker nodes

■ Third rack● Top-of-rack data and management switches● 1 master node● 1 utility node● 16 worker nodes

■ Advanced Server DS220 worker node configuration● 2 Intel 4210 processors● 368 GB RAM● 12 × 1.8 TB SAS drives

■ Total resources● 45 worker nodes● 90 CPUs● 16560 GB RAM● 3420 TB raw data storage

Rack configuration

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 29

Page 30: Hitachi Analytics Infrastructure for Cloudera Data

CDP Private Cloud Data Service solution designCDP Private Cloud Data Services can use either a Red Hat OpenShift Container Platform orExperience to run its Data Services. These can be on either bare metal machines or VMwarevirtual machines. This reference architecture does not cover using VMware.

Currently, CDP Private Cloud Plus Edition's OpenShift cluster must be dedicated for CDPuse. CDP supports Red Hat CoreOS for the master and bootstrap nodes. The helper nodecan run CentOS or RHEL7.

CDP Private Cloud Data Service solution design

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 30

Page 31: Hitachi Analytics Infrastructure for Cloudera Data

OpenShift cluster

Nodes

OpenShift has the following node types:■ The master nodes run OpenShift services and Kubernetes services.■ The worker nodes run Kubernetes Operators and CDP Private Cloud Experiences.■ The helper nodes use Ansible to install and configure external services needed by the

OpenShift cluster. These services are DNS, haproxy, bind, dhcpd, pxe, tftp, and an NFSserver.

■ The bootstrap nodes are used to bootstrap the installation. After clusters are running theycan be converted to worker nodes.

Cloudera's recommended minimum initial deployment is three master nodes and ten workernodes. A good minimum deployment for a production cluster is 3 master nodes and 20worker nodes. A cluster should contain one bootstrap node and one helper node. Thefollowing table shows the recommended configurations for the OpenShift cluster. While notshown, DS220 options are available. Cloudera requires all worker nodes in the OpenShiftcluster to have the same configuration.

Component Description

Server ■ Hitachi Advanced Server DS120

CPU From the Intel Xeon Scalable processor family:■ Master nodes: 2 Intel 4210R processors, 10-core, 2.2

GHz■ Worker nodes: 2 × Intel Xeon Gold 6240R processor■ Helper node: 2 Intel 4210R processors, 10-core, 2.2

GHz■ Bootstrap node: 2 × Intel Xeon Gold 6240R processor

Memory ■ Master nodes: 2 × 32 GB DIMMs■ Worker nodes: 16 × 32 GB DIMMs■ Utility node: 16 × 32 GB DIMMs

Network connections ■ 1 × Mellanox CX-4 LX EN 25 GbE dual port SFP28■ 1 GbE LOM management port

Disk controller ■ 1 × SAS 3516 RAID controller

Operating system devices ■ 2 × 128 GB MLC SATADOM for operating system

Data disks ■ Master nodes: 2 × 960 GB SSD■ Worker nodes: 2 × 3.32 TB NVMe

OpenShift cluster

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 31

Page 32: Hitachi Analytics Infrastructure for Cloudera Data

Component Description■ Helper node: 2 × 960 GB SSD■ Bootstrap node: 2 × 3.32 TB NVMe

CDP supports Red Hat CoreOS for the master and bootstrap nodes. The helper node runseither CentOS or RHEL 7.

Storage

Red Hat OpenShift needs storage. Unlike the other OpenShift deployments, Cloudera doesnot require a lot of storage for the cluster. The primary storage is on the CDP Base cluster inHDFS or Ozone.

The OpenShift cluster has three types of storage: local, persistent volumes (PV), and NFS.■ Local Storage — CDW requires local temporary storage.■ Persistent volumes

● While OpenShift supports many PV storage options, currently Cloudera supports alimited set. In the future they will be adding others.

● Red Hat Ceph Storage can use storage on the worker nodes or have its own nodes. Ifthe storage is on the worker nodes, you might want to add drives to the nodes.Information about hardware configurations, compatibility, architecture, and deployingRHEL Ceph Storage can be found on the Red Hat website.

■ NFS — CML requires NFS storage. It can be internal to the cluster and provisioned by thebootstrapping process. It can also be on external servers that are not part of the cluster.

The current storage requirements for the OpenShift cluster are as follows:■ Control Plane (manages OpenShift Cluster) — 250 GB PV storage.■ CDW — 600 GB of local storage GB per executor (node), with 100 GB of PV storage per

executor.■ CML — 1 TB PV storage per workspace (cluster of containers), with 1 TB or higher NFS

per workspace.

Experiences Compute Service ClusterCDP Private Cloud Experiences Compute Service deploys and manages an embeddedKubernetes infrastructure. This allows a simpler environment to be used than when using theOpenShift cluster.

Experiences Compute Service Cluster

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 32

Page 33: Hitachi Analytics Infrastructure for Cloudera Data

When you create a Private Cloud Experiences Cluster, two new services are added.■ Experiences Compute Service (ECS) service. The ECS service has two roles:

● ECS Server — runs on a single host in the CDP Private Cloud Experiences.● ECS Agent — runs on all hosts except the host running the Server role in the CDP

Private Cloud Experiences cluster.■ Docker service. The Docker service has a single role:

● Docker Server — runs on all hosts in the CDP Private Cloud Experiences cluster.

The following table lists the recommendations for deploying an ECS cluster.

Component Minimum Recommended

Number of Servers – Bare Metal or VMs 10 16

Cores per server 16 48

RAM (GB) 64 GB 384 GB

Storage 1 × 2 TB SSD 1 × 4 TB SSD or NVMe

Network 25 Gbe or better 25 Gbe or better

The following are the software requirements for ECS:■ Cloudera Manager 7.5.1■ CDP Private Cloud Base 7.1.6 or 7.1.7■ CentOS 7.8, Red Hat Enterprise Linux 7.8, and Oracle Linux 7.8

Engineering validationThere are many different configurations of both hardware and software that can be used inthis solution. A basic validation of this solution was done in the lab environment at HitachiVantara. It involved deploying a basic Apache Hadoop Distributed File System in a smallcluster of mixed node configurations. Physical servers were used for storage and processingpurposes.

Engineering validation

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 33

Page 34: Hitachi Analytics Infrastructure for Cloudera Data

The following servers were used:■ 4 Hitachi Advanced Server DS120

● 4 drives for Hadoop Distributed File System● 2 Intel 4210 processors for the CPU● 386 GB of memory

■ 1 Hitachi Advanced Server DS220 using large form factor drives● 4 drives for Hadoop Distributed File System● 2 Intel 4210 processors for the CPU● 386 GB of memory

■ 1 Hitachi Advanced Server DS220 using 24 small form factor drives● 4 drives for Hadoop Distributed File System● 2 Intel 4210R processors for the CPU● 386 GB of memory

■ Virtual machines for the master node

The following software was used:■ Cloudera Data Platform 7.1.4■ Red Hat Enterprise Linux 7.6

Operating system-level storage testing

Two basic disk performance tests were performed. The tests are based upon Cloudera'shardware verification tests. These were performed to get a basic understanding of thehardware.

Engineering validation

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 34

Page 35: Hitachi Analytics Infrastructure for Cloudera Data

The first set of tests used hdparm on SAS, SSD, and NVMe drives to validate disk readperformance. Different drive models have different performance results.■ SAS

● dd bs=16k count=1024000 if=/dev/zero of=/data2/img2.img conv=fdatasync● 1024000+0 records in● 1024000+0 records out● 16777216000 bytes (17 GB) copied, 88.4336 s, 190 MB/s

■ SSD● dd bs=16k count=1024000 if=/dev/zero of=/ssd/img2.img conv=fdatasync● 1024000+0 records in● 1024000+0 records out● 16777216000 bytes (17 GB) copied, 41.3986 s, 405 MB/s

■ NVMe● dd bs=16k count=1024000 if=/dev/zero of=/nvmetest/img2.img conv=fdatasync● 1024000+0 records in● 1024000+0 records out● 16777216000 bytes (17 GB) copied, 18.0212 s, 931 MB/s

Hadoop distributed file system-level storage testing

The next set of tests was to do a basic verification of the read and write performance of theHadoop Distributed File System. These tests used TestDFSIO.■ Write

● Number of files: 300● Total MB processed: 3072000● Throughput MB/s: 8.35● Average I/O rate MB/s: 8.58● I/O rate standard deviation: 1.34

■ Read● Number of files: 300● Total MB processed: 3072000● Throughput MB/s: 15.56● Average I/O rate MB/s: 16.78● I/O rate standard deviation: 5.11

Engineering validation

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 35

Page 36: Hitachi Analytics Infrastructure for Cloudera Data

Processing testing

There are many software packages to process Hadoop Distributed File System data. The twomost popular are MapReduce and Spark. TeraGen and TeraSort were run using bothpackages.■ MapReduce

The following are sample commands used for TeraSort and TeraGen. The commandswere run for 1 TB data sets:

yarn jar ${EXAMPLES_PATH}/hadoop-mapreduce-examples.jar teragen \ -Dmapreduce.job.maps=600 -Dmapreduce.map.memory.mb=4096 10000000000 \ TS_input >>1TBgen.txtyarn jar ${EXAMPLES_PATH}/hadoop-mapreduce-examples.jar terasort \ -Dmapreduce.job.maps=600 -Dmapreduce.map.memory.mb=4096 \ -Dmapreduce.reduce.memory.mb=4096 -Dmapreduce.reduce.cpu.vcores=2 \ TS_input TS_output >1TBsort.txtTeraGen 1tb CPU time spent (ms)=20331300TeraSort 1 tb CPU time spent (ms)= 165554980

■ Spark

The following commands are a sample of the commands used to generate and sort datausing SparkBench:

spark-submit --executor-memory 150G --driver-cores 10 --executor-cores 40 sparkbench_2.11-1.0.14.jar generate 10000000000 /input1tb 40 > spark1tbgen.outnohup spark-submit --executor-memory 150G --driver-cores 10 --executor-cores 40 sparkbench_2.11-1.0.14.jar sort /input1tb /output 1tb > spark1tbsort.out

The results are the following:■ 1 TB data generation: 9066 seconds■ 1 TB data sort: 8115 seconds

Product descriptionsThe following products are used in this solution.

Hitachi Advanced Server DS120 G2With support for two Intel Xeon Scalable processors in just 1U of rack space, the HitachiAdvanced Server DS120 G2 delivers exceptional compute density. It provides flexiblememory and storage options to meet the needs of converged and hyperconvergedinfrastructure solutions, as well as for dedicated application platforms such as internet ofthings (IoT) and data appliances.

Product descriptions

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 36

Page 37: Hitachi Analytics Infrastructure for Cloudera Data

The Intel Xeon Scalable processor family is optimized to address the growing demands ontoday’s IT infrastructure. The server provides 32 slots for high-speed DDR4 memory, allowingup to 4 TB memory capacity with RDIMM population (128 GB × 32) or 8 TB (512 GB × 16) ofIntel Optane Persistent Memory. DS120 G2 supports up to 12 hot-pluggable, front-side-accessible 2.5-inch non-volatile memory express (NVMe), serial-attached SCSI (SAS), serial-ATA (SATA) hard disk drive (HDD), or solid-state drives (SSD). The system also offers 2onboard M.2 slots.

With these options, DS120 G2 can be flexibly configured to address both I/O performanceand capacity requirements for a wide range of applications and solutions.

Hitachi Advanced Server DS220 G2With a combination of two Intel Xeon Scalable processors and high storage capacity in a 2Urack-space package, Hitachi Advanced Server DS220 G2 delivers the storage and I/O tomeet the needs of converged solutions and high-performance applications in the data center.

The Intel Xeon Scalable processor family is optimized to address the growing demands ontoday’s IT infrastructure. The server provides 32 slots for high-speed DDR4 memory, allowingup to 4 TB memory capacity with RDIMM population (128 GB × 32) or 8TB (512 GB × 16)with Intel Optane Persistent Memory population.

DS220 G2 comes in three storage configurations to allow for end user flexibility. The firstconfiguration supports 24 2.5-inch non-volatile memory express (NVMe) drives, the secondsupports 24 2.5-inch serial-attached SCSI (SAS), serial-ATA (SATA) and up to 8 NVMedrives, and the third supports 12 3.5-inch SAS or SATA and up to 8 NVMe drives. All theconfigurations support hot-pluggable, front-side-accessible drives as well as 2 optional 2.5-inch rear mounted drives. The DS220 G2 delivers high I/O performance and high capacity fordemanding applications and solutions.

Hitachi Advanced Server DS120Optimized for performance, high density, and power efficiency in a dual-processor server,Hitachi Advanced Server DS120 delivers a balance of compute and storage capacity. This 1Urack-mounted server has the flexibility to power a wide range of solutions and applications.

The highly-scalable memory supports up to 3 TB using 24 slots of high-speed DDR4 memory.Advanced Server DS120 is powered by the Intel Xeon Scalable processor family for complexand demanding workloads. There are flexible OCP and PCIe I/O expansion card optionsavailable. This server supports up to 12 small form factor storage devices with up to 4 NVMedrives.

This solution allows you to have a high CPU-to-storage ratio. This is ideal for balanced andcompute-heavy workloads.

Multiple CPU and storage devices are available. Contact your Hitachi Vantara salesrepresentative to get the latest list of options.

Hitachi Advanced Server DS220With a combination of two Intel Xeon Scalable processors and high storage capacity in a 2Urack-space package, Hitachi Advanced Server DS220 delivers the storage and I/O to meetthe needs of converged solutions and high-performance applications in the data center.

Hitachi Advanced Server DS220 G2

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 37

Page 38: Hitachi Analytics Infrastructure for Cloudera Data

The Intel Xeon Scalable processor family is optimized to address the growing demands ontoday's IT infrastructure. The server provides 24 slots for high-speed DDR4 memory, allowingup to 3 TB of memory per node when 128 GB DIMMs are used. This server supports up to 12large form factor storage devices and an additional 2 small form factor storage devices.

This server has three storage configuration options:■ 12 large form factor storage devices and an additional 2 small form factor storage devices

in the back of the chassis■ 16 SAS or SATA drives, 8 NVMe drives, and an additional 2 small form factor storage

devices in the back of the chassis■ 24 SFF devices and an additional 2 SFF storage devices in the back of the chassis

Hitachi Unified Compute Platform AdvisorHitachi Unified Compute Platform Advisor (UCP Advisor) is a comprehensive cloudinfrastructure management and automation software that enables IT agility and simplifies day0-N operations for edge, core, and cloud environments. The fourth-generation UCP Advisoraccelerates application deployment and drastically simplifies converged and hyperconvergedinfrastructure deployment, configuration, life cycle management, and ongoing operations withadvanced policy-based automation and orchestration for private and hybrid cloudenvironments.

The centralized management plane enables remote, federated management for the entireportfolio of converged, hyperconverged, and storage data center infrastructure solutions toimprove operational efficiency and reduce management complexity. Its intelligent automationservices accelerate infrastructure deployment and configuration, significantly minimizingdeployment risk and reducing provisioning time and complexity, automating hundreds ofmandatory tasks.

Hitachi Unified Compute Platform Advisor

Reference Architecture GuideHitachi Analytics Infrastructure — Hitachi Vantara 38

Page 39: Hitachi Analytics Infrastructure for Cloudera Data

Hitachi Vantara

Corporate Headquarters

2535 Augustine Drive

Santa Clara, CA 95054 USA

HitachiVantara.com | community.HitachiVantara.com

Contact Information

USA: 1-800-446-0744

Global: 1-858-547-4526

HitachiVantara.com/contact