DRIVESCALE-MAPR Reference Architecturego.drivescale.com/rs/451-ESR-800/images/DriveScale-MapR_Ref_Arch.pdf · Drivescale-MAPR ©2018 DriveScale Inc. All Right Reserved. 7 of 18 REFERENCE

DRIVESCALE-MAPRReference Architecture

Drivescale-MAPR

2 of 18©2018 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

Table of Contents

Glossary of Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Table 1: Glossary of Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1 . Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 . Audience and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 . DriveScale Advantage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Flex your data center with DriveScale for Big Data workloads . . . . . . . . . . . . . . . . . . 5

4 . MapR Advantage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

5 . Industry Standard Servers and JBODs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

6 . DriveScale-MapR Solution Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

7 . DriveScale Components Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

7.1 The DriveScale Composer, Server Agents and DriveScale Central . . . . . . . . . . . . 8

7.2 DriveScale HDD Appliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

7.3 DriveScale Solution Conceptual Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

8 . Reference Architecture Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

8.1 Physical Cluster Components and Configuration List . . . . . . . . . . . . . . . . . . . . . 11

8.2 Logical Cluster Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Drivescale-MAPR



8.3 Physical Cluster Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

8.4 Cluster Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

8.5 Disk and Filesystem Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

8.6 DriveScale-MapR OS Supportability/Compatibility Matrix . . . . . . . . . . . . . . . . . . 17

9 . Rack Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

10 . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

11 . Bill of Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

12 . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18


©2018 DriveScale Inc. All Right Reserved.

DRIVESCALE-MAPR

Glossary of Terms

Table 1: Glossary of Terms

Term Description

Data Node Worker nodes of the cluster to which the MapR-FS data is written.

HDD Appliance The DriveScale HDD Appliance is a 1RU Ethernet to SAS adapter serving as a bridge between 10 Gbps Ethernet connecting compute resources to JBODs full of commodity disks.

DriveScale Central DriveScale Central is a web-based user interface to the DriveScale cloud that performs DriveScale account management. DSC is where you download the keys to enable installation of the DriveScale software, and then set up your DriveScale Management Domain(s) (DMDs). This is where you create your domain, select and configure the Comnposer nodes for the domain, and select a chassis (with its associated DriveScale HDD Appliance) for the domain.

DriveScale Composer

The DriveScale Composer is software that creates composable infrastructure from a set of diskless servers and disk drives.

HDD Hard disk drive.

MapR-FS MapR Distributed File System.

High Availability The configuration that addresses availability issues in a cluster. In a standard configuration, the Name Node is a single point of failure (SPOF). Each cluster has a single Name Node, and if that machine or process became unavailable, the cluster as a whole is unavailable until the Name Node is either restarted or brought up on a new host. The secondary Name Node does not provide failover capability. High availability enables running two Name Nodes in the same cluster: the active Name Node and the standby Name Node. The standby Name Node allows a fast failover to a new Name Node in the case of machine crash or planned maintenance.

JBOD JBOD (Just a bunch of disks). A collection of hard disks that have not been configured to act as a redundant array of independent disks (RAID) array.

Job History Server The process that archives job metrics and metadata. One per cluster.

MLAG Multichassis Link Aggregation. MLAG is the ability of two or more switches to act like a single switch when forming link bundles.

Master/Control/ Administrator Node

The metadata master of MapR essential for the integrity and proper functioning of the distributed filesystem.

NIC Network interface card.

Node Manager The process that starts application processes and manages resources on the Data Nodes.

4 of 18

Drivescale-MAPR



Term Description

PDU Power distribution unit.

ToR Top of the rack.

ZooKeeper Zoo Keeper is a centralized service for maintaining configuration information, naming, and providing distributed synchronization and group services.

1. Executive Summary

DriveScale has engineered a next-generation Software Composable Infrastructure (SCI) solution designed to fundamentally change the way data center teams design, deploy, manage and consume hardware and software resources. DriveScale provides capabilities to IT operators to connect disaggregated pools of resources in an intelligent manner, and then to manage, modify and scale these resources over time. SCI results in a higher performance and more easily deployed infrastructure with fluid resources that can be used for modern Big Data workloads to significantly improve agility.

This document is a high-level design reference architecture guide for implementing MapR with DriveScale SCI on industry standard servers and JBODs. The reference architecture introduces all the high-level components, hardware, and software, included in the stack. Each high-level component is then described individually. The reference architecture does not describe the MAPR components or their applications.

2. Audience and Scope

This reference architecture guide is for Big Data and IT architects who are responsible for the design and deployment of MapR solutions on premises in data centers, as well as for Apache Hadoop administrators and architects, and data center architects or engineers who collaborate with specialists in that space.

3. DriveScale Advantage

Flex your data center with DriveScale for Big Data workloads

Big Data workloads have become an integral part of traditional data centers. They are designed to scale by adding more hardware resources to the same compute cluster. This flexibility is built into Big Data vendor software.

However, typical Big Data deployments have multiple limitations:

• Administrators can’t respond quickly to changing application stacks and data velocity.

• Deployments are over-provisioned with resources and under-utilized in order to meet service level guarantees.

• Multiple silos of hardware are created for each different application workload.

DriveScale’s rackscale SCI architecture provides solutions to all of these modern Big Data deployment limitations. With DriveScale’s SCI solution, administrators can flexibly deploy and manage independent pools of compute and storage resources at a lower cost without making changes to the application stack.

Drivescale-MAPR



4. MapR Advantage

The MapR Converged Data Platform is a proven solution for delivering business value in data-driven companies. The MapR Converged Data Platform delivers speed, scale, and reliability, driving both operational and analytical workloads in a single platform. The MapR Platform is designed to deliver:

• High availability

• Ease of data integration

• Lower total cost of ownership

5. Industry Standard Servers and JBODs

The DriveScale solution works with any type of industry standard x86 server. Customers can customize the server with any compute (memory and CPU) configuration and purchase from existing OEMs and channels.

DriveScale recommends the purchase of high capacity JBODs (Just a Bunch of Disks) with dual hot-pluggable IO controllers (Expanders) and enough upstream bandwidth. The JBODs should also have dual hot-pluggable redundant power supplies. DriveScale has evaluated and tested various vendor offerings for redundancy, management functionality and performance. The table below lists DriveScale certified products with model numbers.

Table 2: DriveScale Certified JBODs

JBOD Vendor Model Number

Dell PowerVault MD3060e - 3.5" and 2.5", 60 bays, 4U, redundant expanders, 2 x 3 x mini-SAS 6G

Dell Storage MD1280 - 3.5", 84 bays, 5U, redundant expanders, 2 x 3 x mini-SAS 6G

Hewlett Packard Enterprise D6020 - 3.5", 70 bays, 5U, quad expanders, 4 x 2 x mini-SAS 12G

D6000 - 3.5", 70 bays, 5U, quad expanders, 4 x 2 x mini-SAS 6G

RAID Inc ./Newisys NDS-4600/4603 - 3.5", 60 bays, 4U, redundant expanders, 2 x 4 x mini-SAS 6G

NDS-2241 - 2.5", 24 bays, 2U, redundant expanders, 2 x 3 x mini-SAS 6G

NDS-4900 - 3.5", 90/96 bays, 4U, redundant expanders, 2 x 6 x mini-SAS-HD 12G

NDS-4900 - 3.5", 84 bays, 4U, redundant expanders, 2 x 5 x mini-SAS-HD 12G

Quanta (QCT) M6400H - 3.5", 60 bays, 4U, redundant expanders, 2 x 4 x mini-SAS 6G

JB4602 - 3.5”, 60 bays, 4U, redundant expanders, 2 x 4 x mini-SAS 12G

Drivescale-MAPR



JBOD Vendor Model Number

Promise Inc . J5300s - 3.5", 12 bays,2U, redundant expanders, 2 x 2 x mini-SAS-HD 12G

J5320s - 2.5",24 bays, 2U, redundant expanders, 2 x 2 x mini-SAS-HD 12G

J5600 - 3.5", 16 bays, 3U, redundant expanders, 2 x 2 x mini-SAS-HD 12G

J5800 - 3.5", 24 bays, 4U, redundant expanders, 2 x 2 x mini-SAS-HD 12G

Lenovo D3284 - 3.5”, 84 bays, 5U, redundant expanders, 3 x 4 x mini-SAS-DS 12G

HGST 4U60G2 - 3.5", 60 bays, 4U, redundant expanders, 2 x 4 x mini-SAS-HD 12G

Ultrastar Data102, 102 bays, 4U, redundant expanders, 2 x 6 x mini-SAS-HD 12G

IBM ESS JBOD Storage (5U84)- 3.5" and 2.5", 84 bays, 5U, redundant expanders, 2 x 3 x mini-SAS 12G

6. DriveScale-MapR Solution Overview

The DriveScale-MapR™ Big Data Solution is designed to address the complexity and silos that result from deploying different workloads on different clusters. The solution is designed with software defined composability as the primary goal. The composability lowers capex and opex costs, improves utilization by eliminating silos, and greatly simplifies the deployment of Big Data workload and analytics clusters.

Hadoop and other Apache projects are being developed in Java and other programming languages by a global community of contributors. Yahoo, who has been the largest contributor to this project, uses Apache Hadoop extensively across its businesses. Core committers on the Hadoop project include employees from MapR, Cloudera, eBay, Facebook, Getopt, Hortonworks, Huawei, IBM, InMobi, INRIA, LinkedIn, Microsoft, Pivotal, Twitter, UC Berkeley, VMware, WANdisco, Yahoo, and many more individuals and organizations.

Hadoop deployments, other Apache projects, and 3rd party compute engines and custom apps for Big Data workloads are widely popular, but installing, configuring, and running a production cluster has challenges, including:

• Choosing the appropriate Big Data software distribution and extensions

• Installing the monitoring and management software

• Allocating Big Data services to physical nodes

• Selecting appropriate server hardware

• Rightsizing the storage configuration

• Implementing data locality

Drivescale-MAPR



• Designing the network fabric

• Sizing and scaling the system

• Managing overall performance

These concerns are complicated by the need to understand the workloads running on the cluster, keep up with the fast-moving pace of core Apache projects, and manage a system designed to scale to thousands of nodes in a single instance.

The DriveScale-MapR Big Data Solution embodies all the hardware, software, resources and services needed to run a Big Data deployment with a single solution in a production environment. This end-to-end solution is specifically designed to accelerate large scale production, while delivering the compute and storage performance needed. The solution components include the MapR Converged Data Platform, DriveScale software and hardware, industry standard servers, networks switches, and JBODs built with standard disk drives.

These components span the entire solution stack:

• Reference architecture

• Optimized storage configurations

• Optimized network infrastructure

• MapR Converged Data Platform

This solution is designed to address the vast majority of Big Data use cases including Apache Hadoop, but not limited to:

• Big data analytics

• ETL offload

• Data warehouse optimization

• Batch processing of unstructured data

• Big data visualization

• Search and predictive analysis

• Real-Time analytics and stream processing

7. DriveScale Components Overview

The DriveScale system is composed of the DriveScale Composer and Server Agents (software), DriveScale Central (a cloud service) and the DriveScale HDD Appliance (hardware).

7 .1 The DriveScale Composer, Server Agents and DriveScale Central

a) The DriveScale Composer

The DriveScale Composer is the heart of the DriveScale SCI solution. The Composer holds the inventory of all resources, composes clusters of compute and storage resources via simple GUI control, monitors and manages clusters, and returns resources to pools for re-use when workloads have finished running.

Drivescale-MAPR



• The server running the Composer software is called the Composer node. A typical deployment consists of three Composer nodes in a clustered configuration for high availability (HA).

• The Composer contains the configuration and information database for:

3 Inventory: DriveScale Composer nodes, DriveScale HDD Appliances, network switches, JBODs, chassis, disk drives and composable server nodes.

3 Configuration: node templates, cluster templates, configured clusters

3 Composer Database: used as a message bus to communicate with the servers and drives.

b) DriveScale Server Agents

The DriveScale Server Agent is installed on all servers to be composed. The server Agent provides inventory to the DriveScale Composer and creates mappings between composed server nodes and disk drives.

c) DriveScale Central

DriveScale Central is a cloud-based portal that provides a repository for:

• Software distribution

• DriveScale keys

• Centralized log files

• User documentation

• License manager

7 .2 DriveScale HDD Appliance

The DriveScale HDD Appliance is a 1U appliance with adapters that connect to servers via 10Gb Ethernet interfaces and to JBOD’s via SAS interfaces. The HDD Appliance software allows JBODs to be mapped to servers and used as local drives.

Figure1: DriveScale HDD Appliance

Drivescale-MAPR



7 .3 DriveScale Solution Conceptual Diagram

Figure 2: DriveScale Cluster Components

Drivescale-MAPR



8. Reference Architecture Details

8 .1 Physical Cluster Components and Configuration List

Table 3: Cluster Physical Components List

Component Configuration Description Quantity

DriveScale HDD Appliance

DHCP, Jumbo frame enabled

1U appliance with adapters to connect servers via Ethernet and JBOD’s via SAS.

1

DriveScale HDD Appliance Controller

DHCP, Jumbo frame enabled

Provides the data network.

4 for each chassis

DriveScale Composer DriveScale Composer running on a VM

Manages and configures nodes and clusters, stores the inventory/configuration repository of each component.

Min 1, for HA 3 Composers are configured as master and slave

Servers Two socket CPU and memory per the individual Hadoop cluster requirements

Commodity x86 servers that house all the NodeManager compute instances and DriveScale Server Agents.

Min 1 Name nodes + 3 Data nodes

HDD for Servers Two drives configured in RAID 1

The internal drives should be used for OS install.

2 for each server

NICs Dual port 10 Gbps Ethernet NICs. The connector type depends on the network design; could be SFP+ or twinax.

Provides the data network

Min 1 for each server

JBOD Default configuration Houses the disk drives with dual IO controllers.

Min 1

HDD for JBOD Default Configuration Disk Drives to store data for the cluster.

Depending on the cluster requirements

ToR 10G switch LLDP, MLAG, Jumbo Frame 9K configured

Provides data network connectivity.

2 for each rack

Drivescale-MAPR




ToR 1G switch Default configuration Provides management network connectivity.

1 for each rack

MAPR installer MAPR installer server running on a VM

Manages configures and monitors the MAPR cluster.

1 for each environment

8 .2 Logical Cluster Topology

The minimum requirements to build out the cluster are:

• 1 Administrator

• 4 Data Nodes

• 1 DriveScale HDD Appliance

• 1 DriveScale Composer

• Two 10G Switches

• One 1G Switch

• 1 JBOD chassis with disk drives

• 1 MAPR server

This reference architecture is built on one administrator node and 5 data nodes with 1 JBOD and 60 drives of 1 TB HDD. The following table lists the configurations of the servers and number of drives used.

Table 4: Server configuration


Controller / Administrator node

Two sockets 20 core CPU, 256GB RAM, 10GbE Intel NIC with two internal HDD for OS and eight high capacity HDD mounted from the JBOD.

Administrator node hosts the MapR node and agents with DriveScale Server Agents.

1

Data nodes Two sockets 16 core CPU, 256GB RAM, 10GbE Intel NIC with two internal HDD for OS and eight high capacity HDD mounted from the JBOD.

Data nodes house the MapR-FS Nodes, ZooKeeper’s, CLDB and YARN Node managers, any additional required services with DriveScale agents.

4

Drivescale-MAPR



Notes:

- Customers with higher (or lower) compute needs can acquire bigger (or smaller) data nodes configured with CPU and memory that fits the specific requirements of their applications.

- Similarly, depending on the data requirements, customers can add or remove disk drives to match the specific needs of their applications.

- Due to MapR’s distributed metadata model, all five nodes can be used for data and processing.

8 .3 Physical Cluster Topology

Figure 3: DriveScale Lab Architecture with 1 HDD Appliance (4x Adapters in use), 1 JBOD, 1 Controller/Data Node and 4 Data Nodes

Drivescale-MAPR



8 .4 Cluster Management

This section details the steps for setting up a DriveScale enabled Hadoop cluster using MapR installer Server.

8.4.1 Setting up the DriveScale Cluster

The following tasks must be completed to setup the DriveScale solution before installing MapR Installer Server or using an existing install of MapR Installer Server:

1. Rack and install the DriveScale HDD Appliance using the supplied documentation.

2. Rack and install the JBOD using the documentation provided by the vendor.

3. Rack and install the servers using the documentation provided by the vendor.

4. Create a RAID 1 or your RAID choice configuration for the internal HDD on the server and install the OS on all the other servers.

5. Install and configure the DriveScale Composer on a VM or a standalone server.

6. Set up the HDD Appliance configuration from the Composer.

7. Install and configure the DriveScale Server Agents on the master and data nodes.

8. Create master/data node and cluster template with required disk drives using the Composer.

9. Create the cluster from the template using the Composer.

10. Ensure that DriveScale cluster is up and running before proceeding ahead.

Figure 4: Logical Cluster status from the Composer UI

Figure 5: Logical cluster details overview from the Composer UI

Drivescale-MAPR



Figure 6: Logical cluster server details from the Composer UI

8.4.2 Setting up MapR cluster

1. After the successful completion of the steps above, follow the MapR prerequisites to ensure all the requirements are met.

2. Install MapR installer Server using the MapR Installer guide.

3. Launch the MapR installer from the browser and follow the onscreen instructions to install the services required on your MapR cluster.

4. For this reference architecture only YARN + MapReduce services were set up.

Figure 7: Installed services details from MapR Installer after successful installation

Drivescale-MAPR



Figure 8: Installed services details from MapR UI Dashboard

5. Ensure that the control and data nodes are up and running with the right assigned roles and storage.

Figure 9: Storage overview from MapR UI Dashboard

8 .5 Disk and Filesystem Layout

Node/Role Disk and Filesystem Layout Description

Management/Master/YARN Node managers

MapR-FS 2 TB drives are mounted from the JBOD’s

Drivescale-MAPR



8 .6 DriveScale-MapR OS Supportability/Compatibility Matrix

Composer Server Nodes MapR

CentOS/RHEL 6.x X X -

CentOS/RHEL 7.x X X X

Ubuntu 14.04 X X X

Ubuntu 16.04 X X X

9. Rack Scalability

Customers can scale beyond one rack to expand their compute and storage resources as application needs grow. Compute-to-storage ratios can be changed or maintained for new or existing racks. For every new JBOD addition, a new DriveScale HDD Appliance with four controllers must be added as well. Since disk drives are assigned from within the rack to servers in the rack, scaling is achieved by simply adding more racks with Servers, DriveScale HDD Appliances, Switches and JBODs.

Figure 10: Rack Scalability

10. References

1. MapR Installer Prerequisites and Guidelines

http://maprdocs .mapr .com/home/AdvancedInstallation/c_install_prerequisites .html

2. MapR Installer setup

http://maprdocs .mapr .com/home/MapRInstaller .html

3. DriveScale documentation for racking and installation which are provided by DriveScale

4. YARN Definition

http://searchdatamanagement .techtarget .com

11. Bill of Materials

Server Components Quantity

Intel Xeon Processor based servers with dual or quad port 10GbE SFP+ NICs. The exact CPU models, number of sockets, and memory are based on customer application needs

Depends on customer application needs

JBOD Components Quantity

DriveScale certified JBODs Depends on customer application needs

NL-SAS HDDs Depends on customer application needs

Switch Quantity

DriveScale certified 10GbE SFP+ switches An even number of switches for redundant switch fabric

1GBaseT switch Based on the number of Servers and JBODs in configuration

DriveScale Components Quantity

DriveScale HDD Appliance One for each JBOD

DriveScale Adapter Four for each HDD Appliance

Software Version

CentOS Please refer to 8.6 section

DriveScale HDD Appliance 1.4

MapR 5.2

12. Conclusion

The DriveScale-MapR solution reference architecture guide is designed to provide an overview of the combined solution and the components employed in the solution. The reference architecture also outlines the advantages of compute and storage disaggregation with the DriveScale-MapR solution.

18 of 18

DriveScale, Inc 1230 Midas Way, Suite 210 Sunnyvale, CA 94085

Main: +1(408) 849-4651 www. drivescale.com

©2018 DriveScale Inc. All Right Reserved.

ra.20171218.002.Rev001

Drivescale-MAPR R E F E R E N C E A R C H I T E C T U R E

Documents

DRIVESCALE-MAPR Reference Architecturego.drivescale.com/rs/451-ESR-800/images/DriveScale-MapR_Ref_Arch.pdf · Drivescale-MAPR ©2018 DriveScale Inc. All Right Reserved. 7 of 18 REFERENCE