12
CDP Overview CDP Overview Date of publish: 2019-08-23 https://docs.cloudera.com/

CDP Overview - docs.cloudera.com › cdp › cloud › overview › cdp-overview.pdf · Use case: You have an on-premise CDH cluster and you would like to migrate a workload to their

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CDP Overview - docs.cloudera.com › cdp › cloud › overview › cdp-overview.pdf · Use case: You have an on-premise CDH cluster and you would like to migrate a workload to their

CDP Overview

CDP OverviewDate of publish: 2019-08-23

https://docs.cloudera.com/

Page 2: CDP Overview - docs.cloudera.com › cdp › cloud › overview › cdp-overview.pdf · Use case: You have an on-premise CDH cluster and you would like to migrate a workload to their

Legal Notice

© Cloudera Inc. 2020. All rights reserved.

The documentation is and contains Cloudera proprietary information protected by copyright and other intellectualproperty rights. No license under copyright or any other intellectual property right is granted herein.

Copyright information for Cloudera software may be found within the documentation accompanying eachcomponent in a particular release.

Cloudera software includes software from various open source or other third party projects, and may bereleased under the Apache Software License 2.0 (“ASLv2”), the Affero General Public License version 3(AGPLv3), or other license terms. Other software included may be released under the terms of alternative opensource licenses. Please review the license and notice files accompanying the software for additional licensinginformation.

Please visit the Cloudera software product page for more information on Cloudera software. For moreinformation on Cloudera support services, please visit either the Support or Sales page. Feel free to contact usdirectly to discuss your specific needs.

Cloudera reserves the right to change any products at any time, and without notice. Cloudera assumes noresponsibility nor liability arising from the use of products, except as expressly agreed to in writing by Cloudera.

Cloudera, Cloudera Altus, HUE, Impala, Cloudera Impala, and other Cloudera marks are registered orunregistered trademarks in the United States and other countries. All other trademarks are the property of theirrespective owners.

Disclaimer: EXCEPT AS EXPRESSLY PROVIDED IN A WRITTEN AGREEMENT WITH CLOUDERA,CLOUDERA DOES NOT MAKE NOR GIVE ANY REPRESENTATION, WARRANTY, NOR COVENANT OFANY KIND, WHETHER EXPRESS OR IMPLIED, IN CONNECTION WITH CLOUDERA TECHNOLOGY ORRELATED SUPPORT PROVIDED IN CONNECTION THEREWITH. CLOUDERA DOES NOT WARRANT THATCLOUDERA PRODUCTS NOR SOFTWARE WILL OPERATE UNINTERRUPTED NOR THAT IT WILL BEFREE FROM DEFECTS NOR ERRORS, THAT IT WILL PROTECT YOUR DATA FROM LOSS, CORRUPTIONNOR UNAVAILABILITY, NOR THAT IT WILL MEET ALL OF CUSTOMER’S BUSINESS REQUIREMENTS.WITHOUT LIMITING THE FOREGOING, AND TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLELAW, CLOUDERA EXPRESSLY DISCLAIMS ANY AND ALL IMPLIED WARRANTIES, INCLUDING, BUT NOTLIMITED TO IMPLIED WARRANTIES OF MERCHANTABILITY, QUALITY, NON-INFRINGEMENT, TITLE, ANDFITNESS FOR A PARTICULAR PURPOSE AND ANY REPRESENTATION, WARRANTY, OR COVENANTBASED ON COURSE OF DEALING OR USAGE IN TRADE.

Page 3: CDP Overview - docs.cloudera.com › cdp › cloud › overview › cdp-overview.pdf · Use case: You have an on-premise CDH cluster and you would like to migrate a workload to their

CDP Overview | Contents | iii

Contents

Introduction to CDP.................................................................................... 4

CDP Public Cloud........................................................................................4Use cases............................................................................................................................................4CDP cloud services.............................................................................................................................4Interfaces............................................................................................................................................. 6Getting started..................................................................................................................................... 8Glossary...............................................................................................................................................8

CDP Data Center........................................................................................10CDP Data Center Tools.................................................................................................................... 11

Page 4: CDP Overview - docs.cloudera.com › cdp › cloud › overview › cdp-overview.pdf · Use case: You have an on-premise CDH cluster and you would like to migrate a workload to their

CDP Overview Introduction to CDP

Introduction to CDP

Cloudera Data Platform is an integrated data platform with both public cloud and on-premises versions.

• CDP Public Cloud is a secure and governed public cloud service platform that offers a broad set ofenterprise data cloud services with the key data analytics and artificial intelligence functionality thatenterprises require.

• CDP Data Center is the on-premises version of Cloudera Data Platform. Data Center combines the bestof Cloudera Enterprise Data Hub and Hortonworks Data Platform Enterprise along with new featuresand enhancements across the stack.

Related InformationCDP Public CloudCDP Data Center

CDP Public Cloud

Cloudera Data Platform (CDP) is a secure and governed cloud service platform that offers a broad setof enterprise data cloud services with the key data analytics and artificial intelligence functionality thatenterprises require. CDP Cloud is a cloud form factor of CDP.

Addressing real-world business problems generally requires the application of multiple analytic functionsworking together on the same data; For example, autonomous vehicles require the application of both real-time data streaming and machine learning algorithms. CDP addresses this by offering multi-function datamanagement and analytics that allow solving an enterprise’s most pressing data and analytic challenges ina streamlined fashion.

Hybrid and multi-cloud, CDP gives enterprises flexibility to operate with equivalent functionality on andoff premises. Support for all major cloud providers helps you avoid vendor lock-in and allows you to takecontrol over your enterprise’s data and future. Secure and compliant, CDP meets the strict data privacy,governance, data migration, and metadata management demands of large enterprises across all theirenvironments.

Use casesCDP cloud services address multiple use cases, for example, registering existing CDH, HDP, and HDFclusters; or spinning up Data Hub clusters and analyzing data in a cloud object store.

CDP cloud services address the following use cases:

• Register your existing CDH, HDP, or HDF clusters in order to burst or migrate a workload to their publiccloud environment by replicating the data and creating a Data Hub cluster to host the workload.

• Spin up Data Hub clusters and then process and analyze your data in cloud object store by usingapplications such as Spark, Hive LLAP, Hue, and Impala.

• Analyze queries and jobs for resource consumption and performance.• Control which users can access which resources by creating and managing authorization policies.• Use secure interfaces for all user-facing endpoints.

CDP cloud servicesCDP Cloud consists of a number of cloud services designed to address specific enterprise data cloud usecases.

4

Page 5: CDP Overview - docs.cloudera.com › cdp › cloud › overview › cdp-overview.pdf · Use case: You have an on-premise CDH cluster and you would like to migrate a workload to their

CDP Overview CDP Public Cloud

This includes Data Hub powered by Cloudera Runtime, self-service apps (Data Warehouse and MachineLearning), the administrative layer (Management Console), and SDX services (Data Lake, Data Catalog,Replication Manager, and Workload Manager).

Administrative layer

Management Console is a general service used by CDP administrators to manage, monitor, andorchestrate all of the CDP services from a single pane of glass across all environments. If you havedeployments in your data center as well as in multiple public clouds, you can manage them all in one place- creating, monitoring, provisioning, and destroying services.

Data Hub

Data Hub is a service for launching and managing workload clusters powered by Cloudera Runtime(Cloudera’s new unified open source distribution including the best of CDH and HDP). This includes a setof cloud optimized built-in templates for common workload types as well as a set of options allowing forextensive customization based on your enterprise’s needs.

Data Hub provides a complete workload isolation and full elasticity so that every workload, everyapplication, or every department can have their own cluster with a different version of the software,different configuration, and running on different infrastructure. This enables a more agile developmentprocess.

Since Data Hub clusters are easy to launch and their lifecycle can be automated, you can create them ondemand and when you don’t need them, you can return the resources to the cloud.

Self-service apps

Data Warehouse is a service for creating and managing self-service data warehouses for teams of dataanalysts. This service makes it easy for an enterprise to provision a new data warehouse and share asubset of the data with a specific team or department. The service is ephemeral, allowing you to quicklycreate data warehouses and terminate them once the task at hand is done.

5

Page 6: CDP Overview - docs.cloudera.com › cdp › cloud › overview › cdp-overview.pdf · Use case: You have an on-premise CDH cluster and you would like to migrate a workload to their

CDP Overview CDP Public Cloud

Machine Learning is a service for creating and managing self-service Machine Learning workspaces. Thisenables teams of data scientists to develop, test, train, and ultimately deploy machine learning models forbuilding predictive applications all on the data under management within the enterprise data cloud.

SDX services

Shared Data Experience (SDX) is a suite of technologies that make it possible for enterprises to pull alltheir data into one place to be able to share it with many different teams and services in a secure andgoverned manner. There are four discrete services within SDX technologies: Data Lake, Data Catalog,Replication Manager, and Workload Manager.

Data Lake is a set of functionality for creating safe, secure, and governed data lakes which provides aprotective ring around the data wherever that’s stored, be that in cloud object storage or HDFS. Data Lakefunctionality is subsumed by the Management Console service and related Cloudera Runtime functionality(Ranger, Atlas, Hive MetaStore).

Data Catalog is a service for searching, organizing, securing, and governing data within the enterprise datacloud. Data Catalog is used by data stewards to browse, search, and tag the content of a data lake, createand manage authorization policies (by file, table, column, row, and so on), identify what data a user hasaccessed, and access the lineage of a particular data set.

Replication Manager is a service for copying, migrating, snapshotting, and restoring data betweenenvironments within the enterprise data cloud. This service is used by administrators and data stewardsto move, copy, backup, replicate, and restore data in or between data lakes. This can be done for backup,disaster recovery, or migration purposes, or to facilitate dev/test in another virtual environment.

Workload Manager is a service for analyzing and optimizing workloads within the enterprise data cloud.This service is used by database and workload administrators to troubleshoot, analyze, and optimizeworkloads in order to improve performance and/or cost.

Related InformationManagement ConsoleData HubData WarehouseMachine LearningData CatalogReplication ManagerWorkload Manager

InterfacesThere are three basic ways to access and use CDP cloud services: web interface, CLI client, and SDK.

Web interface

The CDP web interface provides a web-based, graphical user interface. As an admin user, you can usethe web interface to register environments, manage users, and provision CDP service resources for endusers. As an end user, you can use the web console to access CDP service web interfaces to perform dataengineering or data analytics tasks.

Cloudera validates and tests against the latest version and supports recent versions of the followingbrowsers:

• Google Chrome• Mozilla Firefox• Safari• Microsoft Edge

6

Page 7: CDP Overview - docs.cloudera.com › cdp › cloud › overview › cdp-overview.pdf · Use case: You have an on-premise CDH cluster and you would like to migrate a workload to their

CDP Overview CDP Public Cloud

CLI

If you prefer to work in a terminal window, you can download and configure the CDP client that gives youaccess to the CDP CLI tool. The CDP CLI allows you to perform the same actions as can be performedfrom the web console. Furthermore, it allows you to automate routine tasks such as cluster creation.

SDK

You can use the CDP SDK for Java to integrate CDP services with your applications. Use the CDP SDKto connect to CDP services, create and manage clusters, and run jobs from your Java application or otherdata integration tools that you may use in your organization.

Related InformationSupported Browsers Policy

7

Page 8: CDP Overview - docs.cloudera.com › cdp › cloud › overview › cdp-overview.pdf · Use case: You have an on-premise CDH cluster and you would like to migrate a workload to their

CDP Overview CDP Public Cloud

Getting startedGetting started steps in CDP Cloud depend on your use case.

Onboarding

Use case: Regardless of your use case, your first steps in CDP should involve synchronizing your identityprovider in CDP so that your users can access to CDP and are authorized to access specific resourceswithin CDP.

For more information, refer to Getting started as an admin.

Burst to the cloud

Use case: You have an on-premise CDH cluster and you would like to migrate a workload to their publiccloud environment by replicating the data and creating a Data Hub cluster to host the workload.

1. Register your on-premise cluster in CDP.2. Use Workload Manager to generate a workload, data movement, and compute capacity plan.3. Use Replication Manager to migrate your workload to S3.4. Create a Data Mart cluster or spin up a Data Warehouse instance.5. Use Impala to run queries on the migrated workload.

Born in the cloud

Use case: If you already have data in the cloud, you can provision Data Hub clusters to run yourworkloads.

1. Create a Data Engineering cluster cluster.2. Ingest the data into managed Hive tables.3. Create a Data Mart cluster.4. Use Hue to submit queries.5. Use Tableau with the JDBC connector.6. Use Workload Manager to analyze Impala queries and Spark jobs for resource consumption and

performance.

GlossaryCDP Cloud documentation uses terminology related to enterprise data cloud and cloud computing.

CDP (Cloudera Data Platform) - CDP is a cloud service platform that consists of a number of services. Itenables administrators to deploy CDP service resources and allows end users to process and analyze databy using these resources.

CDP CLI - Provides a command-line interface to access and manage CDP services and resources. TheCDP CLI client can be downloaded from the CDP web console.

CDP web console - The web interface for accessing and manage CDP services and resources.

Cloudera Runtime - The open source software distribution within CDP that is maintained, supported,versioned, and packaged by Cloudera. Cloudera Runtime combines the best of CDP and HDP. ClouderaRuntime 7.0.0 is the first version.

Cluster - Also known as compute cluster, workload cluster, or Data Hub cluster. The cluster created byusing the Data Hub service for running workloads. A cluster makes it possible to run one or more ClouderaRuntime components on some number of VMs and is associated with exactly one data lake.

Cluster definition - A reusable cluster template in JSON format that can be used for creating multiple DataHub clusters with identical cloud provider settings. Data Hub includes a few built-in cluster definitions and

8

Page 9: CDP Overview - docs.cloudera.com › cdp › cloud › overview › cdp-overview.pdf · Use case: You have an on-premise CDH cluster and you would like to migrate a workload to their

CDP Overview CDP Public Cloud

allows you to save your own cluster definitions. A cluster definition is not synonymous with a blueprint,which primarily defines Cloudera Runtime services.

Cluster template - A reusable cluster template in JSON format that can be used for creating multiple DataHub clusters with identical Cloudera Runtime settings. It primarily defines the list of Cloudera Runtimeservices included and how their components are distributed on different host groups. Data Hub includesa few built-in blueprints and allows you to save your own blueprints. A blueprint is not synonymous with acluster definition, which primarily defines cloud provider settings.

Credential - Allows an administrator to configure access from CDP to a cloud provider account so thatCDP can communicate with that account and provision resources within it. There is one credential perenvironment.

Data Catalog (service) - A CDP service used by data stewards to browse, search, and tag the content of adata lake, create and manage authorization policies, identify what data a user has accessed, and accessthe lineage of a particular data set.

Data Lake - A single logical store of data that provides a mechanism for storing, accessing, organizing,securing, and managing that data.

Data Lake cluster - A special cluster type that implements the Cloudera Runtime services (such as HMS,Ranger, Atlas, and so on) necessary to implement a data lake that further provides connectivity to aparticular cloud storage service such as S3 or ADLS.

Data Hub (service) - A CDP service that administrators use to create and manage clusters powered byCloudera Runtime.

Data Warehouse (service) - A CDP service for creating and managing self-service data warehouses forteams of data analysts.

Data warehouse - The output of the Data Warehouse service. Users access data warehouses via standardbusiness intelligence tools such as JDBC or Tableau.

Environment- A logical environment defined with a specific virtual network and region on a customer’scloud provider account. CDP service components such as Data Hub clusters, Data warehouses, and soon, run in an environment.

Image catalog - Defines a set of images that can be used for provisioning Data Hub cluster. Data Hubincludes a built-in image catalog with a set of built-in base and prewarmed images and allows you toregister your own image catalog.

Machine Learning (service) - A CDP service that administrators use to create and manage MachineLearning workspaces and that allows data scientists to do their machine learning.

Machine Learning workspace - The output of the Machine Learning service. Each workspace correspondsto a single cluster that can be accessed by end users.

Management Console (service) - A CDP service that allows an administrator to manage environments,users, and services; and download and configure the CLI.

Recipe - A reusable script that can be used to perform a specific task on a specific resource.

Replication manager (service) - A CDP service used by administrators and data stewards to move, copy,backup, replicate, and restore data in or between data lakes.

Service - A defined subset of CDP functionality that enables a CDP user to solve a specific problem relatedto their data lake (process, analyze, predict, and so on). Example services: Data Hub, Data Warehouse,Machine Learning.

Shared resources - A set of resources such as cloud credentials, recipes (custom scripts), and other thatcan be reused across multiple environments.

Workload Manager (service) - A CDP service used by database and workload administrators totroubleshoot, analyze and optimize workloads in order to improve performance and/or cost.

9

Page 10: CDP Overview - docs.cloudera.com › cdp › cloud › overview › cdp-overview.pdf · Use case: You have an on-premise CDH cluster and you would like to migrate a workload to their

CDP Overview CDP Data Center

CDP Data Center

Overview of CDP Data Center

Cloudera Data Platform (CDP) Data Center is the on-premises version of Cloudera Data Platform.This new product combines the best of Cloudera Enterprise Data Hub and Hortonworks Data PlatformEnterprise along with new features and enhancements across the stack . This unified distribution is ascalable and customizable platform where you can securely run many types of workloads.

CDP Data Center supports a variety of hybrid solutions where compute tasks are separated fromdata storage and where data can be accessed from remote clusters. This hybrid approach provides afoundation for containerized applications by managing storage, table schema, authentication, authorizationand governance.

CDP Data Center is comprised of a variety of components such as Apache HDFS, Apache Hive 3, ApacheHBase, and Apache Impala, along with many other components for specialized workloads. You canselect any combination of these services to create clusters that address your business requirements andworkloads. Several pre-configured packages of services are also available for common workloads. Theseinclude:

Regular (Base) Clusters

Data Engineering Process develop, and serve predictive models.

Services included: HDFS, YARN, YARN QueueManager, Ranger, Atlas, Hive, Hive on Tez, Spark,Oozie, Hue, and Data Analytics Studio

Data Mart Browse, query, and explore your data in aninteractive way.

Services included: HDFS, Ranger, Atlas, Hive, andHue

Operational Database Real-time insights for modern data-driven business.

Services included: HDFS, Ranger, Atlas, andHBase

Custom Services Choose your own services. Services required bychosen services will automatically be included.

Compute Clusters

Data Engineering Process develop, and serve predictive models.

Services included: Spark, Oozie, Hive on Tez, DataAnalytics Studio, HDFS, YARN, and YARN QueueManager

Spark Spark for Compute

Services included: Core Configuration, Spark,Oozie, YARN, and YARN Queue Manager

Data Mart Impala for Compute

Services included: Core Configuration, Impala, andHue

Streams Messaging (Simple) Simple Kafka cluster for streams messaging

10

Page 11: CDP Overview - docs.cloudera.com › cdp › cloud › overview › cdp-overview.pdf · Use case: You have an on-premise CDH cluster and you would like to migrate a workload to their

CDP Overview CDP Data Center

Services included: Kafka, Schema Registry, andZookeeper

Streams Messaging (Full) Advanced Kafka cluster with monitoring andreplication services for streams messaging

Services included: Kafka, Schema Registry,Streams Messaging Manager, Streams ReplicationManager, Cruise Control, and Zookeeper

Custom Services Choose your own services. Services required bychosen services will automatically be included.

When installing a CDP Data Center cluster, you install a single parcel called Cloudera Runtime thatcontains all of the components. For a complete list of the included components, see Cloudera RuntimeComponent Versions.

In addition to the Cloudera Runtime components, CDP Data Center includes powerful tools to helpmanage, govern, and secure your cluster.

Related InformationUpgrade GuideCloudera ManagerInstallationApache AtlasApache Ranger

CDP Data Center Tools

Cloudera Manager

CDP - Data Center uses Cloudera Manager to manage one or more clusters and their configurationsand to monitor cluster performance. You also use Cloudera Manager to manage installations, upgrades,maintenance workflows, encryption, access controls, and data replication. In future releases you will alsobe able to manage Cloudera Enterprise CDH clusters. You can also use Cloudera Manager to create aVirtual Private cluster that allows you to separate compute resources from data storage and to share datastorage among compute resources.

Apache Atlas

Also included in CDP - Data Center is Apache Atlas, used to provide governance for your data. ApacheAtlas serves as a common metadata store that is designed to exchange metadata both inside and outsideof the Hadoop stack. Close integration of Atlas with Apache Ranger enables you to define, administer, andmanage security and compliance policies consistently across all components of the Hadoop stack. Forcustomers familiar with Cloudera Enterprise, Apache Atlas replaces Cloudera Navigator Metadata Server.It provides the following capabilities:

• Flexible metadata models• Entity search using model attributes, classifications (tags), and free text• Lineage across entities based on processes applied to the entities

Apache Ranger

Apache Ranger provides auditing, authentication, and authorization functionality for your CDP - DataCenter clusters.

11

Page 12: CDP Overview - docs.cloudera.com › cdp › cloud › overview › cdp-overview.pdf · Use case: You have an on-premise CDH cluster and you would like to migrate a workload to their

CDP Overview CDP Data Center

Apache Ranger provides a centralized framework for collecting access audit history and reporting data,including filtering on various parameters. Ranger enhances audit information obtained from Hadoopcomponents and provides insights through this centralized reporting capability.

Apache Ranger also manages access control through a user interface that ensures consistent policyadministration across CDP - Data Center components. Security administrators can define security policiesat the database, table, column, and file levels, and can administer permissions for specific LDAP-basedgroups or individual users. Rules based on dynamic conditions such as time or geolocation can also beadded to an existing policy rule. The Ranger authorization model is pluggable and can be easily extendedto any data source using a service-based definition.

For customers familiar with Cloudera Enterprise, Apache Ranger replaces Sentry and Navigator AuditServer and also provides the following capabilities:

• Better fine-grained access controls:

• Dynamic Row Filtering• Dynamic Column Masking• Attribute-based Access Control• SparkSQL fine-grained access control

• Rich policy features

• Allow/Deny constructs, Custom policy conditions/context enrichers, time bound policies, Atlasintegration (for tag based policies)

• Extensive Access Auditing with rich event metadata

12