Pivotal Greenplum Database

Version 6.11© 2020 VMware, Inc.
Copyright Release Notes
Copyright © 2020 VMware, Inc. or its affiliates. All Rights Reserved.
Revised October 2020 (6.11.2)
Validating Your Systems................................................................................................................... 62 Validating Network Performance............................................................................................ 62 Validating Disk I/O and Memory Bandwidth...........................................................................63
Referencing IP Addresses....................................................................................................667 Running Backend Server Programs.....................................................................................667
Chapter 9: About the Tools Package.................................................. 1616
Chapter 10: Installing the Client and Loader Tools Package............ 1617 Supported Platforms......................................................................................................................1618 Installation Procedure....................................................................................................................1619 About Your Installation.................................................................................................................. 1620 Running the UNIX Tools Installer................................................................................................. 1621
Chapter 13: Using the Client and Loader Tools................................. 1625 Prerequisites.................................................................................................................................. 1626
Chapter 15: DataDirect ODBC Drivers for Greenplum....................... 1632 Prerequisites.................................................................................................................................. 1633 Supported Client Platforms........................................................................................................... 1634 Installing on Linux Systems.......................................................................................................... 1635
14
Pivotal Greenplum 6.11 Release Notes
This document contains pertinent release information about Pivotal Greenplum 6.11 releases. For previous versions of the release notes for Greenplum Database, go to Pivotal Greenplum Database Documentation. For information about Greenplum Database end of life, see Pivotal Greenplum Database end of life policy.
Pivotal Greenplum 6 software is available for download from VMware Tanzu Network.
Pivotal Greenplum 6 is based on the open source Greenplum Database project code.
Important: VMware does not provide support for open source versions of Greenplum Database. Only Pivotal Greenplum is supported by VMware.
15
Release 6.11.2 Release Date: 2020-10-2
Pivotal Greenplum 6.11.2 is a maintenance release that includes changes and resolves several issues.
Changed Features Greenplum Database 6.11.2 includes these changes:
• Pivotal GPText version 3.4.5 is included, which includes bug fixes. See the GPText 3.4.5 Release Notes for more information.
• Pivotal Greenplum-Spark connector version 2.0.0 is included, which includes feature changes and bug fixes. See the Greenplum-Spark Connector 2.0.0 Release Notes for more information.
Resolved Issues Pivotal Greenplum 6.11.2 resolves these issues:
30549 - Management and Monitoring
30795 - GPORCA
Fixed a problem where GPORCA did not utilize an index scan for certain subqueries, which could lead to poor performance for affected queries.
30878 - GPORCA
If a CREATE TABLE .. AS statement was used to create a table with non-legacy (jump consistent) hash algorithm distribution from a source table that used the legacy (modulo) hash algorithm, GPORCA would distribute the data according to the value of gp_use_legacy_hashops; however, it would set the table's distribution policy hash algorithm to the value of the original table. This could cause queries to give incorrect results if the distribution policy did not match the data distribution. This problem has been resolved.
30903 - Metrics Collector
Workfile entries were sometimes freed prematurely, which could lead to the postmaster process being reset on segments and failures in query execution. This problem has been resolved.
30928 - GPORCA
If gp_use_legacy_hashops was enabled, GPORCA could crash when generating the query plan for certain queries that included an aggregate. This problem has been resolved.
174812955 - Query Execution
When executing a long query that contained multi-byte characters, Greenplum could incorrectly truncate the query string (removing multi-byte characters) and, if log_min_duration_statement was set to 0, could subsequent write an invalid symbol to segment logs. This behavior could cause errors in gp_toolkit and Command Center. This problem has been resolved.
16
Pivotal Greenplum 6.11.1 is a maintenance release that includes changes and resolves several issues.
Changed Features Greenplum Database 6.11.1 includes this change:
• Greenplum Platform Extension Framework (PXF) version 5.15.1 is included, which includes changes and bug fixes. Refer to the PXF Release Notes for more information on release content and to access the PXF documentation.
Resolved Issues Pivotal Greenplum 6.11.1 resolves these issues:
30751, 173714727 - Query Optimizer
Resolves an issue where a correlated subquery that contained at least one left or right outer join caused the Greenplum Database master to crash when the server configuration parameter optimizer_join_order was set to exhaustive2.
30880 - gpload
Fixed a problem where gpload operations would fail if a table column name included capital letters or special characters.
30901 - GPORCA
For queries that included an outer ref in a subquery, such as select * from foo where foo.a = (select foo.b from bar), GPORCA always used the results of the subquery after unnesting the outer reference. This could cause a crash or incorrect results if the subquery returned no rows, or if the subquery contained a projection with multiple values below the outer reference. To address this problem, all such queries now fall back to using the Postgres planner instead of GPORCA. Note that this behavior occurs for cases where GPORCA would have returned correct results, as well as for cases that could cause crashes or return incorrect results.
30913, 170824967 - gpfdists
A command that accessed an external table using the gpfdists protocol failed if the external table did not use an IP address when specifying a host system in the LOCATION clause of the external table definition. This issue is resolved in Greenplum 6.11.1.
174609237 - gpstart
gpstart was updated so that it does not attempt to start a standby master segment when that segment is unreachable, preventing an associated stack trace during startup.
Upgrading from Greenplum 6.x to Greenplum 6.11 Note: Greenplum 6 does not support direct upgrades from Greenplum 4 or Greenplum 5 releases, or from earlier Greenplum 6 Beta releases.
See Upgrading from an Earlier Greenplum 6 Release to upgrade your existing Greenplum 6.x software to Greenplum 6.11.0.
17
Pivotal Greenplum 6.11.0 is a minor release that includes changed features and resolves several issues.
Features Greenplum Database 6.11.0 includes these new and changed features:
• GPORCA partition elimination has been enhanced to support a subset of lossy assignment casts that are order-preserving (increasing) functions, including timestamp::date and float::int. For example, GPORCA supports partition elimination when a partition column is defined with the timestamp datatype and the query contains a predicate such as WHERE ts::date == '2020-05-10' that performs a cast on the partitioned column (ts) to compare column data (a timestamp) to a date.
• PXF version 5.15.0 is included, which includes new and changed features and bug fixes. Refer to the PXF Release Notes for more information on release content and supported platforms, and to access the PXF documentation.
• Greenplum Command Center 6.3.0 and 4.11.0 are included, which include new workload management and other features, as well as bug fixes. See the Command Center Release Notes for more information.
• The DataDirect ODBC Drivers for Pivotal Greenplum were updated to version 07.16.0389 (B0562, U0408). This version introduces support for the following datatypes:
Greenplum Datatype ODBC Datatype
30899 - Resource Groups
In some cases when running queries are managed by resource groups, Greenplum Database generated a PANIC when managing runaway queries (queries that use an excessive amount of memory) because of locking issues. This issue is resolved.
30877 - VACUUM
In some cases, running VACUUM returns ERROR: found xmin <xid> from before relfrozenxid <frozen_xid>. The error was caused when a previously run VACUUM FULL was interrupted and aborted on a query executor (QE) and corrupted catalog frozen XID information. This issue is resolved.
30870 - Segment Mirroring
In some cases, performing an incremental recovery of a Greenplum Database segment instance failed with the message requested WAL segment has already been removed because the recovery checkpoint was not created properly. This issue is resolved.
30858 - analyzedb
18
analyzedb failed if analyzedb attempted to update statistics for a set of tables and one of the tables was dropped and then recreated while analyzedb was running. analyzedb has been enhanced better handle the specified situation.
30845 - Query Execution
Under heavy load when running multiple queries, some queries randomly failed with the error Error on receive from seg<ID>. The error was caused when Greenplum Database encountered a divide by 0 error while managing the backend processes that are used to run queries on the segment instances. This issue is resolved.
30761 - Postgres Planner
In some cases, Greenplum Database generated a PANIC when a DROP VIEW command was cancelled from the Greenplum Command Center. The PANIC was generated when Greenplum Database did not correctly handle the visibility of the relation.
30721 - gpcheckcat
Resolved a problem where gpcheckcat would fail with Missing or extraneous entries check errors if the gp_sparse_vector extension was installed.
30637 - Query Optimizer
For some queries against partitioned tables, GPORCA did not perform partition elimination when a predicate that includes the partition column also performs an explicit cast. For example, GPORCA would not perform partition elimination when a partition column is defined with the timestamp datatype and the query contains a predicate such as WHERE ts::date == '2020-05-10' that performs a cast on the partitioned column (ts) to compare column data (a timestamp) to a date. GPORCA partition elimination has been improved to support the specified type of query. See Features.
10491 - Postgres Planner
For some queries that contain nested subqueries that do not specify a relation and also contain a nested GROUP BY clauses, Greenplum Database generated a PANIC. The PANIC was generated when Greenplum Database did not correctly manage the subquery correctly. This is an example of the specified type or query.
SELECT * FROM (SELECT * FROM (SELECT c1, SUM(c2) c2 FROM mytbl GROUP BY c1 ) t2 ) t3 GROUP BY c2, ROLLUP((c1)) ORDER BY 1, 2;
This issue is resolved.
10561 - Server
Greenplum Database does not support altering the datatype of a column defined as a distribution key or with a constraint. When attempting to change the datatype, the error message did not clearly indicate the cause. The error message has been altered to provide more information.
174505130 - Resource Groups
In some cases for a query managed by resource group, the resource group cancelled the query with the message Canceling query because of high VMEM usage because the resource group calculated the incorrect memory used by the query. This issue is resolved.
174353156 - Interconnect
In some cases when Greenplum Database uses proxies for interconnect communication (the server configuration parameter gp_interconnect_type is set to proxy), a Greenplum background worker process became an orphaned process after the postmaster process was terminated. This issue is resolved.
174205590 - Interconnect
19
When Greenplum Database uses proxies for interconnect communication (the server configuration parameter gp_interconnect_type is set to proxy), a query might have hung if the query contains multiple concurrent subplans running on the segment instances. The query hung when the Greenplum interconnect did not properly handle the communication among the concurrent subplans. This issue is resolved.
174483149 - Cluster Management - gpinitsystem
gpinitsystem now exports the MASTER_DATA_DIRECTORY environment variable before calling gpconfig, to avoid throwing warning messages when configuring system parameters on Greenplum Database appliances (DCA).
Upgrading from Greenplum 6.x to Greenplum 6.11 Note: Greenplum 6 does not support direct upgrades from Greenplum 4 or Greenplum 5 releases, or from earlier Greenplum 6 Beta releases.
See Upgrading from an Earlier Greenplum 6 Release to upgrade your existing Greenplum 6.x software to Greenplum 6.11.0.
Pivotal Greenplum 6.11 Release Notes Release Notes
20
Deprecated Features Deprecated features will be removed in a future major release of Greenplum Database. Pivotal Greenplum 6.x deprecates:
• The gpsys1 utility. • The analzyedb option --skip_root_stats (deprecated since 6.2).
If the option is specified, a warning is issued stating that the option will be ignored. • The server configuration parameter gp_statistics_use_fkeys (deprecated since 6.2). • The server configuration parameter gp_ignore_error_table (deprecated since 6.0).
To avoid a Greenplum Database syntax error, set the value of this parameter to true when you run applications that execute CREATE EXTERNAL TABLE or COPY commands that include the now removed Greenplum Database 4.3.x INTO ERROR TABLE clause.
• Specifying => as an operator name in the CREATE OPERATOR command (deprecated since 6.0). • The Greenplum external table C API (deprecated since 6.0).
Any developers using this API are encouraged to use the new Foreign Data Wrapper API in its place. • Commas placed between a SUBPARTITION TEMPLATE clause and its corresponding SUBPARTITION
BY clause, and between consecutive SUBPARTITION BY clauses in a CREATE TABLE command (deprecated since 6.0).
Using this undocumented syntax will generate a deprecation warning message. • The timestamp format YYYYMMDDHH24MISS (deprecated since 6.0).
This format could not be parsed unambiguously in previous Greenplum Database releases, and is not supported in PostgreSQL 9.4.
• The createlang and droplang utilities (deprecated since 6.0). • The pg_resqueue_status system view (deprecated since 6.0).
Use the gp_toolkit.gp_resqueue_status view instead. • The GLOBAL and LOCAL modifiers when creating a temporary table with the CREATE TABLE and
CREATE TABLE AS commands (deprecated since 6.0).
These keywords are present for SQL standard compatibility, but have no effect in Greenplum Database. • Using WITH OIDS or oids=TRUE to assign an OID system column when creating or altering a table
(deprecated since 6.0). • Allowing superusers to specify the SQL_ASCII encoding regardless of the locale settings (deprecated
since 6.0).
This choice may result in misbehavior of character-string functions when data that is not encoding- compatible with the locale is stored in the database.
• The @@@ text search operator (deprecated since 6.0).
This operator is currently a synonym for the @@ operator. • The unparenthesized syntax for option lists in the VACUUM command (deprecated since 6.0).
This syntax requires that the options to the command be specified in a specific order. • The plain pgbouncer authentication type (auth_type = plain) (deprecated since 4.x).
21
Migrating Data to Greenplum 6 Note: Greenplum 6 does not support direct upgrades from Greenplum 4 or Greenplum 5 releases, or from earlier Greenplum 6 Beta releases.
See Migrating Data from Greenplum 4.3 or 5 for guidelines and considerations for migrating existing Greenplum data to Greenplum 6, using standard backup and restore procedures.
22
Known Issues and Limitations Pivotal Greenplum 6 has these limitations:
• Upgrading a Greenplum Database 4 or 5 release, or Greenplum 6 Beta release, to Greenplum 6 is not supported.
• MADlib, GPText, and PostGIS are not yet provided for installation on Ubuntu systems. • Greenplum 6 is not supported for installation on DCA systems. • Greenplum for Kubernetes is not yet provided with this release.
The following table lists key known issues in Pivotal Greenplum 6.x.
Table 1: Key Known Issues in Pivotal Greenplum 6.x
Issue Category Description
N/A Backup/Restore Restoring the Greenplum Database backup for a table fails in Greenplum 6 versions earlier than version 6.10 when a replicated table has an inheritance relationship to/from another table that was assigned via an ALTER TABLE ... INHERIT statement after table creation.
Workaround: Use the following SQL commands to determine if Greenplum Database includes any replicated tables that inherit from a parent table, or if there are replicated tables that are inherited by a child table:
SELECT inhrelid::regclass FROM pg_inherits, gp_distribution_policy dp WHERE inhrelid=dp.localoid AND dp.policytype='r'; SELECT inhparent::regclass FROM pg_inherits, gp_distribution_policy dp WHERE inhparent=dp.localoid AND dp.policytype= 'r';
If these queries return any tables, you may choose to run gprestore with the -–on-error-continue flag to not fail the entire restore when this issue is hit. Or, you can specify the list of tables returned by the queries to the -–exclude-table-file option to skip those tables during restore. You must recreate and repopulate the affected tables after restore.
N/A Spark Connector
This version of Greenplum is not compatible with Greenplum-Spark Connector versions earlier than version 1.7.0, due to a change in how Greenplum handles distributed transaction IDs.
N/A PXF Starting in 6.x, Greenplum does not bundle cURL and instead loads the system-provided library. PXF requires cURL version 7.29.0 or newer. The officially-supported cURL for the CentOS 6.x and Red Hat Enterprise Linux 6.x operating systems is version 7.19.*. Greenplum Database 6 does not support running PXF on CentOS 6.x or RHEL 6. x due to this limitation.
Workaround: Upgrade the operating system of your Greenplum Database 6 hosts to CentOS 7+ or RHEL 7+, which provides a cURL version suitable to run PXF.
23
29703 Loading Data from External Tables
Due to limitations in the Greenplum Database external table framework, Greenplum Database cannot log the following types of errors that it encounters while loading data:
• data type parsing errors • unexpected value type errors • data type conversion errors • errors returned by native and user-defined functions
LOG ERRORS returns error information for data exceptions only. When it encounters a parsing error, Greenplum terminates the load job, but it cannot log and propagate the error back to the user via gp_ read_error_log().
Workaround: Clean the input data before loading it into Greenplum Database.
30594 Resource Management
Resource queue-related statistics may be inaccurate in certain cases. VMware recommends that you use the resource group resource management scheme that is available in Greenplum 6.
30522 Logging Greenplum Database may write a FATAL message to the standby master or mirror log stating that the database system is in recovery mode when the instance is synchronizing with the master and Greenplum attempts to contact it before the operation completes. Ignore these messages and use gpstate -f output to determine if the standby successfully synchronized with the Greenplum master; the command returns Sync state: sync if it is synchronized.
30537 Postgres Planner
The Postgres Planner generates a very large query plan that causes out of memory issues for the following type of CTE (common table expression) query: the WITH clause of the CTE contains a partitioned table with a large number partitions, and the WITH reference is used in a subquery that joins another partitioned table.
Workaround: If possible, use the GPORCA query optimizer. With the server configuration parameter optimizer=on, Greenplum Database attempts to use GPORCA for query planning and optimization when possible and falls back to the Postgres Planner when GPORCA cannot be used. Also, the specified type of query might require a long time to complete.
170824967 gpfidsts For Greenplum Database 6.x, a command that accesses an external table that uses the gpfdists protocol fails if the external table does not use an IP address when specifying a host system in the LOCATION clause of the external table definition. This issue is resolved in Greenplum 6.11.1.
n/a Materialized Views
By default, certain gp_toolkit views do not display data for materialized views. If you want to include this information in gp_ toolkit view output, you must redefine a gp_toolkit internal view as described in Including Data for Materialized Views.
168957894 PXF The PXF Hive Connector does not support using the Hive* profiles to access Hive transactional tables.
Workaround: Use the PXF JDBC Connector to access Hive.
24
Issue Category Description
168548176 gpbackup When using gpbackup to back up a Greenplum Database 5.7.1 or earlier 5.x release with resource groups enabled, gpbackup returns a column not found error for t6.value AS memoryauditor.
164791118 PL/R PL/R cannot be installed using the deprecated createlang utility, and displays the error:
createlang: language installation failed: ERROR: no schema has been selected to create in
Workaround: Use CREATE EXTENSION to install PL/R, as described in the documentation.
N/A Greenplum Client/Load Tools on Windows
The Greenplum Database client and load tools on Windows have not been tested with Active Directory Kerberos authentication.
25
Differences Compared to Open Source Greenplum Database
Pivotal Greenplum 6.x includes all of the functionality in the open source Greenplum Database project and adds:
• Product packaging and installation script • Support for QuickLZ compression. QuickLZ compression is not provided in the open source version of
Greenplum Database due to licensing restrictions. • Support for data connectors:
• Greenplum-Spark Connector • Greenplum-Informatica Connector • Greenplum-Kafka Integration • Greenplum Streaming Server
• Data Direct ODBC/JDBC Drivers • gpcopy utility for copying or migrating objects between Greenplum systems • Support for managing Greenplum Database using Pivotal Greenplum Command Center • Support for full text search and text analysis using Pivotal GPText • Greenplum backup plugin for DD Boost • Backup/restore storage plugin API
26
Installing and Upgrading Greenplum Release Notes
27
Platform Requirements This topic describes the Pivotal Greenplum Database 6 platform and operating system software requirements.
Important: Pivotal Support does not provide support for open source versions of Greenplum Database. Only Pivotal Greenplum Database is supported by Pivotal Support.
• Operating Systems
• Client Tools • Extensions • Data Connectors • GPText • Greenplum Command Center
• Hadoop Distributions
Operating Systems Pivotal Greenplum 6 runs on the following operating system platforms:
• Red Hat Enterprise Linux 64-bit 7.x (See the following Note.) • Red Hat Enterprise Linux 64-bit 6.x • CentOS 64-bit 7.x • CentOS 64-bit 6.x • Ubuntu 18.04 LTS • Oracle Linux 64-bit 7, using the Red Hat Compatible Kernel (RHCK)
Important: Significant Greenplum Database performance degradation has been observed when enabling resource group-based workload management on RedHat 6.x and CentOS 6.x systems. This issue is caused by a Linux cgroup kernel bug. This kernel bug has been fixed in CentOS 7.x and Red Hat 7.x systems.
If you use RedHat 6 and the performance with resource groups is acceptable for your use case, upgrade your kernel to version 2.6.32-696 or higher to benefit from other fixes to the cgroups implementation.
Note: For Greenplum Database that is installed on Red Hat Enterprise Linux 7.x or CentOS 7.x prior to 7.3, an operating system issue might cause Greenplum Database that is running large workloads to hang in the workload. The Greenplum Database issue is caused by Linux kernel bugs.
RHEL 7.3 and CentOS 7.3 resolves the issue.
Greenplum Database server supports TLS version 1.2 on RHEL/CentOS systems, and TLS version 1.3 on Ubuntu systems.
Software Dependencies Greenplum Database 6 requires the following software packages on RHEL/CentOS 6/7 systems which are installed automatically as dependencies when you install the Pivotal Greenplum Database RPM package):
• apr
28
• apr-util • bash • bzip2 • curl • krb5 • libcurl • libevent (or libevent2 on RHEL/CentOS 6) • libxml2 • libyaml • zlib • openldap • openssh • openssl • openssl-libs (RHEL7/Centos7) • perl • readline • rsync • R • sed (used by gpinitsystem) • tar • zip
Greenplum Database 6 client software requires these operating system packages:
• apr • apr-util • libyaml • libevent
On Ubuntu systems, Greenplum Database 6 requires the following software packages, which are installed automatically as dependencies when you install Greenplum Database with the Debian package installer:
• libapr1 • libaprutil1 • bash • bzip2 • krb5-multidev • libcurl3-gnutls • libcurl4 • libevent-2.1-6 • libxml2 • libyaml-0-2 • zlib1g • libldap-2.4-2 • openssh-client • openssh-client • openssl • perl • readline • rsync • sed • tar • zip
29
• net-tools • less • iproute2
Greenplum Database 6 uses Python 2.7.12, which is included with the product installation (and not installed as a package dependency).
Important: SSL is supported only on the Greenplum Database master host system. It cannot be used on the segment host systems.
Important: For all Greenplum Database host systems, SELinux must be disabled. You should also disable firewall software, although firewall software can be enabled if it is required for security purposes. See Disabling SELinux and Firewall Software.
Java Greenplum 6 supports these Java versions for PL/Java and PXF:
• Open JDK 8 or Open JDK 11, available from AdoptOpenJDK • Oracle JDK 8 or Oracle JDK 11
Hardware and Network The following table lists minimum recommended specifications for hardware servers intended to support Greenplum Database on Linux systems in a production environment. All host servers in your Greenplum Database system must have the same hardware and software configuration. Greenplum also provides hardware build guides for its certified hardware platforms. It is recommended that you work with a Greenplum Systems Engineer to review your anticipated environment to ensure an appropriate hardware configuration for Greenplum Database.
Table 2: Minimum Hardware Requirements
Minimum CPU Any x86_64 compatible CPU
Minimum Memory 16 GB RAM per server
Disk Space Requirements • 150MB per host for Greenplum installation • Approximately 300MB per segment instance for
meta data • Appropriate free space for data with disks at no
more than 70% capacity
NIC bonding is recommended when multiple interfaces are present
Pivotal Greenplum can use either IPV4 or IPV6 protocols.
Storage The only file system supported for running Greenplum Database is the XFS file system. All other file systems are explicitly not supported by Pivotal.
Greenplum Database is supported on network or shared storage if the shared storage is presented as a block device to the servers running Greenplum Database and the XFS file system is mounted on the block device. Network file systems are not supported. When using network or shared storage, Greenplum
30
Database mirroring must be used in the same way as with local storage, and no modifications may be made to the mirroring scheme or the recovery scheme of the segments.
Other features of the shared storage such as de-duplication and/or replication are not directly supported by Pivotal Greenplum Database, but may be used with support of the storage vendor as long as they do not interfere with the expected operation of Greenplum Database at the discretion of Pivotal.
Greenplum Database can be deployed to virtualized systems only if the storage is presented as block devices and the XFS file system is mounted for the storage of the segment directories.
Greenplum Database is supported on Amazon Web Services (AWS) servers using either Amazon instance store (Amazon uses the volume names ephemeral[0-20]) or Amazon Elastic Block Store (Amazon EBS) storage. If using Amazon EBS storage the storage should be RAID of Amazon EBS volumes and mounted with the XFS file system for it to be a supported configuration.
Data Domain Boost Pivotal Greenplum 6.0.0 supports Data Domain Boost for backup on Red Hat Enterprise Linux. This table lists the versions of Data Domain Boost SDK and DDOS supported by Pivotal Greenplum 6.x.
Table 3: Data Domain Boost Compatibility
Pivotal Greenplum Data Domain Boost DDOS
6.x 3.3 6.1 (all versions)
6.0 (all versions)
Note: In addition to the DDOS versions listed in the previous table, Pivotal Greenplum supports all minor patch releases (fourth digit releases) later than the certified version.
Tools and Extensions Compatibility • Client Tools • Extensions • Data Connectors • GPText • Greenplum Command Center
Client Tools Greenplum Database 6 releases a Clients tool package on various platforms that can be used to access Greenplum Database from a client system. The Greenplum 6 Clients tool package is supported on the following platforms:
• Red Hat Enterprise Linux x86_64 6.x (RHEL 6) • Red Hat Enterprise Linux x86_64 7.x (RHEL 7) • Ubuntu 18.04 LTS • Windows 10 (32-bit and 64-bit) • Windows 8 (32-bit and 64-bit) • Windows Server 2012 (32-bit and 64-bit) • Windows Server 2012 R2 (32-bit and 64-bit) • Windows Server 2008 R2 (32-bit and 64-bit)
The Greenplum 6 Clients package includes the client and loader programs provided in the Greenplum 5 packages plus the addition of database/role/language commands and the Greenplum-Kafka Integration and Greenplum Streaming Server command utilities. Refer to Greenplum Client and Loader Tools Package for installation and usage details of the Greenplum 6 Client tools.
31
Extensions This table lists the versions of the Pivotal Greenplum Extensions that are compatible with this release of Greenplum Database 6.
Table 4: Pivotal Greenplum 6 Extensions Compatibility
Component Package Version Additional Information
PL/Java 2.0.2 Supports Java 8 and 11.
Python Data Science Module Package
2.0.2
R Data Science Library Package 2.0.2
PL/Container 2.1.2
Python 3.7
GreenplumR 1.1.0 Supports R 3.6+.
MADlib Machine Learning 1.17, 1.16 Support matrix at MADlib FAQ.
PostGIS Spatial and Geographic Objects
2.5.4+pivotal.3, 2.5.4+pivotal.2, 2. 5.4+pivotal.1,
2.1.5+pivotal.2-2
• Fuzzy String Match Extension • PL/Python Extension • pgcrypto Extension
Data Connectors • Greenplum Platform Extension Framework (PXF) v5.15.0 - PXF, integrated with Greenplum Database
6, provides access to Hadoop, object store, and SQL external data stores. Refer to Accessing External Data with PXF in the Greenplum Database Administrator Guide for PXF configuration and usage information.
• Greenplum-Kafka Integration - The Pivotal Greenplum-Kafka Integration provides high speed, parallel data transfer from a Kafka cluster to a Pivotal Greenplum Database cluster for batch and streaming ETL operations. It requires Kafka version 0.11 or newer for exactly-once delivery assurance. Refer to the Pivotal Greenplum-Kafka Integration Documentation for more information about this feature.
• Greenplum Streaming Server v1.4.1 - The Pivotal Greenplum Streaming Server is an ETL tool that provides high speed, parallel data transfer from Informatica, Kafka, and custom client data sources to a
32
Pivotal Greenplum Database cluster. Refer to the Pivotal Greenplum Streaming Server Documentation for more information about this feature.
• Greenplum Informatica Connector v1.0.5 - The Pivotal Greenplum Informatica Connector supports high speed data transfer from an Informatica PowerCenter cluster to a Pivotal Greenplum Database cluster for batch and streaming ETL operations.
• Greenplum Spark Connector v1.6.2 - The Pivotal Greenplum Spark Connector supports high speed, parallel data transfer between Greenplum Database and an Apache Spark cluster using Spark’s Scala API.
• Progress DataDirect JDBC Drivers v5.1.4.000223 - The Progress DataDirect JDBC drivers are compliant with the Type 4 architecture, but provide advanced features that define them as Type 5 drivers.
• Progress DataDirect ODBC Drivers v7.1.6 (07.16.0389) - The Progress DataDirect ODBC drivers enable third party applications to connect via a common interface to the Pivotal Greenplum Database system.
Note: Pivotal Greenplum 6 does not support the ODBC driver for Cognos Analytics V11.
Connecting to IBM Cognos software with an ODBC driver is not supported. Greenplum Database supports connecting to IBM Cognos software with the DataDirect JDBC driver for Pivotal Greenplum. This driver is available as a download from Pivotal Network.
GPText Pivotal Greenplum Database 6 is compatible with Pivotal Greenplum Text version 3.3.1 and later. See the Greenplum Text documentation for additional compatibility information.
Greenplum Command Center Pivotal Greenplum Database 6.8 and later are compatible only with Pivotal Greenplum Command Center 6.2 and later. See the Greenplum Command Center documentation for additional compatibility information.
Hadoop Distributions Greenplum Database provides access to HDFS with the Greenplum Platform Extension Framework (PXF).
PXF can use Cloudera, Hortonworks Data Platform, MapR, and generic Apache Hadoop distributions. PXF bundles all of the JAR files on which it depends, including the following Hadoop libraries:
Table 5: PXF Hadoop Supported Platforms
PXF Version Hadoop Version Hive Server Version HBase Server Version
5.15.0, 5.14.0, 5.13.0, 5. 12.0, 5.11.1, 5.10.1
2.x, 3.1+ 1.x, 2.x, 3.1+ 1.3.2
5.8.2 2.x 1.x 1.3.2
5.8.1 2.x 1.x 1.3.2
Note: If you plan to access JSON format data stored in a Cloudera Hadoop cluster, PXF requires a Cloudera version 5.8 or later Hadoop distribution.
33
Introduction to Greenplum High-level overview of the Greenplum Database system architecture.
Greenplum Database stores and processes large amounts of data by distributing the load across several servers or hosts. A logical database in Greenplum is an array of individual PostgreSQL databases working together to present a single database image. The master is the entry point to the Greenplum Database system. It is the database instance to which users connect and submit SQL statements. The master coordinates the workload across the other database instances in the system, called segments, which handle data processing and storage. The segments communicate with each other and the master over the interconnect, the networking layer of Greenplum Database.
Greenplum Database is a software-only solution; the hardware and database software are not coupled. Greenplum Database runs on a variety of commodity server platforms from Greenplum-certified hardware vendors. Performance depends on the hardware on which it is installed. Because the database is distributed across multiple machines in a Greenplum Database system, proper selection and configuration of hardware is vital to achieving the best possible performance.
This chapter describes the major components of a Greenplum Database system and the hardware considerations and concepts associated with each component: The Greenplum Master, The Segments and The Interconnect. Additionally, a system may have optional ETL Hosts for Data Loading and Greenplum Performance Monitoring for monitoring query workload and performance.
34
The Greenplum Master The master is the entry point to the Greenplum Database system. It is the database server process that accepts client connections and processes the SQL commands that system users issue. Users connect to Greenplum Database through the master using a PostgreSQL-compatible client program such as psql or ODBC.
The master maintains the system catalog (a set of system tables that contain metadata about the Greenplum Database system itself), however the master does not contain any user data. Data resides only on the segments. The master authenticates client connections, processes incoming SQL commands, distributes the work load between segments, coordinates the results returned by each segment, and presents the final results to the client program.
Because the master does not contain any user data, it has very little disk load. The master needs a fast, dedicated CPU for data loading, connection handling, and query planning because extra space is often necessary for landing load files and backup files, especially in production environments. Customers may decide to also run ETL and reporting tools on the master, which requires more disk space and processing power.
Master Redundancy You may optionally deploy a backup or mirror of the master instance. A backup master host serves as a warm standby if the primary master host becomes nonoperational. You can deploy the standby master on a designated redundant master host or on one of the segment hosts.
The standby master is kept up to date by a transaction log replication process, which runs on the standby master host and synchronizes the data between the primary and standby master hosts. If the primary master fails, the log replication process shuts down, and an administrator can activate the standby master in its place. When an the standby master is active, the replicated logs are used to reconstruct the state of the master host at the time of the last successfully committed transaction.
Since the master does not contain any user data, only the system catalog tables need to be synchronized between the primary and backup copies. When these tables are updated, changes automatically copy over to the standby master so it is always synchronized with the primary.
Figure 1: Master Mirroring in Greenplum Database
The Segments In Greenplum Database, the segments are where data is stored and where most query processing occurs. User-defined tables and their indexes are distributed across the available segments in the Greenplum Database system; each segment contains a distinct portion of the data. Segment instances are the
35
database server processes that serve segments. Users do not interact directly with the segments in a Greenplum Database system, but do so through the master.
In the reference Greenplum Database hardware configurations, the number of segment instances per segment host is determined by the number of effective CPUs or CPU core. For example, if your segment hosts have two dual-core processors, you may have two or four primary segments per host. If your segment hosts have three quad-core processors, you may have three, six or twelve segments per host. Performance testing will help decide the best number of segments for a chosen hardware platform.
Segment Redundancy When you deploy your Greenplum Database system, you have the option to configure mirror segments. Mirror segments allow database queries to fail over to a backup segment if the primary segment becomes unavailable. Mirroring is a requirement for Pivotal-supported production Greenplum Database systems.
A mirror segment must always reside on a different host than its primary segment. Mirror segments can be arranged across the hosts in the system in one of two standard configurations, or in a custom configuration you design. The default configuration, called group mirroring, places the mirror segments for all primary segments on a host on one other host. Another option, called spread mirroring, spreads mirrors for each host's primary segments over the remaining hosts. Spread mirroring requires that there be more hosts in the system than there are primary segments on the host. On hosts with multiple network interfaces, the primary and mirror segments are distributed equally among the interfaces. Figure 2: Data Mirroring in Greenplum Database shows how table data is distributed across the segments when the default group mirroring option is configured.
Figure 2: Data Mirroring in Greenplum Database
Segment Failover and Recovery
When mirroring is enabled in a Greenplum Database system, the system automatically fails over to the mirror copy if a primary copy becomes unavailable. A Greenplum Database system can remain operational
36
if a segment instance or host goes down only if all portions of data are available on the remaining active segments.
If the master cannot connect to a segment instance, it marks that segment instance as invalid in the Greenplum Database system catalog. The segment instance remains invalid and out of operation until an administrator brings that segment back online. An administrator can recover a failed segment while the system is up and running. The recovery process copies over only the changes that were missed while the segment was nonoperational.
If you do not have mirroring enabled and a segment becomes invalid, the system automatically shuts down. An administrator must recover all failed segments before operations can continue.
Example Segment Host Hardware Stack Regardless of the hardware platform you choose, a production Greenplum Database processing node (a segment host) is typically configured as described in this section.
The segment hosts do the majority of database processing, so the segment host servers are configured in order to achieve the best performance possible from your Greenplum Database system. Greenplum Database's performance will be as fast as the slowest segment server in the array. Therefore, it is important to ensure that the underlying hardware and operating systems that are running Greenplum Database are all running at their optimal performance level. It is also advised that all segment hosts in a Greenplum Database array have identical hardware resources and configurations.
Segment hosts should also be dedicated to Greenplum Database operations only. To get the best query performance, you do not want Greenplum Database competing with other applications for machine or network resources.
The following diagram shows an example Greenplum Database segment host hardware stack. The number of effective CPUs on a host is the basis for determining how many primary Greenplum Database segment instances to deploy per segment host. This example shows a host with two effective CPUs (one dual-core CPU). Note that there is one primary segment instance (or primary/mirror pair if using mirroring) per CPU core.
37
Figure 3: Example Greenplum Database Segment Host Configuration
Example Segment Disk Layout Each CPU is typically mapped to a logical disk. A logical disk consists of one primary file system (and optionally a mirror file system) accessing a pool of physical disks through an I/O channel or disk controller. The logical disk and file system are provided by the operating system. Most operating systems provide the ability for a logical disk drive to use groups of physical disks arranged in RAID arrays.
Figure 4: Logical Disk Layout in Greenplum Database
Depending on the hardware platform you choose, different RAID configurations offer different performance and capacity levels. Greenplum supports and certifies a number of reference hardware platforms and operating systems. Check with your sales account representative for the recommended configuration on your chosen platform.
38
The Interconnect The interconnect is the networking layer of Greenplum Database. When a user connects to a database and issues a query, processes are created on each of the segments to handle the work of that query. The interconnect refers to the inter-process communication between the segments, as well as the network infrastructure on which this communication relies. The interconnect uses a standard 10 Gigabit Ethernet switching fabric.
By default, Greenplum Database interconnect uses UDP (User Datagram Protocol) with flow control for interconnect traffic to send messages over the network. The Greenplum software does the additional packet verification and checking not performed by UDP, so the reliability is equivalent to TCP (Transmission Control Protocol), and the performance and scalability exceeds that of TCP. For information about the types of interconnect supported by Greenplum Database, see server configuration parameter gp_interconnect_type in the Greenplum Database Reference Guide.
Interconnect Redundancy A highly available interconnect can be achieved by deploying dual 10 Gigabit Ethernet switches on your network, and redundant 10 Gigabit connections to the Greenplum Database master and segment host servers.
Network Interface Configuration A segment host typically has multiple network interfaces designated to Greenplum interconnect traffic. The master host typically has additional external network interfaces in addition to the interfaces used for interconnect traffic.
Depending on the number of interfaces available, you will want to distribute interconnect network traffic across the number of available interfaces. This is done by assigning segment instances to a particular network interface and ensuring that the primary segments are evenly balanced over the number of available interfaces.
This is done by creating separate host address names for each network interface. For example, if a host has four network interfaces, then it would have four corresponding host addresses, each of which maps to one or more primary segment instances. The /etc/hosts file should be configured to contain not only the host name of each machine, but also all interface host addresses for all of the Greenplum Database hosts (master, standby master, segments, and ETL hosts).
With this configuration, the operating system automatically selects the best path to the destination. Greenplum Database automatically balances the network destinations to maximize parallelism.
39
Figure 5: Example Network Interface Architecture
Switch Configuration When using multiple 10 Gigabit Ethernet switches within your Greenplum Database array, evenly divide the number of subnets between each switch. In this example configuration, if we had two switches, NICs 1 and 2 on each host would use switch 1 and NICs 3 and 4 on each host would use switch 2. For the master host, the host name bound to NIC 1 (and therefore using switch 1) is the effective master host name for the array. Therefore, if deploying a warm standby master for redundancy purposes, the standby master should map to a NIC that uses a different switch than the primary master.
40
Figure 6: Example Switch Configuration
ETL Hosts for Data Loading Greenplum supports fast, parallel data loading with its external tables feature. By using external tables in conjunction with Greenplum Database's parallel file server (gpfdist), administrators can achieve maximum parallelism and load bandwidth from their Greenplum Database system. Many production systems deploy designated ETL servers for data loading purposes. These machines run the Greenplum parallel file server (gpfdist), but not Greenplum Database instances.
One advantage of using the gpfdist file server program is that it ensures that all of the segments in your Greenplum Database system are fully utilized when reading from external table data files.
The gpfdist program can serve data to the segment instances at an average rate of about 350 MB/s for delimited text formatted files and 200 MB/s for CSV formatted files. Therefore, you should consider the following options when running gpfdist in order to maximize the network bandwidth of your ETL systems:
• If your ETL machine is configured with multiple network interface cards (NICs) as described in Network Interface Configuration, run one instance of gpfdist on your ETL host and then define your external table definition so that the host name of each NIC is declared in the LOCATION clause (see CREATE EXTERNAL TABLE in the Greenplum Database Reference Guide). This allows network traffic between your Greenplum segment hosts and your ETL host to use all NICs simultaneously.
41
Figure 7: External Table Using Single gpfdist Instance with Multiple NICs
• Run multiple gpfdist instances on your ETL host and divide your external data files equally between each instance. For example, if you have an ETL system with two network interface cards (NICs), then you could run two gpfdist instances on that machine to maximize your load performance. You would then divide the external table data files evenly between the two gpfdist programs.
Figure 8: External Tables Using Multiple gpfdist Instances with Multiple NICs
Greenplum Performance Monitoring Greenplum Database includes a dedicated system monitoring and management database, named gpperfmon, that administrators can install and enable. When this database is enabled, data collection
42
agents on each segment host collect query status and system metrics. At regular intervals (typically every 15 seconds), an agent on the Greenplum master requests the data from the segment agents and updates the gpperfmon database. Users can query the gpperfmon database to see the stored query and system metrics. For more information see the "gpperfmon Database Reference" in the Greenplum Database Reference Guide.
Greenplum Command Center is an optional web-based performance monitoring and management tool for Greenplum Database, based on the gpperfmon database. Administrators can install Command Center separately from Greenplum Database.
Figure 9: Greenplum Performance Monitoring Architecture
43
Estimating Storage Capacity To estimate how much data your Greenplum Database system can accommodate, use these measurements as guidelines. Also keep in mind that you may want to have extra space for landing backup files and data load files on each segment host.
Calculating Usable Disk Capacity To calculate how much data a Greenplum Database system can hold, you have to calculate the usable disk capacity per segment host and then multiply that by the number of segment hosts in your Greenplum Database array. Start with the raw capacity of the physical disks on a segment host that are available for data storage (raw_capacity), which is:
disk_size * number_of_disks
Account for file system formatting overhead (roughly 10 percent) and the RAID level you are using. For example, if using RAID-10, the calculation would be:
(raw_capacity * 0.9) / 2 = formatted_disk_space
For optimal performance, do not completely fill your disks to capacity, but run at 70% or lower. So with this in mind, calculate the usable disk space as follows:
formatted_disk_space * 0.7 = usable_disk_space
Using only 70% of your disk space allows Greenplum Database to use the other 30% for temporary and transaction files on the same disks. If your host systems have a separate disk system that can be used for temporary and transaction files, you can specify a tablespace that Greenplum Database uses for the files. Moving the location of the files might improve performance depending on the performance of the disk system.
Once you have formatted RAID disk arrays and accounted for the maximum recommended capacity (usable_disk_space), you will need to calculate how much storage is actually available for user data (U). If using Greenplum Database mirrors for data redundancy, this would then double the size of your user data (2 * U). Greenplum Database also requires some space be reserved as a working area for active queries. The work space should be approximately one third the size of your user data (work space = U/3):
With mirrors: (2 * U) + U/3 = usable_disk_space
Without mirrors: U + U/3 = usable_disk_space
Guidelines for temporary file space and user data space assume a typical analytic workload. Highly concurrent workloads or workloads with queries that require very large amounts of temporary space can benefit from reserving a larger working area. Typically, overall system throughput can be increased while decreasing work area usage through proper workload management. Additionally, temporary space and user space can be isolated from each other by specifying that they reside on different tablespaces.
In the Greenplum Database Administrator Guide, see these topics:
• Managing Performance for information about workload management • Creating and Managing Tablespaces for information about moving the location of temporary and
transaction files • Monitoring System State for information about monitoring Greenplum Database disk space usage
44
Calculating User Data Size As with all databases, the size of your raw data will be slightly larger once it is loaded into the database. On average, raw data will be about 1.4 times larger on disk after it is loaded into the database, but could be smaller or larger depending on the data types you are using, table storage type, in-database compression, and so on.
• Page Overhead - When your data is loaded into Greenplum Database, it is divided into pages of 32KB each. Each page has 20 bytes of page overhead.
• Row Overhead - In a regular 'heap' storage table, each row of data has 24 bytes of row overhead. An 'append-optimized' storage table has only 4 bytes of row overhead.
• Attribute Overhead - For the data values itself, the size associated with each attribute value is dependent upon the data type chosen. As a general rule, you want to use the smallest data type possible to store your data (assuming you know the possible values a column will have).
• Indexes - In Greenplum Database, indexes are distributed across the segment hosts as is table data. The default index type in Greenplum Database is B-tree. Because index size depends on the number of unique values in the index and the data to be inserted, precalculating the exact size of an index is impossible. However, you can roughly estimate the size of an index using these formulas.
B-tree: unique_values * (data_type_size + 24 bytes)
Bitmap: (unique_values * number_of_rows * 1 bit * compression_ratio / 8) + (unique_values * 32)
Calculating Space Requirements for Metadata and Logs On each segment host, you will also want to account for space for Greenplum Database log files and metadata:
• System Metadata — For each Greenplum Database segment instance (primary or mirror) or master instance running on a host, estimate approximately 20 MB for the system catalogs and metadata.
• Write Ahead Log — For each Greenplum Database segment (primary or mirror) or master instance running on a host, allocate space for the write ahead log (WAL). The WAL is divided into segment files of 64 MB each. At most, the number of WAL files will be:
2 * checkpoint_segments + 1
You can use this to estimate space requirements for WAL. The default checkpoint_segments setting for a Greenplum Database instance is 8, meaning 1088 MB WAL space allocated for each segment or master instance on a host.
• Greenplum Database Log Files — Each segment instance and the master instance generates database log files, which will grow over time. Sufficient space should be allocated for these log files, and some type of log rotation facility should be used to ensure that to log files do not grow too large.
• Command Center Data — The data collection agents utilized by Command Center run on the same set of hosts as your Greenplum Database instance and utilize the system resources of those hosts. The resource consumption of the data collection agent processes on these hosts is minimal and should not significantly impact database performance. Historical data collected by the collection agents is stored in its own Command Center database (named gpperfmon) within your Greenplum Database system. Collected data is distributed just like regular database data, so you will need to account for disk space in the data directory locations of your Greenplum segment instances. The amount of space required depends on the amount of historical data you would like to keep. Historical data is not automatically truncated. Database administrators must set up a truncation policy to maintain the size of the Command Center database.
45
Configuring Your Systems Describes how to prepare your operating system environment for Greenplum Database software installation.
Perform the following tasks in order:
1. Make sure your host systems meet the requirements described in Platform Requirements. 2. Disable SELinux and firewall software. 3. Set the required operating system parameters. 4. Synchronize system clocks. 5. Create the gpadmin account.
Unless noted, these tasks should be performed for all hosts in your Greenplum Database array (master, standby master, and segment hosts).
The Greenplum Database host naming convention for the master host is mdw and for the standby master host is smdw.
The segment host naming convention is sdwN where sdw is a prefix and N is an integer. For example, segment host names would be sdw1, sdw2 and so on. NIC bonding is recommended for hosts with multiple interfaces, but when the interfaces are not bonded, the convention is to append a dash (-) and number to the host name. For example, sdw1-1 and sdw1-2 are the two interface names for host sdw1.
For information about running Greenplum Database in the cloud see Cloud Services in the Pivotal Greenplum Partner Marketplace.
Important: When data loss is not acceptable for a Greenplum Database cluster, Greenplum master and segment mirroring is recommended. If mirroring is not enabled then Greenplum stores only one copy of the data, so the underlying storage media provides the only guarantee for data availability and correctness in the event of a hardware failure.
Kubernetes enables quick recovery from both pod and host failures, and Kubernetes storage services provide a high level of availability for the underlying data. Furthermore, virtualized environments make it difficult to ensure the anti-affinity guarantees required for Greenplum mirroring solutions. For these reasons, mirrorless deployments are fully supported with Greenplum for Kubernetes. Other deployment environments are generally not supported for production use unless both Greenplum master and segment mirroring are enabled.
Note: For information about upgrading Pivotal Greenplum Database from a previous version, see the Greenplum Database Release Notes for the release that you are installing.
Note: Automating the configuration steps described in this topic and Installing the Greenplum Database Software with a system provisioning tool, such as Ansible, Chef, or Puppet, can save time and ensure a reliable and repeatable Greenplum Database installation.
Disabling SELinux and Firewall Software For all Greenplum Database host systems running RHEL or CentOS, SELinux must be disabled. Follow these steps:
1. As the root user, check the status of SELinux:
# sestatus SELinuxstatus: disabled
46
2. If SELinux is not disabled, disable it by editing the /etc/selinux/config file. As root, change the value of the SELINUX parameter in the config file as follows:
SELINUX=disabled
3. If the System Security Services Daemon (SSSD) is installed on your systems, edit the SSSD configuration file and set the selinux_provider parameter to none to prevent SELinux-related SSH authentication denials that could occur even with SELinux disabled. As root, edit /etc/sssd/ sssd.conf and add this parameter:
selinux_provider=none
4. Reboot the system to apply any changes that you made and verify that SELinux is disabled.
For information about disabling SELinux, see the SELinux documentation.
You should also disable firewall software such as iptables (on systems such as RHEL 6.x and CentOS 6.x ), firewalld (on systems such as RHEL 7.x and CentOS 7.x), or ufw (on Ubuntu systems, disabled by default).
If you decide to enable iptables with Greenplum Database for security purposes, see Enabling iptables (Optional).
Follow these steps to disable iptables:
1. As the root user, check the status of iptables:
# /sbin/chkconfig --list iptables
If iptables is disabled, the command output is:
iptables 0:off 1:off 2:off 3:off 4:off 5:off 6:off
2. If necessary, execute this command as root to disable iptables:
/sbin/chkconfig iptables off
You will need to reboot your system after applying the change. 3. For systems with firewalld, check the status of firewalld with the command:
# systemctl status firewalld
* firewalld.service - firewalld - dynamic firewall daemon Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled) Active: inactive (dead)
4. If necessary, execute these commands as root to disable firewalld:
# systemctl stop firewalld.service # systemctl disable firewalld.service
For more information about configuring your firewall software, see the documentation for the firewall or your operating system.
Recommended OS Parameters Settings Greenplum requires that certain Linux operating system (OS) parameters be set on all hosts in your Greenplum Database system (masters and segments).
47
In general, the following categories of system parameters need to be altered:
• Shared Memory - A Greenplum Database instance will not work unless the shared memory segment for your kernel is properly sized. Most default OS installations have the shared memory values set too low for Greenplum Database. On Linux systems, you must also disable the OOM (out of memory) killer. For information about Greenplum Database shared memory requirements, see the Greenplum Database server configuration parameter shared_buffers in the Greenplum Database Reference Guide.
• Network - On high-volume Greenplum Database systems, certain network-related tuning parameters must be set to optimize network connections made by the Greenplum interconnect.
• User Limits - User limits control the resources available to processes started by a user's shell. Greenplum Database requires a higher limit on the allowed number of file descriptors that a single process can have open. The default settings may cause some Greenplum Database queries to fail because they will run out of file descriptors needed to process the query.
More specifically, you need to edit the following Linux configuration settings:
• The hosts File • The sysctl.conf File • System Resources Limits • XFS Mount Options • Disk I/O Settings
• Read ahead values • Disk I/O scheduler disk access
• Transparent Huge Pages (THP) • IPC Object Removal • SSH Connection Threshold
The hosts File Edit the /etc/hosts file and make sure that it includes the host names and all interface address names for every machine participating in your Greenplum Database system.
The sysctl.conf File The sysctl.conf parameters listed in this topic are for performance, optimization, and consistency in a wide variety of environments. Change these settings according to your specific situation and setup.
Set the parameters in the /etc/sysctl.conf file and reload with sysctl -p:
# kernel.shmall = _PHYS_PAGES / 2 # See Shared Memory Pages kernel.shmall = 197951838 # kernel.shmmax = kernel.shmall * PAGE_SIZE kernel.shmmax = 810810728448 kernel.shmmni = 4096 vm.overcommit_memory = 2 # See Segment Host Memory vm.overcommit_ratio = 95 # See Segment Host Memory
net.ipv4.ip_local_port_range = 10000 65535 # See Port Settings kernel.sem = 500 2048000 200 4096 kernel.sysrq = 1 kernel.core_uses_pid = 1 kernel.msgmnb = 65536 kernel.msgmax = 65536 kernel.msgmni = 2048 net.ipv4.tcp_syncookies = 1 net.ipv4.conf.default.accept_source_route = 0 net.ipv4.tcp_max_syn_backlog = 4096
48
Shared Memory Pages
Greenplum Database uses shared memory to communicate between postgres processes that are part of the same postgres instance. kernel.shmall sets the total amount of shared memory, in pages, that can be used system wide. kernel.shmmax sets the maximum size of a single shared memory segment in bytes.
Set kernel.shmall and kernel.shmax values based on your system's physical memory and page size. In general, the value for both parameters should be one half of the system physical memory.
Use the operating system variables _PHYS_PAGES and PAGE_SIZE to set the parameters.
kernel.shmall = ( _PHYS_PAGES / 2) kernel.shmmax = ( _PHYS_PAGES / 2) * PAGE_SIZE
To calculate the values for kernel.shmall and kernel.shmax, run the following commands using the getconf command, which returns the value of an operating system variable.
$ echo $(expr $(getconf _PHYS_PAGES) / 2) $ echo $(expr $(getconf _PHYS_PAGES) / 2 \* $(getconf PAGE_SIZE))
As best practice, we recommend you set the following values in the /etc/sysctl.conf file using calculated values. For example, a host system has 1583 GB of memory installed and returns these values: _PHYS_PAGES = 395903676 and PAGE_SIZE = 4096. These would be the kernel.shmall and kernel.shmmax values:
kernel.shmall = 197951838 kernel.shmmax = 810810728448
If the Greeplum Database master the has a different shared memory configuration than the segment hosts, the _PHYS_PAGES and PAGE_SIZE values might differ, and the kernel.shmall and kernel.shmax values on the master host will differ from those on the segment hosts.
Segment Host Memory
The vm.overcommit_memory Linux kernel parameter is used by the OS to determine how much memory can be allocated to processes. For Greenplum Database, this parameter should always be set to 2.
vm.overcommit_ratio is the percent of RAM that is used for application processes and the remainder is reserved for the operating system. The default is 50 on Red Hat Enterprise Linux.
For vm.overcommit_ratio tuning and calculation recommendations with resource group-based resource management or resource queue-based resource management, refer to Options for Configuring Segment Host Memory in the Geenplum Database Administrator Guide. Also refer to the Greenplum Database server configuration parameter gp_vmem_protect_limit in the Greenplum Database Reference Guide.
Port Settings
49
To avoid port conflicts between Greenplum Database and other applications during Greenplum initialization, make a note of the port range specified by the operating system parameter net.ipv4.ip_local_port_range. When initializing Greenplum using the gpinitsystem cluster configuration file, do not specify Greenplum Database ports in that range. For example, if net.ipv4.ip_local_port_range = 10000 65535, set the Greenplum Database base port numbers to these values.
PORT_BASE = 6000 MIRROR_PORT_BASE = 7000
For information about the gpinitsystem cluster configuration file, see Initializing a Greenplum Database System.
For Azure deployments with Greenplum Database avoid using port 65330; add the following line to sysctl.conf:
net.ipv4.ip_local_reserved_ports=65330
System Memory
For host systems with more than 64GB of memory, these settings are recommended:
vm.dirty_background_ratio = 0 vm.dirty_ratio = 0 vm.dirty_background_bytes = 1610612736 # 1.5GB vm.dirty_bytes = 4294967296 # 4GB
For host systems with 64GB of memory or less, remove vm.dirty_background_bytes and vm.dirty_bytes and set the two ratio parameters to these values:
vm.dirty_background_ratio = 3 vm.dirty_ratio = 10
Increase vm.min_free_kbytes to ensure PF_MEMALLOC requests from network and storage drivers are easily satisfied. This is especially critical on systems with large amounts of system memory. The default value is often far too low on these systems. Use this awk command to set vm.min_free_kbytes to a recommended 3% of system physical memory:
awk 'BEGIN {OFMT = "%.0f";} /MemTotal/ {print "vm.min_free_kbytes =", $2 * .03;}' /proc/meminfo >> /etc/sysctl.conf
Do not set vm.min_free_kbytes to higher than 5% of system memory as doing so might cause out of memory conditions.
System Resources Limits Set the following parameters in the /etc/security/limits.conf file:
* soft nofile 524288 * hard nofile 524288 * soft nproc 131072 * hard nproc 131072
For Red Hat Enterprise Linux (RHEL) and CentOS systems, parameter values in the /etc/security/ limits.d/90-nproc.conf file (RHEL/CentOS 6) or /etc/security/limits.d/20-nproc.conf file (RHEL/CentOS 7) override the values in the limits.conf file. Ensure that any parameters in the
50
override file are set to the required value. The Linux module pam_limits sets user limits by reading the values from the limits.conf file and then from the override file. For information about PAM and user limits, see the documentation on PAM and pam_limits.
Execute the ulimit -u command on each segment host to display the maximum number of processes that are available to each user. Validate that the return value is 131072.
XFS Mount Options XFS is the preferred data storage file system on Linux platforms. Use the mount command with the following recommended XFS mount options for RHEL and CentOS systems:
rw,nodev,noatime,nobarrier,inode64
The nobarrier option is not supported on Ubuntu systems. Use only the options:
rw,nodev,noatime,inode64
See the mount manual page (man mount opens the man page) for more information about using this command.
The XFS options can also be set in the /etc/fstab file. This example entry from an fstab file specifies the XFS options.
/dev/data /data xfs nodev,noatime,nobarrier,inode64 0 0
Disk I/O Settings • Read-ahead value
Each disk device file should have a read-ahead (blockdev) value of 16384. To verify the read-ahead value of a disk device:
# /sbin/blockdev --getra devname
# /sbin/blockdev --setra bytes devname
# /sbin/blockdev --setra 16384 /dev/sdb
See the manual page (man) for the blockdev command for more information about using that command (man blockdev opens the man page).
Note: The blockdev --setra command is not persistent. You must ensure the read-ahead value is set whenever the system restarts. How to set the value will vary based on your system.
One method to set the blockdev value at system startup is by adding the /sbin/blockdev -- setra command in the rc.local file. For example, add this line to the rc.local file to set the read- ahead value for the disk sdb.
/sbin/blockdev --setra 16384 /dev/sdb
51
On systems that use systemd, you must also set the execute permissions on the rc.local file to enable it to run at startup. For example, on a RHEL/CentOS 7 system, this command sets execute permissions on the file.
# chmod +x /etc/rc.d/rc.local
Restart the system to have the setting take effect. • Disk I/O scheduler
The Linux disk I/O scheduler for disk access supports different policies, such as CFQ, AS, and deadline.
The deadline scheduler option is recommended. To specify a scheduler until the next system reboot, run the following:
# echo schedulername > /sys/block/devname/queue/scheduler
# echo deadline > /sys/block/sbd/queue/scheduler
Note: Using the echo command to set the disk I/O scheduler policy is not persistent, therefore you must ensure the command is run whenever the system reboots. How to run the command will vary based on your system.
One method to set the I/O scheduler policy at boot time is with the elevator kernel parameter. Add the parameter elevator=deadline to the kernel command in the file /boot/grub/grub.conf, the GRUB boot loader configuration file. This is an example kernel command from a grub.conf file on RHEL 6.x or CentOS 6.x. The command is on multiple lines for readability.
kernel /vmlinuz-2.6.18-274.3.1.el5 ro root=LABEL=/ elevator=deadline crashkernel=128M@16M quiet console=tty1 console=ttyS1,115200 panic=30 transparent_hugepage=never initrd /initrd-2.6.18-274.3.1.el5.img
To specify the I/O scheduler at boot time on systems that use grub2 such as RHEL 7.x or CentOS 7.x, use the system utility grubby. This command adds the parameter when run as root.
# grubby --update-kernel=ALL --args="elevator=deadline"
After adding the parameter, reboot the system.
This grubby command displays kernel parameter settings.
# grubby --info=ALL
For more information about the grubby utility, see your operating system documentation. If the grubby command does not update the kernels, see the Note at the end of the section.
Transparent Huge Pages (THP) Disable Transparent Huge Pages (THP) as it degrades Greenplum Database performance. RHEL 6.0 or higher enables THP by default. One way to disable THP on RHEL 6.x is by adding the parameter transparent_hugepage=never to the kernel command in the file /boot/grub/grub.conf, the GRUB boot loader configuration file. This is an example kernel command from a grub.conf file. The command is on multiple lines for readability:
kernel /vmlinuz-2.6.18-274.3.1.el5 ro root=LABEL=/
52
elevator=deadline crashkernel=128M@16M quiet console=tty1 console=ttyS1,115200 panic=30 transparent_hugepage=never initrd /initrd-2.6.18-274.3.1.el5.img
On systems that use grub2 such as RHEL 7.x or CentOS 7.x, use the system utility grubby. This command adds the parameter when run as root.
# grubby --update-kernel=ALL --args="transparent_hugepage=never"
After adding the parameter, reboot the system.
For Ubuntu systems, install the hugepages package and execute this command as root:
# hugeadm --thp-never
This cat command checks the state of THP. The output indicates that THP is disabled.
$ cat /sys/kernel/mm/*transparent_hugepage/enabled always [never]
For more information about Transparent Huge Pages or the grubby utility, see your operating system documentation. If the grubby command does not update the kernels, see the Note at the end of the section.
IPC Object Removal Disable IPC object removal for RHEL 7.2 or CentOS 7.2, or Ubuntu. The default systemd setting RemoveIPC=yes removes IPC connections when non-system user accounts log out. This causes the Greenplum Database utility gpinitsystem to fail with semaphore errors. Perform one of the following to avoid this issue.
• When you add the gpadmin operating system user account to the master node in Creating the Greenplum Administrative User, create the user as a system account.
• Disable RemoveIPC. Set this parameter in /etc/systemd/logind.conf on the Greenplum Database host systems.
RemoveIPC=no
The setting takes effect after restarting the systemd-login service or rebooting the system. To restart the service, run this command as the root user.
service systemd-logind restart
SSH Connection Threshold Certain Greenplum Database management utilities including gpexpand, gpinitsystem, and gpaddmirrors, use secure shell (SSH) connections between systems to perform their tasks. In large Greenplum Database deployments, cloud deployments, or deployments with a large number of segments per host, these utilities may exceed the hosts' maximum threshold for unauthenticated connections. When this occurs, you receive errors such as: ssh_exchange_identification: Connection closed by remote host.
To increase this connection threshold for your Greenplum Database system, update the SSH MaxStartups and MaxSessions configuration parameters in one of the /etc/ssh/sshd_config or / etc/sshd_config SSH daemon configuration files.
53
If you specify MaxStartups and MaxSessions using a single integer value, you identify the maximum number of concurrent unauthenticated connections (MaxStartups) and maximum number of open shell, login, or subsystem sessions permitted per network connection (MaxSessions). For example:
MaxStartups 200 MaxSessions 200
If you specify MaxStartups using the "start:rate:full" syntax, you enable random early connection drop by the SSH daemon. start identifies the maximum number of unauthenticated SSH connection attempts allowed. Once start number of unauthenticated connection attempts is reached, the SSH daemon refuses rate percent of subsequent connection attempts. full identifies the maximum number of unauthenticated connection attempts after which all attempts are refused. For example:
Max Startups 10:30:200 MaxSessions 200
Restart the SSH daemon after you update MaxStartups and MaxSessions. For example, on a CentOS 6 system, run the following command as the root user:
# service sshd restart
For detailed information about SSH configuration options, refer to the SSH documentation for your Linux distribution.
Note: If the grubby command does not update the kernels of a RHEL 7.x or CentOS 7.x system, you can manually update all kernels on the system. For example, to add the parameter transparent_hugepage=never to all kernels on a system.
1. Add the parameter to the GRUB_CMDLINE_LINUX line in the file parameter in /etc/default/ grub.
GRUB_TIMEOUT=5 GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)" GRUB_DEFAULT=saved GRUB_DISABLE_SUBMENU=true GRUB_TERMINAL_OUTPUT="console" GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=cl/root rd.lvm.lv=cl/ swap rhgb quiet transparent_hugepage=never" GRUB_DISABLE_RECOVERY="true"
2. As root, run the grub2-mkconfig command to update the kernels.
# grub2-mkconfig -o /boot/grub2/grub.cfg
3. Reboot the system.
Synchronizing System Clocks You should use NTP (Network Time Protocol) to synchronize the system clocks on all hosts that comprise your Greenplum Database system. See www.ntp.org for more information about NTP.
NTP on the segment hosts should be configured to use the master host as the primary time source, and the standby master as the secondary time source. On the master and standby master hosts, configure NTP to point to your preferred time server.
54
To configure NTP 1. On the master host, log in as root and edit the /etc/ntp.conf file. Set the server parameter to point
to your data center's NTP time server. For example (if 10.6.220.20 was the IP address of your data center's NTP server):
server 10.6.220.20
2. On each segment host, log in as root and edit the /etc/ntp.conf file. Set the first server parameter to point to the master host, and the second server parameter to point to the standby master host. For example:
server mdw prefer server smdw
3. On the standby master host, log in as root and edit the /etc/ntp.conf file. Set the first server parameter to point to the primary master host, and the second server parameter to point to your data center's NTP time server. For example:
server mdw prefer server 10.6.220.20
4. On the master host, use the NTP daemon synchronize the system clocks on all Greenplum hosts. For example, using gpssh:
# gpssh -f hostfile_gpssh_allhosts -v -e 'ntpd'
Creating the Greenplum Administrative User Create a dedicated operating system user account on each node to run and administer Greenplum Database. This user account is named gpadmin by convention.
Important: You cannot run the Greenplum Database server as root.
The gpadmin user must have permission to access the services and directories required to install and run Greenplum Database.
The gpadmin user on each Greenplum host must have an SSH key pair installed and be able to SSH from any host in the cluster to any other host in the cluster without entering a password or passphrase (called "passwordless SSH"). If you enable passwordless SSH from the master host to every other host in the cluster ("1-n passwordless SSH"), you can use the Greenplum Database gpssh-exkeys command-line utility later to enable passwordless SSH from every host to every other host ("n-n passwordless SSH").
You can optionally give the gpadmin user sudo privilege, so that you can easily administer all hosts in the Greenplum Database cluster as gpadmin using the sudo, ssh/scp, and gpssh/gpscp commands.
The following steps show how to set up the gpadmin user on a host, set a password, create an SSH key pair, and (optionally) enable sudo capability. These steps must be performed as root on every Greenplum Database cluster host. (For a large Greenplum Database cluster you will want to automate these steps using your system provisioning tools.)
Note: See Example Ansible Playbook for an example that shows how to automate the tasks of creating the gpadmin user and installing the Greenplum Database software on all hosts in the cluster.
1. Create the gpadmin group and user.
Note: If you are installing Greenplum Database on RHEL 7.2 or CentOS 7.2 and want to disable IPC object removal by creating the gpadmin user as a system account, provide both the -r option (create the user as a system account) and the -m option (create a home directory) to the useradd command. On Ubuntu systems, you must use the -m option with the useradd command to create a home directory for a user.
55
This example creates the gpadmin group, creates the gpadmin user as a system account with a home directory and as a member of the gpadmin group, and creates a password for the user.
# groupadd gpadmin # useradd gpadmin -r -m -g gpadmin # passwd gpadmin New password: <changeme> Retype new password: <changeme>
Note: Make sure the gpadmin user has the same user id (uid) and group id (gid) numbers on each host to prevent problems with scripts or services that use them for identity or permissions. For example, backing up Greenplum databases to some networked filesy stems or storage appliances could fail if the gpadmin user has different uid or gid numbers on different segment hosts. When you create the gpadmin group and user, you can use the groupadd -g option to specify a gid number and the useradd -u option to specify the uid number. Use the command id gpadmin to see the uid and gid for the gpadmin user on the current host.
2. Switch to the gpadmin user and generate an SSH key pair for the gpadmin user.
$ su gpadmin $ ssh-keygen -t rsa -b 4096 Generating public/private rsa key pair. Enter file in which to save the key (/home/gpadmin/.ssh/id_rsa): Created directory '/home/gpadmin/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again:
At the passphrase prompts, press Enter so that SSH connections will not require entry of a passphrase. 3. (Optional) Grant sudo access to the gpadmin user.
On Red Hat or CentOS, run visudo and uncomment the %wheel group entry.
%wheel ALL=(ALL) NOPASSWD: ALL
Make sure you uncomment the line that has the NOPASSWD keyword.
Add the gpadmin user to the wheel group with this command.
# usermod -aG wheel gpadmin
56
Installing the Greenplum Database Software Describes how to install the Greenplum Database software binaries on all of the hosts that will comprise your Greenplum Database system, how to enable passwordless SSH for the gpadmin user, and how to verify the installation.
Perform the following tasks in order:
1. Installing Greenplum Database 2. Enabling Passwordless SSH 3. Confirm the software installation. 4. Next Steps
Installing Greenplum Database You must install Greenplum Database on each host machine of the Greenplum Database system. Pivotal distributes the Greenplum Database software as a downloadable package that you install on each host system with the operating system's package management system. You can download the package from Pivotal Network.
Before you begin installing Greenplum Database, be sure you have completed the steps in Configuring Your Systems to configure each of the master, standby master, and segment host machines for Greenplum Database.
Important: After installing Greenplum Database, you must set Greenplum Database environment variables. See Setting Greenplum Environment Variables.
See Example Ansible Playbook for an example script that shows how you can automate creating the gpadmin user and installing the Greenplum Database.
Follow these instructions to install Greenplum Database.
Important: You require sudo or root user access to install from a pre-built RPM or DEB file.
1. Download and copy the Greenplum Database package to the gpadmin user's home directory on the master, standby master, and every segment host machine. The distribution file name has the format greenplum-db-<version>-<platform>.rpm for RHEL, CentOS, and Oracle Linux systems, or greenplum-db-<version>-<platform>.deb for Ubuntu systems, where <platform> is similar to rhel7-x86_64 (Red Hat 7 64-bit).
Note: For Oracle Linux installations, download and install the rhel7-x86_64distribution files. 2. With sudo (or as root), install the Greenplum Database package on each host machine using your
system's package manager software.
The yum or apt command automatically installs software dependencies, copies the Greenplum Database software files into a ver

Documents

Pivotal Greenplum Database