Pivotal Greenplum Database
-
Upload
others
-
View
26
-
Download
0
Embed Size (px)
Citation preview
Version 6.11© 2020 VMware, Inc.
Copyright Release Notes
Copyright © 2020 VMware, Inc. or its affiliates. All Rights
Reserved.
Revised October 2020 (6.11.2)
Validating Your
Systems...................................................................................................................
62 Validating Network
Performance............................................................................................
62 Validating Disk I/O and Memory
Bandwidth...........................................................................63
Referencing IP
Addresses....................................................................................................667
Running Backend Server
Programs.....................................................................................667
Chapter 9: About the Tools
Package..................................................
1616
Chapter 10: Installing the Client and Loader Tools
Package............ 1617 Supported
Platforms......................................................................................................................1618
Installation
Procedure....................................................................................................................1619
About Your
Installation..................................................................................................................
1620 Running the UNIX Tools
Installer.................................................................................................
1621
Chapter 13: Using the Client and Loader
Tools................................. 1625
Prerequisites..................................................................................................................................
1626
Chapter 15: DataDirect ODBC Drivers for
Greenplum....................... 1632
Prerequisites..................................................................................................................................
1633 Supported Client
Platforms...........................................................................................................
1634 Installing on Linux
Systems..........................................................................................................
1635
14
Pivotal Greenplum 6.11 Release Notes
This document contains pertinent release information about Pivotal
Greenplum 6.11 releases. For previous versions of the release notes
for Greenplum Database, go to Pivotal Greenplum Database
Documentation. For information about Greenplum Database end of
life, see Pivotal Greenplum Database end of life policy.
Pivotal Greenplum 6 software is available for download from VMware
Tanzu Network.
Pivotal Greenplum 6 is based on the open source Greenplum Database
project code.
Important: VMware does not provide support for open source versions
of Greenplum Database. Only Pivotal Greenplum is supported by
VMware.
15
Release 6.11.2 Release Date: 2020-10-2
Pivotal Greenplum 6.11.2 is a maintenance release that includes
changes and resolves several issues.
Changed Features Greenplum Database 6.11.2 includes these
changes:
• Pivotal GPText version 3.4.5 is included, which includes bug
fixes. See the GPText 3.4.5 Release Notes for more
information.
• Pivotal Greenplum-Spark connector version 2.0.0 is included,
which includes feature changes and bug fixes. See the
Greenplum-Spark Connector 2.0.0 Release Notes for more
information.
Resolved Issues Pivotal Greenplum 6.11.2 resolves these
issues:
30549 - Management and Monitoring
30795 - GPORCA
Fixed a problem where GPORCA did not utilize an index scan for
certain subqueries, which could lead to poor performance for
affected queries.
30878 - GPORCA
If a CREATE TABLE .. AS statement was used to create a table with
non-legacy (jump consistent) hash algorithm distribution from a
source table that used the legacy (modulo) hash algorithm, GPORCA
would distribute the data according to the value of
gp_use_legacy_hashops; however, it would set the table's
distribution policy hash algorithm to the value of the original
table. This could cause queries to give incorrect results if the
distribution policy did not match the data distribution. This
problem has been resolved.
30903 - Metrics Collector
Workfile entries were sometimes freed prematurely, which could lead
to the postmaster process being reset on segments and failures in
query execution. This problem has been resolved.
30928 - GPORCA
If gp_use_legacy_hashops was enabled, GPORCA could crash when
generating the query plan for certain queries that included an
aggregate. This problem has been resolved.
174812955 - Query Execution
When executing a long query that contained multi-byte characters,
Greenplum could incorrectly truncate the query string (removing
multi-byte characters) and, if log_min_duration_statement was set
to 0, could subsequent write an invalid symbol to segment logs.
This behavior could cause errors in gp_toolkit and Command Center.
This problem has been resolved.
16
Release 6.11.1 Release Date: 2020-09-17
Pivotal Greenplum 6.11.1 is a maintenance release that includes
changes and resolves several issues.
Changed Features Greenplum Database 6.11.1 includes this
change:
• Greenplum Platform Extension Framework (PXF) version 5.15.1 is
included, which includes changes and bug fixes. Refer to the PXF
Release Notes for more information on release content and to access
the PXF documentation.
Resolved Issues Pivotal Greenplum 6.11.1 resolves these
issues:
30751, 173714727 - Query Optimizer
Resolves an issue where a correlated subquery that contained at
least one left or right outer join caused the Greenplum Database
master to crash when the server configuration parameter
optimizer_join_order was set to exhaustive2.
30880 - gpload
Fixed a problem where gpload operations would fail if a table
column name included capital letters or special characters.
30901 - GPORCA
For queries that included an outer ref in a subquery, such as
select * from foo where foo.a = (select foo.b from bar), GPORCA
always used the results of the subquery after unnesting the outer
reference. This could cause a crash or incorrect results if the
subquery returned no rows, or if the subquery contained a
projection with multiple values below the outer reference. To
address this problem, all such queries now fall back to using the
Postgres planner instead of GPORCA. Note that this behavior occurs
for cases where GPORCA would have returned correct results, as well
as for cases that could cause crashes or return incorrect
results.
30913, 170824967 - gpfdists
A command that accessed an external table using the gpfdists
protocol failed if the external table did not use an IP address
when specifying a host system in the LOCATION clause of the
external table definition. This issue is resolved in Greenplum
6.11.1.
174609237 - gpstart
gpstart was updated so that it does not attempt to start a standby
master segment when that segment is unreachable, preventing an
associated stack trace during startup.
Upgrading from Greenplum 6.x to Greenplum 6.11 Note: Greenplum 6
does not support direct upgrades from Greenplum 4 or Greenplum 5
releases, or from earlier Greenplum 6 Beta releases.
See Upgrading from an Earlier Greenplum 6 Release to upgrade your
existing Greenplum 6.x software to Greenplum 6.11.0.
17
Release 6.11.0 Release Date: 2020-09-11
Pivotal Greenplum 6.11.0 is a minor release that includes changed
features and resolves several issues.
Features Greenplum Database 6.11.0 includes these new and changed
features:
• GPORCA partition elimination has been enhanced to support a
subset of lossy assignment casts that are order-preserving
(increasing) functions, including timestamp::date and float::int.
For example, GPORCA supports partition elimination when a partition
column is defined with the timestamp datatype and the query
contains a predicate such as WHERE ts::date == '2020-05-10' that
performs a cast on the partitioned column (ts) to compare column
data (a timestamp) to a date.
• PXF version 5.15.0 is included, which includes new and changed
features and bug fixes. Refer to the PXF Release Notes for more
information on release content and supported platforms, and to
access the PXF documentation.
• Greenplum Command Center 6.3.0 and 4.11.0 are included, which
include new workload management and other features, as well as bug
fixes. See the Command Center Release Notes for more
information.
• The DataDirect ODBC Drivers for Pivotal Greenplum were updated to
version 07.16.0389 (B0562, U0408). This version introduces support
for the following datatypes:
Greenplum Datatype ODBC Datatype
30899 - Resource Groups
In some cases when running queries are managed by resource groups,
Greenplum Database generated a PANIC when managing runaway queries
(queries that use an excessive amount of memory) because of locking
issues. This issue is resolved.
30877 - VACUUM
In some cases, running VACUUM returns ERROR: found xmin <xid>
from before relfrozenxid <frozen_xid>. The error was caused
when a previously run VACUUM FULL was interrupted and aborted on a
query executor (QE) and corrupted catalog frozen XID information.
This issue is resolved.
30870 - Segment Mirroring
In some cases, performing an incremental recovery of a Greenplum
Database segment instance failed with the message requested WAL
segment has already been removed because the recovery checkpoint
was not created properly. This issue is resolved.
30858 - analyzedb
18
analyzedb failed if analyzedb attempted to update statistics for a
set of tables and one of the tables was dropped and then recreated
while analyzedb was running. analyzedb has been enhanced better
handle the specified situation.
30845 - Query Execution
Under heavy load when running multiple queries, some queries
randomly failed with the error Error on receive from seg<ID>.
The error was caused when Greenplum Database encountered a divide
by 0 error while managing the backend processes that are used to
run queries on the segment instances. This issue is resolved.
30761 - Postgres Planner
In some cases, Greenplum Database generated a PANIC when a DROP
VIEW command was cancelled from the Greenplum Command Center. The
PANIC was generated when Greenplum Database did not correctly
handle the visibility of the relation.
30721 - gpcheckcat
Resolved a problem where gpcheckcat would fail with Missing or
extraneous entries check errors if the gp_sparse_vector extension
was installed.
30637 - Query Optimizer
For some queries against partitioned tables, GPORCA did not perform
partition elimination when a predicate that includes the partition
column also performs an explicit cast. For example, GPORCA would
not perform partition elimination when a partition column is
defined with the timestamp datatype and the query contains a
predicate such as WHERE ts::date == '2020-05-10' that performs a
cast on the partitioned column (ts) to compare column data (a
timestamp) to a date. GPORCA partition elimination has been
improved to support the specified type of query. See
Features.
10491 - Postgres Planner
For some queries that contain nested subqueries that do not specify
a relation and also contain a nested GROUP BY clauses, Greenplum
Database generated a PANIC. The PANIC was generated when Greenplum
Database did not correctly manage the subquery correctly. This is
an example of the specified type or query.
SELECT * FROM (SELECT * FROM (SELECT c1, SUM(c2) c2 FROM mytbl
GROUP BY c1 ) t2 ) t3 GROUP BY c2, ROLLUP((c1)) ORDER BY 1,
2;
This issue is resolved.
10561 - Server
Greenplum Database does not support altering the datatype of a
column defined as a distribution key or with a constraint. When
attempting to change the datatype, the error message did not
clearly indicate the cause. The error message has been altered to
provide more information.
174505130 - Resource Groups
In some cases for a query managed by resource group, the resource
group cancelled the query with the message Canceling query because
of high VMEM usage because the resource group calculated the
incorrect memory used by the query. This issue is resolved.
174353156 - Interconnect
In some cases when Greenplum Database uses proxies for interconnect
communication (the server configuration parameter
gp_interconnect_type is set to proxy), a Greenplum background
worker process became an orphaned process after the postmaster
process was terminated. This issue is resolved.
174205590 - Interconnect
19
When Greenplum Database uses proxies for interconnect communication
(the server configuration parameter gp_interconnect_type is set to
proxy), a query might have hung if the query contains multiple
concurrent subplans running on the segment instances. The query
hung when the Greenplum interconnect did not properly handle the
communication among the concurrent subplans. This issue is
resolved.
174483149 - Cluster Management - gpinitsystem
gpinitsystem now exports the MASTER_DATA_DIRECTORY environment
variable before calling gpconfig, to avoid throwing warning
messages when configuring system parameters on Greenplum Database
appliances (DCA).
Upgrading from Greenplum 6.x to Greenplum 6.11 Note: Greenplum 6
does not support direct upgrades from Greenplum 4 or Greenplum 5
releases, or from earlier Greenplum 6 Beta releases.
See Upgrading from an Earlier Greenplum 6 Release to upgrade your
existing Greenplum 6.x software to Greenplum 6.11.0.
Pivotal Greenplum 6.11 Release Notes Release Notes
20
Deprecated Features Deprecated features will be removed in a future
major release of Greenplum Database. Pivotal Greenplum 6.x
deprecates:
• The gpsys1 utility. • The analzyedb option --skip_root_stats
(deprecated since 6.2).
If the option is specified, a warning is issued stating that the
option will be ignored. • The server configuration parameter
gp_statistics_use_fkeys (deprecated since 6.2). • The server
configuration parameter gp_ignore_error_table (deprecated since
6.0).
To avoid a Greenplum Database syntax error, set the value of this
parameter to true when you run applications that execute CREATE
EXTERNAL TABLE or COPY commands that include the now removed
Greenplum Database 4.3.x INTO ERROR TABLE clause.
• Specifying => as an operator name in the CREATE OPERATOR
command (deprecated since 6.0). • The Greenplum external table C
API (deprecated since 6.0).
Any developers using this API are encouraged to use the new Foreign
Data Wrapper API in its place. • Commas placed between a
SUBPARTITION TEMPLATE clause and its corresponding
SUBPARTITION
BY clause, and between consecutive SUBPARTITION BY clauses in a
CREATE TABLE command (deprecated since 6.0).
Using this undocumented syntax will generate a deprecation warning
message. • The timestamp format YYYYMMDDHH24MISS (deprecated since
6.0).
This format could not be parsed unambiguously in previous Greenplum
Database releases, and is not supported in PostgreSQL 9.4.
• The createlang and droplang utilities (deprecated since 6.0). •
The pg_resqueue_status system view (deprecated since 6.0).
Use the gp_toolkit.gp_resqueue_status view instead. • The GLOBAL
and LOCAL modifiers when creating a temporary table with the CREATE
TABLE and
CREATE TABLE AS commands (deprecated since 6.0).
These keywords are present for SQL standard compatibility, but have
no effect in Greenplum Database. • Using WITH OIDS or oids=TRUE to
assign an OID system column when creating or altering a table
(deprecated since 6.0). • Allowing superusers to specify the
SQL_ASCII encoding regardless of the locale settings
(deprecated
since 6.0).
This choice may result in misbehavior of character-string functions
when data that is not encoding- compatible with the locale is
stored in the database.
• The @@@ text search operator (deprecated since 6.0).
This operator is currently a synonym for the @@ operator. • The
unparenthesized syntax for option lists in the VACUUM command
(deprecated since 6.0).
This syntax requires that the options to the command be specified
in a specific order. • The plain pgbouncer authentication type
(auth_type = plain) (deprecated since 4.x).
Pivotal Greenplum 6.11 Release Notes Release Notes
21
Migrating Data to Greenplum 6 Note: Greenplum 6 does not support
direct upgrades from Greenplum 4 or Greenplum 5 releases, or from
earlier Greenplum 6 Beta releases.
See Migrating Data from Greenplum 4.3 or 5 for guidelines and
considerations for migrating existing Greenplum data to Greenplum
6, using standard backup and restore procedures.
Pivotal Greenplum 6.11 Release Notes Release Notes
22
Known Issues and Limitations Pivotal Greenplum 6 has these
limitations:
• Upgrading a Greenplum Database 4 or 5 release, or Greenplum 6
Beta release, to Greenplum 6 is not supported.
• MADlib, GPText, and PostGIS are not yet provided for installation
on Ubuntu systems. • Greenplum 6 is not supported for installation
on DCA systems. • Greenplum for Kubernetes is not yet provided with
this release.
The following table lists key known issues in Pivotal Greenplum
6.x.
Table 1: Key Known Issues in Pivotal Greenplum 6.x
Issue Category Description
N/A Backup/Restore Restoring the Greenplum Database backup for a
table fails in Greenplum 6 versions earlier than version 6.10 when
a replicated table has an inheritance relationship to/from another
table that was assigned via an ALTER TABLE ... INHERIT statement
after table creation.
Workaround: Use the following SQL commands to determine if
Greenplum Database includes any replicated tables that inherit from
a parent table, or if there are replicated tables that are
inherited by a child table:
SELECT inhrelid::regclass FROM pg_inherits, gp_distribution_policy
dp WHERE inhrelid=dp.localoid AND dp.policytype='r'; SELECT
inhparent::regclass FROM pg_inherits, gp_distribution_policy dp
WHERE inhparent=dp.localoid AND dp.policytype= 'r';
If these queries return any tables, you may choose to run gprestore
with the -–on-error-continue flag to not fail the entire restore
when this issue is hit. Or, you can specify the list of tables
returned by the queries to the -–exclude-table-file option to skip
those tables during restore. You must recreate and repopulate the
affected tables after restore.
N/A Spark Connector
This version of Greenplum is not compatible with Greenplum-Spark
Connector versions earlier than version 1.7.0, due to a change in
how Greenplum handles distributed transaction IDs.
N/A PXF Starting in 6.x, Greenplum does not bundle cURL and instead
loads the system-provided library. PXF requires cURL version 7.29.0
or newer. The officially-supported cURL for the CentOS 6.x and Red
Hat Enterprise Linux 6.x operating systems is version 7.19.*.
Greenplum Database 6 does not support running PXF on CentOS 6.x or
RHEL 6. x due to this limitation.
Workaround: Upgrade the operating system of your Greenplum Database
6 hosts to CentOS 7+ or RHEL 7+, which provides a cURL version
suitable to run PXF.
Pivotal Greenplum 6.11 Release Notes Release Notes
23
29703 Loading Data from External Tables
Due to limitations in the Greenplum Database external table
framework, Greenplum Database cannot log the following types of
errors that it encounters while loading data:
• data type parsing errors • unexpected value type errors • data
type conversion errors • errors returned by native and user-defined
functions
LOG ERRORS returns error information for data exceptions only. When
it encounters a parsing error, Greenplum terminates the load job,
but it cannot log and propagate the error back to the user via gp_
read_error_log().
Workaround: Clean the input data before loading it into Greenplum
Database.
30594 Resource Management
Resource queue-related statistics may be inaccurate in certain
cases. VMware recommends that you use the resource group resource
management scheme that is available in Greenplum 6.
30522 Logging Greenplum Database may write a FATAL message to the
standby master or mirror log stating that the database system is in
recovery mode when the instance is synchronizing with the master
and Greenplum attempts to contact it before the operation
completes. Ignore these messages and use gpstate -f output to
determine if the standby successfully synchronized with the
Greenplum master; the command returns Sync state: sync if it is
synchronized.
30537 Postgres Planner
The Postgres Planner generates a very large query plan that causes
out of memory issues for the following type of CTE (common table
expression) query: the WITH clause of the CTE contains a
partitioned table with a large number partitions, and the WITH
reference is used in a subquery that joins another partitioned
table.
Workaround: If possible, use the GPORCA query optimizer. With the
server configuration parameter optimizer=on, Greenplum Database
attempts to use GPORCA for query planning and optimization when
possible and falls back to the Postgres Planner when GPORCA cannot
be used. Also, the specified type of query might require a long
time to complete.
170824967 gpfidsts For Greenplum Database 6.x, a command that
accesses an external table that uses the gpfdists protocol fails if
the external table does not use an IP address when specifying a
host system in the LOCATION clause of the external table
definition. This issue is resolved in Greenplum 6.11.1.
n/a Materialized Views
By default, certain gp_toolkit views do not display data for
materialized views. If you want to include this information in gp_
toolkit view output, you must redefine a gp_toolkit internal view
as described in Including Data for Materialized Views.
168957894 PXF The PXF Hive Connector does not support using the
Hive* profiles to access Hive transactional tables.
Workaround: Use the PXF JDBC Connector to access Hive.
Pivotal Greenplum 6.11 Release Notes Release Notes
24
Issue Category Description
168548176 gpbackup When using gpbackup to back up a Greenplum
Database 5.7.1 or earlier 5.x release with resource groups enabled,
gpbackup returns a column not found error for t6.value AS
memoryauditor.
164791118 PL/R PL/R cannot be installed using the deprecated
createlang utility, and displays the error:
createlang: language installation failed: ERROR: no schema has been
selected to create in
Workaround: Use CREATE EXTENSION to install PL/R, as described in
the documentation.
N/A Greenplum Client/Load Tools on Windows
The Greenplum Database client and load tools on Windows have not
been tested with Active Directory Kerberos authentication.
Pivotal Greenplum 6.11 Release Notes Release Notes
25
Differences Compared to Open Source Greenplum Database
Pivotal Greenplum 6.x includes all of the functionality in the open
source Greenplum Database project and adds:
• Product packaging and installation script • Support for QuickLZ
compression. QuickLZ compression is not provided in the open source
version of
Greenplum Database due to licensing restrictions. • Support for
data connectors:
• Greenplum-Spark Connector • Greenplum-Informatica Connector •
Greenplum-Kafka Integration • Greenplum Streaming Server
• Data Direct ODBC/JDBC Drivers • gpcopy utility for copying or
migrating objects between Greenplum systems • Support for managing
Greenplum Database using Pivotal Greenplum Command Center • Support
for full text search and text analysis using Pivotal GPText •
Greenplum backup plugin for DD Boost • Backup/restore storage
plugin API
26
Installing and Upgrading Greenplum Release Notes
27
Platform Requirements This topic describes the Pivotal Greenplum
Database 6 platform and operating system software
requirements.
Important: Pivotal Support does not provide support for open source
versions of Greenplum Database. Only Pivotal Greenplum Database is
supported by Pivotal Support.
• Operating Systems
• Client Tools • Extensions • Data Connectors • GPText • Greenplum
Command Center
• Hadoop Distributions
Operating Systems Pivotal Greenplum 6 runs on the following
operating system platforms:
• Red Hat Enterprise Linux 64-bit 7.x (See the following Note.) •
Red Hat Enterprise Linux 64-bit 6.x • CentOS 64-bit 7.x • CentOS
64-bit 6.x • Ubuntu 18.04 LTS • Oracle Linux 64-bit 7, using the
Red Hat Compatible Kernel (RHCK)
Important: Significant Greenplum Database performance degradation
has been observed when enabling resource group-based workload
management on RedHat 6.x and CentOS 6.x systems. This issue is
caused by a Linux cgroup kernel bug. This kernel bug has been fixed
in CentOS 7.x and Red Hat 7.x systems.
If you use RedHat 6 and the performance with resource groups is
acceptable for your use case, upgrade your kernel to version
2.6.32-696 or higher to benefit from other fixes to the cgroups
implementation.
Note: For Greenplum Database that is installed on Red Hat
Enterprise Linux 7.x or CentOS 7.x prior to 7.3, an operating
system issue might cause Greenplum Database that is running large
workloads to hang in the workload. The Greenplum Database issue is
caused by Linux kernel bugs.
RHEL 7.3 and CentOS 7.3 resolves the issue.
Greenplum Database server supports TLS version 1.2 on RHEL/CentOS
systems, and TLS version 1.3 on Ubuntu systems.
Software Dependencies Greenplum Database 6 requires the following
software packages on RHEL/CentOS 6/7 systems which are installed
automatically as dependencies when you install the Pivotal
Greenplum Database RPM package):
• apr
28
• apr-util • bash • bzip2 • curl • krb5 • libcurl • libevent (or
libevent2 on RHEL/CentOS 6) • libxml2 • libyaml • zlib • openldap •
openssh • openssl • openssl-libs (RHEL7/Centos7) • perl • readline
• rsync • R • sed (used by gpinitsystem) • tar • zip
Greenplum Database 6 client software requires these operating
system packages:
• apr • apr-util • libyaml • libevent
On Ubuntu systems, Greenplum Database 6 requires the following
software packages, which are installed automatically as
dependencies when you install Greenplum Database with the Debian
package installer:
• libapr1 • libaprutil1 • bash • bzip2 • krb5-multidev •
libcurl3-gnutls • libcurl4 • libevent-2.1-6 • libxml2 • libyaml-0-2
• zlib1g • libldap-2.4-2 • openssh-client • openssh-client •
openssl • perl • readline • rsync • sed • tar • zip
Installing and Upgrading Greenplum Release Notes
29
• net-tools • less • iproute2
Greenplum Database 6 uses Python 2.7.12, which is included with the
product installation (and not installed as a package
dependency).
Important: SSL is supported only on the Greenplum Database master
host system. It cannot be used on the segment host systems.
Important: For all Greenplum Database host systems, SELinux must be
disabled. You should also disable firewall software, although
firewall software can be enabled if it is required for security
purposes. See Disabling SELinux and Firewall Software.
Java Greenplum 6 supports these Java versions for PL/Java and
PXF:
• Open JDK 8 or Open JDK 11, available from AdoptOpenJDK • Oracle
JDK 8 or Oracle JDK 11
Hardware and Network The following table lists minimum recommended
specifications for hardware servers intended to support Greenplum
Database on Linux systems in a production environment. All host
servers in your Greenplum Database system must have the same
hardware and software configuration. Greenplum also provides
hardware build guides for its certified hardware platforms. It is
recommended that you work with a Greenplum Systems Engineer to
review your anticipated environment to ensure an appropriate
hardware configuration for Greenplum Database.
Table 2: Minimum Hardware Requirements
Minimum CPU Any x86_64 compatible CPU
Minimum Memory 16 GB RAM per server
Disk Space Requirements • 150MB per host for Greenplum installation
• Approximately 300MB per segment instance for
meta data • Appropriate free space for data with disks at no
more than 70% capacity
NIC bonding is recommended when multiple interfaces are
present
Pivotal Greenplum can use either IPV4 or IPV6 protocols.
Storage The only file system supported for running Greenplum
Database is the XFS file system. All other file systems are
explicitly not supported by Pivotal.
Greenplum Database is supported on network or shared storage if the
shared storage is presented as a block device to the servers
running Greenplum Database and the XFS file system is mounted on
the block device. Network file systems are not supported. When
using network or shared storage, Greenplum
30
Database mirroring must be used in the same way as with local
storage, and no modifications may be made to the mirroring scheme
or the recovery scheme of the segments.
Other features of the shared storage such as de-duplication and/or
replication are not directly supported by Pivotal Greenplum
Database, but may be used with support of the storage vendor as
long as they do not interfere with the expected operation of
Greenplum Database at the discretion of Pivotal.
Greenplum Database can be deployed to virtualized systems only if
the storage is presented as block devices and the XFS file system
is mounted for the storage of the segment directories.
Greenplum Database is supported on Amazon Web Services (AWS)
servers using either Amazon instance store (Amazon uses the volume
names ephemeral[0-20]) or Amazon Elastic Block Store (Amazon EBS)
storage. If using Amazon EBS storage the storage should be RAID of
Amazon EBS volumes and mounted with the XFS file system for it to
be a supported configuration.
Data Domain Boost Pivotal Greenplum 6.0.0 supports Data Domain
Boost for backup on Red Hat Enterprise Linux. This table lists the
versions of Data Domain Boost SDK and DDOS supported by Pivotal
Greenplum 6.x.
Table 3: Data Domain Boost Compatibility
Pivotal Greenplum Data Domain Boost DDOS
6.x 3.3 6.1 (all versions)
6.0 (all versions)
Note: In addition to the DDOS versions listed in the previous
table, Pivotal Greenplum supports all minor patch releases (fourth
digit releases) later than the certified version.
Tools and Extensions Compatibility • Client Tools • Extensions •
Data Connectors • GPText • Greenplum Command Center
Client Tools Greenplum Database 6 releases a Clients tool package
on various platforms that can be used to access Greenplum Database
from a client system. The Greenplum 6 Clients tool package is
supported on the following platforms:
• Red Hat Enterprise Linux x86_64 6.x (RHEL 6) • Red Hat Enterprise
Linux x86_64 7.x (RHEL 7) • Ubuntu 18.04 LTS • Windows 10 (32-bit
and 64-bit) • Windows 8 (32-bit and 64-bit) • Windows Server 2012
(32-bit and 64-bit) • Windows Server 2012 R2 (32-bit and 64-bit) •
Windows Server 2008 R2 (32-bit and 64-bit)
The Greenplum 6 Clients package includes the client and loader
programs provided in the Greenplum 5 packages plus the addition of
database/role/language commands and the Greenplum-Kafka Integration
and Greenplum Streaming Server command utilities. Refer to
Greenplum Client and Loader Tools Package for installation and
usage details of the Greenplum 6 Client tools.
Installing and Upgrading Greenplum Release Notes
31
Extensions This table lists the versions of the Pivotal Greenplum
Extensions that are compatible with this release of Greenplum
Database 6.
Table 4: Pivotal Greenplum 6 Extensions Compatibility
Component Package Version Additional Information
PL/Java 2.0.2 Supports Java 8 and 11.
Python Data Science Module Package
2.0.2
R Data Science Library Package 2.0.2
PL/Container 2.1.2
Python 3.7
GreenplumR 1.1.0 Supports R 3.6+.
MADlib Machine Learning 1.17, 1.16 Support matrix at MADlib
FAQ.
PostGIS Spatial and Geographic Objects
2.5.4+pivotal.3, 2.5.4+pivotal.2, 2. 5.4+pivotal.1,
2.1.5+pivotal.2-2
• Fuzzy String Match Extension • PL/Python Extension • pgcrypto
Extension
Data Connectors • Greenplum Platform Extension Framework (PXF)
v5.15.0 - PXF, integrated with Greenplum Database
6, provides access to Hadoop, object store, and SQL external data
stores. Refer to Accessing External Data with PXF in the Greenplum
Database Administrator Guide for PXF configuration and usage
information.
• Greenplum-Kafka Integration - The Pivotal Greenplum-Kafka
Integration provides high speed, parallel data transfer from a
Kafka cluster to a Pivotal Greenplum Database cluster for batch and
streaming ETL operations. It requires Kafka version 0.11 or newer
for exactly-once delivery assurance. Refer to the Pivotal
Greenplum-Kafka Integration Documentation for more information
about this feature.
• Greenplum Streaming Server v1.4.1 - The Pivotal Greenplum
Streaming Server is an ETL tool that provides high speed, parallel
data transfer from Informatica, Kafka, and custom client data
sources to a
32
Pivotal Greenplum Database cluster. Refer to the Pivotal Greenplum
Streaming Server Documentation for more information about this
feature.
• Greenplum Informatica Connector v1.0.5 - The Pivotal Greenplum
Informatica Connector supports high speed data transfer from an
Informatica PowerCenter cluster to a Pivotal Greenplum Database
cluster for batch and streaming ETL operations.
• Greenplum Spark Connector v1.6.2 - The Pivotal Greenplum Spark
Connector supports high speed, parallel data transfer between
Greenplum Database and an Apache Spark cluster using Spark’s Scala
API.
• Progress DataDirect JDBC Drivers v5.1.4.000223 - The Progress
DataDirect JDBC drivers are compliant with the Type 4 architecture,
but provide advanced features that define them as Type 5
drivers.
• Progress DataDirect ODBC Drivers v7.1.6 (07.16.0389) - The
Progress DataDirect ODBC drivers enable third party applications to
connect via a common interface to the Pivotal Greenplum Database
system.
Note: Pivotal Greenplum 6 does not support the ODBC driver for
Cognos Analytics V11.
Connecting to IBM Cognos software with an ODBC driver is not
supported. Greenplum Database supports connecting to IBM Cognos
software with the DataDirect JDBC driver for Pivotal Greenplum.
This driver is available as a download from Pivotal Network.
GPText Pivotal Greenplum Database 6 is compatible with Pivotal
Greenplum Text version 3.3.1 and later. See the Greenplum Text
documentation for additional compatibility information.
Greenplum Command Center Pivotal Greenplum Database 6.8 and later
are compatible only with Pivotal Greenplum Command Center 6.2 and
later. See the Greenplum Command Center documentation for
additional compatibility information.
Hadoop Distributions Greenplum Database provides access to HDFS
with the Greenplum Platform Extension Framework (PXF).
PXF can use Cloudera, Hortonworks Data Platform, MapR, and generic
Apache Hadoop distributions. PXF bundles all of the JAR files on
which it depends, including the following Hadoop libraries:
Table 5: PXF Hadoop Supported Platforms
PXF Version Hadoop Version Hive Server Version HBase Server
Version
5.15.0, 5.14.0, 5.13.0, 5. 12.0, 5.11.1, 5.10.1
2.x, 3.1+ 1.x, 2.x, 3.1+ 1.3.2
5.8.2 2.x 1.x 1.3.2
5.8.1 2.x 1.x 1.3.2
Note: If you plan to access JSON format data stored in a Cloudera
Hadoop cluster, PXF requires a Cloudera version 5.8 or later Hadoop
distribution.
33
Introduction to Greenplum High-level overview of the Greenplum
Database system architecture.
Greenplum Database stores and processes large amounts of data by
distributing the load across several servers or hosts. A logical
database in Greenplum is an array of individual PostgreSQL
databases working together to present a single database image. The
master is the entry point to the Greenplum Database system. It is
the database instance to which users connect and submit SQL
statements. The master coordinates the workload across the other
database instances in the system, called segments, which handle
data processing and storage. The segments communicate with each
other and the master over the interconnect, the networking layer of
Greenplum Database.
Greenplum Database is a software-only solution; the hardware and
database software are not coupled. Greenplum Database runs on a
variety of commodity server platforms from Greenplum-certified
hardware vendors. Performance depends on the hardware on which it
is installed. Because the database is distributed across multiple
machines in a Greenplum Database system, proper selection and
configuration of hardware is vital to achieving the best possible
performance.
This chapter describes the major components of a Greenplum Database
system and the hardware considerations and concepts associated with
each component: The Greenplum Master, The Segments and The
Interconnect. Additionally, a system may have optional ETL Hosts
for Data Loading and Greenplum Performance Monitoring for
monitoring query workload and performance.
Installing and Upgrading Greenplum Release Notes
34
The Greenplum Master The master is the entry point to the Greenplum
Database system. It is the database server process that accepts
client connections and processes the SQL commands that system users
issue. Users connect to Greenplum Database through the master using
a PostgreSQL-compatible client program such as psql or ODBC.
The master maintains the system catalog (a set of system tables
that contain metadata about the Greenplum Database system itself),
however the master does not contain any user data. Data resides
only on the segments. The master authenticates client connections,
processes incoming SQL commands, distributes the work load between
segments, coordinates the results returned by each segment, and
presents the final results to the client program.
Because the master does not contain any user data, it has very
little disk load. The master needs a fast, dedicated CPU for data
loading, connection handling, and query planning because extra
space is often necessary for landing load files and backup files,
especially in production environments. Customers may decide to also
run ETL and reporting tools on the master, which requires more disk
space and processing power.
Master Redundancy You may optionally deploy a backup or mirror of
the master instance. A backup master host serves as a warm standby
if the primary master host becomes nonoperational. You can deploy
the standby master on a designated redundant master host or on one
of the segment hosts.
The standby master is kept up to date by a transaction log
replication process, which runs on the standby master host and
synchronizes the data between the primary and standby master hosts.
If the primary master fails, the log replication process shuts
down, and an administrator can activate the standby master in its
place. When an the standby master is active, the replicated logs
are used to reconstruct the state of the master host at the time of
the last successfully committed transaction.
Since the master does not contain any user data, only the system
catalog tables need to be synchronized between the primary and
backup copies. When these tables are updated, changes automatically
copy over to the standby master so it is always synchronized with
the primary.
Figure 1: Master Mirroring in Greenplum Database
The Segments In Greenplum Database, the segments are where data is
stored and where most query processing occurs. User-defined tables
and their indexes are distributed across the available segments in
the Greenplum Database system; each segment contains a distinct
portion of the data. Segment instances are the
Installing and Upgrading Greenplum Release Notes
35
database server processes that serve segments. Users do not
interact directly with the segments in a Greenplum Database system,
but do so through the master.
In the reference Greenplum Database hardware configurations, the
number of segment instances per segment host is determined by the
number of effective CPUs or CPU core. For example, if your segment
hosts have two dual-core processors, you may have two or four
primary segments per host. If your segment hosts have three
quad-core processors, you may have three, six or twelve segments
per host. Performance testing will help decide the best number of
segments for a chosen hardware platform.
Segment Redundancy When you deploy your Greenplum Database system,
you have the option to configure mirror segments. Mirror segments
allow database queries to fail over to a backup segment if the
primary segment becomes unavailable. Mirroring is a requirement for
Pivotal-supported production Greenplum Database systems.
A mirror segment must always reside on a different host than its
primary segment. Mirror segments can be arranged across the hosts
in the system in one of two standard configurations, or in a custom
configuration you design. The default configuration, called group
mirroring, places the mirror segments for all primary segments on a
host on one other host. Another option, called spread mirroring,
spreads mirrors for each host's primary segments over the remaining
hosts. Spread mirroring requires that there be more hosts in the
system than there are primary segments on the host. On hosts with
multiple network interfaces, the primary and mirror segments are
distributed equally among the interfaces. Figure 2: Data Mirroring
in Greenplum Database shows how table data is distributed across
the segments when the default group mirroring option is
configured.
Figure 2: Data Mirroring in Greenplum Database
Segment Failover and Recovery
When mirroring is enabled in a Greenplum Database system, the
system automatically fails over to the mirror copy if a primary
copy becomes unavailable. A Greenplum Database system can remain
operational
Installing and Upgrading Greenplum Release Notes
36
if a segment instance or host goes down only if all portions of
data are available on the remaining active segments.
If the master cannot connect to a segment instance, it marks that
segment instance as invalid in the Greenplum Database system
catalog. The segment instance remains invalid and out of operation
until an administrator brings that segment back online. An
administrator can recover a failed segment while the system is up
and running. The recovery process copies over only the changes that
were missed while the segment was nonoperational.
If you do not have mirroring enabled and a segment becomes invalid,
the system automatically shuts down. An administrator must recover
all failed segments before operations can continue.
Example Segment Host Hardware Stack Regardless of the hardware
platform you choose, a production Greenplum Database processing
node (a segment host) is typically configured as described in this
section.
The segment hosts do the majority of database processing, so the
segment host servers are configured in order to achieve the best
performance possible from your Greenplum Database system. Greenplum
Database's performance will be as fast as the slowest segment
server in the array. Therefore, it is important to ensure that the
underlying hardware and operating systems that are running
Greenplum Database are all running at their optimal performance
level. It is also advised that all segment hosts in a Greenplum
Database array have identical hardware resources and
configurations.
Segment hosts should also be dedicated to Greenplum Database
operations only. To get the best query performance, you do not want
Greenplum Database competing with other applications for machine or
network resources.
The following diagram shows an example Greenplum Database segment
host hardware stack. The number of effective CPUs on a host is the
basis for determining how many primary Greenplum Database segment
instances to deploy per segment host. This example shows a host
with two effective CPUs (one dual-core CPU). Note that there is one
primary segment instance (or primary/mirror pair if using
mirroring) per CPU core.
Installing and Upgrading Greenplum Release Notes
37
Figure 3: Example Greenplum Database Segment Host
Configuration
Example Segment Disk Layout Each CPU is typically mapped to a
logical disk. A logical disk consists of one primary file system
(and optionally a mirror file system) accessing a pool of physical
disks through an I/O channel or disk controller. The logical disk
and file system are provided by the operating system. Most
operating systems provide the ability for a logical disk drive to
use groups of physical disks arranged in RAID arrays.
Figure 4: Logical Disk Layout in Greenplum Database
Depending on the hardware platform you choose, different RAID
configurations offer different performance and capacity levels.
Greenplum supports and certifies a number of reference hardware
platforms and operating systems. Check with your sales account
representative for the recommended configuration on your chosen
platform.
Installing and Upgrading Greenplum Release Notes
38
The Interconnect The interconnect is the networking layer of
Greenplum Database. When a user connects to a database and issues a
query, processes are created on each of the segments to handle the
work of that query. The interconnect refers to the inter-process
communication between the segments, as well as the network
infrastructure on which this communication relies. The interconnect
uses a standard 10 Gigabit Ethernet switching fabric.
By default, Greenplum Database interconnect uses UDP (User Datagram
Protocol) with flow control for interconnect traffic to send
messages over the network. The Greenplum software does the
additional packet verification and checking not performed by UDP,
so the reliability is equivalent to TCP (Transmission Control
Protocol), and the performance and scalability exceeds that of TCP.
For information about the types of interconnect supported by
Greenplum Database, see server configuration parameter
gp_interconnect_type in the Greenplum Database Reference
Guide.
Interconnect Redundancy A highly available interconnect can be
achieved by deploying dual 10 Gigabit Ethernet switches on your
network, and redundant 10 Gigabit connections to the Greenplum
Database master and segment host servers.
Network Interface Configuration A segment host typically has
multiple network interfaces designated to Greenplum interconnect
traffic. The master host typically has additional external network
interfaces in addition to the interfaces used for interconnect
traffic.
Depending on the number of interfaces available, you will want to
distribute interconnect network traffic across the number of
available interfaces. This is done by assigning segment instances
to a particular network interface and ensuring that the primary
segments are evenly balanced over the number of available
interfaces.
This is done by creating separate host address names for each
network interface. For example, if a host has four network
interfaces, then it would have four corresponding host addresses,
each of which maps to one or more primary segment instances. The
/etc/hosts file should be configured to contain not only the host
name of each machine, but also all interface host addresses for all
of the Greenplum Database hosts (master, standby master, segments,
and ETL hosts).
With this configuration, the operating system automatically selects
the best path to the destination. Greenplum Database automatically
balances the network destinations to maximize parallelism.
Installing and Upgrading Greenplum Release Notes
39
Figure 5: Example Network Interface Architecture
Switch Configuration When using multiple 10 Gigabit Ethernet
switches within your Greenplum Database array, evenly divide the
number of subnets between each switch. In this example
configuration, if we had two switches, NICs 1 and 2 on each host
would use switch 1 and NICs 3 and 4 on each host would use switch
2. For the master host, the host name bound to NIC 1 (and therefore
using switch 1) is the effective master host name for the array.
Therefore, if deploying a warm standby master for redundancy
purposes, the standby master should map to a NIC that uses a
different switch than the primary master.
Installing and Upgrading Greenplum Release Notes
40
Figure 6: Example Switch Configuration
ETL Hosts for Data Loading Greenplum supports fast, parallel data
loading with its external tables feature. By using external tables
in conjunction with Greenplum Database's parallel file server
(gpfdist), administrators can achieve maximum parallelism and load
bandwidth from their Greenplum Database system. Many production
systems deploy designated ETL servers for data loading purposes.
These machines run the Greenplum parallel file server (gpfdist),
but not Greenplum Database instances.
One advantage of using the gpfdist file server program is that it
ensures that all of the segments in your Greenplum Database system
are fully utilized when reading from external table data
files.
The gpfdist program can serve data to the segment instances at an
average rate of about 350 MB/s for delimited text formatted files
and 200 MB/s for CSV formatted files. Therefore, you should
consider the following options when running gpfdist in order to
maximize the network bandwidth of your ETL systems:
• If your ETL machine is configured with multiple network interface
cards (NICs) as described in Network Interface Configuration, run
one instance of gpfdist on your ETL host and then define your
external table definition so that the host name of each NIC is
declared in the LOCATION clause (see CREATE EXTERNAL TABLE in the
Greenplum Database Reference Guide). This allows network traffic
between your Greenplum segment hosts and your ETL host to use all
NICs simultaneously.
Installing and Upgrading Greenplum Release Notes
41
Figure 7: External Table Using Single gpfdist Instance with
Multiple NICs
• Run multiple gpfdist instances on your ETL host and divide your
external data files equally between each instance. For example, if
you have an ETL system with two network interface cards (NICs),
then you could run two gpfdist instances on that machine to
maximize your load performance. You would then divide the external
table data files evenly between the two gpfdist programs.
Figure 8: External Tables Using Multiple gpfdist Instances with
Multiple NICs
Greenplum Performance Monitoring Greenplum Database includes a
dedicated system monitoring and management database, named
gpperfmon, that administrators can install and enable. When this
database is enabled, data collection
Installing and Upgrading Greenplum Release Notes
42
agents on each segment host collect query status and system
metrics. At regular intervals (typically every 15 seconds), an
agent on the Greenplum master requests the data from the segment
agents and updates the gpperfmon database. Users can query the
gpperfmon database to see the stored query and system metrics. For
more information see the "gpperfmon Database Reference" in the
Greenplum Database Reference Guide.
Greenplum Command Center is an optional web-based performance
monitoring and management tool for Greenplum Database, based on the
gpperfmon database. Administrators can install Command Center
separately from Greenplum Database.
Figure 9: Greenplum Performance Monitoring Architecture
Installing and Upgrading Greenplum Release Notes
43
Estimating Storage Capacity To estimate how much data your
Greenplum Database system can accommodate, use these measurements
as guidelines. Also keep in mind that you may want to have extra
space for landing backup files and data load files on each segment
host.
Calculating Usable Disk Capacity To calculate how much data a
Greenplum Database system can hold, you have to calculate the
usable disk capacity per segment host and then multiply that by the
number of segment hosts in your Greenplum Database array. Start
with the raw capacity of the physical disks on a segment host that
are available for data storage (raw_capacity), which is:
disk_size * number_of_disks
Account for file system formatting overhead (roughly 10 percent)
and the RAID level you are using. For example, if using RAID-10,
the calculation would be:
(raw_capacity * 0.9) / 2 = formatted_disk_space
For optimal performance, do not completely fill your disks to
capacity, but run at 70% or lower. So with this in mind, calculate
the usable disk space as follows:
formatted_disk_space * 0.7 = usable_disk_space
Using only 70% of your disk space allows Greenplum Database to use
the other 30% for temporary and transaction files on the same
disks. If your host systems have a separate disk system that can be
used for temporary and transaction files, you can specify a
tablespace that Greenplum Database uses for the files. Moving the
location of the files might improve performance depending on the
performance of the disk system.
Once you have formatted RAID disk arrays and accounted for the
maximum recommended capacity (usable_disk_space), you will need to
calculate how much storage is actually available for user data (U).
If using Greenplum Database mirrors for data redundancy, this would
then double the size of your user data (2 * U). Greenplum Database
also requires some space be reserved as a working area for active
queries. The work space should be approximately one third the size
of your user data (work space = U/3):
With mirrors: (2 * U) + U/3 = usable_disk_space
Without mirrors: U + U/3 = usable_disk_space
Guidelines for temporary file space and user data space assume a
typical analytic workload. Highly concurrent workloads or workloads
with queries that require very large amounts of temporary space can
benefit from reserving a larger working area. Typically, overall
system throughput can be increased while decreasing work area usage
through proper workload management. Additionally, temporary space
and user space can be isolated from each other by specifying that
they reside on different tablespaces.
In the Greenplum Database Administrator Guide, see these
topics:
• Managing Performance for information about workload management •
Creating and Managing Tablespaces for information about moving the
location of temporary and
transaction files • Monitoring System State for information about
monitoring Greenplum Database disk space usage
Installing and Upgrading Greenplum Release Notes
44
Calculating User Data Size As with all databases, the size of your
raw data will be slightly larger once it is loaded into the
database. On average, raw data will be about 1.4 times larger on
disk after it is loaded into the database, but could be smaller or
larger depending on the data types you are using, table storage
type, in-database compression, and so on.
• Page Overhead - When your data is loaded into Greenplum Database,
it is divided into pages of 32KB each. Each page has 20 bytes of
page overhead.
• Row Overhead - In a regular 'heap' storage table, each row of
data has 24 bytes of row overhead. An 'append-optimized' storage
table has only 4 bytes of row overhead.
• Attribute Overhead - For the data values itself, the size
associated with each attribute value is dependent upon the data
type chosen. As a general rule, you want to use the smallest data
type possible to store your data (assuming you know the possible
values a column will have).
• Indexes - In Greenplum Database, indexes are distributed across
the segment hosts as is table data. The default index type in
Greenplum Database is B-tree. Because index size depends on the
number of unique values in the index and the data to be inserted,
precalculating the exact size of an index is impossible. However,
you can roughly estimate the size of an index using these
formulas.
B-tree: unique_values * (data_type_size + 24 bytes)
Bitmap: (unique_values * number_of_rows * 1 bit * compression_ratio
/ 8) + (unique_values * 32)
Calculating Space Requirements for Metadata and Logs On each
segment host, you will also want to account for space for Greenplum
Database log files and metadata:
• System Metadata — For each Greenplum Database segment instance
(primary or mirror) or master instance running on a host, estimate
approximately 20 MB for the system catalogs and metadata.
• Write Ahead Log — For each Greenplum Database segment (primary or
mirror) or master instance running on a host, allocate space for
the write ahead log (WAL). The WAL is divided into segment files of
64 MB each. At most, the number of WAL files will be:
2 * checkpoint_segments + 1
You can use this to estimate space requirements for WAL. The
default checkpoint_segments setting for a Greenplum Database
instance is 8, meaning 1088 MB WAL space allocated for each segment
or master instance on a host.
• Greenplum Database Log Files — Each segment instance and the
master instance generates database log files, which will grow over
time. Sufficient space should be allocated for these log files, and
some type of log rotation facility should be used to ensure that to
log files do not grow too large.
• Command Center Data — The data collection agents utilized by
Command Center run on the same set of hosts as your Greenplum
Database instance and utilize the system resources of those hosts.
The resource consumption of the data collection agent processes on
these hosts is minimal and should not significantly impact database
performance. Historical data collected by the collection agents is
stored in its own Command Center database (named gpperfmon) within
your Greenplum Database system. Collected data is distributed just
like regular database data, so you will need to account for disk
space in the data directory locations of your Greenplum segment
instances. The amount of space required depends on the amount of
historical data you would like to keep. Historical data is not
automatically truncated. Database administrators must set up a
truncation policy to maintain the size of the Command Center
database.
Installing and Upgrading Greenplum Release Notes
45
Configuring Your Systems Describes how to prepare your operating
system environment for Greenplum Database software
installation.
Perform the following tasks in order:
1. Make sure your host systems meet the requirements described in
Platform Requirements. 2. Disable SELinux and firewall software. 3.
Set the required operating system parameters. 4. Synchronize system
clocks. 5. Create the gpadmin account.
Unless noted, these tasks should be performed for all hosts in your
Greenplum Database array (master, standby master, and segment
hosts).
The Greenplum Database host naming convention for the master host
is mdw and for the standby master host is smdw.
The segment host naming convention is sdwN where sdw is a prefix
and N is an integer. For example, segment host names would be sdw1,
sdw2 and so on. NIC bonding is recommended for hosts with multiple
interfaces, but when the interfaces are not bonded, the convention
is to append a dash (-) and number to the host name. For example,
sdw1-1 and sdw1-2 are the two interface names for host sdw1.
For information about running Greenplum Database in the cloud see
Cloud Services in the Pivotal Greenplum Partner Marketplace.
Important: When data loss is not acceptable for a Greenplum
Database cluster, Greenplum master and segment mirroring is
recommended. If mirroring is not enabled then Greenplum stores only
one copy of the data, so the underlying storage media provides the
only guarantee for data availability and correctness in the event
of a hardware failure.
Kubernetes enables quick recovery from both pod and host failures,
and Kubernetes storage services provide a high level of
availability for the underlying data. Furthermore, virtualized
environments make it difficult to ensure the anti-affinity
guarantees required for Greenplum mirroring solutions. For these
reasons, mirrorless deployments are fully supported with Greenplum
for Kubernetes. Other deployment environments are generally not
supported for production use unless both Greenplum master and
segment mirroring are enabled.
Note: For information about upgrading Pivotal Greenplum Database
from a previous version, see the Greenplum Database Release Notes
for the release that you are installing.
Note: Automating the configuration steps described in this topic
and Installing the Greenplum Database Software with a system
provisioning tool, such as Ansible, Chef, or Puppet, can save time
and ensure a reliable and repeatable Greenplum Database
installation.
Disabling SELinux and Firewall Software For all Greenplum Database
host systems running RHEL or CentOS, SELinux must be disabled.
Follow these steps:
1. As the root user, check the status of SELinux:
# sestatus SELinuxstatus: disabled
46
2. If SELinux is not disabled, disable it by editing the
/etc/selinux/config file. As root, change the value of the SELINUX
parameter in the config file as follows:
SELINUX=disabled
3. If the System Security Services Daemon (SSSD) is installed on
your systems, edit the SSSD configuration file and set the
selinux_provider parameter to none to prevent SELinux-related SSH
authentication denials that could occur even with SELinux disabled.
As root, edit /etc/sssd/ sssd.conf and add this parameter:
selinux_provider=none
4. Reboot the system to apply any changes that you made and verify
that SELinux is disabled.
For information about disabling SELinux, see the SELinux
documentation.
You should also disable firewall software such as iptables (on
systems such as RHEL 6.x and CentOS 6.x ), firewalld (on systems
such as RHEL 7.x and CentOS 7.x), or ufw (on Ubuntu systems,
disabled by default).
If you decide to enable iptables with Greenplum Database for
security purposes, see Enabling iptables (Optional).
Follow these steps to disable iptables:
1. As the root user, check the status of iptables:
# /sbin/chkconfig --list iptables
If iptables is disabled, the command output is:
iptables 0:off 1:off 2:off 3:off 4:off 5:off 6:off
2. If necessary, execute this command as root to disable
iptables:
/sbin/chkconfig iptables off
You will need to reboot your system after applying the change. 3.
For systems with firewalld, check the status of firewalld with the
command:
# systemctl status firewalld
* firewalld.service - firewalld - dynamic firewall daemon Loaded:
loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor
preset: enabled) Active: inactive (dead)
4. If necessary, execute these commands as root to disable
firewalld:
# systemctl stop firewalld.service # systemctl disable
firewalld.service
For more information about configuring your firewall software, see
the documentation for the firewall or your operating system.
Recommended OS Parameters Settings Greenplum requires that certain
Linux operating system (OS) parameters be set on all hosts in your
Greenplum Database system (masters and segments).
Installing and Upgrading Greenplum Release Notes
47
In general, the following categories of system parameters need to
be altered:
• Shared Memory - A Greenplum Database instance will not work
unless the shared memory segment for your kernel is properly sized.
Most default OS installations have the shared memory values set too
low for Greenplum Database. On Linux systems, you must also disable
the OOM (out of memory) killer. For information about Greenplum
Database shared memory requirements, see the Greenplum Database
server configuration parameter shared_buffers in the Greenplum
Database Reference Guide.
• Network - On high-volume Greenplum Database systems, certain
network-related tuning parameters must be set to optimize network
connections made by the Greenplum interconnect.
• User Limits - User limits control the resources available to
processes started by a user's shell. Greenplum Database requires a
higher limit on the allowed number of file descriptors that a
single process can have open. The default settings may cause some
Greenplum Database queries to fail because they will run out of
file descriptors needed to process the query.
More specifically, you need to edit the following Linux
configuration settings:
• The hosts File • The sysctl.conf File • System Resources Limits •
XFS Mount Options • Disk I/O Settings
• Read ahead values • Disk I/O scheduler disk access
• Transparent Huge Pages (THP) • IPC Object Removal • SSH
Connection Threshold
The hosts File Edit the /etc/hosts file and make sure that it
includes the host names and all interface address names for every
machine participating in your Greenplum Database system.
The sysctl.conf File The sysctl.conf parameters listed in this
topic are for performance, optimization, and consistency in a wide
variety of environments. Change these settings according to your
specific situation and setup.
Set the parameters in the /etc/sysctl.conf file and reload with
sysctl -p:
# kernel.shmall = _PHYS_PAGES / 2 # See Shared Memory Pages
kernel.shmall = 197951838 # kernel.shmmax = kernel.shmall *
PAGE_SIZE kernel.shmmax = 810810728448 kernel.shmmni = 4096
vm.overcommit_memory = 2 # See Segment Host Memory
vm.overcommit_ratio = 95 # See Segment Host Memory
net.ipv4.ip_local_port_range = 10000 65535 # See Port Settings
kernel.sem = 500 2048000 200 4096 kernel.sysrq = 1
kernel.core_uses_pid = 1 kernel.msgmnb = 65536 kernel.msgmax =
65536 kernel.msgmni = 2048 net.ipv4.tcp_syncookies = 1
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.tcp_max_syn_backlog = 4096
Installing and Upgrading Greenplum Release Notes
48
Shared Memory Pages
Greenplum Database uses shared memory to communicate between
postgres processes that are part of the same postgres instance.
kernel.shmall sets the total amount of shared memory, in pages,
that can be used system wide. kernel.shmmax sets the maximum size
of a single shared memory segment in bytes.
Set kernel.shmall and kernel.shmax values based on your system's
physical memory and page size. In general, the value for both
parameters should be one half of the system physical memory.
Use the operating system variables _PHYS_PAGES and PAGE_SIZE to set
the parameters.
kernel.shmall = ( _PHYS_PAGES / 2) kernel.shmmax = ( _PHYS_PAGES /
2) * PAGE_SIZE
To calculate the values for kernel.shmall and kernel.shmax, run the
following commands using the getconf command, which returns the
value of an operating system variable.
$ echo $(expr $(getconf _PHYS_PAGES) / 2) $ echo $(expr $(getconf
_PHYS_PAGES) / 2 \* $(getconf PAGE_SIZE))
As best practice, we recommend you set the following values in the
/etc/sysctl.conf file using calculated values. For example, a host
system has 1583 GB of memory installed and returns these values:
_PHYS_PAGES = 395903676 and PAGE_SIZE = 4096. These would be the
kernel.shmall and kernel.shmmax values:
kernel.shmall = 197951838 kernel.shmmax = 810810728448
If the Greeplum Database master the has a different shared memory
configuration than the segment hosts, the _PHYS_PAGES and PAGE_SIZE
values might differ, and the kernel.shmall and kernel.shmax values
on the master host will differ from those on the segment
hosts.
Segment Host Memory
The vm.overcommit_memory Linux kernel parameter is used by the OS
to determine how much memory can be allocated to processes. For
Greenplum Database, this parameter should always be set to 2.
vm.overcommit_ratio is the percent of RAM that is used for
application processes and the remainder is reserved for the
operating system. The default is 50 on Red Hat Enterprise
Linux.
For vm.overcommit_ratio tuning and calculation recommendations with
resource group-based resource management or resource queue-based
resource management, refer to Options for Configuring Segment Host
Memory in the Geenplum Database Administrator Guide. Also refer to
the Greenplum Database server configuration parameter
gp_vmem_protect_limit in the Greenplum Database Reference
Guide.
Port Settings
49
To avoid port conflicts between Greenplum Database and other
applications during Greenplum initialization, make a note of the
port range specified by the operating system parameter
net.ipv4.ip_local_port_range. When initializing Greenplum using the
gpinitsystem cluster configuration file, do not specify Greenplum
Database ports in that range. For example, if
net.ipv4.ip_local_port_range = 10000 65535, set the Greenplum
Database base port numbers to these values.
PORT_BASE = 6000 MIRROR_PORT_BASE = 7000
For information about the gpinitsystem cluster configuration file,
see Initializing a Greenplum Database System.
For Azure deployments with Greenplum Database avoid using port
65330; add the following line to sysctl.conf:
net.ipv4.ip_local_reserved_ports=65330
System Memory
For host systems with more than 64GB of memory, these settings are
recommended:
vm.dirty_background_ratio = 0 vm.dirty_ratio = 0
vm.dirty_background_bytes = 1610612736 # 1.5GB vm.dirty_bytes =
4294967296 # 4GB
For host systems with 64GB of memory or less, remove
vm.dirty_background_bytes and vm.dirty_bytes and set the two ratio
parameters to these values:
vm.dirty_background_ratio = 3 vm.dirty_ratio = 10
Increase vm.min_free_kbytes to ensure PF_MEMALLOC requests from
network and storage drivers are easily satisfied. This is
especially critical on systems with large amounts of system memory.
The default value is often far too low on these systems. Use this
awk command to set vm.min_free_kbytes to a recommended 3% of system
physical memory:
awk 'BEGIN {OFMT = "%.0f";} /MemTotal/ {print "vm.min_free_kbytes
=", $2 * .03;}' /proc/meminfo >> /etc/sysctl.conf
Do not set vm.min_free_kbytes to higher than 5% of system memory as
doing so might cause out of memory conditions.
System Resources Limits Set the following parameters in the
/etc/security/limits.conf file:
* soft nofile 524288 * hard nofile 524288 * soft nproc 131072 *
hard nproc 131072
For Red Hat Enterprise Linux (RHEL) and CentOS systems, parameter
values in the /etc/security/ limits.d/90-nproc.conf file
(RHEL/CentOS 6) or /etc/security/limits.d/20-nproc.conf file
(RHEL/CentOS 7) override the values in the limits.conf file. Ensure
that any parameters in the
50
override file are set to the required value. The Linux module
pam_limits sets user limits by reading the values from the
limits.conf file and then from the override file. For information
about PAM and user limits, see the documentation on PAM and
pam_limits.
Execute the ulimit -u command on each segment host to display the
maximum number of processes that are available to each user.
Validate that the return value is 131072.
XFS Mount Options XFS is the preferred data storage file system on
Linux platforms. Use the mount command with the following
recommended XFS mount options for RHEL and CentOS systems:
rw,nodev,noatime,nobarrier,inode64
The nobarrier option is not supported on Ubuntu systems. Use only
the options:
rw,nodev,noatime,inode64
See the mount manual page (man mount opens the man page) for more
information about using this command.
The XFS options can also be set in the /etc/fstab file. This
example entry from an fstab file specifies the XFS options.
/dev/data /data xfs nodev,noatime,nobarrier,inode64 0 0
Disk I/O Settings • Read-ahead value
Each disk device file should have a read-ahead (blockdev) value of
16384. To verify the read-ahead value of a disk device:
# /sbin/blockdev --getra devname
# /sbin/blockdev --setra bytes devname
# /sbin/blockdev --setra 16384 /dev/sdb
See the manual page (man) for the blockdev command for more
information about using that command (man blockdev opens the man
page).
Note: The blockdev --setra command is not persistent. You must
ensure the read-ahead value is set whenever the system restarts.
How to set the value will vary based on your system.
One method to set the blockdev value at system startup is by adding
the /sbin/blockdev -- setra command in the rc.local file. For
example, add this line to the rc.local file to set the read- ahead
value for the disk sdb.
/sbin/blockdev --setra 16384 /dev/sdb
51
On systems that use systemd, you must also set the execute
permissions on the rc.local file to enable it to run at startup.
For example, on a RHEL/CentOS 7 system, this command sets execute
permissions on the file.
# chmod +x /etc/rc.d/rc.local
Restart the system to have the setting take effect. • Disk I/O
scheduler
The Linux disk I/O scheduler for disk access supports different
policies, such as CFQ, AS, and deadline.
The deadline scheduler option is recommended. To specify a
scheduler until the next system reboot, run the following:
# echo schedulername > /sys/block/devname/queue/scheduler
# echo deadline > /sys/block/sbd/queue/scheduler
Note: Using the echo command to set the disk I/O scheduler policy
is not persistent, therefore you must ensure the command is run
whenever the system reboots. How to run the command will vary based
on your system.
One method to set the I/O scheduler policy at boot time is with the
elevator kernel parameter. Add the parameter elevator=deadline to
the kernel command in the file /boot/grub/grub.conf, the GRUB boot
loader configuration file. This is an example kernel command from a
grub.conf file on RHEL 6.x or CentOS 6.x. The command is on
multiple lines for readability.
kernel /vmlinuz-2.6.18-274.3.1.el5 ro root=LABEL=/
elevator=deadline crashkernel=128M@16M quiet console=tty1
console=ttyS1,115200 panic=30 transparent_hugepage=never initrd
/initrd-2.6.18-274.3.1.el5.img
To specify the I/O scheduler at boot time on systems that use grub2
such as RHEL 7.x or CentOS 7.x, use the system utility grubby. This
command adds the parameter when run as root.
# grubby --update-kernel=ALL --args="elevator=deadline"
After adding the parameter, reboot the system.
This grubby command displays kernel parameter settings.
# grubby --info=ALL
For more information about the grubby utility, see your operating
system documentation. If the grubby command does not update the
kernels, see the Note at the end of the section.
Transparent Huge Pages (THP) Disable Transparent Huge Pages (THP)
as it degrades Greenplum Database performance. RHEL 6.0 or higher
enables THP by default. One way to disable THP on RHEL 6.x is by
adding the parameter transparent_hugepage=never to the kernel
command in the file /boot/grub/grub.conf, the GRUB boot loader
configuration file. This is an example kernel command from a
grub.conf file. The command is on multiple lines for
readability:
kernel /vmlinuz-2.6.18-274.3.1.el5 ro root=LABEL=/
Installing and Upgrading Greenplum Release Notes
52
elevator=deadline crashkernel=128M@16M quiet console=tty1
console=ttyS1,115200 panic=30 transparent_hugepage=never initrd
/initrd-2.6.18-274.3.1.el5.img
On systems that use grub2 such as RHEL 7.x or CentOS 7.x, use the
system utility grubby. This command adds the parameter when run as
root.
# grubby --update-kernel=ALL
--args="transparent_hugepage=never"
After adding the parameter, reboot the system.
For Ubuntu systems, install the hugepages package and execute this
command as root:
# hugeadm --thp-never
This cat command checks the state of THP. The output indicates that
THP is disabled.
$ cat /sys/kernel/mm/*transparent_hugepage/enabled always
[never]
For more information about Transparent Huge Pages or the grubby
utility, see your operating system documentation. If the grubby
command does not update the kernels, see the Note at the end of the
section.
IPC Object Removal Disable IPC object removal for RHEL 7.2 or
CentOS 7.2, or Ubuntu. The default systemd setting RemoveIPC=yes
removes IPC connections when non-system user accounts log out. This
causes the Greenplum Database utility gpinitsystem to fail with
semaphore errors. Perform one of the following to avoid this
issue.
• When you add the gpadmin operating system user account to the
master node in Creating the Greenplum Administrative User, create
the user as a system account.
• Disable RemoveIPC. Set this parameter in /etc/systemd/logind.conf
on the Greenplum Database host systems.
RemoveIPC=no
The setting takes effect after restarting the systemd-login service
or rebooting the system. To restart the service, run this command
as the root user.
service systemd-logind restart
SSH Connection Threshold Certain Greenplum Database management
utilities including gpexpand, gpinitsystem, and gpaddmirrors, use
secure shell (SSH) connections between systems to perform their
tasks. In large Greenplum Database deployments, cloud deployments,
or deployments with a large number of segments per host, these
utilities may exceed the hosts' maximum threshold for
unauthenticated connections. When this occurs, you receive errors
such as: ssh_exchange_identification: Connection closed by remote
host.
To increase this connection threshold for your Greenplum Database
system, update the SSH MaxStartups and MaxSessions configuration
parameters in one of the /etc/ssh/sshd_config or / etc/sshd_config
SSH daemon configuration files.
Installing and Upgrading Greenplum Release Notes
53
If you specify MaxStartups and MaxSessions using a single integer
value, you identify the maximum number of concurrent
unauthenticated connections (MaxStartups) and maximum number of
open shell, login, or subsystem sessions permitted per network
connection (MaxSessions). For example:
MaxStartups 200 MaxSessions 200
If you specify MaxStartups using the "start:rate:full" syntax, you
enable random early connection drop by the SSH daemon. start
identifies the maximum number of unauthenticated SSH connection
attempts allowed. Once start number of unauthenticated connection
attempts is reached, the SSH daemon refuses rate percent of
subsequent connection attempts. full identifies the maximum number
of unauthenticated connection attempts after which all attempts are
refused. For example:
Max Startups 10:30:200 MaxSessions 200
Restart the SSH daemon after you update MaxStartups and
MaxSessions. For example, on a CentOS 6 system, run the following
command as the root user:
# service sshd restart
For detailed information about SSH configuration options, refer to
the SSH documentation for your Linux distribution.
Note: If the grubby command does not update the kernels of a RHEL
7.x or CentOS 7.x system, you can manually update all kernels on
the system. For example, to add the parameter
transparent_hugepage=never to all kernels on a system.
1. Add the parameter to the GRUB_CMDLINE_LINUX line in the file
parameter in /etc/default/ grub.
GRUB_TIMEOUT=5 GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g'
/etc/system-release)" GRUB_DEFAULT=saved GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console" GRUB_CMDLINE_LINUX="crashkernel=auto
rd.lvm.lv=cl/root rd.lvm.lv=cl/ swap rhgb quiet
transparent_hugepage=never" GRUB_DISABLE_RECOVERY="true"
2. As root, run the grub2-mkconfig command to update the
kernels.
# grub2-mkconfig -o /boot/grub2/grub.cfg
3. Reboot the system.
Synchronizing System Clocks You should use NTP (Network Time
Protocol) to synchronize the system clocks on all hosts that
comprise your Greenplum Database system. See www.ntp.org for more
information about NTP.
NTP on the segment hosts should be configured to use the master
host as the primary time source, and the standby master as the
secondary time source. On the master and standby master hosts,
configure NTP to point to your preferred time server.
54
To configure NTP 1. On the master host, log in as root and edit the
/etc/ntp.conf file. Set the server parameter to point
to your data center's NTP time server. For example (if 10.6.220.20
was the IP address of your data center's NTP server):
server 10.6.220.20
2. On each segment host, log in as root and edit the /etc/ntp.conf
file. Set the first server parameter to point to the master host,
and the second server parameter to point to the standby master
host. For example:
server mdw prefer server smdw
3. On the standby master host, log in as root and edit the
/etc/ntp.conf file. Set the first server parameter to point to the
primary master host, and the second server parameter to point to
your data center's NTP time server. For example:
server mdw prefer server 10.6.220.20
4. On the master host, use the NTP daemon synchronize the system
clocks on all Greenplum hosts. For example, using gpssh:
# gpssh -f hostfile_gpssh_allhosts -v -e 'ntpd'
Creating the Greenplum Administrative User Create a dedicated
operating system user account on each node to run and administer
Greenplum Database. This user account is named gpadmin by
convention.
Important: You cannot run the Greenplum Database server as
root.
The gpadmin user must have permission to access the services and
directories required to install and run Greenplum Database.
The gpadmin user on each Greenplum host must have an SSH key pair
installed and be able to SSH from any host in the cluster to any
other host in the cluster without entering a password or passphrase
(called "passwordless SSH"). If you enable passwordless SSH from
the master host to every other host in the cluster ("1-n
passwordless SSH"), you can use the Greenplum Database gpssh-exkeys
command-line utility later to enable passwordless SSH from every
host to every other host ("n-n passwordless SSH").
You can optionally give the gpadmin user sudo privilege, so that
you can easily administer all hosts in the Greenplum Database
cluster as gpadmin using the sudo, ssh/scp, and gpssh/gpscp
commands.
The following steps show how to set up the gpadmin user on a host,
set a password, create an SSH key pair, and (optionally) enable
sudo capability. These steps must be performed as root on every
Greenplum Database cluster host. (For a large Greenplum Database
cluster you will want to automate these steps using your system
provisioning tools.)
Note: See Example Ansible Playbook for an example that shows how to
automate the tasks of creating the gpadmin user and installing the
Greenplum Database software on all hosts in the cluster.
1. Create the gpadmin group and user.
Note: If you are installing Greenplum Database on RHEL 7.2 or
CentOS 7.2 and want to disable IPC object removal by creating the
gpadmin user as a system account, provide both the -r option
(create the user as a system account) and the -m option (create a
home directory) to the useradd command. On Ubuntu systems, you must
use the -m option with the useradd command to create a home
directory for a user.
Installing and Upgrading Greenplum Release Notes
55
This example creates the gpadmin group, creates the gpadmin user as
a system account with a home directory and as a member of the
gpadmin group, and creates a password for the user.
# groupadd gpadmin # useradd gpadmin -r -m -g gpadmin # passwd
gpadmin New password: <changeme> Retype new password:
<changeme>
Note: Make sure the gpadmin user has the same user id (uid) and
group id (gid) numbers on each host to prevent problems with
scripts or services that use them for identity or permissions. For
example, backing up Greenplum databases to some networked filesy
stems or storage appliances could fail if the gpadmin user has
different uid or gid numbers on different segment hosts. When you
create the gpadmin group and user, you can use the groupadd -g
option to specify a gid number and the useradd -u option to specify
the uid number. Use the command id gpadmin to see the uid and gid
for the gpadmin user on the current host.
2. Switch to the gpadmin user and generate an SSH key pair for the
gpadmin user.
$ su gpadmin $ ssh-keygen -t rsa -b 4096 Generating public/private
rsa key pair. Enter file in which to save the key
(/home/gpadmin/.ssh/id_rsa): Created directory
'/home/gpadmin/.ssh'. Enter passphrase (empty for no passphrase):
Enter same passphrase again:
At the passphrase prompts, press Enter so that SSH connections will
not require entry of a passphrase. 3. (Optional) Grant sudo access
to the gpadmin user.
On Red Hat or CentOS, run visudo and uncomment the %wheel group
entry.
%wheel ALL=(ALL) NOPASSWD: ALL
Make sure you uncomment the line that has the NOPASSWD
keyword.
Add the gpadmin user to the wheel group with this command.
# usermod -aG wheel gpadmin
Installing and Upgrading Greenplum Release Notes
56
Installing the Greenplum Database Software Describes how to install
the Greenplum Database software binaries on all of the hosts that
will comprise your Greenplum Database system, how to enable
passwordless SSH for the gpadmin user, and how to verify the
installation.
Perform the following tasks in order:
1. Installing Greenplum Database 2. Enabling Passwordless SSH 3.
Confirm the software installation. 4. Next Steps
Installing Greenplum Database You must install Greenplum Database
on each host machine of the Greenplum Database system. Pivotal
distributes the Greenplum Database software as a downloadable
package that you install on each host system with the operating
system's package management system. You can download the package
from Pivotal Network.
Before you begin installing Greenplum Database, be sure you have
completed the steps in Configuring Your Systems to configure each
of the master, standby master, and segment host machines for
Greenplum Database.
Important: After installing Greenplum Database, you must set
Greenplum Database environment variables. See Setting Greenplum
Environment Variables.
See Example Ansible Playbook for an example script that shows how
you can automate creating the gpadmin user and installing the
Greenplum Database.
Follow these instructions to install Greenplum Database.
Important: You require sudo or root user access to install from a
pre-built RPM or DEB file.
1. Download and copy the Greenplum Database package to the gpadmin
user's home directory on the master, standby master, and every
segment host machine. The distribution file name has the format
greenplum-db-<version>-<platform>.rpm for RHEL, CentOS,
and Oracle Linux systems, or
greenplum-db-<version>-<platform>.deb for Ubuntu
systems, where <platform> is similar to rhel7-x86_64 (Red Hat
7 64-bit).
Note: For Oracle Linux installations, download and install the
rhel7-x86_64distribution files. 2. With sudo (or as root), install
the Greenplum Database package on each host machine using
your
system's package manager software.
The yum or apt command automatically installs software
dependencies, copies the Greenplum Database software files into a
ver