19
air planet people Parallel Python Tools for Handling Big Climate Data Sheri Mickelson Kevin Paul 2018 Workshop on Developing Python Frameworks for Earth System Sciences October 30, 2018 NCAR/CISL/TDD

Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

air • planet • people

Parallel Python Tools for Handling Big Climate Data

Sheri MickelsonKevin Paul

2018 Workshop on Developing Python Frameworks for Earth System SciencesOctober 30, 2018

NCAR/CISL/TDD

Page 2: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

CESM’s CMIP5 Workflow

2

Model Run

Publication

Post-Processing

CESM Model Run

Time Series Conversion

(NCO)

CMOR

Diagnostics(NCO/NCL)

Push to ESGF

Page 3: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

Lessons We Learned From CMIP5

3

CESM was the first model to complete their simulations, but the last to complete publication.

Why?• All of the post-processing was serial and it took a long time to run• Workflow was error prone and was time consuming to debug• Too much human intervention was needed between post-processing

steps and time was wasted• There was only one person who knew the status of all of the

experiments

Page 4: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

Motivating FactorCMIP5 vs CMIP6

4

CMIP5

• 25 Experiments• Timeline: 3 years• Output size: 800TB• Published size: 200TB

CMIP6

• 102 Experiments• Timeline: 1 year• Output size: 8PB (estimate)

• Published size: 2PB (estimate)

http://www.bbc.com/earth/story/20170510-terrifying-20m-tall-rogue-waves-are-actually-real

Page 5: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

New CESM/CMIP6 Workflow

5

Model Run

Publication

Post-Processing

CESM Model Run

Time Series Conversion

(PyReshaper)

New Data Compliance

Tool (PyConform)

Re-Designed Diagnostics

(PyAverager)

Push to ESGF

Aut

omat

ed W

orkf

low

Usi

ng C

ylc

ExperimentsUpdate

Their Status in Run Database

The focus of this talk

Page 6: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

Data Compliance

6

Taking model output and processing it into experiment compliant data

Page 7: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

Previous Version Used for CMIP5

7

• Used Fortran and CMOR to read in the raw output, do all conversions, and write out complaint files

• Serial, no parallelization

SlowRigid and Hard to Expand

Error Prone

Page 8: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

PyConform: New Version Used for CMIP6

8

• Uses PythonnetCDF4, numpy, dreqPy, cf_units, pyNGL

• Parallelization done with MPI4Py

• A three step process

Faster Flexible User Interface

(16x to 38x speedup over Fortran method)

Page 9: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

First Step

9

Users need to create a text file with definitions that describe how to map model variables to requested variables

Examples:cfc11global=f11vmrch4=vinth2p(CH4, hyam, hybm, plev, PS, P0)mc=CMFMC+CMFMCDZMsiage=siage

Page 10: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

Second Step

10

Then users run the iconform tool that matches the definitions to its variable information within the CMIP6 Data Request

The Data Request lists variable requirements:• Units• Dimensions• Descriptions• Positive Attribute on Vertical Dimensions• And a lot more …

Page 11: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

Sample Portion of a PyConform Input File

11

"ua": {"attributes": {

"_FillValue": "1e+20",

"cell_measures": "area: areacella",

"cell_methods": "time: mean",

"comment": "\"Eastward\" indicates a vector component which is positive when directed eastward (negative westward). Wind is defined as a two-dimensional (horizontal) air velocity vector, with no vertical component. (Vertical motion in the atmosphere has the standard name upward_air_velocity.)",

"description": "\"Eastward\" indicates a vector component which is positive when directed eastward (negative westward). Wind is defined as a two-dimensional (horizontal) air velocity vector, with no vertical component. (Vertical motion in the atmosphere has the standard name upward_air_velocity.)",

"frequency": "mon",

"id": "ua",

"long_name": "Eastward Wind",

"mipTable": "Amon",

"out_name": "ua",

"prov": "Amon ((isd.003))",

"realm": "atmos",

"standard_name": "eastward_wind",

"time": "time",

"time_label": "time-mean",

"time_title": "Temporal mean",

"title": "Eastward Wind",

"type": "real",

"units": "m s-1",

"variable_id": "ua"},"datatype": "real",

"definition": "vinth2p(U,hyam,hybm,plev, PS,P0)",

Page 12: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

Third Step

12

Then users run the xconform tool that generates requested variables based on the input specifications

“x = X1 + X2”Read:X1[i]

Read:X2[i]

Evaluate:(X1+X2)[i]

Map:iàj

Validate:> minimum< maximum

dimensions = [j]etcetera

Write:x[j] File

“y = X1 - X2”

Read:X1[i]

Read:X2[i]

Evaluate:(X1-X2)[i]

Map:iàj

Validate:> minimum< maximum

dimensions = [j]etcetera

Write:y[j] File

Page 13: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

Physarray Object

13

• Is a subclass of the maskedArray in NumPy

• Additional features that were needed above the masked array class:

- Automatic Unit Conversion- Automatic Dimension Handling- Automatic Handling of the Positive Attribute

Page 14: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

Physarray Object

14

“X = X1 + X2”Units:

K + Units: K

“X” = Units: K

“X = X1 + X2”Units:

K + Units: C

“X” =

Units: K

Convert to K before operation is performed

“X = X1 * X2”Units:

kg * Units: m-2

“X” = Units: kg m-2

* Must be cf compliant units

Units: K + Units:

K

Units: K =

Units: K = =Units:

kg m-2

Extra features we needed to generate the data correctly:Automatic Unit Handling

Page 15: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

Physarray Object

15

“X = X1 + X2”

+

“X” =

“X = X1 + X2”

Switch the dimensions before operation is performed

Dims: TimeLatLonLev

=

Dims: TimeLatLonLev

Dims: TimeLatLonLev

Dims: TimeLatLonLev

+

“X” =

Dims: TimeLatLonLev

=

Dims: TimeLatLonLev

Dims: LevLatLon

Time

Dims: TimeLatLonLev

Extra features we needed to generate the data correctly:Automatic Dimension Handling

Dims: TimeLatLonLev

“X = X1 + X2”

+=Dims: TimeLatLon

Dims: TimeLatLon

Will give an error

Dims: TimeLatLonLev

Dims: TimeLatLonLev

+

Page 16: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

Physarray Object

16

“X = X1 + X2”Pos: up + Pos:

up

“X” = Pos: up

“X = X1 + X2”Pos: up + Pos:

down

“X” =

Pos: up

Convert before operation is performed

Pos: up + Pos:

up

Pos: up = Pos:

up =

Extra features we needed to generate the data correctly:Automatic Handling of the Positive Attribute

(flipping the vertical dimension)

Page 17: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

Switching to Xarray/Dask

17

• We are working on a new version that uses xarray

• While we no longer need the ability to handle dimension reordering, we still need functionality to handle the unit conversion and the flipping of the vertical dimension

• We will also need to evaluate the performance

Page 18: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

Moving Forward ….

18

• We are currently using PyConform in its current form for our CMIP6 output

• We are looking at a redesign of the internal data structures to use new capabilities that didn’t exist when we started the project

• Performance and usability are key for this tool and we will move in those directions

Page 19: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

air • planet • people

Questions

Contact: mickelso .at. ucar.eduhttps://github.com/NCAR/PyConform

Big Data

http://m.sweetclipart.com/halloween-pumpkin-pail-with-candy/