15
1 COSC 6339 Big Data Analytics Data Formats – HDF5 and Parquet files Edgar Gabriel Fall 2018 File Formats - Motivation Use-case: Analysis of all flights in the US between 2004- 2008 using Apache Spark File Format File Size Processing Time csv 3.4 GB 525 sec json 12 GB 2245 sec Hadoop sequence file 3.7 GB 1745 sec parquet 0.55 GB 100 sec

Data Formats HDF5 and Parquet files - UHgabriel/courses/cosc6339_f18/BDA_16_DataFormats.pdfBig Data Analytics Data Formats – HDF5 and Parquet files Edgar Gabriel Fall 2018 File Formats

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Formats HDF5 and Parquet files - UHgabriel/courses/cosc6339_f18/BDA_16_DataFormats.pdfBig Data Analytics Data Formats – HDF5 and Parquet files Edgar Gabriel Fall 2018 File Formats

1

COSC 6339

Big Data Analytics

Data Formats –

HDF5 and Parquet files

Edgar Gabriel

Fall 2018

File Formats - Motivation

• Use-case: Analysis of all flights in the US between 2004-

2008 using Apache Spark

File Format File Size Processing Time

csv 3.4 GB 525 sec

json 12 GB 2245 sec

Hadoop sequence file 3.7 GB 1745 sec

parquet 0.55 GB 100 sec

Page 2: Data Formats HDF5 and Parquet files - UHgabriel/courses/cosc6339_f18/BDA_16_DataFormats.pdfBig Data Analytics Data Formats – HDF5 and Parquet files Edgar Gabriel Fall 2018 File Formats

2

Scientific data libraries

• Handle data on a higher level

• Provide additional information typically not available in

flat data files (Metadata)

– Size and type of of data structure

– Data format

– Name

– Units

• Two widely used libraries available

– NetCDF

– HDF-5

HDF-5

• Hierarchical Data Format (HDF) developed since 1988

at NCSA (University of Illinois)

– http://hdf.ncsa.uiuc.edu/HDF5/

• Has gone through a long history of changes, the recent

version HDF-5 available since 1999

• HDF-5 supports

– Very large files

– Parallel I/O interface

– Fortran, C, Java, Python bindings

Page 3: Data Formats HDF5 and Parquet files - UHgabriel/courses/cosc6339_f18/BDA_16_DataFormats.pdfBig Data Analytics Data Formats – HDF5 and Parquet files Edgar Gabriel Fall 2018 File Formats

3

HDF-5 dataset

• Multi-dimensional array of basic data elements

• A dataset consists of

– Header + data

• Header consists of

– Name

– Datatype : basic (e.g. HDF_NATIVE_FLOAT) or

compound dataypes

– Dataspace: defines size and shape of a multidimensional

array. Dimensions can be fixed or unlimited.

– Storage layout: defines how multidimensional arrays are

stored in file. Can be contiguous or chunked.

Example of an HDF-5 fileHDF5 “tempseries.h5” {

GROUP “/” {

GROUP “tempseries” {

DATASET “height” {

DATATYPE {“H5_STD_I32BE” }

DATASPACE ( ARRAY (4) (4) }

DATA {

0, 50, 100, 150

}

ATTRIBUTES “units” {

DATATYPE {“undefined string” }

DATASPACE { ARRAY (0) (0) }

DATA {

unable to print

}

}

}

DATASET “temperature” {

DATATYPE {“H5T_IEEE_F32BE” }

DATASPACE{ ARRAY( 3,8,4 ) (H5S_UNLIMITED, 8, 4) }

DATA {…}

Page 4: Data Formats HDF5 and Parquet files - UHgabriel/courses/cosc6339_f18/BDA_16_DataFormats.pdfBig Data Analytics Data Formats – HDF5 and Parquet files Edgar Gabriel Fall 2018 File Formats

4

Storage layout: contiguous vs. chunkedcontiguous chunked

1

9

17

25

33

41

49

57

2

10

18

26

34

42

50

58

3

11

19

27

35

43

51

59

4

12

20

28

36

44

52

60

5

13

21

29

37

45

53

61

6

14

22

30

38

46

54

62

7

15

23

31

39

47

55

63

8

16

24

32

40

48

56

64

1

5

9

13

33

37

41

45

2

6

10

14

34

38

42

46

3

7

11

15

35

39

43

47

4

8

12

16

36

40

44

48

21

25

29

49

57

61

22

26

30

50

58

62

23

27

31

51

59

63

24

28

32

52

60

64

17 18 19 20

53 54 55 56

Advantages and disadvantages of chunking Accessing rows and columns require the same

number of accesses Data can be extended into all dimensions Efficient storage of sparse arrays Can improve caching

HDF-5 API

• HDF-5 naming convention

– All API functions start with an H5

– The next character identifies category of functions

• H5F: functions handling files

• H5G: functions handling groups

• H5D: functions handling datasets

• H5S: functions handling dataspaces

• H5A: functions handling attributes

• A HDF-5 group is a collection of data sets

– Comparable to a directory in a UNIX-like file system

Page 5: Data Formats HDF5 and Parquet files - UHgabriel/courses/cosc6339_f18/BDA_16_DataFormats.pdfBig Data Analytics Data Formats – HDF5 and Parquet files Edgar Gabriel Fall 2018 File Formats

5

h5py

• Python interface to the HDF5 binary data format

• Uses NumPy and Python abstractions such as dictionary

and NumPy array syntax

Reading and Writing an HDF-5 file

using h5py

import numpy as np

import h5py

MyData = np.random.random(size=(100,20))

h5f = h5py.File('data.h5', 'w')

h5f.create_dataset('dataset_1', data=Mydata)

h5f.close()

h5f = h5py.File('data.h5','r')

MyData = h5f['dataset_1'][:]

h5f.close()

Page 6: Data Formats HDF5 and Parquet files - UHgabriel/courses/cosc6339_f18/BDA_16_DataFormats.pdfBig Data Analytics Data Formats – HDF5 and Parquet files Edgar Gabriel Fall 2018 File Formats

6

Setting datatypes and compression in

h5py

hf = h5py.File('integer_8.hdf5', 'w')

d = f.create_dataset('dataset', (100000,), dtype='i8')

d[:] = arr

f.close()

f = h5py.File('float.hdf5', 'w')

d = f.create_dataset('dataset', (100,), dtype='f16‘, compression="gzip")

d[:] = arr

f.close()5f = h5py.File('data.h5','r')

MyData = h5f['dataset_1'][:]

h5f.close()

Parquet files

• Columnar data representation

• Available to many projects in the Hadoop ecosystem

• Built on the record shredding and assembly algorithm

described in the Dremel paper

Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva

Shivakumar, Matt Tolton, Theo Vassilakis: “Dremel: Interactive Analysis of

Web-Scale Datasets”

https://ai.google/research/pubs/pub36632.pdf

• Support compression and efficient encoding schemes.

Page 7: Data Formats HDF5 and Parquet files - UHgabriel/courses/cosc6339_f18/BDA_16_DataFormats.pdfBig Data Analytics Data Formats – HDF5 and Parquet files Edgar Gabriel Fall 2018 File Formats

7

Record vs. column oriented data

Image source: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis: “Dremel: Interactive Analysis of Web-Scale Datasets” https://ai.google/research/pubs/pub36632.pdf

Example

message Document {

required int64 DocId;

optional group Links {

repeated int64 Backward;

repeated int64 Forward; }

repeated group Name {

repeated group Language {

required string Code;

optional string Country; }

optional string Url; }}

Page 8: Data Formats HDF5 and Parquet files - UHgabriel/courses/cosc6339_f18/BDA_16_DataFormats.pdfBig Data Analytics Data Formats – HDF5 and Parquet files Edgar Gabriel Fall 2018 File Formats

8

• Schema can be seen as a tree with leaves being

primitive types.

• A column is created for each leaf

• For this example we end up with 6 columns:DocId

Links.Backward

Links.Forward

Name.Language.Code

Name.Language.Country

Name.Url

• As some of the fields are repeated fields, we need

extra pieces of information to be stored along with the

data to allow re-assembling the records together.

Repetition and definition levels

• Repetition level tells us at what field in the field’s

path the value has repeated.

• Definition level specifies how many fields in a record

that could be undefined (because they are optional or

repeated) are present in the record.

• Only repeated fields increment the repetition level,

• Only non-required fields increment the definition level

Page 9: Data Formats HDF5 and Parquet files - UHgabriel/courses/cosc6339_f18/BDA_16_DataFormats.pdfBig Data Analytics Data Formats – HDF5 and Parquet files Edgar Gabriel Fall 2018 File Formats

9

Image source: https://github.com/julienledem/redelm/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paper

DocId: 10

Links

Forward: 20

Forward: 40

Forward: 60

Name

Language

Code: 'en-us'

Country: 'us'

Language

Code: 'en'

Url: 'http://A'

Name

Url: 'http://B'

Name

Language

Code: 'en-gb'

Country: 'gb'

DocId: 20

Links

Backward: 10

Backward: 30

Forward: 80

Name

Url: 'http://C'

Page 10: Data Formats HDF5 and Parquet files - UHgabriel/courses/cosc6339_f18/BDA_16_DataFormats.pdfBig Data Analytics Data Formats – HDF5 and Parquet files Edgar Gabriel Fall 2018 File Formats

10

DocId: 10

Links

Forward: 20

Forward: 40

Forward: 60

Name

Language

Code: 'en-us'

Country: 'us'

Language

Code: 'en'

Url: 'http://A'

Name

Url: 'http://B'

Name

Language

Code: 'en-gb'

Country: 'gb'

R = 0 (current repetition level)

DocId: 10, R:0, D:0

Links.Backward: NULL, R:0, D:1 (no value defined so D < 2)

Links.Forward: 20, R:0, D:2

R = 1 (we are repeating 'Links.Forward' of level 1)

Links.Forward: 40, R:1, D:2

R = 1 (we are repeating 'Links.Forward' again of level 1)

Links.Forward: 60, R:1, D:2

Back to the root level: R=0

Name.Language.Code: en-us, R:0, D:2

Name.Language.Country: us, R:0, D:3

R = 2 (we are repeating 'Name.Language' of level 2)

Name.Language.Code: en, R:2, D:2

Name.Language.Country: NULL, R:2, D:2 (no value defined so D < 3)

Name.Url: http://A, R:0, D:2

R = 1 (we are repeating 'Name' of level 1)

Name.Language.Code: NULL, R:1, D:1 (Only Name is defined so D = 1)

Name.Language.Country: NULL, R:1, D:1

Name.Url: http://B, R:1, D:2

R = 1 (we are repeating 'Name' again of level 1)

Name.Language.Code: en-gb, R:1, D:2

Name.Language.Country: gb, R:1, D:3

Name.Url: NULL, R:1,

Image source: https://github.com/julienledem/redelm/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paper

DocId: 20, R:0, D:0

Links.Backward: 10, R:0, D:2

Links.Backward: 30, R:1, D:2

Links.Forward: 80, R:0, D:2

Name.Language.Code: NULL, R:0, D:1

Name.Language.Country: NULL, R:0, D:1

Name.Url: http://C, R:0, D:2

DocId: 20

Links

Backward: 10

Backward: 30

Forward: 80

Name

Url: 'http://C'

Image source: https://github.com/julienledem/redelm/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paper

Page 11: Data Formats HDF5 and Parquet files - UHgabriel/courses/cosc6339_f18/BDA_16_DataFormats.pdfBig Data Analytics Data Formats – HDF5 and Parquet files Edgar Gabriel Fall 2018 File Formats

11

Resulting Columns stored in Parquet

Image source: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis: “Dremel: Interactive Analysis of Web-Scale Datasets” https://ai.google/research/pubs/pub36632.pdf

Parquet Glossary

• Row group: A logical horizontal partitioning of the data

into rows.

– A row group consists of a column chunk for each column

in the dataset.

– Max. size buffered in memory while writing

– No physical structure that is guaranteed for a row group.

• Column chunk: A chunk of the data for a particular

column.

– guaranteed to be contiguous in the file.

• Page: Column chunks are divided up into pages.

– conceptually an indivisible unit (in terms of compression

and encoding).

Page 12: Data Formats HDF5 and Parquet files - UHgabriel/courses/cosc6339_f18/BDA_16_DataFormats.pdfBig Data Analytics Data Formats – HDF5 and Parquet files Edgar Gabriel Fall 2018 File Formats

12

Example file of• A file consists of one or more row groups, here N columns split

into M row groups

• A row group contains exactly one column chunk per column.

• Column chunks contain one or more pages.

Image source: https://parquet.apache.org/documentation/latest/

• Reading columns is straight forward

• Record level API to integrate with existing row-based

engines (Hive, Pig, M/R)

– Repetition level 0 indicates new record

– Repetition/Definition level capture the structure

– One column per leaf in the schema

• Unit of parallelization

– MapReduce - File/Row Group

– IO - Column chunk

– Encoding/Compression - Page

Page 13: Data Formats HDF5 and Parquet files - UHgabriel/courses/cosc6339_f18/BDA_16_DataFormats.pdfBig Data Analytics Data Formats – HDF5 and Parquet files Edgar Gabriel Fall 2018 File Formats

13

Encodings• Bit encoding

– Small integers encoded in minimum bits required

– Useful for repetition and definition levels

• Run level encoding

– sequences in which the same data value occurs in many

consecutive data elements are stored as a single data value and

count

– Useful for definition levels of sparse columns

• Dictionary encoding

– searches for matches between the text to be compressed and a

set of strings contained in a 'dictionary'

– When the encoder finds a match, it substitutes a reference to

the string's position in the data structure.

– Useful for small ( ~60k) set of values

Encodings

• Delta encoding (new in parquet 2)

Image source: Julien Le Dem, Nong Li:”Efficient Data Storage for Analytics with Apache Parquet 2.0”, https://www.slideshare.net/cloudera/hadoop-summit-36479635

Page 14: Data Formats HDF5 and Parquet files - UHgabriel/courses/cosc6339_f18/BDA_16_DataFormats.pdfBig Data Analytics Data Formats – HDF5 and Parquet files Edgar Gabriel Fall 2018 File Formats

14

Formats comparison (I)

Image source: Julien Le Dem, Nong Li:”Efficient Data Storage for Analytics with Apache Parquet 2.0”, https://www.slideshare.net/cloudera/hadoop-summit-36479635

Formats comparison (II)

Image source: Julien Le Dem, Nong Li:”Efficient Data Storage for Analytics with Apache Parquet 2.0”, https://www.slideshare.net/cloudera/hadoop-summit-36479635

Page 15: Data Formats HDF5 and Parquet files - UHgabriel/courses/cosc6339_f18/BDA_16_DataFormats.pdfBig Data Analytics Data Formats – HDF5 and Parquet files Edgar Gabriel Fall 2018 File Formats

15

Parquet implementations

• Java implementation

– sources can be build using mvn package.

– current stable version is available from Maven

Central.

• C++ sources

– Based on Apache Thrift (a software stack with a

code generation engine to build services that work

efficiently and seamlessly between numerous

languages, including C++, Java, Python, … )