Data Formats HDF5 and Parquet files - UHgabriel/courses/cosc6339_f18/BDA_16_DataFormats.pdfBig Data Analytics Data Formats – HDF5 and Parquet files Edgar Gabriel Fall 2018 File Formats

1

COSC 6339

Big Data Analytics

Data Formats –

HDF5 and Parquet files

Edgar Gabriel

Fall 2018

File Formats - Motivation

• Use-case: Analysis of all flights in the US between 2004-

2008 using Apache Spark

File Format File Size Processing Time

csv 3.4 GB 525 sec

json 12 GB 2245 sec

Hadoop sequence file 3.7 GB 1745 sec

parquet 0.55 GB 100 sec

2

Scientific data libraries

• Handle data on a higher level

• Provide additional information typically not available in

flat data files (Metadata)

– Size and type of of data structure

– Data format

– Name

– Units

• Two widely used libraries available

– NetCDF

– HDF-5

HDF-5

• Hierarchical Data Format (HDF) developed since 1988

at NCSA (University of Illinois)

– http://hdf.ncsa.uiuc.edu/HDF5/

• Has gone through a long history of changes, the recent

version HDF-5 available since 1999

• HDF-5 supports

– Very large files

– Parallel I/O interface

– Fortran, C, Java, Python bindings

3

HDF-5 dataset

• Multi-dimensional array of basic data elements

• A dataset consists of

– Header + data

• Header consists of

– Name

– Datatype : basic (e.g. HDF_NATIVE_FLOAT) or

compound dataypes

– Dataspace: defines size and shape of a multidimensional

array. Dimensions can be fixed or unlimited.

– Storage layout: defines how multidimensional arrays are

stored in file. Can be contiguous or chunked.

Example of an HDF-5 fileHDF5 “tempseries.h5” {

GROUP “/” {

GROUP “tempseries” {

DATASET “height” {

DATATYPE {“H5_STD_I32BE” }

DATASPACE ( ARRAY (4) (4) }

DATA {

0, 50, 100, 150

}

ATTRIBUTES “units” {

DATATYPE {“undefined string” }

DATASPACE { ARRAY (0) (0) }

DATA {

unable to print

}

}

}

DATASET “temperature” {

DATATYPE {“H5T_IEEE_F32BE” }

DATASPACE{ ARRAY( 3,8,4 ) (H5S_UNLIMITED, 8, 4) }

DATA {…}

4

Storage layout: contiguous vs. chunkedcontiguous chunked

1

9

17

25

33

41

49

57

2

10

18

26

34

42

50

58

3

11

19

27

35

43

51

59

4

12

20

28

36

44

52

60

5

13

21

29

37

45

53

61

6

14

22

30

38

46

54

62

7

15

23

31

39

47

55

63

8

16

24

32

40

48

56

64

1

5

9

13

33

37

41

45

2

6

10

14

34

38

42

46

3

7

11

15

35

39

43

47

4

8

12

16

36

40

44

48

21

25

29

49

57

61

22

26

30

50

58

62

23

27

31

51

59

63

24

28

32

52

60

64

17 18 19 20

53 54 55 56

Advantages and disadvantages of chunking Accessing rows and columns require the same

number of accesses Data can be extended into all dimensions Efficient storage of sparse arrays Can improve caching

HDF-5 API

• HDF-5 naming convention

– All API functions start with an H5

– The next character identifies category of functions

• H5F: functions handling files

• H5G: functions handling groups

• H5D: functions handling datasets

• H5S: functions handling dataspaces

• H5A: functions handling attributes

• A HDF-5 group is a collection of data sets

– Comparable to a directory in a UNIX-like file system

5

h5py

• Python interface to the HDF5 binary data format

• Uses NumPy and Python abstractions such as dictionary

and NumPy array syntax

Reading and Writing an HDF-5 file

using h5py

import numpy as np

import h5py

MyData = np.random.random(size=(100,20))

h5f = h5py.File('data.h5', 'w')

h5f.create_dataset('dataset_1', data=Mydata)

h5f.close()

h5f = h5py.File('data.h5','r')

MyData = h5f['dataset_1'][:]

h5f.close()

6

Setting datatypes and compression in

h5py

hf = h5py.File('integer_8.hdf5', 'w')

d = f.create_dataset('dataset', (100000,), dtype='i8')

d[:] = arr

f.close()

f = h5py.File('float.hdf5', 'w')

d = f.create_dataset('dataset', (100,), dtype='f16‘, compression="gzip")

d[:] = arr

f.close()5f = h5py.File('data.h5','r')

MyData = h5f['dataset_1'][:]

h5f.close()

Parquet files

• Columnar data representation

• Available to many projects in the Hadoop ecosystem

• Built on the record shredding and assembly algorithm

described in the Dremel paper

Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva

Shivakumar, Matt Tolton, Theo Vassilakis: “Dremel: Interactive Analysis of

Web-Scale Datasets”

https://ai.google/research/pubs/pub36632.pdf

• Support compression and efficient encoding schemes.

https://github.com/julienledem/redelm/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paper


7

Record vs. column oriented data

Image source: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis: “Dremel: Interactive Analysis of Web-Scale Datasets” https://ai.google/research/pubs/pub36632.pdf

Example

message Document {

required int64 DocId;

optional group Links {

repeated int64 Backward;

repeated int64 Forward; }

repeated group Name {

repeated group Language {

required string Code;

optional string Country; }

optional string Url; }}


8

• Schema can be seen as a tree with leaves being

primitive types.

• A column is created for each leaf

• For this example we end up with 6 columns:DocId

Links.Backward

Links.Forward

Name.Language.Code

Name.Language.Country

Name.Url

• As some of the fields are repeated fields, we need

extra pieces of information to be stored along with the

data to allow re-assembling the records together.

Repetition and definition levels

• Repetition level tells us at what field in the field’s

path the value has repeated.

• Definition level specifies how many fields in a record

that could be undefined (because they are optional or

repeated) are present in the record.

• Only repeated fields increment the repetition level,

• Only non-required fields increment the definition level

9

Image source: https://github.com/julienledem/redelm/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paper

DocId: 10

Links

Forward: 20

Forward: 40

Forward: 60

Name

Language

Code: 'en-us'

Country: 'us'

Language

Code: 'en'

Url: 'http://A'

Name

Url: 'http://B'

Name

Language

Code: 'en-gb'

Country: 'gb'

DocId: 20

Links

Backward: 10

Backward: 30

Forward: 80

Name

Url: 'http://C'


10

DocId: 10

Links

Forward: 20

Forward: 40

Forward: 60

Name

Language

Code: 'en-us'

Country: 'us'

Language

Code: 'en'

Url: 'http://A'

Name

Url: 'http://B'

Name

Language

Code: 'en-gb'

Country: 'gb'

R = 0 (current repetition level)

DocId: 10, R:0, D:0

Links.Backward: NULL, R:0, D:1 (no value defined so D < 2)

Links.Forward: 20, R:0, D:2

R = 1 (we are repeating 'Links.Forward' of level 1)


R = 1 (we are repeating 'Links.Forward' again of level 1)


Back to the root level: R=0

Name.Language.Code: en-us, R:0, D:2

Name.Language.Country: us, R:0, D:3

R = 2 (we are repeating 'Name.Language' of level 2)

Name.Language.Code: en, R:2, D:2

Name.Language.Country: NULL, R:2, D:2 (no value defined so D < 3)

Name.Url: http://A, R:0, D:2

R = 1 (we are repeating 'Name' of level 1)

Name.Language.Code: NULL, R:1, D:1 (Only Name is defined so D = 1)

Name.Language.Country: NULL, R:1, D:1

Name.Url: http://B, R:1, D:2

R = 1 (we are repeating 'Name' again of level 1)

Name.Language.Code: en-gb, R:1, D:2

Name.Language.Country: gb, R:1, D:3

Name.Url: NULL, R:1,


DocId: 20, R:0, D:0

Links.Backward: 10, R:0, D:2

Links.Backward: 30, R:1, D:2


Name.Language.Code: NULL, R:0, D:1

Name.Language.Country: NULL, R:0, D:1

Name.Url: http://C, R:0, D:2

DocId: 20

Links

Backward: 10

Backward: 30

Forward: 80

Name

Url: 'http://C'




11

Resulting Columns stored in Parquet

Image source: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis: “Dremel: Interactive Analysis of Web-Scale Datasets” https://ai.google/research/pubs/pub36632.pdf

Parquet Glossary

• Row group: A logical horizontal partitioning of the data

into rows.

– A row group consists of a column chunk for each column

in the dataset.

– Max. size buffered in memory while writing

– No physical structure that is guaranteed for a row group.

• Column chunk: A chunk of the data for a particular

column.

– guaranteed to be contiguous in the file.

• Page: Column chunks are divided up into pages.

– conceptually an indivisible unit (in terms of compression

and encoding).


12

Example file of• A file consists of one or more row groups, here N columns split

into M row groups

• A row group contains exactly one column chunk per column.

• Column chunks contain one or more pages.

Image source: https://parquet.apache.org/documentation/latest/

• Reading columns is straight forward

• Record level API to integrate with existing row-based

engines (Hive, Pig, M/R)

– Repetition level 0 indicates new record

– Repetition/Definition level capture the structure

– One column per leaf in the schema

• Unit of parallelization

– MapReduce - File/Row Group

– IO - Column chunk

– Encoding/Compression - Page

https://parquet.apache.org/documentation/latest/

13

Encodings• Bit encoding

– Small integers encoded in minimum bits required

– Useful for repetition and definition levels

• Run level encoding

– sequences in which the same data value occurs in many

consecutive data elements are stored as a single data value and

count

– Useful for definition levels of sparse columns

• Dictionary encoding

– searches for matches between the text to be compressed and a

set of strings contained in a 'dictionary'

– When the encoder finds a match, it substitutes a reference to

the string's position in the data structure.

– Useful for small ( ~60k) set of values

Encodings

• Delta encoding (new in parquet 2)

Image source: Julien Le Dem, Nong Li:”Efficient Data Storage for Analytics with Apache Parquet 2.0”, https://www.slideshare.net/cloudera/hadoop-summit-36479635

https://www.slideshare.net/cloudera/hadoop-summit-36479635

14

Formats comparison (I)


Formats comparison (II)




15

Parquet implementations

• Java implementation

– sources can be build using mvn package.

– current stable version is available from Maven

Central.

• C++ sources

– Based on Apache Thrift (a software stack with a

code generation engine to build services that work

efficiently and seamlessly between numerous

languages, including C++, Java, Python, … )

Documents

Data Formats HDF5 and Parquet files - UHgabriel/courses/cosc6339_f18/BDA_16_DataFormats.pdfBig Data Analytics Data Formats – HDF5 and Parquet files Edgar Gabriel Fall 2018 File Formats