22
Efficient Object Model in Java Slides by Zheng Shao, Facebook Part of Apache Hadoop Hive Project

Hive Object Model

Embed Size (px)

DESCRIPTION

This set of slides describes the efficient Java object model in Hive.

Citation preview

Page 1: Hive Object Model

Efficient Object Model in Java

Slides by Zheng Shao, FacebookPart of Apache Hadoop Hive Project

Page 2: Hive Object Model

Object Inspector

Page 3: Hive Object Model

On-disk Data Format▪ Single on-disk format systems

▪ Simplicity

▪ Multiple on-disk format systems

▪ Ease-of-use

▪ Ease-of-integration

▪ Flexibility: better trade off between space, performance, etc

▪ Hive allows Multiple on-disk format

Page 4: Hive Object Model

Example Multiple on-disk Formats▪ File Format:

▪ Row-based

▪ Column-based

▪ Block-based

▪ Row format:

▪ Text-based

▪ Binary-based

▪ Customized

▪ Index format

Page 5: Hive Object Model

In-memory Data Format▪ Single in-memory format systems

▪ Simplicity: Simpler code

▪ Multiple in-memory format systems

▪ Ease-of-integration: other systems may use their own format

▪ Performance:

▪ Multiple on-disk format/external format + efficient loading Multiple in-memory format

▪ Hive allows Multiple in-memory format

Page 6: Hive Object Model

Example Multiple in-memory Formats▪ Integer:

▪ Integer

▪ IntWritable

▪ LazyInteger

▪ String:

▪ String

▪ Text

Page 7: Hive Object Model

Multiple In-memory Format Design Patterns▪ Object-oriented:

▪ A single interface/base class for Integer

▪ Multiple derived classes

▪ Delegation:

▪ data stored in object

▪ format/operations stored in objectInspector

▪ a pair of object and objectInspector represents a data unit

▪ It’s possible to wrap either one up to conform to the other’s pattern.

Page 8: Hive Object Model

Multiple In-memory Format Design Patterns▪ In OO, we need an interface HiveInteger to represent Integers

▪ Make Integer, IntWritable classes all implement it.

▪ However, Integer class is final (not extendable) and does not implement HiveInteger

▪ We need to do a conversion, every time we exchange data with UDF, SerDe (Thrift), or other libraries (unless they know HiveInteger – this is a bad assumption to make in open system).

▪ Delegation will be a better idea because

▪ For Integer, we have an JavaIntegerObjectInspector

▪ For IntWritable , we have an WritableIntegerObjectInspector

▪ We convert params and return values only if necessary

Page 9: Hive Object Model

Delegation Method List▪ General methods:

▪ isNull(object o)

▪ hashCode(object o)

▪ compare(object o)

▪ clone(object o)

▪ Primitive Objects:

▪ primitive getValue(object o)

▪ String Objects:

▪ String getString(object o)

▪ Text getText(object o)

▪ List Objects:

▪ getListSize(object o)

▪ getListElement(object o)

▪ getList(object o)

▪ Map Objects:

▪ getMapSize(object o)

▪ getValueForKey(object o)

▪ getMap(object o)

▪ Struct Objects:

▪ getStructField(object o)

▪ getStructAsAList(object o)

Page 10: Hive Object Model

SerDe

Page 11: Hive Object Model

Where is SerDe?

File on HDFSFile on HDFS

Hierarchical

Object

Hierarchical

Object

Writable

Writable

StreamStream StreamStream

Hierarchical

Object

Hierarchical

Object

Map Output

File

Map Output

File

Writable

Writable

Writable

Writable

Writable

Writable

Hierarchical

Object

Hierarchical

Object

File on HDFSFile on HDFS

User ScriptUser Script

Hierarchical

Object

Hierarchical

Object

Hierarchical

Object

Hierarchical

Object

Hive Operator

Hive Operator

Hive Operator

Hive Operator

SerDe

FileFormat / Hadoop Serialization

Mapper Reducer

ObjectInspector

imp 1.0 3 54Imp 0.2 1 33clk 2.2 8 212Imp 0.7 2 22

thrift_record<…>thrift_record<…>thrift_record<…>thrift_record<…>

BytesWritable(\x3F\x64\x72\x00)

Java ObjectObject of a Java Class

Standard ObjectUse ArrayList for struct and arrayUse HashMap for map

LazyObjectLazily-deserialized

Writable

Writable

Writable

WritableWritabl

eWritabl

e

Writable

Writable

Text(‘imp 1.0 3 54’) // UTF8 encoded

Page 12: Hive Object Model

getTypeObjectInspector1

getFieldOI

getStructField

getTypeObjectInspector2

getMapValueOI

getMapValue

deserialize SerDeserialize getOI

SerDe, ObjectInspector and TypeInfo

Hierarchical

Object

Hierarchical

Object

Writable

Writable

Writable

Writable

Struct

Struct

intint stringstringlistlist

struct

struct

mapmap

stringstring stringstring

Hierarchical

Object

Hierarchical

Object

String ObjectString Object getType

ObjectInspector3

TypeInfo

BytesWritable(\x3F\x64\x72\x00)

Text(‘a=av:b=bv 23 1:2=4:5 abcd’)

class HO { HashMap<String, String> a, Integer b, List<ClassC> c, String d;}Class ClassC { Integer a, Integer b;}

List ( HashMap(“a” “av”, “b” “bv”), 23, List(List(1,null),List(2,4),List(5,null)), “abcd”)

intint intint

HashMap(“a” “av”, “b” “bv”),

HashMap<String, String> a,

“av”

Page 13: Hive Object Model

LazySimpleSerDe components

LazyStructLazyStruct

LazyInteger

LazyInteger

LazyString

LazyString

LazyArrayLazyArray

LazyInteger

LazyInteger

LazyInteger

LazyInteger

LazyInteger

LazyInteger

LazyInteger

LazyInteger

LazyMapLazyMap

LazyStringLazyString

LazyStringLazyString

LazyStringLazyString

LazyStringLazyString

LazyStructOI(“ “)LazyStructOI(“ “)

LazyArrayOI(“:”)

LazyArrayOI(“:”)

LazyMapOI(“:”,”=“)

LazyMapOI(“:”,”=“)

LazyIntegerOILazyIntegerOI

LazyStringOILazyStringOI

LazyStringOILazyStringOI

byte[] databyte[] data

Hierarchical Object / LazyObject

One Per SerDe instance LazyObjectInspectorSingleton

byte[](‘a=av:b=bv 23 1:2=4:5 abcd’)

LazyStructLazyStructLazyStructOI(“=“)LazyStructOI(“=“)

StandardIntegerOI

StandardIntegerOI

LazyStructLazyStruct

Page 14: Hive Object Model

LazyPrimitive▪ LazyString/LazyInteger

▪ setAll(byte[] data, int start, int length)

▪ LazyString: parse the data and create a String object

▪ LazyInteger: parse the data and create an Integer object

▪ getObject() – returns the corresponding String/Integer object

▪ Future

▪ Replace String/Integer with Text/IntWritable

▪ The Text/IntWritable object is owned by the LazyString/LazyInteger object.

Page 15: Hive Object Model

LazyNonPrimitive▪ LazyStruct/LazyArray/LazyMap

▪ setAll(byte[] data, int start, int length)

▪ Remember data, start and length, and set parsed to false.

▪ getStructField/getArrayElement/getMapValue

▪ If not parsed yet, parse the byte and remember starting positions of each field/element/key/value

▪ For Struct/Array, do setAll on the corresponding LazyObject and return it

▪ For Map, search for the serialized key and return the corresponding value (after doing a setAll on the value).

Page 16: Hive Object Model

Why another SerDe?▪ Functionality:

▪ MetadataTypedColumnSetSerDe can only deal with String columns

▪ DynamicSerDe can deal with all primitive columns and primitive lists/maps, but it does not fully support nested types yet.

▪ Efficiency:

▪ Both MetadataTypedColumnSetSerDe and DynamicSerDe uses String.split() and are not efficient for long rows

Page 17: Hive Object Model

Features of LazySimpleSerDe▪ Functionality:

▪ Fully compatible with MetaDataSerDe and Dynamic/TCTLSeparated

▪ Fully support all nested types (Map Key must be primitive)

▪ Efficiency:

▪ Fully support lazy deserialization - only deserialize the field (and create Objects) when asked.

▪ Reuse multiple-levels of LazyObjects.

▪ Read numbers without UTF-8 decoding

▪ (TODO) Fully reuse objects - IntWritable for Integer, Text for String

▪ (TODO) Write numbers without UTF-8 encoding

Page 18: Hive Object Model

Profiling result of a mapper

▪ 17%: TrackedRecordReader (should include InputFileFormat and decompression)▪ 22%: Operator.close ▪ |-12%: DynamicSerDe.serialize (NOTE: This includes UTF-8 encoding)▪ |- 4%: mapOutputBuffer.collect (should include compression and OutputFileFormat)▪ 50%: Operator.forward▪ |-18%: Text.decode (from LazySerDe)▪ | |- 7%: CharacterSet.decode() (UTF-8 decoding)▪ | |- 5%: toString()  (where we create the string object)▪ |- 3%: LazyStruct.parse (the code that search for separators in the row)▪ |- 3%: Arrays.asList() (from UnionStructOI.getStructFieldData)▪ |- 8%: GroupByOperator.processHashAggr▪ |- 3%: HashMap.get() in GroupByOperator

▪ * Performance Data from Rodrigo Schmidt

Page 19: Hive Object Model

TypeInfo String specification▪ Why not Thrift?

▪ Hard to parse

▪ Simple Syntax

▪ Type: PrimitiveType | MapType | ArrayType | StructType

▪ PrimitiveType: int | bigint | tinyint | smallint | double | string

▪ MapType: map<Type, Type>

▪ ArrayType: array<Type>

▪ StructType: struct< [Name : Type]+ >

▪ Example: array<map<string,struct<a:int,b:array<string>,c:doube>>>

Page 20: Hive Object Model

Future Works

Page 21: Hive Object Model

Future Works of ObjectInspector▪ Delegate all methods described earlier

▪ isNull(), hashCode(), compare() etc are not delegated yet

▪ Support UNION data type: HIVE-537

Page 22: Hive Object Model

Future Works of SerDe▪ LazyBinarySerDe: HIVE-553

▪ A binary-format sortable SerDe: serialized sorting order is the same as deserialized sorting order

▪ A binary-format compact SerDe: saving space