Upload
zheng-shao
View
5.064
Download
1
Tags:
Embed Size (px)
DESCRIPTION
This set of slides describes the efficient Java object model in Hive.
Citation preview
Efficient Object Model in Java
Slides by Zheng Shao, FacebookPart of Apache Hadoop Hive Project
Object Inspector
On-disk Data Format▪ Single on-disk format systems
▪ Simplicity
▪ Multiple on-disk format systems
▪ Ease-of-use
▪ Ease-of-integration
▪ Flexibility: better trade off between space, performance, etc
▪ Hive allows Multiple on-disk format
Example Multiple on-disk Formats▪ File Format:
▪ Row-based
▪ Column-based
▪ Block-based
▪ Row format:
▪ Text-based
▪ Binary-based
▪ Customized
▪ Index format
In-memory Data Format▪ Single in-memory format systems
▪ Simplicity: Simpler code
▪ Multiple in-memory format systems
▪ Ease-of-integration: other systems may use their own format
▪ Performance:
▪ Multiple on-disk format/external format + efficient loading Multiple in-memory format
▪ Hive allows Multiple in-memory format
Example Multiple in-memory Formats▪ Integer:
▪ Integer
▪ IntWritable
▪ LazyInteger
▪ String:
▪ String
▪ Text
Multiple In-memory Format Design Patterns▪ Object-oriented:
▪ A single interface/base class for Integer
▪ Multiple derived classes
▪ Delegation:
▪ data stored in object
▪ format/operations stored in objectInspector
▪ a pair of object and objectInspector represents a data unit
▪ It’s possible to wrap either one up to conform to the other’s pattern.
Multiple In-memory Format Design Patterns▪ In OO, we need an interface HiveInteger to represent Integers
▪ Make Integer, IntWritable classes all implement it.
▪ However, Integer class is final (not extendable) and does not implement HiveInteger
▪ We need to do a conversion, every time we exchange data with UDF, SerDe (Thrift), or other libraries (unless they know HiveInteger – this is a bad assumption to make in open system).
▪ Delegation will be a better idea because
▪ For Integer, we have an JavaIntegerObjectInspector
▪ For IntWritable , we have an WritableIntegerObjectInspector
▪ We convert params and return values only if necessary
Delegation Method List▪ General methods:
▪ isNull(object o)
▪ hashCode(object o)
▪ compare(object o)
▪ clone(object o)
▪ Primitive Objects:
▪ primitive getValue(object o)
▪ String Objects:
▪ String getString(object o)
▪ Text getText(object o)
▪ List Objects:
▪ getListSize(object o)
▪ getListElement(object o)
▪ getList(object o)
▪ Map Objects:
▪ getMapSize(object o)
▪ getValueForKey(object o)
▪ getMap(object o)
▪ Struct Objects:
▪ getStructField(object o)
▪ getStructAsAList(object o)
SerDe
Where is SerDe?
File on HDFSFile on HDFS
Hierarchical
Object
Hierarchical
Object
Writable
Writable
StreamStream StreamStream
Hierarchical
Object
Hierarchical
Object
Map Output
File
Map Output
File
Writable
Writable
Writable
Writable
Writable
Writable
Hierarchical
Object
Hierarchical
Object
File on HDFSFile on HDFS
User ScriptUser Script
Hierarchical
Object
Hierarchical
Object
Hierarchical
Object
Hierarchical
Object
Hive Operator
Hive Operator
Hive Operator
Hive Operator
SerDe
FileFormat / Hadoop Serialization
Mapper Reducer
ObjectInspector
imp 1.0 3 54Imp 0.2 1 33clk 2.2 8 212Imp 0.7 2 22
thrift_record<…>thrift_record<…>thrift_record<…>thrift_record<…>
BytesWritable(\x3F\x64\x72\x00)
Java ObjectObject of a Java Class
Standard ObjectUse ArrayList for struct and arrayUse HashMap for map
LazyObjectLazily-deserialized
Writable
Writable
Writable
WritableWritabl
eWritabl
e
Writable
Writable
Text(‘imp 1.0 3 54’) // UTF8 encoded
getTypeObjectInspector1
getFieldOI
getStructField
getTypeObjectInspector2
getMapValueOI
getMapValue
deserialize SerDeserialize getOI
SerDe, ObjectInspector and TypeInfo
Hierarchical
Object
Hierarchical
Object
Writable
Writable
Writable
Writable
Struct
Struct
intint stringstringlistlist
struct
struct
mapmap
stringstring stringstring
Hierarchical
Object
Hierarchical
Object
String ObjectString Object getType
ObjectInspector3
TypeInfo
BytesWritable(\x3F\x64\x72\x00)
Text(‘a=av:b=bv 23 1:2=4:5 abcd’)
class HO { HashMap<String, String> a, Integer b, List<ClassC> c, String d;}Class ClassC { Integer a, Integer b;}
List ( HashMap(“a” “av”, “b” “bv”), 23, List(List(1,null),List(2,4),List(5,null)), “abcd”)
intint intint
HashMap(“a” “av”, “b” “bv”),
HashMap<String, String> a,
“av”
LazySimpleSerDe components
LazyStructLazyStruct
LazyInteger
LazyInteger
LazyString
LazyString
LazyArrayLazyArray
LazyInteger
LazyInteger
LazyInteger
LazyInteger
LazyInteger
LazyInteger
LazyInteger
LazyInteger
LazyMapLazyMap
LazyStringLazyString
LazyStringLazyString
LazyStringLazyString
LazyStringLazyString
LazyStructOI(“ “)LazyStructOI(“ “)
LazyArrayOI(“:”)
LazyArrayOI(“:”)
LazyMapOI(“:”,”=“)
LazyMapOI(“:”,”=“)
LazyIntegerOILazyIntegerOI
LazyStringOILazyStringOI
LazyStringOILazyStringOI
byte[] databyte[] data
Hierarchical Object / LazyObject
One Per SerDe instance LazyObjectInspectorSingleton
byte[](‘a=av:b=bv 23 1:2=4:5 abcd’)
LazyStructLazyStructLazyStructOI(“=“)LazyStructOI(“=“)
StandardIntegerOI
StandardIntegerOI
LazyStructLazyStruct
LazyPrimitive▪ LazyString/LazyInteger
▪ setAll(byte[] data, int start, int length)
▪ LazyString: parse the data and create a String object
▪ LazyInteger: parse the data and create an Integer object
▪ getObject() – returns the corresponding String/Integer object
▪ Future
▪ Replace String/Integer with Text/IntWritable
▪ The Text/IntWritable object is owned by the LazyString/LazyInteger object.
LazyNonPrimitive▪ LazyStruct/LazyArray/LazyMap
▪ setAll(byte[] data, int start, int length)
▪ Remember data, start and length, and set parsed to false.
▪ getStructField/getArrayElement/getMapValue
▪ If not parsed yet, parse the byte and remember starting positions of each field/element/key/value
▪ For Struct/Array, do setAll on the corresponding LazyObject and return it
▪ For Map, search for the serialized key and return the corresponding value (after doing a setAll on the value).
Why another SerDe?▪ Functionality:
▪ MetadataTypedColumnSetSerDe can only deal with String columns
▪ DynamicSerDe can deal with all primitive columns and primitive lists/maps, but it does not fully support nested types yet.
▪ Efficiency:
▪ Both MetadataTypedColumnSetSerDe and DynamicSerDe uses String.split() and are not efficient for long rows
Features of LazySimpleSerDe▪ Functionality:
▪ Fully compatible with MetaDataSerDe and Dynamic/TCTLSeparated
▪ Fully support all nested types (Map Key must be primitive)
▪ Efficiency:
▪ Fully support lazy deserialization - only deserialize the field (and create Objects) when asked.
▪ Reuse multiple-levels of LazyObjects.
▪ Read numbers without UTF-8 decoding
▪ (TODO) Fully reuse objects - IntWritable for Integer, Text for String
▪ (TODO) Write numbers without UTF-8 encoding
Profiling result of a mapper
▪ 17%: TrackedRecordReader (should include InputFileFormat and decompression)▪ 22%: Operator.close ▪ |-12%: DynamicSerDe.serialize (NOTE: This includes UTF-8 encoding)▪ |- 4%: mapOutputBuffer.collect (should include compression and OutputFileFormat)▪ 50%: Operator.forward▪ |-18%: Text.decode (from LazySerDe)▪ | |- 7%: CharacterSet.decode() (UTF-8 decoding)▪ | |- 5%: toString() (where we create the string object)▪ |- 3%: LazyStruct.parse (the code that search for separators in the row)▪ |- 3%: Arrays.asList() (from UnionStructOI.getStructFieldData)▪ |- 8%: GroupByOperator.processHashAggr▪ |- 3%: HashMap.get() in GroupByOperator
▪ * Performance Data from Rodrigo Schmidt
TypeInfo String specification▪ Why not Thrift?
▪ Hard to parse
▪ Simple Syntax
▪ Type: PrimitiveType | MapType | ArrayType | StructType
▪ PrimitiveType: int | bigint | tinyint | smallint | double | string
▪ MapType: map<Type, Type>
▪ ArrayType: array<Type>
▪ StructType: struct< [Name : Type]+ >
▪ Example: array<map<string,struct<a:int,b:array<string>,c:doube>>>
Future Works
Future Works of ObjectInspector▪ Delegate all methods described earlier
▪ isNull(), hashCode(), compare() etc are not delegated yet
▪ Support UNION data type: HIVE-537
Future Works of SerDe▪ LazyBinarySerDe: HIVE-553
▪ A binary-format sortable SerDe: serialized sorting order is the same as deserialized sorting order
▪ A binary-format compact SerDe: saving space