Ghislain Fourny Big Data for Engineers Spring 2018 · Ghislain Fourny Big Data for Engineers Spring...

Preview:

Citation preview

Ghislain Fourny

Big Data for Engineers Spring 201811. Data Models

pinkyone / 123RF Stock Photo

2

CSV (Comma separated values)

ID,Last name,First name,Theory,1,Einstein,Albert,"General, Special Relativity"2,Gödel,Kurt,"""Incompleteness"" Theorem"

This is syntax

3

CSV (Comma separated values)

ID,Last name,First name,Theory,1,Einstein,Albert,"General, Special Relativity"2,Gödel,Kurt,"""Incompleteness"" Theorem"

ID Last name First name Theory1 Einstein Albert General, Special Relativity2 Gödel Kurt "Incompleteness" Theorem

This is a data model

This is syntax

4

Syntax vs. Data Models

ID,Last name,First name,Theory,1,Einstein,Albert,"General, Special Relativity"2,Gödel,Kurt,"""Incompleteness"" Theorem"

Physical viewSyntax

5

Syntax vs. Data Models

ID,Last name,First name,Theory,1,Einstein,Albert,"General, Special Relativity"2,Gödel,Kurt,"""Incompleteness"" Theorem"

Physical viewSyntax

ID Last name First name Theory1 Einstein Albert General, Special Relativity2 Gödel Kurt "Incompleteness" Theorem

Logical viewData Model

6

Syntax vs. Data Models

<a><d e="f"/><c>This is <b>text</b>.</c>

</a>

Physical viewSyntax

7

Syntax vs. Data Models

<a><d e="f"/><c>This is <b>text</b>.</c>

</a>

Physical viewSyntax

Logical viewData Model

a

d

This is

c

b .

text

e = f

8

Edge vs. Node labeling

foo

bar

Labels are on the edges

9

Edge vs. Node labeling

foo

bar

foobar

foo

bar

Labels are on the edges Labels are on the nodes

10

XML Data models

Information Set (Infoset)http://www.w3.org/TR/xml-infoset/

11

XML Data models

Information Set (Infoset)http://www.w3.org/TR/xml-infoset/

Post Schema-Validation Infoset (PSVI)http://www.w3.org/TR/xmlschema11-1/

12

XML Data models

Information Set (Infoset)http://www.w3.org/TR/xml-infoset/

Post Schema-Validation Infoset (PSVI)http://www.w3.org/TR/xmlschema11-1/

XQuery and XPath Data Model (XDM)http://www.w3.org/TR/xpath-datamodel/

13

JSON Data Models

"original" (implicit) JSON Data Modelhttp://www.json.org/

14

JSON Data Models

"original" (implicit) JSON Data Modelhttp://www.json.org/

JSON Schema Data Modelhttps://www.ietf.org/archive/id/draft-wright-json-

schema-01.txt

15

JSON Data Models

"original" (implicit) JSON Data Modelhttp://www.json.org/

JSON Schema Data Modelhttps://www.ietf.org/archive/id/draft-wright-json-

schema-01.txt

JSONiq Data Model (JDM)http://www.jsoniq.org/docs/JSONiqExtensionToXQuer

y/html/section-jsoniq-data-model.html

16

HTML/XML Data model

Document Object Model (DOM)http://www.w3.org/TR/REC-DOM-Level-1/

17

XML Information Setgrigory_bruev / 123RF Stock Photo

18

Information Set

<?xml version="1.0" encoding="UTF-8"?> <dc:metadata xmlns:dc="http://www.systems.ethz.ch"> <title xml:lang="en" year="2008" >Systems Group</title> <publisher>ETH Zurich</publisher> </dc:metadata>

19

Information Set

2020

The 11 XML Information Items

DocumentElementAttributeProcessing InstructionCharacterComment

NamespaceUnexpanded Entity ReferenceDTDUnparsed EntityNotation

2121

The 11 XML Information Items

DocumentElementAttributeProcessing InstructionCharacterComment

NamespaceUnexpanded Entity ReferenceDTDUnparsed EntityNotation

22

Document Information Items

Document Information Item[children] Element Information Item[document element] Element Information Item metadata[notations] <empty>[unparsed entities] <empty>[base URI ] file:///Users/bigdata/Documents/info.xml[character encoding scheme] UTF-8[standalone] <no value>[version] 1.0

docmetadata

23

Element Information Items

Element Information Item[namespace name] http://www.systems.ethz.ch[local name] metadata[prefix] dc[children] Element Information Items [attributes] <empty>[namespace attributes] Attribute Information Item xmlns:dc[in-scope namespaces] Namespace Information Items[base URI] file:///Users/bigdata/Documents/info.xml[parent] Document Information Item

metadata

doc

title publisher

24

Attribute Information Items

Attribute Information Item [namespace name] empty[local name] year[prefix] empty[normalized value] 2008[specified] true[attribute type] <no value> [references] unknown[owner element] Element Information Item

year

title

25

XML Infoset - the treedoc

metadataxmlns:dc

title

dc->systems

ETH Zurich

publisher

langyear

Systems Group

dc->systemsdc->systems

26

Post-Schema-Validation Infoset

Infoset

+

Types

Post-Schema-Validation Infoset (PSVI)

27

XPath and XQuery Data Model

Weerapat Kiatdumrong / 123RF Stock Photo

28

XDM: Sequences of Items

( , , , , , )

29

XDM: Sequence of one item

30

XDM: Sequence of one item

= ( )

31

XDM: Sequences are flat

(( , ), )

32

XDM: Sequences are flat

(( , ), )=( , , )

33

XDM: Items

Atomic Node

34

XDM: Seven Kinds of XML Nodes

§ Document node§ Element node§ Attribute node§ Text node§ Comment node§ Processing instruction node§ Namespace node

35

XDM: Seven Kinds of XML Nodes

Infoset

XDM

36

XDM vs. Infoset

Infoset

XDM

xs:untyped

37

XDM: New Items in 3.0 and 3.1

Functions

38

XDM: New Items in 3.0 and 3.1

Functions Maps

lorem

ipsum

dolor

sit

amet

39

XDM: New Items in 3.0 and 3.1

Functions Maps Arrays

lorem

ipsum

dolor

sit

amet

40

XDM and Querying

Expression

for if thenelse where

order by

whileany

every

let return

exit with

=

+

41

Types

42

Type Systems

Almost all type systems (Java, SQL, PSVI, JDM, Protocol buffers, Avro, Parquet, and so on) share the following properties:

43

Type Systems

Almost all type systems (Java, SQL, PSVI, JDM, Protocol buffers, Avro, Parquet, and so on) share the following properties:

- Distinction between atomic types and structured types

44

Type Systems

Almost all type systems (Java, SQL, PSVI, JDM, Protocol buffers, Avro, Parquet, and so on) share the following properties:

- Distinction between atomic types and structured types

- Same categories of atomic types

45

Type Systems

Almost all type systems (Java, SQL, PSVI, JDM, Protocol buffers, Avro, Parquet, and so on) share the following properties:

- Distinction between atomic types and structured types

- Same categories of atomic types

- Lists and maps as structured types

46

Type Systems

Almost all type systems (Java, SQL, PSVI, JDM, Protocol buffers, Avro, Parquet, and so on) share the following properties:

- Distinction between atomic types and structured types

- Same categories of atomic types

- Lists and maps as structured types

- Sequence type cardinalities

47

Types (General)

Atomic Typesvs.

Structured Types

48

Atomic Types

49

Atomic Types

Strings

50

Strings(Character sequences with monoid structure)

"foo"

"Zurich"

"Ilsebill salzte nach."

f o o

Z u r i c h

I l s e b i l l ␣ s a l z t e ␣ n a c h .

51

Atomic Types

StringsNumbers

52

Interval-based integer types(exist as signed and unsigned)

8-bit (Java's byte)16-bit (Java's short)32-bit (Java's int)64-bit (Java's long)

53

Arbitrary precision decimals (and integers)

Any precision and scale

3141592653.5897932384626433832795

54

Float and DoubleIEEE 754 standard

single precision double precision

32 bits 64 bits

ca. 7 digits3141592000 3141592653.58979

ca. 15 digits

10-37 to 1037 10-307 to 10308

55

Atomic Types

StringsNumbersBooleans

56

Booleans

TRUEttrueyyeson1

FALSEffalsennooff0

57

Atomic Types

StringsNumbersBooleansDates and Times

58

Dates and times

Date

Time

Timestamp

Duration

59

Dates (Gregorian calendar)

Year + Month + Day2017 August 1st

(AD)

60

Times

Hours + Minutes + Seconds

10 : 31 : 15.109378

61

Timestamps

Year + Month + Day + Hours + Minutes + Seconds

2017 August 1st 10 : 31 : 15.109378(AD)

62

Atomic Types

StringsNumbersBooleansDates and TimesTime Intervals

63

Duration kinds

Year Month Day Hour Minute Second

Example: 2 years and 4 months

64

Duration kinds

Year Month Day Hour Minute Second

Example: 2 years and 4 months

Example: 3 hours and 14 minutes

65

Atomic Types

StringsNumbersBooleansDates and TimesTime IntervalsBinaries

66

Atomic Types

StringsNumbersBooleansDates and TimesTime IntervalsBinariesNull

67

Lexical space vs. value space

Value space Lexical space

68

Lexical space vs. value space

"1""01"...

Value space Lexical space

69

Lexical space vs. value space

"4""04""100b"...

Value space Lexical space

70

Subtypes

Supertype'svalue space

Subtype'svalue space

71

Structured Types

Data Structure ExamplesAssociative Arrays (a.k.a. maps)

JSON Object,Protobuf Message,Set of XML Attributes

Ordered Lists JSON Array,XML Element,Protobuf repeated field

72

Structured Types

Data Structure ExamplesAssociative Arrays (a.k.a. maps)

JSON Object,Protobuf Message,Set of XML Attributes

Ordered Lists JSON Array,XML Element,Protobuf repeated field

73

Cardinality

Howmany?

Commonsign

Common adjective

74

Cardinality

Howmany?

Commonsign

Common adjective

One required

75

Cardinality

Howmany?

Commonsign

Common adjective

One requiredZero or more *

76

Cardinality

Howmany?

Commonsign

Common adjective

One requiredZero or more *Zero or one ? optional

77

Cardinality

Howmany?

Commonsign

Common adjective

One requiredZero or more *Zero or one ? optionalOne or more +

78

JSON Data Modelwklzzz / 123RF Stock Photo

79

JSON Values

Atomic values

Strings

Numbers

Booleans

Null

Objects

Arrays

Structured values

80

JSON Values

Atomic values

Strings

Numbers

Booleans

Null

ObjectsString-to-Value map

ArraysList of values

Structured values

81

JSON Values

Atomic values

Strings

Numbers

Booleans

Null

ObjectsString-to-Value map

ArraysList of values

Structured values

Recursion

82

Tree-based visual model

{"foo" : true,"bar" : [{"foobar" : "foo"

},null

]}

83

Tree-based visual model

{"foo" : true,"bar" : [{"foobar" : "foo"

},null

]}

object

84

Tree-based visual model

{"foo" : true,"bar" : [{"foobar" : "foo"

},null

]}

foo

true

bar

object

array

85

Tree-based visual model

{"foo" : true,"bar" : [{"foobar" : "foo"

},null

]}

foo

true

bar

foobar

null

object

array

object

foo

86

ValidationBurak Cakmak / 123RF Stock Photo

87

Document

Validation: The Pipeline

88

Document Well-Formedness

Validation: The Pipeline

89

Document Well-Formedness Validation

Validation: The Pipeline

90

On the oXygen Cheat Sheet

Validity Well-Formedness

91

Validation vs. Annotation

Validation

92

Validation vs. Annotation

Validation

Annotation

93

Validation

94

Validation

95

Validation

96

Validation

97

Validation

98

XML Schema

99

Empty Schema

<?xml&version="1.0"&encoding="UTF98"?>&<xs:schema&&&xmlns:xs="http://www.w3.org/2001/XMLSchema">&</xs:schema>&&

100

Simple Scenario<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="foo" type="xs:string"/> </xs:schema>

<?xml version="1.0" encoding="UTF-8"?> <foo xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="schema.xsd"> This is text. </foo>

SchemaInstance

schema.xsd

file.xml

101

Simple Scenario<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="foo" type="xs:string"/> </xs:schema>

<?xml version="1.0" encoding="UTF-8"?> <foo xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="schema.xsd"> This is text. </foo>

SchemaInstance

schema.xsd

file.xml

102

Simple Scenario<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="foo" type="xs:string"/> </xs:schema>

<?xml version="1.0" encoding="UTF-8"?> <foo xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="schema.xsd"> This is text. </foo>

SchemaInstance

103

Simple Scenario<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="foo" type="xs:integer"/> </xs:schema>

<?xml version="1.0" encoding="UTF-8"?> <foo xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="schema.xsd"> 142857 </foo>

SchemaInstance

104

Simple Types: Built-inStrings string

anyURIQName

Numbers decimalintegerfloatdoublelong int short bytepositiveInteger nonNegativeInteger...unsignedLong unsignedInt...

Booleans boolean

105

Simple Types: Built-inDates and Times dateTime

timedategYearMonthgMonthDaygYeargMonthgDaydateTimeStamp

Time Intervals durationyearMonthDurationdayTimeDuration

Binaries hexBinary base64BinaryNull -

106

Dates

2014-12-02

2014-12-02T10:15:00Z

01:15:00-08:00

107

Durations

P1Y2MT3H

108

User-defined types

109

User-defined types

Restriction

110

User-defined types

RestrictionUnion Not atomic

111

User-defined types

RestrictionUnion Not atomic

List Not atomic

112

Restriction<?xml&version="1.0"&encoding="UTF98"?>&<xs:schema&&&xmlns:xs="http://www.w3.org/2001/XMLSchema">&&&<xs:simpleType&name="myFixedLengthString">&&&&&<xs:restriction&base="xs:string">&&&&&&&<xs:length&value="3"/>&&&&&</xs:restriction>&&&</xs:simpleType>&&&<xs:element&name="foo"&type="myFixedLengthString"/>&</xs:schema>&

<?xml&version="1.0"&encoding="UTF98"?>&<foo&&&xmlns:xsi="http://www.w3.org/2001/XMLSchema9instance"&&&xsi:noNamespaceSchemaLocation="schema.xsd">ZRH</foo>&&&

SchemaInstance

113

Restriction<xs:simpleType,name="myFixedLengthString">,,,<xs:restriction,base="xs:string">,,,,,<xs:length,value="3"/>,,,</xs:restriction>,</xs:simpleType>,,

<foo>ZRH</foo>,,,

SchemaInstance

114

List

<xs:simpleType,name="myList">,,,<xs:list,itemType="xs:string"/>,</xs:simpleType>,,<foo>foo,bar,foobar</foo>,,

SchemaInstance

115

Union

<xs:simpleType,name="myUnion">,,,<xs:union,memberTypes="xs:integer,xs:boolean"/>,</xs:simpleType>,,

<foo>true</foo>,

,,,

SchemaInstance

116

Complex Types

117

Complex Types

Empty <foo/>

118

Complex Types

EmptySimple Content

<foo/>

<foo>text</foo>

119

Complex Types

EmptySimple Content

Complex Content

<foo/>

<foo>text</foo>

<foo><a/><b/>

</foo>

120

Complex Types

EmptySimple Content

Complex ContentMixed Content

<foo/>

<foo>text</foo>

<foo><a/><b/>

</foo><foo>Text<a/>Text<b/>

</foo>

121

Complex content

SchemaInstance

<xs:complexType-name="complexContent">---<xs:sequence>-----<xs:element-name="twotofour"-type="xs:string"-minOccurs="2"-maxOccurs="4"/>-----<xs:element-name="zeroorone"-type="xs:boolean"-minOccurs="0"-maxOccurs="1"/>---</xs:sequence>-</xs:complexType>--

<foo>---<twotofour>foobar</twotofour>---<twotofour>foobar</twotofour>---<twotofour>foobar</twotofour>---<zeroorone>true</zeroorone>-</foo>----

122

Complex content

<xs:complexType-name="complexContent">---<xs:sequence>-----<xs:element-name="twotofour"-type="xs:string"-minOccurs="2"-maxOccurs="4"/>-----<xs:element-name="zeroorone"-type="xs:boolean"-minOccurs="0"-maxOccurs="1"/>---</xs:sequence>-</xs:complexType>--

<foo>---<twotofour>foobar</twotofour>---<twotofour>foobar</twotofour>---<twotofour>foobar</twotofour>---<zeroorone>true</zeroorone>-</foo>----

SchemaInstance

123

Empty content

<xs:complexType-name="emptyType">---<xs:sequence/>-</xs:complexType>--<foo/>---

SchemaInstance

124

Simple content

<xs:complexType-name="dateCountry">---<xs:simpleContent>-----<xs:extension-base="xs:date">-------<xs:attribute-name="country"-type="xs:string"/>-----</xs:extension>---</xs:simpleContent>-</xs:complexType>--

<foo-country="Switzerland">2014D12D02</foo>---

SchemaInstance

125

Mixed content

<xs:complexType-name="mixedContent"-mixed="true">---<xs:sequence>-----<xs:element-name="b"-type="xs:string"-minOccurs="0"-maxOccurs="unbounded"/>---</xs:sequence>-</xs:complexType>--

<foo>Some-text-and-some-<b>bold</b>-text.</foo>-----

SchemaInstance

126

Simple type on attributes

<xs:complexType-name="withAttribute">---<xs:sequence/>---<xs:attribute-name="country"-----------------type="xs:string"-----------------default="Switzerland"/>-</xs:complexType>--<foo-country="Switzerland"/>---

SchemaInstance

127

Named Types<xs:complexType name="empty"> <xs:sequence/> </xs:complexType> <xs:element name="c" type="empty"> </xs:element> <c/>

SchemaInstance

128

Anonymous Types

<xs:element name="c"> <xs:complexType> <xs:sequence/> </xs:complexType> </xs:element> <c/>

SchemaInstance

129

No namespaces<?xml&version="1.0"&encoding="UTF98"?>&<xs:schema&&&xmlns:xs="http://www.w3.org/2001/XMLSchema"&&&<xs:element&name="foo"&type="xs:string"/>&</xs:schema>&&

&

<?xml&version="1.0"&encoding="UTF98"?>&<foo&&&xmlns:xsi="http://www.w3.org/2001/XMLSchema9instance"&&&xsi:noNamespaceSchemaLocation="schema.xsd">&&&This&is&text.&</foo>&&

SchemaInstance

130

With namespaces<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.example.com/bigdata" xmlns:big="http://www.example.com/bigdata"> <xs:element name="foo" type="xs:string"/> </xs:schema> <?xml version="1.0" encoding="UTF-8"?> <big:foo xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation=" http://www.example.com/bigdata schema.xsd" xmlns:big="http://www.example.com/bigdata"> This is text. </big:foo>

SchemaInstance

131

Warning: named types with namespaces<xs:complexType name="empty"> <xs:sequence/> </xs:complexType> <xs:element name="c" type="big:empty"> </xs:element> <big:c/>

SchemaInstance

@name:implicitly in the target namespace

always unprefixed

132

Bonus material: The Schema of Schemas

!!<xs:schema!!!!!xmlns:xs="http://www.w3.org/2001/XMLSchema"!!!!!targetNamespace="http://www.w3.org/2001/XMLSchema">!!!!!<xs:element!name="schema"!id="schema">!!!!!!!<xs:complexType>!!!!!!!!!<xs:complexContent>!!!!!!!!!!!..!!!!!!!!!</xs:complexContent>!!!!!!!</xs:complexType>!!!!!</xs:element>!!!!!<xs:element!name="element"!type="xs:topLevelElement"!id="element"/>!!!!!<xs:element!name="simpleType"!type="xs:topLevelSimpleType"!id="simpleType"/>!!!!!<xs:element!name="complexType"!type="xs:topLevelComplexType"!id="complexType"/>!!!!!<xs:complexType!name="element"!abstract="true">!!!!!!!<xs:complexContent>!!!!!!!!!..!!!!!!!</xs:complexContent>!!!!!</xs:complexType>!!!</xs:schema>!

133

Alternate data models and validation formatswklzzz / 123RF Stock Photo

134

Protocol buffers

message Person {required string last_name = 1;repeated string first_name = 2;optional Title title = 3;optional Person boss = 4;

}

135

Avro

fields valuesname ETHcanton ZHstudents 20,000

{"type" : "map","name" : "university","fields" : [{ "name" : "name", "type" : "string" },{ "name" : "canton", "type" : "cantonal-code" },{ "name" : "students", "type" : "long" }

]}

Recommended