30
The Synthetic Encoding AN EFFICIENT APPROACH TO IDENTIFY DAG NODES [email protected], some rights reserved

Synthetic Encoding

Embed Size (px)

DESCRIPTION

Synthetic encoding is a state-of-art solution for data lineage. It's related to Dryad and Spark.

Citation preview

Page 1: Synthetic Encoding

The Synthetic EncodingAN EFFICIENT APPROACH TO IDENTIFY DAG NODES

[email protected], some rights reserved

Page 2: Synthetic Encoding

Outline• Prelude

• Canonical Encoding

• Digested Encoding

• Synthetic Encoding

• Cryptanalysis

• Application

• Related work

[email protected], 2013 - 2014

Page 3: Synthetic Encoding

PreludeA BRIEF INTRODUCTION

[email protected], 2013 - 2014

Page 4: Synthetic Encoding

Directed Acyclic Graph (Wikipedia)A directed graph may be used to represent a network of processing elements; in this formulation, data enters a processing element through its incoming edges and leaves the element through its outgoing edges. Examples of this include the following:

• In electronic circuit design, a combinational logic circuit is an acyclic system of logic gates that computes a function of an input, where the input and output of the function are represented as individual bits.

• Dataflow programming languages describe systems of values that are related to each other by a directed acyclic graph. When one value changes, its successors are recalculated; each value is evaluated as a function of its predecessors in the DAG.

• In compilers, straight line code (that is, sequences of statements without loops or conditional branches) may be represented by a DAG describing the inputs and outputs of each of the arithmetic operations performed within the code; this representation allows the compiler to perform common subexpression eliminationefficiently.

• In most spreadsheet systems, the dependency graph that connects one cell to another if the first cell stores a formula that uses the value in the second cell must be a directed acyclic graph. Cycles of dependencies are disallowed because they cause the cells involved in the cycle to not have a well-defined value. Additionally, requiring the dependencies to be acyclic allows a topological order to be used to schedule the recalculations of cell values when the spreadsheet is changed.

[email protected], 2013 - 2014

Page 5: Synthetic Encoding

Example DAG as an Archetype• The archetype

• Arithmetic example

• ((a+b)*(a+b))*(c+d)*d

• Aliases

• Node#1

• Node#2

• Node#3

• Node#4

Legend

Data

Combinator

a

*

b cd

+ +

*

Node#4

Node#3

Node#1 Node#2

[email protected], 2013 - 2014

Page 6: Synthetic Encoding

Alias IS NOT Identity• Arbitrary aliasing

• Node#1 -> Node#a

• Node#2 -> Node#b

• Node#3 -> Node#c

• Node#4 -> Node#dLegend

Data

Combinator

a

*

b cd

+ +

*

Node#d

Node#c

Node#a Node#b

[email protected], 2013 - 2014

Page 7: Synthetic Encoding

Or else…• Arbitrary aliasing

• Node#1 -> Node#3

• Node#2 -> Node#4

• Node#3 -> Node#2

• Node#4 -> Node#1Legend

Data

Combinator

a

*

b cd

+ +

*

Node#1

Node#2

Node#3 Node#4

[email protected], 2013 - 2014

Page 8: Synthetic Encoding

Identity of DAG nodes• No false positive

• Isomorphic

• Nodes with the same identity should not have different semantics

• No false negative

• Distinctive

• Nodes with different identities should not have the same semantics

• Identity Encoding

• Universal

• Invariant across DAGs

• Presumedly existing

Legend

Data

Combinator

a

*

b cd

+ +

*

A=?

B=?

C=? D=?

[email protected], 2013 - 2014

Page 9: Synthetic Encoding

Application of Identity Encodings• Structural Analysis

• Profiling• Detect hot spot

• Detect critical path

• Caching• Memorizing with identity

• Cache(identity(node)) == Apply(node)

• Verification• Replay the computation later

• Formal verification of semantically identical combinators

[email protected], 2013 - 2014

Page 10: Synthetic Encoding

Interpretations• Mathematics: Embedding Graph to Linear

• Engineering: Flatten a DAG into a string

• Feasibility

• Arbitrary long string for graph encoding Encodings

• Trade-off and fault tolerance

[email protected], 2013 - 2014

Page 11: Synthetic Encoding

Canonical EncodingIRREDUCIBLE COMPUTATION

[email protected], 2013 - 2014

Page 12: Synthetic Encoding

Canonical Encoding• A

• ((a+b)*(a+b))*(c+d)*d

• B

• (a+b)*(a+b)

• C

• a+b

• D

• c+d

Legend

Data

Combinator

a

*

b cd

+ +

*

A=?

B=?

C=? D=?

[email protected], 2013 - 2014

Page 13: Synthetic Encoding

Interpretations• Mathematics: Isomorphism

• Engineering: Script for the evaluation

[email protected], 2013 - 2014

Page 14: Synthetic Encoding

Digested EncodingNAÏVE OPTIMIZATION

[email protected], 2013 - 2014

Page 15: Synthetic Encoding

Digested Encoding• A

• md(((a+b)*(a+b))*(c+d)*d)

• B

• md((a+b)*(a+b))

• C

• md(a+b)

• D

• md(c+d)

Legend

Data

Combinator

a

*

b cd

+ +

*

A=?

B=?

C=? D=?

where md() is a one-way function, such as MD5, SHA1 etc

[email protected], 2013 - 2014

Page 16: Synthetic Encoding

Interpretations• Comparison to Canonical Encoding

• Fixed length

• False positive

• Mathematics: Quasi-isomorphism

• Engineering: Digest of the computation

[email protected], 2013 - 2014

Page 17: Synthetic Encoding

Synthetic EncodingFINITE INDUCTION WITH RECURSIVE SEMANTICS

[email protected], 2013 - 2014

Page 18: Synthetic Encoding

Synthetic Encoding• A

• md([*, B, D, d])

• B

• md([*, C, C])

• C

• md([+, a, b])

• D

• md([+, c, d])

Legend

Data

Combinator

a

*

b cd

+ +

*

A=?

B=?

C=? D=?

where md() is a one-way function, such as MD5, SHA1 etc

[email protected], 2013 - 2014

Page 19: Synthetic Encoding

Interpretations• Comparison to Canonical Encoding

• Fixed length

• False positive

• Comparison to Digested Encoding

• Better Locality

• Propagative Error (false positive)

• Mathematics: Quasi-isomorphism

• Engineering: Digest of the node dependency

[email protected], 2013 - 2014

Page 20: Synthetic Encoding

Refinement• Reduce hash collision by• Encoding the cardinality of node

• Encoding the depth of node

• Longer digest, such as SHA2-512

• Example• A= 4-12-md([*, B, D, d])

• B = 3-7-md([*, C, C])

• C = 2-3-md([+, a, b])

• D = 2-3-md([+, c, d])

• Data• x` = 1-1-md([x])

[email protected], 2013 - 2014

Page 21: Synthetic Encoding

CryptanalysisIMPLICATION OF FALSE POSITIVE

[email protected], 2013 - 2014

Page 22: Synthetic Encoding

Metaphor• Proton should decay

• half-life > 6.6×1033 years

• at 90% confidence level

• via antimuon decay (*)

• Can be safely ignored

• For experiments

• For non-GUT theory

• Except GUTs POCs

• Hot topics of 1980s

• Kamioka/Super-K

• Failed to observe proton decay

[email protected], 2013 - 2014

Page 23: Synthetic Encoding

Case Study – SHA2 Family• Unavoidable collisions

• Pigeonhole principle

• Yet no known collisions

• Birthday Attack

• 2L/2 evaluations

• 2128 for SHA2 256

• Attack Attempt

• 41-round SHA-256 out of 64 rounds with time complexity of 2253.5 and space complexity of 216

• 42-round SHA-256 with time complexity of 2251.7 and space complexity of 212

[email protected], 2013 - 2014

Page 24: Synthetic Encoding

Security of Synthetic Encoding• Immune to birthday attack• Data is not arbitrary

• So what?

• Resistant to collisions• Data is not arbitrary

• Structural dependency• Cardinality collision• Depth collision

• High-order digest

• NSA backdoor?

• Attack Surface• Data nodes• Cardinality = 1• Depth = 1

• Possible path• Select data to pollute

• Generate fake data

• Upload fake data to system

• Trigger computation

• Requirements• Prescient knowledge• Access to data ingestion

[email protected], 2013 - 2014

Page 25: Synthetic Encoding

ApplicationNOW WHAT?

[email protected], 2013 - 2014

Page 26: Synthetic Encoding

Advantage• Minimal representation of computation

• Exact data lineage

• Trivial de-duplication

[email protected], 2013 - 2014

Page 27: Synthetic Encoding

Big Data• Data Warehouse

• ETL

• Analytical Computing

• Profiling

• Cache

• In-Memory Computing

• Failover

• Replication

[email protected], 2013 - 2014

Page 28: Synthetic Encoding

Related WorksREINVENT THE WHEEL?

[email protected], 2013 - 2014

Page 29: Synthetic Encoding

Research-driven Technology

MICROSOFT RESEARCH

• Dryad

• Nectar

BERKELEY AMPLAB

• Spark

• Tachyon

[email protected], 2013 - 2014

Page 30: Synthetic Encoding

Statement• This work was related to previous work since 2005

• This work was influenced by both Dryad and Spark

• But it was done without knowledge of Nectar or Tachyon

[email protected], 2013 - 2014