55
FP A Framework for Synthesizing Data Profiles Saswat Padhi 1 Prateek Jain 2 Daniel Perelman 3 Oleksandr Polozov 4 Sumit Gulwani 3 Todd Millstein 1 1 University of California, Los Angeles, CA 2 Microso Research Lab, India 3 Microso Corporation, Redmond, WA 4 Microso Research, Redmond, WA (OOPSLA) Contributed during an internship with PROSE team at Microso

FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

FlashProfileA Framework for Synthesizing Data Profiles

Saswat Padhi 1 † Prateek Jain 2 Daniel Perelman 3

Oleksandr Polozov 4 Sumit Gulwani 3 Todd Millstein 1

1 University of California, Los Angeles, CA2 Microso� Research Lab, India

3 Microso� Corporation, Redmond, WA4 Microso� Research, Redmond, WA

(OOPSLA)

† Contributed during an internship with PROSE team at Microso�

Page 2: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

The Challenges of “Big” Data

High Volume> 2.5 M TB of data generated every day!

High Velocity∼ 4 M Google searches, ∼ 1/2 M tweets,> 1 K Amazon shipments ... per minute!

High Variety90 % of generated data is unstructured!Data may be incomplete, inconsistent,may contain multiple formats ...

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 2 / 15

Page 3: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

The Challenges of “Big” Data

High Volume> 2.5 M TB of data generated every day!

High Velocity∼ 4 M Google searches, ∼ 1/2 M tweets,> 1 K Amazon shipments ... per minute!

High Variety90 % of generated data is unstructured!Data may be incomplete, inconsistent,may contain multiple formats ...

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 2 / 15

Page 4: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

State of The Art

Reference ID

PMC5079771

doi: 10.1016/S1387-

7003(03)00113-8

ISBN: 2-287-34069-6...

.

.

....

ISBN: 0-006-08903-1

PMC9473786...

.

.

....

doi:

10.13039/100005795

ISBN: 1-158-23466-X

not_available

PMC9035311

Microsoft SSDT: (does not describe all data formats)I doi: +10\.\d\d\d\d\d/\d+ (110)I .∗ (113)I ISBN: 0-\d\d\d-\d\d\d\d\d-\d (204)I PMC\d+ (1024)

Ataccama One: (coarse grained, no constants & fixed-width pa�erns)I W_W (5)I W: N.N/LN-N(N)N-D (11)I W: D-N-N-L (34)I W: N.N/N (110)I W: D-N-N-D (267)I WN (1024)

FlashProfile:I ‘not_available’ (5)I ‘doi:’ + ‘10.1016/’ U D4 ‘-’ D4 ‘(’ D2 ‘)’ D5 ‘-’ D (11)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-X’ (34)I ‘doi:’ + ‘10.13039/’ D+ (110)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-’ D (267)I ‘PMC’ D7 (1024)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 3 / 15

Page 5: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

State of The Art

Reference ID

PMC5079771

doi: 10.1016/S1387-

7003(03)00113-8

ISBN: 2-287-34069-6...

.

.

....

ISBN: 0-006-08903-1

PMC9473786...

.

.

....

doi:

10.13039/100005795

ISBN: 1-158-23466-X

not_available

PMC9035311

Microsoft SSDT: (does not describe all data formats)I doi: +10\.\d\d\d\d\d/\d+ (110)I .∗ (113)I ISBN: 0-\d\d\d-\d\d\d\d\d-\d (204)I PMC\d+ (1024)

Ataccama One: (coarse grained, no constants & fixed-width pa�erns)I W_W (5)I W: N.N/LN-N(N)N-D (11)I W: D-N-N-L (34)I W: N.N/N (110)I W: D-N-N-D (267)I WN (1024)

FlashProfile:I ‘not_available’ (5)I ‘doi:’ + ‘10.1016/’ U D4 ‘-’ D4 ‘(’ D2 ‘)’ D5 ‘-’ D (11)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-X’ (34)I ‘doi:’ + ‘10.13039/’ D+ (110)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-’ D (267)I ‘PMC’ D7 (1024)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 3 / 15

Page 6: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

State of The Art

Reference ID

PMC5079771

doi: 10.1016/S1387-

7003(03)00113-8

ISBN: 2-287-34069-6...

.

.

....

ISBN: 0-006-08903-1

PMC9473786...

.

.

....

doi:

10.13039/100005795

ISBN: 1-158-23466-X

not_available

PMC9035311

Microsoft SSDT: (does not describe all data formats)I doi: +10\.\d\d\d\d\d/\d+ (110)I .∗ (113)I ISBN: 0-\d\d\d-\d\d\d\d\d-\d (204)I PMC\d+ (1024)

Ataccama One: (coarse grained, no constants & fixed-width pa�erns)I W_W (5)I W: N.N/LN-N(N)N-D (11)I W: D-N-N-L (34)I W: N.N/N (110)I W: D-N-N-D (267)I WN (1024)

FlashProfile:I ‘not_available’ (5)I ‘doi:’ + ‘10.1016/’ U D4 ‘-’ D4 ‘(’ D2 ‘)’ D5 ‘-’ D (11)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-X’ (34)I ‘doi:’ + ‘10.13039/’ D+ (110)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-’ D (267)I ‘PMC’ D7 (1024)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 3 / 15

Page 7: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

State of The Art

Reference ID

PMC5079771

doi: 10.1016/S1387-

7003(03)00113-8

ISBN: 2-287-34069-6...

.

.

....

ISBN: 0-006-08903-1

PMC9473786...

.

.

....

doi:

10.13039/100005795

ISBN: 1-158-23466-X

not_available

PMC9035311

Microsoft SSDT: (does not describe all data formats)I doi: +10\.\d\d\d\d\d/\d+ (110)I .∗ (113)I ISBN: 0-\d\d\d-\d\d\d\d\d-\d (204)I PMC\d+ (1024)

Ataccama One: (coarse grained, no constants & fixed-width pa�erns)I W_W (5)I W: N.N/LN-N(N)N-D (11)I W: D-N-N-L (34)I W: N.N/N (110)I W: D-N-N-D (267)I WN (1024)

FlashProfile:I ‘not_available’ (5)I ‘doi:’ + ‘10.1016/’ U D4 ‘-’ D4 ‘(’ D2 ‘)’ D5 ‘-’ D (11)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-X’ (34)I ‘doi:’ + ‘10.13039/’ D+ (110)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-’ D (267)I ‘PMC’ D7 (1024)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 3 / 15

Page 8: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Can We Do Even Be�er?

Reference ID

PMC5079771

doi: 10.1016/S1387-

7003(03)00113-8

ISBN: 2-287-34069-6...

.

.

....

ISBN: 0-006-08903-1

PMC9473786...

.

.

....

doi:

10.13039/100005795

ISBN: 1-158-23466-X

not_available

PMC9035311

Allowing domain-experts to profile with custom pa�erns:I ‘not_available’ (5)I ‘doi:’ + DOI (121)I ‘ISBN:’ ISBN10 (301)I ‘PMC’ D7 (1024)

Interactive refinement to gradually drill into data:I ‘not_available’ (5)I ‘doi:’ + ‘10.1016/’ U D4 ‘-’ D4 ‘(’ D2 ‘)’ D5 ‘-’ D (11)I ‘doi:’ + ‘10.13039/’ D+ (110)I ‘ISBN:’ ISBN10 (301)I ‘PMC’ D7 (1024)

Default profile from FlashProfile:I ‘not_available’ (5)I ‘doi:’ + ‘10.1016/’ U D4 ‘-’ D4 ‘(’ D2 ‘)’ D5 ‘-’ D (11)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-X’ (34)I ‘doi:’ + ‘10.13039/’ D+ (110)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-’ D (267)I ‘PMC’ D7 (1024)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 4 / 15

Page 9: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Can We Do Even Be�er?

Reference ID

PMC5079771

doi: 10.1016/S1387-

7003(03)00113-8

ISBN: 2-287-34069-6...

.

.

....

ISBN: 0-006-08903-1

PMC9473786...

.

.

....

doi:

10.13039/100005795

ISBN: 1-158-23466-X

not_available

PMC9035311

Allowing domain-experts to profile with custom pa�erns:I ‘not_available’ (5)I ‘doi:’ + DOI (121)I ‘ISBN:’ ISBN10 (301)I ‘PMC’ D7 (1024)

Interactive refinement to gradually drill into data:I ‘not_available’ (5)I ‘doi:’ + ‘10.1016/’ U D4 ‘-’ D4 ‘(’ D2 ‘)’ D5 ‘-’ D (11)I ‘doi:’ + ‘10.13039/’ D+ (110)I ‘ISBN:’ ISBN10 (301)I ‘PMC’ D7 (1024)

Default profile from FlashProfile:I ‘not_available’ (5)I ‘doi:’ + ‘10.1016/’ U D4 ‘-’ D4 ‘(’ D2 ‘)’ D5 ‘-’ D (11)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-X’ (34)I ‘doi:’ + ‘10.13039/’ D+ (110)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-’ D (267)I ‘PMC’ D7 (1024)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 4 / 15

Page 10: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Can We Do Even Be�er?

Reference ID

PMC5079771

doi: 10.1016/S1387-

7003(03)00113-8

ISBN: 2-287-34069-6...

.

.

....

ISBN: 0-006-08903-1

PMC9473786...

.

.

....

doi:

10.13039/100005795

ISBN: 1-158-23466-X

not_available

PMC9035311

Allowing domain-experts to profile with custom pa�erns:I ‘not_available’ (5)I ‘doi:’ + DOI (121)I ‘ISBN:’ ISBN10 (301)I ‘PMC’ D7 (1024)

Interactive refinement to gradually drill into data:I ‘not_available’ (5)I ‘doi:’ + ‘10.1016/’ U D4 ‘-’ D4 ‘(’ D2 ‘)’ D5 ‘-’ D (11)I ‘doi:’ + ‘10.13039/’ D+ (110)I ‘ISBN:’ ISBN10 (301)I ‘PMC’ D7 (1024)

Default profile from FlashProfile:I ‘not_available’ (5)I ‘doi:’ + ‘10.1016/’ U D4 ‘-’ D4 ‘(’ D2 ‘)’ D5 ‘-’ D (11)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-X’ (34)I ‘doi:’ + ‘10.13039/’ D+ (110)I ‘ISBN:’ D ‘-’ D3 ‘-’ D5 ‘-’ D (267)I ‘PMC’ D7 (1024)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 4 / 15

Page 11: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Key Challenges

The space of profiles is large and inherently ambiguous

Should { ‘1817’, ‘1813?’ } be generalized to a pa�ern, or { ‘1817’, ‘1907’ }?

I prior tools have a fixed bias for { ‘1817’, ‘1907’ }

We allow users to disambiguate, by defining custom domain-specific pa�erns

I for example, a user may define a pa�ern 1800s (= the regex 18.*)

I Exponentially many ways of partitioning a given set of strings

I Exponentially many ways of generalizing strings to a pa�ern

. Clustering, with similarity ≈ Pa�ern score

. E�icient synthesis of complex pa�erns

Inductive program synthesis to the rescue!

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 5 / 15

Page 12: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Key Challenges

The space of profiles is large and inherently ambiguous

Should { ‘1817’, ‘1813?’ } be generalized to a pa�ern, or { ‘1817’, ‘1907’ }?

I prior tools have a fixed bias for { ‘1817’, ‘1907’ }

We allow users to disambiguate, by defining custom domain-specific pa�erns

I for example, a user may define a pa�ern 1800s (= the regex 18.*)

I Exponentially many ways of partitioning a given set of strings

I Exponentially many ways of generalizing strings to a pa�ern

. Clustering, with similarity ≈ Pa�ern score

. E�icient synthesis of complex pa�erns

Inductive program synthesis to the rescue!

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 5 / 15

Page 13: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Key Challenges

The space of profiles is large and inherently ambiguous

Should { ‘1817’, ‘1813?’ } be generalized to a pa�ern, or { ‘1817’, ‘1907’ }?

I prior tools have a fixed bias for { ‘1817’, ‘1907’ }

We allow users to disambiguate, by defining custom domain-specific pa�erns

I for example, a user may define a pa�ern 1800s (= the regex 18.*)

I Exponentially many ways of partitioning a given set of strings

I Exponentially many ways of generalizing strings to a pa�ern

. Clustering, with similarity ≈ Pa�ern score

. E�icient synthesis of complex pa�erns

Inductive program synthesis to the rescue!

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 5 / 15

Page 14: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Key Challenges

The space of profiles is large and inherently ambiguous

Should { ‘1817’, ‘1813?’ } be generalized to a pa�ern, or { ‘1817’, ‘1907’ }?

I prior tools have a fixed bias for { ‘1817’, ‘1907’ }

We allow users to disambiguate, by defining custom domain-specific pa�erns

I for example, a user may define a pa�ern 1800s (= the regex 18.*)

I Exponentially many ways of partitioning a given set of strings

I Exponentially many ways of generalizing strings to a pa�ern

. Clustering, with similarity ≈ Pa�ern score

. E�icient synthesis of complex pa�erns

Inductive program synthesis to the rescue!

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 5 / 15

Page 15: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Key Challenges

The space of profiles is large and inherently ambiguous

Should { ‘1817’, ‘1813?’ } be generalized to a pa�ern, or { ‘1817’, ‘1907’ }?

I prior tools have a fixed bias for { ‘1817’, ‘1907’ }

We allow users to disambiguate, by defining custom domain-specific pa�erns

I for example, a user may define a pa�ern 1800s (= the regex 18.*)

I Exponentially many ways of partitioning a given set of strings

I Exponentially many ways of generalizing strings to a pa�ern

. Clustering, with similarity ≈ Pa�ern score

. E�icient synthesis of complex pa�erns

Inductive program synthesis to the rescue!

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 5 / 15

Page 16: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Key Challenges

The space of profiles is large and inherently ambiguous

Should { ‘1817’, ‘1813?’ } be generalized to a pa�ern, or { ‘1817’, ‘1907’ }?

I prior tools have a fixed bias for { ‘1817’, ‘1907’ }

We allow users to disambiguate, by defining custom domain-specific pa�erns

I for example, a user may define a pa�ern 1800s (= the regex 18.*)

I Exponentially many ways of partitioning a given set of strings

I Exponentially many ways of generalizing strings to a pa�ern

. Clustering, with similarity ≈ Pa�ern score

. E�icient synthesis of complex pa�erns

Inductive program synthesis to the rescue!

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 5 / 15

Page 17: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Key Challenges

The space of profiles is large and inherently ambiguous

Should { ‘1817’, ‘1813?’ } be generalized to a pa�ern, or { ‘1817’, ‘1907’ }?

I prior tools have a fixed bias for { ‘1817’, ‘1907’ }

We allow users to disambiguate, by defining custom domain-specific pa�erns

I for example, a user may define a pa�ern 1800s (= the regex 18.*)

I Exponentially many ways of partitioning a given set of strings

I Exponentially many ways of generalizing strings to a pa�ern

. Clustering, with similarity ≈ Pa�ern score

. E�icient synthesis of complex pa�erns

Inductive program synthesis to the rescue!

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 5 / 15

Page 18: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Key Challenges

The space of profiles is large and inherently ambiguous

Should { ‘1817’, ‘1813?’ } be generalized to a pa�ern, or { ‘1817’, ‘1907’ }?

I prior tools have a fixed bias for { ‘1817’, ‘1907’ }

We allow users to disambiguate, by defining custom domain-specific pa�erns

I for example, a user may define a pa�ern 1800s (= the regex 18.*)

I Exponentially many ways of partitioning a given set of strings

I Exponentially many ways of generalizing strings to a pa�ern

. Clustering, with similarity ≈ Pa�ern score

. E�icient synthesis of complex pa�erns

Inductive program synthesis to the rescue!

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 5 / 15

Page 19: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Key Challenges

The space of profiles is large and inherently ambiguous

Should { ‘1817’, ‘1813?’ } be generalized to a pa�ern, or { ‘1817’, ‘1907’ }?

I prior tools have a fixed bias for { ‘1817’, ‘1907’ }

We allow users to disambiguate, by defining custom domain-specific pa�erns

I for example, a user may define a pa�ern 1800s (= the regex 18.*)

I Exponentially many ways of partitioning a given set of strings

I Exponentially many ways of generalizing strings to a pa�ern

. Clustering, with similarity ≈ Pa�ern score

. E�icient synthesis of complex pa�erns

Inductive program synthesis to the rescue!

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 5 / 15

Page 20: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Key Challenges

The space of profiles is large and inherently ambiguous

Should { ‘1817’, ‘1813?’ } be generalized to a pa�ern, or { ‘1817’, ‘1907’ }?

I prior tools have a fixed bias for { ‘1817’, ‘1907’ }

We allow users to disambiguate, by defining custom domain-specific pa�erns

I for example, a user may define a pa�ern 1800s (= the regex 18.*)

I Exponentially many ways of partitioning a given set of strings

I Exponentially many ways of generalizing strings to a pa�ern

. Clustering, with similarity ≈ Pa�ern score

. E�icient synthesis of complex pa�erns

Inductive program synthesis to the rescue!

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 5 / 15

Page 21: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Main Contributions

An application of a supervised learning technique (inductive programsynthesis) to the unsupervised learning problem of syntactic profiling.

We present:

I a definition for syntactic profiling as a pa�ern-aware clustering problem

I a technique using inductive program synthesis

I practical optimizations for fast, approximate profiling

I FlashProfile, and evaluation of its performance and accuracy

I profile-guided interaction for traditional PBE workflows

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 6 / 15

Page 22: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Main Contributions

An application of a supervised learning technique (inductive programsynthesis) to the unsupervised learning problem of syntactic profiling.

We present:

I a definition for syntactic profiling as a pa�ern-aware clustering problem

I a technique using inductive program synthesis

I practical optimizations for fast, approximate profiling

I FlashProfile, and evaluation of its performance and accuracy

I profile-guided interaction for traditional PBE workflows

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 6 / 15

Page 23: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Main Contributions

An application of a supervised learning technique (inductive programsynthesis) to the unsupervised learning problem of syntactic profiling.

We present:

I a definition for syntactic profiling as a pa�ern-aware clustering problem

I a technique using inductive program synthesis

I practical optimizations for fast, approximate profiling

I FlashProfile, and evaluation of its performance and accuracy

I profile-guided interaction for traditional PBE workflows

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 6 / 15

Page 24: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Main Contributions

An application of a supervised learning technique (inductive programsynthesis) to the unsupervised learning problem of syntactic profiling.

We present:

I a definition for syntactic profiling as a pa�ern-aware clustering problem

I a technique using inductive program synthesis

I practical optimizations for fast, approximate profiling

I FlashProfile, and evaluation of its performance and accuracy

I profile-guided interaction for traditional PBE workflows

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 6 / 15

Page 25: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Main Contributions

An application of a supervised learning technique (inductive programsynthesis) to the unsupervised learning problem of syntactic profiling.

We present:

I a definition for syntactic profiling as a pa�ern-aware clustering problem

I a technique using inductive program synthesis

I practical optimizations for fast, approximate profiling

I FlashProfile, and evaluation of its performance and accuracy

I profile-guided interaction for traditional PBE workflows

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 6 / 15

Page 26: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Main Contributions

An application of a supervised learning technique (inductive programsynthesis) to the unsupervised learning problem of syntactic profiling.

We present:

I a definition for syntactic profiling as a pa�ern-aware clustering problem

I a technique using inductive program synthesis

I practical optimizations for fast, approximate profiling

I FlashProfile, and evaluation of its performance and accuracy

I profile-guided interaction for traditional PBE workflows

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 6 / 15

Page 27: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Overview of FlashProfile

ApproximationParameters

Custom Pa�erns

Number ofPa�erns

Pa�ernSynthesizer

ClusteringProcedure

Profile

Dataset

FlashProfile provides:

I Support for user-defined pa�ernsI Support for arbitrary constants and fixed-width pa�ernsI Interactive refinement of profilesI Control over accuracy vs. performance trade-o�

FlashProfile is publicly-available as a cross-platform C# library (Matching.Text),as part of the Microso� PROSE SDK.

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 7 / 15

Page 28: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Overview of FlashProfile

ApproximationParameters

Custom Pa�erns

Number ofPa�erns

Pa�ernSynthesizer

ClusteringProcedure

Profile

Dataset

FlashProfile provides:I Support for user-defined pa�erns

I Support for arbitrary constants and fixed-width pa�ernsI Interactive refinement of profilesI Control over accuracy vs. performance trade-o�

FlashProfile is publicly-available as a cross-platform C# library (Matching.Text),as part of the Microso� PROSE SDK.

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 7 / 15

Page 29: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Overview of FlashProfile

ApproximationParameters

Custom Pa�erns

Number ofPa�erns

Pa�ernSynthesizer

ClusteringProcedure

Profile

Dataset

FlashProfile provides:I Support for user-defined pa�ernsI Support for arbitrary constants and fixed-width pa�erns

I Interactive refinement of profilesI Control over accuracy vs. performance trade-o�

FlashProfile is publicly-available as a cross-platform C# library (Matching.Text),as part of the Microso� PROSE SDK.

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 7 / 15

Page 30: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Overview of FlashProfile

ApproximationParameters

Custom Pa�erns

Number ofPa�erns

Pa�ernSynthesizer

ClusteringProcedure

Profile

Dataset

FlashProfile provides:I Support for user-defined pa�ernsI Support for arbitrary constants and fixed-width pa�ernsI Interactive refinement of profiles

I Control over accuracy vs. performance trade-o�

FlashProfile is publicly-available as a cross-platform C# library (Matching.Text),as part of the Microso� PROSE SDK.

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 7 / 15

Page 31: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Overview of FlashProfile

ApproximationParameters

Custom Pa�erns

Number ofPa�erns

Pa�ernSynthesizer

ClusteringProcedure

Profile

Dataset

FlashProfile provides:I Support for user-defined pa�ernsI Support for arbitrary constants and fixed-width pa�ernsI Interactive refinement of profilesI Control over accuracy vs. performance trade-o�

FlashProfile is publicly-available as a cross-platform C# library (Matching.Text),as part of the Microso� PROSE SDK.

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 7 / 15

Page 32: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Overview of FlashProfile

ApproximationParameters

Custom Pa�erns

Number ofPa�erns

Pa�ernSynthesizer

ClusteringProcedure

Profile

Dataset

FlashProfile provides:I Support for user-defined pa�ernsI Support for arbitrary constants and fixed-width pa�ernsI Interactive refinement of profilesI Control over accuracy vs. performance trade-o�

FlashProfile is publicly-available as a cross-platform C# library (Matching.Text),as part of the Microso� PROSE SDK.

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 7 / 15

Page 33: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Profiling via Clustering

I Pa�ern-Aware Partitioning. Clustering: Agglomerative hierarchical clustering. Objective: Minimize the cost of describing partitions. Similarity: Minimum cost of describing 2 strings

I Dependencies. A pa�ern learner L. A cost function C

I Optimizations. Approximate similarity using previous pa�erns. Profiling small chunks→ Full profile

.+

‘1’ .+

‘1’ D3

‘18’ D2

1813 · · · 1898

‘190’ D

1900 · · · 1903

‘18’ D2 ‘?’

1850? · · · 1875?

‘?’

?

Empty

ϵ

. .. .... . . . .

. .... . . . .

. .... . .

Suggested

Refined

(see our paper for details)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 8 / 15

Page 34: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Profiling via Clustering

I Pa�ern-Aware Partitioning. Clustering: Agglomerative hierarchical clustering. Objective: Minimize the cost of describing partitions. Similarity: Minimum cost of describing 2 strings

I Dependencies. A pa�ern learner L. A cost function C

I Optimizations. Approximate similarity using previous pa�erns. Profiling small chunks→ Full profile

.+

‘1’ .+

‘1’ D3

‘18’ D2

1813 · · · 1898

‘190’ D

1900 · · · 1903

‘18’ D2 ‘?’

1850? · · · 1875?

‘?’

?

Empty

ϵ

. .. .... . . . .

. .... . . . .

. .... . .

Suggested

Refined

(see our paper for details)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 8 / 15

Page 35: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Profiling via Clustering

I Pa�ern-Aware Partitioning. Clustering: Agglomerative hierarchical clustering. Objective: Minimize the cost of describing partitions. Similarity: Minimum cost of describing 2 strings

I Dependencies. A pa�ern learner L. A cost function C

I Optimizations. Approximate similarity using previous pa�erns. Profiling small chunks→ Full profile

.+

‘1’ .+

‘1’ D3

‘18’ D2

1813 · · · 1898

‘190’ D

1900 · · · 1903

‘18’ D2 ‘?’

1850? · · · 1875?

‘?’

?

Empty

ϵ

. .. .... . . . .

. .... . . . .

. .... . .

Suggested

Refined

(see our paper for details)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 8 / 15

Page 36: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Profiling via Clustering

I Pa�ern-Aware Partitioning. Clustering: Agglomerative hierarchical clustering. Objective: Minimize the cost of describing partitions. Similarity: Minimum cost of describing 2 strings

I Dependencies. A pa�ern learner L. A cost function C

I Optimizations. Approximate similarity using previous pa�erns. Profiling small chunks→ Full profile

.+

‘1’ .+

‘1’ D3

‘18’ D2

1813 · · · 1898

‘190’ D

1900 · · · 1903

‘18’ D2 ‘?’

1850? · · · 1875?

‘?’

?

Empty

ϵ

. .. .... . . . .

. .... . . . .

. .... . .

Suggested

Refined

(see our paper for details)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 8 / 15

Page 37: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Pa�ern Synthesis

I A Language LFP:Pa�ern P [s] := Empty(s)

| P[SuffixAfter(s, α)

]Atom α := Classnc | RegExr

| Functf | Consts

I A Pa�ern Learner LFP

. recursively reduces a synthesis problem

. sound and complete over a given set of atoms

I A Cost Function CFP

. tradeo� between specificity and simplicity

. weighted sum of costs of individual atoms

(see our paper for details)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 9 / 15

Page 38: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Pa�ern Synthesis

I A Language LFP:Pa�ern P [s] := Empty(s)

| P[SuffixAfter(s, α)

]Atom α := Classnc | RegExr

| Functf | Consts

I A Pa�ern Learner LFP

. recursively reduces a synthesis problem

. sound and complete over a given set of atoms

I A Cost Function CFP

. tradeo� between specificity and simplicity

. weighted sum of costs of individual atoms

(see our paper for details)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 9 / 15

Page 39: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Pa�ern Synthesis

I A Language LFP:Pa�ern P [s] := Empty(s)

| P[SuffixAfter(s, α)

]Atom α := Classnc | RegExr

| Functf | Consts

I A Pa�ern Learner LFP

. recursively reduces a synthesis problem

. sound and complete over a given set of atoms

I A Cost Function CFP

. tradeo� between specificity and simplicity

. weighted sum of costs of individual atoms

(see our paper for details)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 9 / 15

Page 40: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Pa�ern Synthesis

I A Language LFP:Pa�ern P [s] := Empty(s)

| P[SuffixAfter(s, α)

]Atom α := Classnc | RegExr

| Functf | Consts

I A Pa�ern Learner LFP

. recursively reduces a synthesis problem

. sound and complete over a given set of atoms

I A Cost Function CFP

. tradeo� between specificity and simplicity

. weighted sum of costs of individual atoms

(see our paper for details)

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 9 / 15

Page 41: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Profile-Guided Interaction for PBE

Traditional PBE Interaction Profile-Guided Interaction

Users typically provide their desired

outputs sequentially

System proactively requests outputs for

syntactically discrepant inputs

Birthdays Years8/20 ’921986 June 073/24 ’881994 November 23

.

.

.

13-08-83

4/21 ’7924-11-91

Birthdays Years8/20 ’92

1986 June 07

3/24 ’88

1994 November 23...

13-08-83

4/21 ’79

24-11-91

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 10 / 15

Page 42: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Profile-Guided Interaction for PBE

Traditional PBE Interaction Profile-Guided Interaction

Users typically provide their desired

outputs sequentially

System proactively requests outputs for

syntactically discrepant inputs

Birthdays Years8/20 ’92 1992 1

1986 June 07 1986 2

3/24 ’88 1988 3

1994 November 23 1994...

13-08-83 1983 4

4/21 ’79 197924-11-91 1991

Birthdays Years8/20 ’92

1986 June 07

3/24 ’88

1994 November 23...

13-08-83

4/21 ’79

24-11-91

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 10 / 15

Page 43: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Profile-Guided Interaction for PBE

Traditional PBE Interaction Profile-Guided Interaction

Users typically provide their desired

outputs sequentially

System proactively requests outputs for

syntactically discrepant inputs

Birthdays Years8/20 ’92 1992 1

1986 June 07 1986 2

3/24 ’88 1988 3

1994 November 23 1994...

13-08-83 1983 4

4/21 ’79 197924-11-91 1991

Birthdays Years8/20 ’92

1986 June 07

3/24 ’88

1994 November 23...

13-08-83

4/21 ’79

24-11-91

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 10 / 15

Page 44: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Profile-Guided Interaction for PBE

Traditional PBE Interaction Profile-Guided Interaction

Users typically provide their desired

outputs sequentially

System proactively requests outputs for

syntactically discrepant inputs

Birthdays Years8/20 ’92 1992 1

1986 June 07 1986 2

3/24 ’88 1988 3

1994 November 23 1994...

13-08-83 1983 4

4/21 ’79 197924-11-91 1991

Birthdays Years8/20 ’92

1986 June 07

3/24 ’88

1994 November 23...

13-08-83

4/21 ’79

24-11-91

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 10 / 15

Page 45: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Profile-Guided Interaction for PBE

Traditional PBE Interaction Profile-Guided Interaction

Users typically provide their desired

outputs sequentially

System proactively requests outputs for

syntactically discrepant inputs

Birthdays Years8/20 ’92 1992 1

1986 June 07 1986 2

3/24 ’88 1988 3

1994 November 23 1994...

13-08-83 1983 4

4/21 ’79 197924-11-91 1991

Birthdays Years8/20 ’92 1992 3

1986 June 07 1986 2

3/24 ’88 1988

1994 November 23 1994...

13-08-83 1983 1

4/21 ’79 1979

24-11-91 1991

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 10 / 15

Page 46: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Profile-Guided Interaction for PBE

Traditional PBE Interaction Profile-Guided Interaction

Users typically provide their desired

outputs sequentially

System proactively requests outputs for

syntactically discrepant inputs

Birthdays Years8/20 ’92 1992 1

1986 June 07 1986 2

3/24 ’88 1988 3

1994 November 23 1994...

13-08-83 1983 4

4/21 ’79 197924-11-91 1991

Birthdays Years8/20 ’92 1992 3

1986 June 07 1986 2

3/24 ’88 1988

1994 November 23 1994...

13-08-83 1983 1

4/21 ’79 1979

24-11-91 1991

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 10 / 15

Page 47: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

�ality of Generated Profiles

20% Profile P̃Dataset S

80%Dataset S

80%Randomly sampled

from other datasets

P̃tn

fp

tp

fn

}Quality(P̃) = F1 Score

�ality of profiles generated by FlashProfile

Ataccama One Microso� SSDT

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 11 / 15

Page 48: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

�ality of Generated Profiles

20% Profile P̃Dataset S

80%Dataset S

80%Randomly sampled

from other datasets

P̃tn

fp

tp

fn

}Quality(P̃) = F1 Score

�ality of profiles generated by FlashProfile

Ataccama One Microso� SSDT

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 11 / 15

Page 49: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

End-to-End Profiling Performance

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 12 / 15

Page 50: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Related WorkI Microso� SQL Server Data Tools (SSDT) [ https://docs.microsoft.com/en-us/sql/ssdt ]

. Recognizes constants and fixed-width atoms.

. Not extensible. No refinement. Profiles are sometimes not comprehensive.

I Ataccama One [ https://one.ataccama.com/ ]

. Comprehensive profiles. Recognizes fixed-width atoms.

. A small fixed set of atoms. No refinement. Does not recognize constants.

I Trifacta Wrangler [ https://cloud.trifacta.com ]

. Recognizes fixed-width atoms. Generates readable profiles.

. Not extensible. No refinement. Does not recognize constants.

I Google OpenRefine [ http://openrefine.org/ ]

. No pa�erns, only clusters based on character-wise similarity.

I Potter’s Wheel [ Vijayshankar Raman and Joseph M. Hellerstein. VLDB 2001 ]

. Extensible set of atoms.

. Only learns the most-frequent pa�ern and shows outliers, not a profile.

I LearnPADS++ [ Kathleen Fisher et al. SIGMOD 2008 ; Kenny Q. Zhu et al. PADL 2012 ]

. Not extensible. No refinement. Generates C-style structures.

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 13 / 15

Page 51: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Conclusion & Future Work

A novel composition of hierarchical clustering and program synthesistechniques for e�icient pa�ern-based data profiling

Future Work:

I Automatically selecting costs for atoms. Machine-learnt costs to maximize the quality of profiles

I Alternate approximation strategies with be�er guarantees. Compute the overall “goodness” of an approximation and refine if needed

I Identify and classify semantic entities as well. For example, combine with named-entity recognition (NER) techniques

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 14 / 15

Page 52: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Conclusion & Future Work

A novel composition of hierarchical clustering and program synthesistechniques for e�icient pa�ern-based data profiling

Future Work:

I Automatically selecting costs for atoms. Machine-learnt costs to maximize the quality of profiles

I Alternate approximation strategies with be�er guarantees. Compute the overall “goodness” of an approximation and refine if needed

I Identify and classify semantic entities as well. For example, combine with named-entity recognition (NER) techniques

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 14 / 15

Page 53: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Conclusion & Future Work

A novel composition of hierarchical clustering and program synthesistechniques for e�icient pa�ern-based data profiling

Future Work:

I Automatically selecting costs for atoms. Machine-learnt costs to maximize the quality of profiles

I Alternate approximation strategies with be�er guarantees. Compute the overall “goodness” of an approximation and refine if needed

I Identify and classify semantic entities as well. For example, combine with named-entity recognition (NER) techniques

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 14 / 15

Page 54: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Conclusion & Future Work

A novel composition of hierarchical clustering and program synthesistechniques for e�icient pa�ern-based data profiling

Future Work:

I Automatically selecting costs for atoms. Machine-learnt costs to maximize the quality of profiles

I Alternate approximation strategies with be�er guarantees. Compute the overall “goodness” of an approximation and refine if needed

I Identify and classify semantic entities as well. For example, combine with named-entity recognition (NER) techniques

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 14 / 15

Page 55: FlashProfile - A Framework for Synthesizing Data Profiles · FlashProfile A Framework for Synthesizing Data Profiles Saswat Padhi1† Prateek Jain2 Daniel Perelman3 Oleksandr Polozov4

Publicly-Available Artifacts

I The Matching.Text NuGet package:https://www.nuget.org/packages/Microsoft.ProgramSynthesis.Matching.Text/

I Documentation for Matching.Text library:https://microsoft.github.io/prose/documentation/matching-text/intro/

I OOPSLA artifacts (a C# app showing Matching.Text API usage):https://github.com/SaswatPadhi/FlashProfileDemo

I Contact: [email protected]

S. Padhi et al. FlashProfile SPLASH 2018 (OOPSLA) • 15 / 15