From Dirt to Shovels: Automatic Tool Generation from Ad Hoc Data Kenny Zhu Princeton University with Kathleen Fisher, David Walker and Peter White

From Dirt to Shovels:From Dirt to Shovels:Automatic Tool Automatic Tool GenerationGeneration

from Ad Hoc Datafrom Ad Hoc Data

Kenny ZhuKenny Zhu

Princeton UniversityPrinceton University

with Kathleen Fisher, David Walker and Peter White

A System Admin’s LifeA System Admin’s Life

Web Server Logs…Web Server Logs…

System Logs…System Logs…

Application Configs…Application Configs…

User EmailsUser Emails

Script Outputs and more…Script Outputs and more…

Automatically Generate Tools Automatically Generate Tools from Data!from Data!

XML converterData profilerGrapher, etc.

ArchitectureArchitecture

Tokenization

Structure Discovery

Format Refinement

Data Description

Scoring Function

Raw Data

PADSCompiler

Profiler

XMLconverter

AnalysisReport

XML

FormatInference

LearnPADS

Tokenization

Structure Discovery

Format Refinement

Simple End-to-EndSimple End-to-End

Data Data Sources:Sources: Punion payload {

Pint32 i; PstringFW(3) s2; };

Pstruct source { ‘\”’; payload p1; “,”; payload p2; ‘\”’; }

“0, 24”

“foo, 16”

“bar, end”

DescriptiDescription:on:

XML XML output:output:<source>

<payload> <int><val>0</val></int> </payload> <payload> <int><val>24</val></int> </payload></source><source> <payload> <string><val>bar</val></string> </payload> <payload> <string><val>end</val></string> </payload></source>

TokenizationTokenization

Parse strings; convert to symbolic tokensParse strings; convert to symbolic tokens Basic token set skewed towards systems dataBasic token set skewed towards systems data

►Int, string, date, time, URLs, hostnames …Int, string, date, time, URLs, hostnames … A config file allows users to define their own new A config file allows users to define their own new token types via regular expressionstoken types via regular expressions

“0, 24”

“foo, 16”

“bar, end”

“ INT , INT ”

“ STR , INT ”

“ STR , STR ”

tokenize

Structure Discovery: OverviewStructure Discovery: Overview

Top-down, divide-and-conquer algorithm:Top-down, divide-and-conquer algorithm: Compute various statistics from tokenized dataCompute various statistics from tokenized data Guess a top-level descriptionGuess a top-level description Partition tokenized data into smaller chunksPartition tokenized data into smaller chunks Recursively analyze and compute descriptions from smaller chunksRecursively analyze and compute descriptions from smaller chunks



“ INT , INT ”

“ STR , INT ”

“ STR , STR ”

discover“ ”,

? ?

struct

?

candidate structure so far

INT

STR

STR

INT

INT

STRsources



discover“ ”,

? ?

struct

INT

STR

STR

INT

INT

STR

“ ”,

?

?

struct

union

INT

?

STR STR

INT

INT

STR

Structure Discovery: DetailsStructure Discovery: Details

Compute frequency distribution histogram for Compute frequency distribution histogram for each token.each token.

(And recompute at every level of recursion).(And recompute at every level of recursion).

“ INT , INT ”

“ STR , INT ”

“ STR , STR ”

percentageof sources Number

of occurrencesper source

0102030405060708090100

Quote Comma Integer String

12


Cluster tokens with similar histograms into groupsCluster tokens with similar histograms into groups Similar histogramsSimilar histograms

► tokens with strong regularity coexist in same description tokens with strong regularity coexist in same description componentcomponent

► use symmetric relative entropy to measure similarityuse symmetric relative entropy to measure similarity Only the “shape” of the histogram mattersOnly the “shape” of the histogram matters

► normalize histograms by sorting columns in descending sizenormalize histograms by sorting columns in descending size► result: comma & quote in one group, int & string in another result: comma & quote in one group, int & string in another

0102030405060708090100


1

2


Classify the groups into:Classify the groups into: Structs == Groups with high coverage & low “residual mass”Structs == Groups with high coverage & low “residual mass” Arrays == Groups with high coverage, sufficient width & high Arrays == Groups with high coverage, sufficient width & high

“residual mass”“residual mass” Unions == Other token groups Unions == Other token groups

Pick group with strongest signal to divide and conquerPick group with strongest signal to divide and conquer

More mathematical details in the paperMore mathematical details in the paper

Struct involving comma, quote identified in histogram aboveStruct involving comma, quote identified in histogram above

Overall procedure gives good starting point for refinementOverall procedure gives good starting point for refinement

0102030405060708090100


1

2

Format RefinementFormat RefinementReanalyze source data with aid of rough description Reanalyze source data with aid of rough description and obtain functional dependencies and constaintsand obtain functional dependencies and constaints

Rewrite format description to:Rewrite format description to: simplify presentationsimplify presentation

►merge & rewrite structuresmerge & rewrite structures improve precisionimprove precision

►add constraints (uniqueness, ranges, functional add constraints (uniqueness, ranges, functional dependencies)dependencies)

fill in missing details fill in missing details ►find completions where structure discovery bottoms outfind completions where structure discovery bottoms out►refine base types (integer sizes, array sizes, refine base types (integer sizes, array sizes, seperators and terminators)seperators and terminators)

Rewriting is guided by local search that optimizes an Rewriting is guided by local search that optimizes an information-theoretic score (more details in the information-theoretic score (more details in the paper)paper)

Refinement: Simple ExampleRefinement: Simple Example

“0, 24”“foo, beg”“bar, end”“0, 56”“baz, middle”“0, 12”“0, 33”…

struct

“ ”, unionunion

int str int str

structurediscovery

Constraints:id3 = 0

id1 = id2

constraintinference

rule-basedstructurerewriting

struct

“ ”union

0 strint str

struct struct

, ,

Greater AccuracyFirst int is 0No “int, str”

(id2)

struct

“ ”, unionunion

int (id3)

tagging/table gen

(id1)

str (id4) int (id5) str (id6)

id1id1 id2id2 id3id3 id4id4 id5id5 id6id6

11 11 00 ------ 2424 ------

22 22 ------ foofoo ------ begbeg

. . . . ..

. . . . ..

. . . . ..

. . . . ..

. . . . ..

. . . . ..

EvaluationEvaluation

Benchmark FormatsBenchmark FormatsData sourceData source ChunksChunks BytesBytes DescriptionDescription

1967Transactions.shor1967Transactions.shortt

999999 7092970929 Transaction recordsTransaction records

MER_T01_01.cvsMER_T01_01.cvs 491491 2173121731 Comma-separated recordsComma-separated records

Ai.3000Ai.3000 30003000 293460293460 Web server logWeb server log

Asl.logAsl.log 15001500 279600279600 Log file of MAC ASLLog file of MAC ASL

Boot.logBoot.log 262262 1624116241 Mac OS boot logMac OS boot log

Crashreporter.logCrashreporter.log 441441 5015250152 Original crashreporter daemon logOriginal crashreporter daemon log

Crashreporter.log.modCrashreporter.log.mod 441441 4925549255 Modified crashreporter daemon logModified crashreporter daemon log

Sirius.1000Sirius.1000 999999 142607142607 AT&T phone provision dataAT&T phone provision data

Ls-l.txtLs-l.txt 3535 19791979 Command ls -l outputCommand ls -l output

Netstat-anNetstat-an 202202 1435514355 Output from netstat -anOutput from netstat -an

Page_logPage_log 354354 2817028170 Printer log from CUPSPrinter log from CUPS

quarterlypersonalincoquarterlypersonalincomeme

6262 1017710177 Spread sheetSpread sheet

Railroad.txtRailroad.txt 6767 62186218 US Rail road infoUS Rail road info

Scrollkeeper.logScrollkeeper.log 671671 6628866288 Log from cataloging systemLog from cataloging system

Windowserver_last.logWindowserver_last.log 680680 5239452394 Log from Mac LoginWindow serverLog from Mac LoginWindow server

Yum.txtYum.txt 328328 1822118221 Log from package installer YumLog from package installer YumAvailable at http://www.padsproj.org/

Training Time vs. Training Training Time vs. Training SizeSize

Training Accuracy vs Training Training Accuracy vs Training SizeSize

ConclusionsConclusionsWe are able produce XML and statistical reports We are able produce XML and statistical reports fully automatically from ad hoc data sources.fully automatically from ad hoc data sources.

We’ve tested on approximately 15 real, mostly We’ve tested on approximately 15 real, mostly systemsy data sources (web logs, crash reports, systemsy data sources (web logs, crash reports, AT&T phone call data, etc.) with what we AT&T phone call data, etc.) with what we believe is a good successbelieve is a good success

For papers, online demos & pads software, see our For papers, online demos & pads software, see our website at:website at:

http://www.padsproj.org/http://www.padsproj.org/

LearnPADS On the Web

EndEnd

Related WorkRelated WorkMost common domains for grammar inference:Most common domains for grammar inference:

xml/htmlxml/html natural languagenatural language

Systems that focus on ad hoc data are rare and those that Systems that focus on ad hoc data are rare and those that do don’t support PADS tool suite:do don’t support PADS tool suite: Rufus system ’93, TSIMMIS ’94, Potter’s Wheel ’01Rufus system ’93, TSIMMIS ’94, Potter’s Wheel ’01

Top-down structure discoveryTop-down structure discovery Arasu & Garcia-Molina ’03 (extracting data from web pages)Arasu & Garcia-Molina ’03 (extracting data from web pages)

Grammar induction using MDL & grammar rewriting searchGrammar induction using MDL & grammar rewriting search Stolcke and Omohundro ’94 “Inducing probabilistic Stolcke and Omohundro ’94 “Inducing probabilistic

grammars...”grammars...” T. W. Hong ’02, Ph.D. thesis on information extraction from T. W. Hong ’02, Ph.D. thesis on information extraction from

web pagesweb pages Higuera ’01 “Current trends in grammar induction”Higuera ’01 “Current trends in grammar induction” Garofalakis et al. ’00 “XTRACT for infering DTDs”Garofalakis et al. ’00 “XTRACT for infering DTDs”

Scoring FunctionScoring FunctionFinding a function to evaluate the “goodness” of a Finding a function to evaluate the “goodness” of a description involves balancing two ideas:description involves balancing two ideas: a description must be concisea description must be concise

►people cannot read and understand enormous descriptionspeople cannot read and understand enormous descriptions a description must be precisea description must be precise

►imprecise descriptions do not give us much useful imprecise descriptions do not give us much useful informationinformation

Note the trade-off:Note the trade-off: increasing precision (good) usually increases description increasing precision (good) usually increases description size (bad)size (bad)

decreasing description size (good) usually decreases decreasing description size (good) usually decreases precision (bad)precision (bad)

Minimum Description Length (MDL) Principle:Minimum Description Length (MDL) Principle: Normalized Information-theoretic ScoresNormalized Information-theoretic ScoresTransmission Bits = BitsForDescription(T) + BitsForData(D given T)

Documents

From Dirt to Shovels: Automatic Tool Generation from Ad Hoc Data Kenny Zhu Princeton University with Kathleen Fisher, David Walker and Peter White