Upload
deborah-hunt
View
225
Download
0
Tags:
Embed Size (px)
Citation preview
From Dirt to Shovels:From Dirt to Shovels:Automatic Tool Automatic Tool GenerationGeneration
from Ad Hoc Datafrom Ad Hoc Data
Kenny ZhuKenny Zhu
Princeton UniversityPrinceton University
with Kathleen Fisher, David Walker and Peter White
A System Admin’s LifeA System Admin’s Life
Web Server Logs…Web Server Logs…
System Logs…System Logs…
Application Configs…Application Configs…
User EmailsUser Emails
Script Outputs and more…Script Outputs and more…
Automatically Generate Tools Automatically Generate Tools from Data!from Data!
XML converterData profilerGrapher, etc.
ArchitectureArchitecture
Tokenization
Structure Discovery
Format Refinement
Data Description
Scoring Function
Raw Data
PADSCompiler
Profiler
XMLconverter
AnalysisReport
XML
FormatInference
LearnPADS
Tokenization
Structure Discovery
Format Refinement
Simple End-to-EndSimple End-to-End
Data Data Sources:Sources: Punion payload {
Pint32 i; PstringFW(3) s2; };
Pstruct source { ‘\”’; payload p1; “,”; payload p2; ‘\”’; }
“0, 24”
“foo, 16”
“bar, end”
DescriptiDescription:on:
XML XML output:output:<source>
<payload> <int><val>0</val></int> </payload> <payload> <int><val>24</val></int> </payload></source><source> <payload> <string><val>bar</val></string> </payload> <payload> <string><val>end</val></string> </payload></source>
TokenizationTokenization
Parse strings; convert to symbolic tokensParse strings; convert to symbolic tokens Basic token set skewed towards systems dataBasic token set skewed towards systems data
►Int, string, date, time, URLs, hostnames …Int, string, date, time, URLs, hostnames … A config file allows users to define their own new A config file allows users to define their own new token types via regular expressionstoken types via regular expressions
“0, 24”
“foo, 16”
“bar, end”
“ INT , INT ”
“ STR , INT ”
“ STR , STR ”
tokenize
Structure Discovery: OverviewStructure Discovery: Overview
Top-down, divide-and-conquer algorithm:Top-down, divide-and-conquer algorithm: Compute various statistics from tokenized dataCompute various statistics from tokenized data Guess a top-level descriptionGuess a top-level description Partition tokenized data into smaller chunksPartition tokenized data into smaller chunks Recursively analyze and compute descriptions from smaller chunksRecursively analyze and compute descriptions from smaller chunks
Structure Discovery: OverviewStructure Discovery: Overview
Top-down, divide-and-conquer algorithm:Top-down, divide-and-conquer algorithm: Compute various statistics from tokenized dataCompute various statistics from tokenized data Guess a top-level descriptionGuess a top-level description Partition tokenized data into smaller chunksPartition tokenized data into smaller chunks Recursively analyze and compute descriptions from smaller chunksRecursively analyze and compute descriptions from smaller chunks
“ INT , INT ”
“ STR , INT ”
“ STR , STR ”
discover“ ”,
? ?
struct
?
candidate structure so far
INT
STR
STR
INT
INT
STRsources
Structure Discovery: OverviewStructure Discovery: Overview
Top-down, divide-and-conquer algorithm:Top-down, divide-and-conquer algorithm: Compute various statistics from tokenized dataCompute various statistics from tokenized data Guess a top-level descriptionGuess a top-level description Partition tokenized data into smaller chunksPartition tokenized data into smaller chunks Recursively analyze and compute descriptions from smaller chunksRecursively analyze and compute descriptions from smaller chunks
discover“ ”,
? ?
struct
INT
STR
STR
INT
INT
STR
“ ”,
?
?
struct
union
INT
?
STR STR
INT
INT
STR
Structure Discovery: DetailsStructure Discovery: Details
Compute frequency distribution histogram for Compute frequency distribution histogram for each token.each token.
(And recompute at every level of recursion).(And recompute at every level of recursion).
“ INT , INT ”
“ STR , INT ”
“ STR , STR ”
percentageof sources Number
of occurrencesper source
0102030405060708090100
Quote Comma Integer String
12
Structure Discovery: DetailsStructure Discovery: Details
Cluster tokens with similar histograms into groupsCluster tokens with similar histograms into groups Similar histogramsSimilar histograms
► tokens with strong regularity coexist in same description tokens with strong regularity coexist in same description componentcomponent
► use symmetric relative entropy to measure similarityuse symmetric relative entropy to measure similarity Only the “shape” of the histogram mattersOnly the “shape” of the histogram matters
► normalize histograms by sorting columns in descending sizenormalize histograms by sorting columns in descending size► result: comma & quote in one group, int & string in another result: comma & quote in one group, int & string in another
0102030405060708090100
Quote Comma Integer String
1
2
Structure Discovery: DetailsStructure Discovery: Details
Classify the groups into:Classify the groups into: Structs == Groups with high coverage & low “residual mass”Structs == Groups with high coverage & low “residual mass” Arrays == Groups with high coverage, sufficient width & high Arrays == Groups with high coverage, sufficient width & high
“residual mass”“residual mass” Unions == Other token groups Unions == Other token groups
Pick group with strongest signal to divide and conquerPick group with strongest signal to divide and conquer
More mathematical details in the paperMore mathematical details in the paper
Struct involving comma, quote identified in histogram aboveStruct involving comma, quote identified in histogram above
Overall procedure gives good starting point for refinementOverall procedure gives good starting point for refinement
0102030405060708090100
Quote Comma Integer String
1
2
Format RefinementFormat RefinementReanalyze source data with aid of rough description Reanalyze source data with aid of rough description and obtain functional dependencies and constaintsand obtain functional dependencies and constaints
Rewrite format description to:Rewrite format description to: simplify presentationsimplify presentation
►merge & rewrite structuresmerge & rewrite structures improve precisionimprove precision
►add constraints (uniqueness, ranges, functional add constraints (uniqueness, ranges, functional dependencies)dependencies)
fill in missing details fill in missing details ►find completions where structure discovery bottoms outfind completions where structure discovery bottoms out►refine base types (integer sizes, array sizes, refine base types (integer sizes, array sizes, seperators and terminators)seperators and terminators)
Rewriting is guided by local search that optimizes an Rewriting is guided by local search that optimizes an information-theoretic score (more details in the information-theoretic score (more details in the paper)paper)
Refinement: Simple ExampleRefinement: Simple Example
“0, 24”“foo, beg”“bar, end”“0, 56”“baz, middle”“0, 12”“0, 33”…
struct
“ ”, unionunion
int str int str
structurediscovery
Constraints:id3 = 0
id1 = id2
constraintinference
rule-basedstructurerewriting
struct
“ ”union
0 strint str
struct struct
, ,
Greater AccuracyFirst int is 0No “int, str”
(id2)
struct
“ ”, unionunion
int (id3)
tagging/table gen
(id1)
str (id4) int (id5) str (id6)
id1id1 id2id2 id3id3 id4id4 id5id5 id6id6
11 11 00 ------ 2424 ------
22 22 ------ foofoo ------ begbeg
. . . . ..
. . . . ..
. . . . ..
. . . . ..
. . . . ..
. . . . ..
EvaluationEvaluation
Benchmark FormatsBenchmark FormatsData sourceData source ChunksChunks BytesBytes DescriptionDescription
1967Transactions.shor1967Transactions.shortt
999999 7092970929 Transaction recordsTransaction records
MER_T01_01.cvsMER_T01_01.cvs 491491 2173121731 Comma-separated recordsComma-separated records
Ai.3000Ai.3000 30003000 293460293460 Web server logWeb server log
Asl.logAsl.log 15001500 279600279600 Log file of MAC ASLLog file of MAC ASL
Boot.logBoot.log 262262 1624116241 Mac OS boot logMac OS boot log
Crashreporter.logCrashreporter.log 441441 5015250152 Original crashreporter daemon logOriginal crashreporter daemon log
Crashreporter.log.modCrashreporter.log.mod 441441 4925549255 Modified crashreporter daemon logModified crashreporter daemon log
Sirius.1000Sirius.1000 999999 142607142607 AT&T phone provision dataAT&T phone provision data
Ls-l.txtLs-l.txt 3535 19791979 Command ls -l outputCommand ls -l output
Netstat-anNetstat-an 202202 1435514355 Output from netstat -anOutput from netstat -an
Page_logPage_log 354354 2817028170 Printer log from CUPSPrinter log from CUPS
quarterlypersonalincoquarterlypersonalincomeme
6262 1017710177 Spread sheetSpread sheet
Railroad.txtRailroad.txt 6767 62186218 US Rail road infoUS Rail road info
Scrollkeeper.logScrollkeeper.log 671671 6628866288 Log from cataloging systemLog from cataloging system
Windowserver_last.logWindowserver_last.log 680680 5239452394 Log from Mac LoginWindow serverLog from Mac LoginWindow server
Yum.txtYum.txt 328328 1822118221 Log from package installer YumLog from package installer YumAvailable at http://www.padsproj.org/
Training Time vs. Training Training Time vs. Training SizeSize
Training Accuracy vs Training Training Accuracy vs Training SizeSize
ConclusionsConclusionsWe are able produce XML and statistical reports We are able produce XML and statistical reports fully automatically from ad hoc data sources.fully automatically from ad hoc data sources.
We’ve tested on approximately 15 real, mostly We’ve tested on approximately 15 real, mostly systemsy data sources (web logs, crash reports, systemsy data sources (web logs, crash reports, AT&T phone call data, etc.) with what we AT&T phone call data, etc.) with what we believe is a good successbelieve is a good success
For papers, online demos & pads software, see our For papers, online demos & pads software, see our website at:website at:
http://www.padsproj.org/http://www.padsproj.org/
LearnPADS On the Web
EndEnd
Related WorkRelated WorkMost common domains for grammar inference:Most common domains for grammar inference:
xml/htmlxml/html natural languagenatural language
Systems that focus on ad hoc data are rare and those that Systems that focus on ad hoc data are rare and those that do don’t support PADS tool suite:do don’t support PADS tool suite: Rufus system ’93, TSIMMIS ’94, Potter’s Wheel ’01Rufus system ’93, TSIMMIS ’94, Potter’s Wheel ’01
Top-down structure discoveryTop-down structure discovery Arasu & Garcia-Molina ’03 (extracting data from web pages)Arasu & Garcia-Molina ’03 (extracting data from web pages)
Grammar induction using MDL & grammar rewriting searchGrammar induction using MDL & grammar rewriting search Stolcke and Omohundro ’94 “Inducing probabilistic Stolcke and Omohundro ’94 “Inducing probabilistic
grammars...”grammars...” T. W. Hong ’02, Ph.D. thesis on information extraction from T. W. Hong ’02, Ph.D. thesis on information extraction from
web pagesweb pages Higuera ’01 “Current trends in grammar induction”Higuera ’01 “Current trends in grammar induction” Garofalakis et al. ’00 “XTRACT for infering DTDs”Garofalakis et al. ’00 “XTRACT for infering DTDs”
Scoring FunctionScoring FunctionFinding a function to evaluate the “goodness” of a Finding a function to evaluate the “goodness” of a description involves balancing two ideas:description involves balancing two ideas: a description must be concisea description must be concise
►people cannot read and understand enormous descriptionspeople cannot read and understand enormous descriptions a description must be precisea description must be precise
►imprecise descriptions do not give us much useful imprecise descriptions do not give us much useful informationinformation
Note the trade-off:Note the trade-off: increasing precision (good) usually increases description increasing precision (good) usually increases description size (bad)size (bad)
decreasing description size (good) usually decreases decreasing description size (good) usually decreases precision (bad)precision (bad)
Minimum Description Length (MDL) Principle:Minimum Description Length (MDL) Principle: Normalized Information-theoretic ScoresNormalized Information-theoretic ScoresTransmission Bits = BitsForDescription(T) + BitsForData(D given T)