Upload
yusuke-oda
View
1.357
Download
8
Embed Size (px)
Citation preview
15/11/13 1
Learning to Generate Pseudo-codefrom Source Code
using Statistical Machine Translation
Yusuke OdaHiroyuki FudabaGraham NeubigHideaki HataSakriani SaktiTomoki TodaSatoshi Nakamura
IEEE/ACM ASE, November 13, 2015
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 2
Summary of This Study
● This presentation introduces summaries of key techniquesused in Pseudogen tool. [Fudaba+2015]
● Goal:
– Generating natural language sentenceswhich describe the behavior of each statement in source code.
– We call these output sentences "pseudo-code."
● Approach:
– Used 2 different frameworks of statistical machine translation (SMT).
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 3
Contribution of Pseudo-code
● Pseudo-code aid code reading for programming beginners.
● Programmers can double-check their code through pseudo-code.
Assisting Code Reading Debugging
if x / 5 == 0:
if x divided by 5 is 0
if x % 5 == 0:
Fix
SourceCode
Pseudocode
in naturallanguage
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 4
Pseudo-code in This Study● Line-to-line Assumption
– Each statement in source code can be written by one phrase in natural languagewith same meaning.
● This assumption represents a minimal relationship between programming and natural language.
– We ignore more complicated cases so far (e.g. snippets, functions, documents).
if x % 5 == 0:(body)
y = 'foo'
(if...)else:(body)
print('bar')
if x is divisible by 5,
assign a string 'foo' to y.
if not,
print a string 'bar' to the output stream.
Python English (to be generated)
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 5
Related Work for Sentence Generation● Rule-based methods e.g. [Buse+ '08], [Sridhara+ '10], [Sridhara+ '11], [Moreno+ '13]
– Can use detailed information, however requires high cost maintainance.
os.print(・) → print ・ to output streamos.print(・) → print ・ to output stream
msg → messagemsg → message
print message to output system
Search on rule table
Combine
print message to output system
Search on KB
Propose
KnowledgeBase
KnowledgeBase
os.print(msg)print message to output systemos.print(msg)print message to output system
os.print(msg)
os.print(msg)
● Data(IR)-based methods e.g. [Haiduc+ '10], [Eddy+ '13], [Wong+ '13], [Rodeghero+ '14]
– Can use large corpora from real wold, however sometimes occurs search error.
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 6
Statistical Machine Translation(SMT)
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 7
Statistical Machine Translation (SMT)● Key idea: Combining good parts of rule-based and data-based methods.
1. Training: Extract transformation rules between two languages from large corpus.
2. Generating: Search accurate combination of rules for an input data.
● Merit
1. Automated: Most translation rules are automatically obtained.
2. Scalable: Increasing the amount of corpus improve translation quality.
● We used 2 different SMT frameworks:
1. Phrase-based machine translation (PBMT)
2. Tree-to-string machine translation (T2SMT)
CorpusTranslator
Training Generating
SourceSentence
TargetSentence
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 8
1. Tokenize
if
x
%
5
==
0
:
if if if
2. Select Phrase Pairs
Phrase-based Machine Translation (PBMT)● Use token strings to generate output.
Python: if x % 5 == 0:
English: if x is divisible by 5
4. Synthesize Target Sentence
Simple method, we only need tokenizersCannot capture source structures
x x
%5
by5
==0:
isdivisible
3. Reorder
if if
x x
%5
by5
==0:
isdivisible
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 9
Tree-to-string Machine Translation (T2SMT)● Use syntax trees to generate output.
Python: if x % 5 == 0:
1. Parse
if :
if
cmp body
binop == 0
% 5x
if :
if
cmp
body
binop == 0
%
5x
if X
Y isdivisible
by Z
x 5
X
Y Z
2. Select Subtrees
Can capture source structuresComplicated method, we need tree treatment
3. Synthesize Target Sentence
English: if x is divisible by 5
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 10
TranslationModel
CombineFeatures
RuleExtraction
TranslationRules
& Stats
Phrase-levelRelationship
Training Process of SMT Methods
SourceCorpus
TargetCorpus
MakingWord
AlignmentAlignment
Token-levelRelationship
MakingLanguage
Model
TargetLanguage
Model
Evaluate Fluency of Output
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 11
Word Alignment● Making word alignment (token-level relationship)
– Using a statistical model.
if x % 5 == 0 :
if
x
is
divisible
by
5
Python
English
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 12
Rule Extraction (PBMT)● Making word alignment (token-level relationship)
– Using a statistical model.
● Extract phrase pairs according to aligned words.
if x % 5 == 0 :
if
x
is
divisible
by
5
== 0 : → is divisible
x % 5 == → x is divisible by 5
if x → if x
% 5 → by 5
5 == 0 → is divisible by 5
...and so on
Python
English
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 13
x % 5 == 0
cmpbinop
x is divisible by 5
x
x
5
5
Rule Extraction (T2SMT)● Given word alignments, tree-to-string rules are extracted according to
aligned words and the source parse tree.
cmpbinop
if
cmpbinop
5x
if
isdivisible
by
x
% == 0 :
5
if
+ − −
X % Y == 0
cmpbinop
X is divisible by Y
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 14
SMT for Pseudo-code Generation
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 15
Requirements for SMT Methods
PBMT T2SMT
● Tokenizer for natural language
– Use NLP tools.
● English: Stanford Tokenizer
● Japanese: MeCab
● Tokenizer for natural language
– Like as PBMT
● Tokenizer for programming language
– Use the tokenizer provided from programming language itself.
● Parser for programming language
– Parser should generate parse trees
● Includes all tokens as its leaf nodesto be used for word alignment
– But most programming languages provide only AST parser.
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 16
Problem of AST
• Problem: Mismatching of token nodes.
If
Compare
BinOp
Name
%Loadx
Num
5 ==
Num
0
Body
id ctx
left op right
left
ops[0]
comparators[0]
n
n
test
body
if x is divisible by 5
?
English
– There are redundant nodes.
– Some words in natural language arealigned to inner nodes in AST.
Our approachApplying simple transformation rules
to avoid token mismatching
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 17
Parse-like Tree (1): Head Insertion
1. Insert HEAD leaves (= label of each nodes).
If
Compare
BinOp
Name
%Loadx
Num
5 ==
Num
0
Body
NumNumNameBinOpCompareIf
id ctx
left op right
left
ops[0]
comparators[0]
n
n
test
body
HEAD
HEAD
HEAD
HEAD
HEAD
HEAD
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 18
Parse-like Tree (2): Pruning
1. Insert HEAD leaves (= label of each nodes).
2. Delete redundant nodes.If
Compare
BinOp
Name
%Loadx
Num
5 ==
Num
0
Body
NumNumNameBinOpCompareIf
id ctx
left op right
left
ops[0]
comparators[0]
n
n
test
body
HEAD
HEAD
HEAD
HEAD
HEAD
HEAD
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 19
Parse-like Tree (3): Simplification
1. Insert HEAD leaves (= label of each nodes).
2. Delete redundant nodes.
3. Integrate some nodes.If
Compare
BinOp
Name
%x
Num
5 ==
Num
0NumNumNameIf
id
left op right
left
ops[0]
comparators[0]
n
n
test
HEAD
HEAD
HEAD
HEAD
x 5 0
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 20
Parse-like Tree (4): Final Tree
• Finally, we obtain the parse-like tree below.
If
Compare
BinOp
% ==If
leftop
right
left
ops[0]
comparators[0]
test
HEAD
x 5 0
if x is divisible by 5English
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 21
Experiments
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 22
Corpus Summaries● We gathered 2 corpus with different language pairs.
1. Python-to-English
• Python ... Extracted from Django framework
• English ... Handmade by 1 human
• Amount ... 18,805 pairs
• Usage ... 17,000 for training, 1,805 for evaluation
2. Python-to-Japanese
– Python ... Extracted from student code for programming exercise
– Japanese ... Handmade by 1 human
– Amount ... 722 pairs
– Usage ... 10-fold cross varidation (9/10 for training, 1/10 for evaluation)
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 23
Evaluated Methods
PBMT
Raw-T2SMT
Modified-T2SMT
Phrase-based
Tree-to-string
Tree-to-string
Token stringsgenerated from tokenize module
AST generated from ast module
Parse-like tree(AST with transformation rules)
Method Framework Input data structure
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 24
Evaluation Setting● We examined 2 points:
Intrinsic evaluation:Translation quality
Extrinsic evaluation:Code understanding
● Apply evaluation metrics used in machine translation studies
– Automatic evaluation: BLEU
– Human evaluation: Acceptability
● Examine our generator in actual task:
Python
Pseudocode
ReadAnswer
Readability
➔ 0➔ 1➔ 2➔ 3➔ 4➔ 5
Record Time
● Python + no pseudo-code● Python + generated pseudo-code● Python + human-written pseudo-code
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 25
Results: Intrinsic Evaluation
● BLEU and Acceptability has the same tendencies:
Modified-T2SMT > Raw-T2SMT > PBMT
● Modified-T2SMT method has the best performance in all settings.
– 72% of test samples achieve the highest Acceptability (= gramatically correct & fluent)
GenaeratorBLEU%
English Japanese
PBMT 25.71 51.67
Raw-T2SMT 49.74 55.66
Modified-T2SMT 54.08 62.88
PBMT Raw-T2SMT Reduced-T2SMT0%
20%
40%
60%
80%
100%
5
4
3
2
1
Cu
mu
lativ
e A
cce
pta
bili
ty
Human Evaluation: Acceptability[Goto et al. 2013] (Python-Japanese)
50% 63% 72%
(do not compare scores between English and Japanese)
Automatic Evaluation: BLEU [Papineni et al. 2002]
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 26
Results: Code Understanding
● Generated pseudo-code can improve code readability compared with no pseudo-code.
● But reading time increases.
– This comes from generation error (oracle pseudo-code decreases reading time).
Group Pseudo-code Readability(6-grade Likert)
Mean ReadingTime [s]
Experienced(8 people)
No 2.55 41.37
Generated 2.71 46.48
Human-written 3.05 35.65
Inexperienced(6 people)
No 1.32 24.99
Generated 1.81 39.52
Human-written 2.10 24.97
Code Readability and Reading Time (Python-Japanese, Modified-T2SMT)
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 27
Conclusion / Future Works● Summary:
– Generating natural language sentence (we call it pseudo-code) from source statements using statistical machine translation (SMT).
– For tree-to-string (T2SMT) method, we apply transformation rules to make parse-like tree.
● Results:
– SMT can generate acceptable sentences.
● 54% BLEU in English, 62% BLEU and 72% highest Acceptability in Japanese
– Generated sentences can aid code readability.
● However reading time is slower than human-written pseudo-code.There is still room for improvement.
● Future Works:
– Considering more complicated generation
● Input: snippets, functions, classes
● Output: multiple sentences, documents
– Applying to more language pairs
– Automated preprocessing