27
15/11/13 1 Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation Yusuke Oda Hiroyuki Fudaba Graham Neubig Hideaki Hata Sakriani Sakti Tomoki Toda Satoshi Nakamura IEEE/ACM ASE, November 13, 2015

Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

Embed Size (px)

Citation preview

Page 1: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 1

Learning to Generate Pseudo-codefrom Source Code

using Statistical Machine Translation

Yusuke OdaHiroyuki FudabaGraham NeubigHideaki HataSakriani SaktiTomoki TodaSatoshi Nakamura

IEEE/ACM ASE, November 13, 2015

Page 2: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 2

Summary of This Study

● This presentation introduces summaries of key techniquesused in Pseudogen tool. [Fudaba+2015]

● Goal:

– Generating natural language sentenceswhich describe the behavior of each statement in source code.

– We call these output sentences "pseudo-code."

● Approach:

– Used 2 different frameworks of statistical machine translation (SMT).

Page 3: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 3

Contribution of Pseudo-code

● Pseudo-code aid code reading for programming beginners.

● Programmers can double-check their code through pseudo-code.

Assisting Code Reading Debugging

if x / 5 == 0:

if x divided by 5 is 0

if x % 5 == 0:

Fix

SourceCode

Pseudocode

in naturallanguage

Page 4: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 4

Pseudo-code in This Study● Line-to-line Assumption

– Each statement in source code can be written by one phrase in natural languagewith same meaning.

● This assumption represents a minimal relationship between programming and natural language.

– We ignore more complicated cases so far (e.g. snippets, functions, documents).

if x % 5 == 0:(body)

y = 'foo'

(if...)else:(body)

print('bar')

if x is divisible by 5,

assign a string 'foo' to y.

if not,

print a string 'bar' to the output stream.

Python English (to be generated)

Page 5: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 5

Related Work for Sentence Generation● Rule-based methods e.g. [Buse+ '08], [Sridhara+ '10], [Sridhara+ '11], [Moreno+ '13]

– Can use detailed information, however requires high cost maintainance.

os.print(・) →   print ・ to output streamos.print(・) →   print ・ to output stream

msg →   messagemsg →   message

print message to output system

Search on rule table

Combine

print message to output system

Search on KB

Propose

KnowledgeBase

KnowledgeBase

os.print(msg)print message to output systemos.print(msg)print message to output system

os.print(msg)

os.print(msg)

● Data(IR)-based methods e.g. [Haiduc+ '10], [Eddy+ '13], [Wong+ '13], [Rodeghero+ '14]

– Can use large corpora from real wold, however sometimes occurs search error.

Page 6: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 6

Statistical Machine Translation(SMT)

Page 7: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 7

Statistical Machine Translation (SMT)● Key idea: Combining good parts of rule-based and data-based methods.

1. Training: Extract transformation rules between two languages from large corpus.

2. Generating: Search accurate combination of rules for an input data.

● Merit

1. Automated: Most translation rules are automatically obtained.

2. Scalable: Increasing the amount of corpus improve translation quality.

● We used 2 different SMT frameworks:

1. Phrase-based machine translation (PBMT)

2. Tree-to-string machine translation (T2SMT)

CorpusTranslator

Training Generating

SourceSentence

TargetSentence

Page 8: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 8

1. Tokenize

if

x

%

5

==

0

:

if if if

2. Select Phrase Pairs

Phrase-based Machine Translation (PBMT)● Use token strings to generate output.

Python: if x % 5 == 0:

English: if x is divisible by 5

4. Synthesize Target Sentence

Simple method, we only need tokenizersCannot capture source structures

x x

%5

by5

==0:

isdivisible

3. Reorder

if if

x x

%5

by5

==0:

isdivisible

Page 9: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 9

Tree-to-string Machine Translation (T2SMT)● Use syntax trees to generate output.

Python: if x % 5 == 0:

1. Parse

if :

if

cmp body

binop == 0

% 5x

if :

if

cmp

body

binop == 0

%

5x

if X

Y isdivisible

by Z

x 5

X

Y Z

2. Select Subtrees

Can capture source structuresComplicated method, we need tree treatment

3. Synthesize Target Sentence

English: if x is divisible by 5

Page 10: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 10

TranslationModel

CombineFeatures

RuleExtraction

TranslationRules

& Stats

Phrase-levelRelationship

Training Process of SMT Methods

SourceCorpus

TargetCorpus

MakingWord

AlignmentAlignment

Token-levelRelationship

MakingLanguage

Model

TargetLanguage

Model

Evaluate Fluency of Output

Page 11: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 11

Word Alignment● Making word alignment (token-level relationship)

– Using a statistical model.

if x % 5 == 0 :

if

x

is

divisible

by

5

Python

English

Page 12: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 12

Rule Extraction (PBMT)● Making word alignment (token-level relationship)

– Using a statistical model.

● Extract phrase pairs according to aligned words.

if x % 5 == 0 :

if

x

is

divisible

by

5

== 0 : → is divisible

x % 5 == → x is divisible by 5

if x → if x

% 5 → by 5

5 == 0 → is divisible by 5

...and so on

Python

English

Page 13: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 13

x % 5 == 0

cmpbinop

x is divisible by 5

x

x

5

5

Rule Extraction (T2SMT)● Given word alignments, tree-to-string rules are extracted according to

aligned words and the source parse tree.

cmpbinop

if

cmpbinop

5x

if

isdivisible

by

x

% == 0 :

5

if

+ − −

X % Y == 0

cmpbinop

X is divisible by Y

Page 14: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 14

SMT for Pseudo-code Generation

Page 15: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 15

Requirements for SMT Methods

PBMT T2SMT

● Tokenizer for natural language

– Use NLP tools.

● English: Stanford Tokenizer

● Japanese: MeCab

● Tokenizer for natural language

– Like as PBMT

● Tokenizer for programming language

– Use the tokenizer provided from programming language itself.

● Parser for programming language

– Parser should generate parse trees

● Includes all tokens as its leaf nodesto be used for word alignment

– But most programming languages provide only AST parser.

Page 16: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 16

Problem of AST

• Problem: Mismatching of token nodes.

If

Compare

BinOp

Name

%Loadx

Num

5 ==

Num

0

Body

id ctx

left op right

left

ops[0]

comparators[0]

n

n

test

body

if x is divisible by 5

?

English

– There are redundant nodes.

– Some words in natural language arealigned to inner nodes in AST.

Our approachApplying simple transformation rules

to avoid token mismatching

Page 17: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 17

Parse-like Tree (1): Head Insertion

1. Insert HEAD leaves (= label of each nodes).

If

Compare

BinOp

Name

%Loadx

Num

5 ==

Num

0

Body

NumNumNameBinOpCompareIf

id ctx

left op right

left

ops[0]

comparators[0]

n

n

test

body

HEAD

HEAD

HEAD

HEAD

HEAD

HEAD

Page 18: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 18

Parse-like Tree (2): Pruning

1. Insert HEAD leaves (= label of each nodes).

2. Delete redundant nodes.If

Compare

BinOp

Name

%Loadx

Num

5 ==

Num

0

Body

NumNumNameBinOpCompareIf

id ctx

left op right

left

ops[0]

comparators[0]

n

n

test

body

HEAD

HEAD

HEAD

HEAD

HEAD

HEAD

Page 19: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 19

Parse-like Tree (3): Simplification

1. Insert HEAD leaves (= label of each nodes).

2. Delete redundant nodes.

3. Integrate some nodes.If

Compare

BinOp

Name

%x

Num

5 ==

Num

0NumNumNameIf

id

left op right

left

ops[0]

comparators[0]

n

n

test

HEAD

HEAD

HEAD

HEAD

x 5 0

Page 20: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 20

Parse-like Tree (4): Final Tree

• Finally, we obtain the parse-like tree below.

If

Compare

BinOp

% ==If

leftop

right

left

ops[0]

comparators[0]

test

HEAD

x 5 0

if x is divisible by 5English

Page 21: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 21

Experiments

Page 22: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 22

Corpus Summaries● We gathered 2 corpus with different language pairs.

1. Python-to-English

• Python ... Extracted from Django framework

• English ... Handmade by 1 human

• Amount ... 18,805 pairs

• Usage ... 17,000 for training, 1,805 for evaluation

2. Python-to-Japanese

– Python ... Extracted from student code for programming exercise

– Japanese ... Handmade by 1 human

– Amount ... 722 pairs

– Usage ... 10-fold cross varidation (9/10 for training, 1/10 for evaluation)

Page 23: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 23

Evaluated Methods

PBMT

Raw-T2SMT

Modified-T2SMT

Phrase-based

Tree-to-string

Tree-to-string

Token stringsgenerated from tokenize module

AST generated from ast module

Parse-like tree(AST with transformation rules)

Method Framework Input data structure

Page 24: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 24

Evaluation Setting● We examined 2 points:

Intrinsic evaluation:Translation quality

Extrinsic evaluation:Code understanding

● Apply evaluation metrics used in machine translation studies

– Automatic evaluation: BLEU

– Human evaluation: Acceptability

● Examine our generator in actual task:

Python

Pseudocode

ReadAnswer

Readability

➔ 0➔ 1➔ 2➔ 3➔ 4➔ 5

Record Time

● Python + no pseudo-code● Python + generated pseudo-code● Python + human-written pseudo-code

Page 25: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 25

Results: Intrinsic Evaluation

● BLEU and Acceptability has the same tendencies:

Modified-T2SMT > Raw-T2SMT > PBMT

● Modified-T2SMT method has the best performance in all settings.

– 72% of test samples achieve the highest Acceptability (= gramatically correct & fluent)

GenaeratorBLEU%

English Japanese

PBMT 25.71 51.67

Raw-T2SMT 49.74 55.66

Modified-T2SMT 54.08 62.88

PBMT Raw-T2SMT Reduced-T2SMT0%

20%

40%

60%

80%

100%

5

4

3

2

1

Cu

mu

lativ

e A

cce

pta

bili

ty

Human Evaluation: Acceptability[Goto et al. 2013] (Python-Japanese)

50% 63% 72%

(do not compare scores between English and Japanese)

Automatic Evaluation: BLEU [Papineni et al. 2002]

Page 26: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 26

Results: Code Understanding

● Generated pseudo-code can improve code readability compared with no pseudo-code.

● But reading time increases.

– This comes from generation error (oracle pseudo-code decreases reading time).

Group Pseudo-code Readability(6-grade Likert)

Mean ReadingTime [s]

Experienced(8 people)

No 2.55 41.37

Generated 2.71 46.48

Human-written 3.05 35.65

Inexperienced(6 people)

No 1.32 24.99

Generated 1.81 39.52

Human-written 2.10 24.97

Code Readability and Reading Time (Python-Japanese, Modified-T2SMT)

Page 27: Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 27

Conclusion / Future Works● Summary:

– Generating natural language sentence (we call it pseudo-code) from source statements using statistical machine translation (SMT).

– For tree-to-string (T2SMT) method, we apply transformation rules to make parse-like tree.

● Results:

– SMT can generate acceptable sentences.

● 54% BLEU in English, 62% BLEU and 72% highest Acceptability in Japanese

– Generated sentences can aid code readability.

● However reading time is slower than human-written pseudo-code.There is still room for improvement.

● Future Works:

– Considering more complicated generation

● Input: snippets, functions, classes

● Output: multiple sentences, documents

– Applying to more language pairs

– Automated preprocessing