Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Prague, 28 March 2018
Signe Oksefjell Ebeling
Using the BAWE and VESPA corpora to
explore learners’ and native-speakers’ use of
recurrent word-combinations across
academic disciplines
Outline
• Introducing BAWE and VESPA
– Content
– Encoding of corpus texts
• Study of recurrent word-combinations
– Focus, background, aims
– Material and method
– Hypotheses
– Analysis I: Comparing native and non-native speakers
– Analysis II: Comparing disciplines
– Summary of findings
• Concluding remarks
Introducing BAWEBritish Academic Written English corpus
• Compiled as part of the ESRC-funded project An investigation of genres
of assessed writing in British Higher Education (2004-2007)
• Collaboration between Oxford Brookes, Reading and Warwick
Universities
• Collection of 2,800 student assignments at UG and Masters level –
assessed at 2:1 level (60% +), amounting to around 6.5 million running
words.
• Four disciplinary groupings: Arts & Humanities, Life Sciences, Physical
Sciences, Social Sciences
Disciplinary
group
Discipline Disciplinary
group
Discipline
Arts &
Humanities
Archaeology
Life
Sciences
Planning
Classics Agriculture
Comp. American Literature Biological Sciences
English (literature) Food Sciences
History Health
Linguistics Medicine
Philosophy Psychology
Social
Sciences
Anthropology
Physical
Sciences
Architecture
Business Chemistry
Economics Computer Science
HLTM Cybernetics/Electronics
Law Engineering
Politics Mathematics
Publishing Physics
Sociology
Introducing VESPAVarieties of English for Specific Purposes dAtabase
• The VESPA learner corpus project was initiated in 2008 by the
Centre for English Corpus Linguistics at the Catholic University of
Louvain. Ongoing and ad-hoc funded.
• Corpus of English for Specific Purposes texts written by L2 writers
from various mother tongue (L1) backgrounds:
– Dutch, French, German, Norwegian, Spanish, Swedish
• English L2 texts in a wide range of disciplines (linguistics, business,
medicine, law, biology, literature, etc), genres (essays, reports, MA
dissertations) and levels (from first-year students to MA (PhD)
students).
• Aim: 500,000 words per discipline per mother tongue
BAWE and VESPA: Text encoding issues
• In order to make the corpora more suitable for linguistic research, the
corpus files have been encoded and stored in a standardised format
(XML; TEI-conformant).
• Metadata (and property rights) from contributors, i.e. students
• Principles of mark-up:
– Keep the structure of the document as close to the original as
possible;
– Mark up elements relevant to our research;
– Should be cost effective.
• Interactive tagging of Word documents
– Facilitates the tagging process;
– Ensures reliability and consistency;
– Makes possible exclusion of annotated material when searching
the corpus.
Encoding:From MS Word document to TEI conformant xml text
Macros enabled (Microsoft Word)
Running the macros (VESPA: man)
Running the macros (VESPA: man)
Running the macros (VESPA: man)
After running the macros (VESPA: man)
After running VESPA: man → VESPA: auto
Automatic post-processing
The Perl script is used to finalize the formatting process:
• Implement TEI XML structure;
• Normalize hyphens, dashes, quotes, etc.;
• Transform the pseudo-tags that were inserted in the Word files into
genuine XML TEI-conformant tags;
• Import the metadata (i.e. learner profiles) from an external
spreadsheet;
• Mark-up of sentence and paragraph boundaries;
• Mark-up of footnotes and endnotes;
• Numbering of <s> and <p>.
After post-processing
Particularly important
• Annotation of text that is not the students’ own, such
as “mentioned items” (e.g. linguistic examples), is
essential.
• There is a clear effect on the number of words when
excluding marked-up text that is not the students’
own (incl. header, mentioned items, block quotes,
etc.)
VESPA-NOTokens
(overall)
Tokens (excluding
annotated material)
Proportion of tokens
excluded
BUS 64,376 62,115 3.5%
LIN 563,599 427,137 24.2%
LIT 117,483 101,708 13.4%
TOTAL 745,458 590,960 20.7%
Study of recurrent word-combinations:Native speakers of English vs. learners of English
Students of lingusitics vs. students of business
(Ebeling & Hasselgård, 2015)
Background
• The study of recurrent word-combinations is rewarding
“because they give insights into important aspects of the
phraseology used by writers in different contexts” (Scott &
Tribble 2006: 132).
• Although not all such combinations are of phraseological
interest (cf. Altenberg 1998), they serve as a useful
starting point for an investigation of how student writers
apply them across disciplines.
• “Bundles occur and behave in dissimilar ways in different
disciplinary environments.” (Hyland 2008: 20)
Research questions & aims
• What discourse functions do the recurrent
word-combinations have?
• To what extent are the same patterns and
functions used by learners and native
speakers?
• Is L1 background or discipline more decisive
for the use of recurrent word-combinations
and their functions?
Material
Linguistics Business
Texts Words Texts Words
VESPA-NO (L2) 239 267,855 70 47,335
BAWE (L1, BrE) 76 167,437 64 141,249
The Linguistics and Business sections of BAWE
and VESPA-NO
Mark-up of VESPA and BAWE excludes e.g. footnotes, block quotes and
headlines. Exclusion of mentioned items possible in VESPA only. (Similar
items in BAWE have been manually removed.)
Method: N-gram extraction
• Inspired by Stubbs & Barth’s (2003) study on recurrent phrases as text-type discriminators.
• Extract the 100 most frequent 3- and 4-grams in each sub-corpus, using WordSmith Tools
• Focus on 3- and 4-grams– based on Altenberg’s (1998) findings that the majority of
recurrent word-combinations cluster as 2-, 3-, or 4-grams, some as 5-grams, and very few as 6-grams;
– and on Stubbs & Barth’s (2003) findings that three-word and four-word chains are better text-type discriminators than e.g. two-word or five-word chains.
Function Exampleideational experiential informational • stating proposition,
conveying information
of the brain
situational • relating to extralinguistic
context, responding to
situation
as in tager flusberg
interpersonal evaluative • conveying speaker’s
evaluation and attitude
is likely to
modalizing • conveying truth values,
advice, requests, etc.
we can see
textual organizational • organizing text, signalling
discourse structure
in this paper
Functional classificationadapted from Moon (1998: 217-218)
Differs from Moon (1998) in taking the organizational function out
of the ideational, and instead operating with a textual-
organizational function (more in line with Halliday (e.g. 2004))
Hypotheses
• The types of n-grams may differ between learners
and native speakers (cf. Hyland 2008: 7 f, 20);
• The types of n-grams may differ across disciplines
(cf. Hyland 2008: 20);
• Linguistics students will use more organizational n-
grams than business students (cf. Hasselgård 2013);
• Learners will be more visible authors in their texts,
which may also show up in their recurrent word-
combinations (cf. Paquot et al. 2013);
Comparing L1 groups: learners vs.
native speakers - LinguisticsBAWE-
ling
3-grams
VESPA-
ling
3-grams
p-value BAWE-
ling
4-grams
VESPA-
ling
4-grams
p-value
Informational 46 57 0.157
(p > 0.05)
42 49 0.394
(p > 0.05)
Situational 1 0 4 0 0.129
(p > 0.05)
Evaluative 24 8 0.004
(p < 0.01)
29 15 0.026
(p < 0.05)
Modalizing 16 9 0.199
p > 0.05)
11 14 0.669
(p > 0.05)
Organizational 13 26 0.032
(p < 0.05)
14 22 0.198
(p > 0.05)
100 100 100 100
Native speakers: significantly more evaluative 3- and 4grams
Learners: significantly more organizational 3-grams
Comparing L1 groups: learners vs.
native speakers - Business
BAWE-
bus
3-grams
VESPA-
bus
3-grams
p-value BAWE-
bus
4-grams
VESPA-
bus
4-grams
p-value
Informational 64 73 0.223
(p > 0.05)
65 80 0.027
(p < 0.05)
Situational 0 0 1 1
Evaluative 8 6 0.782
(p > 0.05)
12 4 0.068
(p > 0.05)
Modalizing 9 1 0.023
(p < 0.05)
2 4 0.679
(p > 0.05)
Organizational 19 20 1
(p > 0.05)
20 11 0.118
(p > 0.05)
100 100 100 100
Native speakers: (significantly) more modalizing (but small numbers)
Learners: significantly more informational 4-grams
Shared n-grams across the L1 groups
(frequencies and examples): Linguistics
3-grams: 4-gramsinformational 16
that there are, the number of,
the use of, part of the…
7
and the use of, at the end of,
by the use of…
situational 0 0
evaluative 7
in the same, meaning of the,
the fact that…
7
as well as the, in the same
way, it is important to…
modalizing 6
be found in, can also be, can
be seen, can be used…
5
can be found in, can be seen
in, it is possible to…
organizational 7
an example of, in this case, in
this essay, looking at the…
6
an example of this, example
of this is, in this case the…
36% of 3-grams and 25% of 4-grams are shared.
Shared n-grams across the L1 groups (full
list): Business
3-grams: 4-gramsinformational a lot of
that they are
situational at the same time
evaluative is important to
it is important
the importance of
it is important to
modalizing
organizational based on the
in order to
there is a
one of the
on the other hand
9% of 3-grams and 3% of 4-grams are shared.
Features typical of the Norwegian learners
• Function
– Ideational and textual• Generally use more informational n-grams than the native
speakers
• Use slightly more organizational n-grams than the native speakers
• Form– n-grams with author presence (as hypothesized):
• i will look at; in this paper i; i would like to; i will discuss; we can see that
– some n-grams are overrepresented:
• when it comes to
• Function– Ideational and interpersonal
• Use less informational n-grams than the learners, but it is still the predominant function
• Generally use more evaluative and modalizing n-grams than the Norwegian learners
• Form– Non-personal (self) projection (e.g. it is clear that, it is
argued that)
– Complex noun phrases (e.g. the majority of the, the nature of the, as a result of)
– N-grams that reflect passive voice (e.g. it can be seen)
Features typical of the native speakers
Comparing disciplines: learners
VESPA-
ling
3-grams
VESPA-
bus
3-grams
p-value VESPA-
ling
4-grams
VESPA-
bus
4-grams
p-value
Informational 57 73 0.026
(p < 0.05)
49 80 9.287e-06
(p < 0.0001)
Situational 0 0 0 1
Evaluative 9 7 0.794
(p > 0.05)
15 4 0.016
(p < 0.05)
Modalizing 9 1 0.023
(p < 0.05)
14 4 0.026
(p < 0.05)
Organizational 25 19 0.393
(p > 0.05)
22 11 0.057
(p > 0.05)
100 100 100 100
Business: significantly more informational
Linguistics: significantly more evaluative/modalizing
Comparing disciplines: native speakersBAWE-
ling
3-grams
BAWE-
bus
3-grams
p-value BAWE-
ling
4-grams
BAWE-
bus
4-grams
p-value
Informational 46 64 0.016
(p < 0.05)
42 65 0.002
(p < 0.01)
Situational 1 0 4 1 0.365
(p > 0.05)
Evaluative 25 9 0.005
(p < 0.01)
29 12 0.005
(p < 0.01)
Modalizing 16 9 0.199
(p > 0.05)
11 2 0.022
(p < 0.05)
Organizational 12 18 0.322
(p > 0.05)
14 20 0.347
(p > 0.05)
100 100 100 100
Business: Significantly more informational (as in VESPA), more organizational
grams (unlike VESPA, but not significant).
Linguistics: Significantly more evaluative/modalizing (as in VESPA)
Shared n-grams across the disciplines (full
list): learners3-grams: 4-grams
informational
situational
evaluative it is important to
modalizing we can see I would like to
we can see that
organizational in this essay
it comes to
on the other
the other hand
when it comes
at the same time
in this essay I
on the other hand
the other hand is
this essay I will
when it comes to
6% of 3-grams and 9% of 4-grams are shared.
Shared n-grams across the disciplines
(frequencies and examples): native speakers3-grams: 4-grams
informational 18
a number of, it is a, part of the,
such as the, that it is …
8
at the end of, in the form of,
the nature of the…
situational 0 0
evaluative 6
as well as, due to the, is
important to, the fact that, the
importance of…
5
as well as the, it is clear that,
it is important to, the fact that
the …
modalizing 4
be able to, can be seen, it can
be, need to be
1
to be able to
organizational 5
a result of, as a result, in
terms of, in this case, one of
the
3
a result of the, as a result of,
on the other hand
33% of 3-grams and 17% of 4-grams are shared.
Features typical of linguistics
• Function
– Ideational and interpersonal
• Predominantly informational (most overlap between the L1 user
groups)
• Topic-specific
• Generally more evaluative and modalizing n-grams than the
business students
• Form
– Complex noun phrases (e.g. at the end of, by the use of, in the
case of)
– N-grams with can predominate in the modalizing function (can be
found in, can be seen in, can also be)
Features typical of business
• Function
– Ideational
• Highly informational and topic-specific, even more so
than was the case for linguistics
• Some organizational features and very few of the others
• Form
– Hardly any overlapping n-grams across the two L1
groups, apart from evaluative grams including
importan* (is important to, it is important, the
importance of)
Hypotheses revisited
• The types of n-grams may differ between learners and native speakers
– Yes, and to a greater extent among business students than among linguistics students.
• The types of n-grams may differ across disciplines– Yes, and to a greater extent among the learners than among the native
speakers.
• Linguistics students will use more organizational n-grams than business students
– This is a slight tendency in the NNS material, but the difference is not significant.
– Hasselgård (2013) found a more widespread use of metadiscourse in linguistics than in business. But this is only one type of organizational grams.
• Recurrent word-combinations will reveal learners as more visible authors in their texts
– Learners used more n-grams involving a first-person pronoun than the native speakers in both disciplines. However, this was not such a salient feature as expected.
Summary of findings: functional types of
n-grams
• The n-gram approach and the classificatory framework enabled us to
identify differences between both disciplines and L1 groups;
• The ideational/informational grams are typical for both L1 groups and
both disciplines – above 50% in all subcorpora except NS linguistics;
– Not surprising, since academic disciplines have been found to be highly
informational and we are dealing with novice academic writers.
• Situational n-grams were rare in all the corpora (found only in NS
linguistics);
• Evaluative and modalizing n-grams were more frequent in linguistics
than in business in both L1 groups;
• Organizational n-grams were more frequent in linguistics in VESPA
and more frequent in business in BAWE.
Summary of findings: Learners vs. native
speakers• Far fewer overlapping n-grams across the disciplines among the
learners than among the native speakers;
• More overlapping n-grams between learners and native speakers in linguistics than in business: The linguistics papers are more similar across the L1 groups than the
business papers
• Some features of the distribution of functional types of n-grams mark learners off from native speakers in both linguistics and business:– The learners in both disciplines have fewer modalizing and evaluative
grams.
– A slight tendency for the learners to use more informational n-grams (significant only with 4-grams in business).
– In linguistics, the learners have more organizational grams than native speakers, but in business they have slightly fewer.
Summary of findings: Learners vs.
native speakers (cont.)
• Differences in the form of the n-grams
– Learners: more n-grams involving 1st person pronouns
– Native speakers: more n-grams with non-personal
projection (extraposition)
– Native speakers: more n-grams suggesting complex
noun phrases and verb phrases with passive voice
Summary of findings: Disciplinary
differences (linguistics vs. business)
• The discipline comparison involved more statistically
significant differences than the NS/NNS comparison.
• There are more overlapping n-grams between the
corpora in linguistics than in business.
• Linguistics students have fewer informational n-grams
than business students (across L1 backgrounds).
• Linguistics students have more evaluative and modalizing
n-grams than business students.
Further work
• Analysis of token frequency of n-grams;
• Inclusion of more material from the business
discipline (esp. learners);
• Comparison with other disciplines
• Comparison with other L1 backgrounds
• Comparison with ‘specialist’ writing in the
same disciplines.
Applications
• Disciplinary and L1-specific use of n-grams could feed
into EAP courses & teaching materials.
– Functional types of n-grams that differ greatly across L1
background (e.g. modal / evaluative; clausal vs. nominal n-
grams)
– Functional (and structural) types of n-grams that are typical for
academic disciplines
• Findings indicate a greater need for more explicit
instruction among the learners in business.
Concluding remarks
• Challenges of corpus comparability
– Business BAWE ≈ Business VESPA-NO
– Assessed writing (BAWE) vs. non-assessed (VESPA)
– Other variables (e.g. proficiency level of learners; task
prompts)
• Shown potential of comparable corpora of novice writing
– Across L1 and L2 writers
– Across disciplines
• Add more L1 backgrounds to the VESPA family!
Works consulted
Ädel, A. & U. Römer (2012) Research on advanced student writing across disciplines and levels. Introducing the Michigan
Corpus of Upper-level Student Papers. International Journal of Corpus Linguistics 17:1, 3-34.
Altenberg, B. (1998) On the phraseology of spoken English: The evidence of recurrent word-combinations. In A.P. Cowie (ed.),
Phraseology. Theory, Analysis, and Applications. Oxford: Oxford University Press. 101-122.
Biber, D., S. Johansson, G. Leech, S. Conrad, E. Finegan (1999) Longman Grammar of Spoken and Written English. London:
Longman.
Ebeling, S.O. (2011) Recurrent word-combinations in English student essays. Nordic Journal of English Studies, 10:1, 49-76.
Ebeling, S.O. & H. Hasselgård (2015). Learners' and native speakers' use of recurrent word-combinations across disciplines.
Bergen Language and Linguistics Studies (BeLLS), 87- 106
Ebeling, S.O. and A. Heuboeck (2007). Encoding document information in a corpus of student writing: The British Academic
Written English corpus. Corpora 2 (2): 241-256.
Halliday, M.A.K. (2004). An Introduction to Functional Grammar. 3rd ed., revised by C.M.I.M. Matthiessen. London: Arnold.
Hasselgård, H. (2012) Facts, ideas, questions, problems, and issues in advanced learners’ English. Nordic Journal of English
Studies, 11:1, 22-54.
Hasselgård, H. (2013) Metadiscourse in novice academic English. Paper given at ICAME34, Santiago de Compostela.
Heuboeck, A., J. Holmes, and H. Nesi. 2008. The BAWE Corpus Manual. University of Warwick, University of Reading, Oxford
Brookes University.
Hyland, K. (2008). As can be seen. Lexical bundles and disciplinary variation. English for Specific Purposes 27, 4-21.
Moon, R. (1998) Fixed Expressions and Idioms in English. A Corpus-based Approach. Oxford: Clarendon Press.
Paquot, M., S.O. Ebeling, A. Heuboeck, and L. Valentin. (2015). The VESPA tagging manual v2.3. CECL, Université catholique
de Louvain.
Paquot, M., H. Hasselgård & S.O. Ebeling. (2013) Writer/reader visibility in learner writing across genres. A comparison of the
French and Norwegian components of the ICLE and VESPA learner corpora. In: S. Granger, G. Gilquin & F. Meunier
(eds), Twenty Years of Learner Corpus Research: Looking back, Moving ahead. Corpora and Language in Use -
Proceedings 1, Louvain-la-Neuve: Presses universitaires de Louvain, 377-388.
Scott, M. and C. Tribble (2006) Textual Patterns. Key Words and Corpus Analysis in Language Education. Amsterdam /
Philadelphia: John Benjamins.
Stubbs, M. & I. Barth (2003) Using recurrent phrases as text-type discriminators. A quantitative method and some findings.
Functions of Language, 10 (1), 61-104.
Corpora
BAWE: http://www.coventry.ac.uk/research/research-directories/current-
projects/2015/british-academic-written-english-corpus-bawe/
BAWE was developed at the Universities of Warwick, Reading and
Oxford Brookes under the directorship of Hilary Nesi and Sheena
Gardner (formerly of the Centre for Applied Linguistics [previously
called CELTE], Warwick), Paul Thompson (Department of Applied
Linguistics, Reading) and Paul Wickens (Westminster Institute of
Education, Oxford Brookes), with funding from the ESRC (RES-
000-23-0800).
VESPA: http://www.uclouvain.be/en-cecl-vespa.html
VESPA-NO: http://www.hf.uio.no/ilos/english/services/vespa/