27
The SLALS Corpus Handbook Using the corpus resources in the School of Linguistics and Applied Language Studies Contents: Introduction……………………………………………………2 1 Making a connection to the corpus server………….4 2 Installing SARA……………………………………………….5 3 Installing the ICECUP 3 program………………………9 4 Using the BNC……………………………………………….11 5 Using ICECUP 3 to investigate ICE-GB……………..19 6 Using Wordsmith Tools…………………………………..23 SLALS Corpus Handbook (v. 1) 1 PT 01/10/2003

The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

The SLALS Corpus Handbook

Using the corpus resources in the

School of Linguistics and Applied Language Studies

Contents:

Introduction……………………………………………………2

1 Making a connection to the corpus server………….4 2 Installing SARA……………………………………………….5 3 Installing the ICECUP 3 program………………………9 4 Using the BNC……………………………………………….11 5 Using ICECUP 3 to investigate ICE-GB……………..19 6 Using Wordsmith Tools…………………………………..23

SLALS Corpus Handbook (v. 1) 1 PT 01/10/2003

Page 2: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

Introduction The School of Linguistics and Applied Language Studies has a number of corpora available for linguistic research. These corpora are kept on a server owned by SLALS. The server, which is named ‘Midwich’, can be accessed from any computer on the university network. Please note: Midwich cannot be accessed from outside the university network. The corpora currently available on Midwich are: the British National Corpus (BNC), the International Corpus of English British Component (ICE-GB), and the ICAME collection, which includes: Written Brown Corpus Lancaster-Oslo-Bergen Corpus (LOB) Freiburg-LOB (FLOB) Freiburg-Brown (Frown) Kolhapur Corpus (India) Australian Corpus of English (ACE) Wellington Corpus (New Zealand) The International Corpus of English - East African component Spoken London Lund Corpus Lancaster/IBM Spoken English Corpus (SEC) Corpus of London Teenage Language (COLT) Wellington Spoken Corpus (New Zealand) The International Corpus of English - East African component

Historical The Helsinki Corpus of English Texts: Diachronic Part The Helsinki Corpus of Older Scots Corpus of Early English Correspondance The Newdigate Newsletters Lampeter Corpus Innsbruck Computer-Archive of Machine-Readable English Texts (ICAMET) Polytechnic of Wales Corpus Lancaster Parsed Corpus (LOB)

A detailed explanation of the ICAME corpus collection will be placed in the ICAME folder on Midwich. Further corpora will be added in the coming months. A corpus is simply a data bank, a collection of texts. To search the data bank, you need to use a corpus analysis programme. When you investigate the British National Corpus, you will probably use the programme called ‘SARA’. In this handbook you will find information about how to install SARA on the computer that you are using (if it is not already installed) and the basics of how to use SARA. If you are going to use the ICE-GB corpus, you will similarly find installation instructions, and a brief introduction. Please note that you will only need to install either programme once. There are other corpus analysis programmes that you can use. The computers in the SLALS Corpus Facility room (Ground Floor of the HumSS building) and in Room 181 (HumSS) have WordSmith Tools installed and you can also download a free concordancer (ConcApp) from the Internet (http://www.edict.com.hk/PUB/concapp/). To use either of these programmes with the SLALS corpus collection, you will first

SLALS Corpus Handbook (v. 1) 2 PT 01/10/2003

Page 3: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

need to make a connection to the Midwich server (see instructions in the next section), and then you will be able to open the files in the corpus folders on the Z drive. The corpus analysis programmes must be on the computer that you are using. It is useful to bear in mind, as you read through these instructions, that the corpus is in one place (one machine), but the analysis programme is in another.

Collections of text on corpus server

Corpus analysis

programme on computer

Connect

When you want to work with the corpora, you will first of all need to make a connection from the machine that you are working on to the corpus server, and then you will need to install the programme on your machine (if you haven’t got it already) and open it. You will need to do this every time that you want to work with the corpora.

1 Connect to the corpus server 2 Open the corpus analysis programme

In the following pages, you will learn how to make a connection to Midwich. This is essential reading. You will then learn how to install two programmes on to your computer. You only need to read these sections if you have to install the programmes. The final 3 sections look at how to use 3 programmes: SARA (for the BNC); ICECUP 3 (for ICE-GB); Wordsmith Tools (other corpora). The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the installations necessary if you are to work with the corpora. The handbook also contains a quick introduction to some of the major features of the corpus analysis programmes. It does not, however, in any way aim to provide a comprehensive overview of the programmes, nor of the range of approaches to corpus analysis. For either of these, you will need to consult the Help files attached to the programmes and refer to the relevant literature.

SLALS Corpus Handbook (v. 1) 3 PT 01/10/2003

Page 4: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

1 Making a connection to the corpus server (Midwich) These instructions assume that you are using a computer on the university network, running either Windows 2000 or Windows XP. It is not possible, unfortunately, to connect to the server on a machine running earlier versions of Windows (for example, Windows 98); if you need to upgrade your machine, consult the School Computing Officer, Gerry Latawiec. The software will not run on any operating system except Windows. 1 Right-click on the ‘My computer’ (or equivalent) icon on the desktop, and choose ‘Map Network Drive’:

2 Choose Drive Z, and type on the line for ‘Folder’: \\midwich\corpora

3 Click ‘Finish’. In the next window, enter ‘study’ as the user name and ‘corpora’ for the password, then click ‘OK’. You should now see the Corpora folder:

4 You can now move on to the second stage: either installing or opening a corpus analysis programme. In the next section, we will look at how to install the SARA programme for use with the BNC.

SLALS Corpus Handbook (v. 1) 4 PT 01/10/2003

Page 5: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

Section 2: Installing SARA

You can find out whether you have the BNC SARA programme on your computer on your computer by going to the ‘Program Files’ folder and looking for a folder in there called ‘Sara-32’ – if this does not exist, you need to install SARA, and configure it. If you do have the folder, proceed to Section 4, or to page 7 (Opening SARA). [If you are using a computer in one of the university computer labs, first open the N: drive and create a folder called: Program Files] 1 Open Internet Explorer, and type in:

http://www.hcu.ox.ac.uk/SARA/saraclient.exe 2 When prompted, choose ‘Save’ and place the programme in your Program Files folder [If you are using a computer in one of the university computer labs, save the programme to your N: drive, in the ‘Program Files’ folder] 3 Close Internet Explorer 4 Go to the Program Files folder. Locate the saraclient programme and double click on it to open it. The install programme will begin. Follow the on-screen instructions. On the 3rd screen you will be asked where the files should be placed – the default (which is shown) is C:\Program Files\sara-32. If you are using a computer in a computer lab, you have to change the ‘C’ at the beginning of the address to ‘N’, so that it becomes: N:\Program Files\sara-32 5 After clicking ‘Yes’ for the third screen, you will see a message saying that the folder does not exist and asking whether you want the folder to be created – click ‘OK’. 6 When you have completed the installation, you will be asked if you want to open the programme. Choose NOT to. You have now created a folder called Sara-32 in your Program Files folder. 7 Go to the Start menu, and choose Run. Type: write.exe, and then click OK 8 When Word Pad has opened, select Open from the File menu. At the bottom of the Open window, change the type of files to open to ‘All documents’. 9 Navigate to the Sara-32 folder inside the Program Files folder, and then select a file called ‘corpus.prm’. Open it. The contents of the file are as follows:

SLALS Corpus Handbook (v. 1) 5 PT 01/10/2003

Page 6: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

10 Make the following changes to the file [NOTE: bnc must be in lower case, and Etc and Texts begin with upper case]:

Change the ETC line to: ETC=Z:\bnc\Etc\ Change the IDX line to: IDX=Z:\bnc\index\ Change the TXT line to: TXT=Z:\bnc\Texts\

[If you are using a computer in a university computer lab, make the following changes:

Change the ETC line to: ETC=Z:\bnc\Etc\ Change the IDX line to: IDX=Z:\bnc\index\ Change the TXT line to: TXT=Z:\bnc\Texts\ Change the ACC line to: ACC=N:\Program Files\Sara32\

] These can be seen in the screen shots below:

SLALS Corpus Handbook (v. 1) 6 PT 01/10/2003

Page 7: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

11 Choose ‘Save As’ from the ‘File’ menu 12 Give the file the name: “midwich.prm” You must type this exactly as you see it, with the double quotation marks at the beginning and end [but not in bold]. 13 Then click on ‘Save’. 14 Close the Word Pad programme. Opening Sara 1 Go to the Sara programme. You can open this by clicking on the ‘BNC’ (or ‘BNC Online’) icon on the desktop – if you can’t see an icon on the desktop, go to the ‘Sara-32’ folder in the ‘Program Files’ folder, and double-click on ‘Sara32’. You should see this screen appear:

2 In the ‘About SARA’ window, click on ‘Menu’. You will see a menu of servers appear. If you can see a server named ‘Midwich’, click ‘OK’ and return to ‘About SARA’, then click ‘OK’ again. This should open the connection to the MIdwich server. If there is no server named ‘Midwich’ in the menu of servers, or if you failed to make a connection to the server, do the following: 3 In the ‘Menu’ window, click on ‘Add’.

4 This will open the ‘Server Properties’ window. Type ‘Midwich’ in the Name box, and then click on the ‘Local’ check box. Next click on ‘Browse’.

SLALS Corpus Handbook (v. 1) 7 PT 01/10/2003

Page 8: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

5 Navigate to the ‘Sara-32’ folder (in C:\Program Files\, or in N:\Program Files\ if you are in a computer lab), and select the ‘Midwich.prm’ file. 6 Click on ‘Open’. When you return to the ‘Server Properties’ window, click on ‘OK’. 7 In the ‘Server list’ window, click on the ‘Midwich’ folder icon, and then click on ‘Set Default’ in the bottom right corner. You should see a small arrow appear next to the ‘Midwich’ line:

8 Make sure that the word ‘Midwich’ is still highlighted and then click ‘OK’. After a while (sometimes this can take much more than a minute!), the window will disappear and the SARA window will open up. It will look something like this:

To learn how to use some of the main features of SARA, move now to Section 4. To learn how to install the ICECUP 3 programme for use with the ICE-GB corpus, read Section 3.

SLALS Corpus Handbook (v. 1) 8 PT 01/10/2003

Page 9: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

Section 3: Installing the ICECUP 3 Program The ICE-GB corpus is fully tagged for part-of-speech (POS) and it is also parsed. It is an excellent resource for investigating the grammatical features of modern-day British English. To investigate the ICE-GB corpus, you will need to install the dedicated corpus analysis programme called ICECUP 3. If you are using a free-standing computer, you should follow the instructions below. If you are using a computer on the university network in one of the university computer labs, you can simply make a connection to Midwich (see page 3) and then copy the folder called ‘DOWNLOAD’ from the ‘ICECUP’ folder in the Z drive into your N drive. The new folder in your N drive must have the path: N:\DOWNLOAD Installing to a C drive Firstly, you need to download the programme from the web. 1 Open your browser, and type in the following URL:

http://www.ucl.ac.uk/english-usage/ice-gb/sampler/ice-gb-c.exe 2 Save the programme to your ‘Program Files’ folder. 3 Close your browser. 4 Go to the Start menu, and choose ‘Run’. 5 In the ‘Run’ window, choose ‘Browse’ and then navigate to ice-gb-c.exe in your Program Files folder. Click ‘Open’ 6 In the ‘Run’ window, add a space followed by a hyphen and a lower case d after the information already entered. The complete line should look like this: "C:\Program Files\ice-gb-c.exe" -d 7 Click ‘OK’. 8 This will create a folder called ‘DOWNLOAD’ in your Program Files folder. Go to the DOWNLOAD folder, and open it. Double-click on the ‘Install’ icon., and choose to install the ICECUP programme by clicking on the ‘Install’ button. 9 At the end of the process, you will be able to see the ICECUP programme opening page. Close the program, by selecting ‘Exit’ from the ‘Corpus’ menu. In the next stage, you need to instruct the computer where the corpus files are located. To do this, you need to change a file in the ‘Windows’ folder.

SLALS Corpus Handbook (v. 1) 9 PT 01/10/2003

Page 10: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

Setting the paths 1 Go to the Start menu, and choose Run. 2 Type: write.exe, and then click OK 3 When Word Pad has opened, select Open from the File menu. At the bottom of the Open window, change the type of files to open to ‘All documents’. 4 Navigate to the Windows folder in the C: drive, and then select a file called ‘icecup3’. Open it. You need to change the contents of the file so that it looks as follows:

[paths] corp: Z:\ICE-GB\DATA indx: Z:\ICE-GB\INDEX lex: Z:\ICE-GB\LEXICAL node: Z:\ICE-GB\NODAL mark: Z:\ICE-GB\MARKUP var: Z:\ICE-GB\VARS head: Z:\ICE-GB\TEXT data: C:\ICECUP3\ICEDATA filt: C:\ICECUP3\ICEDATA save: C:\output help: C:\ICECUP3\icecup3.hlp gets: C:\ICECUP3\iceget.hlp

You must enter the information exactly as it appears above. 5 When you have finished, save the file and close the programme. Opening the ICECUP 3 program The ICECUP 3 programme has been placed in a new folder on your C drive called ‘ICECUP3’. Open this and double-click on the ‘ICECUP’ icon. This should open the ICECUP programme. If it does not, a possible reason is that you did not copy the list of paths correctly – check that file and make sure that you did not leave out any letters, strokes, and did not type in lower case instead of upper case. Another possibility is that you have not yet opened a connection to the Z drive (see page 3 above). When ICECUP opens you should see that there are 500 texts in the corpus – if you don’t, and the number given is 20, then you are looking at the sample corpus, not the full corpus. You need to go through the previous steps (in ‘Setting the paths’) again.

SLALS Corpus Handbook (v. 1) 10 PT 01/10/2003

Page 11: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

Section 4: Using the BNC

The BNC

A corpus is basically a collection of texts - a data bank. The texts in the BNC are

called BNC documents. The BNC contains over 4000 documents and 100 million

words; the proportion of written text to transcribed speech is 9:1. Each document is

made up of two parts: a header and a body. The header contains information about the

text that follows: its provenance, the creators of the text, etc. The body section

contains text which has been coded using SGML (Standard Generalized Markup

Language). The tags are placed in angle brackets, as in the following: <p>. The <p>

tag indicates the beginning of a paragraph.

For an explanation of the tags, look at the BNC Handbook (Aston and Burnard, 1998),

which can be found in the SLALS Library. All that it is worth mentioning here is that

the whole corpus has been tagged for part-of-speech (POS) following the CLAWS

tagset, and information on the tags used appears in the appendix to this document.

Each word is preceded by a tag of the form: <w XXX>, where XXX stands for a

particular code used to identify a part of speech. Adjective codes begin AJ, adverbs

AV, nouns NN, verbs V (lexical verbs = VV; verb ‘be’ = VB; verb ‘do’ = VD; verb

‘have’ = VH), and prepositions PRP, to name a few. Here is an example, adapted from

the BNC: <w NN1>Cybernetics <w VVZ>takes <w NN1>linguistics <w CJS>as <w DPS>its <w

NN1>model <w NN1>discourse<c PUN>, <w CJC>and <w AT0>the <w NN1>analogy <w

PRP>between <w NN1-AJ0>narrative <w CJC>and <w NN1>programming <w

VVZ>provides <w AT0>the <w AJ0>underlying <w NN1>conceit <w PRF>of <w AT0>the

<w NN1>novel<c PUN>.

SARA

The programme for searching the BNC is SARA (which is an acronym for SGML

Aware Retrieval Application). The following pages present an introduction to some of

the features of SARA, but, as with the tagsets, you are recommended to use the BNC

Handbook to find out what all the features of the SARA programme are.

SLALS Corpus Handbook (v. 1) 11 PT 01/10/2003

Page 12: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

To search the corpus, you have to form a query. SARA provides a sophisticated range

of options for forming queries:

word: you can search for a particular word (a string of letters)

phrase: you can search for a phrase

POS: allows you to define the POS that you are looking for in a string

pattern: you can search for particular patterns

SGML: allows you to search for particular tagged features

query builder: allows you to combine any of the other options in a complex

query

CQL: Corpus Query Language

There is a button for each of these query types as shown in the toolbar. If you place

your cursor over a button, you should see a small box with the name of the button

appear. Try this and identify the seven query buttons.

SLALS Corpus Handbook (v. 1) 12 PT 01/10/2003

Page 13: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

Word query

Let us look first at a simple word query.

Click on the Word Query button.

Type in the word 'corpus', and then click on Lookup. A list of words that

appear under this heading in the BNC index will appear. By clicking on any

of the words you can see how often that word form appears in the BNC.

Choose the word 'corpuses' and then click OK. The programme will now

request data from the database (on the BNC server) and send the results to

you.

Displaying results

If you can see only one concordance line, go to the Query menu and choose

Concordance. You should then see nine lines as in the screen shot below:

SARA offers you a range of options for how to show results, how many to show, what

extra information can be found, how to manipulate the results, and how to save them.

In the Query menu:

The Edit option allows you to change your search term.

The Sort option allows you to sort the concordance lines by the 1st word to the

left of the search term, or to the right.

SLALS Corpus Handbook (v. 1) 13 PT 01/10/2003

Page 14: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

The Thin option allows you to delete lines. To select a line, double-click on it

or single-click and then press the Space bar. When you have selected your

lines, choose Thin – Selection to delete all the unselected lines, and Thin –

Reverse Selection to delete all the selected lines.

Options determines how the results will be shown. The format options are

Plain, POS and SGML. Try each of these. You will find that the POS format

shows each POS in a different colour. If you right click on a word, an

explanation of the POS category will be shown. The range options are:

automatic, sentence, paragraph. This determines how much co-text will be

shown.

Query text opens a frame at the top which shows what your query is. This is

useful when you begin to form more complex queries, and you need to either

learn the query syntax or you need to remind yourself of the parameters of

your query.

Annotation gives you a space to write notes about the query results.

Listing allows you to save the results as an XML file (see section on Saving

below)

Source provides information about the source text. It is easier to access this

from the Bibliographic Data button (next to the Question Mark button on the

row of buttons at the top of the window).

SLALS Corpus Handbook (v. 1) 14 PT 01/10/2003

Page 15: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

It is also important to set your own preferences when you begin a long session of

work. Go to the View menu and choose Preferences.

This determines the default display for any query results. You can choose to increase

the number of lines that you download (be aware though that large quantities of data

can be overwhelming, and can also take a long time). You can also set the format and

scope, and it is advisable to tick the Concordance box, so that you will always be

shown the list of concordance lines first.

A POS query

In a Word query, if you look for the word 'sweet' you will receive concordance lines

containing sweet as a noun, as an adjective, and possibly also as adverb or verb. If you

want to look only at examples of sweet as an adjective, you should use a POS query.

Click on the POS query button. In the L-word box, type 'sweet'.

Click in the POS box below, and a list of POS tags will appear.

Select AJ0. You will see an explanation of the tag appear to the right:

‘general adjective’. Click on the next POS tag and read the explanation.

Choose AJ0 and AJ0/NN1 (this tag indicates that the automatic tagger was

uncertain whether the word was an adjective or noun; the BNC is not fully

post-edited, so there are many cases of ambiguity left, and also errors) and

click OK.

The ‘Too many solutions’ box will appear. Choose to view one solution per

text.

Note that you only get 100 lines. If you want more you will need to change

the number in the ‘Too many solutions box’ (or change your Preferences)

SLALS Corpus Handbook (v. 1) 15 PT 01/10/2003

Page 16: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

Query builder

In spoken English, who uses adjectival 'sweet' most, males or females?

And in what senses?

To answer these questions, we can use a combination of the POS query with SGML

queries.

First, click on the query builder button. This is what you should see:

Click in the span box (the black rectangle, with <bncDoc> written in it).

From the drop-down menu choose SGML.

In the top option box, scroll down the list of options until you find the u

element (u stands for utterance). Click on this.

A new list of options (Attributes) will appear. There are the different

attributes of the u element. You are going to choose who_sex. Click Add,

and then you will see a choice of values. Choose m (for male) and click OK.

In the Query Builder, click on the empty red rectangle (the right-hand one).

Select Edit -> POS.

In the L-word slot, enter sweet, and then click once in the Part of Speech box

area. A list of options will appear. Select AJ0 and click OK.

You will return to the query builder box. Note that ‘Query is OK’ appears at

the bottom of the box to tell you that the syntax of the query is good. Click

OK.

Click on the Page/Line Mode button (if you can only see one line of

concordance data – if you can already see many lines of data, don’t bother),

so that all the concordance lines will be shown.

SLALS Corpus Handbook (v. 1) 16 PT 01/10/2003

Page 17: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

Some of the concordance lines may feature nominal sweet and you will need to delete

these. Double-click on any lines that contain nominal sweet, and when they are all

selected, go to the Query menu, choose Thin -> Reverse selection.

Now repeat the process, but this time search for uses of adjectival sweet by female

speakers.

You should have at least two query windows open. If you want to move from one to

the other, go to the Window menu, and choose which window you want to look at.

How many uses of adjectival sweet do you find in each set of queries? Write your

findings in the box below, and for each query set, include notes on the meanings of

adjectival sweet for that group. You can see what words immediately precede or

succeed the word ‘sweet’ by going to Query – Sort and choosing to sort the words

first by the left (to see what words precede) and then by the right (to see which words

succeed).

What conclusions can you draw from the evidence that you have available to you

from the BNC? What extra information do you need in order interpret your results?

You may wish to find out how many male speaker utterances there are in the BNC,

compared to the number of female speaker utterances (this is clearly an imperfect

comparison as the length of utterances is not equal, but it suffices as an example). You

can use an SGML Query. Click on the SGML Query button, choose ‘u’ (as above) ,

then, for the attribute ‘who.sex’ choose ‘m’ (for male) and then, for the attribute

‘who.age’ choose 0 and 1. These values, according to the User Reference Manual,

indicate:

0: Under 15 years

1: 15 to 24 years

2: 25 to 34 years

3: 35 to 44 years

4: 45 to 59 years

5: Over 59 years

X: Unknown

SLALS Corpus Handbook (v. 1) 17 PT 01/10/2003

Page 18: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

Now click ‘OK’. The query will take a few seconds to resolve. You should see a box

like the following:

This tells you that there are 66788 utterances in the BNC by males at, and under, the

age of 24. Click Cancel. (You don’t need to look at the utterances, you only need to

know how many there are). Repeat the same procedure for older male speakers, for

older female speakers, and for younger female speakers,

Saving results You can save results as SARA files (.sqy) or as .xml files that you can open in a word

processor, edit and print.

To save as .sqy, choose save from the file menu, and give a meaningful name

to the query (such as sweet_male_young). It is useful to set up your own

conventions, such placing the search term in first position, with parameters

following.

To save as XML, go the query menu and choose Listing. You will be

prompted to name the file and choose a folder to save the document in.

You can open an XML document from within Word. The XML document contains a

certain amount of tagging: each line, for example, will have the <HIT> tag with an

identifying number. The advantage of an XML file is that you have more flexibility

over the data than if you try to print directly from within SARA.

References

Aston, G. & L. Burnard (1998) The BNC Handbook Edinburgh: Edinburgh University Press

BNC User Reference Guide http://www.natcorp.ox.ac.uk/World/HTML/

SLALS Corpus Handbook (v. 1) 18 PT 01/10/2003

Page 19: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

Section 5: Using ICECUP 3 to investigate ICE-GB

The International Corpus of English GB Component (ICE-GB) is a one million word corpus attempting to represent general English in Great Britain. It contains a sampling of English, with 200 written texts and 300 spoken texts. The corpus is fully annotated for part-of-speech and also at clause level. There is dedicated software programme for use with the corpus: ICECUP 3. ICECUP 3 is a very powerful corpus search programme and there is no space here to explain the full capability of the programme. If you want to experiment further you should try reading the ‘Help’ menu pages, which provide a lucid and comprehensive explanation of the features of the ICECUP programme. For further details, and ideas of what questions can be explored through use of the corpus, consult:

Wallis, S., G. Nelson and B. Aarts (2002) Exploring natural language: working with the British component of the International Corpus of English Amsterdam: Benjamins.

In the following pages, a brief introduction to the interface is given.

Exploring ICE-GB 1 Open ICECUP 3 (the programme is in the ‘ICECUP 3 folder in your C – or N – drive) In ICECUP, you will find that there are many buttons and options in the menus. You are going to learn about some but not all of these. On the first screen, you can see the following:

2 Some of these words may look rather opaque. Let’s take a simple one first: Text. Click on this button, and the following window will open:

SLALS Corpus Handbook (v. 1) 19 PT 01/10/2003

Page 20: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

In this box, you can specify a word or words to search for. Type in the word ‘poorly’ and then click on ‘OK’. You may have to wait a few seconds for the results of your search to appear on the screen. You will (eventually) see a screen like the following.

At the bottom left of the screen (see the pink oval) you can see the current line number (in this screen shot, 1) and the total number of lines (8). In the row of small buttons at the top, the second and third buttons (see the orange oval), increase and decrease the font size. Further along the row of buttons you can see buttons labelled F, C and f. Click on each of these in turn and see what happens. Write down what codes appear in the box below:

Button Examples of codes that appear F C f Now, look at the white button immediately to the left of the F button. If you place your cursor over this button, you will see a screen tip saying ‘No concordancing’. Click on this button – what happens? Write your answer in the box below: In ICECUP, you can look at the tree diagrams for each line and also make queries about syntactic patterns. First, double-click on the first line. You will see a new window appear, like the following:

SLALS Corpus Handbook (v. 1) 20 PT 01/10/2003

Page 21: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

At the bottom of the screen, you can see the line, and above that is a tree diagram, demonstrating the analysis of the sentence into parts-of-speech and phrases and clauses. If you right-click on any of the category codes, a help window will open with an explanation of the code. After you have read the contents, close the window. What do the following mean? Write the simple one to three word name in the column on the right:

SU NP NPHD CS MVB

Now, close the ‘Spy’ window. You should be able to see the list of concordance lines again. Next you are going to learn about ‘Fuzzy Tree Fragments’. This is a special term used in the ICE corpus and not used in any other context. You can find out about Fuzzy Tree Fragments (FTFs) by using the ICECUP Help menu, which contains a number of useful tutorials.

1. Click on ‘Help’ (top right corner of the screen) 2. Select ‘Editing Fuzzy Tree Fragments’ 3. Try reading the explanation on this page, as it is useful, but if you find it

impossible to understand, skip to the bottom of the page 4. Select ‘A simple FTF’ 5. Follow the explanation and construct an FTF following the instructions. You

may find that some of the terms are difficult for you to understand, but you should still be able to follow the instructions, without fully understanding what you are doing.

6. When you reach the end of the page, and you have constructed the FTF, close the ‘Help’ window’ and click on the ‘Start’ button in the top-right of the FTF

SLALS Corpus Handbook (v. 1) 21 PT 01/10/2003

Page 22: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

window. You should now see a new window opening with the results of your query.

Questions: How many examples of a clause containing a subject – verbal – adverbial sequence can you find? Is it possible to find clauses in which some other element appears between the S and the V or between the V and the A? Give an example:

SLALS Corpus Handbook (v. 1) 22 PT 01/10/2003

Page 23: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

WordSmith Tools is a suite of tools for corpus analysis. It can be used with most of the corpora in the SLALS collection. The programme is installed on the computers in the Corpus Facility room and in Room 181. It can be installed on staff machines by request. As it is a licensed program, the School cannot give copies to students for use on other computers.

Using Wordsmith Tools [The following is taken from a worksheet developed for use with secondary school teachers. The tasks given are simply quick introductions to the features of WordSmith Tools. The programme is capable of more sophisticated analysis, and you should look at the ‘Help’ facility to find out more about the capabilities of the program] In this worksheet, you are going to learn the basics of using Wordsmith Tools. You will learn to:

• Choose texts • Make a concordance search • Sort the results • Make a word list • Compare two wordlists

1 Go to C:/wsmith and double-click on wshell.exe. This will open the Wordsmith Controller. Go to the Tools menu and choose ‘Concord’. 2 In the top left corner of the Concord window, you will see:

3 Click on the leftmost button, which looks like a miniature Brazilian flag (this is the Start button). You will then see the following window:

4 First, click on the Choose Texts Now button.

SLALS Corpus Handbook (v. 1) 23 PT 01/10/2003

Page 24: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

5 Navigate to: Z:\icame\TEXTS\LOB_plain 6 Select the All button in the lower right side, and then click on OK.

You have now selected all the texts from the LOB corpus. Next you are going to make a concordance search.

1 You will have returned to the Getting started window. Click on Specify Search word and enter the word ‘success’ in the search word box, then click on Go now!! 2 You will see a set of concordance lines appear in a new window. How many lines are there? Write the answer here: 3 Next, you are going to sort the concordance lines. The row of buttons below the start button looks like this:

4 The third button from the left is the Sort button. Click on this, and then in the following window, choose 1R as the first sort and 2R as the second sort:

SLALS Corpus Handbook (v. 1) 24 PT 01/10/2003

Page 25: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

5 Click OK. Look through the lines. Which words most often appear to the right of ‘success’? Write them below: 6 Go to the Sort button again, and this time sort the lines by the first word to the left. Which words do you most often find to the left of ‘success’? 7 You have looked at the LOB corpus which contains texts from the 1960s. Now you are going to do the same search in the FLOB corpus, a set of texts from the 1990s. Click on the Start button, and then, in the Choose Texts window, click on the Clear previous button.

8 This action clears the previous choice. If you don’t do this, you will end up with both sets of text in your selection. When you have cleared the previous choice, select the FLOB_plain folder, choose All and then click on OK. 9 You don’t need to change the search word, so simply select Start concordance. Then repeat the same set of actions. Answer the following questions:

How many lines are there?

SLALS Corpus Handbook (v. 1) 25 PT 01/10/2003

Page 26: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

Which words most often appear to the right of ‘success’? Write them below: Which words do you most often find to the left of ‘success’? Comparing the two sets of concordance lines, do you feel that the uses of the word ‘success’ are broadly the same?

In the preceding exercise you looked at how the word ‘success’ was used in the LOB and the FLOB corpora. Think of a word or expression that you would like to investigate in these two corpora, and record the results of your exploration below:

Making a word list 1 Close the concord windows by clicking on the X in the top right window. In the Controller window, go to the Tools menu and choose Wordlist. You will see the following set of buttons in the Word list window:

2 Once again, click on the Start button. You will have one of the corpora selected already (probably the FLOB corpus). In the window (shown below) choose Make a word list now:

SLALS Corpus Handbook (v. 1) 26 PT 01/10/2003

Page 27: The SLALS Corpus Handbook - University of Reading · 2008-01-28 · The purpose of this handbook is to explain how to make the connection to the Midwich server, and how to make the

This will create a word list of the FLOB corpus, in three parts: statistical, alphabetical and frequency order. The frequency list shows the words in order of frequency, in terms of raw counts, and of percentages; the statistical list provides overall information about the files, such as average sentence length, and type/token ratio. 3 Save the wordlists as flob.lst 4 Go to the Start menu and change your selection of texts to the LOB corpus texts. Then make a word list of the LOB texts and save as lob.lst 5 Finally, go to the Comparison menu and choose Compare two wordlists. On the left side choose lob and on the right choose flob. Then click ‘OK’. 6 You will see a new window appear with a list of words and figures in columns to the right of the words. The words at the top of the list show which words are key to the LOB corpus (the concept of ‘keyness’ is explained in the WordSmith Tools Help pages; simply put, a word is ‘key’ to a particular corpus if it occurs with greater relative frequency in the corpus than it does in a corpus that it is compared to). 7 Look closely at the first twenty:

• How many are function words? • How many are specific to a particular register of English? • How many of them do not appear to be words?

Scroll to the bottom of the list and look at the 20 most key words in the FLOB corpus. Answer the same questions. Do you think that the differences are due to changes in language use over time, or to some other factors?

If you find that the Midwich server is not working, or if it is responding very slowly, please inform either Gerry Latawiec ([email protected]) or Paul Thompson ([email protected]), so that the server can be checked, and rebooted, if necessary.

SLALS Corpus Handbook (v. 1) 27 PT 01/10/2003