The Lancaster Speech, Writing and Thought Presentation Spoken Corpus

A Handbook to the Lancaster Speech, Writing and Thought Presentation Spoken Corpus

1. Introduction

The Lancaster Speech, Writing and Thought Presentation Spoken Corpus has been constructed as part of an AHRB-funded research project entitled ‘A corpus-based study of Speech, Writing and Thought Presentation in contemporary spoken British English’ (henceforth the SW&TP Spoken project). The aim of the project is to investigate the forms and functions of Speech, Writing and Thought Presentation (henceforth SW&TP) in a corpus of conversations and oral narratives, using a tagset developed by Mick Short, Elena Semino, Martin Wynne and Jonathan Culpeper at Lancaster University, and based on the cline of categories proposed in Leech and Short (1981). We will be conducting quantitative and qualitative analyses of our data and comparing our findings with those of our previous research project investigating SW&TP in written texts. (A full list of papers resulting from the SW&TP Written project can be accessed by clicking on the 'publications' link on the left.)

1.1 Research questions

1.1.1 Over-arching research question
- What are the forms and functions of SW&TP in contemporary spoken British English?

1.1.2 Theoretical/methodological questions
- To what extent can models and categories originally developed for the analysis of SW&TP in written texts be applied to speech?
- To what extent can the model developed during the SW&TP Written project be applied to a spoken corpus?
- What modifications are needed to our existing annotation scheme?

1.1.3 Analytical questions
- What are the relative frequencies of different categories of SW&TP in the spoken corpus as a whole?
- Are there differences in the frequency of categories depending on variables such as the nature and context of the interaction or the identity of the speaker?
- How can these quantitative differences be related to the functions that different categories of SW&TP can perform in discourse?
- How do the frequencies in the spoken corpus compare with those in our written corpus?

2. The composition of the corpus

2.1 The data
The texts that form our corpus are drawn from two sources: 1) the spoken demographic section of the British National Corpus (BNC - World edition); and 2) an oral history archive stored in the Centre for North West Regional Studies (CNWRS) at Lancaster University.
     We chose to construct a corpus of approximately 250,000 words in order to make it comparable with the existing SW&TP written corpus. The CNWRS archives and the BNC obviously provide a far larger body of data than we required, and so we opted to select 120 ‘chunks’ (60 from the BNC and 60 from the CNWRS archives) of approximately 2000 words each (as this was the size of the texts in the written corpus), providing 240,000 words in total. We also decided that the chunks would not be stopped at exactly 2000 words, but would be allowed to run on a little, preferably to the end of discussion of a particular topic. This would give us the remaining 10,000 words needed to make our corpus approximately 250,000 words in size. Below is a description of the CNWRS and BNC data.

2.2 The CNWRS data
The CNWRS data consists of two archives: 1) The ‘Family and Social Life’ archive; and 2) The ‘Childhood and Schooling’ archive. Click on the link to see a diagram detailing the CNWRS texts in our corpus.

2.2.1 The ‘Family and Social Life’ archive
This archive was compiled and organised by Elizabeth Roberts, Emeritus Reader in Social History at Lancaster University. The data was collected in the 1970s and 1980s by Elizabeth Roberts and Lucinda Beiers, and the archive consists of 250 hours of interviews, stored on audiocassettes and reel tapes, with accompanying transcripts. The interviews fall into two main groups:

1) Interviews done in the mid-1970s, with respondents recalling the period 1890-1940.
2) Interviews done in the mid-1980s, with respondents recalling the period 1940-1970.

The interviewees lived in Preston, Barrow or Lancaster at the time of interview, but we made the decision to use only the records relating to Barrow and Lancaster, for reasons explained in section 2.1.1.

2.2.2 The ‘Childhood and Schooling’ archive
This archive was compiled and organised by Penny Summerfield, formerly of Lancaster University and now Professor of Modern History at the University of Manchester. Collected in the 1980s, the data in this archive consists of approximately 200 hours of interviews on audiocassette, with accompanying transcripts. The interviewees recall their years at secondary schools between 1920 and 1950 in Lancaster and Morecambe, Preston, Blackburn, Burnley and Clitheroe.
     The respondents in both these archives were all born prior to the early 1950s, and their memories provide an irreplaceable collection of personal retrospectives, many very particular to the North West. The oldest contributors were born in the 1880s, and their recollections of their own childhood and their parents’ life and times describe ordinary life as far back as the mid-19th Century. Overall, then, the archives provide a perspective on some 150 years of ordinary life. Of particular interest to us is the way in which many respondents tell stories or narrate events in their lives in ways which include the presentation of their own or others’ speech, thought and writing.

2.2.3 How the CNWRS archives are organised

2.2.3.1 The Family and Social Life Archive
This archive consists of two sections: 1) records for the interviews relating to 1890-1940; and 2) records for the interviews relating to 1940-1970. The archive is divided this way because the latter interviews were carried out approximately ten years after the first. Within each of these two sections, the records are arranged by the town in which the interviews took place, giving three blocks, for Preston, Barrow and Lancaster. Within each town, the interviews are filed according to the initial of the respondent’s surname and the initial of the town in which the respondent was living at the date of interview. Where a respondent’s surname begins with the same letter as that of another interviewee, a number id used to differentiate. For example:

- ‘Mrs B3B’ is a married woman whose surname begins with ‘B’ and who lives in Barrow. She is the third respondent from that town with a surname beginning with ‘B’.
- ‘Mr R2L’ is a man whose surname begins with ‘R’, and who lives in Lancaster. He is the second respondent from Lancaster to have a surname beginning with ‘R’.

In addition to the interview transcripts, the archive consists of basic biographical details of each interviewee, and a record of the numbers of all tapes related to an individual respondent. So for example, Mr R2L may have interviews recorded on reel tape 145, and cassettes 500 and 790. There is also a ‘Subject File’, which consists of an alphabetical list of topics discussed in the interviews. This is provided mainly to facilitate socio-historical research, and this is cross-referenced to show which individuals discuss a given topic on which tapes.

2.2.3.2 The Childhood and Schooling Archive
The transcripts for these interviews are organised by school and by gender. In addition, there are index cards, ‘Interview Checklists’ (giving details of the interviewees’ names, pseudonyms if requested, addresses and how many hours of interview they gave), and additional details about the data-gathering process. There is also a Biographical Index, giving basic biographical information, and a card-index detailing topics covered in the interviews. Tapes however are not labelled other than with the respondent’s name or pseudonym.
     Because of the danger of confusing pseudonyms and real names, we decided on an alternative method of reference for this archive. The tapes were numbered according to the alphabetical order of the respondents’ surnames (regardless of whether the name on the tape was real or false). To differentiate them from the Family and Social Life archive tapes, their numbers were preceded by ‘C’ to indicate that they belonged to the Childhood and Schooling Archive.

2.3 BNC data
Approximately 10 million words of spoken data are available from the BNC, and are divided into two main categories: 1) demographic and 2) context-governed texts. Demographic texts are transcriptions of spontaneous natural conversations made and recorded by members of the public. These conversations were all recorded around 1991 - 92, and there are equal numbers of male and female speakers. In addition, there is a roughly equal distribution of speakers across age ranges and social class groups. The context-governed texts, on the other hand, do not constitute spontaneous natural speech. They are transcriptions of recordings made at specific types of meetings and events. There are roughly equal quantities of speech divided across four main categories, these being educational/informative events (e.g. lectures, news broadcasts), business events (e.g. interviews, trade union meetings), institutional/public events (e.g. sermons, parliamentary proceedings) and leisure events (e.g. after-dinner speeches, radio phone-ins). We decided to use only material from the spoken demographic section of the BNC, as this would allow us to contrast spontaneous speech with the elicited data of the CNWRS archives. Since the BNC data was collected in the early 1990s and the CNWRS data in the 1970s and 1980s, we also left open the possibility of studying diachronic developments in speech. Click on the link to see a diagram detailing the BNC texts in our corpus.
     In the following section we describe the criteria we used in selecting material for our corpus from both BNC and the CNWRS.

3. Selecting texts for our corpus

To provide as balanced a range of data as possible, we took 60 chunks from the BNC (discussed in section 2.2), 30 from the Childhood and Schooling archive, and 30 from the Family and Social Life Archive (15 from the 1890-1940 batch and 15 from the 1940-1970 batch), with an approximately even split of male and female respondents throughout. We also tried to obtain as wide an age range as possible amongst the CNWRS respondents, so that it might later be possible for us to consider potential evidence of diachronic changes in usage. We decided not to make a further distinction between social classes, as the overwhelming majority of the CNWRS respondents are working class, and so we did not have enough respondents from other social classes to make this difference representative. Similarly, if we had tried to select BNC data according to respondents’ social class we would not have had enough examples from each group to form a representative data set.

3.1 Selecting texts from the CNWRS archives
Our selection of texts from the CNWRS archives was in part restricted by the generally poor quality of the transcripts and audio tapes. In the following section we describe the criteria developed for selecting suitable CNWRS texts to include in our corpus.

3.1.1 Criteria for selecting texts from the CNWRS archives
First, many of the transcripts were dot-matrix printed, and some of these were indistinct. We photocopied and darkened these to make them suitable for scanning but some were still insufficiently clear and could not be used without being re-transcribed totally. The older transcripts were typed on typewriters, with varying degrees of legibility due to dirty keys, mistyping, etc. These were usually dark enough to scan, but because of the legibility of the original transcripts, the resulting scanned file would contain a large number of errors which all had to be corrected by hand.
     Similarly, much of the audio material was not in good condition; the older reel tapes were starting to deteriorate, some of the cassettes used had, even initially, been of poor quality, and many were recorded without regard to recording levels. Because of this, sometimes respondents were virtually inaudible, and sometimes the recordings were made at such high level that we could not remove the distortion from the captured sound file without damaging the comprehensibility of the interview. Additionally, many had been recorded in the respondents’ homes, resulting in such phenomena as traffic noises from outside, clocks chiming, people ringing doorbells and many helpful spouses bringing in tea! All of these factors meant that we could not use these sound files as part of our corpus.
     With regard to the respondents, many of the interviews in both archives include several participants. We decided only to use one-to-one interviews in our corpus, since the BNC files would constitute multiple-participant conversations.
     Secondly, many of the respondents in the 1940-1970 batch were children of those in the 1890-1940 batch, and some of the respondents in both batches were related by blood or marriage to other respondents. We decided that we should, if possible, select our files only from those records where the respondent was not related in any way to any other respondent, so that possible similarities of speech style due to close and frequent interaction between family members would be eliminated, and the range of usage styles maximised. The poor quality of some of the material also led us to develop a set of criteria for what to exclude when selecting files from the CNWRS archive. These were as follows:

- Any records for which audio material was unavailable or of too poor a quality to be used.

- Any records which included interviews with 3 or more participants (including the interviewer).

- Any records where the respondent seemed to be in any way reluctant to disclose information or had requested not to be included.

- Any records where the respondent was a relative of another respondent.

- Those records for which the transcript was, even after photocopying, seriously compromised for legibility.

- Those transcripts with very thin, torn or creased pages which might be difficult to copy or scan.

- Multiple-interview recordings - Because the majority of the respondents in the CNWRS archives had been interviewed more than once, indeed some as many as six or seven times, we decided that a further criterion would be that only one chunk should be selected from any single individual, in order to standardise and provide maximally differing data.

- Preston files - When selecting data from the CNWRS archive material we discovered that the ‘Family and Social Life’ transcripts relating to Preston were summaries of what the participants had said, rather than accurate transcriptions. Although this was deemed suitable for the purposes of historical research by the historians who compiled the archive, it was not accurate enough for linguistic analysis. Additionally, one of the original transcribers had converted all instances of the presentation of Direct Speech into Indirect Speech, clearly affecting those areas in which we would be most interested and making it difficult even to see how to identify them clearly. We therefore decided not to select any samples from the Preston files.

Once we had decided which files to exclude from our data, we then set about finding suitable sound and transcription chunks from each archive. This meant first looking through the catalogues of each archive in order to select potentially suitable candidates, taking into account our criteria for selection relating to age, gender, etc.

3.1.1.1 Transcriptions
We found by examining the SW&TP written corpus that SW&TP occurs more frequently in longer extracts than in shorter ones. Therefore, after the initial selection, we examined each transcript for ‘long turns’, on the basis that these might provide a richer distribution of the kind of features we were seeking to examine. In fact, this meant excluding any records where the turn-taking was for the most part of a very brief question and answer format, or where long turns only occurred widely spread out through the text, so that a 2000 word chunk would not be able to include more than one or two such turns.
     Once we had identified potentially rich interviews for our purposes, we photocopied each respondents’ transcript and passed the copy to members of the research team so that they could look for suitable extracts. The photocopying also meant that we were not examining the original transcripts and so did not have to remove files from the archives for long periods of time.
     Once the transcripts had been decided on, we scanned them, and then reduced the resulting electronic file to the ‘chunk’ required for inclusion in our corpus. Scanning was done using Omni-Page Pro, and the scanned files were saved as Word documents: these were labelled <xxxWtr.doc>, where ‘xxx’ is the number of the original file. We then reduced the contents of these files to the selected chunk and re-saved these as <xxxWch.doc>. We also put together a further file for each interview, labelled <xxxH.doc>, which contained as much biographical information as possible about the interview participants, in preparation for inclusion in the header of the eventual marked-up corpus files.
     Once the chunks had been selected, we then had to deal with the problem of the relatively poor quality of the original transcriptions. These had been done by non-linguists and in many instances we found that the original transcriptions of the CNWRS interviews were not accurate enough for linguistic analysis. It was therefore necessary for us to partially or wholly re-transcribe the interview chunks to produce an accurate orthographic transcription. At this point it is worth noting the unusual elements found in some of the original transcripts, and the decisions we made concerning these.

3.1.1.2 Irregularities and oddities in the original transcriptions
The original transcriptions were made for an oral history research project, and were not transcribed by linguists. As a result there are various irregularities and inconsistencies in the original transcripts which make them unsuitable, as they stand, for the purpose of linguistic analysis.
     The first decision we took was on how to deal with punctuation in the original transcriptions. The original transcribers had obviously punctuated as they saw fit, and, due to the nature of the original project, obviously without regard for linguistic transcription conventions. The speech, then, is divided into what the original transcribers deemed to be ‘sentences’, with commas, full-stops etc., dividing these. It is also the case that the OCR scanning process unavoidably produces some corruptions of the original text. Hence, a full-stop may appear as a comma, a comma as a letter ‘g’ or ‘y’, inverted commas as an asterisk, etc. We decided to retain the original punctuation as shown in the ‘hard copy’ transcription, except in those cases where it was necessary to re-transcribe a section due to the inaccuracy of the original transcription, or where we had to newly transcribe stretches of interaction that had been omitted or summarised. In these cases, the only punctuation that we added were full-stops where we felt a sentence boundary would most likely exist in a written form of the interview. We did this to prevent the text from becoming impossibly difficult to read and understand.
     This was the major issue to be resolved. There were, however, other anomalies which we had to deal with. Some of the original transcribers, for reasons quite unknown to us, and indeed to the original researchers, had placed inverted commas around all instances of the words ‘yes’ and ‘no’. We removed this punctuation to avoid it appearing that these words were instances of direct speech. Another transcriber had used a capital letter at the beginning of the line every time a new page was started, regardless of whether this was the beginning of a ‘sentence’ or not. In each case we converted these to lower case letters where necessary. Finally, many transcribers had misspelled words, particularly proper nouns. All these misspellings were corrected.

3.1.1.3 Transcribing normal non-fluency features
With regard to producing a more accurate orthographic transcription, it was necessary to transcribe normal non-fluency features which were often missing from the original transcriptions. We kept this as simple as possible, with three possible transcriptions available. These were ‘er’, ‘erm’ and ‘um’, which we felt adequately described the range of normal non-fluency features present in the data. We did not transcribe event phenomena such as laughter and coughing, since i) they are unlikely to affect the presentation of speech, thought and writing, and ii) our time was necessarily limited. However, it would be a relatively straightforward task to include this information at a later date if required.

3.1.1.4 Sound archives
In addition to producing electronic copies of the interview transcripts, we also digitised their related audio tapes. We did this using an application called CoolEdit, which allowed us to convert the original tapes to <*.wav> files. The <*.wav> files were recorded as 16-bit files in mono, in order that they should be in a form suitable for later time-alignment.
     Our intention initially was to capture each whole tape, so that it could later be ‘chunked’, and the original whole sound file could be kept for future reference. However, because sound files can only be captured in real time, time constraints meant that we were only able to capture that section of each tape that related to the extract from the transcript.
     There were, as previously stated, many recordings which were not suitable for capture due to low quality of the recordings. Additionally, odd discrepancies came to light during the process of digitisation, such as audio material which was incomplete, or tapes that were incorrectly labelled and did not relate to the supposed accompanying transcription. In these cases the tapes were not used. CoolEdit does provide some post-capture processing features, though it was not possible in every case to provide ‘clean’ sound files, since the use of more than one or two post-capture processes tended to reduce the sound quality of the speech to an unacceptable level.
     With regard to storing the <*.wav> files, we followed a similar process to that applied with the transcriptions. In those instances where the whole original audio tape was digitised, the resulting file would be labelled as <xxx.wav>; the relevant chunk from that file would be labelled <xxxch.wav>.

3.1.1.5 Storing the CNWRS data to await mark-up
Once the digitisation and scanning of the transcripts was complete, we stored the files for each individual chunk in separate folders, and named each folder with the original CNWRS tape number that had been used to generate the chunk filenames. So, for example, folder 136 contained 136Wch.doc, 136H.doc and 136ch.wav).

3.1.1.6 The 60 texts from the CNWRS archive
The table below shows the identifying file numbers of the 60 CNWRS files from which our texts were taken:

RESPONDENTS BY SEX
*Family and Social Life Archive*		*Childhood and Schooling Archive*
Male	Female	Male	Female
105	120	C002	C087
123	132	C008	C090
125	136	C009	C102
138	147	C017	C104
148	150	C025	C107
161	169	C027	C115
184	177	C029	C126
506	197	C030	C131
531	500	C036	C135
545	511	C041	C137
560	530	C047	C146
562	538	C055	C149
588	550	C061	C152
787	580	C063	C174
-----------------------	683	C196	C185
-----------------------	766	-----------------------	-----------------------

Table 1 CNWRS files from which extracts were taken for the SW&TP spoken corpus

3.2 Selecting texts from the BNC
We decided to concentrate primarily on dialogic texts from the spoken demographic section of the BNC, in order that the BNC files would be significantly different from the CNWRS texts we had selected. (Using the BNC’s system of classification, texts from the CNWRS archives would be considered context-governed data.)
     We chose texts from the BNC that cover all age ranges (using the BNC’s classification system), with an equal division between male and female speakers, and when dividing the texts into these categories we took the age and sex of the respondent as the determining factors for categorisation. We also concentrated solely on face-to-face interaction – so we did not use transcripts of radio phone-ins, for example - and we used only those texts which constitute spontaneous, unscripted data.

3.2.1 Criteria for selection of BNC texts
There are 153 texts in the spoken demographic section of the BNC, all of which are spontaneous, natural conversations on a wide variety of topics. We took 60 chunks from these texts and each chunk was taken from a different file in the BNC World edition.
     Six age groupings exist in the BNC classification system - 0-14, 15-24, 25-34, 35-44, 45-59, and 60 plus – and these are all represented in our choice of texts. Thus, there are 10 chunks per age group in our corpus. There is an equal division between male and female speakers, which means that in each age group we have five texts from male speakers and five from female speakers. We decided that it was not practicable to choose texts representing the variety of social class groups, since this would have resulted in there not being enough texts in each division to be representative. Additionally, we had already decided that social class was an inappropriate distinction with the CNWRS data.
     The first stage in selecting suitable chunks to use was to divide the spoken texts in the BNC into the different age groups. We then searched the texts for common reporting verbs using the BNC Web query facility. Where the query returned favourable results, we then examined that area of the text in question manually, to see if it was likely to yield rich data. So, in addition to the reporting verbs picked up in the electronic search, we also looked for further examples of SW&TP in close proximity to these. And again, as with the CNWRS data, we favoured longer turns on the basis that narrative report and SW&TP was more likely to appear in longer stretches of texts (we made this decision on the basis of examining the earlier SW&TP written corpus). Once we had chosen suitable chunks, these were then passed round members of the project team for closer inspection, and would be accepted or rejected at this stage.

3.2.2 BNC Sound files
In addition to having sound files corresponding to the CNWRS transcripts, we have also attempted to acquire the related sound files for the BNC texts in our corpus. However, these have proved rather more difficult to get hold of. The problem lies with the cataloguing of the original cassettes on which the Spoken Demographic data was recorded, and with the fact that these do not correspond exactly to their related transcripts. Nevertheless, despite the difficulties it is possible to locate the extracts from our corpus, and work on this is ongoing.
     The cassettes for the Spoken Demographic section of the BNC are stored in the National Sound Archive of the British Library in London, in cardboard boxes marked ‘C897’. Transcription logs to accompany these are stored in the same place in black ring binders, and are useful in locating the relevant extracts. To correlate a transcript with a cassette it is first necessary to look in the header of that transcript for which the sound file is required. In the recording statement (<RecordingStmt>) there are a series of recording elements. There is one recording element for each division of the sound file that accompanies a particular corpus text. Following the ‘n’ attribute in the recording element is a 6-digit number. The first four digits of this number indicate the cassette number on which the sound file can be found (NB there may be numerous cassettes for each text file). Each cassette is sub-divided into various sections, usually corresponding to a change of scene/location (indicated by the <div> tag). These sub-divisions are marked by the two digits at the end of the four digit number that indicates the cassette. Unfortunately, these numbers are not recorded on the sleeves of the audio cassettes, but are present in the ‘n’ attribute value of the recording element. The first four digits of the ‘n’ attribute value can be used to locate all the cassettes that correspond to the BNC file in question. On the cassette boxes, these four digits are prefixed by the letters ‘OG’. Once all the cassettes for a particular file have been located, the next step is to identify which of these is most likely to contain the relevant extract. There are various pointers that can help here. First of all, in the Responsibility Statement of the SW&TP file for which a corresponding BNC sound file is being sought there is a record of the s-units that were removed from the original BNC text (see section 4.3.2). If the s-unit numbers are low (e.g. 025-140) it is likely that the extract will be contained on the first cassette. Secondly, the number of participants in an exchange can be used to delimit a search. The participants in the transcript held in our corpus can be checked against the relevant BNC transcription log, and from this it is possible to work out which cassettes can be exclude from the search – e.g. if your extract contains just two participants then all those cassettes containing three or more participants can be excluded. Finally, the notes in the transcription log can be useful. These contain the first line of each sub-division of a sound file, making it possible to identify particular sections when listening to the cassette. Once a probable cassette has been selected, it is a case of fast-forwarding through it and stopping at various points to try and locate the relevant extract. The ‘find’ facility in Notepad or Wordpad can be used to search in the transcript for words mentioned on the cassette, and by doing this it is possible with a little effort to locate the corresponding sound file.
     Having said this, there are still problems with the cataloguing of both the cassettes and the transcripts. It is sometimes the case that a transcript will contain sections of conversation that do not appear on the cassette. It seems this may be because the transcripts were not put together in the correct order when they were transferred to electronic files. Similarly, it may be that a cassette contains sections of conversation that do not appear on the transcript. This happens particularly in those cassettes where there are long pauses between turns, where it is likely that a transcriber unknowingly fast-forwarded past these parts in an effort to find the conversation on the tape and complete the transcription more quickly.

3.2.3 The 60 texts from the BNC
The table below shows the identifying file numbers of the BNC files from which our 60 texts were taken:

RESPONDENTS BY AGE AND SEX
*0-14*		*15-24*		*25-34*		*35-44*		*45-59*		*60+*
Male	Female	Male	Female	Male	Female	Male	Female	Male	Female	Male	Female
KBR	KP2	KBM	KBY	KBG	KB6	KBD	KB3	KB1	KB7	KB2	KB0
KNY	KP3	KCM	KCE	KC6	KBF	KCY	KB9	KBK	KB8	KBA	KBC
KP9	KPG	KD6	KDL	KCA	KBU	KD0	KBJ	KC1	KBE	KBB	KC0
KPA	KPY	KPF	KDX	KDA	KC5	KD7	KBL	KCF	KCN	KBS	KC9
KSW	KST	KSV	KPV	KDP	KCG	KE3	KCD	KDN	KCV	KC2	KPM

Table 2 BNC files from which extracts were taken for the SW&TP spoken corpus

4. Mark-up of the texts

The 120 files in our corpus are all marked up using TEI- (Text Encoding Initiative) conformant SGML in order to create a shareable archive, compatible with other corpora and concordancing packages. The SGML mark-up allows the corpus to be searched using concordancing programs such as Wordsmith Tools and SARA. In this section we describe the conventions that we used when marking up our data. This section concentrates specifically on the mark-up of the corpus texts ready for SW&TP annotation. Section 5 discusses the application of the SW&TP tags to the data.

4.1 File headers
All the files in our corpus have a file header which contains bibliographical and other information about the file. It is impossible to describe this fully here, for reasons of space, but full details can be found in the TEI Guidelines at http://www.tei-c.org/P4X/HD.html. A brief description, though, of the header and the information it contains will be useful.
The header is divided into four main sections: 1) the file description, 2) the encoding description, 3) the profile description, and 4) the revision description. The file description contains full bibliographical details about the computer file itself, with which it is possible to catalogue the file in a library archive. The encoding description provides details about the types of tags that are used in the file, and can also be used to describe how the encoder resolved any problems that arose during tagging. For the non-corpus linguist, the profile description is perhaps of the most interest, since it is this description that contains classificatory and contextual information about the text itself. In the SW&TP spoken corpus, for example, the profile description contains personal details of the speakers, information about where the recording took place, the subject matter of the file, etc. Below is an example of the profile description from file no. 506 (a CNWRS file) in the SW&TP spoken corpus:

<particDesc>
<person id=‘PS202’ n=‘24’ sex=‘f’ age=‘X’>
<name>Lucinda Beier</name>
<occupation>Academic</occupation>
</person>
<person id=‘PS203’ n=‘24’ sex=‘m’>
<name>Mr L3B</name>
<age>36</age>
<occupation>unknown</occupation>
<maritalStatus>married</maritalStatus>
<spousesOccupation>unknown</spousesOccupation>
<fathersOccupation>blastfurnaceman</fathersOccupation>
<mothersOccupation>seamstress</mothersOccupation>
</person>
</particDesc>

The tag <particDesc> stands for ‘participant description’ and, as can be seen in the example above, contains such information as the speaker’s age, sex, occupation etc. The <person> tag contains the speaker’s identity number (id) and the number of conversational turns they have (n). The <particDesc> section can, of course, be adapted to suit the purposes of the corpus; Sperberg-McQueen and Burnard (2001) provide advice on how to keep any such adaptations TEI-conformant.
With regard to the BNC texts in our corpus, because these come originally from files in the BNC World edition, the <particDesc> is copied wholesale from the original BNC World file header
Finally, the revision description provides a history of changes made in the development of an electronic text. The purpose of this is to aid the development of subsequent versions of a file or corpus, and to record the history of a file. In the case of the CNWRS files, the revision description is used to explain how the original CNWRS data was digitised, OCR scanned, re-catalogued and marked-up.
We used a generic header as a template to add to marked-up chunks, and then adapted this to suit the file in question. For example, the name of the file is added to the header, and the number of the various tags used in the file is counted and included within the encoding description.

4.2 Mark-up specific to the CNWRS texts

4.2.1. The ‘utterance’ element
In the CNWRS interviews we use the ‘utterance’ element to distinguish the speakers and their conversational turns. The ‘utterance’ element consists of two attributes: ‘who’ and ‘n’. An example is useful to explain the information contained within the tag:

<utterance who="PS201" n="001">PS201:
I know he was a craftsman at Gillows.
</utterance>

(148)

In the ‘utterance’ tag, the ‘who’ attribute refers to the speaker identity which is marked by the string following the equals sign. ‘PS201’ identifies the speaker as the interviewer Elizabeth Roberts, and the attribute ‘n=‘001’’ tells us that this is her first turn. The end tag </utterance>, which appears immediately after the speech, indicates the end of this utterance.

4.3 Mark-up specific to the BNC texts

4.3.1 The ‘u’ element
We used the same system as described in 4.2.1 with the BNC files in our corpus. The only difference here was that the BNC files already had person identifiers, and used <u> rather than our preferred <utterance>. We changed the <u> tags to <utterance> because <u> is an SGML tag that indicates underlining; therefore if one of our corpus files is opened in MS Word (e.g. for display purposes), the text displayed is underlined. Using <utterance> as opposed to <u> avoids this. The only other difference between the mark-up of the BNC and CNWRS texts is that the ‘utterance’ tag in the BNC texts does not include an ‘n’ attribute. This is because of the difficulty of interpreting and therefore numbering speakers’ turns, arising from the difficulties of working with naturally occurring spontaneous conversation.

4.3.2 S-unit tags
One major problem in tagging the corpus files for SW&TP was that the large quantity of existing mark-up in the files made it very difficult to read the texts when applying the SW&TP tags. This was a particular problem with the BNC files which, in addition to the header, also included s-unit tags and part of speech tags for every word. S-units are used in the BNC to mark sentence-like units in spoken data. As an example, here is an extract from a tagged BNC World edition file:

Example 1

<utterance who=‘PS0E8’>PS0E8:
<s n="4240"><unclear> <w PRP>over <w DPS>her <w NN1>leotard<c PUN>, <w CJS>so <w PNP>we <w VVD>kept<c PUN>, <w PNP>we <w VVD>bought <w PNI-CRD>one<c PUN>, <w PNP>they <w VVD>said <w ITJ>oh<c PUN>, <w PNP>you <w VVB>know<c PUN>, <w PNP>you <w VVB>want <w AT0>a <w AV0>fairly <w AJ0>big <w PNI>one<c PUN>, <w CJS>so <w PNP>they <unclear><c PUN>, <w AV0>well <w PNP>it <w VVD-VVN>looked <w AJ0>alright <w PRP>across <w AV0>here<c PUN>, <w CJS>when <w PNP>we <w VVD>got <w PNP>it <w AV0>home<c PUN>, <w PNP>she <w VVD>tried <w PNP>it <w AVP-PRP>on<c PUN>, <w PNP>it <w VBD>was <w AV0>nearly <w AVP>down <w PRP>to <w DPS>her <w NN2>knees<c PUN>. <s n="4241"><w CJS>When <w PNP>we <w VVD>opened <w PNP>it <w AVP-PRP>up <w PNP>it <w VBD>was <w PRP>aged <w CRD>nine <w PRP>to <w CRD>eleven<c PUN>, <w CJC>and <w PNP>she <w VVD>said <w PNP>I <w VM0>ca<w XX0>n’t <w VVI>go <w PRP>in <w DT0>this<c PUN>. <s n="4242"><vocal desc=laugh> <w AV0>So <w PNP>we <w VHD>had <w TO0>to <w VVI>go <w AVP>back <w CJC>and <w VVI>swap <w PNP>it <w PRP>for <w CRD>seven <w PRP>to <w CRD>eight <pause></utterance>

(KCD)

Even before we had added our own SW&TP tags the data was difficult to read. Because of the difficulty of integrating our SW&TP tags with the existing mark-up, and to ensure that the BNC texts were statistically comparable with the CNWRS texts, we decided to strip out the s-unit tags. However, we did not remove the part of speech tags as it is possible that these might prove useful at some later stage (if not for our project, then for future researchers using our corpus). We therefore decided to tag the files in MS Word which enabled us to use a macro to suppress every tag in the file except the SW&TP tags. This increased the legibility of the files (which made them much easier to annotate for SW&TP), as can be seen from the example below, which is same text as in example 1, but with the s-units removed, the redundant tags suppressed and the SW&TP tags added:

Example 2

PS0E8: <sptag cat="A">
<unclear> over her leotard, so we kept, we bought one, </sptag> <sptag cat="xRS">they said </sptag> <sptag cat="xDS">oh, you know, you want a fairly big one, </sptag> <sptag cat="A">so they <unclear>, well it looked alright across here, when we got it home, she tried it on, it was nearly down to her knees. When we opened it up it was </sptag><sptag cat="FIW">aged nine to eleven, </sptag> <sptag cat="xRS"> and she said </sptag> <sptag cat="xDS">I can’t go in this.</sptag> <sptag cat="A"><vocal desc=laugh> So we had to go back and swap it for seven to eight <pause></sptag></utterance>

(KCD)

In addition to removing the s-unit tags we also removed <div> tags, used to mark divisions between different conversations, in order to validate our files using an SGML validator. We replaced these with a note in the text at the relevant point indicating that a division occurred.

4.4 A summary of the tags used to mark-up the corpus files
The following tags (and, where relevant, end tags) were used as we marked up our files:

<note></note>	Indicates a note from the encoder about a particular feature of the text that cannot be described with any of the other available tags.
<notelet>	Indicates a note from the encoder about a particular feature of the text that cannot be described with any of the other available tags, and that is not scoped around a particular word or utterance.
<pause dur=‘x’>	Indicates a significant pause of at least 3 seconds; the length of the pause in seconds is noted in the duration element.
<sic></sic>	Indicates text reproduced from the original despite being apparently incorrect or inaccurate.
<sptag cat=‘x’></sptag>	Indicates the category of speech and/or thought and/or writing presentation inherent in a particular piece of discourse.
<text>	Indicates the beginning of the spoken text.
<unclear>	Indicates portions of the speech which are indistinct and cannot be transcribed.
<utterance></utterance>	Indicates an utterance.
<w>	Indicates the part of speech of a word (BNC texts only).

5. Speech, Writing and Thought Presentation tagging

In this section we detail the system of tagging used to apply SW&TP annotation to the corpus data.

5.1 SW&TP categories
Here we show the acronyms used in the tags and their accompanying definitions. For full definitions of the SW&TP categories see McIntyre et al. (2003).

5.1.1 Main categories

Category	Definition
A	Anything other than SW&TP
RM	Report of Mention
RV	Minimal Report of Speech
RN	Minimal Report of Writing
RI	Minimal Report of Internal State
RS	Report of Speech
RW	Report of Writing
RT	Report of Thought
RSA	Report of Speech Act
RWA	Report of Writing Act
RTA	Report of Thought Act
IS	Indirect Speech
IW	Indirect Writing
IT	Indirect Thought
FIS	Free Indirect Speech
FIW	Free Indirect Writing
FIT	Free Indirect Thought
DS	Direct Speech
DW	Direct Writing
DT	Direct Thought
FDS	Free Direct Speech
FDW	Free Direct Writing
FDT	Free Direct Thought

The main changes in the tag-set since the SW&TP Written project are as follows:

- We have dispensed with the N element of the tags as a consequence of tagging oral texts. N previously stood for ‘narrative’ or ‘narrator’s’, and is therefore not applicable to spoken data. Hence, what in the written corpus would have been NRSA is in the spoken corpus simply RSA. Likewise, the single N tag, which was used in the written corpus to mark anything not tagged as SW&TP, is replaced in the spoken corpus by A – which refers to ‘[A]nything other than SW&TP’.

- We have changed the NW tag, used in the written corpus, to RN, to refer to the minimal report of writing - such as ‘I wrote to Eileen’ [KC2]. As in the written corpus, this tag is the equivalent in writing presentation terms of the tags RV and RI on the speech and thought scales. The change was necessary due to dispensing with the N element. Because of this we were left with the same tag - RW - to refer both to the ‘report of writing’ (i.e. a reporting clause of writing presentation preceding either the direct or indirect report of writing) and the minimal report of writing (such as the example given above of ‘I wrote to Eileen’ [KC2]). We therefore needed a different tag in order to distinguish between the two phenomena. We chose to use RN to tag the latter, the N being the only remaining consonant in the word ‘writing’ that is not used elsewhere in the tag-set.

- We have introduced a new tag, RM, to refer to ‘report of mention of language use’. This is used to tag instances where speakers refer to calling something or someone by a particular name. A prototypical example would be ‘[…] there was <sptag cat="xRM"> a character by the name name of Eddie Polo </sptag> <sptag cat="xRM"> in a serial called The Broken Coin. </sptag>‘.

5.1.2 Sub-categories

Sub-category	Definition
p	Extended topic.
#	Flags problems for discussion, i.e. examples that we weren’t sure how to analyse.
e	Embedded.
g	Grammatical negative.
a	Absence of category.
h	Hypothetical.
i	Inferred.
q	With quote.
r	Iterated.
v	Interrogative.
p	Imperative.
u	Uncompleted.
1/2/3 etc	Level of embedding or number of repeated adjacent categories.

The additions to the sub-category tags since the written corpus are g, a, r, v, p, u and the numbers to indicate levels of embedding and number of repeated adjacent categories. All new additions were to cope with particular phenomena we encountered in the spoken data.

5.2 The tagging process
John Heywood and Dan McIntyre tagged all the corpus texts for SW&TP. The tagged texts were then passed to Mick Short and Elena Semino individually for their comments and revisions. The project team discussed any remaining problems together and a final draft of each tagged file was produced following this discussion.

5.3 The tagging format
SW&TP categories are marked within the <sptag> category attribute ('cat'). We employ a 15-field category set, using ‘x’ as a placeholder for empty positions in order to assist concordancing. Below is an example of an SW&TP tag and its component parts:

element	attribute		attribute value [SW&TP 'tag']
<sptag	cat	=	"x x x x x x x x x x x x x x x">
			fields 1 - 15

Formally, the tag constitutes everything within angle brackets, though for ease of reference in analysis we prefer to use the term 'tag' (as opposed to 'attribute value') to refer specifically to the SW&TP category itself. The SW&TP tag, then, has 15 possible constituents (marked above by the Xs), detailed in Table 3, below. (NB. We do not mark empty positions that follow the final category or sub-category constituent.)

Field	Possible constituents	Definition of constituent
1	x A F	Anything other than SW&TP; Free
2	x R I D	Representation; Indirect; Direct
3	x S T W V I N M	Speech; Thought; Writing; Voice; Internal state; WritiNg; Mention
4	x A	Act
5	x P	toPic
6	x # 1 2 3 4	# = odd/interesting cases; no.s = repeated (-ing or –ed) adjacent cats.
7	x e	embedded
8	x g a	grammatical negative; absence of speech, thought and/or writing
9	x h	hypothetical
10	x i	inferred
11	x q	quote
12	x r	iterative
13	x v p	interrogative; imperative
14	x u	uncompleted
15	x 1 2 3 4	no.s = level of embedding

Table 3 Possible constituents of fields in the <sptag> category attribute

Some examples should help to make this clear:

- RIi (an inferred report of internal state) would be tagged formally as: <sptag cat=”xRIxxxxxxi”>

- RSAPg (a grammatically negative report of a speech act with an extended topic) would be tagged formally as <sptag cat=”xRSAPxxg”>

- ISehu2 (second-level embedded, hypothetical, uncompleted indirect speech) would be tagged formally as <sptag cat=”xISxxxexhxxxxu2”>

- FISeghvu3 (third-level embedded, grammatically negative, hypothetical, uncompleted, interrogative, free indirect speech) would be tagged formally as <sptag cat=”FISxxxeghxxxvu3”>

6. Further directions

We have recently begun the analysis of the corpus, and some of our initial findings are reported in McIntyre et al. (2003). One of the major results of our project so far, though, has been confirmation that the model of speech, thought (and later) writing presentation suggested by Leech and Short (1981) and developed by Short, Semino and Wynne in their work on the Written Corpus, is applicable to spoken data, with few modifications. Our work on the Spoken Corpus would seem to confirm the robustness of the model of SW&TP that we are using.

Further research planned includes quantitative and qualitative analyses of the corpus, comparisons between the Written and Spoken corpora, and investigation into the role of prosody in SW&TP.

References

Leech, G. N. and Short, M. H. (1981) Style in Fiction. London: Longman.

McIntyre, D., Bellard-Thomson, C., Heywood, J., McEnery, A., Semino, E. and Short, M. (2003) 'The construction of a corpus to investigate the presentation of speech, thought and writing in written and spoken British English.' in Archer, D., Rayson, P., Wilson, A. and McEnery, A. (eds) Proceedings of the Corpus Linguistics 2003 Conference. Lancaster University: UCREL Technical Papers 16. 513-22.

Last updated:
Dan McIntyre