A Handbook to the Lancaster Speech, Writing and Thought Presentation Spoken Corpus
1. Introduction
The Lancaster Speech, Writing and Thought Presentation Spoken Corpus has been constructed as part of an AHRB-funded research project entitled ‘A corpus-based study of Speech, Writing and Thought Presentation in contemporary spoken British English’ (henceforth the SW&TP Spoken project). The aim of the project is to investigate the forms and functions of Speech, Writing and Thought Presentation (henceforth SW&TP) in a corpus of conversations and oral narratives, using a tagset developed by Mick Short, Elena Semino, Martin Wynne
and Jonathan Culpeper at Lancaster University, and based on the cline of categories proposed in Leech and Short (1981). We will be conducting quantitative and qualitative analyses of our data and comparing our findings with those of
our previous research project investigating SW&TP in written texts. (A full list of papers resulting from the SW&TP Written project
can be accessed by clicking on the 'publications' link on the left.)
1.1 Research questions
1.1.1 Over-arching research question
- What are the forms and functions of SW&TP in contemporary spoken British English?
1.1.2 Theoretical/methodological questions
- To what extent can models and categories originally developed for the analysis of SW&TP in written texts be applied to speech?
- To what extent can the model developed during the SW&TP Written project be applied to a spoken corpus?
- What modifications are needed to our existing annotation scheme?
1.1.3 Analytical questions
- What are the relative frequencies of different categories of SW&TP in the spoken corpus as a whole?
- Are there differences in the frequency of categories depending on variables such as the nature and context of the interaction or the identity of the speaker?
- How can these quantitative differences be related to the functions that different categories of SW&TP can perform in discourse?
- How do the frequencies in the spoken corpus compare with those in our written corpus?
2. The composition of the corpus
2.1 The data
The texts that form our corpus are drawn from two sources: 1) the spoken demographic section of the
British National Corpus (BNC -
World edition); and 2) an oral history archive stored in the Centre for North West Regional Studies (CNWRS) at Lancaster University.
We chose to construct a corpus of approximately 250,000 words in order to make it comparable with the existing SW&TP written corpus. The CNWRS archives and the BNC obviously provide a far larger body of data than we required, and so we opted to select 120 ‘chunks’ (60 from the BNC and 60 from the CNWRS archives) of approximately 2000 words each (as this was the size of the texts in the written corpus), providing 240,000 words in total. We also decided that the chunks would not be stopped at exactly 2000 words, but would be allowed to run on a little, preferably to the end of discussion of a particular topic. This would give us the remaining 10,000 words needed to make our corpus approximately 250,000 words in size. Below is a description of the CNWRS and BNC data.
2.2 The CNWRS data
The CNWRS data consists of two archives: 1) The ‘Family and Social Life’ archive; and 2) The ‘Childhood and Schooling’ archive.
Click on the link to see a diagram detailing the CNWRS
texts in our corpus.
2.2.1 The ‘Family and Social Life’ archive
This archive was compiled and organised by Elizabeth Roberts, Emeritus Reader in Social History at Lancaster University. The data was collected in the 1970s and 1980s by Elizabeth Roberts and Lucinda Beiers, and the archive consists of 250 hours of interviews, stored on audiocassettes and reel tapes, with accompanying transcripts. The interviews fall into two main groups:
1) Interviews done in the mid-1970s, with respondents recalling the period 1890-1940.
2) Interviews done in the mid-1980s, with respondents recalling the period 1940-1970.
The interviewees lived in Preston, Barrow or Lancaster at the time of interview, but we made the decision to use only the records relating to Barrow and Lancaster, for reasons explained in section 2.1.1.
2.2.2 The ‘Childhood and Schooling’ archive
This archive was compiled and organised by Penny Summerfield, formerly of Lancaster University and now Professor of Modern History at the University of Manchester. Collected in the 1980s, the data in this archive consists of approximately 200 hours of interviews on audiocassette, with accompanying transcripts. The interviewees recall their years at secondary schools between 1920 and 1950 in Lancaster and Morecambe, Preston, Blackburn, Burnley and Clitheroe.
The respondents in both these archives were all born prior to the early 1950s, and their memories provide an irreplaceable collection of personal retrospectives, many very particular to the North West. The oldest contributors were born in the 1880s, and their recollections of their own childhood and their parents’ life and times describe ordinary life as far back as the mid-19th Century. Overall, then, the archives provide a perspective on some 150 years of ordinary life. Of particular interest to us is the way in which many respondents tell stories or narrate events in their lives in ways which include the presentation of their own or others’ speech, thought and writing.
2.2.3 How the CNWRS archives are organised
2.2.3.1 The Family and Social Life Archive
This archive consists of two sections: 1) records for the interviews relating to 1890-1940; and 2) records for the interviews relating to 1940-1970. The archive is divided this way because the latter interviews were carried out approximately ten years after the first. Within each of these two sections, the records are arranged by the town in which the interviews took place, giving three blocks, for Preston, Barrow and Lancaster. Within each town, the interviews are filed according to the initial of the respondent’s surname and the initial of the town in which the respondent was living at the date of interview. Where a respondent’s surname begins with the same letter as that of another interviewee, a number id used to differentiate. For example:
- ‘Mrs B3B’ is a married woman whose surname begins with ‘B’ and who lives in Barrow. She is the third respondent from that town with a surname beginning with ‘B’.
- ‘Mr R2L’ is a man whose surname begins with ‘R’, and who lives in Lancaster. He is the second respondent from Lancaster to have a surname beginning with ‘R’.
In addition to the interview transcripts, the archive consists of basic biographical details of each interviewee, and a record of the numbers of all tapes related to an individual respondent. So for example, Mr R2L may have interviews recorded on reel tape 145, and cassettes 500 and 790. There is also a ‘Subject File’, which consists of an alphabetical list of topics discussed in the interviews. This is provided mainly to facilitate socio-historical research, and this is cross-referenced to show which individuals discuss a given topic on which tapes.
2.2.3.2 The Childhood and Schooling Archive
The transcripts for these interviews are organised by school and by gender. In addition, there are index cards, ‘Interview Checklists’ (giving details of the interviewees’ names, pseudonyms if requested, addresses and how many hours of interview they gave), and additional details about the data-gathering process. There is also a Biographical Index, giving basic biographical information, and a card-index detailing topics covered in the interviews. Tapes however are not labelled other than with the respondent’s name or pseudonym.
Because of the danger of confusing pseudonyms and real names, we decided on an alternative method of reference for this archive. The tapes were numbered according to the alphabetical order of the respondents’ surnames (regardless of whether the name on the tape was real or false). To differentiate them from the Family and Social Life archive tapes, their numbers were preceded by ‘C’ to indicate that they belonged to the Childhood and Schooling Archive.
2.3 BNC data
Approximately 10 million words of spoken data are available from the BNC, and are divided into two main categories: 1) demographic and 2) context-governed texts. Demographic texts are transcriptions of spontaneous natural conversations made and recorded by members of the public. These conversations were all recorded around 1991 - 92, and there are equal numbers of male and female speakers. In addition, there is a roughly equal distribution of speakers across age ranges and social class groups. The context-governed texts, on the other hand, do not constitute spontaneous natural speech. They are transcriptions of recordings made at specific types of meetings and events. There are roughly equal quantities of speech divided across four main categories, these being educational/informative events (e.g. lectures, news broadcasts), business events (e.g. interviews, trade union meetings), institutional/public events (e.g. sermons, parliamentary proceedings) and leisure events (e.g. after-dinner speeches, radio phone-ins). We decided to use only material from the spoken demographic section of the BNC, as this would allow us to contrast spontaneous speech with the elicited data of the CNWRS archives. Since the BNC data was collected in the early 1990s and the CNWRS data in the 1970s and 1980s, we also left open the possibility of studying diachronic developments in speech.
Click on the link to see a diagram detailing the BNC
texts in our corpus.
In the following section we describe the criteria we used in selecting material for our corpus from both BNC and the CNWRS.
3. Selecting texts for our corpus
To provide as balanced a range of data as possible, we took 60 chunks from the BNC (discussed in section 2.2), 30 from the Childhood and Schooling archive, and 30 from the Family and Social Life Archive (15 from the 1890-1940 batch and 15 from the 1940-1970 batch), with an approximately even split of male and female respondents throughout. We also tried to obtain as wide an age range as possible amongst the CNWRS respondents, so that it might later be possible for us to consider potential evidence of diachronic changes in usage. We decided not to make a further distinction between social classes, as the overwhelming majority of the CNWRS respondents are working class, and so we did not have enough respondents from other social classes to make this difference representative. Similarly, if we had tried to select BNC data according to respondents’ social class we would not have had enough examples from each group to form a representative data set.
3.1 Selecting texts from the CNWRS archives
Our selection of texts from the CNWRS archives was in part restricted by the generally poor quality of the transcripts and audio tapes. In the following section we describe the criteria developed for selecting suitable CNWRS texts to include in our corpus.
3.1.1 Criteria for selecting texts from the CNWRS archives
First, many of the transcripts were dot-matrix printed, and some of these were indistinct. We photocopied and darkened these to make them suitable for scanning but some were still insufficiently clear and could not be used without being re-transcribed totally. The older transcripts were typed on typewriters, with varying degrees of legibility due to dirty keys, mistyping, etc. These were usually dark enough to scan, but because of the legibility of the original transcripts, the resulting scanned file would contain a large number of errors which all had to be corrected by hand.
Similarly, much of the audio material was not in good condition; the older reel tapes were starting to deteriorate, some of the cassettes used had, even initially, been of poor quality, and many were recorded without regard to recording levels. Because of this, sometimes respondents were virtually inaudible, and sometimes the recordings were made at such high level that we could not remove the distortion from the captured sound file without damaging the comprehensibility of the interview. Additionally, many had been recorded in the respondents’ homes, resulting in such phenomena as traffic noises from outside, clocks chiming, people ringing doorbells and many helpful spouses bringing in tea! All of these factors meant that we could not use these sound files as part of our corpus.
With regard to the respondents, many of the interviews in both archives include several participants. We decided only to use one-to-one interviews in our corpus, since the BNC files would constitute multiple-participant conversations.
Secondly, many of the respondents in the 1940-1970 batch were children of those in the 1890-1940 batch, and some of the respondents in both batches were related by blood or marriage to other respondents. We decided that we should, if possible, select our files only from those records where the respondent was not related in any way to any other respondent, so that possible similarities of speech style due to close and frequent interaction between family members would be eliminated, and the range of usage styles maximised. The poor quality of some of the material also led us to develop a set of criteria for what to exclude when selecting files from the CNWRS archive. These were as follows:
- Any records for which audio material was unavailable or of too poor a quality to be used.
- Any records which included interviews with 3 or more participants (including the interviewer).
- Any records where the respondent seemed to be in any way reluctant to disclose information or had requested not to be included.
- Any records where the respondent was a relative of another respondent.
- Those records for which the transcript was, even after photocopying, seriously compromised for legibility.
- Those transcripts with very thin, torn or creased pages which might be difficult to copy or scan.
- Multiple-interview recordings - Because the majority of the respondents in the CNWRS archives had been interviewed more than once, indeed some as many as six or seven times, we decided that a further criterion would be that only one chunk should be selected from any single individual, in order to standardise and provide maximally differing data.
- Preston files - When selecting data from the CNWRS archive material we discovered that the ‘Family and Social Life’ transcripts relating to Preston were summaries of what the participants had said, rather than accurate transcriptions. Although this was deemed suitable for the purposes of historical research by the historians who compiled the archive, it was not accurate enough for linguistic analysis. Additionally, one of the original transcribers had converted all instances of the presentation of Direct Speech into Indirect Speech, clearly affecting those areas in which we would be most interested and making it difficult even to see how to identify them clearly. We therefore decided not to select any samples from the Preston files.
Once we had decided which files to exclude from our data, we then set about finding suitable sound and transcription chunks from each archive. This meant first looking through the catalogues of each archive in order to select potentially suitable candidates, taking into account our criteria for selection relating to age, gender, etc.
3.1.1.1 Transcriptions
We found by examining the SW&TP written corpus that SW&TP occurs more frequently in longer extracts than in shorter ones. Therefore, after the initial selection, we examined each transcript for ‘long turns’, on the basis that these might provide a richer distribution of the kind of features we were seeking to examine. In fact, this meant excluding any records where the turn-taking was for the most part of a very brief question and answer format, or where long turns only occurred widely spread out through the text, so that a 2000 word chunk would not be able to include more than one or two such turns.
Once we had identified potentially rich interviews for our purposes, we photocopied each respondents’ transcript and passed the copy to members of the research team so that they could look for suitable extracts. The photocopying also meant that we were not examining the original transcripts and so did not have to remove files from the archives for long periods of time.
Once the transcripts had been decided on, we scanned them, and then reduced the resulting electronic file to the ‘chunk’ required for inclusion in our corpus. Scanning was done using Omni-Page Pro, and the scanned files were saved as Word documents: these were labelled <xxxWtr.doc>, where ‘xxx’ is the number of the original file. We then reduced the contents of these files to the selected chunk and re-saved these as <xxxWch.doc>. We also put together a further file for each interview, labelled <xxxH.doc>, which contained as much biographical information as possible about the interview participants, in preparation for inclusion in the header of the eventual marked-up corpus files.
Once the chunks had been selected, we then had to deal with the problem of the relatively poor quality of the original transcriptions. These had been done by non-linguists and in many instances we found that the original transcriptions of the CNWRS interviews were not accurate enough for linguistic analysis. It was therefore necessary for us to partially or wholly re-transcribe the interview chunks to produce an accurate orthographic transcription. At this point it is worth noting the unusual elements found in some of the original transcripts, and the decisions we made concerning these.
3.1.1.2 Irregularities and oddities in the original transcriptions
The original transcriptions were made for an oral history research project, and were not transcribed by linguists. As a result there are various irregularities and inconsistencies in the original transcripts which make them unsuitable, as they stand, for the purpose of linguistic analysis.
The first decision we took was on how to deal with punctuation in the original transcriptions. The original transcribers had obviously punctuated as they saw fit, and, due to the nature of the original project, obviously without regard for linguistic transcription conventions. The speech, then, is divided into what the original transcribers deemed to be ‘sentences’, with commas, full-stops etc., dividing these. It is also the case that the OCR scanning process unavoidably produces some corruptions of the original text. Hence, a full-stop may appear as a comma, a comma as a letter ‘g’ or ‘y’, inverted commas as an asterisk, etc. We decided to retain the original punctuation as shown in the ‘hard copy’ transcription, except in those cases where it was necessary to re-transcribe a section due to the inaccuracy of the original transcription, or where we had to newly transcribe stretches of interaction that had been omitted or summarised. In these cases, the only punctuation that we added were full-stops where we felt a sentence boundary would most likely exist in a written form of the interview. We did this to prevent the text from becoming impossibly difficult to read and understand.
This was the major issue to be resolved. There were, however, other anomalies which we had to deal with. Some of the original transcribers, for reasons quite unknown to us, and indeed to the original researchers, had placed inverted commas around all instances of the words ‘yes’ and ‘no’. We removed this punctuation to avoid it appearing that these words were instances of direct speech. Another transcriber had used a capital letter at the beginning of the line every time a new page was started, regardless of whether this was the beginning of a ‘sentence’ or not. In each case we converted these to lower case letters where necessary. Finally, many transcribers had misspelled words, particularly proper nouns. All these misspellings were corrected.
3.1.1.3 Transcribing normal non-fluency features
With regard to producing a more accurate orthographic transcription, it was necessary to transcribe normal non-fluency features which were often missing from the original transcriptions. We kept this as simple as possible, with three possible transcriptions available. These were ‘er’, ‘erm’ and ‘um’, which we felt adequately described the range of normal non-fluency features present in the data. We did not transcribe event phenomena such as laughter and coughing, since i) they are unlikely to affect the presentation of speech, thought and writing, and ii) our time was necessarily limited. However, it would be a relatively straightforward task to include this information at a later date if required.
3.1.1.4 Sound archives
In addition to producing electronic copies of the interview transcripts, we also digitised their related audio tapes. We did this using an application called CoolEdit, which allowed us to convert the original tapes to <*.wav> files. The <*.wav> files were recorded as 16-bit files in mono, in order that they should be in a form suitable for later time-alignment.
Our intention initially was to capture each whole tape, so that it could later be ‘chunked’, and the original whole sound file could be kept for future reference. However, because sound files can only be captured in real time, time constraints meant that we were only able to capture that section of each tape that related to the extract from the transcript.
There were, as previously stated, many recordings which were not suitable for capture due to low quality of the recordings. Additionally, odd discrepancies came to light during the process of digitisation, such as audio material which was incomplete, or tapes that were incorrectly labelled and did not relate to the supposed accompanying transcription. In these cases the tapes were not used. CoolEdit does provide some post-capture processing features, though it was not possible in every case to provide ‘clean’ sound files, since the use of more than one or two post-capture processes tended to reduce the sound quality of the speech to an unacceptable level.
With regard to storing the <*.wav> files, we followed a similar process to that applied with the transcriptions. In those instances where the whole original audio tape was digitised, the resulting file would be labelled as <xxx.wav>; the relevant chunk from that file would be labelled <xxxch.wav>.
3.1.1.5 Storing the CNWRS data to await mark-up
Once the digitisation and scanning of the transcripts was complete, we stored the files for each individual chunk in separate folders, and named each folder with the original CNWRS tape number that had been used to generate the chunk filenames. So, for example, folder 136 contained 136Wch.doc, 136H.doc and 136ch.wav).
3.1.1.6 The 60 texts from the CNWRS archive
The table below shows the identifying file numbers of the 60 CNWRS files from which our texts were taken:
RESPONDENTS BY SEX |
Family and Social
Life Archive |
Childhood and
Schooling Archive |
Male |
Female |
Male |
Female |
105 |
120 |
C002 |
C087 |
123 |
132 |
C008 |
C090 |
125 |
136 |
C009 |
C102 |
138 |
147 |
C017 |
C104 |
148 |
150 |
C025 |
C107 |
161 |
169 |
C027 |
C115 |
184 |
177 |
C029 |
C126 |
506 |
197 |
C030 |
C131 |
531 |
500 |
C036 |
C135 |
545 |
511 |
C041 |
C137 |
560 |
530 |
C047 |
C146 |
562 |
538 |
C055 |
C149 |
588 |
550 |
C061 |
C152 |
787 |
580 |
C063 |
C174 |
----------------------- |
683 |
C196 |
C185 |
----------------------- |
766 |
----------------------- |
----------------------- |
Table 1 CNWRS files from which extracts were taken for the SW&TP spoken corpus
3.2 Selecting texts from the BNC
We decided to concentrate primarily on dialogic texts from the spoken demographic section of the BNC, in order that the BNC files would be significantly different from the CNWRS texts we had selected. (Using the BNC’s system of classification, texts from the CNWRS archives would be considered context-governed data.)
We chose texts from the BNC that cover all age ranges (using the BNC’s classification system), with an equal division between male and female speakers, and when dividing the texts into these categories we took the age and sex of the respondent as the determining factors for categorisation. We also concentrated solely on face-to-face interaction – so we did not use transcripts of radio phone-ins, for example - and we used only those texts which constitute spontaneous, unscripted data.
3.2.1 Criteria for selection of BNC texts
There are 153 texts in the spoken demographic section of the BNC, all of which are spontaneous, natural conversations on a wide variety of topics. We took 60 chunks from these texts and each chunk was taken from a different file in the BNC World edition.
Six age groupings exist in the BNC classification system - 0-14, 15-24, 25-34, 35-44, 45-59, and 60 plus – and these are all represented in our choice of texts. Thus, there are 10 chunks per age group in our corpus. There is an equal division between male and female speakers, which means that in each age group we have five texts from male speakers and five from female speakers. We decided that it was not practicable to choose texts representing the variety of social class groups, since this would have resulted in there not being enough texts in each division to be representative. Additionally, we had already decided that social class was an inappropriate distinction with the CNWRS data.
The first stage in selecting suitable chunks to use was to divide the spoken texts in the BNC into the different age groups. We then searched the texts for common reporting verbs using the BNC Web query facility. Where the query returned favourable results, we then examined that area of the text in question manually, to see if it was likely to yield rich data. So, in addition to the reporting verbs picked up in the electronic search, we also looked for further examples of SW&TP in close proximity to these. And again, as with the CNWRS data, we favoured longer turns on the basis that narrative report and SW&TP was more likely to appear in longer stretches of texts (we made this decision on the basis of examining the earlier SW&TP written corpus). Once we had chosen suitable chunks, these were then passed round members of the project team for closer inspection, and would be accepted or rejected at this stage.
3.2.2 BNC Sound files
In addition to having sound files corresponding to the CNWRS transcripts, we
have also attempted to acquire the related sound files for the BNC texts in
our corpus. However, these have proved rather more difficult to get hold of.
The problem lies with the cataloguing of the original cassettes on which the
Spoken Demographic data was recorded, and with the fact that these do not
correspond exactly to their related transcripts. Nevertheless, despite the
difficulties it is possible to locate the extracts from our corpus, and work
on this is ongoing.
The cassettes for the Spoken Demographic section of
the BNC are stored in the National Sound Archive of the British Library in
London, in cardboard boxes marked ‘C897’. Transcription logs to
accompany these are stored in the same place in black ring binders, and are
useful in locating the relevant extracts. To correlate a transcript with a
cassette it is first necessary to look in the header of that transcript for
which the sound file is required. In the recording statement (<RecordingStmt>)
there are a series of recording elements. There is one recording element for
each division of the sound file that accompanies a particular corpus text.
Following the ‘n’ attribute in the recording element is a 6-digit
number. The first four digits of this number indicate the cassette number on
which the sound file can be found (NB there may be numerous cassettes for
each text file). Each cassette is sub-divided into various sections, usually
corresponding to a change of scene/location (indicated by the <div>
tag). These sub-divisions are marked by the two digits at the end of the
four digit number that indicates the cassette. Unfortunately, these numbers
are not recorded on the sleeves of the audio cassettes, but are present in
the ‘n’ attribute value of the recording element. The first four digits
of the ‘n’ attribute value can be used to locate all the cassettes that
correspond to the BNC file in question. On the cassette boxes, these four
digits are prefixed by the letters ‘OG’. Once all the cassettes for a
particular file have been located, the next step is to identify which of
these is most likely to contain the relevant extract. There are various
pointers that can help here. First of all, in the Responsibility Statement
of the SW&TP file for which a corresponding BNC sound file is being
sought there is a record of the s-units that were removed from the original
BNC text (see section 4.3.2). If the s-unit numbers are low (e.g. 025-140)
it is likely that the extract will be contained on the first cassette.
Secondly, the number of participants in an exchange can be used to delimit a
search. The participants in the transcript held in our corpus can be checked
against the relevant BNC transcription log, and from this it is possible to
work out which cassettes can be exclude from the search – e.g. if your
extract contains just two participants then all those cassettes containing
three or more participants can be excluded. Finally, the notes in the
transcription log can be useful. These contain the first line of each
sub-division of a sound file, making it possible to identify particular
sections when listening to the cassette. Once a probable cassette has been
selected, it is a case of fast-forwarding through it and stopping at various
points to try and locate the relevant extract. The ‘find’ facility in
Notepad or Wordpad can be used to search in the transcript for words
mentioned on the cassette, and by doing this it is possible with a little
effort to locate the corresponding sound file.
Having
said this, there are still problems with the cataloguing of both the
cassettes and the transcripts. It is sometimes the case that a transcript
will contain sections of conversation that do not appear on the cassette. It
seems this may be because the transcripts were not put together in the
correct order when they were transferred to electronic files. Similarly, it
may be that a cassette contains sections of conversation that do not appear
on the transcript. This happens particularly in those cassettes where there
are long pauses between turns, where it is likely that a transcriber
unknowingly fast-forwarded past these parts in an effort to find the
conversation on the tape and complete the transcription more quickly.
3.2.3 The 60 texts from the BNC
The table below shows the identifying file numbers of the BNC files from which our 60 texts were taken:
RESPONDENTS BY AGE AND SEX |
0-14 |
15-24 |
25-34 |
35-44 |
45-59 |
60+ |
Male |
Female |
Male |
Female |
Male |
Female |
Male |
Female |
Male |
Female |
Male |
Female |
KBR |
KP2 |
KBM |
KBY |
KBG |
KB6 |
KBD |
KB3 |
KB1 |
KB7 |
KB2 |
KB0 |
KNY |
KP3 |
KCM |
KCE |
KC6 |
KBF |
KCY |
KB9 |
KBK |
KB8 |
KBA |
KBC |
KP9 |
KPG |
KD6 |
KDL |
KCA |
KBU |
KD0 |
KBJ |
KC1 |
KBE |
KBB |
KC0 |
KPA |
KPY |
KPF |
KDX |
KDA |
KC5 |
KD7 |
KBL |
KCF |
KCN |
KBS |
KC9 |
KSW |
KST |
KSV |
KPV |
KDP |
KCG |
KE3 |
KCD |
KDN |
KCV |
KC2 |
KPM |
Table 2 BNC files from which extracts were taken for the SW&TP spoken corpus
4. Mark-up of the texts
The 120 files in our corpus are all marked up using TEI- (Text Encoding Initiative) conformant SGML in order to create a shareable archive, compatible with other corpora and concordancing packages. The SGML mark-up allows the corpus to be searched using concordancing programs such as Wordsmith Tools and SARA. In this section we describe the conventions that we used when marking up our data. This section concentrates specifically on the mark-up of the corpus texts ready for SW&TP annotation. Section 5 discusses the application of the SW&TP tags to the data.
4.1 File headers
All the files in our corpus have a file header which contains bibliographical and other information about the file. It is impossible to describe this fully here, for reasons of space, but full details can be found in the TEI Guidelines at http://www.tei-c.org/P4X/HD.html. A brief description, though, of the header and the information it contains will be useful.
The header is divided into four main sections: 1) the file description, 2) the encoding description, 3) the profile description, and 4) the revision description. The file description contains full bibliographical details about the computer file itself, with which it is possible to catalogue the file in a library archive. The encoding description provides details about the types of tags that are used in the file, and can also be used to describe how the encoder resolved any problems that arose during tagging. For the non-corpus linguist, the profile description is perhaps of the most interest, since it is this description that contains classificatory and contextual information about the text itself. In the SW&TP spoken corpus, for example, the profile description contains personal details of the speakers, information about where the recording took place, the subject matter of the file, etc. Below is an example of the profile description from file no. 506 (a CNWRS file) in the SW&TP spoken corpus:
<particDesc>
<person id=‘PS202’ n=‘24’ sex=‘f’ age=‘X’>
<name>Lucinda Beier</name>
<occupation>Academic</occupation>
</person>
<person id=‘PS203’ n=‘24’ sex=‘m’>
<name>Mr L3B</name>
<age>36</age>
<occupation>unknown</occupation>
<maritalStatus>married</maritalStatus>
<spousesOccupation>unknown</spousesOccupation>
<fathersOccupation>blastfurnaceman</fathersOccupation>
<mothersOccupation>seamstress</mothersOccupation>
</person>
</particDesc>
|
The tag <particDesc> stands for ‘participant description’ and, as can be seen in the example above, contains such information as the speaker’s age, sex, occupation etc. The <person> tag contains the speaker’s identity number (id) and the number of conversational turns they have (n). The <particDesc> section can, of course, be adapted to suit the purposes of the corpus; Sperberg-McQueen and Burnard (2001) provide advice on how to keep any such adaptations TEI-conformant.
With regard to the BNC texts in our corpus, because these come originally from files in the BNC World edition, the <particDesc> is copied wholesale from the original BNC World file header
Finally, the revision description provides a history of changes made in the development of an electronic text. The purpose of this is to aid the development of subsequent versions of a file or corpus, and to record the history of a file. In the case of the CNWRS files, the revision description is used to explain how the original CNWRS data was digitised, OCR scanned, re-catalogued and marked-up.
We used a generic header as a template to add to marked-up chunks, and then adapted this to suit the file in question. For example, the name of the file is added to the header, and the number of the various tags used in the file is counted and included within the encoding description.
4.2 Mark-up specific to the CNWRS texts
4.2.1. The ‘utterance’ element
In the CNWRS interviews we use the ‘utterance’ element to distinguish the speakers and their conversational turns. The ‘utterance’ element consists of two attributes: ‘who’ and ‘n’. An example is useful to explain the information contained within the tag:
<utterance
who="PS201" n="001">PS201:
I know he was a craftsman at Gillows.
</utterance>
(148) |
In the ‘utterance’ tag, the ‘who’ attribute refers to the speaker identity which is marked by the string following the equals sign. ‘PS201’ identifies the speaker as the interviewer Elizabeth Roberts, and the attribute ‘n=‘001’’ tells us that this is her first turn. The end tag </utterance>, which appears immediately after the speech, indicates the end of this utterance.
4.3 Mark-up specific to the BNC texts
4.3.1 The ‘u’ element
We used the same system as described in 4.2.1 with the BNC files in our corpus. The only difference here was that the BNC files already had person identifiers, and used <u> rather than our preferred <utterance>. We changed the <u> tags to <utterance> because <u> is an SGML tag that indicates underlining; therefore if one of our corpus files is opened in MS Word (e.g. for display purposes), the text displayed is underlined. Using <utterance> as opposed to <u> avoids this. The only other difference between the mark-up of the BNC and CNWRS texts is that the ‘utterance’ tag in the BNC texts does not include an ‘n’ attribute. This is because of the difficulty of interpreting and therefore numbering speakers’ turns, arising from the difficulties of working with naturally occurring spontaneous conversation.
4.3.2 S-unit tags
One major problem in tagging the corpus files for SW&TP was that the large quantity of existing mark-up in the files made it very difficult to read the texts when applying the SW&TP tags. This was a particular problem with the BNC files which, in addition to the header, also included s-unit tags and part of speech tags for every word. S-units are used in the BNC to mark sentence-like units in spoken data. As an example, here is an extract from a tagged BNC World edition file:
Example 1
<utterance who=‘PS0E8’>PS0E8:
<s n="4240"><unclear> <w PRP>over <w DPS>her <w NN1>leotard<c PUN>, <w CJS>so <w PNP>we <w VVD>kept<c PUN>, <w PNP>we <w VVD>bought <w PNI-CRD>one<c PUN>, <w PNP>they <w VVD>said <w ITJ>oh<c PUN>, <w PNP>you <w VVB>know<c PUN>, <w PNP>you <w VVB>want <w AT0>a <w AV0>fairly <w AJ0>big <w PNI>one<c PUN>, <w CJS>so <w PNP>they <unclear><c PUN>, <w AV0>well <w PNP>it <w VVD-VVN>looked <w AJ0>alright <w PRP>across <w AV0>here<c PUN>, <w CJS>when <w PNP>we <w VVD>got <w PNP>it <w AV0>home<c PUN>, <w PNP>she <w VVD>tried <w PNP>it <w AVP-PRP>on<c PUN>, <w PNP>it <w VBD>was <w AV0>nearly <w AVP>down <w PRP>to <w DPS>her <w NN2>knees<c PUN>. <s n="4241"><w CJS>When <w PNP>we <w VVD>opened <w PNP>it <w AVP-PRP>up <w PNP>it <w VBD>was <w PRP>aged <w CRD>nine <w PRP>to <w CRD>eleven<c PUN>, <w CJC>and <w PNP>she <w VVD>said <w PNP>I <w VM0>ca<w XX0>n’t <w VVI>go <w PRP>in <w DT0>this<c PUN>. <s n="4242"><vocal desc=laugh> <w AV0>So <w PNP>we <w VHD>had <w TO0>to <w VVI>go <w AVP>back <w CJC>and <w VVI>swap <w PNP>it <w PRP>for <w CRD>seven <w PRP>to <w CRD>eight <pause></utterance>
(KCD)
|
Even before we had added our own SW&TP tags the data was difficult to read. Because of the difficulty of integrating our SW&TP tags with the existing mark-up, and to ensure that the BNC texts were statistically comparable with the CNWRS texts, we decided to strip out the s-unit tags. However, we did
not remove the part of speech tags as it is possible that these might prove useful at some later stage (if not for our project, then for future researchers using our corpus). We therefore decided to tag the files in MS Word which enabled us to use a macro to suppress every tag in the file except the SW&TP tags. This increased the legibility of the files (which made them much easier to annotate for SW&TP), as can be seen from the example below, which is same text as in example 1, but with the s-units removed, the redundant tags suppressed and the SW&TP tags added:
Example 2
PS0E8: <sptag cat="A">
<unclear> over her leotard, so we kept, we bought one, </sptag> <sptag cat="xRS">they said </sptag> <sptag cat="xDS">oh, you know, you want a fairly big one, </sptag> <sptag cat="A">so they <unclear>, well it looked alright across here, when we got it home, she tried it on, it was nearly down to her knees. When we opened it up it was </sptag><sptag cat="FIW">aged nine to eleven, </sptag> <sptag cat="xRS"> and she said </sptag> <sptag cat="xDS">I can’t go in this.</sptag> <sptag cat="A"><vocal desc=laugh> So we had to go back and swap it for seven to eight <pause></sptag></utterance>
(KCD)
|
In addition to removing the s-unit tags we also removed <div> tags, used to mark divisions between different conversations, in order to validate our files using an SGML validator. We replaced these with a note in the text at the relevant point indicating that a division occurred.
4.4 A summary of the tags used to mark-up the corpus files
The following tags (and, where relevant, end tags) were used as we marked up our files:
<note></note> |
Indicates a note from the encoder
about a particular feature of the text that cannot be described with
any of the other available tags. |
<notelet> |
Indicates a note from the encoder about a particular feature of the text that cannot be described with any of the other available tags, and that is not scoped around a particular word or utterance. |
<pause dur=‘x’> |
Indicates a significant pause of at least 3 seconds; the length of the pause in seconds is noted in the duration element. |
<sic></sic> |
Indicates text reproduced from the original despite being apparently incorrect or inaccurate. |
<sptag cat=‘x’></sptag> |
Indicates the category of speech and/or thought and/or writing presentation inherent in a particular piece of discourse. |
<text> |
Indicates the beginning of the spoken text. |
<unclear> |
Indicates portions of the speech which are indistinct and cannot be transcribed. |
<utterance></utterance> |
Indicates an utterance. |
<w> |
Indicates the part of speech of a word (BNC texts only). |
5. Speech, Writing and Thought Presentation tagging
In this section we detail the system of tagging used to apply SW&TP annotation to the corpus data.
5.1 SW&TP categories
Here we show the acronyms used in the tags and their accompanying
definitions. For full definitions of the SW&TP categories see McIntyre et
al. (2003).
5.1.1 Main categories
Category |
Definition |
A |
Anything other than SW&TP |
RM |
Report of Mention |
RV |
Minimal Report of Speech |
RN |
Minimal Report of Writing |
RI |
Minimal Report of Internal State |
RS |
Report of Speech |
RW |
Report of Writing |
RT |
Report of Thought |
RSA |
Report of Speech Act |
RWA |
Report of Writing Act |
RTA |
Report of Thought Act |
IS |
Indirect Speech |
IW |
Indirect Writing |
IT |
Indirect Thought |
FIS |
Free Indirect Speech |
FIW |
Free Indirect Writing |
FIT |
Free Indirect Thought |
DS |
Direct Speech |
DW |
Direct Writing |
DT |
Direct Thought |
FDS |
Free Direct Speech |
FDW |
Free Direct Writing |
FDT |
Free Direct Thought |
The main changes in the tag-set since the SW&TP Written project are as follows:
- We have dispensed with the N element of the tags as a consequence of tagging oral texts. N previously stood for ‘narrative’ or ‘narrator’s’, and is therefore not applicable to spoken data. Hence, what in the written corpus would have been NRSA is in the spoken corpus simply RSA. Likewise, the single N tag, which was used in the written corpus to mark anything not tagged as SW&TP, is replaced in the spoken corpus by A – which refers to ‘[A]nything other than SW&TP’.
- We have changed the NW tag, used in the written corpus, to RN, to refer to the minimal report of writing - such as ‘I wrote to Eileen’ [KC2]. As in the written corpus, this tag is the equivalent in writing presentation terms of the tags RV and RI on the speech and thought scales. The change was necessary due to dispensing with the N element. Because of this we were left with the same tag - RW - to refer both to the ‘report of writing’ (i.e. a reporting clause of writing presentation preceding either the direct or indirect report of writing) and the minimal report of writing (such as the example given above of ‘I wrote to Eileen’ [KC2]). We therefore needed a different tag in order to distinguish between the two phenomena. We chose to use RN to tag the latter, the N being the only remaining consonant in the word ‘writing’ that is not used elsewhere in the tag-set.
- We have introduced a new tag, RM, to refer to ‘report of mention of language use’. This is used to tag instances where speakers refer to calling something or someone by a particular name. A prototypical example would be ‘[…] there was <sptag cat="xRM"> a character by the name name of Eddie Polo </sptag> <sptag cat="xRM"> in a serial called The Broken Coin. </sptag>‘.
5.1.2 Sub-categories
Sub-category |
Definition |
p |
Extended topic. |
# |
Flags problems for discussion, i.e. examples that we weren’t sure how to analyse. |
e |
Embedded. |
g |
Grammatical negative. |
a |
Absence of category. |
h |
Hypothetical. |
i |
Inferred. |
q |
With quote. |
r |
Iterated. |
v |
Interrogative. |
p |
Imperative. |
u |
Uncompleted. |
1/2/3 etc |
Level of embedding or number of repeated adjacent categories. |
The additions to the sub-category tags since the written corpus are g, a, r, v, p, u and the numbers to indicate levels of embedding and number of repeated adjacent categories. All new additions were to cope with particular phenomena we encountered in the spoken data.
5.2 The tagging process
John Heywood and Dan McIntyre tagged all the corpus texts for SW&TP. The tagged texts were then passed to Mick Short and Elena Semino individually for their comments and revisions. The project team discussed any remaining problems together and a final draft of each tagged file was produced following this discussion.
5.3 The tagging format
SW&TP categories are marked within the <sptag> category attribute
('cat'). We employ a 15-field category set, using ‘x’ as a placeholder for empty positions in order to assist concordancing.
Below is an example of an SW&TP tag and its component parts:
|
element |
|
attribute |
|
attribute value [SW&TP
'tag'] |
<sptag |
|
cat |
= |
"x x x x x x x x x x
x x x x x"> |
|
|
|
|
fields 1 - 15 |
Formally, the tag constitutes everything within angle brackets, though for
ease of reference in analysis we prefer to use the term 'tag' (as opposed to
'attribute value') to refer specifically to the SW&TP category itself.
The SW&TP tag, then, has 15 possible constituents (marked above by the
Xs), detailed in Table 3, below. (NB. We do not mark empty positions that follow the final category or sub-category
constituent.)
Field |
Possible constituents |
Definition of constituent |
1 |
x A F |
Anything other than SW&TP; Free |
2 |
x R I D |
Representation; Indirect; Direct |
3 |
x S T W V I N M |
Speech; Thought; Writing; Voice; Internal state;
WritiNg; Mention |
4 |
x A |
Act |
5 |
x P |
toPic |
6 |
x # 1 2 3 4 |
# = odd/interesting cases; no.s = repeated (-ing or –ed) adjacent
cats. |
7 |
x e |
embedded |
8 |
x g a |
grammatical negative; absence of speech, thought and/or writing |
9 |
x h |
hypothetical |
10 |
x i |
inferred |
11 |
x q |
quote |
12 |
x r |
iterative |
13 |
x v p |
interrogative; imperative |
14 |
x u |
uncompleted |
15 |
x 1 2 3 4 |
no.s = level of embedding |
Table 3 Possible constituents of fields in the <sptag> category attribute
Some examples should help to make this clear:
- RIi (an inferred report of internal state) would be tagged formally as: <sptag cat=”xRIxxxxxxi”>
- RSAPg (a grammatically negative report of a speech act with an extended topic) would be tagged formally as <sptag cat=”xRSAPxxg”>
- ISehu2 (second-level embedded, hypothetical, uncompleted indirect speech) would be tagged formally as <sptag cat=”xISxxxexhxxxxu2”>
- FISeghvu3 (third-level embedded, grammatically negative, hypothetical, uncompleted, interrogative, free indirect speech) would be tagged formally as <sptag cat=”FISxxxeghxxxvu3”>
6. Further directions
We have recently begun the analysis of the corpus, and some
of our initial findings are reported in McIntyre et al. (2003). One
of the major results of our project so far, though, has been confirmation
that the model of speech, thought (and later) writing presentation suggested
by Leech and Short (1981) and developed by Short, Semino and Wynne in their
work on the Written Corpus, is applicable to spoken data, with few
modifications. Our work on the Spoken Corpus would seem to confirm the
robustness of the model of SW&TP that we are using.
Further research planned includes quantitative and qualitative analyses of
the corpus, comparisons between the Written and Spoken corpora, and
investigation into the role of prosody in SW&TP.
References
Leech, G. N. and Short, M. H.
(1981) Style in Fiction. London: Longman.
McIntyre,
D., Bellard-Thomson, C., Heywood, J., McEnery, A., Semino, E. and Short,
M. (2003)
'The
construction of a corpus to investigate the presentation of speech, thought
and writing in written and spoken British English.'
in Archer, D., Rayson, P., Wilson, A. and McEnery, A. (eds) Proceedings
of the Corpus Linguistics 2003 Conference.
Lancaster University: UCREL Technical Papers 16.
513-22.
|