A Handbook to the Lancaster
Speech, Writing and Thought Presentation Written Corpus
1. Introduction
This
handbook details the construction of the Lancaster Speech, Thought and
Writing Presentation Written Corpus, and discusses some of the issues
involved in this. For more detailed information see Semino and Short
(forthcoming).
We built a corpus of around 250,000 words
and annotated this for categories of speech and thought
presentation (also known as speech and thought reporting or representation)
using a tagset which has been developed by Mick Short, Elena Semino,
Jonathan Culpeper and Martin Wynne at Lancaster University. This tagset is
an extension of the model of speech and thought presentation (SW&TP)
proposed in Leech and Short (1981) which posits a continuum of categories
along an axis representing degrees of narrator's intervention.
Originally a pilot corpus of some 40,000 words of fiction texts was compiled
and annotated in 1994. A parallel pilot sample of 40,000 newspaper texts
were then added in 1994 and 1995. This work was done with funds provided by
the Faculty of Social Sciences at Lancaster University.
Following the award of a major British Academy research project grant, this
80,000-word pilot corpus was expanded in 1996 and 1997 to a nominally
240,000-word corpus. The fiction and newspaper sections were doubled in
size, and a new section of biography and autobiography texts was
added.
Analysis of the corpus is ongoing.
2. The composition of the corpus
2.1 Structure
There are approximately 250,000 words of text in the corpus. It is made
up of 120 sections of about 2,000 words. The final count is somewhat in
excess of 240,000 because the texts were sampled in such a way as to begin
and end them at fairly 'natural' breaks, so that a reader of the corpus text
can see enough of the relevant context to understand the narrative, and
usually it was preferred to find such a break after rather than before the
2,000 word mark, but as close to it as possible (for more on sampling
strategies see 2.3 below).
The primary classification of the corpus is into three sections relating
narrative genres. These genres are: (i) fiction, (ii) newspaper news reports
and (iii) biography and autobiography. There are a minimum 80,000 words in
each section.
Within each genre there is a division between 'serious' and 'popular' texts.
While such a division is inevitably difficult to some extent, this
classification was made on the basis of what would commonly be held to be
the case by the average educated reader. This will enable the testing of
such preconceptions by the analysis of the actual texts.
In the fiction and biography there is also an binary division (cutting
across the popular/serious division) between first and third person
narratives. In the biography this creates a division between biography and
autobiography.
2.2
List of texts sampled
Serious fiction
Amis, M. (1984) Money,
London: Penguin.
Atkinson, K. (1984) Behind
the Scenes at the Museum, London: Penguin.
Ballard, J.G. (1984) Empire
of the Sun, London: Panther.
Barnes, J. (1989) A
History of the World in 10˝ Chapters, London: Picador.
Byatt, A.S. (1991) Possession,
London: Vintage.
Carter, A. (1967) The
Magic Toyshop, London: Heinemann.
Drabble, M. (1969) Jerusalem
the Golden, Harmondsworth: Penguin.
Fowles, J. (1963) The
Collector, London: Vintage
Gardam, J. (1992)
Queen of the Tambourine, London: Abacus.
Golding, W. (1980)
Rites of Passage, London: Faber & Faber.
Greene, G. (1943) Brighton
Rock, London: Penguin.
Huxley, A. (1928) Point
Counter Point, London: Chatto & Windus.
Lawrence, D.H. (1955)
‘Tickets Please’, in The Complete
Short Stories (Vol. II), London: Heinemann.
Lessing, D. (1974) The
Memoirs of a Survivor, London: Octagon.
Lowry, M. (1969) ‘Gin and
Goldenrod’, in Hear us O Lord from
Heaven thy Dwelling Place, Harmondsworth: Penguin.
Maugham, S. (1935) The
Moon and Sixpence, London: Heinemann.
Murdoch, I. (1961) A
Severed Head, London: Chatto & Windus.
Rushdie, S. (1995) The
Moor’s Last Sigh, London: Jonathan Cape.
Wells, H.G. (1953) Tono-Bungay,
London: Collins.
Woolf, V. (1919) Night
and Day, London: The Hogarth Press.
Popular
fiction
Adler, E. (1986) Peach,
London: Hodder & Stoughton.
Bow, J. (1991) Jane's
Journey, Sussex: The Book Guild Ltd.
Burley, W.J. (1978) Wycliffe
and the Scapegoat, London: Gollancz.
Conran, S. (1982) Lace,
Harmondsworth: Penguin.
Cookson, C. (1984) Hamilton,
London: Heinemann.
Dibdin, M. (1991) Dirty
Tricks, London: Faber & Faber.
Francis, D. (1988) The
Edge, London: Michael Joseph.
Higgins, J. (1991) The
Eagle Has Flown, London: Pan.
Holt, V. (1991) Daughter
of Deceit, London: Harper Collins.
Lewis, T. (1992) Get
Carter, London: Allison & Busby.
MacLean, A. (1986) Santorini,
London: Collins.
Maitland, S. (1990) Three
Times Table, London: Chatto & Windus.
McDermid, V. (1992) Dead
Beat, London: Gollancz.
McDowell, C. (1991) A
Woman of Style, London: Century Group.
Nabb, M. (1989) Death
in Springtime, London: Fontana.
Peters, E. (1992) The
Holy Thief: The Nineteenth Century Chronicle of Brother Cadfael, London:
Headline.
Seymour, G. (1992) Archangel,
London: Fontana.
Smith, W. (1987) The
Eye of the Tiger, London: Heinemann.
Taylor, A. (1986) The
Raven on the Water, London: Harper Collins.
Thomson, R. (1991) The
Five Gates to Hell, London: Bloomsbury.
Popular
(auto)biography
Bannister, J. (1994) Lara:
the story of a record-breaking year, London: Stanley Paul
Beck, S. (1995) Queen
of the Street: The Amazing Life of Julie Goodyear, London: Blake.
Bergan, R. (1991) Dustin
Hoffman, London: Virgin.
Black, C. (1985) Step
Inside, London: Dent.
Caine, M. (1992) What's
it all about?, London: Century.
Cherrington, J. (1993) On
the Smell of an Oily Rag: My Fifty Years in Farming, Ipswich: John
Farming Press Books.
Christie, L. (with Ward, T.)
(1989) Linford Christie: An Autobiography, London: Paul.
Dimbleby, J. (1994) The
Prince of Wales: A Biography, London: Little, Brown.
Dorman, L.S. and Rawlins,
C.L. (1990) Leonard Cohen: Prophet of
the Heart, London: Omnibus.
Henry, A. (1994) From
Zero to Hero: Damon Hill, Yeovil: Patrick Stephens Limited
Juby, K. (ed.) (1986) In
other words – David Bowie, London: Omnibus Press.
Miller, J. (with Brown, J.)
(1989) Former Soldier Seeks Employment,
London: MacMillan.
Milligan, S. (1976) Monty
– his part in my victory, London: Penguin.
Morton, A. (1993) Diana:
Her True Story, London: O’Mara.
Phoenix, P. (1983) Love,
Curiosity, Freckles and Doubt, London: Arlington Books.
Smith, J. (1988) The
Benny Hill Story, London: W H Allen.
Stokes, D. (with Dearsley,
L.) (1987) Joyful Voices, London:
Macdonald.
Stone, S. (1990) Kylie
Minogue: The Superstar Next Door, London: Omnibus.
Whitbread, F. (with Blue,
A.) (1988) Fatima, London: Pelham
Books.
Windsor, B. (with Flory, J.)
(1990) Barbara: The Laughter and Tears
of a Cockney Sparrow, London: Century.
Serious
(auto)biography
Baker, K. (1993) The
Turbulent Years, London: Faber and Faber.
Critchley, J. (1995) A
Bag of Boiled Sweets, London: Faber and Faber.
Glasser, R. (1986) Growing
Up in the Gorbals, London: Chatto and Windus.
Isherwood, C. (1980) My
Guru and His Disciple, London: Magnum.
Kennedy, L. (1989) On
my way to the club: The Autobiography of Ludovic Kennedy, London:
Collins.
Lee, L. (1969) As
I Walked Out One Midsummer Morning, London: Andre Deutsch.
Worsthorne, P. (1993) Tricks
of Memory. An Autobiography: Peregrine Worsthorne, London: Weidenfeld
& Nicholson.
Spark, M. (1992) Curriculum
Vitae, London: Constable.
Stalker, J. (1988) Stalker,
London: Harrap.
Thatcher, M. (1993) The
Downing Street Years, London: Harper Collins.
Carpenter, H. (1983) W.
H. Auden, London: Unwin.
Adams, J. (1992) Tony
Benn, London: MacMillan.
Bragg, M. (1988) Rich
- The life of Richard Burton, London: Hodder and Stoughton.
Ponting, C. (1994) Churchill,
London: Sinclair-Stevenson.
Wilson, A.N. (1990) C.S
Lewis, a biography, London: Collins.
Sherry, N. (1989) The
Life of Graham Greene, London: Penguin.
Rose, J. (1990) Modigliani:
The Pure Bohemian, London: Constable.
Ackroyd, P. (1984) T.S.
Eliot, London: Hamilton.
Hodges, A. (1983) Alan
Turing: The Enigma of Intelligence, London: Unwin.
Callow, S. (1990) Vincent
Van Gogh - a Life, London: Allison & Busby.
Broadsheets (Serious newspapers)
The Daily Telegraph
The Guardian
The Independent
The Independent on Sunday
The Times
Tabloids
(Popular newspapers)
The
Express
The Mirror
The News of the World
The Star
The Sun
Today (samples from 1994 only, as the paper ceased publication before
the 1996 sample was taken)
2.3
Sampling Strategies
Fiction
The decision as to what
counted as 'high' literature was made by nine members of Lancaster
University's Stylistics Research Group, who were given a list of authors
whose works were available in electronic form in the Oxford Text Archive.
Authors which six or more informants judged as 'high' literature were
selected. The extracts which we took constituted relatively independent
units (e.g. chapters, sections or short stories). The popular fiction
extracts consisted of eight 3rd-person narratives taken from the relevant
category of the British National Corpus, to which we added two 1st-person
narratives, so that we would have a greater range of narrative styles. Six
extracts were from romantic novels and four from action novels.
In
the fiction section a further subdivision was made within each text type
between texts with first and third person narrators. This is paralleled in
the biography/autobiography section, where the biography texts are all first
person narratives and the autobiography texts are all third person
narratives.
News
All the press data was taken
from articles published in British national daily newspapers. Only
newspapers that were felt to be prototypical members of the broadsheet or
tabloid categories were selected. Newspapers were taken from the same or
consecutive days in four samples: 4-5 December 1994, 11-12 December 1994,
28-29 April 1996 and 12-13 May 1996. This enabled us to select articles that
covered the same story, and thereby facilitated comparisons between
different newspaper styles (work which we hope to carry out in later phases
of the project). News stories rather than editorials or magazine-style
articles were chosen so that the press data would be as similar as possible
in type to the narrative fiction data. The main criterion for selecting
articles was that they should appear in at least three newspapers.
(Auto)biography
It was less clear-cut how to
make a serious/popular distinction in the biography/autobiography section.
It was decided to rely on the perceived seriousness of the subject, so
politicians, serious writers and artists are considered 'serious' and TV
stars, royalty and sports people are considered 'popular'. At the same time
some attention was paid to the writing style of the biography in question so
as not to include problematic cases, such as particularly badly written
autobiographies of serious politicians, or highbrow biographies of pop
stars, for example.
2.4
Text markup
The following SGML elements were
used to mark up the text:
div1 |
text divisions - fiction (types: serious, popular), newspapers
(types: broadsheet, tabloid), biography (types: serious, popular) |
div2 |
sample (c.2000 words) |
div3 |
(in newspapers) articles; other subdivisions of samples |
edit |
as a note to the text editor indicating what stage of processing the
text is at |
head |
a text heading (e.g. newspaper headline or a chapter heading) |
header |
bibliographical information and the list of speakers in the text |
note |
a note indicating additional information about the SW&TP tagging |
p |
paragraph break |
pb |
page break |
sptag |
Speech, Writing and Thought Presentation category tag |
3. Speech, Writing and Thought Presentation annotation
3.1 SW&TP
categories
Here we show the acronyms used in the tags and their accompanying
definitions. For full definitions of the SW&TP categories see Short et
al. (1998).
N
|
Narrative
|
NRS
|
Narrative Report of
Speech
|
NRW
|
Narrative Report of
Writing
|
NRT
|
Narrative Report of
Thought
|
NI
|
Narrative Report of
Internal State
|
NV
|
Narrative
Report of Voice
|
NRSA
|
Narrative
Report of Speech Act
|
NRWA
|
Narrative
Report of Writing Act
|
NRTA
|
Narrative
Report of Thought Act
|
NRSAP
|
Narrative
Report of Speech Act with Topic
|
NRWAP
|
Narrative
Report of Writing Act with Topic
|
NRTAP
|
Narrative
Report of Thought Act with Topic
|
IS
|
Indirect
Speech
|
IW
|
Indirect
Writing
|
IT
|
Indirect
Thought
|
FIS
|
Free
Indirect Speech
|
FIW
|
Free
Indirect Writing
|
FIT
|
Free
Indirect Thought
|
DS
|
Direct
Speech
|
DW
|
Direct
Writing
|
DT
|
Direct
Thought
|
FDS
|
Free
Direct Speech
|
FDW
|
Free
Direct Writing
|
FDT
|
Free
Direct Thought
|
Affixes:
e |
embedded |
q |
with quote |
h |
hypothetical |
i |
inferred (see section on NIi below) |
+ |
speech summary (not used) |
e.g.
NRSAPq is "Narrative Report of Speech Act with Topic with an embedded
quotation":
Notes
#
is used to flag
problems for discussion, i.e. things that we weren't sure how to analyse.
May be used in conjunction with a portmanteau tag to indicate choices (see
below)
(portmanteau tagging) is to
be used for genuine ambiguity, where it is preferable to indicate two
possible interpretations (e.g. IS-IT, NV-NRSA).
e
embedded SW&TP is
indented on the page to make it easier to read
Line breaks
all SW&TP tags are
printed on a line on their own, to make it easier to extract, sort, count
etc.
Wordcounts
the unit used is the
orthographic word, simply defined as a string of alphanumeric characters
surrounded by spaces or punctuation. Hyphenated and contracted words count
as one unit and genitive diacritics are ignored. e.g. "she's a
man-eater" (3 words)
Scare quotes
these are tagged with a
<note>
3.2
Tagging Guidelines
When tagging, consideration
is taken of all three of these levels of analysis. For more details of how
this is done, it is necessary to refer to the guidelines for the tagging of
particular categories in the different text types.
Some
problems arise because there are fuzzy areas on the boundary between the
presentation of linguistic and non-linguistic acts. There are also mentions
of written language where the focus is not on the production or reception of
the text, but merely on its existence. Such cases were not annotated as
writing presentation.
3.2.1
NIi
The NI category was invented
to cover cases in fiction where an omniscient narrator is able to report on
the internal states of characters, e.g.:
Jed's heart lifted in his ribs.
(Rupert Thomson, The Five Gates to Hell)
For a moment she didn't know where she was.
(Graham Greene, Brighton Rock)
In
texts which are not fiction with an omniscient narrator, NI (and all
categories of thought presentation) are only used where the character in
question has access to the thoughts and internal states which are reported.
This usually means that only states and thoughts of the reporter are tagged
as NI (or thought).
Often
passages tagged as N-NI are in fact inferences based on what someone has
said, and may even be quite close to the form of the original utterance. For
example:
<sptag cat=N-NI who=S next=NRSAP
whonext=B s=1 w=16>
The Palace was keen that the Prime Minister should continue until a
successor had been elected.
(Baker)
This is formally presented
formally as as the report of an internal state ('is keen that'), but the
reader will infer that this is a report of something that was probably said
by a spokesman for the Palace. However it is impossible to tell what type of
speech report this might be. It could be FIS, if the original utterance was
something like 'The Palace is keen that...'; it could be NRS followed by IS
if the original utterance was something like 'The Prime Minister should
continue...'; or it could be NRSAP if this report is a summary of what was
said, possibly on different occasions and by different people with the words
here bearing little relation to the actual words said.
Given
the impossibility of classifying the type of speech report, such examples
are tagged as NIi, since they are formally presentation of internal states,
even though pragmatically they can function as speech presentation.
In
all cases where there is no omniscient narrator then, what is formally
presented as the narration of internal states or thought is tagged as
ambiguous between N and the relevant category of thought presentation, e.g.
N-NI, N-NRT, N-IT.
3.2.2
NRSAP
The analysis of the press
data highlighted the existence of particular variants of existing
categories, which appear to be typical of newspaper reporting. An example of
this is the use of extremely long and detailed NRSAs, such as those given
below:
Mr Major warned yesterday of the
dangers of Britain being left behind if a group of European Union members
pushed ahead with a single currency.
(The Independent on Sunday, "Blair Puts Labour Troops on Alert for Snap
Election")
Labour called last night for a streamlined Scandinavian style monarchy to
banish Britain's class-riddensociety.
(The Daily Mirror, "Cut the Royals Down to Size")
In both cases the reporter
spells out the speech act that the original speaker is supposed to have
performed (warned, called), and then goes on to provide details of the
content of the utterance in the form of lengthy and complex noun phrases.
Clearly, such instances are not fully accounted for by the original
definition of the NRSA category, which aimed to capture those cases where
little more than the speech act is provided.
They
are therefore tagged as NRSAP, or 'narrator's representation of speech act
with topic', as in the following example:
<sptag cat=NRSAP who=M next=NRSA whonext=B s=0.67 w=18>
However, when he invited Beatrice Hastings to come and model for
him nude early on in their affair,
<sptag cat=NRSA who=B next=N s=0.07 w=2>
Modigliani objected
<sptag cat=N next=NRS whonext=M s=0.26 w=10>
and she failed to keep the appointment. This happened twice.
(June Rose, Modigliani)
3.2.3
NV
Both the fictional and
newspaper data contained instances of minimal speech presentation, which
could not easily be accounted for by Leech and Short's categories. Consider
the emboldened parts in the examples below:
"Don't you love
Barrie's plays?" she asked. "I'm so fond of them". She talked
on. Rampion made no comment.
(Aldous Huxley, Point Counter Point)
We spoke to vice madam Michaela Hamilton from Bullwell, Notts, who arranged
girls for a Hudson orgy at the Sanam curry house in Stoke.
(The News of the World, "Hudson Fixed Sex Orgies as his Charity Fund
Collapsed")
In both cases we are
informed that someone engaged in verbal activity, but we are not given any
explicit indication even as to what speech acts were performed, let alone
what the form and content of the utterances were. In other words, we are
faced with a form of speech presentation that is even more minimal, both
formally and functionally, than that captured by the NRSA category, where
the narrator specifies the illocutionary force of the utterance, and,
possibly, its topic. We classified instances like these as Narrator's Report
of Voice, and tagged them with the acronym NV.
3.3
The tagging process
All texts were tagged
manually by the present author using a version of the emacs text editor
under the Unix operating system. All tagged texts were then checked by both
Mick Short and Elena Semino, and any problems were then discussed in detail,
and any necessary changes were then made. Additionally, others have been
involved in tagging and checking areas of the corpus, and numerous further
checks have also been applied in order to ensure global consistency, to
enforce evolving guidelines and to identify and correct typographical and
other errors. Checking and refinement of the tagging is still in progress.
3.4
The tagging format
Formally an sptag can be
defined as follows:
<sptag cat=[tag](-[tag]) (who=[A-Z]) next=[tag](-[tag])
(whonext=[A-Z]) s=n(+n(+n))
w=n>
where
elements in round brackets are optional; tag is an SW&TP tag from
the tagset (e.g. N, NRS, FDS, etc); and n is a number. For example:
<sptag cat=DS who=B next=NRS whonext=B s=0.77 w=10>
'A criminal offence under the Defence of the Realm Act,'
<sptag cat=NRS who=B next=FDS whonext=C s=0.23 w=3>
I told her.
Here the tags tell us that
there is a sequence of direct speech spoken by speaker B which is 10 words
long (comprising 77% of the sentence) followed by a reported clause
reporting speech by speaker B which is 3 words long (comprising 23% of the
sentence).
3.5
Who's who
Where possible, speakers are
expicitly named persons. However, it is sometimes necessary to attribute
SW&TP to somewhat vaguer entities, such as groups of people or
institutions, and sometimes the speaker is unknown. Occasionally it has been
necessary to indicate the medium rather than the speaker, where this is the
only information given, for example in The Prince of Wales by
Jonathon Dimbleby:
F is the avalanche bulletin
P is the Sun
in the examples:
<sptag cat=eNRSAPQ-eNRWAPQ level=2 who=F s=0.16 w=10>
the avalanche bulletin warned of 'a considerable local avalanche danger'
</sptag level=1>
<sptag cat=NRW who=P next=DW whonext=P s=0.88 w=7>
On 18 March, the <hi r=it>Sun</hi> headline read
<sptag cat=DW who=P next=NW whonext=X s=0.13+1 w=8>
'ACCUSED. Official: Charles DID cause the killer avalanche'.
First
person narrators are always coded as B, and unknown speakers always as X.
The main protagonist of third narratives have also been coded as B. It may
be preferable to change this so that all and only first person narrators are
B.
3.6
Boundary
problems
Where the reported speech is
nominalised clauses it is NRSAP. Independent clauses, "to" and
"-ing" clauses are IS. Sometimes semantic criteria are relevant
however; such cases are hashed, notably all "how" clauses.
NQ
Tag the whole sentence as
NQ.
N-NRS
Also relates to DS-FDS
ambiguities. Where a passage is not formally a reporting clause but
functions to introduce SW&TP, a note has been inserted saying 'functions
as NRS'.
NRS-NRSA
Where an NRSA is
syntactically embedded within an NRSA, if possible it should be disentangled
and tagged separately, e.g.:
<sptag cat=NRS who=B next=NRSA whonext=B s=0.25 w=5>
I put forward the view
<sptag cat=NRSA who=B next=IS whonext=B s=0.4 w=8>
which I had earlier expressed to John Wakeham,
<sptag cat=IS who=B next=NRSAP whonext=B s=0.35 w=7>
that Margaret should not be so definite.
NRSAP-NRSAPQ
Q does not necessarily have
to be in quotation marks, or even formally marked at all, e.g.:
<sptag cat=N next=NRSAPQ whonext=G s=1+0.70 w=41>
When the boys came at her she attacked them with a ferocity that easily
overcame their theoretical advantages of strength and size. Her gifts of
war came down to her from some
unknown ancestor; and though her adversaries
grabbed her hair
<sptag cat=NRSAPQ who=G next=N s=0.15 w=4>and called her Jewess
Narrative/Thought
in non-fiction
In general, what appears
formally to be thought presentation in non-fiction should be tagged as
ambiguous between N and the relevant category of TP.
N-NV-NRSA
N words: phone, news
NV words: interview,
comments, chatted, talks, speak, conversation, "a lie detector
test", "cheering and singing", "Robbie Williams made
sure he was never knowingly underquoted".
NRSA words: questions, quiz,
request, swore, warn, greet, bust-up, blasted, complaints, threatening,
"called off the search", "Police were called",
"further tests were ordered" "joined the condolences in a
message to...", condemned, security warnings, "urging/cheering the
sides on", "crack bad jokes", "delivered a defiant
message".
Not speech events:
"positive media coverage" "meeting" "Richard had
been permanently excluded from school"
N-NW
"I got you a flag with
'Champions' on last time,"
IS-FIS-DS
Decide
'Decide' is tagged as
ambiguous between narrative and thought presentation, e.g.:
<sptag cat=N-NRT next=N-IT s=0.17 w=4>
So the Mirror decided
<sptag cat=N-IT next=NRSA s=0.83 w=20>
to confront 48-year-old OJ...
4. Findings
The
findings from our work on the Written Corpus are available in published form
(click on the 'publications' link on the left for more details), and a
monograph on the project is due to be be published in 2003 (Semino and Short
forthcoming).
References
Leech, G. N. and Short, M. H. (1981) Style in Fiction. London:
Longman.
Semino, E. and Short, M. (forthcoming) Corpus
Stylistics: A Corpus-based Study of Speech, Writing and Thought Presentation
in Narratives. London: Routledge.
Short, M. Wynne, M. and Semino, E. (1998) ‘Reading reports: discourse
presentation in a corpus of narratives, with special reference to news
reports.’ Anglistik & Englischunterricht. 39-65.
|