Introduction to part-of-speech annotation and the BNC Sampler
Part-of-speech (POS) tags are generally codes of a few letters and numbers, in which the
first letter has a basic part-of-speech meaning:
- N... typically indicates a noun
- V... typically indicates a verb
- J... typically indicates an adjective
The next letters often add further meaning:
- NP... often means a proper noun
- NN... often means an ordinary (common) noun
- VB... often means part of the verb BE
- VH... often means part of the verb HAVE
- VV... often means part of a lexical verb (e.g. play, run)
And at the end of a tag:
...1 often means singular noun |
...2 often means plural noun |
...0 often means base form verb |
...I sometimes means infinitive of verb |
...Z often means 3rd person singular verb |
...D often means past tense verb |
...G often means present participle of verb |
...N often means past participle of verb |
So you might like to try guessing the meaning of the following tags which are found in
today's corpus:
NN2 VVZ VBZ VVI VHN NP2 JJR
N.B. A full list of tags is called a "tagset".
Check the tagset here to see if you guessed correctly.
The BNC Sampler Corpus
The corpus we will be using today is a 2-million word corpus taken from the British National
Corpus (BNC). The smaller corpus is known as the BNC Sampler, and it is publicly available on CD-Rom. Its key features
are:
- it contains written and spoken data in almost equal portions
- it has been POS-tagged (by the CLAWS program) and the POS tags have been hand-corrected, so theoretically
there should be no mistakes
The tagset used for the BNC Sampler (and LOB, FLOB etc.) is known locally as the CLAWS "C7"
tagset. The following links to Lancaster sites provide more material on this tagset:
A manual for the POS-tagging of the Sampler corpus, describing in detail how each POS-tag is used
A full list of all tags in the C7 tagset.
The table below outlines the structure of the BNC Sampler:
Broad text category |
WordSmith folder (under BNCsamp) |
Text category and description |
Number of words |
Closest equivalent in Brown,Frown,LOB,FLOB |
Written |
inform |
"informative" writing |
781,801 |
Sections A-J |
imag |
"imaginative" writing |
231,173 |
Sections K-R |
Spoken |
demog |
informal conversation which has been demographically sampled across
the population of the UK |
498,404 |
none |
cg |
speech recorded at specific locations for specific events, such as
business meetings, public talks ("context-governed" |
499,998 |
none |
|