Introduction to part-of-speech annotation and the BNC Sampler

Part-of-speech (POS) tags are generally codes of a few letters and numbers, in which the first letter has a basic part-of-speech meaning:

N... typically indicates a noun

V... typically indicates a verb

J... typically indicates an adjective

The next letters often add further meaning:

NP... often means a proper noun

NN... often means an ordinary (common) noun

VB... often means part of the verb BE

VH... often means part of the verb HAVE

VV... often means part of a lexical verb (e.g. play, run)

And at the end of a tag:

...1 often means singular noun	...2 often means plural noun
...0 often means base form verb	...I sometimes means infinitive of verb
...Z often means 3rd person singular verb	...D often means past tense verb
...G often means present participle of verb	...N often means past participle of verb

So you might like to try guessing the meaning of the following tags which are found in today's corpus:

NN2
VVZ
VBZ
VVI
VHN
NP2
JJR

N.B. A full list of tags is called a "tagset". Check the tagset here to see if you guessed correctly.

The BNC Sampler Corpus

The corpus we will be using today is a 2-million word corpus taken from the British National Corpus (BNC). The smaller corpus is known as the BNC Sampler, and it is publicly available on CD-Rom. Its key features are:

it contains written and spoken data in almost equal portions

it has been POS-tagged (by the CLAWS program) and the POS tags have been hand-corrected, so theoretically there should be no mistakes

The tagset used for the BNC Sampler (and LOB, FLOB etc.) is known locally as the CLAWS "C7" tagset. The following links to Lancaster sites provide more material on this tagset:

A manual for the POS-tagging of the Sampler corpus, describing in detail how each POS-tag is used

A full list of all tags in the C7 tagset.

The table below outlines the structure of the BNC Sampler:

Broad text category	WordSmith folder (under BNCsamp)	Text category and description	Number of words	Closest equivalent in Brown,Frown,LOB,FLOB
Written	inform	"informative" writing	781,801	Sections A-J
Written	imag	"imaginative" writing	231,173	Sections K-R
Spoken	demog	informal conversation which has been demographically sampled across the population of the UK	498,404	none
Spoken	cg	speech recorded at specific locations for specific events, such as business meetings, public talks ("context-governed"	499,998	none