Parsing: in depth

Not all parsing systems are the same. The two main differences are:

The number of constituent types which a system employs.
The way in which constituent types are allowed to combine with each other.

However, despite these differences, the majority of parsing schemes are based on a form of context-free phrase structure grammar. Within this system an important distinction must be made beyween full parsing and skeleton parsing.

Full parsing aims to provide as detailed as possible analysis of the sentence structure, while skeleton parsing is a less detailed approach which tends to use a less finely distinguished set of syntactic constituent types and ignores, for example, the internal structure of certain constituent types. The two examples below show the differences.

Full parsing:

[S[Ncs another_DT new_JJ style_NN feature_NN Ncs] [Vzb is_BEZ Vzb] [Ns the_AT1 [NN/JJ& wine-glass_NN [JJ+ or_CC flared_JJ HH+]NN/JJ&] heel_NN ,_, [Fr[Nq which_WDT Nq] [Vzp was_BEDZ shown_VBN Vzp] [Tn[Vn teamed_VBN Vn] [R up_RP R] [P with_INW [NP[JJ/JJ/NN& pointed_JJ ,_, [JJ- squared_JJ JJ-] ,_, [NN+ and_CC chisel_NN NN+]JJ/JJ/NN&] toes_NNS Np]P]Tn]Fr]Ns] ._. S]

This example was taken from the Lancaster-Leeds treebank

The syntactic constituent structure is indicated by nested pairs of labelled square brackets, and the words have part-of-speech tags attached to them. The syntactic constituent labels used are:

& whole coordination
+ subordinate conjunct, introduced
- subordinate conjunct, not introduced
Fr relative phrase
JJ adjective phrase
Ncs noun phrase, count noun singular
Np noun phrase, plural
Nq noun phrase, wh-word
Ns noun phrase, singular
P prepositional phrase
R adverbial phrase
S sentence
Tn past participal phrase
Vn verb phrase, past participle
Vzb verb phrase, third person singular to be
Vzp verb phrase, passive third person singular

Skeleton Parsing

[S& [P For_IF [N the_AT members_NN2 [P of_IO [N this_DD1 university_NNL1 N]P]N]P] [N this_DD1 charter_NN1 N] [V enshrines_VVZ [N a_AT1 victorious_JJ principle_NN1 N]V]S&] ;_; and_CC [S+[N the_AT fruits_NN2 [P of_IO [N that_DD1 victory_NN1 N]P]N] [V can_VM immediately_RR be_VB0 seen_VVN [P in_II [N the_AT international_JJ community_NNJ [P of_IO [N scholars_NN2 N]P] [Fr that_CST [V has_VHZ graduated_VVN here_RL today_RT V]Fr]N]P]V]S+] ._.

This example was taken from the Spoken English Corpus.

The two examples are similar, but in the example of skeleton parsing all noun phrases are simply labelled with the letter N, whereas in the example of full parsing there are several types of noun phrase which are distinguished according to features such as plurality. The only constituent labels used in the skeleton parsing example are:

Fr relative clause
N noun phrase
P prepositional phrase
S& 1st main conjunct of a compound sentence
S+ 2nd main compound of a compound sentence
V verb phrase

Constraint grammar

It is not always the case that a corpus is parsed using context-free phrase structure grammar. For example, the Birmingham Bank of English has been part-of-speech tagged and parsed using a form of dependency grammar known as constraint grammar (Karlsson et al. 1995).

Constraint grammar marks the grammatical functions of words within a sentence and the interdependencies between them, rather than identifying hierarchies of constituent phrase types. For example, a code with a forward pointing arrowhead (e.g. AN> ) indicates a premodifying word, in this case an adjective, while a code with a backward pointing arrowhead (e.g. <NOM-OF ) indicates a postmodifying word, in this case "of". The example below shows parsing using the Helsinki constraint grammar for English:

It has maintained its independance and present boundaries intact since 1815.

"<it>" "it" <*> <NonMod> PRON NOM SG3 SUBJ @SUBJ "<has>" "have" <SVO> <SVOC/A> V PRES SG3 VFIN @+FAUXV "<maintained>" "maintain" <Vcog> <SVO> <SCOC/A> PCP2 @-FMAINV "<its>" "it" PRON GEN SG3 @GN> "<independence>" "independence" <-Indef> N NOM SG @OBJ @NN> "<and>" "and" CC @CC "<present>" "present" <SVO> <P/in> <P/with> V INF @-FMAINV "present" A ABS @AN "<boundaries>" "boundary" N NOM PL @OBJ "<intact>" "intact" A ABS @PCOMPL-O @<NOM "<since>" "since" PREP @<NOM @ADLV "<1815>" "1815" <1900> NUM CARD @<P <$.>"

On the line next to each word are three (or sometimes more) pieces of information. The first item in double quotes is the lemma of that word, following that is a part-of speech code (which can include more than one string e.g. N NOM PL); and at the right-hand end of the line is a tag indicating the grammatical function of the word. These begin with a @ and stand for:

@+FMAINV	finite main predicator
@-FMAINV	non-finite main predicator
@		premodifying adjective
@CC		coordinator
@DN>		determiner
@GN>		premodifying genitive
@INFMARK>	infinitive marker
@NN>		premodifying noun
@OBJ		object
@PCOMPL-O	object compliment
@PCOMPL-S	subject compliment
@QN>		premodifying quantifier
@SUBJ		subject

Back to previous page