ZCTC - corpus annotation

2. Corpus annotation

The ZCTC corpus is annotated using ICTCLAS2008, the latest release of the Chinese Lexical Analysis System developed by the Institute of Computing Technology, the Chinese Academy of Sciences. This annotation tool, which relies on a large lexicon and the Hierarchical Hidden Markov Model, integrates word tokenisation, named entity identification, unknown word recognition, as well as part-of-speech tagging. The ICTCLAS2008 has been reported to achieve a precision rate of 98.54% for word tokenisation. Latest open tests have also given encouraging results, with a precision rate of 98.13% for tokenisation and 94.63% for part-of-speech tagging. The application programming interface (API) of ICTCLAS2008 is publicly available at www.ictclas.org while a compiled program is available at www.corpus4u.org.

In order to ensure maximum comparability, a new release of the LCMC corpus (version 2.0) has been produced, which is retagged using this same tool. The part-of-speech tagset applied on the ZCTC and the new release of LCMC is described as follows.

a adjective

ad adverbial use of adjective

ag adjectival morpheme

an nominal use of adjective

al adjectival formulaic expression

b modifier (non-predicate noun modifier)

bg noun modifier morpheme

bl noun modifying formulaic expression

c conjunction

cc coordinating conjunction

d adverb

dg adverbial morpheme

dl adverbial formulaic expression

e interjection

ew sentence-final punctuation (full stop, semi-colon, question mark, exclamation mark)

f space word

h prefix

k suffix

m numeral and quantifier

mg numeral and quantifier morpheme

mq numeral-classifier

n noun

ng nominal morpheme

nl nominal formulaic expression

nr person name

nr1 Chinese surname

nr2 Chinese first name

nrf transliterated foreign person name

nrj Japanese name

ns place name

nsf transliterated foreign place name

nt organisation name

nz other proper noun

o onomatopoeia

p preposition

pba preposition ba 把

pbei preposition bei 被

q classifier

qt temporal classifier

qv verbal classifier

r pronoun

rg pronominal morpheme

rr personal pronoun

ry interrogative pronoun

rys place interrogative pronoun

ryt temporal interrogative pronoun

ryv verbal interrogative pronoun

rz deictic pronoun

rzs place pronoun

rzt temporal pronoun

rzv verbal pronoun

s place word

t time word

tg time word morpheme

u auxiliary

ude1 的

ude2 地

ude3 得

udeng 等

udh 的话

uguo 过

ule 了

ulian 连

uls 来说、来讲、而言、说来

usuo 所

uyy 一样、一般、似的、般

uzhe 着

uzhi 之

v verb

vd adverbial use of verb

vf directional verb

vg verbal morpheme

vi intransitive verb

vl verbal formulaic expression

vn nominal use of verb

vshi 是

vx pro-verb

vyou 有

w symbols and punctuations

wb percentage and permillle signs: ％ and ‰ of full length; % of half length

wd full or half-length comma: ，,

wj full stop of full length: 。

wky closing brackets: ）〕］｝》】〗〉of full length; ) ] } > of half length

wkz opening brackets: （〔［｛《【〖〈 of full length; ( [ { < of half length

wn full-length enumeration mark: 、

wp dash: —— －－ —— － of full length; --- ---- of half length

ws full-length ellipsis: …… …

wt full or half-length exclamation mark: ！of full length; ! of half length

wyy full-length single or double closing quote: ” ’ 』

wyz full-length single or double opening quote: “ ‘ 『

x non-word character string

y particle

z descriptive word