The ZCTC corpus is annotated using ICTCLAS2008, the latest release of the Chinese Lexical Analysis System developed by the Institute of Computing Technology, the Chinese Academy of Sciences. This annotation tool, which relies on a large lexicon and the Hierarchical Hidden Markov Model, integrates word tokenisation, named entity identification, unknown word recognition, as well as part-of-speech tagging. The ICTCLAS2008 has been reported to achieve a precision rate of 98.54% for word tokenisation. Latest open tests have also given encouraging results, with a precision rate of 98.13% for tokenisation and 94.63% for part-of-speech tagging. The application programming interface (API) of ICTCLAS2008 is publicly available at www.ictclas.org while a compiled program is available at www.corpus4u.org.
In order to ensure maximum comparability, a new release of the LCMC corpus (version 2.0) has been produced, which is retagged using this same tool. The part-of-speech tagset applied on the ZCTC and the new release of LCMC is described as follows.
a adjective
ad adverbial use of adjective
ag adjectival morpheme
an nominal use of adjective
al adjectival formulaic expression
b modifier (non-predicate noun modifier)
bg noun modifier morpheme
bl noun modifying formulaic expression
c conjunction
cc coordinating conjunction
d adverb
dg adverbial morpheme
dl adverbial formulaic expression
e interjection
ew sentence-final punctuation (full stop, semi-colon, question mark, exclamation mark)
f space word
h prefix
k suffix
m numeral and quantifier
mg numeral and quantifier morpheme
mq numeral-classifier
n noun
ng nominal morpheme
nl nominal formulaic expression
nr person name
nr1 Chinese surname
nr2 Chinese first name
nrf transliterated foreign person name
nrj Japanese name
ns place name
nsf transliterated foreign place name
nt organisation name
nz other proper noun
o onomatopoeia
p preposition
pba preposition ba 把
pbei preposition bei 被
q classifier
qt temporal classifier
qv verbal classifier
r pronoun
rg pronominal morpheme
rr personal pronoun
ry interrogative pronoun
rys place interrogative pronoun
ryt temporal interrogative pronoun
ryv verbal interrogative pronoun
rz deictic pronoun
rzs place pronoun
rzt temporal pronoun
rzv verbal pronoun
s place word
t time word
tg time word morpheme
u auxiliary
ude1 的
ude2 地
ude3 得
udeng 等
udh 的话
uguo 过
ule 了
ulian 连
uls 来说、来讲、而言、说来
usuo 所
uyy 一样、一般、似的、般
uzhe 着
uzhi 之
v verb
vd adverbial use of verb
vf directional verb
vg verbal morpheme
vi intransitive verb
vl verbal formulaic expression
vn nominal use of verb
vshi 是
vx pro-verb
vyou 有
w symbols and punctuations
wb percentage and permillle signs: % and ‰ of full length; % of half length
wd full or half-length comma: ,,
wj full stop of full length: 。
wky closing brackets: ) 〕 ] } 》 】 〗 〉of full length; ) ] } > of half length
wkz opening brackets: ( 〔 [ { 《 【 〖 〈 of full length; ( [ { < of half length
wn full-length enumeration mark: 、
wp dash: —— -- —— - of full length; --- ---- of half length
ws full-length ellipsis: …… …
wt full or half-length exclamation mark: !of full length; ! of half length
wyy full-length single or double closing quote: ” ’ 』
wyz full-length single or double opening quote: “ ‘ 『
x non-word character string
y particle
z descriptive word