The UCLA Chinese Corpus

The UCLA Written Chinese Corpus

We are pleased to announce the release of the second edition of the UCLA Written Chinese Corpus (UCLA2), which has been expanded to one million words.

The UCLA Written Chinese Corpus is designed as a Chinese counterpart for the FLOB and Frown corpora of British and American English for contrastive research, as well as a recent update of the Lancaster Corpus of Mandarin Chinese (LCMC) for diachronic studies of possible changes in written Chinese over the past decade. Since this period is of special significance because of the impact of the Internet on language, especially on Chinese, the corpus is an excellent complement to LCMC.

The samples in the corpus are all collected from written modern Chinese available from the internet, during the period of 2000-2012, though some texts may have been converted from paper-based publications in earlier years. File types are matched as closely as possible to the Brown corpus model, with some variations (e.g. adventure fictions) to accommodate Chinese characteristics, while the proportions for different text categories may vary from the English counterparts and LCMC. The genres covered and their sample sizes (in terms of tokens including punctuation marks) in the two editions of the UCLA Written Chinese Corpus are shown as in the table below.

Code	Genre	Tokens in first edition (UCLA1)	Tokens in second edition (UCLA2)
A	Press: reportage	84302	88933
B	Press: editorials	25155	58284
C	Press: reviews	32223	35018
D	Religion	5885	41308
E	Skills, trades and hobbies	8925	55662
F	Popular lore	24854	69664
G	Essays and biographies	71169	164686
H	Misc. (reports and official documents)	65705	65705
J	Academic prose	27652	150507
K	General fiction	40999	85912
L	Mystery and detective stories	85317	85912
M	Science fiction	60378	60378
N	Adventure stories	55253	58144
P	Romantic fiction	66849	66849
R	Humour	32968	32968
Total number of tokens		687634	1119930

The corpus is Unicode and XML-compliant. Each corpus file is composed of a corpus header and a text body. The header gives general information of a corpus file. In the body part, paragraphs, sentences and tokens are marked up, with each sentence numbered and each token annotated for part of speech.

The UCLA Chinese Corpus is a product of the joint effort of Professor Hongyin Tao (University of California Los Angeles) and Dr. Richard Xiao (UCREL of Lancaster University). Funding for this project was provided to Hongyin Tao by the UCLA Academic Senate during the academic years 2003-2005, while Richard Xiao was supported by the UK Economic and Social Research Council (Award Reference RES-000-23-0553). We are also obliged to Iris Li, Haiyong Liu, and Hui Zhang, and Danjie Su for their assistance in data collection.

The UCLA2 corpus is distributed free of charge for use in non-profit-making research. For licencing information, please refer to the LCMC licence. Click here to have a look at the POS tagset. the corpus can be accessed online via the CQP web interface (username and password are both test, which allows access to the whole corpus and the full functionality of the tool) hosted at Beijing Foreign Studies University.

The UCLA Chinese Corpus can be cited as:

Tao, Hongyin and Richard Xiao (2012) The UCLA Chinese Corpus (2nd edition). UCREL, Lancaster.

Tao, Hongyin and Richard Xiao (2007) The UCLA Chinese Corpus (1st edition). UCREL, Lancaster.

Disclaimer: We give no warranties that the UCLA corpus will be suitable for any particular purpose and accept no responsibility for any technical limitations of the corpus or software.

Created and maintained by Richard Xiao 2013