1. Aims
The corpus approach is increasingly being recognized as a useful tool for linguistic investigation. However, in-depth monolinguistic studies of Chinese have proved difficult in the past, because of the general lack of publicly available balanced corpora. Indeed, most of the existing resources relating to Chinese (e.g. the PH Corpus, the PFR People’s Daily Corpus released by the Institute of Computational Linguistics, Peking University, and the corpora released by the LDC) are composed exclusively of newswire texts or newswire texts and official documents. Balanced corpora do exist, but they, too, are problematic. The Sinica Corpus, for example, represents the language used in Taiwan, and is therefore not representive of modern Mandarin Chinese as written on the mainland of China. Another corpus that does represent mainland Chinese (see Zhou & Yu 1997) is not publicly available.
The LCMC corpus seeks to enable in-depth monolinguistic studies by making a diverse range of text-types publicly available to academic researchers. LCMC, in combination with FLOB, also provides a sound basis for contrastive studies of Chinese and English, whether one wishes to compare the two languages as a whole or compare them by text type.
2. Sampling frame and text collection
In the LCMC corpus, the FLOB sampling frame is followed strictly except for two minor variations. The first variation relates to the sampling frame – we replaced western and adventure fiction (category N) with martial arts fiction. There are three reasons for this decision. Firstly, there is simply no western fiction in China; secondly, martial arts fiction is a type of adventure fiction that, in China especially, is both popular and important, and therefore should be represented; thirdly, the language used in martial arts fiction is a distinctive language type, and thus worthy of study in its own right. Most stories of this type, even though they were published recently, are under the influence of vernacular Chinese, i.e. modern Chinese styled to appear like classical Chinese. While the inclusion of this text type has made the tasks of POS tagging and post-editing more difficult, it may also make it possible to compare representations of vernacular Chinese and modern Chinese. The second variation relates to our decision to modify the FLOB sampling period slightly by including samples within ±2 years of 1991 when an insufficient number of texts for a given category were produced in 1991 (we assume that such a time span will not influence a language significantly). Texts produced within ±2 years of 1991 represent no more than one third of the 500 samples included in the LCMC corpus.
The LCMC corpus has been constructed using written Mandarin Chinese texts published in Mainland China to ensure some degree of textual homogeneity. It should be noted that plain written texts alone have been transcribed, with tables, illustrations, pictures, formulae and special symbols omitted and replaced with a gap element marked by the wording ‘omission’. Long citations from translated texts or texts produced outside the sampling period were also omitted so that the effect of translationese could be excluded and L1 quality guaranteed.
A small number of samples that
were conformant with our sampling frame were collected from the Internet. Most
samples, however, were provided by the SSReader
Digital Library in China. As each page of electronic books in the library
comes in PDG format, these pages were transferred into text files using an OCR
program provided by the digital library. This scanning process resulted in a
1-3% error rate, depending on the quality of the picture files. Each electronic
text file was proofread and corrected independently by two native speakers of
Mandarin Chinese so as to keep the transcribed raw texts as accurate as possible.
While the digital library has a
very large collection of books, it does not provide newspapers. The only sources
of newswire texts from the library are a dozen of collections of news awarded at
various levels. These collections, however, represent newswire texts from more
than eighty newspapers and television or broadcasting stations. The samples from
these sources account for around two thirds of texts for the press categories
(A-C). The other one third are sampled from newswire texts from Xinhua News
Agency (excerpted from the PH Corpus). Considering that this is the most
important and representative news provider in China, we believe that this
proportion is justified.
3. Encoding and markup conventions
Unlike single-byte western
languages like English, Chinese uses 2 bytes of ASCII codes for each character.
Currently there are three encoding systems for Chinese characters: GB2312 for
simplified Chinese, Big5 for traditional Chinese, and Unicode. While the
original texts were encoded in GB2312, we decided to convert the encoding into
Unicode (UTF-8) for the following reasons, namely, (1) to ensure the
compatibility of non-Chinese operating system and Chinese characters; and (2) to
take advantage of the latest Unicode-compatible concordancers like Xara version
1.0 and the WordSmith Tools version 4.0.
In order to make it more
convenient for users with an operating system earlier than Windows 2000 and
without a language support pack to use our data, we have produced a Pinyin
version of the LCMC corpus in addition to the standard version containing
characters. While also encoded using UTF-8, the Pinyin version will be more
compatible with older operating and concordance systems. This is also of
assistance to users who can read Romanised Chinese but not Chinese characters.