Collocations
The idea of collocations is an important one to many areas of linguistics. Khellmer (1991) has argued that our mental lexicon is made up not only of single words, but also of larger phraseological units, both fixed and more variable. Information about collocations is important for dictionary writing, natural language processing and language teaching. However, it is not easy to determine which co-occurences are significant collocations, especially if one is not a native speaker of a language or language variety.
Given a text corpus it is possible to empirically determine which pairs of words have a substantial amount of "glue" between them. Two of the most commonly encountered formulae are: mutual information and the Z-score. Both tests provide similar data, comparing the probablities that two words occur together as a joint event (i.e. because they belong together) with the probability that they are simply the result of chance. For example, the words riding and boots may occur as a joint event by reason of their belonging to the same multiword unit (riding boots) while the words formula and borrowed may simply occur because of a one-off juxtaposition and have no special relationship. For each pair of words, a score is given - the higher the score the greater the degree of collocality.
Mutual information and the Z-score are useful in the following ways:
- They enable us to extract multiword units from corpus data, which can be used in lexicography and particularly specialist technical translation.
- We can group similar collocates of words together to help to identify different senses of the word. For example, bank might collocate with words such as river, indicating the landscape sense of the word, and with words like investment indicating the financial use of the word.
- We can discriminate the differences in usage between words which are similar. For example, Church et al (1991) looked at collocations of strong and powerful in a corpus of press reports. Although these two words have similar meanings, their mutual information scores for associations with other words revealed interesting differences. Strong collocated with northerly, showings, believer, currents, supporter and odor, while powerful collocated with words such as tool, minority, neighbour, symbol, figure, weapon and post. Such information about the delicate differences in collocation between the two words has a potentially important role, for example in helping students who learn English as a foreign language.
Read about the use of mutual information in parallel aligned corpora in Corpus Linguistics, Chapter 3, page 73.