Surveying
End-User Needs for the
The
Department of Linguistics at Lancaster University has been engaged with two
recent EPSRC-funded projects drawing attention to the non-indigenous
minority language communities in the UK by locating existing resources,
investigating end-user needs and wants, examining basic technical issues and
beginning to generate appropriate resources. This identified a subsequent gap in
the market for the associated indigenous minority languages of the
British Isles and Ireland, or “BIML”s – Cornish, (Scottish) Gaelic, Irish,
Manx, Scots, Ulster Scots (Ullans) and Welsh - which are becoming increasingly
widely used in both public and private life. Speech and language technology
applications for these languages are now also becoming an increasing urgent
need. To develop such applications, basic language resources are
therefore required.
The
LER-BIML project has three primary aims:
i)
to survey the existing language engineering resources and tools for the
BIMLs in question
ii)
to obtain information regarding end-user needs and demands in these areas
iii) to investigate some of the particular technical issues that these BIMLs
raise, principally in view of spoken corpus collection and annotation
This workpackage concentrates on the second of these objectives.
The most effective way of ensuring as wide a scope as possible of potential end-user needs of BIML resources was to be by means of a web questionnaire posted on the project website. Notice of this questionnaire was emailed to over fifteen Internet bulletin boards and mailing lists including HUMANIST, CORPORA, TERMCELT and CELTLING. This secured the questionnaire being disseminated to all the BIML linguistic regions, and also outside of the British Isles and Ireland to groups working with the BIMLs as non-indigenous minority languages. The questionnaire focuses on the response to language engineering resources and corpus construction for the BIMLs by varying groups of users.
There were 128 responses, 57 of which were interested in receiving feedback from the survey. This would be done by emailing out a copy of the report.
Scottish
Gaelic had the highest demand for corpus resources; there was no demand at all
for Ulster Scots[1].
There was strong interest in seeing more availability of resources for Breton
and Shetlandic, with individual requests for the Channel Island languages and
Romany amongst others.
A
bilingual corpus was the most favoured corpus type:
to
contain English alongside the BIMLs in question,
to
contain sentence-aligned translations of the same texts in each language.
Most
wanted to see an equal balance of written and spoken data built for the BIMLs,
and for this to be done within general balanced corpora rather than in genre
specific corpora. For genre specific corpora news, history and fiction proved
the most popular areas of interest. Whilst people thought it important to
envisage the ideal of all types of genre being made available for the individual
languages, suggestions other than those proposed on the questionnaire included
arts and music, youth culture, environment, travel, technology, oral literature,
folklore, food and drink, media and advertising and religion and ethics.
As
regards linguistically annotating the data, most would prefer just plain text.
However of the methods of annotation on offer part-of-speech was the next most
popular. They would be happy with anything that could be made available,
although there was special mention of IPA and metaphorical and dialect
annotation. The question of textual mark-up returned the highest number of nil
responses, but amongst those who were interested in seeing mark-up, html was the
favourite.
The
Internet was the favourite medium for receiving corpus data, with the CD a close
second.
On
the issue of the listed features and their perceived importance within a corpus,
the general consensus was that there was ‘no opinion’ on their preferred
status. The only features which received a majority rating of ‘essential’
were the header elements ‘author’, ‘source of data’ and ‘language
use’. The spoken data features attracted the only number of ‘not wanted’
responses.
The
majority of respondents were linguists rather than language engineers.
Applications which the language engineers envisaged using the BIML data to build
included frequency tables, speech synthesis and recognition, spelling, style,
syntax and grammar checkers, bilingual dictionaries and lexica and pedagogical
tools. Questions the linguists wanted to explore with the data included effects
of linguistic shift and borrowings; frequencies and variations of syntactic
structures, dialect, registers and discourse; patterns of code switching across
genres; reported versus actual usage; patterns of growth and decline; reception
by young people. Suggested support tools included concordances, checkers, search
and recognition tools, glossaries, taggers, text aligners and audio and video
files.
The optimistic end result was that there was an overwhelming majority that people were very likely to be working with the BIMLs in the future.
The encouraging number of responses to the survey indicated that work in progress and current activity regarding the BIMLs is healthy and positive. The higher demand for Scottish Gaelic, Irish and Welsh most likely correlates with the more specialised mailing lists that cater specifically for these languages and are in much wider use. An amendment for a future survey of this type would be to ask the respondents to indicate from which mailing list they received details of the questionnaire to secure a better overview of the distribution of the respondents. It would also be beneficial to determine the professional status of the respondents to indicate from what type of linguistic background they are approaching the data. These factors could help resolve why these results might appear to go somewhat against expectation in light of the results of the previous survey of existing resources.
[1] This zero return could be explained by the presumption that the respondents identified more with the category Scots than Ulster Scots, or from a more negative point of view, that there is simply a lack of interest in this area.