Surveying
Existing Resources for the
The
Department of Linguistics at Lancaster University has been engaged with two
recent EPSRC-funded projects drawing attention to the non-indigenous
minority language communities in the UK by locating existing resources,
investigating end-user needs and wants, examining basic technical issues and
beginning to generate appropriate resources. This identified a subsequent gap in
the market for the associated indigenous minority languages of the
British Isles and Ireland, or “BIML”s – Cornish, (Scottish) Gaelic, Irish,
Manx, Scots, Ulster Scots (Ullans) and Welsh - which are becoming increasingly
widely used in both public and private life. Speech and language technology
applications for these languages are now also becoming an increasing urgent
need. To develop such applications, basic language resources are
therefore required.
The
LER-BIML project has three primary aims:
i)
to survey the existing language engineering resources and tools for the
BIMLs in question
ii)
to obtain information regarding end-user needs and demands in these areas
iii) to investigate some of the particular technical issues that these BIMLs
raise, principally in view of spoken corpus collection and annotation
This initial workpackage concentrates on the first of these objectives.
The BIMLs in question fall into the two language families of Celtic Indo–European and Scots. As it has not previously been general practice to regard and treat these two families as one individual cohesive group, there is no one central governing body responsible for coordinating and making their resources and facilities widely available. Consequently work on these languages is widespread and yet sparsely distributed. This paper is therefore concerned with the manner and extent of this distribution and the level of current activity surrounding them. It was evident from the outset that the Internet was going to be the main starting point for locating material and suitable contacts. It was intended that our main focus should be on text corpora and machine-readable texts, but also concentrating on locating speech databases, term banks, lexicons and language analysis tools such as taggers and parsers. The Internet has the obvious benefit of supplying all texts in electronic form. Search engines proved a useful initial means for generating a general idea of the volume of material available for the BIMLs, and from here subsequent leads and contacts were then followed up. What was apparent was that most of this material would require sifting as whilst there is a wealth of information in English about the BIMLs, the interests of this project lie, however, with resources directly available in the languages themselves[1].
Aside
from MILLE and EMILLE there are several other various projects in progress
sharing themes with those of LER-BIML. These include CELT, an online database of
ancient and contemporary Irish literary and historical texts,
the National Corpus of Irish incorporating 15 million words from a variety of
contemporary books, newspapers, periodicals and discourse, marked up in
accordance with the PAROLE encoding standards and MELIN which has produced
dictionaries, grammars, spellcheckers and terminology lists for the initial four
EU minority languages, Irish, Welsh, Catalan and Basque. Their sites offer good
links to BIML data and other websites. The Oxford Text Archive has a catalogue
of several thousand electronic texts and linguistic corpora in a range of
languages including standard reference works and mono and bilingual
dictionaries. The Universities of Edinburgh and Glasgow have recently begun
collaborative work on the SCOTS project which aims to build a collection of
electronic spoken and written texts for the languages of Scotland incorporating
Scottish English, Scots and Gaelic. Previous work in this field has included a
one million word lexical database and frequency count for Welsh (CEG) developed
at Bangor from a broad range of modern Welsh text types, and Briony William’s
annotated speech database for Welsh and its application in speech technology at
CSTR, University of Edinburgh.
In assessing the relative
volume of BIML resources, it was clear that Welsh has the most widely available
material. This is a direct result of the Welsh Language Act 1993 which states
that the public sector must offer its services bilingually, something that much
of the private sector has now also adopted. In locating BIML data we therefore
took specific note of whether the material appeared solely in the original
language, or in bilingual format. One particular good example of Welsh parallel
text is to be found on University homepages. Sabhal Mór Ostaig, a Further
Education College on Skye, offers a comprehensive index to key resources in
Scottish Gaelic and is a primary gateway for BIML data. ACCAC has a
parallel site of exhaustive Welsh language and bilingual educational sites for
ages 4-18. Dublin City University has a centre, FIONTAR, which
administers academic programs entirely through the medium of Irish, and has a
bilingual website outlining this. The Centre for Manx Studies obliged in sending
an extensive list of Manx resources, which included the main resources page
developed by the Manx Language Officer at the Department of Education. This page
offers links to short stories, the Manx Language Society’s newsletter Dhooraght
and the magazine CARN, dictionaries, grammars and glossaries.
There are interactive and
self-teaching language courses available for all the BIMLS, ranging in style and
intensity from the colloquial to the more grammatically orientated instruction,
but which cater for all levels of learner. Most sites also supply various
vocabulary and phrase lists and glossaries.
The Irish, Ulster Scots
and Welsh languages are all well represented by having official Language Boards
and Agencies whose websites occur entirely in bilingual format and offer useful
links. Whilst no such official bodies exist for the remaining BIMLs, there are
organisations whose work is invaluable to promoting their respective language
such as Agan Tavas for Cornish, Cli for Scottish Gaelic and Mannin.Org.Im
for Manx. These sites include histories of the language, manuscripts, reference materials, news items, and merchandise. Most of these pressure
groups offer discussion fora and mailing lists in English and in the language in
question where reports are archived for public reading. Webzines are
particularly popular and are a very good source of BIML resources as they form
the focus of special interest groups with a targeted loyal audience. Examples
are An Gannas for Cornish speakers, Beo for Irish, Wir Leid
for Scots and Ullans.com. They offer comment, stories, puzzles, quizzes,
jokes, polls, and reviews amongst others. The Mercator project which
serves as an information network for minority languages of the European Union
profiles Cornish, Scottish Gaelic, Irish and Welsh amongst these languages and
directs towards associated resources.
All Welsh council and
parliamentary sites including the National Assembly are legally required to be
presented in bilingual format, and the Scottish Parliament and Northern Ireland
Executive are following their lead. Health Authorities in Wales that have
developed their own websites also present them in this bilingual format. There
is obviously much more “official” material available for Welsh than any of
the other BIMLs, but Gaelic and Irish are certainly increasing their profile.
Media
resources are strong in BIML data with various online newspapers, radio
broadcasts and television schedules. Newspapers vary from presenting their
entire content in the original language, like the Welsh weeklies Y Cymro and Golwg, and
the Irish weeklies Foinse and Lá,
to producing special reports like An Phoblacht, Ireland’s leading
weekly Republican newspaper (archived). BBC Online is available in Welsh, BBC
Scotland has pages in Gaelic, BBC Cornwall produces a weekly audio five minute
news bulletin and the Welsh and Irish stations S4C and TG4 respectively have
bilingual sites. The recently launched BBC4 channel broadcasts programmes in
Gaelic. RTE, the national Irish radio service, provides bulletins online
and Raidió na Gaeltachta
has an extensive bilingual site which supplies audio downloads of its broadcasts
24 hours a day. There is a text
and audio weekly review of Manx Radio. Search engines and web browsers
can be used in Welsh, Gaelic and Irish: the Opera Web Browser is
available in Irish, Scottish Gaelic and Welsh as well as Breton, and a detailed guide to Welsh software can be found at Meddal.
The
Internet does not entirely replace the printed text as much of the literary
corpora is reproduced on various sites. Poetry and song lyrics are favourites,
some with audio recordings, and are often used alongside language lessons. Short
stories have been written by contributors to the webzines. There are several
online book inquiry services and bookshops including the Welsh Books Council,
and the National Library of Wales has a bilingual website. Details and adverts
for film and music festivals and local events are often posted in the respective
BIML and linked through webzines and special interest groups. The National
Museums and Galleries of Wales website is in parallel format.
As
regards religious texts, the Cornish Language Board has translated several books
of the Bible into Cornish. Various excerpts have also been translated into Manx
and there are Manx, Scottish Gaelic and Welsh versions of the Book of Common Prayer.
Online dictionaries are widely available for all the BIMLs and spellcheckers have been developed for Cornish, Scottish Gaelic, Irish, Manx and Welsh. Canolfan Bedwyr at the University of Bangor has published extensively in the area of specialist terminology dictionaries, most of which are online and include amongst others glossaries of finance and education terms produced for the National Assembly of Wales.
There
is a reasonably healthy volume of BIML resources available on the Internet but
as predicted Welsh is the most prominent and prolific of these languages. This
prominence will be a direct result of the Welsh Language Act 1993 which demands
all official material to be presented bilingually. As there is no such official
legislation as regards the other BIMLs there is a strong bias amongst these
languages towards popular entertainment, particularly webzines, and archaic
literature as they still rely primarily on specialist interest. They are,
however, increasing their profile within more official and political contexts.
The paucity in particular of Scots and Ulster Scots resources is probably best
accounted for by the difficulty in distinguishing the boundaries between what is
a dialect of English and what is something entirely recognisable as
“Scottish”. This is a much disputed issue.[2]
Bilingual text tends to be the most favoured format of presenting BIML data, as
it caters for the interests of the native BIML speaker whilst also recognising
the need to extend its resources to the non-BIML speaker. Much material exists
on the subject of these languages, but in English. The positive results of this
survey therefore support the project’s claims of the increasing profile of the
BIMLs in the UK today.
[1] To avoid confusion the term “BIML data/material” will refer exclusively to material written in the languages themselves, rather than in English.
[2] In
2001, the UK Government signed and ratified the European Charter for
Regional or Minority Languages. As a result, in Scotland and in Northern
Ireland, Scots is recognized as a regional language, except that in Northern
Ireland it is referred to as Ulster-Scots. For most linguists, Scots is the
national dialect of Scotland bound up with present-day English, and that as
a national dialect it also has regional variants, including the variant in
Ulster. For many non-linguists, because Scots was once a language it still
is on the grounds of both national ideology as well as the distinctiveness
still retained and separating it from English. For some politicians
and some activists, Scots is officially a regional language which is a
political and not a linguistic concept. However because of the way
legislation is expressed, some politicians and some activists consider
Ulster Scots a separate regional
language in Northern Ireland.