This
document contains a description of the structure of the corpus, as embodied in the
filenames of the texts and the directory structure.
It
also gives word counts for the various sections, and acknowledgements to the
organisations and individuals who have allowed their texts to be used in the
corpus
The
text is encoded as two-byte Unicode text. For more information on Unicode, see www.unicode.org
.
The
texts are marked up in SGML using level 1 CES compliant markup. Each file also
includes a full header, which specifies the provenance of the text. In the
spoken corpus, information about the speakers is also stored in the header.
There
are three parts to the EMILLE Corpora. They are:
The
EMILLE Spoken Corpus
The
EMILLE-CIIL Monolingual Written Corpora
The
EMILLE Parallel Corpus
However,
for ease of use, the Spoken and Written Corpora are not organised separately in
the directory structure. Instead, the spoken and written corpora for each
language are grouped in a single directory, e.g. “hindi” for the Hindi spoken
and written corpora, under the directory “monolingual”.
The
parallel corpus is held separately, in the directory “parallel”.
Many
of the directories contain subdirectories, in which texts are classified by a)
their provenance or b) their genre. This structure is also implicit in the filename
of each text within the corpora. While the directory structure and the
filenames usually imply the same structure, there are some ad hoc
exceptions to this, mentioned below where relevant.
A
further section of the corpus contains annotated data, as discussed
below.
All
files collected for the EMILLE Corpora have a filename in a standard format.
Texts which have been incorporated from the CIIL Corpus have been given a
filename in this standard format to render the two corpora compatible.
The
filename consists of a series of codes chained together with hyphen characters.
These codes specify the main language of the file, the source of the text, its
subcategory in terms of subject matter if such information is available, and an
identifying number. The name is generally of the format:
[Language]-[text
type]-[datasource]-[subcategory]-[identifying number].txt
In
the case of sources from which text was gathered on a periodical basis (i.e.
the news websites in the written corpus, the radio programmes for the written
corpus) the identifying number is a date. For other files it is simply an
arbitrary distinguishing number.
For
example:
hin-w-ranchi-news-01-03-22.txt
ben-s-cg-asiannet-02-07-23.txt
(see
below for details on how language, text type and dates are signified.)
Files
incorporated from the CIIL Corpus have a slightly different format. Since the
CIIL Corpus data draws on a much wider variety of sources of data than the
EMILLE Corpus, texts are sorted by their genre and subject matter, and uniquely
identified by their original code in the CIIL Corpus, thus:
[Language]-[text
type]-[genre]-[subcategory]-[CIIL Corpus code].txt
Some
exceptions to this scheme are detailed in the discussion of the different parts
of the corpora below. The major one is the Sinhala written corpus, which unlike
the other languages is organised primarily by the category into which the text
falls, and secondarily by the source it was gathered from.
The
codes used for languages in the filenames and also within the mark-up of some
of the corpus files are drawn from ISO-639. See the table below.
Language |
Code |
Hindi |
hin |
Bengali |
ben |
Punjabi |
pun |
Gujarati |
guj |
Urdu |
urd |
Tamil |
tam |
Sinhala |
sin |
Marathi |
mar |
Oriya |
ori |
Assamese |
asm |
Kashmiri |
kas |
Malayalam |
mal |
Kannada |
kan |
Telugu |
tel |
English |
eng |
Text type |
Code |
Written |
w |
Spoken
(demographically sampled) |
s-dem |
Spoken
(context governed) |
s-cg |
All
dates in filenames and anywhere in the header or markup of files in the corpora
are given in the format yy-mm-dd.
Alongside
the corpus is given some annotated data in Hindi and Urdu.
The
data in Hindi consists of excerpts from the “Ranchi Express” data of the Hindi
corpus (see below) annotated with anaphora analysis by Srija Sinha.
The
data in Urdu consists of a copy of the Urdu written, spoken and parallel
corpora annotated with morphosyntactic tags by the Urdu tagger created by
Andrew Hardie.
The
parallel corpus consists of 200,000 words of text in English and accompanying translations
in Hindi, Bengali, Punjabi, Gujarati and Urdu.
The
EMILLE Project would like to express our thanks to the UK Government and the
various local authorities who generously gave us permission to incorporate
their information leaflets into the parallel corpus.
For
some texts, a parallel version was not available in one or more of the
languages (either because the text was not translated in the first place, or
because we were unable to locate a copy of the leaflet in the language in
question). In this case, we have had translations made of the missing texts,
either by employees at Lancaster or by an outside agency on our behalf. Where a
translation is not “official” and has been produced by the EMILLE Project, this
is indicated in the file header.
If
two files are parallel to one another, their filenames are identical except for
the language code at the start. So, ben-w-housing-value is parallel to guj-w-housing-value.
Filenames
in the parallel corpus are more consistent than in the written corpus: each
parallel set of files is assigned to a category (in the examples given above,
“housing”) and is given a unique identifier (like “value” above). This
identifier is a word drawn from the title of the leaflet or summarising its
contents.
The
categories are:
consumer |
Consumer
issues |
Note
that the categories are not evenly represented in the parallel corpus.
There
follows a list of the unique identifiers, together with the titles and
publishers of the leaflets they represent (this information is also contained
in the header of each relevant file).
Filename |
Document Title |
Publisher |
attend |
School attendance, information for parents |
Department for Education and Employment |
babies |
Babies and Children BC1 |
Department of Social Security |
bloodsample |
If a blood sample is being taken... |
Department of Health |
breast |
Be Breast Aware |
Department of Health |
buyers |
A Buyer's / Shopper's Guide |
Office of Fair Trading |
cancer |
Womens Nationwide Cancer Control Campaign |
Womens Nationwide Cancer Control Campaign/Department of
Health |
catering |
Assured Safe Catering |
Department of Health |
childcare |
Childcare career |
Department for Education and Skills |
compensation |
A better deal for tenants: Your new right to compensation
for improvements |
Department of the Environment, Transport and the Regions |
consent |
About the consent form |
Department of Health |
cot |
Reducing The Risk Of Cot Death |
Department of Health |
county |
Here to help you |
Lancashire County Council |
crime |
Victims of crime |
The Home Office |
discharge |
Discharge From Hospital |
Manchester City Council |
donor |
Life Don't Keep It To Yourself |
Department of Health |
drugs |
Drugs A Parent's Guide |
Department of Health |
drugschild |
Drugs and Solvents: You and Your Child |
Department of Health |
drugsprobs |
Services for people with drugs problems |
Manchester City Council |
exclusion |
Preventing social exclusion (summary) |
The Cabinet Office |
eye |
Eye Sight Tests |
Department of Health |
financial |
Financial help if you work or are looking for work |
Department of Social Security |
foodlaw |
Food Law Inspections |
Ministry of Agriculture Fisheries and Food/Department of
Health |
haccp |
Practical Food Safety for Businesses |
Department of Health, Ministry of Agriculture, Fisheries
and Food, and the Central Office of Information |
headlice |
The prevention and treatment of headlice |
Department of Health |
hepatitis |
Hepatitis B |
Department of Health |
hiv |
Services for people with HIV/AIDS |
Manchester City Council |
homeschool |
Home-School agreements, What every parent should know
(ISBN 0855229098) |
Department for Education and Skills |
landlord |
My landlord wants me out |
Department for Transport, Local Government and the Regions |
law |
Health and Safety Law |
Health and Safety Executive |
learning |
Services for people with learning disabilities |
Manchester City Council |
littleread |
A little reading goes a long way |
Department for Education and Skills |
liverpool |
Unhappy with social services? |
Liverpool Social Services |
looking |
How to get help in looking after someone |
Department of Health |
manage |
A better deal for tenants: Your new right to manage |
Department of the Environment, Transport and the Regions |
manchester |
How To Get Help From Social Services |
Manchester City Council |
markets |
Modern markets: confident consumers |
Department of Trade and Industry |
maternity |
Maternity services |
Department of Health |
meningitis |
Knowing about Meningitis and Septicaemia |
Department of Health |
mmr |
MMR The Facts |
Department of Health |
nation |
The Health of the Nation and You |
Department of Health |
nhs |
Help with NHS costs |
Department of Health |
noise |
Bothered by noise |
Department for Environment, Food and Rural Affairs |
older |
Health and Healthy Living: A guide for older people |
Department of Health |
ombudsman |
The Health Service Ombudsman for England |
Office of the Health Service Ombudsman |
patients |
The Patient's Charter and You |
Department of Health |
permit |
Work Permits (UK) General Information |
The Home Office |
pregnant |
While You Are Pregnant |
Department of Health |
race |
New Laws... Race equality |
The Home Office |
readwrite |
Learning to read and write at home and at school |
Department for Education and Skills |
rent |
Do you rent, or are you thinking of renting, from a
private landlord? |
Department for Transport, Local Government and the Regions |
repair |
Your new right to repair |
Department of the Environment |
residential |
Choosing Residential and Nursing Home Care |
Manchester City Council |
retire |
Retirement RM1 |
Department of Social Security |
rights |
Your rights as a council tenant |
Department of the Environment, Transport and the Regions |
road |
Teaching children road safety |
Department of the Environment, Transport and the Regions |
runaways |
Consultation on Young Runaways (summary) |
The Cabinet Office |
scottish |
Report of the Inquiry into the Liaison Arrangements between
the Police, the Procurator Fiscal Service and the Crown Office and the Family
of the Deceased Surjit Singh Chhokar in Connection with the Murder of Surjit
Singh Chhokar and the Related Prosecutions (Dr Raj Jandoo). |
Clerk of the Scottish Parliament/Scottish Parliament |
senguide |
Special Educational Needs (SEN) A guide for parents and
carers |
Department for Education and Skills |
sentribunal |
SEN tribunal: How to appeal |
Department for Education and Skills |
service |
Work permits : Service and Standards |
The Home Office |
sick |
Sick or disabled SD1 |
Department of Social Security |
solvents |
Solvents A parent's guide |
Department of Health |
supporting |
Supporting People and Sheltered Housing |
Department for Transport, Local Goverment and the Regions |
teeth |
Healthy teeth for life |
Health Education Board for Scotland/National Health
Service |
tenant |
Tenant participation compacts: a guide for tenants |
Department of the Environment, Transport and the Regions |
training |
Work permits for training schemes and work experience |
The Home Office |
transport |
Making the Connections: Transport and Social Exclusion
(summary) |
The Cabinet Office |
tuberculosis |
TB - are you aware? |
Department of Health |
value |
Best Value in Housing |
Department of the Environment, Transport and the Regions |
vitamin |
Vitamin K |
Department of Health |
wage |
The National Minimum Wage Report Summary |
Low Pay Commission |
warm |
Keep Warm Keep Well |
Department of Health |
Three
texts are missing from the parallel corpus, which we were not able to get
transcribed. They are: liverpool (Bengali), manchester (Hindi),
and pregnant (Bengali).
The
parallel corpus contains full sentence markup using the <s> element,
which is not the case for the majority of files within the written corpus.
As mentioned
above, for ease of use the different parts of the Spoken Corpus are grouped
with the same-language written texts
Most
of the data in these corpora is context-governed speech (transcripts of radio
programmes from the BBC Asian Network). The Bengali and Hindi corpora also
contain small amounts of demographically-sampled speech. This is indicated in
the filename, as specified above (e.g. ben-s-dem-302.txt as
opposed to ben-s-cg-asiannet-02-11-26.txt).
The
context-governed files are further subdivided according to the radio programme
from which they were derived. This is the third element of the filename. For
Gujarati and Punjabi, there is no subdivision as all texts were derived from
the BBC Asian Network Gujarati and Punjabi programmes.
The
date of the original broadcast completes the filename. The names of the
programmes as abbreviated in the filenames are listed and expanded upon below:
asiannet The BBC Asian Network Language Programmes
(all
languages: broadcast nightly between 7.30 and 10 p.m.)
afternoon The Afternoon Show with Navinder Bhogal
(Hindi and
Urdu: weekday afternoons 2-4 p.m.
shamoly The BBC Radio Lancashire Shamoly
programme
(Bengali:
Sunday evenings)
jaltrang The BBC Radio Lancashire Jaltrang
programme
(Urdu:
Sunday evenings)
The
size of the Spoken Corpus is as follows:
Language |
Word count |
Bengali |
442,000 |
Hindi |
588,000 |
Urdu |
512,000 |
Gujarati |
564,000 |
Punjabi |
521,000 |
Total |
2,628,000 |
Filenames
in the written corpus consist of codes for the language and text type, as
specified above, followed by a word to specify the source of the data, possibly
followed by a subcategory. The date of publication of the original text (in the
case of news data) or a code number (for other files) is included at the end of
the filename.
The
written corpus incorporates the CIIL Corpus, created by the Central
Institute of Indian Languages, Mysore in collaboration with the Indian
Institute of Technology, Delhi, the Institute of Applied Language
Sciences, Bhubaneshwar, and Aligarh Muslim University, Aligarh.
This
corpus, originally encoded as ISCII text, has been re-encoded as Unicode with
CES-compliant SGML markup, as per the data collected by the EMILLE Project, to
allow simultaneous use of both datasets.
The
filenames of texts from the CIIL Corpus differ slightly form the overall
scheme. Because the CIIL Corpus was drawn from a much wider set of genres, and
from many, many more sources than the EMILLE Project’s data, it was not
considered suitable to classify them by source. Instead, they are classified by
genre (category and subcategory). The filename is concluded by the code
identifying that file in the original CIIL Corpus. These files are grouped in a
directory entitled “miscellaneous” within the corpus structure (because the
data derives from a wide miscellany of sources).
The
EMILLE-CIIL Monolingual Written Corpora have a total size of approximately 93,530,000
words. The make-up of each of the fourteen language corpora is discussed below.
The
EMILLE Project would like to express our thanks to the text providers who
generously allowed us to gather data from their websites:
The
Hindi corpus also contains data incorporated from the CIIL Corpus, originally
gathered by the Indian Institute of Technology.
The
contents of the Hindi written corpus are as follows (broken down by directory):
The
Hindi written corpus contains a total of approximately 12,390,000 words.
The
EMILLE Project would like to express our thanks to the text providers who
generously allowed us to gather data from their websites:
The
Bengali corpus also contains data incorporated from the CIIL Corpus, originally
gathered by the Institute of Applied Language Sciences, Bhubaneshwar.
The
contents of the Bengali written corpus are as follows (broken down by
directory):
The
Bengali written corpus contains a total of approximately 5,520,000 words.
The
EMILLE Project would like to express our thanks to the text providers who
generously allowed us to gather data from their websites:
We would
also like to express our thanks to the staff of the Panchim periodical,
who supplied us with copies of their text, and to the “Gurbani CD” project, who
generously allowed us to use their CD as the source for the text of the Shree
Guru Granth Sahib.
The
Punjabi corpus also contains data incorporated from the CIIL Corpus, originally
gathered by the Indian Institute of Technology.
Note
that, unlike the other languages, the Punjabi written corpus contains data in
more than one writing system. The data from the Panchim periodical is
written in the Indo-Perso-Arabic (“Shahmukhi”) alphabet, whereas the remainder
of the data is written in the Gurmukhi alphabet. The two different types of
written data are held in separate directories.
The
contents of the Punjabi written corpus are as follows (broken down by
directory):
**
Note that due to difficulties in the encoding of the text, it was not possible
to obtain an accurate word-count for the data drawn from the Sanjh Savera
website. The figure above therefore represents an educated estimate based on a
sample of the data.
The
Punjabi written corpus contains a total of approximately 15,600,000 words.
The
EMILLE Project would like to express our thanks to the text providers who generously
allowed us to gather data from their websites:
The
Gujarati corpus also contains data incorporated from the CIIL Corpus,
originally gathered by the Central Institute for Indian Languages.
The
contents of the Gujarati written corpus are as follows (broken down by
directory):
(**
This is a very imprecise word-count.)
The
Gujarati written corpus contains a total of approximately 12,150,000 words.
The
Urdu written corpus consists of data incorporated from the CIIL Corpus,
originally gathered by Aligarh Muslim University (approximately 1,640,000
words).
The
EMILLE Project would like to express our thanks to the text providers who
generously allowed us to gather data from their websites:
The
Tamil corpus also contains data incorporated from the CIIL Corpus, originally
gathered by the Central Institute for Indian Languages.
The
contents of the Tamil written corpus are as follows (broken down by directory):
The
Tamil written corpus contains a total of approximately 19,980,000 words.
The
Sinhala corpus consists of data collected from the press and media of Sri Lanka
by Vincent Halahakone. The EMILLE Project would like to express our thanks to
the text providers who generously allowed their data to be incorporated into
the corpus and to all the other parties and individuals who helped with the
Sinhala text collection, including:
The
Sinhala corpus is structured slightly differently to the other written corpora
(and this is reflected in the filenames and directory structure). The primary
classification is by text type (see below). The source is the secondary
classification. Subsequent elements in the filename may indicate a further
subcategory or an element of the title. As before, filenames end either with a
code number or a date.
The
contents of the Sinhala written corpus are as follows (broken down by
directory):
The
Sinhala written corpus contains a total of approximately 6,860,000 words.
The
Marathi written corpus consists of data incorporated from the CIIL Corpus,
originally gathered by the Central Institute for Indian Languages
(approximately 2,210,000 words).
The
Oriya written corpus consists of data incorporated from the CIIL Corpus,
originally gathered by the Institute of Applied Language Sciences, Bhubaneshwar
(approximately 2,730,000 words).
The
Assamese written corpus consists of data incorporated from the CIIL Corpus,
originally gathered by the Institute of Applied Language Sciences, Bhubaneshwar
(approximately 2,620,000 words).
The
Kashmiri written corpus consists of data incorporated from the CIIL Corpus,
originally gathered by Aligarh Muslim University (approximately 2,270,000
words).
The
Malayalam written corpus consists of data incorporated from the CIIL Corpus,
originally gathered by the Central Institute for Indian Languages
(approximately 2,350,000 words).
The
Kannada written corpus consists of data incorporated from the CIIL Corpus,
originally gathered by the Central Institute for Indian Languages
(approximately 2,240,000 words).
The
Telugu written corpus consists of data incorporated from the CIIL Corpus,
originally gathered by the Central Institute for Indian Languages
(approximately 3,970,000 words).