Up: Contents Previous: 4 Spoken texts Next: 6 Wordclass Tagging in BNC XML
The header of a TEI-conformant text provides a structured description of its contents, analogous to the title page and front matter of a book. The component elements of a TEI header are intended to provide in machine-processable form all the information needed to make sensible use of the Corpus.
Every separate text in the British National Corpus (i.e. each
<bncDoc>
element) has its own header, referred to below as a
text header. In addition, the corpus itself has a header,
referred to below as the corpus header, containing
information which is applicable to the whole corpus. Both
corpus and text headers are represented by <teiHeader>
elements.
<teiHeader>
element, as used within the BNC.
A TEI header contains a file description
(section 5.1 The file description ), an encoding description
(section
5.2 The encoding description), a profile description (section
5.3 The profile description ) and a revision description (section
5.4 The revision description), represented by the following four elements:
<fileDesc>
) is the first of the four
main constituents of the header. It is intended to document an electronic file
i.e. (in the case of a corpus header) the whole corpus, or (in
the case of a text header) any characteristics peculiar to an
individual file within it. In each case, it contains the following
five subdivisions:
Further detail for each of these is given in the following subsections.
The title statement (<titleStmt>
) element of a BNC text
contains one or more <title>
elements, optionally followed by
<author>
, <editor>
, or <respStmt>
elements. These
sub-elements are used throughout the header, wherever the title of a
work or a statement of responsibility are required.
The content of the <title>
element includes the title of the
source, followed by the phrase "Sample containing about", the
approximate word count for the sample, and further information about
the text type and ___domain, all extracted from other parts of the
header. This is followed by responsibility statements showing which
of the BNC Consortium members was responsible for capturing the
text originally.
<respStmt>
element is used to indicate
each agency responsible for any significant effort in the creation of
the text. Since responsibilities for data encoding and storage, and for
enrichment, are the same for all texts, they are not repeated in each
text header. The responsibility for
original data capture and transcription varies text by text, and is
therefore stated in each text header, in the following manner:
Author and editor information for the source from which a text is derived (e.g.
the author of a book) is not included in the <filedesc>
element but in the <sourceDesc>
element discussed below (5.1.5 The source description ).
<editionStmt>
element is used to specify an
edition for each file making up the corpus. It takes the same form in
both the corpus header and individual text headers:
<extent>
element is used in each text header to specify the size of the text to
which it is attached, as in the following example:
<w>
and
<s>
elements respectively.<publicationStmt>
element is used
to specify publication and
availability information for an electronic text. It contains the
following three elements:
The second identifier (of type old) is the old-style
mnemonic or numeric code attached to BNC texts during the production of
the corpus, and is still used to label the original printed source materials in the
BNC Archive. The first three character code (of type bnc)
is the standard BNC identifier. It is also used both for the filename in
which the text is stored and as the value supplied for the
xml:id attribute on the <bncDoc>
element containing
the whole text, and should always be used to cite the text. The code
is a completely arbitrary identifier, and does not indicate anything
about the nature of the text.
<sourceDesc>
element is used to supply
bibliographic details for the original source material from which an
electronic text derives. In the case of a BNC text, this might be a
book, pamphlet, newspaper etc., or a recording. One of the following
elements available within the <sourceDesc>
will be used, as
appropriate:
These elements are not used within the corpus header, which simply
contains a note about the sources from which the corpus was derived,
tagged as a <para>
(paragraph). The headers of individual texts
each contain one of the above elements to specify their source.
Context-governed spoken texts derived from broadcast or similar ‘published’ material may have either a recording statement or a bibliographic record as their source.
All bibliographic data supplied in the individual text headers is collected together and reproduced in section 10 List of Sources below.
<recordingStmt>
) element
contains one or more <recording>
elements:
n | tape number. | |
date | date of the recording in standardized form. | |
time | time of day the recording was made. | |
type | kind of recording. | |
dur | duration of the recording in seconds. |
The value of the n attribute here provides the number of the audio tape holding the original recording, as deposited with the British Library's Sound Archive in London.
<recording>
element has no content at all:
<recording>
element, as in the
following example:
<div>
(division) element
within an <stext>
. In that element, the identifier of the
source recording is supplied as the value of
a decls attribute. Thus, in the spoken text derived
from the above mentioned recordings, there will be a <div>
element starting as follows:
<bibl>
element is also used to record
bibliographic information for each non-spoken component of the BNC.
In this case, its structure is constrained to contain only the
following elements in the order specified:
During production of the BNC, the n attribute was
used with both <author>
and <imprint>
elements to supply
a six-letter code identifying the author or imprint concerned. The
values used should be unique across the corpus, but this is not
validated in the current release of the DTD.
<imprint>
element is
supplied for published texts only and contains the following elements in the order given:
Where ‘series’ information is available for a given title, this is not normally tagged distinctly. Instead the series title is given as part of the monographic title, usually preceded by a colon.
This level of bibliographic description has not been carried out with complete consistency across the current release of the corpus.
The second major component of the TEI header is the encoding
description (<encodingDesc>
). This contains
information about the relationship between an encoded text and its
original source and describes the editorial and other principles
employed throughout the corpus. It also contains reference information
used throughout the corpus.
<encodingDesc>
element has the following six
components:
In the BNC, one of each of these elements appears in the corpus
header. Only the <tagsDecl>
element appears
in the individual text headers.
The <projectDesc>
element for the corpus gives a brief
description of the goals, organization and results of the BNC project.
The <samplingDecl>
, <editorialDecl>
and
<refsDecl>
elements similarly supply brief prose descriptions
describing the sampling procedures used in the project and the
referencing system applied. This information is also summarized elsewhere
in this documentation.
<tagsDecl>
) element is
used slightly differently in corpus and in text headers. In the corpus
header, it is used to list every element name actually used within the
corpus, together with a brief description of its function. In text
headers, it is used to specify the number of elements actually tagged
within each text. In either case it consists of a <namespace>
element, containing a number of
<tagUsage>
elements, defined as follows:
gi | the name (generic identifier) of the element indicated by the tag. | |
occurs | specifies the number of occurrences of this element within the text. |
<tagUsage>
element contains a
brief description of the element specified by its <gi>
element;
the occurs attribute is not supplied, as in the following
extract: <tagUsage>
elements are empty, but the
occurs attribute is always supplied, and indicates the
number of such elements which appear within the text, as in the following
example, taken from a typical written text:
<refsDecl>
element for the corpus header defines the
approved format for references to the corpus. It takes the following form
<classDecl>
element is used in the BNC Corpus Header
to formally define several text classication
schemes which are used in the corpus. Each scheme or taxonomy
defines a number of code/description pairs, applicable to a text in
the corpus. For example, the written ___domain taxonomy defines twelve
subject domains ("Imagination", "Informative: natural science",
"Informative: applied science" etc.) and each
written text is assigned to one of them. Each
taxonomy is defined in the corpus header, using the following elements:
<taxonomy>
element
defining the Written ___domain classification system as it appears in
the corpus header:
For a complete list of the taxonomies used in the BNC and the number of texts etc. classified according to them, refer to the corpus header and to chapter 1 Design of the corpus.
<catRef>
element within the associated text header. Its
target lists the identifiers of all <category>
elements applicable to that text. For example, the header of a written text
assigned to the social science ___domain which has a corporate author will
include a <catRef>
element like the following:
A full list of all category codes can be found in a separate document, and the numbers of texts so classified in the current release of the corpus is provided in section 9.6 Text and genre classification codes.
Further information about the classification and categorization of an
individual texts is provided within the <textClass>
element
discussed below (5.3.5 Text classification )
The Xaira Specification element is used by the XAIRA indexing software to index the BNC. A brief description of its components is provided in xairaspec below; for full information, consult the Xaira documentation available from http://www.xaira.org/
n | in demographic texts, supplies the respondent number used to identify the batch of tapes. |
This element is provided to record the date of publication for
texts originally published separately, and any details concerning the origination
of any spoken or written texts, whether or not covered elsewhere. It
is supplied in every text header, although the details provided
vary. As a minimum, a date (tagged with the standard <date>
element) will be included; this gives the date the content of this
text was first created. For a spoken text, this will be the same as
the date of the recording; for a written text, it will normally be the
date of first publication of the edition, which may not be the same as
the date of publication of the copy used.
Note that the BNC contains modernized editions of some classic texts such as Defoe's Robinson Crusoe (FRX); the creation date specified here is that of the creation of the modernized version rather than the 17th c. original.
For imaginative works, the creation date is also the date used to
classify the text (by means of the WRITIM category). For
other written works, such as textbooks, which are likely to have been
extensively revised since their first publication, the date used to
classify the text will be that of the edition described in the
<sourceDesc>
, but the original date will also be recorded
within the <creation>
element.
The participant description (<particDesc>
) element is used
to provide information about speakers of texts transcribed for the
BNC. It appears only within individual spoken text headers to define
the participants specific to those texts.
It contains a series of <person>
elements describing the
participants whose speech is transcribed in this text.
<person>
element describes a single participant in a
language interaction. It carries a number of attributes which are used
to provide encoded values for some key aspects of the person concerned:
ageGroup | specifies the age group to which the participant belongs. | |
dialect | specifies the dialect or accent of a participant's speech, as identified by the respondent. | |
firstLang | specifies the country of origin of the participant, as identified by the respondent. | |
n | internal identifier. | |
educ | specifies the age at which the participant ceased full-time education. | |
soc | specifies the social class of the participant. | |
sex | specifies the sex of the participant. | |
role | describes the relationship or role of this participant with respect to the respondent. | |
xml:id | provides the unique identifier for this element. |
The xml:id attribute is required for each participant whose speech is included in a text, and its value is unique within the corpus. Although a given individual will always have the same identifier within a single text, there is no way of identifying the same individual should they appear in different texts. Since all demographically sampled conversations collected by a single respondent are treated together as a single text, and respondents were recruited from many different social contexts, the probability of the same person being recorded by different respondents is rather low, though not completely impossible.
On many occasions the speaker of a given utterance cannot be identified. A special code is used to indicate an unknown speaker, but, for consistency, this is also made unique to each text. Thus, an "unknown speaker" in one text will have different identifying code from an "unknown speaker" in another. As far as possible, different speakers are given different identifying codes, even where they cannot be identified with any confidence; thus there may be more than one "unidentified" speaker in the same text.
Where several speakers speak together, if they are identified, then all of the relevant codes are given; if however they are not, then a special "unknown speaker group" code is used.
<person>
element:
In each case, the information provided is that given by the respondent and is taken from the log books issued to all participants in the demographic part of the corpus. It has not been normalized.
In the context-governed part of the corpus however, there is no
respondent and relationship information must be deduced from the other
information provided. The role attribute for
<person>
elements in these texts will usually have the value
unspecified.
<settingDesc>
element is used to describe the context
within which a spoken text takes place. It appears once in the header
of each spoken text, and contains one or more <setting>
elements for each distinct recording.
who | indicates the person, or group of people, to whom the element content is ascribed. | |
n | an internal identifier for a setting. | |
xml:id | provides the unique identifier for this element. |
<setting>
element supplies additional
details about the place, time of day, and other activities going on,
using the following additional elements:
spont | level of spontaneity or informality of the context as assessed by transcriber. |
<textClass>
element, which appears once in the header
of each text. Classifications may be represented using references to
internally defined classications provided in the <classCode>
element (such as the BNC classification scheme described
in section 5.2.3 The reference and classification declarations), by reference to some other
predefined classification system, or by an open set of keywords. All
three methods are used in the BNC, using the following elements:
scheme | identifies the classification system or taxonomy in use. |
A <catRef>
element is provided in the header of each
text. Its target attribute contains values for each of the
classification codes defined in the corpus header. In each case, the
classification code consists of a code used as the identifier of a
<category>
element within a <taxonomy>
element defined
in the corpus header. For example: ALLTIM1 indicates
‘dated 1960-1974’. A list of the values used is given in section
9.6 Text and genre classification codes below.
This taxonomy is that originally defined for selection and description of texts during the design of the corpus, as further discussed elsewhere. It is of course possible to classify the texts in many other ways, and no claim is made that this method is universally applicable or even generally useful, though it does serve to identify broadly distinct sub-parts of the corpus for investigation. The reader is also cautioned that, although an attempt has been made in the current edition of the corpus to correct the more egregious classification errors noted in the first edition, unquestionably many errors and inconsistencies remain. In particular, the categories WRILEV (perceived level of difficulty) and WRISTA (estimated circulation size) were incorrectly differentiated during the preparation of the corpus and cannot be relied on.
A <classCode>
element is also provided for every text in the
corpus. This contains the code assigned to the text in a genre-based
analysis carried out at Lancaster University by David Lee since
publication of the first edition of the BNC. Lee's scheme classes the
texts more delicately in most cases, since it takes into account their
topic or subject matter (see further 9.6 Text and genre classification codes
below).
Lee's scheme is also used as the basis of a very simple
categorization for each text, which is provided by means of the
type attribute on its <text>
or <stext>
element. This categorization distinguishes six categories for written
text (fiction, academic prose, non-academic prose, newspapers, other
published, unpublished), and two for spoken text (conversation,
other); It may be found a convenient way of distinguishing the major
text types represented in the corpus: see further 9.1 XML tag usage by text type.
In the first release of the BNC, most texts were assigned a set of
descriptive keywords, tagged as <term>
elements within the
<keywords>
element. These terms were not taken from any
particular descriptive thesaurus or closed vocabulary; the words or
phrases used are those which seemed useful to the data preparation
agency concerned, and are thus often inconsistent or even
misleading. They have been retained unchanged in the present version
of the BNC, pending a more thorough revision. In the World (second)
Edition this set of keywords was complemented for most written texts
by a second set, also tagged using a <keywords>
element, but
with a value for its source attribute of
COPAC, indicating that the terms so tagged are derived
from a different source. The source used was a major online library
catalogue service (see http://www.copac.ac.uk). Like
other public access catalogue systems, COPAC uses a well-defined
controlled list of keywords for its subject indexing, details of which
are not further given here.
<revisionDesc>
) element
is the fourth and final element of a standard TEI header.
In the BNC, it consists of a series of <change>
elements.
date | supplies the date of the change in standard form, i.e. yyyy-mm-dd. | |
who | indicates the person, or group of people, to whom the element content is ascribed. |
Up: Contents Previous: 4 Spoken texts Next: 6 Wordclass Tagging in BNC XML