4.1 Basic structure: spoken texts
The spoken material transcribed for the BNC is also organized into
‘texts’, which are subdivided into
‘divisions’, made up of <w>
and <mw>
elements grouped into <s>
elements in the same way as written
texts. However there a number of other elements specific to spoken
texts, and their hierarchic organization is naturally not the same as that of
written texts. For this reason, a different element (<stext>
)
is used to represent a spoken text.
In demographically sampled spoken texts, each distinct conversation
recorded by a given respondent is treated as a distinct <div>
element. All the conversations from a single respondent are then
grouped together to form a single <stext>
element. Context-governed spoken texts, however, do not use the
<div>
element: the <stext>
element for a context
governed text is composed only of <u>
elements, not grouped
into any unit smaller than the <stext>
itself.
The <s>
elements making up a spoken text are grouped not
into <p>
or other similar elements, but instead into <u>
elements. Each <u>
(utterance) element marks a stretch of
uninterrupted speech from a given speaker; (see section 4.2 Utterances). Interspersed within and between <u>
elements, a variety of other
elements indicate para-linguistic phenomena noticed
by the transcribers (see section 4.3 Paralinguistic phenomena).
The methods and principles applied in transcription and
normalisation of speech are defined in a BNC working paper TGCW21
Spoken Corpus Transcription Guide, and have also been
described in subsequent publications (e.g. Crowdy 1994). The editorial tags
discussed in section 2.5 Editorial indications above are also used to
represent normalisation practice when dealing with transcribed speech.
4.2 Utterances
The term
utterance is used in the BNC to refer to a
continuous stretch of speech produced
by one participant in a
conversation, or by a group of participants. Structurally, the
corresponding element behaves in a similar way to the
<p>
element in a written text — it groups a sequence of
<s>
elements together.
-
<u> (utterance) a stretch of speech usually preceded and followed by
silence or by a change of speaker.
who |
indicates the person, or group of
people, to whom the element content is ascribed. |
The
who attribute is required on every
<u>
: its function is to
identify the person or group of people making the utterance, using the
unique code defined for that person in the appropriate section of the
header. A simple example follows:
�<u�who="PS1LW">
��<s�n="159">
���<w�c5="ITJ"�hw="mm"�pos="INTERJ">Mm </w>
���<w�c5="ITJ"�hw="mm"�pos="INTERJ">mm</w>
���<c�c5="PUN">.</c>
��</s>
�</u>
<!-- F7F -->
The code
PS1LW used here will be specified as
the value for the
xml:id attribute of some
<person>
element within the header of the text from which this example is
taken. Where the speaker cannot be confidently identified, or where
there is more than one aspeaker, a special
code is used; see further discussion at
5.3.3.1 The person element.
4.3 Paralinguistic phenomena
In transcribing spoken language, it is necessary to select from the
possibly very large set of distinct paralinguistic phenomena which might
be of interest. In the texts transcribed for the BNC, encoders were
instructed to mark the following such phenomena:
- voice quality
- for example, whispering, laughing, etc., both as discrete events
and as changes in voice quality affecting passages within an utterance.
- non-verbal but vocalised sounds
- for example, coughs, humming noises etc.
- non-verbal and non-vocal events
- for example passing lorries, animal noises, and other matters
considered worthy of note.
- significant pauses
- silence, within or between utterances, longer than was judged
normal for the speaker or speakers.
- unclear passages
- whole utterances or passages within them which were inaudible or
incomprehensible for a variety of reasons.
- speech management phenomena
- for example truncation, false starts, and correction.
- overlap
- points at which more than one speaker was active.
Other aspects of spoken texts are not explicitly recorded in
the encoding, although their headers contain considerable amounts of
situational and participant information.
In many cases, because no standardized set of descriptions was predefined,
transcribers gave very widely differing accounts of the same
phenomena. An attempt has however been made to normalize the descriptions for
some of these elements in the BNC XML editions.
The elements used to mark these phenomena are listed below in
alphabetical order:
-
<event> any phenomenon or occurrence, not necessarily vocalized or
communicative, for example incidental noises or other events affecting
communication.
desc |
provides a brief description of the event. |
dur |
(duration) indicates the duration of the element in seconds. |
-
<pause> a pause either between or within utterances.
dur |
(duration) indicates the duration of the element in seconds. |
-
<shift> marks the point at which some paralinguistic feature of a series of
utterances by any one speaker changes.
new |
specifies the new state of the paralinguistic feature specified. |
-
<trunc> contains one or more truncated words in transcribed speech.
-
<unclear> contains a word, phrase, or passage which cannot be transcribed
with certainty because it is illegible or inaudible in the source.
dur |
(duration) indicates the duration of the element in seconds. |
-
<vocal> (Vocalized semi-lexical) any vocalized but not necessarily lexical phenomenon, for example
voiced pauses, non-lexical backchannels, etc.
desc |
provides a brief description of the vocal event. |
dur |
(duration) indicates the duration of the element in seconds. |
who |
indicates the person, or group of
people, to whom the element content is ascribed. |
The value of the dur attribute is normally specified
only if it is greater than 5 seconds, and its accuracy is only
approximate.
With the exception of the <trunc>
element, which is a
special case of the editorial tags discussed in section 2.5 Editorial indications above, all of these elements are empty, and may appear anywhere within
a transcription.
The following example shows an event, several pauses and a patch of
unclear speech:
�<s�n="5490">
��<event�desc="radio on"/>
��<pause�dur="34"/>
��<w�c5="PNP"�hw="you"�pos="PRON">You </w>
��<w�c5="VVN"�hw="get"�pos="VERB">got</w>
��<w�c5="TO0"�hw="ta"�pos="PREP">ta </w>
��<unclear/>
��<w�c5="NN1"�hw="radio"�pos="SUBST">Radio </w>
��<w�c5="CRD"�hw="two"�pos="ADJ">Two </w>
��<w�c5="PRP"�hw="with"�pos="PREP">with </w>
��<w�c5="DT0"�hw="that"�pos="ADJ">that</w>
��<c�c5="PUN">.</c>
�</s>
�<s�n="5491">
��<pause�dur="6"/>
��<w�c5="AJ0"�hw="bloody"�pos="ADJ">Bloody
</w>
��<w�c5="NN1"�hw="pirate"�pos="SUBST">pirate </w>
��<w�c5="NN1"�hw="station"�pos="SUBST">station </w>
��<w�c5="VM0"�hw="would"�pos="VERB">would</w>
��<w�c5="XX0"�hw="not"�pos="ADV">n't </w>
��<w�c5="PNP"�hw="you"�pos="PRON">you</w>
��<c�c5="PUN">?</c>
�</s>
<!-- KB2 -->
Where the whole of an utterance is unclear, that is, where no speech
has actually been transcribed, the
<unclear>
element is used on
its own, with an optional
who attribute to indicate who
is speaking, if this is identifiable. For example:
�<u�who="xx">
��<s>....</s>
�</u>
�<unclear�who="yy"/>
�<u�who="xx">
��<s>... </s>
�</u>
Here YY's remarks, whatever they are, are too unclear to be
transcribed, and so no
<u>
element is provided.
The values used for the
desc attribute of the
<event>
element are not constrained in the current version of
the corpus, and more than a thousand different values exist in the
corpus. Some common examples follow:
�<event�desc="laughter"/>
�<event�desc="telephone noise"/>
A list of the most frequent values is given in
9.4 Event descriptions.
As noted above, a distinction is made between discrete vocal events,
such as laughter, and changes in voice quality, such as words which are
spoken in a laughing tone. The former are encoded using the
<vocal>
element, as in the following example:
�<u�who="PS09T">
��<s�n="4307">
���<vocal�desc="laugh"/>
���<c�c5="PUN">, </c>
���<w�c5="PNP"�hw="you"�pos="PRON">you</w>
���<w�c5="VM0"�hw="will"�pos="VERB">'ll </w>
���<w�c5="VHI"�hw="have"�pos="VERB">have </w>
���<w�c5="TO0"�hw="to"�pos="PREP">to </w>
���<w�c5="VVI"�hw="take"�pos="VERB">take </w>
���<w�c5="DT0-CJT"�hw="that"�pos="ADJ">that </w>
���<w�c5="AVP-PRP"�hw="off"�pos="ADV">off </w>
���<w�c5="AV0"�hw="there"�pos="ADV">there </w>
���<vocal�desc="laugh"/>
���<w�c5="ITJ"�hw="yeah"�pos="INTERJ">yeah </w>
���<w�c5="PNP"�hw="you"�pos="PRON">you </w>
���<w�c5="VM0"�hw="can"�pos="VERB">can </w>
���<pause/>
���<vocal�desc="laugh"/>
���<pause/>
��</s>
�</u>
<!-- KC2 -->
The
<shift>
element is used instead where the laughter
indicates a change in voice quality, as in the following example:
�<u�who="PS01V">
��<s�n="4188">
���<w�c5="CJC"�hw="and"�pos="CONJ">And </w>
���<w�c5="UNC"�hw="erm"�pos="UNC">erm </w>
���<pause/>
���<w�c5="CJC"�hw="and"�pos="CONJ">and </w>
���<w�c5="AV0"�hw="then"�pos="ADV">then </w>
���<w�c5="PNP"�hw="we"�pos="PRON">we </w>
���<w�c5="VVD"�hw="go"�pos="VERB">went </w>
���<w�c5="CJC"�hw="and"�pos="CONJ">and </w>
���<w�c5="VVD"�hw="get"�pos="VERB">got </w>
���<w�c5="DPS"�hw="i"�pos="PRON">my </w>
���<w�c5="NN0"�hw="fruit"�pos="SUBST">fruit </w>
���<w�c5="CJC"�hw="and"�pos="CONJ">and </w>
���<w�c5="NN1"�hw="veg"�pos="SUBST">veg </w>
���<w�c5="CJC"�hw="and"�pos="CONJ">and </w>
���<w�c5="AV0"�hw="then"�pos="ADV">then </w>
���<w�c5="PNP"�hw="we"�pos="PRON">we </w>
���<w�c5="VVD"�hw="go"�pos="VERB">went </w>
���<w�c5="PRP"�hw="in"�pos="PREP">in </w>
���<w�c5="AJ0-NN1"�hw="top"�pos="ADJ">Top </w>
���<w�c5="NP0"�hw="marks"�pos="SUBST">Marks </w>
���<w�c5="CJC"�hw="and"�pos="CONJ">and </w>
���<w�c5="VVD"�hw="get"�pos="VERB">got </w>
���<w�c5="PNP"�hw="they"�pos="PRON">them </w>
���<shift�new="laughing"/>
���<w�c5="AV0"�hw="so"�pos="ADV">so </w>
���<w�c5="PNP"�hw="we"�pos="PRON">we </w>
���<w�c5="AV0"�hw="never"�pos="ADV">never </w>
���<w�c5="VVD"�hw="get"�pos="VERB">got </w>
���<shift/>
���<w�c5="PNP"�hw="we"�pos="PRON">we </w>
���<w�c5="VVD"�hw="go"�pos="VERB">went </w>
���<w�c5="AVP"�hw="through"�pos="ADV">through </w>
���<w�c5="PRP"�hw="for"�pos="PREP">for </w>
���<w�c5="AT0"�hw="a"�pos="ART">a </w>
���<w�c5="NN1"�hw="video"�pos="SUBST">video </w>
���<w�c5="AV0"�hw="really"�pos="ADV">really</w>
���<c�c5="PUN">, </c>
���<w�c5="AV0"�hw="never"�pos="ADV">never </w>
���<w�c5="VVN-VVD"�hw="get"�pos="VERB">got </w>
���<w�c5="AVP"�hw="round"�pos="ADV">round </w>
���<w�c5="PRP"�hw="to"�pos="PREP">to </w>
���<w�c5="VVG"�hw="look"�pos="VERB">looking </w>
���<w�c5="PRP"�hw="for"�pos="PREP">for </w>
���<w�c5="AT0"�hw="a"�pos="ART">a </w>
���<w�c5="NN1"�hw="video"�pos="SUBST">video </w>
���<w�c5="VDD"�hw="do"�pos="VERB">did </w>
���<w�c5="PNP"�hw="we"�pos="PRON">we</w>
���<c�c5="PUN">?</c>
��</s>
�</u>
<!-- KB2 -->
When no
new attribute is supplied on a
<shift>
element, the meaning is that the voice quality
indicated reverts to a normal or unmarked state. Hence, in this example,
the passage between the tags <shift new="laughing"/>
and <shift/>
is spoken with a laughing intonation
A list of values currently used for the new
attribute is given below in section 9.2 Voice quality codes.
4.4 Alignment of overlapping speech
By default it is assumed that the events represented in a
transcription are non-overlapping and that they are transcribed in
temporal sequence. That is, unless otherwise specified, it is implied
that the end of one utterance precedes the start of the next following
it in the text, perhaps with an interposed
<pause>
element.
Where this is not the case, the following element is used:
-
<align> marks an temporal alignment point within transcribed speech.
with |
supplies an arbitrary identifier; all
elements specifying the same value for this attribute are
understood to be aligned with each other in time. |
The with attribute of an <align>
element
may be thought of as identifying some point in time. Where two or
more <align>
elements specify the same value for this
attribute, their locations are assumed to be synchronised.
The following example demonstrates how this mechanism is used to
indicate that one speaker's attempt to take the floor has been
unsuccessful:
�<u�who="PS6U5">
��<s�n="485">
���<w�c5="AJ0"�hw="poor"�pos="ADJ">Poor </w>
���<w�c5="AJ0"�hw="old"�pos="ADJ">old </w>
���<w�c5="NP0"�hw="luxembourg"�pos="SUBST">Luxembourg</w>
���<w�c5="VHZ"�hw="have"�pos="VERB">'s </w>
���<w�c5="VVN-AJ0"�hw="beat"�pos="VERB">beaten</w>
���<c�c5="PUN">.</c>
��</s>
��<s�n="486">
���<w�c5="PNP"�hw="you"�pos="PRON">You </w>
���<w�c5="PNP"�hw="you"�pos="PRON">you</w>
���<w�c5="VHB"�hw="have"�pos="VERB">'ve </w>
���<w�c5="PNP"�hw="you"�pos="PRON">you</w>
���<w�c5="VHB"�hw="have"�pos="VERB">'ve </w>
���<w�c5="AV0"�hw="absolutely"�pos="ADV">absolutely </w>
���<w�c5="AV0"�hw="just"�pos="ADV">just </w>
���<w�c5="VVN"�hw="go"�pos="VERB">gone </w>
���<w�c5="AV0-AJ0"�hw="straight"�pos="ADV">straight </w>
���<align�with="KNYLC01D"/>
���<w�c5="PRP"�hw="over"�pos="PREP">over </w>
���<w�c5="PNP"�hw="it"�pos="PRON">it </w>
��</s>
�</u>
�<u�who="PS4YX">
��<s�n="487">
���<align�with="KNYLC01D"/>
���<w�c5="PNP"�hw="i"�pos="PRON">I </w>
���<w�c5="VHB"�hw="have"�pos="VERB">have</w>
���<w�c5="XX0"�hw="not"�pos="ADV">n't</w>
���<c�c5="PUN">.</c>
��</s>
�</u>
�<u�who="PS6U5">
��<s�n="488">
���<w�c5="CJC"�hw="and"�pos="CONJ">and </w>
���<w�c5="VVN"�hw="forget"�pos="VERB">forgotten </w>
���<w�c5="AT0"�hw="the"�pos="ART">the </w>
���<w�c5="AJ0"�hw="poor"�pos="ADJ">poor </w>
���<w�c5="AJ0"�hw="little"�pos="ADJ">little </w>
���<w�c5="NN1"�hw="country"�pos="SUBST">country</w>
���<c�c5="PUN">.</c>
��</s>
�</u>
<!-- KNY -->
This encoding is the CDIF equivalent of what might be presented in a
conventional playscript as follows:
W0001: Poor old Luxembourg's beaten. You, you've, you've absolutely just
gone straight over it --
W0014: (interrupting) I haven't.
W0001: (at the same time) and forgotten the poor little country.