Up: Contents Previous: 10 List of Sources Next: 12 Formal Specification of the BNC XML schema
The <xairaSpecification>
supplied in the corpus header
determines the behaviour of the XAIRA indexer, and hence of the
XAIRA-indexed system delivered with the BNC. In this section, we
document that specification as it applies to the BNC only. The
information provided here is for reference purposes only, and is of no
interest unless you are using the XAIRA system to index the BNC or a
similar corpus. Note however that this document is not an exhaustive
description of the capabilities of the XAIRA system: for more
information on that, please consult the project web site at http://www.xaira.org/
<xairaSpecification>
element
is as a member of the
model.encodingPart class, and may therefore be included
within the <encodingDesc>
element of the TEI Header for any
corpus. It is organized as a number of
<xairaList>
elements, each of which contains a number of
<xairaItem>
elements. Both of these latter elements have a
type attribute which specifies more exactly the function of
the item or list, by supplying one of a number of predefined codes, as
further described in this section.
type | indicates the function of this part of the specification. |
type | indicates what is defined by this part of the specification. | |
ns | supplies the namespace within which the generic identifier is to be found. | |
ident | supplies an element's generic identifier, or one of the codes * (meaning all elements), or name() meaning that the name of the referenced element is to be used rather than its value. |
<xairaList>
:
<xairaItem>
element:
A XAIRA element specification consists of a <xairaList>
of
type elementSpec containing one or more
<xairaItem>
elements, one for each element that the Xaira
indexer or client needs to be aware of. Elements which are not
mentioned within the Xaira element specification may however appear
within a corpus. When the indexer finds such an element, it will index
it using all default options; the client will not have access to any
explanatory text or gloss for such elements. Equally, the
specification may include definitions for elements which do not appear
within the corpus.
<attList>
element embedded within the <xairaItem>
, consisting of one
<attDef>
element for each attribute concerned:
<valList>
element within the <attDef>
, as in the following example:
The values A0, A1 etc. supplied by the ident attribute
on <valItem>
need not be unique across the corpus.
<xairaItem>
element:
A type attribute may also be specified on the
<valList>
element to indicate whether the list of values it
contains is exhaustive or exemplary; at present Xaira does not use
this information however.
ident | supplies an element's generic identifier, or one of the codes * (meaning all elements), or name() meaning that the name of the referenced element is to be used rather than its value. |
ident | supplies an element's generic identifier, or one of the codes * (meaning all elements), or name() meaning that the name of the referenced element is to be used rather than its value. | |
type | specifies the extensibility of the list of attribute values specified. | |
copyOf | supplies the identifier of a previously-defined value list to be used at this point. |
ident | supplies an element's generic identifier, or one of the codes * (meaning all elements), or name() meaning that the name of the referenced element is to be used rather than its value. | |
ns | supplies the namespace within which the generic identifier is to be found. |
A Xaira key specification is used to define how the indexer should identify which parts of the input documents are to be regarded as lexical forms and what additional keys should be associated with those forms. Additional keys are used to distinguish otherwise identical forms in the index (for example, the same spelling with two different POS codes); they are also used too build up lemma schemes and regions on which see below.
The key specification
consists of a <xairaList>
of type
keySpec. If no specification is given, the indexer will
assume default implicit tokenization is in force and no additional
keys are defined.
If a key specification is supplied, it contains at least one <xairaItem
type="form">
, optionally followed by one or more <xairaItem
type="addKey">
elements, each of which may contain a <desc>
element to document its purpose, and should also contain a
<valSource>
element to specify an
element or attribute within the corpus being indexed which is to be
used as the source for the values to be used as a key.
<w>
and <c>
delimit the forms which the indexer must
index:
The <valSource>
element specifies where the
indexer is to find the value which is to be treated as the form part
of the index entry. In both cases, it is found as element content, of
a <c>
or <w>
element. Since no further information is
given about where such elements are to be found, this will apply to
every occurrence of a <w>
or <c>
element, irrespective
of its context. Since no namespace is specified, the element is
assumed to be in the current or default namespace.
<w>
and <c>
:
This defines an additional key called c5, the
value of which is supplied by the attribute also called c5,
but only when that attribute is supplied on an element called
<w>
or <c>
and at any point in the document
structure. Other attributes called c5 (such as that on
<mw>
) will not be used
for this purpose.
When an additional key value is required, but no value is
available, because the attribute or element specified does not exist
or has no value, the literal content of the <defaultVal>
element
(XXX in the example above) will be used instead. In the
BNC, this should not happen, and this value should not therefore appear.
<w>
element:
The caseFold attribute is used to specify that forms should be case folded before indexing, so that forms differing only in letter case will be stored identically.
The effect of this is to define an additional key called
region, the value of which on a given form in the
index will be one of the strings stext, teiHeader, wtext,
or nowhere depending on the ___location of the form being
indexed. The name() identifier here indicates that it is the name
of the associated elements which is to be used as the value of
the key, rather than their content. If no <nameList>
were provided, then
the key generated would contain the name of the nearest ancestor
element. This key is used in the subsequent region specification (see
11.4 Region Specification).
Any combination of additional keys may be used to form a lemma scheme. This enables the values of the nominated keys to be treated as alternate forms for the associated index entries. For example, occurrences of words such as "dogs", "dogged", "dogging" etc in the BNC all have the value "dog" for an additional key called "Headword". To distinguish verbal senses from nominal ones, this additional key would need to be combined with another key giving the part of speech (noun or verb) for each occurrence. The resulting lemma scheme would then distinguish forms of "dog (noun)" from forms of "dog (verb)".
<xairaList type="lemmaSpec">
element, containing
one <xairaItem type="lemmaScheme">
for each scheme. This
element contains an optional <desc>
, followed by a <nameList>
containing the names of the additional keys used to constitute the
scheme. (The name of the additional key is the name supplied by the
ident attribute when the key was defined.). Thus, the lemma
scheme defined for the BNC has the following specification:
This defines a lemma scheme called BNC which is based on the combination of the values given by the additional keys Headword and pos which were defined in the previous section.
A region is a collection of possibly discontinuous
sections of a corpus defined by the XML tagging within it. For
example, each BNC document contains a <teiHeader>
element and either a
<wtext>
or an <stext>
element. We say that all the parts of
each document contained by a <teiHeader>
element constitute one
region. All the parts contained by either a <wtext>
or a
<stext>
element constitute another region. Regions (unlike
partitions) span document boundaries, and are not made up of whole
texts but of defined parts of them.
A region is defined by means of a <xairaItem>
of type
region. The ident attribute on the
<xairaItem>
supplies a name for the region, which can be used
by the client to limit searches to locations within the named
region.
The definition of the region is contained within a
<nameList>
. It combines the name of a previously-defined
additional key (region in the case of the BNC) which is
tagged as an <ident>
element, with a
list of one or more values. Word occurrences whose
region additional key has the value specified will be
considered to fall within the region being defined. Since these
values are element names, they are tagged within the
<nameList>
using the <gi>
element.
The first of these defines the region headerOnly, for
words occurring within the header; the second defines the region
textOnly for words occurring within <wtext>
or
<stext>
elements, as indicated by the values supplied for their
respective region additional key.
The index maps occurrences of index terms as defined in the previous section to locations in the corpus, which may be identified in a number of ways, additional to the internally-defined ___location system. This external referencing scheme is used by the system to label the context of occurrences found by the search program. Occurrences themselves are precisely located by the internal ___location scheme. Although the index contains information about the complete xpath ___location of occurrences within the corpus, the internal ___location scheme is highly optimized and cannot be used to support access via arbitrary Xpaths or XQL queries.
The element from which the text identifier is derived also delimits a single ‘text’ in the corpus. This effectively limits the kinds of value which may be used to identify it: it must be an attribute value or a pseudo value; element content is not permitted.
The referencing specification for a Xaira index is given by a <xairaList
type="refSpec">
, containing exactly one <xairaItem
type="textRef">
, followed by one <xairaItem type="scopeRef">
and optionally
one or more further <xairaItem type="unitRef">
elements. Each such
<xairaItem>
element contains a <valSource>
element as
defined above, to indicate where the value for the reference is to be
obtained in the input document. It may also contain a
<labelGen>
element which further defines the parts of the document to
which the reference applies and its format.
In the BNC, each <bncDoc>
begins a new ‘text’, which is
identified by the value of its xml:id attribute, and
the scope for each query is to be a complete
<s>
element, identified by its
n attribute. The reference is to be formatted with a dot
between the two values.
This specification will produce references like
ABC.123 for an <s>
element with attribute
n set to 123, found within a
<bncDoc>
element whose xml:id attribute has the value
ABC.
In addition to index terms derived from the lexical content of a corpus, a Xaira index also contains information about the occurrence of XML start- and end-tags within the corpus. This information is used to facilitate a number of search options: searching for non-lexical features, searching for lexical features within a given structural context, scoping co-occurrences of lexical or non-lexical features, etc.
By default an entry is made in the index for each occurrence of each tag, both start and end. This entry may also distinguish start-tag occurrences depending on the values of specified attributes supplied with them. (Note that this is independent of the use of such attribute values in the creation of index terms as described in the previous section).
<head>
and </head>
<head>
, <head type="sub">
and </head>
The content of every element found in a corpus is indexed by default, as are all of the tags, and all of their attributes. This behaviour may be modified by specifying explicit indexing policies for elements to which this default policy does not apply. An indexing policy may not be specified for elements or attributes which have been nominated as the sources for an additional key or reference, since these are indexed in a different way. Any indexing policy specified for such elements or attributes will be ignored by the indexer.
<category>
element within a TEI-conformant <taxonomy>
element.For every element or attribute to which a non-default indexing
policy applies, a <xairaItem type="indexPol">
appears within
the <xairaList type="indexSpec">
element. This may contain
either an <elementPolicy>
or an <attributePolicy>
,
element depending on whether it relates to elements or attributes.
<revisionDesc>
:
<revisionDesc>
elements
will be visible in search results, they cannot be searched for and a query
for one or for anything contained by of one, will return no hits<bibliography>
. One occurrence of this element, declared in its
own name space, is necessary for a XAIRA system: it holds metadata
relating to each text constituting the corpus. In the BNC this
bibliographic information is copied from the text headers, which are
also indexed in their own right. To avoid duplication of this content,
the indexer is instructed to index only the structure of the
bibliography but not its content:
<person>
element, and
also uses the attribute who to identify the
speaker or speakers of each speech in the
transcribed part of the corpus:
<person>
element, a join query can be
effected. The XAIRA client can be configured to
support queries in which the attributes age and soc appear to be
attributes of the <u>
element, their values being transferred
from the <person>
element whose xml:id value is equal to that
given by the who attribute on <u>
. The effect is
as it would be if the <u>
elements above looked like this:
First, we declare a join-to policy for any xml:id
attribute. Next we declare the join-from policy for the who
attribute on the <u>
element. As well as specifying which
attribute carries the value required (who), we need additionally to
supply the name of the element on which the corresponding join-to
attribute should be found (<u>
). Values are transferred when a
match is found between the value for the who attribute and
that of whichever attribute of the nominated element has been indexed
with the join-to policy. Note that only one attribute of a given
element may be indexed with the join-to policy and that the values of
attributes indexed with the join-to policy must be unique within the
specified element and attribute combination. Thus, there may be only
<person>
element with the value ABC for its xml:id
attribute, though the same value may appear on other attributes. If
the value appears on the xml:id attribute of some other
element, it will not be found with this join-to policy. Note that,
since the globally-available xml:id attribute is used to
hold the joint-to attribute, its values must be unique across the
whole corpus.
A taxonomy is a special kind of codebook, the purpose of which is to provide a set of defined codes to classify the texts making up a corpus. The BNC defines several different taxonomies as means of classifying its constituent texts, as further described in 5.2.3 The reference and classification declarations. The element or attribute within a particular text which identifies its classification, by referencing one or more codes within a taxonomy, is called its classifier.
Each distinct taxonomy for a corpus is defined by a TEI
<taxonomy>
element, within the corpus header. This defines the
codes available for use and gives a gloss to them. Where, as is
usual, the texts in a corpus are classified along more than one
dimension (for example, by text type, by medium of distribution, by
audience type etc.), a <taxonomy>
must be defined for each
dimension, rather than defining a single taxonomy with disjoint sets
of children. Note that the classification codes used must be unique
across the whole corpus, irrespective of the taxonomy to which they
belong. This approach also enables the client to regard each taxonomy
as defining a partition of the corpus.
<catRef>
element in each
text header supplies a list of values for all the original selection
and descriptive criteria, described in 1 Design of the corpus
<wtext>
and
<stext>
elements carries a broadbrush text-type categorization,
derived from the other classification codes, see further 9.1 XML tag usage by text type
As a Unicode system, XAIRA is able to handle data in any natural
language or writing system. However, it is still necessary to
specify the language or languages used in the corpus being
indexed. This specification is performed by a <xairaList>
of
type langspec. This contains at least one
<xairaItem type="defaultLang">
, and optionally other
<xairaItem type="langRules">
elements.
Up: Contents Previous: 10 List of Sources Next: 12 Formal Specification of the BNC XML schema