Up: Contents Next: 1 Design of the corpus
The Users Reference Guide for the British National Corpus contains a description of the design principles underlying the British National Corpus (BNC), and detailed information about the way in which it is encoded, covering both the markup conventions applied and the linguistic annotation with which the corpus was enriched.
This revised edition has been slightly reorganized and considerably expanded to provide a complete reference work for users of the corpus in its new XML form. The text of the manual is available in TEI-XML and in HTML format, and also from the BNC website at http://www.natcorp.ox.ac.uk/XMLedition/urg.html, from which updated versions may be obtained.
The material presented in this manual derives originally from a number of BNC Project internal documents, combining contributions from all the participants in the project (see further Acknowledgments); any errors introduced are the responsibility of the editor. Please send any comments or corrections to [email protected].
Section 2 Basic structure describes the basic structure of the BNC encoding scheme, in terms of the XML elements and attributes distinguished and the tags used to mark them. Section 3 Written texts describes features which are peculiar to written texts, and section 4 Spoken texts those peculiar to spoken texts. In each case, a distinction is made between those elements which are marked up in all texts and those which (for technical or financial reasons) are not always so distinguished, and hence appear in some texts only. It should be noted that by no means all of the features described here will be present in every text of the corpus, nor, if present, will they necessarily be tagged.
Section 5 The header describes the structure of the
detailed metadata associated with each text, in the form of the
<teiHeader>
element attached to each component of the corpus,
and also to the whole corpus itself.
This is complemented in section 6 Wordclass Tagging in BNC XML by a detailed presentation of the linguistic annotation or wordclass tagging applied throughout the corpus. (This chapter is derived from the the Manual to accompany The British National Corpus (Version 2) with Improved Word-class Tagging (Leech and Smith) originally distributed separately with BNC World)
Section 7 Software for the BNC discusses briefly some ways of exploiting the the BNC computationally. Section 9 Miscellaneous tables complements the metadata supplied in the header by listing and documenting several of the coded values used in the markup. A brief bibliography combining significant background readings about the BNC with works cited elsewhere in the manual is provided in section 8 References and a complete list of all the original sources from which the corpus was compiled is given in section 10 List of Sources.
Section 11 The Xaira Specification documents suggested settings for those wishing to use the XAIRA system to index and query the BNC. The pre-built XAIRA index delivered as part of the BNC XML package was made using the XAIRA specification described in this section. This section is provided for the convenience of XAIRA users; it may be ignored if you are using some other software to search or manage the corpus.
Finally, a reference section (12 Formal Specification of the BNC XML schema) provides an alphabetical list of all XML elements and attributes used in the markup of the corpus, together with the model and attribute classes to which they belong, and macros used to simplify references to them. This specification conforms to the 2007 (P5) edition of the TEI Guidelines ([24]), with which it should be read in conjunction.
Creation of the corpus was funded by the UK Department of Trade and Industry and the Science and Engineering Research Council under grant number IED4/1/2184 (1991-1994), within the DTI/SERC Joint Framework for Information Technology. Additional funding was provided by the British Library and the British Academy.
After the completion of the first edition of the BNC, a phase of tagging improvement was undertaken at Lancaster University with funding from the Engineering and Physical Sciences Research Council (Research Grant No. GR/F 99847). This tagging enhancement project was led by Geoffrey Leech, Roger Garside and Tony McEnery. The main objective was to correct as many tagging errors as possible, using an enhanced version of Claws4. In addition, a new tool was developed (the Template Tagger) for ‘patching’ the corpus in such a way as to eliminate further sets of errors by rule. This tool was developed by Michael Pacey, building on a prototype written by Steven Fligelstone. The research team working on tagging improvement was Nicholas Smith (lead researcher), Martin Wynne and Paul Baker.
Correction and validation of the bibliographic and contextual information in all the BNC Headers was carried out at OUCS by Lou Burnard, with assistance at various stages from Andrew Hardie and Paul Groves, who helped check demographic details for all spoken texts, and in particular from David Lee, who checked bibliographic and classification information for the bulk of the written texts. Thanks are also due to the many users of the original version of the BNC who took the time to notify us of errors they found.
Thanks are due to Martin Wynne and Ylva Berglund who first suggested the idea of an XML version of a subset of the BNC. Production of that edition (BNC Baby) provided valuable experience in automatic conversion of the World edition. The bulk of the technical work involved in producing the XML edition was carried out by Tony Dodd and Lou Burnard, with assistance and advice from many BNC users and beta-testers worldwide, in particular Guy Aston, Andrew Hardie, Paul Rayson, and Sebastian Rahtz. Without their input the present revision would have been impossible.
Up: Contents Next: 1 Design of the corpus