Jump to content

Spoken English Corpus

From Wikipedia, the free encyclopedia

The Spoken English Corpus (SEC) is a speech corpus collection of recordings of spoken British English compiled during 1984–1987. The corpus manual can be found on ICAME.[1]

History

[edit]

The Spoken English Corpus (SEC) project was supported jointly in 1984-5 by the Humanities Research Fund at Lancaster University and by IBM (UK) Ltd, and subsequently by IBM UK Ltd. The project was supported by Geoffrey Leech at Lancaster and Geoffrey Kaye at IBM. The project was a collaboration, funded by IBM, between the Unit for Computer Research on the English Language (UCREL) at the University of Lancaster and the IBM Scientific Centre in Winchester.[2]

Compilation

[edit]

SEC comprises 53 recorded passages, mainly from the BBC, spoken in the accent usually referred to as Received Pronunciation, or RP. The collection covers categories such as commentary, news broadcast, lecture, dialogue, poetry and propaganda.[3] The corpus contains 52,637 words, totalling 339 minutes. The compilation of the corpus is described by Lita Taylor in her 1996 article "The Compilation of the Spoken English Corpus."[4]

Transcription

[edit]
Knowles et al., (1996) A Corpus of Formal British English Speech, Routledge

A system was devised for transcription of the intonation of the material in the recordings. Two transcribers, Gerry Knowles and Briony Williams, both supported by Lita Taylor, analysed the entire corpus. The transcription system is explained by Williams,[5] and an experiment was conducted by Brian Pickering to assess the degree of agreement between the two transcribers on a section of the Corpus containing around 1000 tone-units which was transcribed by both transcribers.[6] Good agreement was found.

An important attribute of a modern corpus is that it is computer-readable: a corpus tends to reside on a hard disk than a bookshelf. In presenting the corpus in this book form, the authors have taken into account the needs of established corpus linguists, and of those who are not yet familiar with corpora. Anyone who has the corpus on disk can make hard copies of most of the files; but without a special font to print the prosodic symbols, the prosodic texts will be either unprintable or unreadable. For this reason the prosodic version has been chosen for publication.

The whole transcription in print was made in its present form by Peter Alderson, who later took over as Speech Research Manager at IBM. The volume was later entitled "A Corpus of Formal British English Speech: The Lancaster/IBM Spoken English Corpus", and was first published by Longman in 1996, later by Routledge in 2013. The book is currently available from online bookstores including Routledge and Book Depository, or in electronic format from Google Play Books.[7][8]

Other analyses

[edit]

Grammatical tagging of each word, based on the CLAWS1 tagset, was added to the text of the SEC by an automatic process.[9][10] The fact that this tagging was in machine-readable form made it possible to relate grammatical and prosodic information in the texts. Subsequent work used probabilistic models to develop further the grammatical tagging and to produce automatic parsing techniques.[11]

Anne Wichmann published her research on SEC intonation, "Intonation in Text and Discourse: Beginnings, middles, and ends" in 2000.[12]

Machine-Readable Spoken English Corpus (MARSEC)

[edit]

Although the text and its associated tagging existed in machine-readable form, the recordings themselves existed only as tape-recordings. A collaboration, funded by the Economic and Social Research Council in 1992–4, between speech scientists at the Universities of Lancaster and Leeds in the United Kingdom set out to produce a version of the corpus which contained the recordings in digital form, time-linked to the text.[13] The principal researchers were Gerry Knowles and Tamas Varadi (Lancaster) and Peter Roach and Simon Arnfield (Leeds). The outline of the project is set out in Knowles,[14] and the automatic time-alignment is described by Roach and Arnfield.[15] The digitized recordings were recorded on CD-ROM. It was subsequently made available for downloading for research purposes from Leeds University, though this facility is no longer supported.[16]

Aix-MARSEC

[edit]

The work on MARSEC in Lancaster and Leeds finished around 1995, but the corpus has subsequently been the object of a considerable amount of further development at the University of Aix-en-Provence, France, under the direction of Daniel Hirst.[17] The database consists of two major components: the digitalized recordings from MARSEC and the annotations. Annotations have so far been undertaken at nine levels, including phonemes, syllables, words, stress feet, rhythm units and minor and major turn units. Two supplementary levels, the grammatical annotation by CLAWS and a Property Grammar system developed at Aix-en-Provence, are to be integrated soon.[18] A possible disadvantage of this treatment is that the corpus can only be searched using specially written scripts.[19] The database, together with tools, is available under GNU GPL licensing at the Aix-MARSEC project site.[20]

References

[edit]
  1. ^ "MANUAL OF INFORMATION TO ACCOMPANY THE SEC CORPUS". korpus.uib.no. Retrieved 2020-10-15.
  2. ^ Leech, Geoffrey. (1996). "The Spoken English Corpus in its context." Foreword. Knowles, Gerard; Wichmann, Anne; Alderson, Peter, eds. (1996). Working with Speech. Longman. p. ix. ISBN 9780582045347.
  3. ^ Xiao, Richard; Tono, Yukio (2006). MacEnery, Tony (ed.). Corpus-Based Language Studies: An Advanced Resource Book. Taylor & Francis. p. 63. ISBN 9780415286220.
  4. ^ Taylor, Lita. (1996). "The Compilation of the Spoken English Corpus." Knowles, Gerard; Wichmann, Anne; Alderson, Peter, eds. (1996). Working with Speech. Longman. pp. 20–37. ISBN 9780582045347.
  5. ^ Williams, Briony. (1996). "The formulation of an intonation transcription system for British English." Knowles, Gerard; Wichmann, Anne; Alderson, Peter, eds. (1996). Working with Speech. Longman. pp. 38–57. ISBN 9780582045347.
  6. ^ Pickering, Brian. (1996). "Analysis of transcriber differences in the SEC." Knowles, Gerard; Wichmann, Anne; Alderson, Peter, eds. (1996). Working with Speech. Longman. pp. 61–86. ISBN 9780582045347.
  7. ^ "A Corpus of Formal British English Speech: The Lancaster/IBM Spoken English Corpus (Paperback) - Routledge". Routledge.com. Retrieved 2018-07-22.
  8. ^ "A Corpus of Formal British English Speech : Gerald Knowles : 9781138457768". www.bookdepository.com. Retrieved 2019-01-30.
  9. ^ Taylor, Lita. (1996). "The Compilation of the Spoken English Corpus." Knowles, Gerard; Wichmann, Anne; Alderson, Peter, eds. (1996). Working with Speech. Longman. p. 30. ISBN 9780582045347.
  10. ^ "UCREL CLAWS1 (LOB) Tagset". ucrel.lancs.ac.uk. Retrieved 2020-10-15.
  11. ^ Sampson, Geoffrey. (1987). "Probabilistic models of analysis." Garside, Roger; Sampson, Geoffrey; Leech, Geoffrey (1987). The Computational Analysis of English. Longman. ISBN 9780582291492.
  12. ^ "Intonation in Text and Discourse: Beginnings, Middles and Ends". Routledge & CRC Press. Retrieved 2020-10-15.
  13. ^ Roach, P., Knowles, G., Varadi, T. and Arnfield, S. (1994) Roach, Peter; Knowles, Gerry; Varadi, Tamas; Arnfield, Simon (1993). "MARSEC: a MAchine-readable Spoken English Corpus". Journal of the International Phonetic Association. 23 (2): 47–54. doi:10.1017/s0025100300004849. ISSN 0025-1003. S2CID 145797962.
  14. ^ Knowles, G. "Converting a corpus into a relational database: SEC becomes MARSEC" Geoffrey, Leech; Myers, Greg; Thomas, Jenny (1995). Spoken English on Computer. Longman. pp. 208–219. ISBN 9780582250215.
  15. ^ Roach, Peter and Arnfield, Simon. "Linking prosodic transcription to the time dimension." Geoffrey, Leech; Myers, Greg; Thomas, Jenny (1995). Spoken English on Computer. Longman. pp. 149–160. ISBN 9780582250215.
  16. ^ "MARSEC: The Machine Readable Spoken English Corpus". www.reading.ac.uk. Retrieved 2020-10-15.
  17. ^ Hirst, Daniel; De Looze, Céline; Auran, Cyril; Bouzon, Caroline (27 July 2010). "Aix-MARSEC database". Archived from the original on 23 January 2010. Retrieved 15 April 2013.
  18. ^ Auron, Cyril; Bouzon, Caroline (2003). "Phonotactique prédictive et alignement automatique : application au corpus MARSEC et perspectives" [Predictive phonotactics and automatic alignment: application in the MARSEC corpus and prospects]. Travaux interdisciplinaires du laboratoire parole et langage d'Aix-en-Provence (in French). 22. Publications de l'Université de Provence: 33–63. Retrieved 15 April 2013.
  19. ^ Wichmann, Anne "Speech corpora and spoken corpora" Ludeling, Anke; Kyto, Merja (2006). Corpus Linguistics 1. Walter de Gruyter. p. 200. ISBN 9783110180435.
  20. ^ Hirst, Daniel. "Aix-MARSEC project". Archived from the original on 23 January 2010. Retrieved 15 April 2013.