(5.1) Computational/Corpus Linguistics

(5.1.1) Machine Learning Support for Evaluation and Quality Control

Hans van Halteren
University of Nijmegen, The Netherlands

Annotated material which is to be evaluated and possibly upgraded is used as training and test data for a machine learning system. The portion of the material for which the output of the machine learning system disagrees with the human annotation is then examined in detail. This portion is shown to contain a higher percentage of annotation errors than the material as a whole, and hence to be a suitable subset for limited quality improvement. In addition, the types of disagreement may identify the main inconsistencies in the annotation so that these can then be investigated systematically.


In many humanities projects today, we see that large textual resources are manually annotated with markup symbols, as these are deemed necessary for efficient future research with those resources. The reason that the annotation is applied manually is that there is, for the time being, no automatic procedure which can apply the annotation with an acceptable degree of correctness, typically because the annotation requires detailed knowledge of language or even of the world to which the resources refer.

The choice of human annotators may be unavoidable, but it is also one which has a severe disadvantage. Human annotators are unable to sustain the amount of concentration needed for correct annotation for the amounts of time needed to annotate the enormous amounts of data present (cf. e.g. Marcus et al. 1993; Baker 1997). Loss of concentration, even if only partial and temporary, is bound to lead to a loss of correctness in the annotation. Awareness of this problem has led to the use of quality control procedures in large scale annotation projects. Such procedures generally consist of spot checks by more experienced annotators or double blind annotation of a percentage of the material. The lessons learned from such checks lead to additional instruction of the annotators, and, if the observed errors are systematic and/or severe enough, to correction of previously annotated material. Even with excellent quality control measures during annotation, though, it is likely that the end result will not be fully correct, and the measure of correctness can, at most, be estimated from the observations made in quality control. Obviously, it would be enormously helpful if there were automatic procedures to support large scale evaluation and upgrade of annotated material.


Unfortunately, as mentioned above, automatic procedures are currently unable to deal with natural language to a sufficient degree to correctly apply most types of annotation. However, although automatic procedures cannot provide correctness, they are undoubtedly well-equipped to provide consistency. Now consistency and correctness are not the same, but both are desirable qualities and, unlike other pairs of desirable qualities such as high precision and recall, they are not in opposition. Complete correctness is bound to be consistent at some level of reference and complete consistency at a sufficiently deep level of reference is bound to be correct. More practically, a highly correct annotation can be assumed to agree most of the time with a highly consistent annotation, which means that disagreement between the two will tend to indicate instances with a high likelihood of error.

An example is provided by Van Halteren et al. (Forthcoming). One of the constructed wordclass taggers is trained and tested on Wall Street Journal material tagged with the Penn Treebank tagset. In comparison with the benchmark, the tagger provides the same tag in 97.23% of the cases. When the disagreements are checked manually for 1% of the corpus, it turns out that out of 349 disagreements, 97 are in fact errors in the benchmark. Unless this is an unfortunate coincidence, it would mean that we can remove about 10,000 errors by checking fewer than 40,000 words, a much less formidable task than checking the whole 1Mw corpus. In addition, the cases where the tagger is wrong appear to be caused in 44% by inconsistencies in the training data, e.g. the word "about" in "about 20" or "about $20" is tagged as a preposition 648 times and as an adverb 612 times. Such observations are slightly harder to use systematically, but can again serve to adjust inconsistent and/or incorrect annotation.

In principle, the use of such a comparison methodology is not limited to wordclass tagging. Any annotation task which can be expressed as classification on the basis of a (preferably small) number of information units (e.g. for wordclass tagging the information units could be the word, two disambiguated preceding classes and two undisambiguated following classes) is amenable to be handled by a machine learning system. Such a system attempts to identify regularities in the relation between the set of information units and uses these regularities to classify previously unseen cases (cf. e.g. Langley 1996; Carbonell 1990). Several machine learning systems are freely available for research purposes, e.g. the memory-based learning system TiMBL (http://ilk.kub.nl) and the decision tree system C5.0 (http://www.rulequest.com). If we have a machine learning system and if we can translate the annotation task into a classification task, we can train the system on the annotated material and then compare the system's output with the human annotation. The instances where the two disagree can then (a) be used as prime candidates for rechecking correctness and (b) point to systematic inconsistencies to be reconsidered.

Overview of the Paper

Using various types of annotated material and machine learning systems, this paper will attempt to answer the following questions:


Baker, J. P. (1997). Consistency and Accuracy in Correcting Automatically Tagged Data. In R. Garside, G. Leech and A. P. Mcenery (eds) Corpus Annotation. Addision Wesley Longman, London. 243-250.
Carbonell, J. (ed) (1990). Machine Learning: Paradigms and Methods. MIT Press, Cambridge, MA.
van Halteren, H., Daelemans, W. and Zavrel, J. (Forthcoming). Improving Accuracy in NLP through Combination of Machine Learning Systems.
Langley, P. (1996). Elements of Machine Learning. Morgan Kaufmann, Los Altos, CA.
Marcus, M., Santorini, B. and Marcinkiewicz, M. (1993). Building a large annotated Corpus of English: the Penn Treebank. Computational Linguistics 19(2). 313-330.

Return to ALLC/ACH Programme

(5.1.2) A Corpus-Based Methodology for Identifying Non-nominal "It": Rule-Based and Machine Learning Approaches

Richard Evans
University of Wolverhampton, UK

In this paper, seven uses of "it" are identified in English. These uses involve noun phrase anaphora (eg. "Do not sweep the dust when dry, you will only recirculate IT."), verb phrase anaphora (eg. "Raising money for your favourite charity can be fun. You can do IT on your own..."), reference to clauses (eg. "Not every city would be suited to this approach, IT must be admitted."), reference to entire discourse segments, (eg. "Always use a tool for the job it was intended to do. Always use tools correctly. If IT feels very awkward, stop."), cataphoric reference to entities (eg. "When IT fell, the glass broke."), pleonastic uses in which the pronoun has no reference and is used only due to a requirement of the grammar (eg. "IT was {raining, 4 o'clock, All Saints' day, etc.}", "IT's recommended that...", "IT's easier to..."), and uses in idiomatic constructions (eg. "I take IT you're going now."). Due to the absence of a suitable term in the literature, the term "non-nominal it" is used to identify all the cases in which "it" is not in an anaphoric relationship with a noun phrase in the text. Numerous researchers have so far proposed hand-crafted rule-based pattern matching techniques to identify pleonastic "it". These methods have the drawback that they require recognition of potentially large and open-ended lists of trigger words and complex expressions in order to succeed. The goal here was to compare a rule-based method with a method devised to use machine learning to make the identification. It was hoped that information such as the position of the pronoun and its complex relation to the surrounding syntactic context would contribute to the accuracy of the identification. We implemented both methods. Corpora were constructed, annotated and used to classify and evaluate the accuracy of these programs. A comparison was made between them.

The literature makes it easy to infer the importance of recognising non-nominal uses of "it" in the fields of anaphora resolution, information retrieval, machine translation and text summarisation. The task is especially crucial when it is considered that almost one third of the uses of "it" in our corpus of randomly selected texts were non-nominal.

In the full paper, the treatment of pleonastic "it" in surveys of English usage is reviewed, as is work by Paice & Husk (1987), Lappin & Leass (1994) and Denber (1998) on methods for automatic recognition of pleonastic "it". The application of machine learning to a different problem in linguistics is described in the review of Litman (1996) on the automatic classification of cue phrases. One of the methods in the present paper applies machine learning to the automatic identification of non-nominal "it".

A novel resource was required for this corpus-based research. A corpus was therefore constructed using 77 randomly selected texts from the BNC and stripped down versions of the Susanne corpus. We implemented a software tool that facilitates SGML mark up of instances of "it" that appear in the corpus by a human annotator. Non-Nominal uses of "it" are marked <PLEO ID="XX">it</PLEO> whereas other instances are left unmarked by the annotator. On completion, the corpus contained 368830 words, 3171 occurrences of "it" and 1025 non-nominal uses. A DTD was defined for the annotated corpus and the SGML aware LT-Chunker (Mikheev 1996) was used to tokenise the corpus while preserving the prior mark up. The tokenised file was then processed by a Perl program written to report the paragraph, sentence and word positions of the non-nominal instances of "it". This information was written to a data file and used to evaluate the methods implemented and described later.

We implemented a program based on Paice & Husk's (1987) method for recognition of pleonastic "it". In the first step, a plain text version of the corpus was tagged using Tapanainen & Jarvinen's (1997) SGML-blind FDG-Parser. The output from the tagger was converted to an SGML format by our software and then processed by our program based on Paice & Husk's pattern recognition method. In this way a classification was assigned to each instance of "it". Evaluation was performed by comparing the output of the program with the contents of the data file produced earlier.

A machine learning approach was also implemented. It exploits Daelemans' (1999) TiMBL memory based learning method. TiMBL works by using a training file of feature-value vectors that have been given a classification: non-nominal; or not. The construction of the training file was made by processing a plain text version of the annotated corpus with the FDG-Parser and the SGML conversion program. The SGML file was input to a program that described each instance of "it" as a vector of feature values. The features used in our approach were designed to describe the position of non-nominal instances, the lemmas of significant "following" words such as verbs and adjectives, as well as the relation of "it" to other structures in the text, such as prepositions and noun phrases. A thorough description appears in the full paper. The vectors associated with the instances were classified by comparison with the data file constructed earlier. The set of classified vectors made this way was then used as the training file. TiMBL classifies query vectors according to their similarity to the examples in the training file. The method of ten-fold cross-validation was used to obtain an evaluation of the average accuracy of the technique over our corpus.

The results of our automatic evaluation showed no major differences in the level of accuracy between the two methods. However, it was noted that the method based on work by Paice & Husk was slightly more accurate over this text (78.81% vs. 78.68%) but had a stronger tendency to misclassify instances as non-nominal (false positives: 265 vs. 243). If false positives are undesirable to the user, then the machine learning approach is better.

Further experiments in which the classification of instances in the training set was extended using a 7-ary system, in accordance with the uses given in the introduction, showed some improvement in making the binary distinction between nominal and non-nominal uses. The classification accuracy rose to 78.74% and the number of false positive classifications fell to 209. Predictably, the detection rate for each of the different types of usage was low (50.35% on average).

Given that TiMBL is reliant on a training file, it will also be beneficial to extend our resource in terms of size as well as information content. The present file, with 3171 instances, cannot be considered to be of sufficient size. The availability of a suitable resource for evaluation is also important for the application of optimisation techniques. Of course, non-nominal pronouns appear in languages other than English, and it would be valuable to generate resources in order to explore machine learning based methods to identify non-nominal pronouns for them.


Burnard, L. (1995). Users Reference Guide British National Corpus Version 1.0. Oxford University Computing Services, UK.
Daelemans, W. (1999). TiMBL: Tilburg Memory Based Learner version 2 Reference Guide, ILK Technical Report - ILK 99-01. Tilburg University, The Netherlands.
Denber, M. (1998). Automatic Resolution of Anaphora in English. Eastman Kodak Co., Imaging Science Division.
Lappin, S and Leass, H.J. (1994). An Algorithm for Pronominal Anaphora Resolution. Computational Linguistics, Volume 20, Number 4.
Litman, D.J. (1996). Cue Phrase Classification Using Machine Learning. Journal of Artificial Intelligence Research, vol 5, pp.53-94.
Mikheev, A. (1996). LT_CHUNK V 2.1. Language Technology Group, University of Edinburgh, UK.
Paice, C.D. and Husk, G.D. (1987). Towards the Automatic Recognition of Anaphoric Features in English Text: The Impersonal Pronoun 'It'. Computer Speech and Language, 2 p.109-132, Academic Press, US.
Sampson, G. (1995). English for the Computer: The SUSANNE Corpus and Analytic Scheme. Oxford University Press, UK.
Tapanainen, P. and Jarvinen, T. (1997). A Non-Projective Dependency Parser. The Proceedings of The 5th Conference of Applied Natural Language Processing, pp. 64-71, ACL, US.

Return to ALLC/ACH Programme

(5.1.3) Binomials and the Computer: a Study in Corpus-Based Phraseology

Ourania Hatzidaki
University of Birmingham, UK

This paper presents the results of a large-scale corpus-based study (Hatzidaki 1999) of an important feature of English phraseology, namely binomial pairs (e.g. chalk and cheese, up and down, prim and proper, through and through, Laurel and Hardy). The purpose of the research was two-fold, firstly, to conduct, on the basis of a large and varied corpus of English textual data, a thorough and in-depth structural and functional analysis of a much-studied and yet not fully explored phraseological phenomenon; and secondly, to examine the hypothesis that the use of systematically collected samples of authentic language data results in more accurate and comprehensive descriptions of the form and function of linguistic phenomena than does the sole reliance on introspection. The analysis yielded an extensive and rigorous taxonomy of the various structural variations of binomials, as well as significant new information on their function in the communicative process.

Binomials, namely sequences of "two or more words or phrases belonging to the same grammatical category, having some semantic relationship and joined by some syntactic device such as 'and' or 'or'" (Bhatia 1994:143), have long been objects of interest for idiomatologists and stylisticians. The several existing studies of this phenomenon have mainly focussed on its marked occurrence in the works of certain literary authors such as Chaucer, Lydgate, Shakespeare, Swift, Shaw, etc. (see, respectively, Héraucourt 1939 and Potter 1972, Tilgner 1936, Nash 1958 and Gerritsen 1958, Milic 1967, and Ohmann 1962), as well as in English legal texts; the semantic and syntactic characteristics and idiosyncrasies of the various paired forms, especially the semantic relationship between the linked members of a binomial (synonymy, able and talented; antonymy, boys and girls; complementarity, bow and arrow; Malkiel 1959), or the notion of irreversibility, i.e. the tendency of binomials to occur in only one sequence, as in here and there and not *there and here, and the possible causes of this phenomenon (e.g. 'proximal before distal'; Cooper and Ross 1975); and the incidence of binomials in languages other than English (e.g. Fix 1985 for German; Abraham 1950 for French and Italian; Malkiel 1959 for Russian, Portuguese, Spanish, Ancient Greek and Latin; Gold 1991 for Yiddish; Koch 1983 for Arabic; Szpyra 1983 for Polish; etc.).

As opposed to literary studies, where binomials are treated as a flexible and interesting stylistic device which serves as a powerful means of expressing the authors' ideology and worldview, most studies of the occurrence of this feature in general language implicitly or explicitly regard binomials as a small and probably finite set of structurally and semantically idiosyncratic forms. Moreover, although many studies of the formal characteristics of binomials are available, there exists no comprehensive account of the full structural variability of the binomial pairs used by the average speaker, no detailed information on the distribution of the different patterns, and no organized taxonomy of forms. Finally, with the notable exception of studies of binomials as a distinctive feature of the language of the law which fulfils the requirements of legal draftsmanship for precision, clarity, unambiguity and all-inclusiveness (Mellinkoff 1963, Gustafsson 1984, Bhatia 1994), minimal attention has been given to the functions of binomials in non-literary language.

Crucially, with very few exceptions (notably Gustafsson 1975), previous treatises on binomials have been intuition-based. A glance at a general corpus, however, instantly reveals a number of new and interesting facts concerning this feature. Firstly, numerous paired forms emerge, which appear to have been modelled on an abstract dualistic structure of the A + link + B type, very few of which, however, represent familiar, idiomatic locutions such as the oft-quoted rough and ready and out and out: the majority of the couplets appearing in corpus data constitute novel sequences such as calm and united, gently and effectively, inflation and unemployment, etc., whose formation seems to be governed by the specific lexicogrammatical, discoursal and pragmatic rules pertaining to the production of the texts in which they are encountered. Secondly, although couplets are extremely varied in their structural details, they all seem to fall into a set of identifiable lexicogrammatical patterns. And thirdly, the occurrence of the various dualistic patterns in textual sources with different situational characteristics demonstrates substantial distributional fluctuations.

The above facts indicate that, in order to effectively account for the phenomenon of binomial pairing as it is observed in a corpus of texts, a new and more flexible data-driven framework needs to be devised. In the light of the data used in the present research, rather than a list of structurally and semantically peculiar couplets, binomials are analyzed as an abstract mechanism which speakers have at their disposal for the generation of a very wide range of paired types that serve a variety of important communicative purposes. As a theoretical model for the identification and extraction of binomials from the corpus and the classification of their various lexicogrammatical variants into a set of categories, we exploit the notion of phraseological frame or formal idiom, as posited and developed by Moon (1998:154f) and Fillmore, Kay & O'Connor (1988:505f). This, in very broad terms, represents an abstract structural formula which, as Fillmore et al. put it, 'serves as host' (ibid.:506) to institutionalized expressions as well as novel, spontaneously created forms.

Binomials emerge as a major frame which can be represented by means of the general formula A link B. Our data analysis, which results in the construction of a detailed and comprehensive data-driven taxonomy of binomial patterns, involves, firstly, the identification and extraction of the various binomial forms from our corpus of textual data; secondly, the devising of a prototypical system of abstract representations to which each extant pair is assigned on the basis of its lexicogrammatical attributes; thirdly, the detailed recording of any interesting lexicosemantic preferences displayed by the patterns (for instance their semantic prosodies; Louw 1993 and Sinclair 1996); and, finally, the calculation of the frequency of occurrence of each pattern in the corpus.

We also discuss in detail the important but rarely addressed issue of the function of binomials in the communicative process. Specifically, we examine the incidence of the various binomial patterns in each of the six subcorpora comprising our corpus (a set of written publications in book form, both fiction and non-fiction; a broadcasting medium; a semi-specialized periodical publication; two daily newspapers, a broadsheet and a tabloid; and a set of spontaneous and semi-spontaneous spoken texts), and seek explanations for the very substantial distributional perturbations. The main purpose of this exercise is to establish the nature and extent of the correlation between the form and structure of binomial patterns on the one side, and the extralinguistic and situational factors pertaining to each subcorpus on the other, and, thus, to determine the precise functions served by each binomial pattern in communication.

Our data strongly suggest that binomials constitute a phraseological device which makes a highly significant contribution to the communicative process. Our analysis demonstrates that, depending on their structure as well as the type of text in which they are encountered, binomials serve a wide range of communicative functions. For instance, it is shown that the abundant use of informationally dense binomials (e.g. government and parliament, political and monetary, commercial and investment banks) on the part of journalists serves most effectively the institutional requirements of the mass media for factuality, informativeness, precision, conciseness and stylistic uniformity (Crystal & Davy 1969, Tuchman 1978, van Dijk 1988, and elsewhere), whilst simultaneously disguising the highly fragmented process of production of news texts (Bell 1991).

On the other hand, the frequent employment of repetitive, vague or informationally sparse pairs in conversation (ages and ages, here and there, try and get) reflects the efforts of conversationalists in the face of the exigencies of real-time communication. In the context of unplanned talk, binomials act as a lexicalized and, therefore, elegant and well-integrated temporal space which speakers create automatically and with the minimum of cognitive effort whilst coping with delays in the formulation of thought and argument. Binomials in extemporaneous conversation act as a crucial discourse-cohesive device, which helps keep speech 'glued together' (Johnstone 1987), whilst minimizing the effect of fragmentation (Chafe 1982) created by phenomena such as false starts, random repetition (Norrick 1987), etc. At the same time, binomials may be used by speakers as a means of expressing emphasis and emotional involvement and of creating rhetorical presence (e.g. faster and faster, ringing and ringing).

On the whole, the corpus-based structural and situational analysis of binomials not only offers new and significant information on a well-known linguistic phenomenon, it also offers substantial empirical support for the hypothesis that phraseology plays a major part in the accomplishment of the communicative goals of speakers or writers (for a review of relevant studies, see Hatzidaki 1999).


Abraham, R.D. (1950). Fixed Order of Coordinates. Modern Language Journal 34, 276-287.
Bell, A. (1996). The Language of News Media. Blackwell, Oxford.
Bhatia, V. (1994). Cognitive structuring in legislative provisions. In J. Gibbons (ed) Language and the Law, Longman, London.
Chafe, W.L. (1982). Integration and Involvement in Speaking, Writing, and Oral Literature. In D. Tannen (ed) Spoken and Written Language. Ablex, New Jersey.
Cooper, W.E. & Ross, J. R. (1975).World Order. In R. E. Grossman, J. L. San and T. J. Vance (eds) Papers from the Parasession on Functionalism. Chicago Linguistic Society, Chicago.
Crystal, D. and Davy, D. (1969). Investigating English Style, Longmans, London.
Fillmore, C.J., Kay, P. and O'Connor, M.C. (1988). Regularity and Idiomaticity in Grammatical Constructions: The Case of Let Alone. Language 64(3), 501-538.
Fix, U. (1985). Wortpaare im heutigen Deutsch. Sprachpflege 34(8), 112-113.
Gerritsen, J. (1958). More Paired Words in Othello. English Studies 39, 212-214.
Gold, D.L. (1991). Reversible Binomials in Afrikaans, English, Esperanto, French, German, Hebrew, Italian, Judesmo, Latin, Lithuanian, Polish, Portuguese, Rumanian, Spanish and Yiddish. Orbis 36, 104-118.
Gustafsson, M. (1975). Binomial Expressions in Present-day English. Turun Yliopisto, Turku.
Gustafsson, M. (1984). The syntactic features of binomial expressions in legal English. Text 4(1-3), 123-141.
Hatzidaki, O. (1999). Part and Parcel: A Linguistic Analysis of Binomials and its Application to the Internal Characterization of Corpora, Ph.D. Thesis, University of Birmingham.
Héraucourt, W. (1939). Die Wertwelt Chaucers, Carl Winters, Heidelberg.
Johnstone, B. (1987). An Introduction. Text 7(3), 205-214.
Koch, B. J. (1983). Arabic Lexical Couplets and the Evolution of Synonymy. General Linguistics 23(1), 51-61.
Louw, B. (1993). Irony in the Text or Insincerity in the Writer - The Diagnostic Potential of Semantic Prosodie. In M. Baker, G. Francis and E. Tognini-Bonelli (eds) Text and Technology, John Benjamins, Amsterdam.
Malkiel, Y. (1959). Studies in Irreversible Binomials. Lingua 8, 113-160.
Mellinkoff, D. (1963). The Language of the Law. Little, Brown & Co, Boston.
Milic, L. T. (1967). A Quantitative Approach to the Style of Jonathan Swift. Mouton & Co, The Hague.
Moon, R. (1998). Fixed Expressions and Idioms in English. Clarendon Press, Oxford.
Nash, W. (1958) Paired Words in Othello: Shakespeare's Use of a Stylistic Device. English Studies 39, 212-214.
Norrick, N. R. (1980). Semantic Relations and Motivation in Idioms. In E. Weigand and G. Tschauder (eds) Perspektive: Textintern Vol. 1, Niemeyer, Tübingen.
Ohmann, R. M. (1962). Shaw: The Style and the Man. Wesleyan University Press, Middletown.
Potter, S. (1972). Chaucer's Untransposable Binomials. In E. Ohmann, V. Vaananen and A. Kurvinen (eds) Studies Presented to Tauno F. Mustanoja on the occasion of his sixtieth birthday, Modern Language Society, Helsinki.
Sinclair, J. (1996). The Search for Units of Meaning. Textus 9, 75-106.
Tilgner, E. (1936). Die Aureate Terms als Stilelement bei Lydgate. Germanische Studien 182.
Tuchman, G. (1978). Making News. The Free Press, New York.
van Dijk, T. A. (1988). News Analysis. Lawrence Erlbaum Associates, New Jersey.

Return to ALLC/ACH Programme

(5.3) Digital Resources

(5.3.1) ACAD - a Cambridge Alumni Database

John Dawson
University of Cambridge, UK

1. Introduction

From 1922-27 Venn published the four volumes of Part I of Alumni Cantabrigienses, a biographical list of all known students, graduates and holders of office at the University of Cambridge, from the earliest times to 1751. This was followed from 1940-54 by the six volumes of Part II, covering 1752-1900. Subsequent archival research unearthed much more detail, and many more names, for the period up to 1500, and in 1963 Emden published his two-volume A Biographical Register of the University of Cambridge to 1500. Together, these twelve volumes cover approximately 180000 names, with some overlap.

It goes without saying that all this information is of the utmost importance for historical research, covering as it does a large proportion of the religious, legal, administrative, medical, academic, and royal appointments in Britain, the Empire, and the Colonies, as well as many other countries. A good deal of social history is also included, albeit patchily.

However, all these publications have a great defect for research: there is no index. Also, since Venn's work, many corrections and additions to the information have come to light, and without the incorporation of this new material it is easy to be misled by the original books. So, to find all the Vicars of Trumpington mentioned by Venn and Emden requires an exhaustive search of twelve large volumes and many card indexes of Addenda and Corrigenda.

2. Database

We therefore set about the creation of an on-line database to make all this information accessible. Other sources, such as the Tripos Lists (lists of degrees awarded), and College Registers (especially those of the women's colleges, which were ignored by Venn) have been included. It is envisaged that the database will be made freely available for public searching on the World Wide Web, though it is not yet clear what mechanism for searching will be provided.

Similar projects, based on Emden's Registers for Oxford and Cambridge, were undertaken in the 1970s.[3] In those studies, the data was highly coded to allow easy cross-tabulation. Several important articles using results from the studies have appeared.[1][2][4] There is, however, no resource for Oxford comparable to that of Venn for the later period.

For many years we were unable to find a simple and reliable way to put the data into machine-readable form. Venn's books are in small hand-set type, printed on thick rough paper, and are full of italics, all of which proved completely intractable to the OCR packages available until recently.

By chance, just as we had found suitable technology to cope with Venn's printing, we discovered that Ancestry.com had already prepared machine-readable versions of most of the volumes of Part II.

Negotiations between Cambridge University Press (the copyright holders) and Ancestry.com soon led to an agreement to share this data, as their product and ours are for essentially different purposes, theirs being mainly accessed for genealogical information. Ancestry.com are also planning to put the remaining volumes of Venn into the computer, and to make the data available to us.

Emden's Biographical Register, the Tripos Lists, and the registers of the women's colleges have proved relatively easy to read using OCR and the services of an excellent methodical proofreader.

3. Structural Analysis

A typical entry from Emden looks like this (with references abbreviated):

Dawson, John (Dauson).*
Entered in C.L. ET 1484;
grace that study for 6 yr in C. and Cn.L. suffice for entry in Cn.L. gr. 1488-9;
Inc. C.L., adm. June 1490 [Ref_1];
R. of Debden, Essex, clk, adm. 17 May 1484;
till death [Ref_2].
Died 1492.
Will dated 10 Aug. 1492; proved 12 Feb. 1493 [Ref_3].
Requested burial in S. Michael's, Cambridge.

and has the following structure:

event 1
event 2 ...

where each event in general comprises:


The initial form of the database is an SGML-tagged text, from which subsequent databases and searching/sorting structures can easily be obtained.

My first attempts at analysis were written in Perl, a widely available string-handling language which allows complex regular expressions. (A regular expression is just a pattern which is used to match parts of the data and extract those parts which can vary.) It soon became apparent that the complexity of the regular expressions needed for the recognition of large-scale structures such as these entries uses too much memory in Perl, and the programs frequently failed.

At Cambridge we have a locally-written programmable text editor called NE which has good regular expression handling. It may seem a retrograde step to use a one-off local program like NE in preference to a widely used standard such as Perl, but in our case only the product (the database) is useful; the process used to make the product is different for each text analysed, so the ephemeral nature of the analysis programs is not significant.

Events will in general be split over several input lines, so it was first necessary to combine the lines of a complete paragraph into a single line, then to split them at punctuation such as semicolons, and to put references on separate lines.

It was clear that some type of formal, structured, but readable output would be needed in the first instance. This could then be converted automatically as input to any required database package. SGML provides an adequate structure for these needs, and is widely used by publishers of machine-readable databases.

First attempts at analysis were very heuristic, but served to clarify the problems in my mind. Writing a DTD for the SGML structure was then very helpful, as it forced me to take decisions about nesting of fields, etc.

Initially, my regular expressions tried to match complete events, including place names and dates, but two problems arose: the programs ran out of time or store, or NE's regular expression processor found the structure too complex to analyse. Automatically pre-tagging identifiable structures such as dates and place names enabled simpler regular expressions to be written.

4. Results

A discussion of the complete DTD and the analytical processes used will be presented. Various types of results will be used to illustrate the processing, including complete updated entries amalgamated from all sources, and statistics about certain types of event such as religious appointments. The Figures (see below) are based on only part of the data (one volume of Venn), as the analysis is not yet complete. The Figures should be used with care, because although they represent approximately ten thousand individuals, they are constrained by having surnames beginning 'Abbey' to 'Challis', so a preponderance of one family attending one college may skew the results.

Figure 1 will show the range of ages at admission to all colleges, and holds no surprises. Age at admission is given in only about half of the entries in Venn. Most of the older admissions are men who have already been ordained.

Figure 2 will show the admissions to the two largest colleges, Trinity and St John's, between 1752 and 1900, and illustrates the general increase in size of all the colleges, and hence the whole university, in that period.

Figures 3 and 4 will show the admissions to other colleges during that period (except Downing, Selwyn, and the women's colleges, which were not founded until the nineteenth century). The outstanding feature of Figure 4 is the dramatic increase in admissions at Queens' College from 1821-1830.

Figure 5 will show the number of religious appointments (Curate, Vicar, or Rector) per county.

5. References

[1] Aston, T. H. (1977). 'Oxford's Medieval Alumni', Past and Present, 74.
[2] Aston, T. H., Duncan, G. D., and Evans, T. A. R. (1980). 'The Medieval Alumni of the University of Cambridge', Past and Present, 86.
[3] Evans, R. (1986). 'The Analysis by Computer of A.B. Emden's Biographical Registers of the Universities of Oxford and Cambridge'. In N. Bulst and J.-P. Genet (eds) Medieval Lives and the Historian: Studies in Medieval Prosopography. Medieval Institute Publications, Western Michigan University, Kalamazoo.
[4] Dobson, R. B. (1986). 'Recent Prosopographical Research in Late Medieval English History: University Graduates, Durham Monks, and York Canons'. In N. Bulst and J.-P. Genet (eds) Medieval Lives and the Historian: Studies in Medieval Prosopography. Medieval Institute Publications, Western Michigan University, Kalamazoo.

Return to ALLC/ACH Programme

(5.3.2) New Models for Electronic Publishing

Ronald Tetreault
Dalhousie University, Canada

Roger Chartier rightly warns us that reading on a computer screen is not the same as reading a book when he says that

"electronic representation of texts completely changes the text's status: for the materiality of the book, it substitutes the immateriality of texts without a unique location [...] in place of the immediate apprehension of the whole work, made visible by the object that embodies it, it introduces a lengthy navigation in textual archipelagos that have neither shores nor borders" (18).

Hence any project in the electronic medium that lays claim to scholarly authority will require adaptations in both the presentation and the dissemination of its materials. My work on developing an electronic edition of Lyrical Ballads (by William Wordsworth and Samuel Taylor Coleridge, 1798-1805) helps illustrate how these changes may be realized. Bringing such an edition before the public has made it clear to me that the evolution of electronic publishing will necessitate innovations that are not just technological but also institutional.

On the technical side, Peter Shillingsburg has set out the minimum requirements for a scholarly electronic edition. Such works need to have 1) "a full accurate transcription and full digital image of each source edition," 2) "a webbing or networking of cross-references connecting variant texts, explanatory notes, contextual materials, and parallel texts," and 3) "a navigational system" so that readers can thread their way through this complex hypertext environment (Shillingsburg 28). In preparing an electronic Lyrical Ballads we have tried to meet these standards, first by painstakingly transcribing in text files each of the four lifetime editions of the collection, and by gathering together digital images of every page of every edition. These images will complement our e-texts by being presented in windows set side-by-side on the computer screen. Second, our e-texts will be encoded using TEI-conformant SGML to ensure the preservation of their complex logical structure and to enable hypertext linking that will create a webbing of cross-references and make possible their distribution in a networked environment. The scholars engaged in this project have recently formed a decided preference for online delivery over CD-ROM distribution. Not only is Internet dissemination more efficient and cost effective, it also permits easy correction of errors that are likely to plague any scholarly edition. More important, the fluid nature of the World Wide Web will allow us to take ready advantage of new standards and new forms of interface as they develop. We can constantly add new materials and we can explore new paradigms for their presentation that will help accomplish Shillingsburg's third goal of ready navigation through a multiplicity of variant texts. Lately our new method of marshalling scholarly apparatus which I have called "dynamic collation" has been enhanced by the addition of popup windows generated by javascript which preview variant readings whenever the reader passes the cursor over a revised passage in the electronic text. This reconceptualization of how variant readings can be presented in the digital medium grounds in actual practice David Greetham's proposition that "dismembering scholarly apparatus" will be a consequence of the transition to the new medium (329).

The real challenge for online publication, however, is what might be called the lack of mature institutional structures on the Web. So long as anyone can publish anything on the WWW, the quality of its materials remains questionable. Certainly, Matthew Kirschenbaum is right to remind us that "publication entails a great deal more than simply the act of making public." Hitherto, some degree of authority has been conferred on a website by its association with an institutional host such as a university, as well as by the reputation of its author. But the imprimatur of an established publisher would be an even greater guarantee of the reliability of these immaterial texts. We are in the midst of negotiations that would see the electronic Lyrical Ballads published under the auspices of Cambridge University Press, through the good offices of the website of "Romantic Circles", a reputable online journal hosted by the University of Maryland. By reporting on the progress of these negotiations and their outcome, I hope to indicate ways in which both the Net and established publishers will need to evolve in an electronic world in order to guarantee standards of authority.

The future of electronic publishing depends upon us asking questions about more than technical standards. Online scholarly archives must also meet standards of peer evaluation, editorial practice, and institutional approval that have traditionally ensured the quality of print publications.


Chartier, Roger (1995). Forms and Meanings: Texts, Performances, and Audiences from Codex to Computer. University of Pennsylvania Press, Philadelphia.
Greetham, D. C. (1999). Theories of the Text. Oxford University Press, Oxford.
Kirschenbaum, Matthew (1998). "E-dissertations?" Online posting. 4 Oct. 1998. Humanist Discussion Group. 6 Oct. 1998 <humanist@kcl.ac.uk>.
Shillingsburg, Peter (1996). "Principles for Electronic Archives, Scholarly Editions, and Tutorials." In Richard J. Finneran (ed) The Literary Text in the Digital Age. The University of Michigan Press, Ann Arbor. 23-35.

Return to ALLC/ACH Programme

(5.3.3) Academic Collaboration On Line: The SOL as a Case Study

Ross Scaife
Raphael Finkel
Univesity of Kentucky, USA

William Hutton
College of William and Mary, USA

Elizabeth Vandiver
University of Maryland, USA

Patrick Rourke
Nashoba Valley Technical Vocational High School, USA

Despite the Herculean labors and remarkable achievements of the Perseus Project, today many essential resources for the study of the ancient Mediterranean world still remain accessible only to a few specially trained researchers because they have never been translated into modern languages or provided with sufficiently convenient interpretive materials. Our current work represents the first step in an attempt to address that problem by engaging the efforts of scholars world-wide in the production of substantial translated and annotated texts that will be made available exclusively through the internet. The text with which we have chosen to begin is the Byzantine encyclopedia known as the Suda, a 10th century C.E. compilation of material on ancient literature, history, and biography. A massive work of about 32,000 entries, and written in often dense Byzantine Greek prose, the Suda is nevertheless an invaluable source for many details that would otherwise be unknown to us about Greek and Roman antiquity, as well as an important text for the study of Byzantine intellectual history.

The sheer size of the Suda (the most up-to-date printed edition runs to four hefty and tightly-printed volumes) and its lack of literary charm are sufficient to explain why no individual scholar has committed his or her career to translating it. Many scholars, each taking responsibility for selected entries or series of entries, can get the job done more effectively. Moreover, the vast breadth of subject matter covered by the Suda would challenge the expertise of even the most widely competent modern scholar. By sharing the load, individual translators can focus on those entries from the Suda that pertain to their area of expertise, thus producing better translations and more informed annotations.

Begun in January of 1998, the Suda On Line (SOL) already involves the contributions (or promised contributions!) of nearly seventy scholars throughout the world. The general plan of the project is to assemble an SGML-encoded database, searchable and browsable on the web, with continuously improved annotations, bibliographies and hypertextual links to other electronic resources in addition to the core translation of entries in the Suda. Individual work becomes available on the web as soon as possible, with only the minimum necessary proofreading and editorial oversight. A diverse board of area specialists will eventually edit every entry, altering and improving the content as needed. The display of each entry will include an indication of the level of editorial scrutiny it has received. We want to encourage the greatest possible participation in the project and the smallest possible delay in presenting a high quality resource to a wide public readership.

Collaborative efforts always generate questions about how to allot proper credit to individual contributors. Given the searchable database format of the SOL it is a simple matter for translators and editors to print out their own peer-reviewed work for inclusion in, for example, promotion and tenure dossiers. Moreover, we anticipate that translators will establish hypertextual links directly from their on-line rèsumès to their contributions in the SOL.

Our choice of the web as the medium for publishing the SOL is crucial to the project's conception. This format has many advantages, of which accessibility and ease of use are perhaps the most obvious. Users can access the project's web page and search the database in various ways: by strings in a full text search, with Boolean combinations, by keyword, by translator, etc. The display for each entry includes the headword in English and Greek (options for displaying Greek suit the requirements of different systems), the translation, footnotes and other annotations, and bibliographical references (where available and/or appropriate). The SOL's interface automatically generates links to the complete Greek text of the entry from the database of the Thesaurus Linguae Graecae and to the relevant entries in Liddell and Scott's Greek Lexicon (via the Perseus Project); further links to both external and internal resources can be created in the text of the translations and the annotations.

The on line format also allows for continuous editing and updating, which is crucial to our conception of the SOL as an evolving work forever subject to improvement by many hands. We believe that the specific ways in which we have enabled the process of editorial control and our plans for further enhancements are among the most sophisticated now available in any on-line scholarly resource: every aspect of communication among contributors is handled via web-based forms and dynamically generated e-mail. Any work goes onto the web immediately; individual authors do not have to wait for publication until the entire project is finished (as in the case of a print format). Thus, the project can be immediately useable even while the bulk of the work remains to be done. Furthermore, entries can be updated immediately whenever new information arises.

Perhaps the most exciting aspect of a web-based publication such as the SOL, however, is the potential for interoperability with other projects. The SOL is one of many projects involved in a consortial arrangement at the Stoa, which is actively exploring ways to promote the interconnection of distributed projects. Although our goal is to have as much annotation and documentation as possible within the SOL database itself, our translation takes advantage of the natural capacity of web-based documents to be linked with other sources of electronic information. We want the SOL to be one important model for a new generation of hypertext commentary on ancient texts; a SOL fully outfitted with links to other electronic resources will provide not only the Greek text of the Suda and its translation, but also a wide variety of links to other relevant Suda entries, to the ancient vitae of any major authors or other figures mentioned in the text, to all the testimonia, and to essays by various scholars (both public-domain and new essays written specifically for this project). The same model may be used for on-line commentaries for other ancient works, which may in turn be linked to relevant entries in the SOL. The prospects for the on-line production of true variorum editions are vast and exciting.

This copious annotation and hypertextuality will ensure that the on-line Suda is useful not only to classical scholars and historians, but to a much wider audience as well. Students at various levels will be able not merely to read the translated Suda entries but to understand their wider context. A bare translation would be of little use to most non-specialists, but a translation provided with a rich supply of links to other ancient works and to modern scholarship will open a whole world of information to the interested beginner and can still be a valuable research tool for the trained specialist.

The goal of the SOL is not just to be a useful tool for researchers, but to provide a sophisticated model for the kind of scholarship made possible by open source technology and the internet, scholarship that is cooperative rather than solitary, communal rather than proprietary, worldwide rather than localized and evolving rather than static. Accordingly we aim at two principal results: in addition to our development of the Suda On Line itself as a respected scholarly resource, we plan to make a generalized, well-documented version of our software freely available for other scholars to adapt for their own purposes.


Adler, A. (ed) (1928-1938). Suidae Lexicon (5 volumes). Stuttgart.
Perseus: An Evolving Digital Library of Ancient Greece <http://www.perseus.tufts.edu/>
The Stoa: A Consortium for Electronic Publication in the Humanities <http://www.stoa.org/>
The Suda On Line <http://www.stoa.org/sol/>
Thesaurus Linguae Graecae <http://www.uci.edu/~tlg>

Return to ALLC/ACH Programme

(5.4) Posters & Demonstrations

(5.4.1) Dating Dickinson: an Experimental Approach to Stylochronometry

Constantina Stamou
Richard S. Forsyth
University of Luton, UK

Stylometry is the statistical analysis of literary style, whose two primary applications are authorship attribution and chronological problems. It originated in 1851 when Augustus de Morgan suggested that it is possible to settle authorship by determining if one text "does not deal in longer words" than another (Holmes, 1998). Stylometry is based upon the notion that it is possible to detect an author's 'signature' by examining quantifiable features of written texts. The only difference between the two applications is that attributional studies claim that certain features in an author's style are manipulated unconsciously and therefore remain fixed, whilst chronological studies support the idea that stylistic fingerprints evolve smoothly throughout an author's life. The contradiction is overridden, though, by the choice of features.

Stylochronometry, a term used to cover the dating of texts from stylistic evidence, concerns itself with problems of specifying the sequence of composition of the works of a given author. Famous cases are the dating of Plato's dialogues, of certain of Shakespeare's plays, and the dating of the New Testament scriptures, although their true chronology will never be known since there is not enough external evidence to back up such stylometric findings.

Scientific approaches to chronology begin with the choosing of a group of texts that are more or less securely dated, then proceed with the application of stylometric methods that manipulate the chosen variables which will best correlate with the dates of the texts. Once the methods used assign the correct dates to the initial test set, the final step is to employ the same methods on disputed cases. Such stylometric variables include high frequency words, function and common words, type-token ratio, vocabulary richness measures and others.

A famous example comes from Brainerd (1980) on the chronology of Shakespeare's plays. Examining the percentage of occurrence of 120 lemmata, which were mainly related to high-frequency lexical items, additionally combined with the investigation of the average verse line length in words, the percentage of split lines and the type-token ratio, he concentrated initially on a group of plays that had fairly accurate dates of composition. Since only 20 out of the 120 lemmata proved to be useful discriminators for chronology, he used them in order to construct a function that would predict the dates of the control group. Once his method produced the desired results, the final step was to use it on those of Shakespeare's plays which were of disputed nature in terms of dates. Difficulties arose related to the possibility of multiple authorship in certain cases, authorial revision at some stage, and the status of manuscripts used for the preparation of the basic copy texts. However, multivariate statistics proved useful in order to detect which plays were likely to be products of multiple authorship.

In poetry though, it has not been possible to date texts of less than 500 words in length until recently. Forsyth (1999) at BSRU investigated a method of dating short pieces of text (averaging 114 words in length) and tested them on W.B. Yeats's work. This method, among others, will be used in our project, which aims at building on collaborative work begun by Dr Forsyth and Prof. Margaret Freeman of Valley College, California, on the investigation of chronological changes in the style of the American poet Emily Dickinson (1830-1886).

Born in Amherst, Massachusetts, Dickinson lived at her father's house most of her life and in her later years became a recluse. Because of her individualistic style, which, as it is accepted nowadays, set her ahead of her time, only 10 of her poems were published during her lifetime. Moreover, due to her difficult handwriting and her idiosyncratic punctuation, they were heavily edited, since the public was not yet prepared for her eccentric masterpieces. At the time of her death, 1775 poems were discovered arranged in 60 small packets. Following that, efforts were made by her relatives to get all the poems published; still, though, her poetry was heavily edited. Her impact on the American public gradually became intense, and in 1955 a complete edition of her work was published by Thomas H. Johnson, this time using her own punctuation and vocabulary. Today she is known for her startling originality, her bold experiments in prosody, her tragic vision, and the range of her intellectual and emotional explorations.

Johnson's edition provides approximate dates of composition (Johnson, 1961), according to Theodora Ward's study, who collaborated with Johnson, of the changes in her handwriting, apart from a few poems which have precise dates, either because Dickinson sent them as parts of letters to various friends or because she mentions contemporary events.

Our investigation will initially concentrate on control authors that have securely dated works, such as Christina Rossetti and W.B. Yeats. Both poets lived in the 19th century as Dickinson did. It is proposed to utilise a feature-finding program developed by Forsyth & Holmes (1996) at BSRU, a tagger such as TOSCA from Nijmegen University, and a content analysis tool. Thus we will tap into linguistic information of different kinds - lexical, syntactic and semantic. Our aim is to detect the type of linguistic information that is useful for discriminating between the early and late works of our poets with the intention of using the techniques applied on the control authors to date Dickinson's work.

Laan (1995) argues that there is no hard evidence to suggest that authors have both a conscious and an unconscious aspect to their writing style, as stylometry suggests. On the other hand, possibilities such as the existence of a stable and an adaptable part in an author's unconscious style, or the idea that some change their unconscious features of their styles and others do not, also exist as Laan (1995) admits. The question to what extent such claims are true has been investigated by Robinson (1992) and Keyser (1992) who both suggest proceeding from authors with known publication dates to authors with unknown publication dates.

Initial studies, to be reported at this conference, have investigated the idea that authors generally exhibit a trend towards decreasing complexity as they grow older. Using the Fog Index as a measure of the density of language, based on the proportion of long words and average sentence length, we have found equivocal results. But other measures do seem to show increased simplicity with time. We believe that this brings us a step closer to correct chronology.


Brainerd, B. (1980) "The Chronology of Shakespeare's Plays: A Statistical Study". Computers and the Humanities 14, 221-230.
Forsyth, R. S. (1999) "Stylochronometry with Substrings Or: A Poet Young and Old". Literary and Linguistic Computing. 14(4), 1-11.
Forsyth, R.S and Holmes, D.I. (1996) "Feature Finding for Text Classification". Literary and Linguistic Computing. 11(4), 163-174.
Holmes, D. I. (1998) "The Evolution of Stylometry in Humanities Scholarship" Literary and Linguistic Computing. 13(3), 111-117.
Johnson, T.H. (ed) (1961) The Complete Poems of Emily Dickinson. Little, Brown and Company, Boston.
Keyser, P. (1992) "Stylometric Method and the Chronology of Plato's Works (review article)".Bryn Mawr Classical Review. 3(1), 58-74.
Laan, N.M. (1995) "Stylometry and Method: The Case of Euripides". Literary and Linguistic Computing. 10(4), 271-278.
Robinson, T.M (1992) "Plato and the Computer". Ancient Philosophy. 12, 375-382.

Return to ALLC/ACH Programme

(5.4.2) A Comparison of Methods for the Attribution of Authorship of Popular Fiction

Fiona J. Tweedie
University of Glasgow, UK

Lisa Lena Opas-Hänninen
University of Joensuu, Finland


In this poster we present the stylistic analysis of a number of popular fiction genres. Popular fiction generally receives less academic attention than literature, but its ability to draw the reader into the story is noteworthy. In previous work (Opas and Tweedie, 1999a, 1999b) we have examined measures of stance in an attempt to quantify this degree of reader involvement. In this paper we turn to measures used to discriminate between authors, in order to find consistent differences between genres and authors.

Textual Sources

We have taken texts from three distinct sources: romance novels, detective novels and American short stories. Our total corpus is 590,000 words. We have analysed romance novels published between 1990-1996 from the Harlequin Presents and Regency Romance series. We have also analysed Danielle Steel's works, which are classified as women's fiction or 'cross-overs'. The romance texts make up 245,000 words. The detective fiction part of our corpus is made up of popular contemporary female authors published in the 1990s, i.e. Cornwell, Grafton, James, Leon, Peters, and Rendell. Where an author has created many detectives, we chose the most well-known one to represent that author. Some of the detectives are male and others female and we expect them to express stance differently. These texts make up 295,000 words. Short stories were also taken from the works of Carver and Barth. These make up almost 50,000 words.


We will compare and contrast the results from three analyses of these texts. The analyses are based on methods used in determining authorship: the frequency of the most common words, letter frequency and measures of vocabulary richness. The data from each of these procedures is then used in a principal components analysis in order to identify the most important elements.

1) Word frequencies The use of principal components analysis of the most common words to determine authorship was proposed by Burrows in 1988 and has become an essential tool for stylistic analysis. Here, the most commonly-occurring forty words were employed. Their frequencies were measured and standardised for text length. A principal components analysis was then carried out and the texts plotted in the principal components space. The first two principal components corresponded to 32.2% of the total variation. The first principal component separates the romantic Steel texts, with high negative scores, from the American short stories which have high positive scores. Detective stories by Sue Grafton and Patricia Cornwell also have high positive scores on this axis. The second principal component appears to act as a rough genre separator; romantic texts tend to have positive scores and all of the detective novels have negative scores. Consideration of a loadings plot indicates that the Steel texts use a high proportion of "she", "her" and "they", while the short stories use "at", "said" and "on". The Grafton and Cornwell texts are written in the first person and this is highlighted by their use of "me", "my" and "I".

2) Letter frequencies Ledger and Merriam use letter frequencies in their analysis of Shakespearean texts with remarkable success. Here we consider the relative frequencies of 'A' - 'Z', with capital and lower-case letters amalgamated. These 26 variables are then subjected to principal components analysis. The results are plotted in the first two dimensions of the principal components space which account for 34.5% of the total variation. In this analysis the separation is not as good as the word frequency analysis. The American texts tend to have high negative scores on the first principal component, while the texts by P. D. James have very high positive scores.

3) Measures of Vocabulary Richness A great number of measures of vocabulary richness have been proposed. Tweedie and Baayen (1998) carry out a review of these measures and find that two, K and Z, contain the vast majority of the information from the author's vocabulary. Yule's K measures the 'repeat-rate' used by the author, while Orlov's Z measures vocabulary richness in the sense of the number of different words used. We therefore plot the texts in the K-Z plane. As might be expected, the American short stories are found to have a low repeat rate and high vocabulary richness. The Steel texts have a higher richness than the other romantic texts, but the detective and romantic fiction texts are not separated by this analysis.

3. Conclusions

These three analyses offer views of different facets of the style of popular romantic and detective fiction. The genres are most clearly separated when the most common words are used as data, while the letter frequency analysis is, not surprisingly, more affected by the particular names of heroes or heroines. The measures of vocabulary richness distinguish clearly between the more popular texts and the short stories. At the conference we shall also present the analysis of markers of stance, used in Opas and Tweedie (1999a, 1999b).


Burrows, J. F. (1987) Word patterns and stay shapes: the statistical analysis of narrative style. Literary and Linguistic Computing 2(2): 61-70.
Ledger, G. and Merriam, T. (1994) Shakespeare, Fletcher and the Two Noble Kinsmen. Literary and Linguistic Computing 9(3): 235-248.
Opas, L. L. and Tweedie, F. J. (1999a) The Magic Carpet Ride: Reader Involvement in Romantic Fiction. Literary and Linguistic Computing 14(1):89-101.
Opas, L. L. and Tweedie, F. J. (1999b) Come into my World: Styles of Stance in Detective and Romantic Fiction. Abstracts of the ALLC/ACH conference 1999, Virginia, 247.
Tweedie, F. J. and Baayen, R. H. (1998) How Variable May a Constant Be? Measures of Lexical Richness in Perspective. Computers and the Humanities 32(5): 323-352.

Return to ALLC/ACH Programme

(5.4.3) Computational Methods for the Study of Multilingual Corpora

Silvia Hansen
University of the Saarland, Germany

Corpus linguistics is becoming increasingly important for translation studies (cf. Baker, 1995; Granger, 1999). In the past, the application of corpus linguistic methods was limited to the applied branch of this discipline. In particular, they were used in the fields of terminology, translation aids (e.g., to develop translation memories or machine translation programs), translation criticism and translation training (to improve the final product with the help of corpus-based contrastive analysis and the study of translationese). Recently also, in the theoretical and descriptive branches of translation studies corpus linguistic methods have been introduced. In particular, one issue that is receiving more and more attention is the question about translation as a particular text type (Baker, 1996; Laviosa-Braithwaite, 1996; Teich, 1999; Hansen, 1999). In this paper, I present the analysis of a corpus of translated texts and its comparison with a corpus of originals produced in the target language in order to investigate the universal features of translations (cf. Baker, 1996). Furthermore, on the basis of the analysis of universal features, I analyse the source language texts in order to see what has happened during the translational process. Thus, my aim is to identify both the universal features of translation (comparing the translation corpus with the originals of the target language) and, on this basis, the translation procedures (comparing the translation corpus with the originals of the source language). In particular, I discuss the use of various standard corpus tools, such as concordance programs, aligners and taggers for the analysis of parallel and comparable corpora. But the use of these tools is limited: only parts-of-speech and grammatical categories can be analysed with the help of such tools. Thus, it is not possible to say anything about translation procedures, translation strategies or the translational process because the results gained through standard corpus tools are quantitative values (cf. Hansen & Teich, 1999). But we need qualitative data, i.e., a linguistic description of the phenomena which occur in the translations, to test hypotheses concerning the translational process and the universal features of translations. In order to use the information which is provided through the standard corpus tools and in order to carry out deeper investigations, we need tools which are able to analyse more abstract linguistic categories. For this reason, we use the tool TATOE (http://www.darmstadt.gmd.de /~rostek/tatoe.htm) with which we annotate the corpus using Systemic Functional Linguistics (SFL; Halliday, 1978; Halliday, 1985). The systemic functional model, which allows the analysis of the relationships between the different linguistic levels (grammar, semantics, context), is used for various disciplines, e.g. for language teaching, for the area of functional stylistics, for grammatical text analysis, and for computational linguistics (in this discipline especially for automatic text generation (cf. Teich, 1995; Bateman, 1997), TATOE enables us to define systemic functional categories and, on this basis, to annotate the texts. These annotations make a systemic functional analysis of the parallel and comparable corpora possible, and thus a cross-linguistic description of the phenomena which occur in the texts. On this basis, hypotheses concerning the translational process and the universal features of translations can be tested and new ones can be generated.


Baker, M. (1995). Corpora in translation studies: An overview and some suggestions for future research. Target 7(2):223-243, 1995.
Baker, M. (1996). Corpus-based translation studies: The challenges that lie ahead. In H. Somers (ed) Terminology, LSP and Translation: Studies in Language Engineering in Honour of Juan C. Sager, pp. 175-186. Benjamins, Amsterdam.
Bateman, J. (1997). KPML Development Environment: multilingual linguistic resource development and sentence generation. Deutsches Forschungszentrum Informationstechnik (GMD), Institut für Integrierte Publikations und Informationssysteme (IPSI), Bonn (Birlinghoven).
Granger, S. (ed) (1999). Proceedings of Symposium 'Contrastive Linguistics and Translation Studies. Empirical Approache'', Louvain-la-Neuve, Belgien, February 1999.
Halliday, M.A.K. (1978). Language as social semiotic. Edward Arnold, London.
Halliday, M.A.K. (1985). An introduction to Functional Grammar. Edward Arnold, London.
Hansen, S. (1999). A Contrastive Analysis of Multilingual Corpora (English-German). Diploma Thesis, University of the Saarland, Saarbrücken.
Hansen, S. and Teich, E. (1999). Kontrastive Analyse von Übersetzungskorpora: ein funktionales Modell. J. Gippert Hrsg. Sammelband der Jahrestagung der GLDV 99, pp. 311-322. Frankfurt a. Main.
Laviosa-Braithwaite, S. (1996). The English Comparable Corpus (ECC): A Resource and a Methodology for the Empirical Study of Translation. PhD Thesis, UMIST, Manchester.
Teich, E. (1995). Towards a methodology for the construction of multilingual resources for multilingual generation. Proceedings of the IJCAI workshop on multilingual generation, International Joint Conference on Artificial Intelligence (IJCAI), Montreal, Canada, August 1995, 136-148.
Teich, E. (1999). Towards a model for the description of cross-linguistic divergence and commonality in translation. In E. Steiner and C. Yallop (eds) Beyond content: Exploring translation and multilingual text production. Mouton de Gruyter, Berlin.

Return to ALLC/ACH Programme

(5.4.4) A System for Dynamic Text Corpus Management (with an Example Corpus of the Russian Mass Media of the 1990s)

Girogri Sidorov
National Polytechnic Institute (IPN), Mexico

Anatoly Baranov
Mikhail Mikhailov
Russian Academy of Sciences, Russia

We present a system for text corpus processing which is oriented to the idea of a "dynamic text corpus". With its help a user can search for examples of usage (words, phrases, and even morphemes), build word lists and concordances, compile his own subcorpus. The software was used while compiling a text corpus on modern Russian mass media. It is a collection of texts from Russian newspapers and magazines of the 1990s with a total size of about 15 Mb. Each text of the corpus is classified by 6 parameters - source, date, author(s), genre, and topic(s). Later these parameters are used to generate the subcorpora that conform to users' needs.


Corpus linguistics is part of the computational linguistics that deals with the problems of compilation, representation, and analysis of large text collections. One of the most complex problems in modern corpus linguistics is defining of the principles of the text corpus compilation. The text corpus should in the ideal case answer the criteria of representativeness and at the same time be much smaller than the whole dedicated field. On the other hand the representativeness of the text corpus is directly connected with the research objectives. For example, the research connected with text macrostructure needs quite different parameters than sociolinguistic research or the description of contexts of usage of a certain morpheme or a word. The difficulty of reconciling statistical representativeness and user demands leads to the fact that many of the existing corpora do not have any explicit and clear criteria for texts' selection. For example, there are no clear-cut criteria of the items' selection for the well-known Birmingham corpus of English texts; the situation is the same with the German text corpora. We suggest a definite strategy for text corpora compilation that allows a user to create his own subset of texts from a corpus for his own task (as a new subcorpus). We call the initial text corpus, which is the source for further manipulations and selection plus corresponding software, dynamic text corpus. For the compiling of our corpus we used texts from Russian mass media of the 1990s.

General strategy of initial text corpus compilation

Taking into account the requirement of representativeness we directed special attention to choosing the most prominent mass media editions with different political orientation which was fairly important for society during the period the research in question was covering (1990s) and their proportional representation considering their popularity and significance. We used as the criterion of popularity the results of the last elections when, roughly speaking, 25 percent voted for the communists, 10 for the ultra left, 25 for the right, and 40 for the center. The second important factor of the corpus compiling was quantity of texts. There should be enough texts to reflect relevant features of the dedicated field. The upper limit was connected only with pragmatic considerations, the disk space and the speed of the service software. In our case, during the project that took place in 1996-1998 we collected around 15 Megabytes of text. As we stated above the different users have different tasks and expect different things from the text corpus. It is also necessary to take into account the fact that some users may not be linguists. These people may be interested in the reflection in the mass media of certain events during a certain period. It is probable that they would like to read the whole texts and not just concordances. To consider possible different requirements it is necessary to compile the text corpus not of extracts from the texts but of whole texts. The idea of using extracts (so called sampling) was popular at the early stage of corpus linguistics, e. g., the famous "Brown corpus", which consists of 1000-word-long text extracts. It is also necessary to take into account that linguists from different linguistic areas have different requirements of text corpora. For example, for morphological or syntactic research a 1 million word text corpus would be sufficient. Sometimes it is even more convenient to use a relatively small corpus because the concordances of usage of function words may occupy thousands of pages and most of the examples will be trivial. However, even for grammar research it seems reasonable to have in the corpus the texts of different structure and genre. At the same time the text corpus should be large enough to ensure the presence of rare words. Only in this case is the corpus interesting for a lexicologist or a lexicograper. Thus, the task of compilers of a text corpus is to take into account all the different and sometimes contradictory users' requirements. We suggest allowing the user to construct his own subset of texts (his own corpus) from the dynamic text corpus. To ensure this possibility each document has a certain search pattern which allows the software to filter the initial corpus and construct the corpus which fits the needs of the user.

Encoding of corpus units

After the analysis of the text data the following parameters were chosen as corpus-forming.

  1. Source (the mass media printed editions),
  2. Author (about 1000 authors),
  3. Title of the article (1369 articles),
  4. Political orientation (left, ultra left, right, center),
  5. Genre (memoir, interview, critique, discussion, essay, reportage, review, article, feuilleton),
  6. Theme (internal policy, external policy, literature, arts, etc. In total 39 themes.),
  7. Date (exact date of publication. In our case we used articles published during the period of the 1990s).
The following printed editions (magazines and newspapers) were used: VEK, Druzba Narodov, Zavtra, Znamia, Izvestiya, Itogi, Kommunist, Literaturnaya gazeta, Molodaya gvardiya, Moskovskiy komsomolec, Moskovskie novosti, Nash sovremennik, Nezavisimaya gazeta, Novyi mir, Ogonyok, Rossiiskaya gazeta, Russki vestnik, Segodnya, Sobesednik, Sovetskaya Rossiya, Trud, Ekspert, Elementy, Evraziiskoe obozrenie. Every text in the corpus is characterized by a set of these features. At the current stage it was done manually. The most representative are the following sources: Vek (8%), Zavtra (14%), Itogi (11%), Literaturnaya gazeta (6%), Moskovskie novosti (8%), Novy mir (8%).

Software description

The text corpus seems incomplete and hard to work with without software that assures the user-friendly interface and allows different kinds of processing. A general problem of the corpus software is selecting the texts to work with. If the user wants to deal just with certain parts of the corpus he has to do it manually by choosing file names. This is typical of corpus software and it is not convenient. The other possibility - to have all text files merged - simply does not allow any additional selection in the corpus. However, in our system it is possibile to select texts automatically using their feature sets. All the user has to do is to describe his requirements for his own corpus. We should mention that the collection of texts with descriptions are only rough material while in the traditional technology it is the final result. In the technology suggested in this article the 'big corpus' is a source for compilation of subcorpora answering the user's needs with greater accuracy. The initial text corpus is stored as a data base where each text is a record and each parameter is a field. The texts of articles are stored in a MEMO field. Importation of the manually marked articles into the data base is performed by a special utility. On the basis of this information a user can create his own corpus by indicating a set of parameters. He does it by going through a sequence of dialogue routines, answering questions or choosing from the lists. The resulting corpus is a text file containing the texts matching the selected parameters. The system allows the following main functions:

  1. Standard browsing of the texts and their parameters.
  2. Selection and ordering of texts according to the chosen parameters or their logical combinations. The system has a standard set of QBE queries which are translated automatically into SQL. The experienced users can write SQL queries directly.
  3. Generating a text corpus that is a subset of the initial corpus on the base of a stochastic choice and the given percentage for each parameter.
  4. . Generating a user's text corpus.
  5. Browsing the user's text corpora and text processing: building concordances or word lists.
The program contains four standard variants of the initial corpus. In the whole corpus there are its proportional subsets containing 25% of the initial one for the parameters sources, themes, and genres.


We developed a system that implements dynamic text corpus management (the software is included in the notion of the dynamic corpus) for Russian mass media texts. The system is applicable to any corpus. All texts of the corpus are classified according to the parameters described above. The system ensures easy corpus processing for a user. The corpus is representative from the point of view of the chosen parameters. It means that all values and their combinations are presented in the corpus (except the impossible ones, e.g., the magazine Novy mir (a literature magazine) has no articles on finances, and the magazine Expert (a financial magazine) has no articles on literature).

Return to ALLC/ACH Programme