(9.1) Computational / Corpus Linguistics

(9.1.1) On Consensus between Tree-representations of Linguistic Data

Michel Juillard
Xuan Luong
Université Nice-Sophia Antipolis, France

1. Introduction

One of the aims of modern linguistics, particularly of the computational persuasion, is to infer from the ever-growing mass of actual data available, the implicit, virtual organization underlying the apparent disorder and diversity of surface phenomena.

This ever-present crucial duality is also at work in computational linguistics where the chief question is how to reach, beyond the teeming, bristling surface of observed individual facts, for the latent abstract organisation, thus enabling the observer (i.e. the linguist) to gain access to knowledge that can be generalized.

2. Trees

Tree-representation is a powerful means of evincing the inherent structure of mutually dependent data.

Scholars in the main fields of taxonomy regularly and successfully avail themselves of tree-structures, e.g. genealogies, pedigrees and phylogenies.

Chomsky's syntagmatic trees have grown under every clime but they are far from being the sole way of imaging linguistic dependence or independence of the represented objects by means of a hierarchic tree where clearly outlined categories are paired and embedded.

Frequently enough, modern linguists tend to be interested more in the relative closeness of objects than in their belonging to this or that closed class. Additive, as opposed to hierarchic, trees do away with watertight partitions between objects and lay the stress on notions such as proximity and opposition. Figure 1 illustrates this new way of representing textual data. The linguistic units under scrutiny here are modals and auxiliaries in a body of contemporary English poetry.

The information contained in figure 1 is rich and clear. The organisation of the whole structure rests on the notions of proximity and opposition. The present auxiliaries have and be are closely associated while their past form counterparts had, and was and were form another distinct pair in the opposite part of the tree. More generally, the top of the tree can be seen as gathering the past forms whereas the present tense forms congregate in the bottom part. Deleting one of the edges of the structure breaks the tree down into two connex components. The case of should and would is interesting in that the two modals occupy an intermediate position between past and present which reflects their specificities in the actual texts.

3. Going further with trees :

The unrooted-tree representation (figure 5) makes conspicuous properties of coordinating conjunctions that were, of course, impossible to discern in the table of occurrences, let alone in the lines of the original text (Day Lewis, complete poetry).

The figure opposes the very tightly-knit pair but-and to the rest of the data which, in turn, form three other groups of coordinators (respectively either-or-which, neither-nor, then-yet-than) that are similar in behaviour, although more independent of each other than the previous two (and-but).

Figure 6 illustrates the behaviour of the same grammatical units in a body of contemporary English novels.

Without going into unnecessary detail, it is clear that this tree imposes unity or, at any rate, very close proximity on elements that evinced more independence in the previous tree (figure 5), neither-nor and either-or now forming more conspicuous pairs, while than becomes more closely associated to the structure as it teams up with but.

3.1. Fusion of the previous two trees :

The question of course arises of the possibility of representing the two distinct sets of original data in one tree-figure, considering that they correspond to the same grammatical units at work in two provinces of literature (poetry and the novel) but in the same language and in the same period of time.

Since these two original sets of data are technically disparate, it is impossible to start from the initial numbers as such - for instance, by adding them up.

The only procedure available is to attempt to achieve a consensus by fusion of the original two trees by means of a new algorithm which we have just devised. Figure 7 is the product of this fusion algorithm.

It is interesting to observe first of all that this representation does sum up the information of trees 5 and 6. Not only are the properties of each single separate tree preserved, which is indeed a prerequisite, but there also emerges a more legible picture of the actual syntagmatic roles and affinities of the function-words under scrutiny here. The correlative conjunctions either-or and neither-nor are more satisfactorily grouped together, then enters into a close set with but and and, while the proximity of than and then on the tree is evocative of their common etymology, although they do not form a set stricto sensu.

3.2 The fusion algorithm

The fusion algorithm is derived from the topological properties of the tree.

Consider two trees A and B . Let VA and VB be the matrices of the corresponding neighbourhoods.

One notes as VAB the Cartesian product whose elements are (x,y) with x Î VA and y Î VB . We shall build on VAB a preorder induced by the preorders of the neighbourhood levels of VA and of VB and define a neighbourhood relation on VAB compatible with the topologies of A and B. The fusion of the two trees shall ensue.

Preorder :

Two elements (x,y) and (u,v) of VAB are :

- ordered by the relation < if and only if x+y < u+v

- equivalent by the relation @ if and only if x+y = u+v

Neighbours :

A set G of elements of X is made up of neighbours if and only if all the pairs of distinct elements of G are :

- minimal by the relation < in VAB and

- equivalent by the relation @ in VAB .

Algorithm

Calculate the matrices of the neighbourhoods of A and of B .

Build VAB.

(iter) : Look for the minimal elements of VAB . Use VAB in order to determine the neighbours.

- Each set of neighbours G is represented by a single of its elements z. For each set G, of k elements, remove from X the k-1 elements other than z. Delete in VAB the corresponding lines and columns.

If the numbers of lines and columns are larger than 3, goto (iter) else end of the algorithm.

Return to ALLC/ACH Programme

(9.1.2) An Electronic Lexicon of Nominalizations: NOMLEX

Catherine Macleod
Ralph Grishman
Adam Meyers
New York University, USA

New York University (NYU) has recently completed Nomlex, a dictionary containing detailed syntactic information about 1000 common English nominalizations. This dictionary, which has been developed for use in natural language processing, is freely available from NYU.

History:

Previous dictionary work at NYU includes COMLEX Syntax [Macleod'97], a large syntactic dictionary with detailed information on the syntactic properties of English words, especially with regard to complement structure. This dictionary is available through the Linguistic Data Consortium (LDC) and is being presently used by several natural language processing (NLP) groups.

Nominalizations present a special problem, in that one wants not only to analyze their syntactic structure, but also to relate them to corresponding verbal structures. This is essential for natural language interpretation in applications such as question answering and information extraction. For example, the answer to a question, "How badly was the city bombed?" could be phrased "The city was destroyed completely" or "The destruction of the city was complete." These answers though syntacticly diverse contain exactly the same information.

Dictionary structure:

The challenge in designing the NOMLEX entries is to provide, in reasonably compact form, all the lexically-specific syntactic information required to relate nominal arguments to the corresponding verbal arguments. This task is complicated because of the wide range of nominal argument structures, which makes a direct enumeration of all possible correspondences hopelessly unwieldy. In particular, the core arguments and oblique arguments behave differently in this respect.

The core arguments of the verb (subject, object, indirect object) may appear as possessive determiners (DET-POSS), noun noun modifiers (N-N-MOD) or in a prepositional phrase commonly preceded by "of" (PP-OF). Examples of this are: "His death" (where "his" represents the one who dies and therefore the subject of the verb "die"), "the price adjustment" (where "price" is the object of the verb "adjust"), and "the discussion of the case" (where "case" is the object of the verb "discuss" and the analysis would be "X discuss the case").

More complex verbal complementation such as sentential and verbal complements are found following the nominalization and often retain the same structure. For instance, the complement "that he came" is the same for the nominalization as for the verb, seen in the following examples.

"Someone report(ed) that he came."
"The report that he came."

Some verbal complements must have an introductory preposition in order to appear as nominalization complements. For example, "Someone questioned whether it was a wise plan." versus "The question of whether it was a wise plan". We encode all these possibilities in the lexical entries of the nominalizations. The NOMLEX entry for "appointment" is as follows.

(NOM :ORTH "appointment"
:VERB "appoint"
:PLURAL "appointments"
:NOUN ((EXISTS))
:NOM-TYPE ((VERB-NOM))
:VERB-SUBJ ((N-N-MOD) (DET-POSS))
:SUBJ-ATTRIBUTE ((COMMUNICATOR))
:OBJ-ATTRIBUTE ((NHUMAN))
:VERB-SUBC ((NOM-NP :OBJECT ((DET-POSS) (N-N-MOD) (PP-OF)))
(NOM-NP-PP :OBJECT ((DET-POSS) (N-N-MOD) (PP-OF))
:PVAL ("for" "to"))
(NOM-NP-TO-INF-OC :OBJECT ((DET-POSS) (PP-OF)))
(NOM-NP-AS-NP :OBJECT ((DET-POSS) (PP-OF)))))

This sample entry, in combination with a set of defaults, provides us with a lot of information, including the following:
 
 

In addition, the prepositions in the verb complement are realized as the same preposition in the nominalization complement. Due to space considerations, we must leave out a lot of detail regarding the interpretation of this dictionary entry. In particular, we have a system of rules and defaults which prevent spurious ambiguity and allow us to keep the lexical entries compact. The example cited above is that PP-by is so often a marker of the subject of the verb, that we treat this as a default, marking nominalizations that do not allow this mapping with NOT-PP-by. For more detail about NOMLEX, please see our web site: <http://cs.nyu.edu/cs/projects/proteus/nomlex/index.html>

Using NOMLEX:

To demonstrate the utility of NOMLEX we constructed two NLP applications that depend on NOMLEX entries to analyze nominalization phrases.

  1. A program that transforms a nominalization phrase into one or more sentences, corresponding to the possible senses of the nominalization phrase. For example, "Rome's destruction of Carthage" has exactly one sense, which our program would paraphrase as "Rome destroyed Carthage". The program takes a grammatical analysis (a parse) of a nominalization phrase as input and uses NOMLEX to create grammatical analyses of the corresponding sentences, copying noun phrases and complement phrases from the various nominal positions (N-N-MOD, DET-POSS, post-noun, etc.) into the appropriate sentential positions (SUBJECT, OBJECT, etc.). Actual sentences are then generated from these parses. The output of this program could be used as input to any NLP application designed to operate on full sentences.
  2. A program which converts an information extraction pattern designed for sentences into a set of information extraction patterns designed for nominalization phrases ([Meyers'98]). For our purposes, an information extraction pattern is a pattern used to identify events in text and correctly mark the participants of these events. One such system extracts information about corporate hirings, firings, resignations, etc., including the identification of who left which company, who joined which company, the positions they left, the positions they attained, dates, etc. Our program converts a pattern that analyzes "IBM appointed Alice Smith as Vice President" into a pattern that analyzes: "IBM's appointment of Alice Smith as President", "Alice Smith's appointment by IBM", "IBM's Alice Smith appointment", "IBM's appointee", "The appointee of IBM", etc. Nomlex helps determine where information identified by a sentence pattern would be found in a nominalization phrase. This would enable an information extraction tool to easily extract information from nominalizations.
Process:

The menu-based entry program used for COMLEX was adapted to enter NOMLEX. As for COMLEX, the entry program gave us access to a large corpus of text (including the Brown Corpus and large amounts of newspaper articles) and we had limited access to the British National Corpus (BNC) as well. Two half-time ELFs (Enterer of Lexical Features) worked for 5 months and then one half-time ELF completed the lexicon in two years.

Concluding Remarks:

We have created a dictionary with the goal of solving a pervasive problem in NLP. Most grammatical analyses are designed to process sentences, but in order not to miss information these need to be applied to nominalization phrases, as well. NOMLEX provides a bridge in the form of: (1) a set of rules and defaults; and (2) a dictionary record of idiosyncratic mappings between nominalizations and their related verbs. Before the creation of NOMLEX, developers and researchers have had to adapt processes which handle sentences to enable them to handle nominalizations also. Due to the high overhead of this endeavor, many systems do not handle nominalizations at all or use simple, but imprecise heuristics. NOMLEX makes it less expensive to fully integrate nominalizations into an NLP system.

Future work:

In order to make NOMLEX an even more useful resource, we plan to add support verbs to the entries. The use of support verbs changes the relationship of the nominal and verbal arguments. For example, in "his visit to Mary" the possessive determiner is the subject of "visit" (i.e. "he visit Mary"); in "he made a visit to Mary" the subject of the support verb "make" is now the subject of "visit" (i.e. "he visited Mary."). The demonstration that this is idiosyncratic and a matter for lexical interpretation is the appearance of "he had a visit from Mary" where the object of the preposition "from" is the subject of "visit" and the subject of the support verb becomes the object of "visit" (i.e. "Mary visited him"). We have been exploring Mel'cuk's notation as a means of capturing these relationships for NOMLEX.

Bibliography

Macleod, C., Grishman, R. and Meyers, A. (1997/1998) COMLEX Syntax: A Large Syntactic Dictionary for Natural Language Processing. Computers and the Humanities (CHUM), Kluwer Academic Publishers, Vol. 31 No. 6.
Meyers, A., Macleod, C., Yangarber, R., Grishman, R., Barrett L., and Reeves, R. (1998). Using NOMLEX to Produce Nominalization Patterns for Information Extraction. Coling-ACL98 workshop Proceedings: the Computational Treatment of Nominal.

Return to ALLC/ACH Programme

(9.1.3) CAT : A Jurilinguistic Application of Automatic Speech to Text Transcription

Benjamin K. T'sou
K. K. Sin
Samuel W. K. Chan
Tom B. Y. Lai
Lawrence Y. L. Cheung
K. T. Ko
Gary K. K. Chan
City University of Hong Kong, Hong Kong SAR, China

1. Introduction

British rule in Hong Kong made English the only official language in the legal domain for over a century. It was not until the reversion of sovereignty to China in 1997 that Chinese came to also enjoy official status in the Judiciary of Hong Kong. Legal bilingualism in Hong Kong has brought on an urgent need to create a Computer-Aided Transcription (CAT) system for Chinese to be on a par with the existing English CAT system. The production and retention of verbatim records of court proceedings is a cornerstone of the Common Law system. The creation of such facilities is vital for the successful retention of the Common Law system in Hong Kong, under the "One Country, Two Systems" principle, which brought about the creation of the Hong Kong Special Administrative Region of China. Court proceedings had been kept only in English until recently. There is thus an urgent demand for the creation of Chinese CAT to produce and maintain the legally tenable records of court proceedings conducted in Cantonese, the predominant Chinese dialect in Hong Kong. (T'sou 1993, Sin and T'sou 1994, Lun et al. 1995) The existing monolingual English CAT system has to be adapted in order to produce the appropriate court proceedings. Furthermore, since English will remain to be frequently used in court in addition to Cantonese, the ultimate Cantonese CAT must operate in parallel to the English CAT so that the existing contingent of court stenographers can switch from one to the other easily. This paper discusses the Jurilinguistic Engineering undertaken to develop a Cantonese CAT system, with special reference to phonetically-based stenograph code to Chinese text conversion and other enhancement features.

2. Computer-Aided Transcription (CAT)

CAT is divided into three stages. First, the stenographer encodes speech into a sequence of phonetically-based shorthand code, or stenograph code. The code is recorded via a stenograph machine. Second, the Automatic Transcription System (ATS) will recover the original text {c1, . . . , cn} from a sequence of stenograph codes {s1, . . . , sn}. Finally, the post-editing step is needed to correct typing or transcription errors.

There are two major constraints in the development of the Cantonese CAT system. First, there are many homophonous characters which make the conversion of phonetically-based stenograph code into Chinese characters difficult. Cantonese (and also Mandarin Chinese) is basically a monosyllabic language and each logograph represents one syllable. Problematical homonymy is a persistent problem in the language. Second, the design of the Cantonese CAT system must capitalize on the existing equipment and the stenographer's skills in English stenography so that they can switch from one environment to the other easily. The user interface including keyboard design and input method should be made consistent across the two CAT systems.

3. Ambiguity Resolution - Bigram Model

ATS converts a sequence of stenograph codes {s1, . . . , sk} into a sequence of characters {c1, . . . , ck}. The challenge of the conversion lies in the one-to-many relationship between a stenograph code si and the set of homophonous characters ci that can be encoded by si. This is the homocode problem in theconversion from phonetic to textual representation. To resolve the ambiguity, we apply the bigram model (Bahl and Mercer 1976, Rabiner 1989, Waibel and Lee 1990, Charniak 1993), which has been extensively used in natural language modelling. The conversion procedure determines the most probable character sequence {c1, . . . , ck} for the input stenograph code sequence {s1, . . . , sk}. In conditional probability, (1) should be maximized.

(1) P (c1, . . . , ck | s1, . . . , sk)

where {c1, . . . , ck } denotes a sequence of k characters, and

{s1, . . . , sk} denotes a sequence of k input shorthand codes.

By making some approximation assumptions, the maximization of (1) is recast as the maximization of (2).

(2) Multiplication i=1,. . . ,k (P(ci | ci-1) * P(si | ci))

P(si | ci) and the bigram probability P(ci | ci-1) can be readily computed from the training data set. The Viterbi algorithm (Viterbi 1967) is implemented to efficiently compute the maximum value of (2).

To evaluate the accuracy of ATS, we conducted some transcription tests. Two prototypes were developed. The first one, CAT2, implemented the bigram model for conversion as described above. A baseline model, CAT0, was also built for comparison. It converts by selecting the character si with the highest P(si | ci)) value for the stenograph ci.

We compiled a training corpus of about 0.85 million character authentic court proceedings to obtain the conditional probabilities necessary for computation. A testing corpus of about 0.15 million character testing data was used. After training, CAT0 and CAT2 achieve an accuracy of about 78% and 92% respectively. The use of the bigram statistical model significantly improves the ambiguity resolution.

4. Enhancement Facilities

A consistent user interface for both Cantonese CAT and English CAT must be provided so that stenographers can easily operate in both Chinese and English. Two important features were offered to maintain consistency and to improve transcription efficiency.

4.1 "Arbitrary" (as defined in Glassbrenner and Sonntag (1986))

We have been assuming throughout that one keystroke corresponds to one syllable for the sake of simplicity. Nonetheless, such a requirement is not obligatory. A stenograph code may well represent a string of characters of any length. Although the English CAT system is syllable-based, it builds in functions to associate a unique user-defined stenograph code, or an "arbitrary", with a frequently used phrase or expression instead of a single syllable. "Arbitrary" is a critical feature for fast online recording as keystrokes can be significantly reduced.

Two requirements must be met in the incorporation of "arbitraries". First, each stenographer may not consistently use "arbitraries" even within the same recording session. An expression may be recorded using an "arbitrary" or be entered as a series of syllable-based stenograph codes. The system must be able to tolerate such variation. Second, the system must be flexible enough to allow the stenographer to create novel "arbitraries" at input time. The stenographer may invent new ad hoc codes at input time to speed up recording. The CAT system must be able to operate without defining the "arbitrary" before using it.

While "arbitraries" in the English CAT system are merely additional entries in the conversion dictionary, incorporating "arbitraries" into our conversion model is more complicated. A macro design is introduced which enables "arbitraries" to be fully integrated into the syllable-based scheme and our statistically-based ATS module. This is achieved by allowing the stenographer to define in an "arbitrary dictionary" an ad hoc macro for a stenograph code sequence plus its corresponding character string. The input is pre-processed so that the ad hoc code will be expanded into a sequence of syllable-based stenograph codes as defined. Subsequently, the expanded code will be subject to the statistically-based conversion. While the stenographer uses ad hoc codes at input time, the conversion procedure operates using the syllable-based code at transcription time.

4.2 Domain-Specific Transcription

In English CAT, automatic transcription is supported by special "Job dictionaries", containing specialized vocabularies. They can be dynamically activated depending on the case type recorded. Different case types have specific legal terms and lexical usage. For instance, chemical vocabulary in drug-trafficking offences is not likely to be found in fraud or traffic offences. Integrating all vocabularies in a training corpus of the Chinese CAT system may obscure the co-occurrence probabilities of some characters. To make the bigram probabilities more reliable, we exploit this domain-specificity of lexical items. Another test was conducted using transcripts pertaining to traffic offences. A training corpus and testing corpus of about 0.85 million and 0.15 million characters respectively were compiled. We observed that the system achieved a better transcription accuracy of about 95%. With this feature, the stenographer can choose a particular domain to work with before the automatic transcription. At present, the available categories include Assault, Civil, Robbery, and Traffic.

5. Conclusion

To summarize, the bigram statistical model has been applied to resolve ambiguity in stenograph code to Chinese conversion. Supplemented with the ad hoc codes and domain-specific transcription, the Cantonese CAT system offers a user-friendly environment that matches the English CAT. The resultant high transcription accuracy makes it viable to implement a Cantonese stenograph system on phonologically-based machines which are designed for English, and which can now accommodate both English and Cantonese as dictated by circumstances.

Acknowledgement

Support for the research reported here is provided mainly through the Research Grants Council of Hong Kong under Competitive Earmarked Research Grant CERG 9040326.

References

Bahl, L. R. and Mercer, R. L. (1976). "Part of Speech Assignment by a Statistical Algorithm." IEEE International Symposium on Information Theory, Ronneby, Sweden, June 1976.
Charniak, E. (1993) Statistical Language Learning. MIT Press, Cambridge, MA.
Glassbrenner, M and Sonntag, G A. (1986). Computer-Compatible Stenograph Theory. 2 vols. Stenogrph Corporation, Illinois.
Linguistic Society of Hong Kong (1997). Yueyu Pinyin Zibiao (Cantonese Jyutping Transliteration Word List). Linguistic Society of HK, Hong Kong.
Lun, S., Sin, K. K., T'sou, B. K. and Cheng, T. A. (1997). "Diannao Fuzhu Yueyu Suji Fangan." (The Cantonese Shorthand System for Computer-Aided Transcription) (in Chinese) Proceedings of the 5th International Conference on Cantonese and Other Yue Dialects. In B. H. Zhan (ed). Jinan University Press, Guangzhou. pp. 217"227.
Rabiner, L. R. (1990). "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition." Proceedings of IEEE. Reprinted in Waibel and Lee (1990).
Sin, K. K. and T'sou, B. K. (1994). "Hong Kong Courtroom Language: Some Issues on Linguistics and Language Technology". Paper presented at the Third International Conference on Chinese Linguistics. Hong Kong.
T'sou, B. K. (1993). "Some Issues on Law and Language in the Hong Kong Special Administrative Region (HKSAR) of China." Language, Law and Equality: Proceedings of the 3rd International Conference of the International Academy of Language Law (IALL). In K. Prinslooet al. (Eds) University of South Africa, Pretoria. 314-331.
Viterbi, A. J. (1967). "Error Bounds for Convolution Codes and an Asymptotically Optimal Decoding Algorithm." IEEE Transaction on Information Theory 13: 260"269.
Waibel, A., and Lee, K. F. (eds) (1990). Readings in Speech Recognition. Morgan Kaufmann, San Mateo, CA.

Return to ALLC/ACH Programme

(9.3) Stylistics

(9.3.1) Shouting and Screaming: Manner and Noise Verbs in Communication

Margaret Urban
Josef Ruppenhofer
University of California, Berkeley, USA

In many languages, words can be used in different domains from those in which they originated. In English, sound verbs are commonly used in the context of human communication (1-4).

(1) ('Shut up, Doreen,'[MESSAGE]) (Silas[SPEAKER]) barked, his face contorted by a scowl.
(2) ('Darling,'[MESSAGE]) (Conrad[SPEAKER]) cooed as Lee entered the living room.
(3) ('He's a thief, Hilary,'[MESSAGE])(he[SPEAKER]) grated almost savagely.
(4) (Grandson Richard[SPEAKER]) rumbled (a reply[MESSAGE]).

However, not all sound verbs have communication uses; the ones that do are restricted as to the type of message and/or speaker they can occur with. The syntactic patterns of sound verbs used for communication are not the same patterns found with true communication verbs. Several researchers (Goossens 1995, Miller & Johnson-Laird 1976, Levin et al. 1997) have explored these phenomena, paying particular attention to which verbs have or lack communication uses. Here we propose a unified and expanded corpus-based account of these cross-domain extensions in terms of Frame Semantics (Fillmore 1982) and with reference to theories of metaphor. This analysis has implications for the further description of the relationships between frames (e.g., inheritance and blending), and the development of cross-domain uses of words.

The FrameNet Project (P.I. Charles J. Fillmore) is creating a lexical database with 3 linked components: the expanded lexicon, the Frame Database, and annotated example sentences (Baker et al. 1998). Files which represent senses of lexical items within a particular domain and frame (represented as domain/frame) are created, and constituents are annotated with the Frame Elements which are realized with respect to the target word. This annotation, and subsequent marking of phrase type and grammatical function, is further analyzed for the combinations of syntactic and semantic patterns realized in various senses. Even while still in progress, this project has become a valuable resource for lexical and other linguistic analysis. The authors, both researchers involved in all stages of the project, have examined annotated files for 201 verbs in perception/noise, 23 verbs in communication/manner, and 60 verbs in communication/noise (there are 314 communication verb files overall).

Sound originates in the domain and frame of perception/noise. Since communication involves human verbal interaction, it necessarily overlaps with the sound domain. Two criteria determine which noise verbs can have communication uses. First, a noise verb is usable as a communication verb if the sound is produced by animate beings, especially animals (e.g. bark, yelp), but not when it is produced by objects (e.g. clink, thud). Nevertheless, some inanimate noises, such as rumble, are used for communication. It is possible that the physical profile of these sounds lends itself to a communication construal. Secondly, among animal sounds, imitative sounds (e.g. oink, quack) have no uses as communication verbs; the exact specification of the sound blocks the expression of a message (5).

(5) *(Mr. Baker[SPEAKER]) oinked (an invitation[MESSAGE]) across the table.

The noise verbs with communication uses (e.g. scream, bellow) do not behave like genuine manner of speech verbs (e.g. shout, whisper), differing from them both syntactically and semantically. We argue that this reflects the differences in the structure of the two domains and frames. In their home domain, noise verbs are usually intransitive, taking the sound SOURCE as subject (6-8).

(6) Somewhere behind her (a horn[SOURCE]) blared.
(7) (The long blades[SOURCE]) clashed and rang, their movement too fast for the eye to follow.
(8) The ducks began quacking and (the frogs[SOURCE]) croaking.

By comparison, communication verbs are normally transitive, with a SPEAKER subject and a MESSAGE object. ADDRESSEE and TOPIC prepositional phrases and MANNER adverbs frequently appear (9-11).

(9) ('How's the shop?[MESSAGE]) mumbles (one balding sweating man[SPEAKER]) (to another[Addressee]).
(10) One of his body squires heard (him[SPEAKER]) whispering (about it[TOPIC]) (to his Gascon favourite[Addressee]).
(11) 'If (you[SPEAKER]) so much as whisper (a word[MESSAGE]) (about Dame Agatha[TOPIC])(to the Lady Maeve[ADDRESSEE]), you will regret the day I ever plucked you out of Newgate!.'

Consider a communication use of the sound verb 'snarl' (`Do you have to?' she snarled at him as he took out a cigarette). Whereas `Do you have to?' [MESSAGE] and she [SPEAKER] look like canonical communication frame elements, (at him) is not a typical encoding of ADDRESSEE; compare the oddness of (I talked at John). The effect of (at him) is to make him seem more like the target of a directed sound emission as in (The dog barked at me). The difference between real manner of speech verbs and communication uses of noise verbs can also be observed in terms of complementation patterns and their frequencies. For example, more quoted MESSAGEs are found with noise verbs than with manner of speech verbs. The pattern is the reverse for that-clause MESSAGEs. ADDRESSEEs are less common with noise verbs used for communication than with regular manner of speech verbs (12).

(12) (The housekeeper[SPEAKER]) left the room, muttering (about ingratitude[TOPIC]).

This difference can be exemplified statistically by the analysis of proportional samples of representative verbs from each domain and frame.

Noise verbs used for communication do not only differ from manner of speech verbs as a class, but also exhibit interesting differences among themselves. For instance, many verbs are specialized as to what kinds of speakers they accept: older people and females are better cacklers, while men and people in positions of authority are more likely to rumble, bellow, or grunt (13-15).

(13) ('I'll warrant he is!'[MESSAGE]) (the old lady[SPEAKER]) cackled unexpectedly.
(14) We passed (the police sentry who[SPEAKER]) grunted (a sleepy greeting[MESSAGE]).
(15) ('Off now then?'[MESSAGE])chirped (the woman[SPEAKER]), dropping another sock.

Also, inasmuch as the manner of the speech act is being emphasized, the quoted [MESSAGE] component frequently contains an alphabetic representation supporting that emphasis (16).

(16) ('Th-that's b-blackmail,'[MESSAGE]) (she[SPEAKER]) spluttered.

This analysis shows that in these cross-domain uses, semantic and syntactic factors from both source and target domains play a role in determining the structure of the utterance. While the target domain supplies a syntactic structure, the source domain's semantics constrain the degree to which that syntactic structure can be exploited. Although some of the domains' interactions resemble metaphorical mappings, e.g. the SPEAKER-SOURCE correspondence, the relationship between the domains is not that of metaphor. Both domains are concrete, rather than one being concrete and one abstract. Instead of being discrete domains, they have something in common, i.e. the presence of a sound source. Nor are they simple cases of situational metonymy between speaking and producing sound. This kind of evidence and data can be used to describe more the complex interactions of frames which are evidenced in natural language: frame blends and inheritance, metaphor, complex frames, and other cross-domain uses. The synthesis of linguistic theory, lexicography, and work with large-scale corpora is necessary for significant coverage of the data. The frame semantic approach, with detailed lexical analysis, provides a semantically and syntactically informative account.

References

Baker, Collin F., Fillmore, Charles J., and Lowe, John B. (1998). The Berkeley FrameNet Project. COLING-ACL '98 Proceedings of the Conference, held August 10-14, 1998, in Montreal, Canada.
Fillmore, Charles J. (1982). Frame Semantics. In Linguistic Society of Korea (ed) Linguistics in the Morning Calm. 111-138. Hanshin, Seoul.
Goossens, Louis (1995). Metaphtonymy: The Interaction of Metaphor and Metonymy in Figurative Expressions for Linguistic Action. In Goossens et al. (eds) By Word of Mouth. John Benjamins, Amsterdam/Philadelphia Publishing Co.
Lakoff, George, and Johnson, Mark (1980). Metaphors We Live By. University of Chicago Press, Chicago/London.
Levin, Beth, Song, Grace, and Atkins, B. T. S. (1997). Making Sense of Corpus Data: A Case Study of Verbs of Sound. International Journal of Corpus Linguistics, Vol. 2, No. 1. 23-64.
Miller, George A., and Johnson-Laird, Philip (1976). Language and Perception. Harvard University Press, Cambridge, Mass.

Return to ALLC/ACH Programme

(9.3.2) The Style-Marker Mapping Project: a Rationale and Progress Report

Joseph Rudman
Carnegie Melon University, USA

Introduction

This paper explicates the what, why, and how of a substantially completed but ongoing project to identify and categorize all style-markers in written English that are quantifiable (e.g. type/token ratios, word length distributions, word length correlations, hapax legomena).

Section I treats the what - defining the project - and then addresses the why - the rational for the project. Section II outlines the how. Section III gives a status report of the project.

Although the final mapping will have value in various disciplines (e.g. stylistics, corpus linguistics, computational linguistics, and computer science), the impetus for the project is from non-traditional authorship attribution studies. Non-traditional attribution practitioners define style in the seemingly narrow framework of only those stylistic traits that are quantifiable.

The main hypothesis behind this project (and all non-traditional authorship attribution studies) is that every author has a verifiably unique style. If we look at style as an organism, style-markers are its genetic material - making this project analogous to the human genome project. This analogy is somewhat of a stretch because each style-marker is analyzed by an independent study, whereas all of the loci of an autoradiogram are obtained in one scientific analysis.

The identification of a quantifiable style-marker does not necessarily mean that that particular style-marker should be included in an authorship study (e.g. the orthography might be dictated by an editor or typesetter).

Section I

This section treats the what - defining the project - and then moves into the why of the project.

A short overview of what is in this section follows:

The style-marker mapping project is a study to identify every style-marker in written English that can be quantified. The project began in a preliminary fashion in 1983 when I started recording the various style-markers that were used in non-traditional authorship attribution studies so that I could use them in my studies of the canon of Daniel Defoe. The project continued in this vein (along with a few attempts on my part to come up with "new" style-markers) until five years ago when I realized the importance of identifying all of the quantifiable style-markers.

There is no one style-marker or even a combination of several style-markers that have proven to be a definitive discriminator in all non-traditional authorship studies. What works in one case often does not work in others. Word length distributions and sentence length distributions are two examples of style-markers that seemingly work in some cases but not in others. The idea is to look at style as a combination of all of the quantifiable style-markers and then to do the analysis as if each style-marker were a locus in the autoradiogram (See RUDMAN for a more detailed explanation).

Another reason for using all of the quantifiable style-markers is to eliminate any charges of statistical cherry-picking.

Section II

This section treats the how. References to all of the literature and techniques will appear in the final report. For example, HOLMES, DELCOURT, and ELLIOTT AND VALENZA are three of the references under non-traditional authorship studies.

1) Search the literature, e.g.:
 
 

2) Query the practitioners in all of the above fields, e.g.:
 
  3) Establish a clearinghouse on a web page that allows anyone to query the up-to-date mapping and allows anyone to suggest "new" quantifiable style markers that would be added by the curator. This will lead to a continually updated list. Negotiations are under way to make this site an extension of the Carnegie Mellon University English Web Site.

4) Use various strategies to identify new style-markers. This is where the innovative work supplements the drudge work, e.g.:
 
 

Section III

This section gives a status report of the project and reports a timeline for its "completion." References will be given for all of the style-markers in the final report, e.g. MOSTELLER AND WALLACE is one of the function word references. Only a few representative examples of each section are listed for this abstract.

1) META-WORD

2) WORD 3) SUB-WORD 4) OTHER Conclusion

The questions, "Can this project ever be completed?", and, "Is the number of style-markers infinite?", are addressed.

The success of this project will not solve all of the problems of non-traditional authorship studies. This project does not address the problems that gender, genre, time constraints, or conscious vs. unconscious style bring to the table. Nor does it treat the problem of lemmatization.

Identifying the style-markers is only a small part of the overall problems with non-traditional authorship attribution studies. The statistics that should be used in any study for each of these style-markers and the statistics for combining all of the markers into a "final" answer is the subject of another ongoing project.

Bibliography

Delcourt, Christian (1994). Stylometry. ALLC-ACH École Normale Supérieure de Lettres et Sciences Humaines, Paris, 14-15.04.
Elliott, Ward E.Y. and Valenza, Robert J. (1996). And Then There Were None: Winnowing the Shakespeare Claimants. CHum 30.3: 191-245.
Holmes, David I. (1985). The Analysis of Literary Style - A Review. J.R.Statist. Soc. A 148, Part 4: 328-341.
Mosteller, Fredrick and Wallace, David L. (1984). Applied Bayesian And Classical Inference: The Case Of The "Federalist Papers." 2nd Edition. Springer-Verlag, New York.
Rudman, Joseph (1998). The State of Authorship Attribution Studies: Some Problems and Solutions. CHum 31: 351-365.

Return to ALLC/ACH Programme

(9.3.3) Comparative Study of the Lexical and Orthographic Variety in the Mediæval Slavonic Psalter

Milena Dobreva
Institute of Mathematics and Informatics, Bulgaria

Monia Camuglia
University of Pisa, Italy

1. Introduction

1.1. Text variety in mediæval Slavonic texts

The mediæval Slavonic written tradition is characterised by a high level of variety on all linguistic levels. This is usually explained with the use of a vernacular language in the Mediaeval Slavonic texts. The spread of the texts to regions with substantial differences from the language in the region where the text originated reflected the transmission. In this setting, the study of text variety is an important source of information about the synchronic and diachronic development of the Slavonic languages. However, up till now the study of text variety was done by traditional research methods based extensively on human collection and processing of scattered data. Taking into account the substantial number of mediæval Slavonic manuscripts, one can imagine that the bulk of the texts can not be physically covered by traditional investigation.

1.2. Case Study

The Psalter has always had a leading role among the religious texts since its first appearance in the Slavonic countries, representing a very important source in the process of the development of the Slavonic written tradition.

Its importance is also linked to the didactic function it had in the Slavonic lands where a new alphabet was recently introduced. This led to the need to educate the scholars in the new literary environment. Due to its own rhythm, to its diffusion, the Psalter was then studied by heart by people in order to make acquaintance with the written culture.

That is why the Psalter has always been one of the most widespread texts [Karaèorova 89]. Its tradition is considered constant by almost all the researches, besides the structural differences that allow us to speak about different typologies of Psalter (simple, with commentaries, etc.).

In the light of these considerations (wide dissemination and lower variety due to its religious use), we decided to investigate some textual peculiarities of the Psalter's transmission.

1.3. Experimental Setting

We focussed our attention on six manuscripts, choosing the spatial and temporal settings of each manuscript as reference points. The linguistic peculiarities referring to an origination from a certain region are reported using the term 'revision'.

We used the following manuscripts:
 
 

We used as text excerpts 15 Psalms scattered through the Psalter (1, 39, 40, 41, 44, 45, 64, 73, 74, 75, 89, 91, 98, 102, 134; the exception is the Serbian Psalter where the 1st psalm was damaged).

2. Study of the lexical variety

We started with a study of the lexical variety in order to follow the tradition of the Psalter diatopically and diachronically, to set out the lexical variants and to verify the unity of its tradition.

We used for this study the DBT (Data Base Testuale), a program of textual analysis developed at the ICL-Pisa, which produces different types of information in real time. Among its functions, the DBT provides lists of alphabetical frequencies, decreasing frequency order, concordances, co-occurrences, indices locorum, lists of suffixes and endings and lemmatization procedure.

Comparing our witnesses we found the following:
 

We can consider the chosen texts divided in 4 blocks: 1) Sinaitic Psalter; 2) Bolonian and Serbian Psalters (the closest in time to the Sinaitic); 3) Norovian Psalter; 4) Kievian and Gennadian Psalters. In 115 variants as regards the Sinaitic Psalter, we noticed that:
  The Norovian Psalter is the text that more than the others is detached from the Sinaitic and, as Cesko says "it's one of the first trials to normalize the literary language thanks to a correction based on the Greek model" [Cesko o77].

On the other side, the Serbian and Norovian Psalters show a lower level of variants and they were written in a period very close to the creation of the Sinaitic Psalter.

Concerning the two Russian Psalters, the Norovian and Gennadian Psalters have almost the same number of variants (78 and 75). Thus we can think that they belong to the same revision. But this is not true: of the 78 variants present in the Norovian, 54 are different from those of the Gennadian, that only 28 times offers variants different from those of the Kievian. We can verify in this way the big linguistic similarity between the two Russian texts and the atypicality of the Norovian Psalter in respect to this group of texts.

In conclusion, we have the Norovian Psalter vs. all the others, that can be very well accepted if we consider the basic homogeneity of all the examined manuscripts and the declared adhesion of the Norovian to the Greek original.

This confirms the unity of the Psalter's tradition in the Slavonic countries.

3. Study of the Orthographic variety

We studied also a lower textual level - the orthographical one - in order to investigate the correlation between revisions and quantitative data on orthographic features. The traditional approach to the orthographic variety study is a qualitative one. The decision about the time when a text was written and the localization of its creation is based on the occurrences of specific orthographic features. Such features reflect the use of the jers; nasal vowels and groups containing - r - and - l - [OBG, 1993].

We studied the relative frequencies of all letters from the Old Cyrillic alphabet and all strings belonging to such characteristic groups. We aimed to check whether the qualitative characteristics relevant to the text origin lead also to differences in the quantitative study of the texts. The data were processed in STATISTICA for Windows. This data organization allowed us to choose a subset of texts, and some of the letters or groups for analysis. From the view-point of the specialist in Slavonic studies, this would mean that we are able to study the usage of all or selected graphemes in a chosen group of texts. We applied cluster analysis in order to check whether texts belonging to different revisions are grouped correctly.

The cross-study of the texts leads us to the following conclusions:

4. Conclusions

Our study showed the importance of conducting experiments on different linguistic levels.

The first approach was oriented towards a study of lexical variance aimed at investigating the distribution of synonyms in the different witnesses and at forming groups of similarity between the witnesses. This study confirmed the hypothesis about the unity of the Psalter's transmission amongst the mediæval Slavs.

The second approach was oriented towards a study of quantitative data on the distribution of orthographic characteristics.

The comparison of the results shows that the application of both approaches could help the specialists in mediæval Slavonic textual studies to acquire new data about the similarities and dissimilarities of their object of study. Moreover, the complexity of the studied field presupposes the use of methods applied to the different linguistic levels because they enlighten different aspects of the studied written tradition.

References

[Cesko o77] Cesko, E.V. (1982). Ob afonskoj redakcii slavjanskogo perevoda psaltyri v ee otnoshenii k drugim redakcijam, V: Jazyk i pismennost' srednebolgarskogo perioda. Moskva, p. 77 (in Russian).
[Karaèorova 89] Karaèorova, I. (1989). Kym vyprosa za Kirilo-Metodievskija starobylgarski prevod na Psaltira. V: Kirilo-Metodievski Studii, t.6, S., str.15-129 (in Bulgarian).
[OBG 1993] Duridanov, I. et al. (eds) (1993). Gramatika na starobalgarskija ezik, S., 1993 (in Bulgarian).

Return to ALLC/ACH Programme

(9.4) Posters & Demonstrations

(9.4.1) Subject Specific Search Engines for the Humanities

Brian Hancock
Rutgers University, USA

The Internet has proven to be a vast and unruly resource. Students are happy to come to the library and use general engines such as AltaVista and directories such as Yahoo! to do their research on the Web, but often the results they generate are extraneous and from unreliable sources. To make matters worse, recent studies show the major engines index only a portion of the Internet, although Excite claims its recently-announced search engine will reach all the Web's pages. On a more modest scale, to help keep users on track with their Web searches, humanities librarians have begun to use subject-specific engines.

Subject-specific engines are a good tool for concentrating a search within the parameters of a particular discipline. These parameters naturally define a search and can help users obtain relevant results quickly. To help achieve this, these engines are set up with a variety of automated software packages such as Harvest and Ultraseek server.

Harvest is an integrated set of tools to gather, extract, organize, search, cache, and replicate relevant information across the Internet. It was developed originally at the University of Colorado by the Internet Task Force Group on Resource Discovery (IRTF-RD) and is maintained by a group of volunteers at the University of Edinburgh. It is used to gather information from selected sites so a user will receive only information relevant to the subject searched. For example, if searched for "Horace," a subject-specific engine for the classics like Index Antiquus will return results relating to the Roman poet. The users of this subject-specific system are assured the hits returned will be relevant.

Harvest consists of two basic subsystems: the gatherer and the broker. The gatherer collects the information from sites selected by a person (in the case of Index Antiquus, sites relevant to the classics as determined by a librarian). This process is an evaluation using particular criteria - criteria such as accuracy, appropriateness, authority, organization, currency, and relevancy. To this list we must add stability. Even though a robot or link checker will help maintain links to active sites, you don't want the database changing radically every time it's renewed.

Once the information is returned to the local server, it is summarized and indexed. That is, it is stripped of any HTML code, and a database is created and indexed. The gatherer does not by itself automatically update the database; this is done by running the cron command to resend the gatherer. This can be done at a specified time, every month for instance, and because we are good Net citizens, early in the morning to keep the load down on host servers. The broker is basically the mechanism that accesses the database and returns the results; in other words, the broker is the search engine. The default engine for Harvest is glimpse, but you can use WAIS or Swish if you wish. Because Harvest is open source, you can download the source and compile the software yourself.

Infoseek has now made its Ultraseek Server port to Linux available. We are currently running Ultraseek 3.1 on on Red Hat Linux 6.1 and have also run it successfully on SuSE Linux 6.3. It's extremely easy to install but because it is a commerical product you are not given the code. It can be used on an Intranet or to collect documents from Web servers.

The Ultraseek Server automatically sends out its robot to selected sites and creates the index. The search interface is configurable and supports natural-language queries. Once the database is created, you can manage it remotely via your browser. The extensive documentation includes installation, administration, and customization guides. The version for Linux is available in i386.rpm file or tar ball format.

The presenter will demonstrate Index Antiquus running on Harvest and on Ultraseek server. As indicated, Index Antiquus is a subject-specific search engine for classical and early latin medieval texts. Interested parties will be invited to test Index Antiquus and the presenter will be pleased to answer any questions such as the use of this type of system for other humanities disciplines. The URL for Index Antiquus is < http://harvest.rutgers.edu:8765>
 
 
 

Return to ALLC/ACH Programme

(9.4.2) The Academic vs. Subject Corpus: Development of Criteria for the Teaching of ESP According to Lexical Needs in Spanish Polytechnic Courses

Alejandro Curado Fuentes
Universidad de Extremadura, Spain

The aim of this paper is to offer the details gathered from the lexical analysis of English texts read in Information Science related majors in Spanish universities. In such a textual collection, lexical items are arranged according to the notions of word frequency and range across text types and genres, or within given subject fields and topics in the Information science and technology disciplines. How strong these lexical combinations are, based on their statistical M.I. (Mutual Information) measurement, is also quite pertinent to our study. The degree of collocation is thus assessed in the light of common coreness. That these patterns are more or less consistent in our corpus is, indeed, a key characteristic to value so that a reference with the total number of texts and running words can be established. Finally, as the findings show that there exist representative lexical items for a limited or reduced number of texts, keywords must be explored. For the observation of results drawn according to the three approaches mentioned - word frequency / range, collocations and keywords - the focus is placed on both the text and the subject-matter. This is essentially done to follow the priority of working with language and content from the ESP (English for Specific Purposes) perspective. As a consequence, a categorization is made regarding a specified kind of context - e.g. text types.

As in the case of genre, the environments of text and discourse are of prime importance for the situation of lexical items in the scope of academic linguistic competence. Text types are approached in relation to how text is organized and reflects coherence and cohesion, while the second setting - genre - registers the writer's inclination and intentionality to produce discourse for a community (e.g. academic). There are two other parameters - subject and topic - on which the distribution of the lexical items of our corpus is based. In their case, a framework based on content is provided, and the findings yield the core lexis according to thematic / conceptual fields.

a. Word frequency and range.

In this first division, the most frequent text type words are provided according to how recurrent they are across six sets of ten texts. These are grouped as follows:

 1. Definitions.
2. Descriptions.
3. Classifications.
4. Exemplifications.
5. Discussions.
6. Conclusions.

The samples are taken randomly to represent the rhetorical functions and sub-sections of genres with which the learner must cope and come to grips.1 The relevant vocabulary analyzed from this perspective is classified as argumentative, procedural and discourse/grammar items, examined in demarcated domains such as distinctive subject fields and genres. 2

Our immediate concern thus lies in having all the interrelationships among the subject fields represented visually in order to make the selection of text samples accordingly.

We offer figures which refer to the number of sources belonging to four specified disciplines in Informatics-related majors - pointed out by abbreviations (e.g. 'C.S.' stands for Computer Science and so forth).

In addition, in our corpus, capital letters refer to the codes used for the subjects/topics within disciplines as shown in the Appendices (Appendix 1). As will be observed, in addition to all the labels A - F, each single subject field is also represented individually by some texts (not shared with other studies).

There are more texts in the 'F' category, which the four scientific areas share - Computer Science, Information Science, Audio-visual Communication and Optical/Wave communication. In contrast, the subject 'Communication Theory', included in the Audio-visual Communication, Information Science and Telecommunication programs of studies, is formed by only four sources. In turn, Audio-visual Communication is the discipline with the smallest amount of readings involved - only one text for each genre. 3

The selection of the texts is made by having as yardsticks the overall distribution and length of these in the corpus. As a result, if there are up to five descriptions (out of 10 possible ones) included in the 'F' or 'All disciplines' category, this is due to the fact that these passages are quite common in these readings. In addition, these five samples are not as long as other types, such as definitions in this division. Finally, that the balance in relation to the entire corpus be kept is, as has been pointed out, a chief consideration.

So that the text type findings based on frequency and range are also framed with the detailed knowledge of the subjects and topics comprised, the distribution of text type sources within each sub-category or label must be provided.

The maximum number of texts encompassed is three - e.g. in the case of Descriptions on the topic of 'Information infrastructure' (F6 category). This distinction reflects both the larger amount of readings existent in the corpus dealing with issues of this kind and the recurrence of this type of rhetorical function employed in sub-division F6. In contrast, where no samples are contained within a given sub-category, the reason is that the model was either less developed or not included at all in the content of the text (e.g. Conclusions in 'Perspectives on information' [F1], 'Media theory' [D2], 'Media documentation' [C2], 'Automated Knowledge-based systems' [B3], etc).

A final comment must be made regarding the importance of keeping a balance with the representation of three academic genres - textbooks, reports and research articles - in the construction of the corpus. The intent of this arrangement is to offer a weighed basis for text selection and analysis. Since the end of such an organizing procedure is to provide adequate ground for lexical sifting, this text type sub-corpus should incorporate as many different language and content settings - i.e. contextual factors - as possible. In this sense, even some text units as characteristic of one single genre as sections of research articles - Discussions and Conclusions, in this case - can be located in the other two genres (e.g. a discussion appearing in a textbook on Communication Theory [E1] or a conclusion taken from a report on Software Programming [A1], as figure 4 shows).

The end results should thus be adequate and fitting for the design of both written and oral lexical activities and tasks that reveal the importance of academic and subject lexis, based on the analysis of common texts across different disciplines.

Notes

1. Only these six types are chosen due to the fact that others, such as the discourse function signalling contrast, are contained by Discussions on five occasions, whereas illustrations are coped with in Exemplifications. In turn, the two sections of research articles included - Discussions and Conclusions - are given priority over Abstract, Introduction, Method and Results since these are already selected for the compilation of Descriptions, Classifications, Definitions and Exemplifications to a greater or lesser degree (see distribution of the text type sub-corpus).
2. How certain genres and types can be characterized by core or subject-core lexis is described, among others, by Carter (1997, 1988).
3. The correspondence of number of texts and disciplines obeys the aim of assembling core language and subject matter: the lower the measure of samples, the more subject-specific the texts tend to be.

Return to ALLC/ACH Programme

(9.4.3) Hyperfiction Reading Research: An Experiment in Method

Colin Gardner
University of Sheffield, UK

[1] Introduction

Hyperfiction reading experiments have been carried out as part of the author's PhD research. The experiment complements and, in significant ways, extends methods for critical metadiscursive commentary on non-linear literary texts, and can be considered a preliminary and partial response to a question posed in the title of an article by G. P. Landow: "What's a Critic to Do" (Landow, 1994). Screen recording software has been used to log readers' navigation of hyperfiction and to generate data amenable to informal qualitative and quantitative analysis. Although this is an exploratory study, it is expected that the data will provide a basis for testing the hypothesis that analysis of navigational choices made within a well-defined context can be used to suggest how a reader may have interpreted the text. The data will be integrated within the various models of reading hyperfiction and contribute to the growing methodological corpus of research on this topic. Whilst the value of results deriving from this meta-interpretational analysis may be questioned, it should nevertheless provide a point of departure for a very urgent and timely debate into the paths that technocriticism might take with regard to hypertext fictional narrative.

[2a] Analysis

Focusing on readings of a well-known hyperfiction, Michael Joyce's 'Afternoon: a Story', the experiment makes use of the fact that reading online allows the movement of a reader through a text to be monitored discretely, such as where the reader has visited and for how long. Readers sometimes 'hover' around the screen using a mouse, and the screen recording software captures this activity for use by the analyst/critic. In Readingspace (the viewing application for Storyspace texts) words selected by the reader are highlighted with a red outline. This is important to the success of the meta-interpretation because the words actually chosen by the reader, even where they do not activate a specific link, are significant in themselves as indicators of an intention on the reader's part. In this way, directions that readers take in fictional narratives may act as a kind of interpretative index for that reading. To this end, the experimental procedure requires analysis of each space visited to assess the relation between its semantic elements; this is referred to as "immediate context analysis". Not relevant to meta-interpretation of the reader, but nevertheless useful to the analyst/critic, is an analysis of the relation between link elements and their destination spaces: this is the "narratological context analysis". Finally, analysis of the connection of pathways along which the reader can negotiate the structure reveals what may be referred to as vortices, since recursal (returns to the same screen in the same reading) is subject to probability factors set up by the arrangement of links. This is referred to as "probability analysis" and its corresponding factors are centrifugal/centripetal.

[2b] Experimental factors

Two objections have to be contended with in this research: randomness and conformity. Between the two extremes of random selection of links by a reader, and a single predetermined reading sequence, this study aims to use reading research to find out what relationship exists, if any, between the content of the narrative and the choices made by the reader. These factors might be deemed to involve an unacceptable indeterminacy of variables, or to lack the rigour and exactitude of more "scientific" analyses, much less to contribute positively to the problem of hyperfiction criticism. Whilst it is acknowledged that no research can give a reliable or purely objective indication of the state of a reader's mind, the fact that engaged readers make conscious choices suggests that their decisions are amenable to objective analysis. Therefore, what this research cannot, and does not, aim to recreate is what is in the reader's mind in the course of navigation; it attempts to augment, using the data available, the range of critical tools necessary for critique of the hyperfictional literary text. Since decisions made by writers in their choice of words are open to stylistic analysis, it would seem reasonable to suppose that the choices made by readers can be considered in the same way. Some choices will be more considered than others and, as a result, measures to guard against undue interpretations are implemented. For example, the reader profile between some screens might suggest a skimming mode. However, in others, the strategy may change to one of close reading and more considered responses. Decisions occurring within a skimming mode would lead to a lower level of confidence in meta-interpretational analysis than in an intensive mode.

[3] Theoretical Perspectives

Although a strong argument can be made against anachronistic theoretical recontextualisations (Aarseth, 1997: 82ff) in a return to the parlance of such theories in the notion, for example, of "form and content" (Ryan, 1997:690), the powerful resonance of these ideas with issues prevalent in our own time often can be too irresistible. Such revisitations might also suggest an almost inevitable reconsideration of the material conditions of literature observed by Aarseth to be "invisible" to most literary theory (1997: 164). In the analyses outlined above (2a), terminology and ideas from existing models have been used. Complementing a standard grammatical analysis of the spaces (for example, that found in the later chapters of Greenbaum and Quirk, 1990: 394ff), and Gèrard Genette's 'Narrative Discourse: An Essay in Method' (1980), is Aarseth's (1997) "textonomy" and discussion of 'Afternoon', as well as J. Yellowlees Douglas' (1994) analysis and interpretation of her own readings of the work. There are perhaps models better suited to, or more concomitant with, the aspirations of the research and it is anticipated that interested participants will be forthcoming with suggestions.

[4] Content of poster

1. Title and brief explanation of aims of experiment
2. Diagrams showing structure relevant to analysis
3. Charts plotting reading data
4. Separate key to diagrams and charts linked to interpretations
5. Interpretations linked to diagrams (handout)
6. Analysis of the results in bullet point format

[5] Bibliography

 Aarseth, Espen J. (1997). Cybertext: Perspectives on Ergodic Literature. Johns Hopkins UP, Baltimore.
Alasuutari, Perrti (1995). Researching Culture: Qualitative Method and Cultural Studies. Sage, London.
Douglas, J. Yellowlees (1994). 'How do I stop this thing?": In Landow (ed) Closure and Indeterminacy in Interactive Narratives.
Genette, Gèrard (1980). Narrative Discourse: An Essay in Method (tr. Jane E. Lewin). Cornell University Press, Ithaca.
Greenbaum, Sydney and Quirk, Randolph (1990). A Student's Grammar of the English Language. Longman, Harlow.
Landow, George P. (ed) (1994). Hyper/Text/Theory. Johns Hopkins UP, Baltimore.
Landow, George P. (1994). 'What's a Critic to Do? Critical Theory in the Age of Hypertext'. In Landow (ed).
Robson, Colin (1993). Real World Research: A Resource for Social Scientists and Practitioner-Researchers. Blackwell, Oxford.
Ryan, Marie-Laure (1997). 'Interactive Drama: Narrativity in a Highly Interactive Environment'. Modern Fiction Studies, 43:3, 677-707.

Return to ALLC/ACH Programme

(9.4.4) "Hyperlectures": Teaching Postmodern Culture on the Web

Licia Calvi
Trinity College Dublin, Eire.

One of the major didactic difficulties, regardless of the kind of knowledge that must be learnt, seems to be that of ensuring a correct "transfer" of information from the teacher to the learner. This "transfer", as Plato already illustrated, may be problematic because of what happens in the "head" of the learner, something which ultimately influences what the learner herself can understand - see also (Ambroso 1999). In order to guarantee a successful information transfer, it is therefore necessary that the teacher adopts a series of strategies taking into account how information is processed by learners. Cognitive science may help in this respect to shed a new light on how information is received, processed, stored and ultimately retrieved when facing real-life situations - see, for instance, (Dix et al. 1993). From these characteristics, it is indeed possible to draw different learner profiles, different learning models.

Traditionally, according to van der Veer (1990), the individual characteristics that HCI (Human-Computer Interaction) takes into account while conceiving interfaces and user-friendly systems include:

  1. learners' personality, mainly in terms of introvert or extrovert behaviour;
  2. the cognitive styles that students follow while learning, either a verbal or a visual style;
  3. the strategies they exploit for learning, namely a heuristic or systemic procedure;
  4. finally, the structures they use to represent the knowledge they have acquired. These structures ultimately reflect the different long-term memory modalities employed to process information, i.e., semantic or episodic structures.
Once this assumption is accepted, the teacher must satisfy all these requirements in order to be able to support the learner in gaining the most efficient learning possible. IT (Information Technologies) take part in this dialogue by helping the teacher to assume perfectly her role of "learning facilitator according to the learner's pace and modalities" (Ambroso 1999).

It is with this intention in mind that we have started to develop a series of hyperlectures on postmodern culture and philosophy which are completely taught online. The term "hyperlectures" was originally coined by Frode Ulvund (1997) to indicate "a multimedia lecture transferred on the internet" and later rephrased by Rob van Craenenburg (forthcoming) in his series of online lectures on culture. In the context of this presentation, it will be used to indicate a courseware that has two characteristics: conceptually, it is adaptive, and structurally, it is layered (see below). These hyperlectures originate from the two separate courses on "Interactive Narrative" and on "Technology and Culture" which were taught last year within the MSc in Multimedia Systems at TCD. As such, they have a dual objective: on the one hand, they intend to show the development of hyperfiction, i.e., of electronic fiction, from its early attempts still in a paper-based medium to the current computer-based works. According to some scholars (see, for instance Landow 1997), there is indeed a strict relationship between the theory of literature and the theory of hypertext: the former theorises the open work, that the latter ultimately embodies and tests. In this respect, understanding hyperfiction implies a definition of both the medium hypertext and the more theoretical narrative principles implicit in any text: story formation, the notion of the plot, the role of both author and reader, the mutual actions they are required to play. On the other hand, since hyperfiction has emerged and imposed itself in recent years, it becomes important to understand the dominant way of representing and of thinking of reality, the present major philosophical discourse, what is referred to as postmodernism.

The hyperlectures on postmodern culture build on the expertise acquired through two previous experiences at developing Internet-based courseware: they include the development of an adaptive courseware to teach hypertext to MSc students in Computer Science, which is still in use and has been used by several universities in Europe, and an adaptive courseware to teach business Italian to MA students in Romance Philology, which has only been developed for experimental reasons. Whereas the former experience focused on the setting up of a methodology for knowledge acquisition on the Web that progressively keeps into account every student's learning proficiency within the field of a mainly technology-oriented education (Calvi and De Bra 1997), the other project examined the validity of this same framework in a more linguistic domain when combining the "situated" teaching of Italian with the acquisition of a domain-specific knowledge like the Italian present economic reality (Calvi in press). What these Web-based courses have in common is the approach that was used in designing and developing them, i.e., the need to adapt to users. User's adaptivity and customisation is a sort of number 0 rule to make an efficient use of the Web and to ensure that results will be significant in terms of learning. This also means being able to cope with the problems raised by knowledge globalisation and cultural differences. And still, preserving one's specific characteristics in the view of the safeguard of minority languages and cultural heritage that has become so prominent in EU policies. In the investigation of the educational possibilities of hypertext for learning, these issues have become relevant (see, for example, (Brusilovsky 1996):

  1. the necessity to adapt information to users, i.e., to provide information according to users' learning needs, level of competence achieved thus far, goals, and preferences, in order to facilitate learning (a user-tailored approach to information presentation);
  2. the complementary requirement of fostering both textual and conceptual coherence in structuring information;
  3. the subsequent need to limit, if not to avoid completely, cognitive overload while users process information by appropriately determining the sequences of nodes and links to be consequently shown to users.
It is a sort of "guided pulling" (Perrin 1998), where it is up to the learner to choose which further step to take in the acquisition process in an intimately autonomous way because the system does not hinder in any way his/her exploratory behaviour. The process of progressively unveiling links does not indeed correspond to the common notion of "guided tours", although it is always the teacher/designer who determines not only how the content material can be presented to students, but also how it can be structured and how its concepts can be interconnected to one another.

The postmodern course is now being followed by two sorts of students: students enrolled in a Master's programme in Multimedia Systems at TCD, who have a very heterogeneous academic background (from engineering to biology, from communication studies to sculpture and painting), although the master is taught within the Computer Science department; and students enrolled in a Master programme in Romance Philology at the University of Antwerp, who have therefore a mainly humanistic background. By the end of the course (in a couple of months, therefore well before the conference is scheduled) data about their interaction modalities with be systems will become available for analysis. This will allow us to verify if the design guidelines that have been chosen have matched onto actual use.

During the presentation we will describe in details the hyperlectures from the point of view of both its design rationale and objectives and will relate it to the previously acquired experiences mentioned above in order to outline the methodological and pedagogical implications behind the use of hypertext in education.

The course can be found at the URL <wwwis.win.tue.nl/~calvi/pmw>.

References

Ambroso, S. (1999). "Nuovi orientamenti nella didattica dell'italiano come L2". Proceedings of the conference "La didattica dell'italiano lingua straniera oggi. Realtà e prospettive", VUB Press.
Brusilovsky, P. (1996)."Methods and Techniques of Adaptive Hypermedia". User Modeling and User-Adapted Inetraction 6(2-3):87-129, 1996.
Calvi, L. (In press). "Business Italian via a Proficiency-Adapted CALL System", in D.P. O' Baoill (ed.), Special Issue of TEANGA 18, 1999.
Calvi, L. and De Bra, P. (1997). "A Proficiency-Adapted Framework for Information Browsing and Filtering in Educational Hypermedia Systems". International Journal of User Modelling and User-Adapted Interactions, 7:257-277.
Dix, A., Finlay, J., Abowd, G. and Beale, R. (1993). Human-Computer Interaction. Prentice Hall.
Landow, G. (1997). Hypertext 2.0. The Convergence of Contemporary Critical Theory and Technology. The Johns Hopkins University Press.
Perrin, M. (1998). "What Cognitive Science Tells us About the Use of New Technologies". Paper presented at the Conference "Languages for Specific Purposes and Academic Purposes--Integrating Theory into Practice", Dublin, 6-8 March 1998.
van Craenenburg, R. (forthcoming). Teaching Culture in a multi-linear Environment. Intellect Books.
van der Veer, G. C. (1990). Human-Computer Interaction. Learning, Individual Differences, and Design Recommendations. Offsetdrukkerij Haveka B.V., Alblasserdam.
Ulvund, F. (1997). "Hyperlectures - Teaching on Demand". Paper originally presented at Cincinnati Symposium on Computers and History, University of Cincinnati, Cincinnati, OH, May 2-3 1997 and at the XIIth International Conference of the AHC in Glasgow, UK, June 30 - July 3 1997, online reference.

Return to ALLC/ACH Programme