TEI: Affordances and Restrictions

In 2004, Peter Robinson wrote that his research “appear[ed] to be leading us to a rather depressing conclusion:  that the effort of making scholarly editions in digital form is not worth the candle”. Of course, there has been a vast improvement in digital techniques, availability, software, hardware in the ten years that have passed. And perhaps more importantly, though not quite a sea of change, at least a medium sized lake of change within academia with regards to using digital tools. A slowly building community of self-identified Digital Humanities scholars (10 percent of humanities scholars, according to a recent article on Slate) is probably both symptom and cause of this gradual development.

Scholars such as Dr Barbara Bodalejo, whom Robinson cites as particularly influential to his 2004 article, are appearing at “traditional” Textual Scholarship conferences talking about “Computer Assisted Textual Analysis and the Re-Thinking of the Scholarly Edition”. In fact, 15 of the sessions at STS 2013 concerned themselves specifically with “digital editing”, “digital editions” or variants thereof.  Though of course only one sample conference, hosted by a joint Textual Studies and Digital Humanities center, this increasing focus on digital editing is indicative of a rapidly growing interest in, acceptance of and use of digital tools within scholarly editing.

One of the arguably longest-standing standards within Digital Humanities is the Text Encoding Initiative. Started in 1987 to “ to develop, maintain, and promulgate hardware- and software-independent methods for encoding humanities data in electronic form” (source), the XML language has been developed by scholars and coders, constantly evolving to fit needs of both users and digital technology. Widely accepted as “the encoding scheme of choice for the production of critical and scholarly editions of literary texts, for scholarly reference works and large linguistic corpora” (source), TEI appears, with its expansive guidelines, to cover most possible needs for any project that concerns itself with text-markup.

As the project has evolved, each new set of guidelines (as of April 2014, we are on P5, version 2.6.0) has responded both to extensive research conducted by the humanists at any time involved, as well as needs and shortcomings discovered by those who have been using the guidelines. As a result, the TEI guidelines are now expansive enough to accommodate projects ranging from the Chinese Buddhist Electronic Text Association to Inscriptions of Roman Tripolitana to the Norwegian Dictionary’s 2014 electronic edition.

However, as with most improvements, it comes at a price. The current guidelines are so vast that they have, in some opinions, expanded past practical utility. And in an effort to keep all the various tags and attributes permitted and in order, the guidelines also function in a rigid and strict hierarchy, as to ensure that every project conforms to the standards. Using a modern XML browser such as oXygen, any encoder will quickly be told that “E [Jing] element “gender” [is] not allowed anywhere”. And any effort to bypass the guidelines will result in not being able to apply any of the many stylesheets, formatting options and transformation options available in most XML editors. In other words, any divergence from the TEI guidelines means that the mark-up is about as useful as if one had used an XML language developed specifically for that project. Which, of course, can be extremely useful, but not broadly applicable, readable and understandable like TEI is today.

It is an inherent function of XML that to be well formed it must have correctly ordered elements and a child tag must, in most cases be closed off within it’s parent tag, as such:

<parent tag>
<child tag> Orlando is a novel </child tag>
</parent tag>
This is no fault of TEI specifically, but simply the way the logic XML (indeed most mark-up languages) works. However, TEI goes to extreme lengths to specify which children may be under which parents, what each element may contain (i.e. text, numbers, nothing etc) and which attributes may be connected to which elements. As mentioned above, this is a natural consequence of creating a useable standard. However, with a standard the size of the TEI, this quickly becomes unmanageable and frustrating, when the tags needed are, indeed available, just not for use in the way the coder needs to use it.

Furthermore, as Paul Eggert has pointed out in his article “Text-encoding, Theories of the Text, and the ‘Work-Site'”,

The TEI (1994) requirement, that documents be syntactically defined as an ordered hierarchy of content objects that cannot overlap, conflicts with the fact that, as many observers have pointed out, humanities texts typically require analysis involving text elements that do overlap” (426).

It is, then, not only gender, as discussed further in Marking up gender in Orlando, that cannot be defined by a set of hierarchically ordered elements, but also literary structure. Natural language, fiction, literature, poetry, all the work of authors and writers, is still a different language than that of computers. All automatic “understanding” of natural language has, at some point, been taught to the technology precisely by things like markup. And though there are some incredibly accurate natural language processors in existence (one has only to look at one’s related ads in the gmail sidebar to see the work of Google’s natural language processor), mark-up is still necessary to tell the processor what a sentence is, what a paragraph is, what a name is, at least in the realm of creating anything useful for digital literary work.

And it is precisely here that a form of crux lies within digital text editing for scholarly purposes. In 2005, Eggert laments that “digital texts are still serving only as surrogates for printed texts” – in part “because the inherited consensus on what texts are and how they function has not changed with the technology” (425). Though there has been significant development in the 9 years since his article was published, the problem of rhetoric and logic in scholarly electronic editions is becoming apparent. If the textual data is to be useful beyond simply reading it as one would a print text (and basic searchability), it must be marked up, extensively and with great knowledge not only of TEI (or whatever language and guidelines are chosen for the markup), but also of the text itself. An key example of digital texts that are evading Eggert’s concerns of wasted work as the digital texts multiply and deteriorate as new additions are added to the markup are the Folger Digital Texts‘s marked up Shakespeare texts.

The way these texts have simultaneously avoided the need for adding further markup and deteriorating the editors’ work, and made them useful beyond being digitally accessible, authoritative texts, is that they have marked up “everything”, and made the markup publicly available. They are by no means the only Shakespearean XML available, as a quick Google search will reveal, but they are the most widely used and acknowledged in scholarly communities. Furthermore, the owners of these texts have not only made them publicly available for download, they encourage creative use of the markup to create new and innovative ways to interact with the text, with great results.

However, notably these texts are not using TEI, steering away from the restrictive hierarchy and choosing instead to employ a more flexible xml schema written for their own purposes, which can easily be repurposed for other uses. It seems then, that the use of TEI within the scholarly community has come down to a not dissimilar debate as that of electronic books: Do the practical affordances (accessibility, ease of use, premade schemas, transformation scenarios, stylesheets and the authoritative statement that comes with using an established standard), outweigh the practical restrictions (lack of flexibility etc) and the possibility that the work will be soon rendered inaccessible or obsolete in a the midst rapidly evolving technological advancements?