Gary F. Simons, SIL International
Revised draft of 7 August 2002
EMELD Workshop on Digitizing Lexical Information
2-5 August 2002, Ypsilanti, MI
The EMELD proposal (section 3.1) lists the following among the objectives of the project:
The purpose of the workshop is to begin the process of developing best practice recommendations for the markup and metadata description of lexical resources. The purpose of this document is to lay out a roadmap for what needs to be developed in order to meet the objectives of the project. This is done by proposing requirements for the eventual solution and then enumerating consequent features of its implementation. But first I begin with some background definitions.
Before developing a system for encoding lexical resources, it is necessary to define the audience for that system. This was done, for instance, by Ide and others (1992) for the most widely known system of dictionary encoding, namely the TEI guidelines (Sperberg-McQueen and Burnard 2001b). The focus of the TEI guidelines is dictionaries that have already been published in print. The developers of the guidelines saw not only the lexicographer who creates dictionaries and the computational linguist who mines information from encoded dictionaries as within the audience, but also the print historian who wants to study conventions of typesetting and layout. They identify three views of the dictionary and conclude that the markup must be able to encode all three views and mappings among them. The three views are:
In the EMELD project we can (and should) narrow the focus. Our aim is to give guidance to field linguists on how they should create electronically encoded lexical resources so as to maximize their long-term usefulness. In this context only the lexical view is of relevance. The typographic and textual views are not part of the information resource itself, but will be added by automated processes (using stylesheets) that tailor the published appearance to the needs of a given target audience.
To create electronically encoded lexical resources we will use markup languages. A markup language, like a natural language, has a lexicon, syntax, and semantics. The following terms are used throughout this paper to refer to the descriptive artifacts that document these three aspects of markup:
- markup vocabulary
Enumerates the lexical inventory of markup: i.e., the set of elements and attributes that are used in marking up a resource. (In practice, the vocabulary is enumerated within the markup schema rather than in a separate document.)
- markup schema
Specifies the syntax of markup: i.e., a formal grammar defining constraints on where elements and attributes must or may occur with respect to embedding and relative order. (This is typically realized in an XML DTD or an XML Schema, though other mechanisms are emerging.)
- markup metaschema
Specifies the semantics of markup: i.e., a formal mapping from elements and attributes to the linguistic concepts they represent. (This area of markup is not as well developed as the syntactic area, but is beginning to be developed under the impetus of the so-called Semantic Web (W3C 2002).)
In this presentation of requirements the individual requirements are set apart as numbered statements in order to facilitate discussion. Similarly, the consequent features are set out as subordinate statements that bear an identification letter, as in:
The first requirement deals with the need for longevity of access far into the future. This aspect of language documentation and description is covered in detail in Bird and Simons (2002); only a few key points are noted here:
Microsoft Word documents provide an example of a proprietary, binary format that is not acceptable for long-term preservation of information. Plain text documents formatted with line breaks and spaces are an example of a format that meets requirements a through c; so are tab- or comma-delimited representations of spreadsheets or data tables. But most lexical resources have a more complex structure involving hierarchy and cross-reference, thus a more sophisticated representation is needed. Markup based on the XML standard meets all the above requirements and is now supported by such a wide variety of tools (both open and proprietary) that it has become the clear choice for archival formats. Those unfamiliar with XML are referred to the Text Encoding Initiative's "Gentle Introduction to XML" (Sperberg-McQueen and Burnard 2001a). But what should the nature of the markup vocabulary be?
HTML markup, when applied to lexical resources, is an example of presentational markup. Though it does have the features of longevity needed for an archival format, it does not offer linguists the ability to do automated processing of a linguistic nature, such as to answer the query "What are the part-of-speech categories used in this lexicon?" For this purpose a markup vocabulary that specifically identifies the linguistic significance of each piece of information is needed. But simply having a markup vocabulary is not enough; for each lexical resource there is also a grammar that defines how the individual markup elements combine to form valid lexical descriptions.
These consequences of requirement 3 thus mean that there will be multiple markup schemas, even in the context of best practice. In order to achieve interoperability of resources when there are multiple markup schemes we will need to introduce a meta-level in our approach to markup:
Finally, it is not enough that electronically encoded resources are created. They must also be found and used by others long into the future. This implies a final set of consequences having to do with archiving.
The Open Language Archives Community is already in place with an infrastructure that meets these needs, and EMELD will build on this infrastructure.
Taken together, the above requirements and the consequent features of implementation suggest the following shape for best practice:
|Best Practice for Resource Creation||What the Community Must Do to Support Best Practice|
|Lexical description||Archive resource as an XML document that is valid with respect to a descriptive markup schema that is supplied with the resource.||1. Document characteristics of best practice descriptive
2. Recommend one or more markup schemas that meet these characteristics.
3. Develop stylesheets for these schemas.
|Metadescription for resource discovery||Provide OLAC metadata for the resource and deposit it with an OLAC data provider.||4. Define the OLAC metadata standard.
5. Define the controlled vocabulary for identifying lexical resource types in <type.linguistic>.
6. Develop a community service for resource discovery.
|Metadescription for resource interoperation||Provide a metaschema for the resource.||7. Define a common ontology of the concepts of lexical
8. Define the standard markup schema for a metaschema.
9. Develop metaschemas for the schemas recommended in point 2 above.
10. Develop a community service that uses metaschemas to provide interoperation across multiple lexical resources.
When the 10 community action steps listed in the last column have been completed, the "formulation" part of the EMELD objectives listed at the outset of this paper will have been met for the area of lexicons. The "promulgation" part will require additional work in areas like documentation, dissemination, and training.
Bird, Steven and Gary Simons, 2002. Seven Dimensions of Portability for Language Documentation and Description, Proceedings of the Workshop on Portability Issues in Human Language Technologies, Third International Conference on Language Resources and Evaluation, Las Palmas, Canary Islands. Available at: http://arxiv.org/abs/cs/0204020
Ide, Nancy and others, 1992. Principles for encoding machine readable dictionaries, EURALEX'92 Proceedings. Available at: http://www.cs.vassar.edu/~ide/papers/Euralex92.ps
Langendoen, D. Terence and others, 2002, Publications of the EMELD Arizona group. Available at: http://emeld.douglass.arizona.edu:8080/group.html.
Simons, Gary F., 1998. Using architectural processing to derive small, problem-specific XML applications from large, widely-used SGML applications, SIL Electronic Working Papers 1998-006. Available at: http://www.sil.org/silewp/1998/006/.
Sperberg-McQueen, C.M. and Lou Burnard, 2001a. A Gentle Introduction to XML. Chapter 2 of TEI P4: Guidelines for Electronic Text Encoding and Interchange, XML-compatible edition. TEI Consortium. Available at: http://www.tei-c.org/P4X/SG.html
Sperberg-McQueen, C.M. and Lou Burnard, 2001b. Print Dictionaries. Chapter 12 of TEI P4: Guidelines for Electronic Text Encoding and Interchange, XML-compatible edition. TEI Consortium. Available at: http://www.tei-c.org/P4X/DI.html
W3C, 2002. The Semantic Web, an activity of the World Wide Web Consortium. Home page: http://www.w3.org/2001/sw/.
This analysis of the shape of best practice makes it possible to offer more focus to the assignments for the three workgroups that will function during the workshop:
|Group I: Principles of Lexical Description||
|Group II: Markup of Lexical Entries (emphasis on ontological concepts)||
|Group III: Lexicon Macrostructure||