Development of the NTI Reader Dictionary

This page describes the design, sources, and building of the NTI Reader Dictionary. The NTI Reader Dictionary is a Chinese-English dicionary combining general modern Chinese, literary Chinese vocabulary, and Buddhist terminology.

Contents

Goals

The goals of the NTI Reader dictionary are to:

  1. Support the NTI Reader by providing word entries for words in the texts in a way that is helpful to human users
  2. Building an index to help users find the most relevant texts for any given word
  3. Support corpus management and analysis software as a machine readable dictionary
  4. Provide a general resource to NTI Reader website users in general word search

Sources

In order to achieve these goals, the NTI Reader dictionary combines the function of several traditional dictionaries. In particular, it combines the function of a general literary Chinese-English dictionary and a Buddhist dictionary. Major sources for the NTI Reader dictionary are listed in the table below.

Table: Comparison of Major Sources by Type
TypeSourceNotes
Bilingual A Chinese-English Dictionary (Giles 1892); CC-CEDICT (CC-CEDICT Project 2016); Buddhist Chinese-Sanskrit Dictionary (Hirakawa 1997) Giles’ Chinese-English Dictionary is one of the most comprehensive literary Chinese to English dictionaries available although it uses archaic English equivalents. It was written before vernacular Chinese became commonly used for Chinese text, which began around 1911 (Sun 2006, loc. 335-345).
Etymological Ci Yuan 《辭源》(Guangdong, Guangxi, Hunan, and Henan Ci Yuan Revision Committee 1983); A Student's Dictionary of Classical and Medieval Chinese (Kroll 2015); Gudai Hanyu Cidian 《古代汉语词典》 (Chen Fu Hua 2005); Gudai Hanyu Da Cidian 《古代汉语大词典》(Wangjian Yin et al. 2007) Etymological dictionaries describe the historic meaning and usage of a word in a modern language. Ci Yuan 《辭源》 is one of the most comprehensive sources of literary Chinese words with meanings described in modern Chinese.
Buddhist The Princeton dictionary of Buddhism (Buswell and Lopez 2014); Fo Guang Buddhist dictionary 《佛光大辭典》 (Fo Guang Shan 2000) The main focus of Buddhist dictionaries is encyclopedic entries describing Buddhist concepts. What has been used in compilation of the NTI Reader is the headword and a small amount of notes.
Multilingual Unicode Han Database (Unicode Consortium 2016b) The Unicode Han Database contains entries for Han characters with Chinese, Japanese and Korean readings and basic equivalents in English. It is especially useful for rare characters and for its machine readable format.
Gazetteers Buddhist Studies Place Authority Databases (Zhang, Bo-Yong and Ge, Grace 2002) A gazetteer contains a list of place names. The Buddhist Studies Place Authority Databases contains entries for many Buddhist temples mentioned in the Taishō
Linguistic references Outline of Classical Chinese Grammar (Pulleyblank 1995); Chinese: a Linguistic Introduction (Sun 2006); Chinese (Norman 1988) Used for function words and understanding of syntactic categories

Buddhist dictionaries have some aspects of multi-lingual dictionaries because they often include Chinese, Sanskrit, Pali and English. For example, the Princeton Buddhist Dictionary entry for ‘dharma’ includes Sanskrit, Pali, Chinese, Japanese, Korean and Tibetan (Buswell and Lopez 2014, s.v. ‘dharma’).

Entries were added the dictionary either as they were encountered in reading Buddhist texts, provided by external contributors, or from terminology extraction, as described in Introduction to Corpus Analysis of the Chinese Buddhist Canon.

The NTI Reader dictionary is based on several sources that allow direct incorporation and extensive curation with reference to many sources. The sources that allow direct incorporation include: 1. The CC-CEDICT Project based on a Creative Commons Attribution-Share Alike 3.0 License (CCASE 3.0). 2. The Unihan database based on the Unicode® Terms of Use. 3. Fo Guang Shan Terminology, used with permission (Fo Guang Shan, 2015). 4. Lancaster for text titles, used with permission (Lancaster, 2004). This has allowed relatively rapid compilation of a Chinese-English dictionary.

Microstructure

Each dictionary entry is represented as a headword with a list of lexical units. A headword is generally understood as the word that a dictionary entry is organized around, which is often lemmatized to a basic grammatical form (Atkins and Rundell 2008, loc. 1353-1366). For the NTI Reader dictionary this is the Chinese text of the word. A lexical unit is a unit of vocabulary that is approximately the same as a word sense and may differ from other lexical units in part of speech and pronunciation (Crystal 2008, s.v. ‘lexis’). The English field of a single lexical unit may include several alternate translations called equivalents (Svensén 2009, p. 7). The project will describe the structure of the dictionary in detail, relating it to current lexicographical practice, for example, as described by Atkins and Rundell (2008) and Svensén, (2009).

The basic unit of information in the NTI Reader dictionary is a lexical unit. See the Style Guide for a description of semantic data from the perspective of an editor adding entries. The fields for each lexical unit are listed in the table below.

Table: Metadata for Lexical Units
ItemDescriptionExample
headword A unique identifier for the headword. 3618
id A unique integer identifying the lexical unit. There are multiple lexical units for each headword. The id of the first lexical unit will be the same as the headword id. 3618
simplified Simplified Chinese text for the word. This is also the headword, which is constant for different lexical units.
traditional Traditional Chinese text for the word. If the traditional is the same as the simplified, then this field is omitted. -
pinyin Pronunciation written in Hanyu pinyin
english One of more English equivalents for the Chinese word. Buddha / Awakened One
grammar Part of speech proper noun
concept_cn The concept (in Chinese) that the word describes. See discussion of metadata below. -
concept_en The concept (in Chinese) that the word describes -
domain_cn The domain (in Chinese) that the word sense belongs to. See discussion of metadata below. 佛教
domain_en The domain (in English) that the word describes. Buddhism
subdomain_cn The child domain (in Chinese) that the word sense belongs to. -
subdomain_en The child domain (in English) that the word sense belongs to. -
image The name of an image file, mainly used for art and architecture -
notes Abbreviated encyclopedic notes and other information Sanskrit: buddha … See 佛陀. (BL 'Buddha'; CCD '佛' 1; FE '佛' 1; Kroll '佛')

One challenge has been assigning parts of speech to lexical entries. The traditional parts of speech (noun, verb, adjective, etc.) are commonly understood but have problems in themselves and problems in application to literary Chinese. One of the problems with traditional parts of speech is in overlapping categories. Culicover gives the example ‘glass window’ where the noun ‘glass’ modifies the noun ‘window’ (Culicover 2009, p. 45). The noun ‘glass’ is acting like an adjective in this phrase. Culicover gives more examples of nouns acting as adjectives, including ‘cotton dress,’ ‘career move’ and ‘desk chair.’ The theory of syntactic categories is one attempt to address this problem Culicover 2009, p. 45). Despite the problem of overlapping categories traditional parts of speech are still commonly included in dictionary entries.

Assigning parts of speech to literary Chinese is more problematic than English. Norman notes that lack of word morphology and freedom of movement between word categories are problems in formal word class analysis in literary Chinese (Norman 1988, p. 87). However, it is not true that words in literary Chinese cannot be assigned a part of speech at all. Norman outlines a system of word classes similar to the word classes adopted in the NTI Reader (Norman 1988, p. 88-94).

Metadata

The concept, domain, and subdomain fields are semantic labels that are used to tag dictionary entries. They help to define the kind of a word and context of its use. As an example of category consider a word for a temple. The category of the word would be '寺院 Temple' and other metadata should describe the location of the temple. This is what Harpring defines as an instance relationship (Harpring, 2010, pp. 39-40). The domain for most Buddhist words is '佛教 Buddhism'. Metadata can help users of the NTI Reader help by remove ambiguity when discussing data and can also help to locate related concepts and keywords.

The domain and subdomain fields are hierarchically related to each other and analogous to the Broader Terms (BT) and Narrower Terms (NT) fields in the Library of Congress Subject Headings (Library of Congress 2016, ix). For example, the domain '佛教 Buddhism' is a parent of the subdomain 'Chinese Buddhism 中国佛教' for the dictionary entry for 大乘起信論 Treatise on the Awakening of Faith in the Mahāyāna.

The domain values are restricted to the set in topics.txt. In this sense the domain labels form a controlled vocabulary. The possible of values for concept and subdomain are not restricted to a limited set.

Burdick, et. al. write, “The use of structured and/or tagged approaches to identify persons, themes, places, or features of a text provides a way to maximize the intellectual investigation of documents and to display these interpretations” (Burdick, et. al. 2012, p. 35). Metadata may conform to formal ontologies, such as the International Committee for Documentation (CIDOC) Reference Model (Crofts et al. 2011), in order to use consistent terminology within and across information systems. The Getty Art & Architecture Thesaurus (AAT) uses such an ontology (Getty Research Institute).

Ontological metadata hierarchies may be shallow or deep. The AAT uses a deep hierarchy. The AAT uses a deep taxonomy to classify concepts used in its descriptions of physical and digital assets. An example of this is the concept ‘buddhapadas’ (Buddha footprints), (Getty Research Institute, ID ‘300395464’, viewed 29 January 2017):


  .... Visual and Verbal Communication (hierarchy name) (G)
  ........ Visual Works (hierarchy name) (G)
  ................ <visual works by material or technique> (G)
  .................... sculpture (visual works) (G)
  ........................ <sculpture by subject type> (G)
  ............................ Buddhapadas (G)
  

One concern with deep hierarchies like this is the time needed for curation. The NTI Reader uses a simpler hierarchy but one with no formal ontology. The main goals of the dictionary metadata in the NTI Reader is to (1) reduce lexical ambiguity and (2) facilitate curation. The data being described in the case of the NTI Reader dictionary is the dictionary vocabulary. The subject domain that the word belongs to. For example, the domain for the word entry 海淀 Hǎidiàn ‘Haidian’ is “地方 Dìfāng ‘Places’”. The most important use of the domain is to distinguish modern Chinese from literary Chinese. Words that are used on in literary Chinese are labelled as ‘古文 Classical Chinese’. Other words are modern Chinese. The Subdomain field narrows the domain. For example, one subdomain for the domain “佛教 Fójiào ‘Buddhism’” is “大乘佛教 Dàshèng Fójiào ‘Mahāyāna Buddhism’”.

The domain and subdomain fields are similar to the Broader Terms (BT) and Narrower Terms (NT) in the Library of Congress Subject Headings (Library of Congress 2016, ix).

Harpring defines a controlled vocabulary as “an information tool that contains standardized words and phrases used to refer to ideas, physical characteristics, people, places, events, subject matter, and many other concepts” (Harpring 2010, p. 1). Controlled vocabularies help in data categorization, indexing and retrieval. The concept, domain, and subdomain fields in the NTI Reader dictionary are more structured than simple labels but not as carefully defined as a controlled vocabulary. They are a controlled vocabulary in the sense that the concept, domain, and subdomain fields have some hierarchical structure. However, the system does not collect variants and synonyms and related those back to an authoritative record, as in a controlled vocabulary (Harpring 2010, p. 12). In addition, there is inconsistency in the hierarchy. For example, the domain value ‘Buddhism’ can be understood as a domain but the domain value ‘Classical Chinese’ is a language and there is no corresponding ‘Modern Chinese’ value in the hierarchy.

Usage Examples

One of the ways of assisting users with deciding on word sense is to give examples that illustrate the various senses. Many of the problems with polysemy come from commonly used words that are not specifically Buddhist terms. Drawing examples for these words randomly from the corpus, however, may not be a great idea. Finding good examples that do not confuse the user is a special topic in itself. The reasons are: (1) examples for non-Buddhist terms should not be drawn from a Buddhist source but rather from a representative source and (2) it is typically hard to find examples in the corpus that do not require specialist knowledge of the particular text to understand or refer to previous sections of the text. Good examples are self contained and simple.

References

  1. Atkins, BTS, Rundell, M 2008, The Oxford Guide to Practical Lexicography, Oxford University Press, Oxford.
  2. Burdick, A, Drucker, J, Lunenfeld, P, Presner, T, Schnapp, J 2012, Digital_Humanities, MIT Press, Cambridge, Mass.
  3. Buswell, Robert E., and Donald S. Lopez, eds. The Princeton Dictionary of Buddhism. Princeton: Princeton University Press, 2014.
  4. CC-CEDICT project. “CC-CEDICT.” Accessed March 2, 2014. http://cc-cedict.org/wiki/ and MDBG Chinese-English dictionary, online dictionary website based on CC-CEDICT.
  5. Chen Fu Hua 陈复华, ed. Gudai Hanyu Cidian 《古代汉语词典》. Beijing: China Commercial Press 商务印书馆, 2005.
  6. Culicover, P.W 2009, Natural Language Syntax, Oxford University Press, Oxford ; New York.
  7. Fo Guang Shan. “Fo Guang Dictionary of Buddhism 佛光大辭典,” accessed March 2, 2014. Online Buddhist Dictionary
  8. Gilliland, A 2008, “Setting the Stage”, in: Martha Baca (Ed.), Introduction to Metadata, Getty Research Institute, Los Angeles, pp. 1–19.
  9. Guangdong, Guangxi, Hunan, and Henan Ci Yuan Revision Committee, 1983, Revised ed, Ci Yuan 《辭源》, Commercial Press, Beijing.
  10. Giles, HA 1892. A Chinese-English Dictionary, London: B. Quaritch.
  11. Harpring, P 2010, Introduction to Controlled Vocabularies: Terminology for Art, Architecture, and other Cultural Works. Getty Research Institute, Los Angeles.
  12. Hirakawa, Akira. 1997. Buddhist Chinese-Sanskrit Dictionary. Tokyo: The Reiyukai.
  13. Library of Congress 2016, Library of Congress Authorities, 38th ed. Library of Congress, Washington, D.C., accessed 21 April 2017, http://authorities.loc.gov.
  14. Library of Congress 2016, Library of Congress Subject Headings, 38th ed. Library of Congress, Washington, D.C., accessed 21 April 2017, http://id.loc.gov/authorities/subjects.html.
  15. Norman, J., 1988. Chinese, Cambridge language surveys. Cambridge University Press, Cambridge [Cambridgeshire] ; New York.
  16. Pulleyblank, Edwin G. Outline of Classical Chinese Grammar. Vancouver: UBC Press, 1995.
  17. Sun, Chaofen 2006. Chinese: A Linguistic Introduction, Kindle ed., Cambridge, UK; New York: Cambridge University Press.
  18. Unicode Consortium, 2015. Unicode Han Database, http://unicode.org/reports/tr38/. Lookup is at http://www.unicode.org/charts/unihan.html.
  19. Wangjian Yin 王剑引, Liangjian Min, 2007. Gudai Hanyu Da Cidian 《古代汉语大词典》, Ci Hai. edition Shanghai Commercial Press, Shanghai.
  20. Zhang, Bo-Yong, and Grace Ge. 2002. "Place Authority Database" 《地名規範資料庫》. Dharma Drum Buddhist College, viewed 11 March 2017, http://authority.dila.edu.tw/place/.