NTI Buddhist Text Reader

Development of the NTI Reader Dictionary

This page describes the design, sources, and building of the NTI Reader Dictionary. The NTI Reader Dictionary is a Chinese-English dicionary combining general modern Chinese, literary Chinese vocabulary, and Buddhist terminology.



The goals of the NTI Reader dictionary are to:

  1. Support the NTI Reader by providing word entries for words in the texts in a way that is helpful to human users
  2. Building an index to help users find the most relevant texts for any given word
  3. Support corpus management and analysis software as a machine readable dictionary
  4. Provide a general resource to NTI Reader website users in general word search


In order to achieve these goals, the NTI Reader dictionary combines the function of several traditional dictionaries. In particular, it combines the function of a general literary Chinese-English dictionary and a Buddhist dictionary. Major sources for the NTI Reader dictionary are listed in the table below.

Table: Comparison of Major Sources by Type
Bilingual A Chinese-English Dictionary (Giles 1892); CC-CEDICT (CC-CEDICT Project 2016); Buddhist Chinese-Sanskrit Dictionary (Hirakawa 1997) Giles’ Chinese-English Dictionary is one of the most comprehensive literary Chinese to English dictionaries available although it uses archaic English equivalents. It was written before vernacular Chinese became commonly used for Chinese text, which began around 1911 (Sun 2006, loc. 335-345).
Etymological Ci Yuan 《辭源》(Guangdong, Guangxi, Hunan, and Henan Ci Yuan Revision Committee 1983); A Student's Dictionary of Classical and Medieval Chinese (Kroll 2015); Gudai Hanyu Cidian 《古代汉语词典》 (Chen Fu Hua 2005); Gudai Hanyu Da Cidian 《古代汉语大词典》(Wangjian Yin et al. 2007) Etymological dictionaries describe the historic meaning and usage of a word in a modern language. Ci Yuan 《辭源》 is one of the most comprehensive sources of literary Chinese words with meanings described in modern Chinese.
Buddhist The Princeton dictionary of Buddhism (Buswell and Lopez 2014); Fo Guang Buddhist dictionary 《佛光大辭典》 (Fo Guang Shan 2000) The main focus of Buddhist dictionaries is encyclopedic entries describing Buddhist concepts. What has been used in compilation of the NTI Reader is the headword and a small amount of notes.
Multilingual Unicode Han Database (Unicode Consortium 2016b) The Unicode Han Database contains entries for Han characters with Chinese, Japanese and Korean readings and basic equivalents in English. It is especially useful for rare characters and for its machine readable format.
Gazetteers Buddhist Studies Place Authority Databases (Zhang, Bo-Yong and Ge, Grace 2002) A gazetteer contains a list of place names. The Buddhist Studies Place Authority Databases contains entries for many Buddhist temples mentioned in the Taishō
Linguistic references Outline of Classical Chinese Grammar (Pulleyblank 1995); Chinese: a Linguistic Introduction (Sun 2006); Chinese (Norman 1988) Used for function words and understanding of syntactic categories

Buddhist dictionaries have some aspects of multi-lingual dictionaries because they often include Chinese, Sanskrit, Pali and English. For example, the Princeton Buddhist Dictionary entry for ‘dharma’ includes Sanskrit, Pali, Chinese, Japanese, Korean and Tibetan (Buswell and Lopez 2014, s.v. ‘dharma’).

Entries were added the dictionary either as they were encountered in reading Buddhist texts, provided by external contributors, or from terminology extraction, as described in Introduction to Corpus Analysis of the Chinese Buddhist Canon.

The NTI Reader dictionary is based on several sources that allow direct incorporation and extensive curation with reference to many sources. The sources that allow direct incorporation include: 1. The CC-CEDICT Project based on a Creative Commons Attribution-Share Alike 3.0 License (CCASE 3.0). 2. The Unihan database based on the Unicode® Terms of Use. 3. Fo Guang Shan Terminology, used with permission (Fo Guang Shan, 2015). 4. Lancaster for text titles, used with permission (Lancaster, 2004). This has allowed relatively rapid compilation of a Chinese-English dictionary.


Each dictionary entry is represented as a headword with a list of lexical units. A headword is generally understood as the word that a dictionary entry is organized around, which is often lemmatized to a basic grammatical form (Atkins and Rundell 2008, loc. 1353-1366). For the NTI Reader dictionary this is the Chinese text of the word. A lexical unit is a unit of vocabulary that is approximately the same as a word sense and may differ from other lexical units in part of speech and pronunciation (Crystal 2008, s.v. ‘lexis’). The English field of a single lexical unit may include several alternate translations called equivalents (Svensén 2009, p. 7). The project will describe the structure of the dictionary in detail, relating it to current lexicographical practice, for example, as described by Atkins and Rundell (2008) and Svensén, (2009).

The basic unit of information in the NTI Reader dictionary is a lexical unit. See the Style Guide for a description of semantic data from the perspective of an editor adding entries. The fields for each lexical unit are listed in the table below.

Table: Metadata for Lexical Units
headword A unique identifier for the headword. 3618
id A unique integer identifying the lexical unit. There are multiple lexical units for each headword. The id of the first lexical unit will be the same as the headword id. 3618
simplified Simplified Chinese text for the word. This is also the headword, which is constant for different lexical units.
traditional Traditional Chinese text for the word. If the traditional is the same as the simplified, then this field is omitted. -
pinyin Pronunciation written in Hanyu pinyin
english One of more English equivalents for the Chinese word. Buddha / Awakened One
grammar Part of speech proper noun
concept_cn The concept (in Chinese) that the word describes. See discussion of metadata below. -
concept_en The concept (in Chinese) that the word describes -
domain_cn The domain (in Chinese) that the word sense belongs to. See discussion of metadata below. 佛教
domain_en The domain (in English) that the word describes. Buddhism
subdomain_cn The child domain (in Chinese) that the word sense belongs to. -
subdomain_en The child domain (in English) that the word sense belongs to. -
image The name of an image file, mainly used for art and architecture -
notes Abbreviated encyclopedic notes and other information Sanskrit: buddha … See 佛陀. (BL 'Buddha'; CCD '佛' 1; FE '佛' 1; Kroll '佛')


The concept, domain and subdomain fields are semantic metadata fields that are used to tag dictionary entries. They help to define the kind of a word and context of its use. As an example of category consider a word for a temple. The category of the word would be '寺院 Temple' and other metadata should describe the location of the temple. This is what Harpring defines as an instance relationship (Harpring, 2010, pp. 39-40). The domain for most Buddhist words is '佛教 Buddhism'. Metadata can help users of the NTI Reader help by remove ambiguity when discussing data and can also help to locate related concepts and keywords.

The domain and subdomain fields are hierarchically related to each other and analogous to the Broader Terms (BT) and Narrower Terms (NT) fields in the Library of Congress Subject Headings (Library of Congress 2016, ix). For example, the domain '佛教 Buddhism' is a parent of the subdomain 'Chinese Buddhism 中国佛教' for the dictionary entry for 大乘起信論 Treatise on the Awakening of Faith in the Mahāyāna.

The domain values are restricted to the set in topics.txt. In this sense the domain labels form a controlled vocabulary. The possible of values for concept and subdomain are not restricted to a limited set.

Burdick, et. al. write, “The use of structured and/or tagged approaches to identify persons, themes, places, or features of a text provides a way to maximize the intellectual investigation of documents and to display these interpretations” (Burdick, et. al. 2012, p. 35). Metadata may conform to formal ontologies, such as the International Committee for Documentation (CIDOC) Reference Model (Crofts et al. 2011), in order to use consistent terminology within and across information systems. The Getty Art & Architecture Thesaurus (AAT) uses such an ontology (Getty Research Institute).

Ontological metadata hierarchies may be shallow or deep. The AAT uses a deep hierarchy. The AAT uses a deep taxonomy to classify concepts used in its descriptions of physical and digital assets. An example of this is the concept ‘buddhapadas’ (Buddha footprints), (Getty Research Institute, ID ‘300395464’, viewed 29 January 2017):

  .... Visual and Verbal Communication (hierarchy name) (G)
  ........ Visual Works (hierarchy name) (G)
  ................ <visual works by material or technique> (G)
  .................... sculpture (visual works) (G)
  ........................ <sculpture by subject type> (G)
  ............................ Buddhapadas (G)

One concern with deep hierarchies like this is the time needed for curation. The NTI Reader uses a simpler hierarchy but one with no formal ontology. The main goals of the dictionary metadata in the NTI Reader is to (1) reduce lexical ambiguity and (2) facilitate curation. The data being described in the case of the NTI Reader dictionary is the dictionary vocabulary. The subject domain that the word belongs to. For example, the domain for the word entry 海淀 Hǎidiàn ‘Haidian’ is “地方 Dìfāng ‘Places’”. The most important use of the domain is to distinguish modern Chinese from literary Chinese. Words that are used on in literary Chinese are labelled as ‘古文 Classical Chinese’. Other words are modern Chinese. The Subdomain field narrows the domain. For example, one subdomain for the domain “佛教 Fójiào ‘Buddhism’” is “大乘佛教 Dàshèng Fójiào ‘Mahāyāna Buddhism’”.


