NTI Buddhist Text Reader

Description of the NTI Reader

This page gives a description of what is included with the NTI Buddhist Text Reader (NTI Reader) on this web site. The NTI Reader presents the Taishō Shinshū Daizōkyō 《大正新脩大藏經》 (Taishō) version of the Chinese Buddhist canon. The content of the Taishō was downloaded from the digitized version on the CBETA web site.

Contents

User Profile for the NTI Reader

The primary user profile for the NTI Reader is Buddhist Studies students, researchers, and translators who are proficient in English. A secondary user profile Buddhist studies students, researchers, and translators who are proficient in Chinese and want to relate the Chinese text to English references or translate it to English.

The NTI Reader is designed to make language reading in a second, or L2, language easier. In the context of language comprehension, the NTI Reader can be understood as a language aid to assist users explore and read an L2 language. There is a substantial body of research on second language acquisition (SLA), including computer assisted language learning (CALL), for example, as described by Levy (2009) and Chapelle (2007).The goals of NTI Reader are somewhat different yet do overlap with CALL. Vocabulary has been a major focus for CALL tools because of the large task for a language learner of acquiring the vocabulary in the L2 language (Levy 2009, 771). This is attempted by the NTI Reader, for example, by providing metadata describing the domain and semantic types of words in both L1 and L2 languages.

Elements of the NTI Reader

The following table lists the main elements of the the NTI Reader.

Table: Main Elements of the NTI Reader
ElementDescription
Table of contents English metadata describing the structure of the Taishō, including volume numbers, Taishō text numbers, and text titles.
Colophon English metadata describing each text, including titles, authors, translators, dates of translation, and table of contents of the text.
Main text Web pages containing the Chinese text for the content of the Taishō
Hyperlinks to dictionary definition Each word in the Taishō is matched back to a dictionary definition. Mouse-over for a brief definition.
Dictionary lookup Searches for words in the Chinese-English dictionary with links back to use in the Taishō

The content of the NTI Reader is the Chinese text from the 《大正新脩大藏經》 Taishō Shinshū Daizōkyō or ‘Taishō Revised Tripiṭaka’ (Takakusu Junjiro 1988), referred to here as the Taishō. The digitized version of volumes 1-55 was downloaded from the CBETA web site. The CBETA project made an agreement with the University of Tokyo for reproduction and free redistribution of volumes 1-55 and 85 (Huimin 2000). CBETA subsequently made the text freely available under a Creative Commons License. See the page About the Chinese Buddhist Canon for more background on the Chinese Buddhist canon itself.

Besides the metadata on the CBETA web site, much of the metadata from The Korean Buddhist Canon: A Descriptive Catalogue by Lancaster. See the page About the Chinese Buddhist Canon for more background on the Taishō and the Korean Canon. For more details on the content and structure of the metadata see the page Metadata Used in the NTI Reader.

The NTI Reader has several features that are intended to make it easier for users, including: (1) text segmentation, (2) automatic matching of words in the text to the dictionary with a summary on mouseover and linking to the dictionary definitions (3) highlighting of proper nouns in order to facilitate comparison of different versions of a text using the proper nouns as markers. Text segmentation means that multi-character words are grouped together from the stream of Chinese characters. This is a point of difficulty for non-native Chinese speakers because Chinese text has no spaces. It may be useful to native Chinese speakers as well because of the large number of multi-character words of Sanskrit origin in the canon.

The contents of the NTI Reader, including text, metadata, and dictionary, may be freely reused in accordance with the Creative Commons Attribution-Share Alike 3.0 License (CCASE 3.0).

Building the NTI Reader

The Chinese text reader for the Taishō was generated using a freely available tool cnreader. The HTML pages generated were uploaded to this web site to be available to web users. The source of the texts is the CBETA web site, which distributes the texts freely under a Creative Commons license. The NTI Reader Project is very thankful to CEBTA for this wonderful contribution to the general public and subsequently makes the content of the NTI Reader freely available under an English version of the same license.

The cnreader tool analyzes all the text files making up volumes 1-55 of the Taishō, does text segmentation to find the words with the stream of characters, and generates HTML files with links for each word to a Chinese-English word definition. A user can mouse over a word to see a brief defintion and can click on the word for the full detail on a separate HTML page.

The collection of texts in this web site are managed as a text corpus. A text corpus is a term used in linguistics to mean a representative collection of texts used to study the linguistic characteristics of a language. It is not necessary for users of the site to know anything on corpus analysis to use the site effectively but more details about the corpus building and analysis aspects of the project can be found at Introduction to Corpus Analysis of the Chinese Buddhist Canon. The corpus of Chinese texts from the Taishō used on this site has 59,069,881 words with 33,200 unique words and 73,414,250 characters, according the summary of the Corpus Analysis.

Each corpus entry includes metadata about a primary source document, such as the source and links to plain text versions, English translations, vocabulary analysis, and other information. Each corpus entry has a text file that uses a separate markdown file. This allows metadata to be added without changing the source text of the canon. Markdown is a plain Markdown can allow for people who are not software or web design specialists to easily work with plain text that can be transformed to HTML later. Markdown consists of simple and intuitive patterns, such has headers and bullet lists for making plan text more readable. This makes it easy to add annotations to primary source documents or copies of them.

The collection of primary texts and markdown documents is through a set of tab separated variable files.

Related pages: Introduction to Corpus Analysis of the Chinese Buddhist Canon

References

  1. Chapelle, C 2007, “Technology and Second Language Acquisition”, Annual Review of Applied Linguistics, vol. 27, pp. 98–114, viewed 30 December 2016, http://lib.dr.iastate.edu/cgi/viewcontent.cgi?article=1039&context=engl_pubs.
  2. Huimin 2000, “CBETA Chinese Electronic Tripitaka (Taisho Edition)”, in Pacific Neighborhood Consortium (PNC) Annual Conference and Joint Meetings, PNC, Berkeley, California.
  3. Levy, M 2009, “Technologies in Use for Second Language Learning”, Modern Language Journal, vol. 93, pp. 769–782, https://www.researchgate.net/publication/227674494.