This page gives a description of what is included with the NTI Buddhist Text Reader (NTI Reader) on this web site. The NTI Reader presents the Taishō Shinshū Daizōkyō 《大正新脩大藏經》 (Taishō) version of the Chinese Buddhist canon. The content of the Taishō was downloaded from the digitized version on the CBETA web site.
The following table lists the main elements of the the NTI Reader.
|Table of contents||English metadata describing the structure of the Taishō, including volume numbers, Taishō text numbers, and text titles.|
|Colophon||English metadata describing each text, including titles, authors, translators, dates of translation, and table of contents of the text.|
|Main text||Web pages containing the Chinese text for the content of the Taishō|
|Hyperlinks to dictionary definition||Each word in the Taishō is matched back to a dictionary definition. Mouse-over for a brief definition.|
|Dictionary lookup||Searches for words in the Chinese-English dictionary with links back to use in the Taishō|
The content of the NTI Reader is the Chinese text from the 《大正新脩大藏經》 Taishō Shinshū Daizōkyō or ‘Taishō Revised Tripiṭaka’ (Takakusu Junjiro 1988), referred to here as the Taishō. The digitized version of volumes 1-55 was downloaded from the CBETA web site. The CBETA project made an agreement with the University of Tokyo for reproduction and free redistribution of volumes 1-55 and 85 (Huimin 2000). CBETA subsequently made the text freely available under a Creative Commons License. See the page About the Chinese Buddhist Canon for more background on the Chinese Buddhist canon itself.
Besides the metadata on the CBETA web site, much of the metadata from The Korean Buddhist Canon: A Descriptive Catalogue by Lancaster. See the page About the Chinese Buddhist Canon for more background on the Taishō and the Korean Canon. For more details on the content and structure of the metadata see the page Metadata Used in the NTI Reader.
The NTI Reader has several features that are intended to make it easier for users, including: (1) text segmentation, (2) automatic matching of words in the text to the dictionary with a summary on mouseover and linking to the dictionary definitions (3) highlighting of proper nouns in order to facilitate comparison of different versions of a text using the proper nouns as markers. Text segmentation means that multi-character words are grouped together from the stream of Chinese characters. This is a point of difficulty for non-native Chinese speakers because Chinese text has no spaces. It may be useful to native Chinese speakers as well because of the large number of multi-character words of Sanskrit origin in the canon.
The contents of the NTI Reader, including text, metadata, and dictionary, may be freely reused in accordance with the Creative Commons Attribution-Share Alike 3.0 License (CCASE 3.0).
The Chinese text reader for the Taishō was generated using a freely available tool cnreader. The HTML pages generated were uploaded to this web site to be available to web users. The source of the texts is the CBETA web site, which distributes the texts freely under a Creative Commons license. The NTI Reader Project is very thankful to CEBTA for this wonderful contribution to the general public and subsequently makes the content of the NTI Reader freely available under an English version of the same license.
The cnreader tool analyzes all the text files making up volumes 1-55 of the Taishō, does text segmentation to find the words with the stream of characters, and generates HTML files with links for each word to a Chinese-English word definition. A user can mouse over a word to see a brief defintion and can click on the word for the full detail on a separate HTML page.
The collection of texts in this web site are managed as a text corpus. A text corpus is a term used in linguistics to mean a representative collection of texts used to study the linguistic characteristics of a language. It is not necessary for users of the site to know anything on corpus analysis to use the site effectively but more details about the corpus building and analysis aspects of the project can be found at Introduction to Corpus Analysis of the Chinese Buddhist Canon. The corpus of Chinese texts from the Taishō used on this site has 59,069,881 words with 33,200 unique words and 73,414,250 characters, according the summary of the Corpus Analysis.
Each corpus entry includes metadata about a primary source document, such as the source and links to plain text versions, English translations, vocabulary analysis, and other information. Each corpus entry has a text file that uses a separate markdown file. This allows metadata to be added without changing the source text of the canon. Markdown is a plain Markdown can allow for people who are not software or web design specialists to easily work with plain text that can be transformed to HTML later. Markdown consists of simple and intuitive patterns, such has headers and bullet lists for making plan text more readable. This makes it easy to add annotations to primary source documents or copies of them.
The collection of primary texts and markdown documents is through a set of tab separated variable files.