This page explains some of the linguistic details of the corpus analysis of the Chinese Buddhist Canon included in the NTI Reader. The page may be useful to people wanting to do terminology extraction of those with an interest in computational linguistics. The page Description of the NTI Reader gives more details on the elements and construction of the NTI Reader.
The NTI Reader is built on text in volumes 1-55 of the Taishō shinshū daizōkyō 《大正新脩大藏經》 (Taishō) version of the Chinese Buddhist canon as a monolingual historic corpus. This is in contrast with a bilingual parallel corpus that matches two languages together in that the entire body of the corpus is literary Chinese. However, the metadata describing the corpus is in English to help users navigate the collection of texts.
As explained in the page Description of the NTI Reader the cnreader tool analyzes the Taishō version of the Chinese Buddhist canon. In addition to generating text files for the text reader, the program analyses the collection of texts as a corpus to find word frequencies, bigram frequencies, collocations, and extract example uses for words in the NTI Reader dictionary. The analysis results for the entire set of texts is shown in the page Corpus Analysis.
Source texts are marked with genre to allow analysis to be broken by genre. Example genres are āgama, jātaka, and prajñāpāramitā. The most interesting aspect of this is to allow identification of keywords for specific genres. The genres are taken from the section titles of the Taishō.
The Routledge Dictionary of Language and Linguistics defines a corpus as “A finite set of concrete linguistic utterances that serves as an empirical basis for linguistic research” (Bussmann, s.v. 'corpus'). One of the uses of a corpus is that it gives way to see how a word is used in many different contexts. This can help in confirming the correctness of word senses in a dictionary and be a source of examples. Modern corpora, such as the British National Corpus, are taken from a variety of sources, such as newspapers, books and web pages that are mostly intended for linguistic research rather than to be read for the value of the content. For example, corpora may be used to compile lists of commonly occurring multiword expressions and keywords for compilation of translation glossaries.
Corpora may be distinguished as monolingual, bilingual and multilingual varieties (McEnery and Hardie 2011, loc. 756-815). Examples, characteristics and primary uses of different kinds of corpora are summarized in Table 1.
|Historic||Modern||Main Characteristics||Main uses|
|Monolingual||Helsinki Corpus of English Texts, ARCHER (A Representative Corpus of Historic English Registers), Corpus of Early English Correspondence, NTI Reader||Brown Corpus, British National Corpus, Lancaster Corpus of Mandarin Chinese, Enron Corpus (emails)||Large volume||Linguistic study, lexicography, inference|
|Bilingual||BDK-SAT parallel corpus||English-Norwegian Parallel Corpus||Aligned||Translation, lexicography|
|Multilingual||Universal Dependencies (ancient Greek, Gothic, Latin, and Old Church Slavonic)||Universal Dependencies (36 modern languages)||POS tagged and parsed, not aligned||Training for automated translation|
Word frequency analysis is useful in getting an idea of an untranslated text. This is important because there is more untranslated text than translated text in the Chinese Buddhist canon and other Chinese text collections. Even for translated texts a word frequency analysis can be useful for writing a commentary or doing a close reading. It can also be useful for comparing different translations for style and quality or for comparing different historic recensions of the same source document.
A frequency distribution of non-functional words can give an idea of the content of a text at a quick glance. A frequency distribution of functional words can give an idea of the style of a translator, the histoic period written, and the translation source (e.g.,Sanskrit). A frequency distribution of proper nouns (not implemented yet) can give an idea of the principal actors in a text. A character frequency distribution may be more useful when your are missing vocabulary.
You need a good base vocabulary to do a word frequency analysis in Chinese. Since Chinese text does not have spaces in between words you need the vocabulary to find the boundaries between the words. Otherwise, you will simply end of with a character frequency distribution. The NTI Reader dictionary is essential for the word frequency analysis.
A word frequency analysis by genre is also done. This uses genre labels for texts according to their position in the Taishō. For example, Āgama for volumes 1-2, Jātaka and Avadāna for volumes 3-4, and Prajñāpāramitā for volumes 5-8.
Word frequency analysis is also done for each individual text. Analysis of individual texts it is referred to content analysis rather than corpus analysis, which refers to the entire set of texts. For example, see the content analysis results for the Treatise on the Awakening of Faith in the Mahāyāna (T 1666).
In addition to an overall analysis of whole corpus and each text within it, the analysis results can be view for each word. This includes the texts that the word is mentioned most frequently in, collocations, and examples of word usage. The list of texts that the word is mentioned most frequently in provides links to specific locations within the Buddhist text that use the given word most frequently. For example, looking at the page for the word 真如 zhēnrú tathatā or 'suchness' we find that the term is most frequently mentioned in Scroll 16 of the the Large Sūtra on Perfection of Wisdom (T 220).
A collocation is a habitual combination of words (Crystal 2008, s.v. 'collocation'). For example, in English, an auspicious occasion or draw a conclusion. Some collocations express sentiment. For example, the collocations a notorious thief and habitual gambler express negative sentiments. This is related to the concept of semantic prosody, which describes words that occur together with other words in a semantic set (Crystal 2008, s.v. 'semantic prosody'). Collocations are important in translation and will usually be translation units.
List of collocations in the NTI Reader are based on a bigram analysis of the corpus. To avoid the list being dominated by common bigrams that are related to free selection with function words, bigrams with function words are removed from the set.
Other words were found in reading Buddhist texts or examining the bigram frequency list and then added the dictionary after checking the defintions in one or more of the References. In addition, Corpus analysis has also be used to identify words subsequently added to the NTI Reader dictionary.
The scan of the texts by the cnreader software identifies all the Chinese characters that are not in the dictionary. Word entries for these characters are then added using the Jupyter notebook unihan.ipynb from the entry in the Unihan database (Unicode Consortium 2016, 'Unihan Database Lookup').
Titles for texts in the Taishō were extracted from CBETA, which is shared under Creative Commons, and the The Korean Buddhist Canon: A Descriptive Catalogue (Lancaster, 2004), which was used with permission. This was automated using the add_title.ipynb Jupyter notebook.
Bigrams were identified, ranked by frequency of occurance, and compared against head words in the Ding Fubao's 《丁福保佛學大辭典》 ‘Dictionary of Buddhist Studies’ (Ding, 2010). In the latest run there were 9,980 words identified with this method. This can be run from the Jupyter notebook term_extraction.ipynb.
For more (slightly older) on word frequency analysis see the page Statistical Background.