Introduction to Corpus Analysis of the Chinese Buddhist Canon

Page Contents

This page explains some of the linguistic details of the corpus analysis of the Chinese Buddhist Canon included in the NTI Reader. The page may be useful to people wanting to do terminology extraction of those with an interest in computational linguistics. The page Description of the NTI Reader gives more details on the elements and construction of the NTI Reader.

The NTI Reader Corpus

The NTI Reader is built on text in volumes 1-55 of the Taishō shinshū daizōkyō 《大正新脩大藏經》 (Taishō) version of the Chinese Buddhist canon as a monolingual historic corpus. This is in contrast with a bilingual parallel corpus that matches two languages together in that the entire body of the corpus is literary Chinese. However, the metadata describing the corpus is in English to help users navigate the collection of texts.

As explained in the page Description of the NTI Reader the cnreader tool analyzes the Taishō version of the Chinese Buddhist canon. In addition to generating text files for the text reader, the program analyses the collection of texts as a corpus to find word frequencies, bigram frequencies, collocations, and extract example uses for words in the NTI Reader dictionary. The analysis results for the entire set of texts is shown in the page Corpus Analysis.

Source texts are marked with genre to allow analysis to be broken by genre. Example genres are āgama, jātaka, and prajñāpāramitā. The most interesting aspect of this is to allow identification of keywords for specific genres. The genres are taken from the section titles of the Taishō.

What is Corpus Analysis?

The Routledge Dictionary of Language and Linguistics defines a corpus as “A finite set of concrete linguistic utterances that serves as an empirical basis for linguistic research” (Bussmann, s.v. 'corpus'). One of the uses of a corpus is that it gives way to see how a word is used in many different contexts. This can help in confirming the correctness of word senses in a dictionary and be a source of examples. Modern corpora, such as the British National Corpus, are taken from a variety of sources, such as newspapers, books and web pages that are mostly intended for linguistic research rather than to be read for the value of the content. For example, corpora may be used to compile lists of commonly occurring multiword expressions and keywords for compilation of translation glossaries.

Corpora may be distinguished as monolingual, bilingual and multilingual varieties (McEnery and Hardie 2011, loc. 756-815). Examples, characteristics and primary uses of different kinds of corpora are summarized in Table 1.

Table 2: Example corpora and uses
  Historic Modern Main Characteristics Main uses
Monolingual Helsinki Corpus of English Texts, ARCHER (A Representative Corpus of Historic English Registers), Corpus of Early English Correspondence, NTI Reader Brown Corpus, British National Corpus, Lancaster Corpus of Mandarin Chinese, Enron Corpus (emails) Large volume Linguistic study, lexicography, inference
Bilingual BDK-SAT parallel corpus English-Norwegian Parallel Corpus Aligned Translation, lexicography
Multilingual Universal Dependencies (ancient Greek, Gothic, Latin, and Old Church Slavonic) Universal Dependencies (36 modern languages) POS tagged and parsed, not aligned Training for automated translation

Word Frequency analysis

Word frequency analysis is useful in getting an idea of an untranslated text. This is important because there is more untranslated text than translated text in the Chinese Buddhist canon and other Chinese text collections. Even for translated texts a word frequency analysis can be useful for writing a commentary or doing a close reading. It can also be useful for comparing different translations for style and quality or for comparing different historic recensions of the same source document.

A frequency distribution of non-functional words can give an idea of the content of a text at a quick glance. A frequency distribution of functional words can give an idea of the style of a translator, the histoic period written, and the translation source (e.g.,Sanskrit). A frequency distribution of proper nouns (not implemented yet) can give an idea of the principal actors in a text. A character frequency distribution may be more useful when your are missing vocabulary.

You need a good base vocabulary to do a word frequency analysis in Chinese. Since Chinese text does not have spaces in between words you need the vocabulary to find the boundaries between the words. Otherwise, you will simply end of with a character frequency distribution. The NTI Reader dictionary is essential for the word frequency analysis.

A word frequency analysis by genre is also done. This uses genre labels for texts according to their position in the Taishō. For example, Āgama for volumes 1-2, Jātaka and Avadāna for volumes 3-4, and Prajñāpāramitā for volumes 5-8.

Content Analysis

Word frequency analysis is also done for each individual text. Analysis of individual texts it is referred to content analysis rather than corpus analysis, which refers to the entire set of texts. For example, see the content analysis results for the Treatise on the Awakening of Faith in the Mahāyāna (T 1666).

Word Usage Analysis

In addition to an overall analysis of whole corpus and each text within it, the analysis results can be view for each word. This includes the texts that the word is mentioned most frequently in, collocations, and examples of word usage. The list of texts that the word is mentioned most frequently in provides links to specific locations within the Buddhist text that use the given word most frequently. For example, looking at the page for the word 真如 zhēnrú tathatā or 'suchness' we find that the term is most frequently mentioned in Scroll 16 of the the Large Sūtra on Perfection of Wisdom (T 220).

Collocation Analysis

A collocation is a habitual combination of words (Crystal 2008, s.v. 'collocation'). For example, in English, an auspicious occasion or draw a conclusion. Some collocations express sentiment. For example, the collocations a notorious thief and habitual gambler express negative sentiments. This is related to the concept of semantic prosody, which describes words that occur together with other words in a semantic set (Crystal 2008, s.v. 'semantic prosody'). Collocations are important in translation and will usually be translation units.

List of collocations in the NTI Reader are based on a bigram analysis of the corpus. To avoid the list being dominated by common bigrams that are related to free selection with function words, bigrams with function words are removed from the set.

Terminology Extraction

Other words were found in reading Buddhist texts or examining the bigram frequency list and then added the dictionary after checking the defintions in one or more of the References. In addition, Corpus analysis has also be used to identify words subsequently added to the NTI Reader dictionary.

The scan of the texts by the cnreader software identifies all the Chinese characters that are not in the dictionary. Word entries for these characters are then added using the Jupyter notebook unihan.ipynb from the entry in the Unihan database (Unicode Consortium 2016, 'Unihan Database Lookup').

Titles for texts in the Taishō were extracted from CBETA, which is shared under Creative Commons, and the The Korean Buddhist Canon: A Descriptive Catalogue (Lancaster, 2004), which was used with permission. This was automated using the add_title.ipynb Jupyter notebook.

Bigrams were identified, ranked by frequency of occurance, and compared against head words in the Ding Fubao's 《丁福保佛學大辭典》 Dictionary of Buddhist Studies (Ding, 2010). In the latest run there were at least 10,000 words identified with this method. This can be run from the Jupyter notebook term_extraction.ipynb. Adding translations for these headwords to the dictionary is a main focus of current work.

For more (slightly older) on word frequency analysis see the page Statistical Background.


  1. Bussmann H, Trauth G (trans.), Kazzazi K (trans.) 2006, Routledge Dictionary of Language and Linguistics, Routledge, London.
  2. Crystal, D 2008, Dictionary of Linguistics and Phonetics, 6th edition (Kindle), Wiley-Blackwell, Malden, MA ; Oxford.
  3. Lancaster, LR 2004, The Korean Buddhist Canon: A Descriptive Catalogue, accessed 11 February 2017,
  4. McEnery, T and Hardie, A 2011, Corpus Linguistics: Method, Theory and Practice. Cambridge University Press.
  5. Unicode Consortium 2016, 'Unihan Unicode Database', accessed 11 February 2017,


Pinyin   English