NTI Reader

Statistical Background

Corpus Data
Software Tools
Concepts
Special Distributions
Statistical Estimation
Sampling
Hypothesis Testing
Categorical Data
Nonparametric Methods
- Kolmogorov-Smirnov Tests
- Robust Estimation
Linear Regression
Simulation
Information Theory
References

This page gives some incomplete notes on statistical background in analyzing Chinese text and analyzing quality of corpus data. The statistical techniques described here are currently not used in the NTI Reader. Rather, simple counts based on unigram and bigram frequencies are used for word sense disambiguation by the NTI Reader in sentence-based dictionary lookup and HTML gloss generation.

Corpus Data

A corpus is a collection of texts, which may be used by linguists for studying the characteristics of language usage. (Bird, Natural Language Processing, 39) The NTI Buddhist Text Reader contains a corpus of Buddhist and classical Chinese texts. A small subset of the corpus is tagged with part-of-speech tags and gloss distinguishing word sense. The organization of the corpus is described on the page Text Management, Annotation, and Gloss on this web site. The tagging scheme is described on the page Part-of-Speech Tag Definitions. The raw data, including word frequencies, can be found at the NTI Buddhist Text Reader GitHub Project in the data/dictionary folder. They are tab delimited text files.

Word frequencies from the tagged subset of the corpus can found in the files unigram.txt and bigram.txt These are a tab delimited UTF-8 file with no header row. The unigram.txt file lists single word frequencies in the tagged corpus. The structure of the unigram.txt file is:

          pos_tagged_text: The element text with POS tag and gloss in pinyin and English
          element_text:    The element text in traditional Chinese
          word_id:         Matching id in the word table (positive integer)
          frequency:       The frequency of occurrence of the word sense (positive integer)

The word_id matches the numeric id of a word in the dictionary file words.txt. Each entry in the words.txt file is a lexeme - a single dictionary entry with a single meaning. (Manning and Schütze, Statistical Natural Language Processing, 128)

The bigram.txt file lists frequencies of two-word combinations. The structure of the bigram.txt file is:

          pos_tagged_text: The element text with POS tag and gloss in pinyin and English
          previous_text:   The element text in traditional Chinese
          element_text:    The element text in traditional Chinese
          word_id:         Matching id in the word table (positive integer)
          frequency:       The frequency of occurrence of the word sense (positive integer)

A tab delimited plain text list of summary statistics for the corpus texts is included in the file corpus_stats.txt. The structure of the file is

          source_name: the title of the text
          word_count: the number of words in the text
          character_count: the number of characters in the text
          unique_words: the number of unique words in the text

Software Tools

There are a number of software tools for statistical modeling and may be used for text analysis. The R programming language will be used here. For other languages, examples of basic statistical operations can be found at the pages Combinations and permutations and Statistics/Basic.

R Project for Statistical Computing

The R Project for Statistical Computing (http://www.r-project.org), or simply R, is an open source project, language, and platform. You can freely download and install R for Linux, Mac, and Windows from links in the project web site. The R project web site has links to introductory materials. I referred to the books by Knell (Knell, Introductory R: A Beginner’s Guide to Data Visualisation, Statistical Analysis and Programming in R) and Yau (Yau, R Tutorial with Bayesian Statistics Using OpenBUGS) in preparing the commands and scripts presented here.

R is a generic statistics platform. It is very handy for working through statistics problems in textbooks and trying stuff out rather than using a piece of paper and calculator or a spreadsheet. In addition, there are a number of packages for R specifically for text analysis. These can be found in the web page Available CRAN Packages By Name. Some examples are: koRpus: An R Package for Text Analysis, tau: Text Analysis Utilities, textcat: N-Gram Based Text Categorization, and tm: Text Mining Package.

After installing R, open the command line interpreter with the command R. The unigram.txt file can be read in to a data frame using the read.table() function, as shown below. Change directories to the directory containing the file first.

          $ R
          . . .
          > names <- c("pos.tagged.text", "element.text", "word.id", "frequency")
          > unigram <- read.table("unigram.txt", header=FALSE, sep="\t", quote="\"", col.names=names, numerals ="allow.loss")
          > head(unigram)
                 pos.tagged.text element.text word.id frequency
          1   佛/NR[Fó | Buddha]           佛    3618       330
          2 有/VE[yǒu | to have]           有   32458       249
          3        於/P[yú | in]           於    1710       220
          4      不/AD[bù | not]           不     502       218
          5  人/NN[rén | person]           人     399       212
          6     諸/DT[zhū | all]           諸    3556       193

The $ prompt is shown before a shell command and the > prompt is shown before an R command. The c() function concatenates the arguments into a vector, which is assigned to the variable names using the assignment operator <-. The variable names is used for the column names later. The unigram.txt file into the unigram data frame with the read.table() function. The head function prints out the first few lines of the data frame. The UTF-8 encoded Chinese characters are read in by R correctly. The NTI Reader text files are formatted to be loaded into MySQL and there are a few differences with the basic form of the read.table() function. The format for NULL values in MySQL is \N but R expects NA. So, you will need to be careful if you depend on accurate representation of NULL values. In addition, the NTI Reader files do not have variable names in the first row.

The bigram.txt file can be read in to a data frame in a similar way, as shown below.

          > binames <- c("pos.tagged.text", "previous.text", "element.text", "word.id", "frequency")
          > bigram <- read.table("bigram.txt", header=FALSE, sep="\t", quote="\"", col.names=binames, numerals ="allow.loss")
          > head(bigram)
                pos.tagged.text previous.text element.text word.id frequency
          1 故/NN[gù | purpose]            以           故    7115        41
          2  以/P[yǐ | because]            何           以   30648        41
          3  僧/NN[sēng | monk]            衆           僧    3366        40
          4    此/DT[cǐ | this]            於           此     624        37
          5  佛/NR[Fó | Buddha]            諸           佛    3618        34
          6   塔/NN[tǎ | stupa]            起           塔    3374        34

Data can be plotted in R using the plot() function, as shown below.

          > x <- seq(1, length(unigram$frequency))
          > plot(x, unigram$frequency, xlab="Rank", ylab="Frequency", pch=20, col="blue")

The seq() function generates a sequence of integers so that the frequency of each word will be plotted from most frequent at the left to least frequent at the right. The resulting plot is the word frequency versus rank, shown in Figure 1.

Figure 1: Word Frequency versus Rank

The plot in Figure is informative in that it shows that only a handful of words have a high frequency and the vast majority is a long tail with frequency 1. This is a phenomenon in word frequencies for natural language in general. (Manning and Schütze Foundations of Statistical Natural Language Processing. Cambridge, 23-27) An early model for the relation between word frequency and rank is Zipf's law that states that word frequency f is inversely proportional to rank r.

f ∝ 1/r

(Manning and Schütze Foundations of Statistical Natural Language Processing. Cambridge, 24)

To generate an image file for that plot and the others in this page, pull the project from GitHub, change to the top level directory in the project and type the command

          $ Rscript r/generate_images.R

This is a useful diagram because it shows the breadth and depth of the corpus at a glance. The corpus only has about 3,200 unique words and only has a substantial frequency for only about 200 of these words.

You will need to install the R showtext package to properly display Chinese text on R generated graphics. Use the commands below to install showtext.

          > install.packages("showtext")
          > library(showtext)

You can make a histogram plot of the frequency data using the barplot() function, as shown below.

          > freq <- as.vector(as.matrix(unigram[4])[1:8, 1])
          > freq.labels <- as.vector(as.matrix(unigram[2])[1:8, 1])
          > showtext.begin()
          > barplot(freq, names.arg=freq.labels, ylab="Frequency", col="steelblue")
          > showtext.end()

A subset of fourth column of the unigram data.frame is read into the variable freq, after being converted to a vector. The labels are the Chinese text for each word taken from column 2. Rather than generate the graph in a window, the code above writes it to a png file. The showtext.begin() needs to be called before generating the graph. When dev.off() is called the file will be written. The chart generated is shown in Figure 2.

Figure 2: Word Frequency Bar Plot for a Subset of the Data

Concepts

Probability

Suppose that we need to choose between one of two options and each option was equally likely. Then each option would have a 1/2 or 50% probability of being correct. For example, suppose we see the English word butter in a text and we need to decide whether the noun meaning the butter in the fridge is intended or the verb, as in please butter the toast, is intended. To find out the probability of the intended sense then you might look at the surrounding words or you might guess that the noun meaning a particular dairy product in the fridge is more likely, in general. One of the goals of the NTI Reader is to help the user decide which is the most appropriate word sense and part of speech. These are classic problems in natural language processing and ones that are still problematic today. The technique used is to tag all the words in a subset of the corpus and tally up the different word senses found. It has been found that simply using the most common part of speech of words with multiple possible parts of speech results in selection of the correct part of speech about 90% of the time, for English text. (Manning and Schütze, Statistical Natural Language Processing, 344) The success rate may be lower for Chinese due to the long historic period of written Chinese covered and the changing predominant word sense over time.

Example: Suppose that we wish to find the probability of choosing the right meaning of the character 法 (fǎ). 法 (fǎ) most often means the noun "law" or the noun "method" in modern Chinese. However, in Buddhist texts written in literary Chinese, 法 might mean the proper noun "Dharma" (the teachings of the Buddha), teachings in general, or the noun "dharma" (phenomenon). The word frequencies for each sense can be found from the unigram table using the following R commands.

          > fa <- subset(unigram, element.text=="法")
          > fa
                          pos.tagged.text element.text word.id frequency
          10          法/NN[fǎ | a dhárma]           法   17994        74
          1253          法/NR[Fǎ | Dhárma]           法    3509         1
          1376 法/NN[fǎ | a mental object]           法   32204         1
          1397          法/NN[fǎ | method]           法    3506         1
          > PrA = fa$frequency[1] / sum(fa$frequency)
          > PrA
          [1] 0.961039

The subset() function selects a subset of the unigram table with the value of element_text equal to "法". The total number of occurrences of 法 in the tagged corpus is sum(fa$frequency) = 74 + 1 + 1 + 1 = 77. So the probability of any one occurrence of 法 being the noun (NN) meaning "a dhárma" (a teaching) is 74/77 = 0.961. If we had no other information about the context of a occurrence of 法, then we would guess that it would mean a dhárma. ▢

The conditional probability of an event A given that we know an event B has already occurred is written Pr(A|B). It can be computed as

Pr(A|B) = Pr(A ∩ B) / Pr(B)

This assumes that Pr(B) > 0. (DeGroot and Morris, Probability and Statistics, 56)

Example: If we know the word before 法, it may give us a better chance at picking the correct word sense. This is an example of conditional probability We can use the bigram table to compute the conditional probability of the word sense of 法. Suppose the word before is 說 shuō "to say" in modern Chinese but more commonly "to teach" in literary Chinese. That would make the phrase be 說法 "to teach a dhárma." Let A = the proper meaning of 法 in this instance is 法/NN[fǎ | a dhárma]. Let B = the word before 法 is 說. The conditional probability can be computed using the R commands below.

          > subset(subset(bigram, element.text=="法"), previous.text=="說")
                 pos.tagged.text previous.text element.text word.id frequency
          34 法/NN[fǎ | a dhárma]            說           法   17994         9

Pr(A|B) = 9 / 9 = 1.0

That is, whenever the word before 法 is 說 then the proper word sense is always 法/NN[fǎ | a dhárma]. The interpretation of this is that the previous word is a very good predictor of word sense, in this case. ▢

Two events A and B are independent if the occurrence of one does not affect the occurrence of the other. If this is true then

Pr(A ∩ B) = Pr(A) Pr(B)

(DeGroot and Morris, Probability and Statistics, 66)

A random variable is a real-valued function in a sample space S. (DeGroot and Morris, Probability and Statistics, 93)

The probability function of a discrete random variable X is the function

ƒ(x) = Pr(X = x)

(DeGroot and Morris, Probability and Statistics, 96) In linguistics discrete random variables, like word frequency, are more common but continuous random variables may also be used, for example the word frequency of a specific word per 1,000 words of text. The equivalent function for a continuous random variable is called the probability density function (pdf).

The cummulative distribution function (cdf) of a random variable X is

F(x) = Pr(X ≤ x) for -∞ < x < ∞

(DeGroot and Morris, Probability and Statistics, 108)

The quantile function F^-1(p) of a random variable X is the inverse of the cdf, which is also the smallest value x with F(x) ≥ p. The variable p is the probability. F^-1(p) is the p quantile of X or 100p percentile. (DeGroot and Morris, Probability and Statistics, 112) A quartile is found by sorting the data and then dividing it into four equal groups. The interquartile range is the middle two quartiles or, in other words, the range between the 25 and 75th percentiles. Quartile and range information can be found using the R function summary(). The quantile can be found using the R function qt(). The interquartile range can be found using the R function IQR().

Example: Summary data for the word frequency data can be found as follows.

          > summary(unigram$frequency)
          Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
          1.000   1.000   1.000   4.357   3.000 137.000 
          > length(unigram$frequency)
          [1] 1646

The length() function gives the number of items in the data set, which shows that there are only 3,200 unique words in the tagged corpus. ▢

The joint probability function of two random variables X and Y is

ƒ(x, y) = Pr(X = x and Y = y)

(DeGroot and Morris, Probability and Statistics, 119)

The marginal cdf of a joint probability function of two discrete random variables X and Y is summed over all possible values of y. In symbols,

ƒ₁(x) = ∑_{All y} ƒ(x, y)

(DeGroot and Morris, Probability and Statistics, 131)

Expectation

The expectation or of a random variable is its mean. The expectation of a discrete random variable X with probability function f is defined as

E(X) = Σ_{All x} x ƒ(x)

(DeGroot and Morris, Probability and Statistics, 208) The median is another measure of central tendency, which separates the lower half from the upper half of a set of numbers. It can be more useful than the mean when dealing with small integers or highly skewed data. The mean of a vector of numbers can be found with the R function mean() and the median can be found with the function median().

Example: The values of mean and median for word frequency in the NTI Reader tagged corpus can be found with the R commands below.

          > mean(unigram$frequency)
          [1] 4.708101
          > median(unigram$frequency)
          [1] 1

This can give some idea of the adequacy of the size of the tagged corpus. A mean frequency of about 4.7 word occurrences and a median of 1 in the tagged corpus makes the tagged corpus seem kind of small. We need to do more analysis to understand what might be really sufficient. ▢

Some distributions do not have a mean. The Cauchy distribution below is an example of a function with no mean.

ƒ(x) = 1/[π(1 + x²)] for -∞ < x < ∞

(DeGroot and Morris, Probability and Statistics, 210)

The variance of a random variable X with mean μ is defined as

Var(X) = E[(X - μ)²]

The variance of a discrete random variable X with mean μ can be computed with the sum

Var(X) = (1/n) ∑_{all i}(X_i - μ)²

The standard deviation is the square root of the variance. (DeGroot and Morris, Probability and Statistics, 226)

Example: The variance and standard deviation of word frequency in the NTI Reader tagged corpus can be found with the R commands below.

          > x >- unigram$frequency
          > v <- sum((x-mean(x))^2)/(length(x)) 
          > v
          [1] 140.0195
          > sqrt(v)
          [1] 11.83299

The variance is about 140.0 and the standard deviation about 11.8. ▢

The covariance of random variables X and Y with means μ_x and μ_y is defined as

Cov(X, Y) = E[(X - μ_x)(Y - μ_y)]

assuming that the expectation exists. (DeGroot and Morris, Probability and Statistics, 248) The sample covariance corrects for sampling from a larger population. It is given by the formula:

Cov_{x, y} = ∑(x - x̄)(y - ȳ)/(n - 1)

where x̄ and ȳ are the sample means and n - 1 is the degrees of freedom. (Knell, Introductory R, 211) The sample covariance can be computed using the R function cov().

The correlation of random variables X and Y with variances σ_x² and σ_y² is defined as

ρ(X, Y) = Cov(X, Y) / [σ_xσ_y]

(DeGroot and Morris, Probability and Statistics, 250) The correlation for a sample can be computed from the formula

r = Σ(x - x̄)(y - ȳ)/[(n - 1)(s_x s_y)]

where s_x and s_y are the sample standard deviations. (Knell, Introductory R, 212) The sample correlation can be computed using the R function cor().

Example: Let's find the correlation between text size and the number of unique words in different texts in the corpus. We can load the file corpus_stats.txt, containing the summary statistics of the corpus texts, print out the first row of data, print a summary of the distribution of unique words, find the correlation, and draw a plot with these R commands.

          > corpusstats <- read.table("../stats/corpus_stats.txt", sep="\t", quote="\"", numerals ="allow.loss", header=TRUE)
          > corpusstats[1,]
                             source.name word.count character.count unique.words
          1 Diamond Sūtra 金剛般若波羅蜜經       3407            5307          534
          > summary(corpusstats$unique.words)
          Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
          15.0   122.2   580.5   962.2  1661.0  5145.0 
          > cor(corpusstats$word.count, corpusstats$unique.words)
          [1] 0.7266394
          > plot(corpusstats$word.count, corpusstats$unique.words, xlab="Word Count", ylab="Unique Words", pch=17, col="blue")

This particular file does have column headers in the first row that are automatically used by R as column names in the data frame. The image generated is shown in Figure 3.

Figure 3: Variation of Unique Words with Text Size

From the correlation value there r = 0.727 there is clearly a correlation between text length and number of unique words. However, from Figure 3 we can see that it is consistent but not linear. The number of unique words eventually begins to flatten out with really long texts. Since the set of words in a text may include proper nouns, such as person and place names, the number of unique words is practically unlimited. It is useful to consider the summary of unique words with the unigram data for the tagged corpus. The third quartile value is 1661, which is greater than the total number of unique words in the tagged corpus. So, there are at least 25% of the texts in the corpus that have more unique words than we have word frequency data for. Clearly, we are short of coverage on word frequency.

Entropy

Entropy is a measure of the amount of information in a distribution. The entropy H(X) of a discrete random variable X is defined as

H(X) = -∑_x∈Xp(x)log₂p(x)

where x is a value of the random variable X, which is one of a number of symbols in an alphabet X. Entropy is measured in bits. (Manning and Schütze. Statistical Natural Language Processing, 61; Cover and Thomas. Elements of Information Theory, 2012, ch. 2.1) For example, the possible values for X are the values in an 8 letter alphabet, each with equally likely probability of occuring, then the entropy is

H(X) = -∑_i=1..8(1/8)log₂(1/8) = log₂(8) = 3

So, a uniform distribution over the alphabet has 3 bits of information. In other words, the most efficient way of encoding the alphabet is with 3 bits.

Manning and Schütze explain word sense disambiguation with entropy. The approach described selects the word sense that maximizes the mutual entropy with the surrounding words. (Manning and Schütze. Statistical Natural Language Processing, 239-241)

Markov Models

A stochastic process is a sequence of random variables X₁, X₂, ... at discrete points in time. A Markov chain is a stochastic process where the conditional distributions of all X_n+j depend only on X_n and not only earlier states. (DeGroot and Morris, Probability and Statistics, 188)

The transition distributions of a Markov chain are the conditional probabilities

p_ij = Pr(X_n+1=j|X_n=i)

where the random variables X_n can have k possible states. A transition matrix is a matrix P = [p_ij] made up of the conditional probabilities of the transition distributions. (DeGroot and Morris, Probability and Statistics, 190-192)

A hidden Markov model is a Markov model where the state of the model is not known directly. Only the expression of the state may be observed. There is a level of probability between the state of the model and the expression of the state. (Manning and Schütze, Statistical Natural Language Processing, 320-321)

Manning and Schütze describe using a Markov model for a part-of-speech (POS) bigram tagger. (Manning and Schütze, Statistical Natural Language Processing, 345-349) The state of the model is given by the POS tag t_i for a word w_i at position i in the text. The tags are the Brown Corpus tags that only give the POS, not the specific meaning of a word, like the NTI Chinese corpus. For example, move/VB, the tag VB indictes a verb. We estimate the transition matrix with a training set of tagged data and use it to predict tags in a test set. The model is a hidden Markov model because in the test data we do not know the values of the tags, just their expression in particular words. The following assumptions are made

The value of a tag only depends on the previous tag. This is the limited horizon assumption.
The words are independent of each other.
The sense of a word only depends on its tag. For example, we can assume that the noun butter means the butter in the fridge and there are no other nouns with a different meaning to this. This assumption may be valid most of the time for English but would probably not be valid for literary Chinese.

The probability Pr(t^k|t^j) of the kth tag t^k in the tagset following the tag t^j is

Pr(t^k|t^j) = C(t^k, t^j) / C(t^j)

where C(t^k, t^j) is the count of tag t^k following t^j and C(t^j) is the count of tag t^j. The probability of word w^l, the lth word in the lexicon, being associated with tag t^j is

Pr(w^l|t^j) = C(w^l : t^j) / C(t^j)

where C(w^l : t^j) is the count of word w^l tagged as t^j. The optimal series of tags for a sentence is found by optimizing the Maximum Likelihood Estimator Pr(w^l|t^j) over all words the training set. Since the computation can be computationally intensive, the Vibeti algorithm, which optimizes using memoization is suggested. (Manning and Schütze, Statistical Natural Language Processing, 326-336) Code for a POS Tagger based on a hidden Markov model can be found at LiteratePrograms (See POS_tagger_(Java)).

The current version of the NTI Reader sentence-based dictionary lookup and HTML gloss generator uses word-sense disambiguation based on bigram frequency with fallback to unigram frequency. That is, rather than using a true statistical estimation technique, like the Markov model approach described here, the NTI Reader uses the most common bigram frequency to decide on word sense for a word in a sequence of words. When the reader has never seen a bigram sequence before, it falls back on the unigram frequency to select the word sense. Manning and Schütze describe combining bigram, unigram, and trigram Markov models using linear interpolation and toher techniques. (Manning and Schütze, Natural Language Processing, 353-354) In fact, handling of previously unseen words is a major problem using Markov models for text processing, even with large corpuses. POS tags for unknown words in English text may be guessed based on capitalization and word ending but that is not possible in Chinese text. (Manning and Schütze, Natural Language Processing, 351-352)

Manning and Schütze report accuracies of POS tagging in English of up to 95% to 97%. (Manning and Schütze, Natural Language Processing, 371) One would expect Chinese to be a good language for bigram hidden Markov model taggers because word order is fundamental to grammatical structure in Chinese. This may not be the case for Sanskrit, where word morphology is more important than word order.

Special Distributions

Bernoulli Distribution

The Bernoulli distribution for random variable X, which can only take the values 0 and 1, with parameter p (0 ≤ p ≤ 1) has the probabilities

Pr(X = 1) = p and Pr(X = 0) = 1 - p

An sequence random variables with the Bernoulli distribution are called Bernoulli trials. (DeGroot and Morris, Probability and Statistics, 276)

Binomial Distribution

The binomial distribution with integer parameter n and continuous parameter p (0 ≤ p ≤ 1) is defined as

ƒ(x|n,p) = (n ¦ x) p^x(1 - p)^n-x for x = 0, 1, 2, ... and 0 otherwise

where (n ¦ x) is the binomial coefficient n!/[x!(n - x)]. The mean and variance are

E(X) = np

Var(X) = np(1 - p)

(DeGroot and Morris, Probability and Statistics, 277) The R function to compute the binomial probability is dbinom(). The commands to plot a binomial distribution with parameters n = 20 and p = 0.5 are shown below.

         > x <- seq(1, 20)
         > plot(x, dbinom(x, 20, 0.5), xlab="x", ylab="Frequency", pch=17, col="blue")

The plot() function takes a sequence of integers x from 1 to 20 and plots the binomial probability for each value. The graph generated is shown in Figure 4.

Figure 4: Binomial Distribution with Parameters n = 20 and p = 0.5

Poisson Distribution

The Poisson distribution for random variable X with mean λ is defined as

ƒ(x|λ) = e^-λ λ^x/x! for x = 0, 1, 2, ... and 0 otherwise.

The variance of the Poisson distribution is also λ. (DeGroot and Morris, Probability and Statistics, 288-290) The dpois() R function can be used to compute values of the Poisson distribution. This is demonstrated with the R commands below, which generate a graph of the Poisson distribution with λ = 3.0.

         > x <- seq(1, 10)
         > plot(x, dpois(x, 3.0), xlab="x", ylab="Frequency", pch=17, col="blue")

The graph generated is shown in Figure 5.

Figure 5: Poisson Distribution with λ = 3.0

Normal Distribution

The normal distribution for the continuous random variable X with mean μ and standard deviation σ is defined as

ƒ(x|μ, σ) = [1/σ√(2π)] exp[-0.5((x - μ)/σ)²] for -∞ < x < ∞

The standard normal distribution has mean μ = 0 and standard deviation σ = 1. It is given by the equation

ƒ(z|μ, σ) = [1/√(2π)] exp[-0.5((z)/σ)²] for -∞ < z < ∞

(DeGroot and Morris, Probability and Statistics, 307) The standard normal distribution can be plotted using the curve() function in R, as shown below.

          > curve(dnorm(x), -3, 3, xlab="z", ylab="Probability Density", col="blue")

The graph generated is shown in Figure 6.

Figure 6: Standard Normal Distribution

A random variable X with mean μ and standard deviation σ can be transformed with to the standard normal distribution with the equation

z = (x - μ)/σ

The cumulative distribution F(x) of a normal distribution can be computed in R with the function pnorm(q, mean = 0, sd = 1. The quantile function F^-1(x) can be computed with and qnorm(p, mean = 0, sd = 1.

Example: Following Example 5.6.4 in DeGroot and Morris (DeGroot and Morris, Probability and Statistics, 308) the probability of a random variable from a normal distribution with mean 5 and standard deviation 2 being greater than 1 and less than 8 is

Pr(1 < X < 8) = Pr(X < 8) - Pr(X < 1)

This can be computed with R command

          > pnorm(8, mean = 5, sd = 2) - pnorm(1, mean = 5, sd = 2)
          [1] 0.9104427

So the probability that X is greater than 1 and less than 8 is 0.91. ▢

The normal distribution is a good approximation for variables in many random processes. (DeGroot and Morris, Probability and Statistics, 303) The Central Limit Theorem states that the distribution of a sum of random variables Σ_i=1ⁿ X_i with any distribution will be approximately the normal distribution with mean nμ and variance nσ², as n becomes large. (DeGroot and Morris, Probability and Statistics, 361)

Gamma Distribution

The gamma distribution for the continuous random variable X with parameters α and β is

ƒ(x|α, β) = [β^α / Γ(α)] x^α-1 e^-βx for x > 0 or 0 otherwise

where Γ(α) is the gamma function. The mean of the gamma distribution is α/β and the variance is α/β². (DeGroot and Morris, Probability and Statistics, 319-320)

Exponential Distribution

The exponential distribution for the continuous random variable X with parameter β is

ƒ(x|β) = βe^-βx for x > 0 or 0 otherwise

The exponential distribution is a special case of the gamma distribution with α = 1. The mean of the exponential distribution is 1/β and the variance is 1/β². (DeGroot and Morris, Probability and Statistics, 321) The exponential distribution can be drawn with the R command below.

          > curve(0.8*exp(-0.8*x), 0, 5, xlab="x", ylab="Probability Density", col="blue")

The generated graph is shown in Figure 7.

Figure 7: Exponential Distribution (β = 0.8)

Beta Distribution

The beta distribution for the continuous random variable X with parameters α and β is

ƒ(x|α, β) = [Γ^β+α / (Γ(α) Γ(β))] x^α-1 (1 - x)^β-1 for 0 < x < 1 or 0 otherwise

The mean of the beta distribution is

E(X) = α/(α + β)

The variance of the beta distribution is

Var(X) = αβ/[(α + β)²(α + β + 1)]

(DeGroot and Morris, Probability and Statistics, 328-329)

Multinomial Distribution

The multinomial distribution for the discrete random vector X = (X₁, X₂, ... X_k) having probabilities p = (p₁, p₂, ... p_k) with n items selected is defined as

f(X|n, p) = [n!/(x₁ x₂ ... x_k)] p₁^x₁ p₂^x₂ ... p_k^x_k

if x₁ + x₂ + ... x_k = n or 0 otherwise. (DeGroot and Morris, Probability and Statistics, 334) The multinomial distribution is appropriate for distributions into sets that are not necessarily numbers, for example, the frequencies of words of different part of speech values.

Statistical Estimation

A statistical model is a collection of random variables, identification of probability distributions for the variables, and the set of parameters that the distributions require values for. Statistical inference is a probabilistic statement about a statistical model. (DeGroot and Morris, Probability and Statistics, 377-378) For example, the approximate date of a document may be inferred from the vocabulary in it. (Krippendorff, Content Analysis, 42) In a Buddhist text mention of copying sutras or description of devotional practices may help provide information for the date for the text. The data may be recorded as 0 (not present) and 1 (present) or as a word frequency.

A statistic is a function of a set of random variables. For example, mean, median, and variance are statistics. (DeGroot and Morris, Probability and Statistics, 382)

Parameters in probability distributions are usually unknown and needed to be estimated with statistical methods. So the parameters themselves can be considered simply as unknown constants or to have probability distributions themselves. The prior distribution of a parameter θ is the probability distribution ζ(θ) assumed before experimental observations are applied to estimate its value. The posterior distribution ζ(θ|x₁, ... x_n) is the conditional distribution after the random variables X₁, ... X_n have been observed. (DeGroot and Morris, Probability and Statistics, 385-387) The likelihood function f_n(x|θ) is the joint probability function of the random variables x = (x₁, ... x_n) and the parameter θ. The likelihood function can be used to relate teh prior and posterior distributions,

ζ(θ|x) ∝ ƒ_n(x|θ) ζ(θ)

The constant of proportionality can be found from equating the total probability to 1. (DeGroot and Morris, Probability and Statistics, 390)

A conjugate family of prior distributions is a family of possible distributions for ζ(θ) where the posterior distribution also belongs to the same family. The family of gamma distributions is a conjugate family of prior distributions for Poisson distributions of the random variables X₁, ... X_n when the parameter θ is unknown. The family of normal distributions is itself a conjugate family of prior distributions for normal distributions of X₁, ... X_n when the mean is unknown but the variance is known. The family of gamma distributions is also a conjugate family of prior distributions for exponential distributions of X₁, ... X_n when the value of the parameter θ is unknown. (DeGroot and Morris, Probability and Statistics, 395-402)

An estimator δ(X₁, ... X_n) gives an estimate of the parameter θ using observed values of the data x = (x₁, ... x_n). A loss function L(θ, a) quantifies the effect of the difference between the estimate a of θ and the true value. A Bayes estimator δ*(x) minimizes the expected value of the loss function. (DeGroot and Morris, Probability and Statistics, 408-409)

The squared error loss function is defined as

L(θ, a) = (θ - a)²

When the squared error loss function is used the Bayes estimator is the posterior mean value of θ. (DeGroot and Morris, Probability and Statistics, 411)

Manning and Schütze describe word sense disambiguation with Bayesian classification using maximum likelihood estimators. The authors give the examples of ambiguous words in English, including duty (a task or a tax) and land (to stand on land or to land a plane). The decision is based on conditional probability, given the context of the word. Different contexts have been experimented with, including the object of a verb and the surrounding words. The decision rule for selecting a word sense is optimal in the sense that it minimizes the chance for error in a training data set. (Manning and Schütze, Statistical Natural Language Processing, 235-239)

Another approach to estimating parameters maximizes the probability of observed data. A likelihood function f_n(x|θ) is a joint pdf for continuous random variables or joint pf for discrete random variables with the parameter θ considered part of the joint distribution. A maximum likelihood estimator maximizes the value of f_n(x|θ). (DeGroot and Morris, Probability and Statistics, 418) This approach avoids the need to assume a probability distribution for θ. However, the drawback is that is may not always give a good estimate for θ and sometimes it may not exist at all.

Sampling

A sampling distribution is the distribution of a statistic T of a set of random variables X = (X₁, ... X_n) that are a sample of the random variable X with distribution having a parameter θ. A sampling distribution can be used to determine how good an estimate θ̂ of the parameter θ is. (DeGroot and Morris, Probability and Statistics, 465)

The chi-square χ² distribution with m degrees of freedom is the gamma distribution with α = m/2 and β = ½. This can be written as

ƒ(x) = 1/(2^m/2 Γ(m/2)) x^(m/2)-1 e^-x/2 for x > 0.

The mean of the χ² distribution is E(X) = m. The variance is Var(X) = 2m. (DeGroot and Morris, Probability and Statistics, 469-470) The χ² distribution can computed with the R dchisq() function. The χ² distribution with 5 degrees of freedom can be drawn in R using the command below.

          > curve(dchisq(x, df=5), 0, 20, xlab="x", ylab="Chi Square", col="blue")

The result is shown in Figure 8.

Figure 8: χ² Distribution with 5 Degrees of Freedom

The χ² distribution with two degrees of freedom is the same as the exponential distribution with parameter ½.

The cumulative χ² distribution can be computed with the pchisq() R function.

Example: Following the Example 8.2.3 in DeGroot and Morris (DeGroot and Morris, Probability and Statistics, 470) The probability that X is less than 10.0 for a χ² distribution with 10 degrees of freedom can be found using the R command

          > pchisq(c(10), df=10)
          [1] 0.5595067

Giving the probability 0.55 or 56%. ▢

If X₁, ... X_n are a random sample from a normal distribution with mean μ and variance σ² then the sample mean and variance have maximum likelihood estimators

μ̂ = X̄_n
σ̂ = [(1/n) Σ_i=1ⁿ(X_i - X̄_n)²]^½

These estimators are independent variables. The estimator for the sample mean X̄_n has a normal distribution with mean μ and variance σ/n. The quantity nσ̂²/σ² has a χ² distribution with n - 1 degrees of freedom.

The t distribution is useful when we need to use an estimate of the variance because we do not know the true variance. The t distribution is

Γ((m+1)/2)/[(mπ)^½Γ(m/2)] (1 + x²/m)^-(m+1)/2

If X₁, ... X_n is a sample for random variable X with with mean μ and variance σ². Define the statistic

σ′ = [Σ_i=1ⁿ(X_i - X̄_n)²]^½

Also define the random variable

U = n^½(X̄ - μ)/σ

U has a t distribution with n - 1 degrees of freedom. (DeGroot and Morris, Probability and Statistics, 480-482) The t distribution looks like the normal distribution but the tails do not converge to zero as quickly, especially for small values of n. Values of the pdf for t distribution can be computed with the R dt function. The command below draws a graph of the t distribution with 5 degrees of freedom.

          > curve(dt(x, df=5), -3, 3, xlab="z", ylab="Probability Density", col="blue")

The chart produced is shown in Figure 9.

Figure 9: t Distribution with 5 Degrees of Freedom

A confidence interval (A, B) for coefficient γ is an interval with

Pr(A < g(θ) < B) ≥ γ

for a random sample X = (X₁, ... X_n) from a distribution with parameter θ.

The confidence interval for the sample mean from a normal distribution with mean μ and variance σ² is

A = X̄_n - T_n-1^-1((1 + γ)/2) σ′/n^½
B = X̄_n + T_n-1^-1((1 + γ)/2) σ′/n^½

where T_n(c) is the cdf of the t distribution with n degrees of freedom and T_n^-1 is the quantile function. (DeGroot and Morris, Probability and Statistics, 486) The quantile function T_n^-1(x) for a value x with n degrees fo freedom can be computed with the R function qt(x, df = n).

Example: Computing Example 8.5.3 from DeGroot and Morris (DeGroot and Morris, Probability and Statistics, 487) with a sample of 26 rain measurements with sample average 5.134 and estimated variance 1.60. The 95% confidence interval can be computed using the R commands

          > A <- 5.134 - qt(1.95/2, df=25) * 1.6/sqrt(25)
          > A
          [1] 4.474948
          > B <- 5.134 + qt(1.95/2, df=25) * 1.6/sqrt(25)
          > B
          [1] 5.793052

The 95% confidence interval is (4.47, 5.79). ▢

A one-sided confidence interval (A, ∞) has a statistic A, such that

Pr(A < g(θ)) ≥ γ

This is a 100γ percent one-sided confidence interval for g(θ). Similarly, the interval (-∞, B) is a 100γ percent one-sided confidence interval for g(θ) where

Pr(g(θ) < B) ≥ γ

For a normal distribution with mean μ and variance σ² the confidence limits A and B can be computed as

A = X̄_n - T_n-1^-1(γ) σ′/n^½
B = X̄_n + T_n-1^-1(γ) σ′/n^½

(DeGroot and Morris, Probability and Statistics, 488-489)

An unbiased estimator δ(X) of a function g(θ) of a parameter θ has the same expectation as g(θ) or all values of θ. That is, E_θ[δ(X)] = g(θ) for all θ. (DeGroot and Morris, Probability and Statistics, 507) For example, the sample mean X̄_n is an unbiased estimate of the true mean μ because the mean of X̄_n is μ for all values of μ. The statistic

σ̂₁² = (1/(n - 1)) ∑_i=1ⁿ(X_i - X̄_n)²

is an unbiased estimator of the variance. (DeGroot and Morris, Probability and Statistics, 508) σ̂₁² is sometimes called the sample variance.

Example: The sample variance and sample standard deviation of word frequency in the NTI Reader tagged corpus can be found with the R commands below.

          > v <- var(unigram$frequency)
          > v
          [1] 140.1174
          > sqrt(v)
          [1] 11.83712

Compare this to the population variance 140.20 and standard deviation 11.83 computed above. ▢

Hypothesis Testing

Concepts

The null hypothesis H₀ is the hypothesis that θ ∈ Ω₀, where θ is the parameter of a probability distribution and Ω₀ is a subset of the parameter space Ω. The alternative hypothesis H₁ is the hypothesis that θ ∈ Ω₁, where Ω₁ is the complement of Ω₀. (DeGroot and Morris, Probability and Statistics, 531-532)

A critical region S₁ is a subset of the sample space S of a random vector X = (X₁, ... X_n) for which the null hypothesis H₀ is rejected. A test statistic T = r(X) defines a procedure where the null hypothesis H₀ is rejected if T ∈ R, where R is a subset of the real numbers. R is the rejection region of the test. The critical region has the form S₁ = {x: r(x) ∈ R}. (DeGroot and Morris, Probability and Statistics, 532-533)

A power function π(θ, δ) is the probability that a test procedure δ will reject the null hypothesis for all values of the parameter θ. In symbols

π(θ, δ) = Pr(X ∈ S₁)

where S₁ is the critical region. (DeGroot and Morris, Probability and Statistics, 534)

A type I error is an erroneous choice to reject the null hypothesis. A type II error is an erroneous choice not to reject a false null hypothesis. (DeGroot and Morris, Probability and Statistics, 535)

The size α(δ) is defined as

α(δ) = sup _{θ ∈ Ω₀} π(θ, δ)

where δ is a test and sup is the supremum. (DeGroot and Morris, Probability and Statistics, 535)

The p-value is the smallest level α₀ that would result in rejection of the null hypothesis.

The likelihood ratio statistic Λ(x) is the largest value of the likelihood function in Ω₀ compared to the entire parameter space Ω. In symbols

Λ(x) = [sup _{θ ∈ Ω₀} ƒ(x | θ)] / [sup _{θ ∈ Ω} ƒ(x | θ)]

(DeGroot and Morris, Probability and Statistics, 544)

A simple hypothesis is a choice from two alternative parameter values in the parameter space Ω = {θ₀, θ₁}. It has the form

H₀: θ = θ₀
H₁: θ = θ₁

The probability α of a type I error for a test procedure δ is α(δ) = Pr(Reject H₀ | θ = θ₀). The probability β of a type II error for a test procedure δ is β(δ) = Pr(Do not reject H₀ | θ = θ₁). (DeGroot and Morris, Probability and Statistics, 551)

A uniformly most powerful test is for a test procedure δ^* for the hypothesis

H₀: θ ∈ Ω₀
H₁: θ ∈ Ω₁

at the level of significance α₀ if α(δ^*) ≤ α₀ where

π(θ, δ) ≤ π(θ, δ^*) for all θ ∈ Ω₁

(DeGroot and Morris, Probability and Statistics, 560)

A one sided alternative hypothesis has the form

H₀: θ ≤ θ₀
H₁: θ > θ₁

(DeGroot and Morris, Probability and Statistics, 562)

A two sided alternative hypothesis has the form

H₀: θ = θ₀
H₁: θ ≠ θ₁

(DeGroot and Morris, Probability and Statistics, 565)

An unbiased test is a test δ where

π(θ, δ) ≤ π(θ′, δ)

for every θ ∈ Ω₀ and θ′ ∈ Ω₁. (DeGroot and Morris, Probability and Statistics, 573)

A t test for the one sided hypothesis about the mean μ

H₀: μ ≤ μ₀
H₁: μ > μ₀

where the variance σ² is unknown, rejects H₀ if

U = n^½ (X̄ - μ₀)/σ′ ≥ c

and c is found from the t quantile function for level of significance α₀. The opposite hypothesis H₀: μ ≥ μ₀ is rejected if U ≤ c. (DeGroot and Morris, Probability and Statistics, 576)

Example: We will work through Example 9.5.3 in DeGroot and Morris (DeGroot and Morris, Probability and Statistics, 578) with R commands. The problem tests the hypothesis H₀: μ ≥ 200 with a level of significance α₀ = 0.1 for a random sample with n = 18, X̄ = 182.17, and σ′ = 72.22. The 0.1 quantile of the t distribution and the test statistic U can be found with the R commands

          > c = qt(0.1, df=17)
          > c
          [1] -1.333379
          > U = sqrt(18-1) * (182.17 - 200)/72.2
          >  U
          [1] -1.018213

Since U ≤ c is false, the null hypothesis μ ≥ 200 is not rejected. ▢

Manning and Schütze discuss using hypothesis with the t test for the occurrence of bigrams. The chance of of the two individual words coming w¹ and w² together at any p articular point in the text is Pr(w¹w²) = Pr(w¹)Pr(w²). The sample variance is needed in the t test, which is estimated as the variance σ² = p(1 - p) of a Bernoulli trial. In this expression p is the word frequency or, in other words, the probabily of occurrence of the words together. (Manning and Schütze, Statistical Natural Language Processing, 163-166)

A t test for the two sided hypothesis about the mean μ

H₀: μ = μ₀
H₁: μ ≠ μ₀

where the variance σ² is unknown, rejects H₀ if

|U| ≥ T_n-1^-1(1 - α₀/2)

where T_n-1>^-1 is the quantile function of the t distribution. (DeGroot and Morris, Probability and Statistics, 582)

Evaluation Metrics in Natural Language Processing

Successful and unsuccessful prediction results are summarized in Table 1.

Table 1: Prediction Results
		Actual
		Positive	Negative
Predicted	Positive	true positive tp	false positive fp
Predicted	Negative	true negative fn	false negative tn

Manning and Schütze define precision P as the number of correctly predicted positive results:

P = tp/(tp + fp)

and recall R as the proportion of target items predicted

R = tp/(tp + fn)

Precision and recall avoid the potential problem with accuracy that the number of negative events will be far greater than the number of positive events. That is, a system that successfully rejects many negative items may have high accuracy but not do very well on the positive events. (Manning and Schütze, Statistical Natural Language Processing, 268-269)

Comparing means from two distributions

To test a hypothesis comparing the means μ₁ and μ₀ of two populations X = X₁, ... X_m and Y = Y₁, ... Y_n with the same variance σ, the null and alternate hypotheses

H₀: μ₁ ≤ μ₂
H₁: μ₁ > μ₂

are used. The two-sample statistic U is defined as

U = (m + n - 2)^½(X̄_m - Ȳ_n) / [(1/m + 1/n)^½(S_X² + S_Y²)^½]

U has a t distribution with m + n - 2 degrees of freedom. In this statistic

X̄_m = (1/m) ∑_i=1^mX_i and Ȳ_n = (1/n) ∑_i=1ⁿY_i

and

S_X² = ∑_i=1^m(X_i - X̄_m)² and S_Y² = ∑_i=1ⁿ(Y_i - Ȳ_n)²

The null hypothesis H₀ is rejected at the level of significance α₀ if U ≥ T_m+n-2^-1(1-α₀). (DeGroot and Morris, Probability and Statistics, 588-589)

Example: We will work through Example 9.6.2 in DeGroot and Morris (DeGroot and Morris, Probability and Statistics, 589-590) with R commands. The problem tests the hypothesis that the mean of rain from seeded clouds μ₁ is greater than the mean of rain from unseeded clouds μ₂. That is, we attempt to reject H₀: μ₁ ≤ μ₂ at level of significance α₀ = 0.01. The mean of the 26 measurements with seeded clouds is X̄_m = 5.13. The mean of the 26 measurements with unseeded clouds is X̄_n = 3.99. Also, S_X² = 63.96 and S_Y² = 67.39. The R commands to compute U and critical value of T_m+n-2^-1 are

          > U = sqrt(26+26-2)*(5.13-3.99)/(sqrt(1/26+1/26)*sqrt(63.96+67.39))
          >  U
          [1] 2.535984
          >  c = qt(0.99, df=26+26-2)
          >  c
          [1] 2.403272

Since U > c, the null hypothesis is rejected.▢

F Distribution

An F distribution for the variable X combining two random variables Y and W with χ² distributions with m and n degrees of freedom in the form

X = (Y/m)/(W/n)

X has an F distribution, which is given by the formula

ƒ(x) = (Γ[(m + n)/2] m^m/2n^n/2) / [Γ(m/2)Γ(n/2)] · x^(m/2)-1/(mx + n)^(m+n)/2

for x > 0.

An F test compares the unknown variances of two random variables X₁, ... X_n and Y₁, ... Y_n. The null and alternate hypotheses are

H₀: σ₁² ≤ σ₂²
H₁: σ₁² > σ₂²

In the test the statistic V is defined as

V = [S_X²/(m - 1)] / [S_Y²/(n - 1)]

If V ≥ c, where c is determined from the level of significance α₀, then H₀ is rejected. (DeGroot and Morris, Probability and Statistics, 598-599) The quantile function for the F distribution can be computed with the R command qf(p, df1, df2.

Example: We will work through Example 9.7.3 in DeGroot and Morris (DeGroot and Morris, Probability and Statistics, 600-601) with R commands. X₁, ... X₆ have unknown mean and variance. S_X² = 30 is computed from observations. Y₁, ... Y₂₁ also have unknown mean and variance and have S_Y² = 40. The following R commands compute the statistic V and the quantile value for α₀ = 0.025 and α₀ = 0.05.

          > V = (30/(6-1))/(40/(21-1))
          > V
          [1] 3
          > qf(0.95, df1=5, df2=20)
          [1] 2.71089
          > qf(0.975, df1=5, df2=20)
          [1] 3.289056

Since V > 2.71 the null hypothesis is rejected at the α₀ = 0.05 significance level but since V < 3.29 it is not rejected at the α₀ = 0.025 level. ▢

Categorical Data

Categorical data is data where observations classifies the data into different categories. (DeGroot and Morris, Probability and Statistics, 625)

χ² goodness of fit test

The χ² goodness of fit test measures how well the proportions of categories in a random sample matches against proportions from the whole population. In this test there are k different categories and p_i is the proportion of each in the population. The proportions in the random sample belonging to each category are p₁⁰, ... p_k⁰. The hypothesis tested is

H₀: p_i = p_i⁰ for i = 1, ... k
H₁: p_i ≠ p_i⁰ for at least one i

That is, the null hypothesis is that the distribution of data amongst the different categories is explained by the probabilities given. The alternative hypothesis is that the data in at least one of the categories cannot be explained by the given probabilities. The χ² statistic is defined as

Q = ∑_i=1^k(N_i - np_i⁰)² / np_i⁰

where N_i are the numbers of each item in the sample and n is the total number in the sample. If H₀ is true then Q converges to a χ² distribution as n → ∞. (DeGroot and Morris, Probability and Statistics, 626) Smaller values for Q are less likely to result in rejection of the null hypothesis. The chisq.test R function can be used to perform this test.

Example: This example uses the χ² goodness of fit test to compare the distribution counts for words grouped by part of speech in the tagged corpus and a document that is included in the tagged corpus. The most frequent few pronouns in the tagged corpus can be extracted with the following R command.

          > head(unigram[grep("/PN\\[", unigram$pos.tagged.text),])
                 pos.tagged.text element.text word.id frequency
          2     是/PN[shì | this]           是   17908       136
          12      所/PN[suǒ | it]           所   17100        66
          26        我/PN[wǒ | I]           我     321        45
          28     其/PN[qí | that]           其    1574        43
          33 云何/PN[yúnhé | how]         云何   29319        39
          62      汝/PN[rǔ | you]           汝    6690        22

The regular expression "/PN\\[" helps the grep() identify patterns like "/PN["" encapsulating part-of-speech tags. The R script pos_counts.R aggregates the frequencies of each separate part of speech in the tagged corpus and for the Heart Sutra and compares the counts with a χ² test. It can be run with the following command.

          $ Rscript r/pos_counts.R 

          Chi-squared test for given probabilities

          data:  heart.count
          X-squared = 71.0523, df = 7, p-value = 9.052e-13

With a very small p-value, the output shows that the distribution of part-of-speech counts in the Heart Sutra is not well explained by the general distribution of part-of-speech counts in the tagged corpus. Small count values have been avoided by combining categories. The data are shown in Table 2.

Table 2: Predicted and Observed Word Counts for the Heart Sutra
Part of Speech	Observed Count	Predicted Count
Adjectives	15	10.3
Existential verbs	21	5.5
Regular verbs	32	36.9
Proper nouns	7	19.8
Regular nouns	63	51.2
Pronouns and numbers	10	23.8
Adverbs	18	11.6
Other function words	15	21.8

There is a relatively higher proportion of existential verbs in the Heart Sutra, which is expected. ▢

χ² Test for a Composite Hypothesis

The χ² test for a composite hypothesis test can be used to do a goodness-of-fit test against a distribution parameterized by a set of parameters θ = (θ₁, ... θ_s). The probabilities for the categories are given by the functions π₁(θ), ... π_k(θ). The hypothesis is

H₀: There is a θ ∈ Ω such that p_i = θ_i H₁: H₀ is false

The Q statistic is defined as

Q = ∑_i=1^k[N_i - nπ_i(θ̂)]² / nπ_i(θ̂)

If H₀ is true then the cdf of Q will converge on the cdf of χ² with k - s - 1 degrees of freedom, as n → ∞. (DeGroot and Morris, Probability and Statistics, 634-635)

Contingency Tables

A contingency table is a table that classifies each observation in two or more ways. A two-way contingency table classifies data in two ways. A contingency table can be used to test data independence using a χ² test. If N_ij is the count in each table cell then the row and column counts are

N_i+ = ∑_i=1^CN_ij
N_j+ = ∑_i=1^RN_ij

where C is the number of columns, R is the number of rows, and n is the total number of cells, excluding the row and cell counts. An estimator for the expected number in each cell, based on the marginal probabilities from the row and column counts is

Ê_ij = N_i+N_j+/n

The statistic Q is defined as

Q = ∑_i=1^C∑_i=1^R(N_ij - Ê_ij)² / Ê_ij

has a χ² distribution with (R - 1)(C - 1) degrees of freedom. (DeGroot and Morris, Probability and Statistics, 642-643)

Manning and Schütze describe the use of a contingency table with a χ² test to test for the significance of collocations. (Manning and Schütze, Statistical Natural Language Processing, 169-172)

Example: This example tests the independence of the distribution of word counts for different parts of speech for three texts, The Heart Sutra, the Amitabha Sutra, and the Diamond Sutra. The R script pos_independence.R computes the word counts and invokes the χ² test. It can be invoked as shown below.

          $ $ Rscript r/pos_independence.R 
                            Heart Amitabha Diamond
          Adjectives           15       86     140
          Existential Verbs    21       31     130
          Verbs                32      245     709
          Proper Nouns          7      194     404
          Nouns                63      361     942
          Pronouns              10      226     503
          Adverbs              18       69     248
          Function Words       15      164     510

            Pearson's Chi-squared test

          data:  data
          X-squared = 111.0789, df = 14, p-value < 2.2e-16

The low p-value in results show that the null hypothesis should be rejected. The distribution of is not independent at the conventional 0.05 level of significance and even at more strict levels of significance. ▢

The χ² test for data homogeneity is the same as the test for data independence. (DeGroot and Morris, Probability and Statistics, 649)

Nonparametric Methods

Nonparametric methods are methods that do not depend on data belonging to specific parametric families of distributions.

Kolmogorov-Smirnov Tests

Kolmogorov-Smirnov tests can be used to test observed values of a continuous data against a parameterized distribution or to test the similarity of two sets of observed values.

A sample cdf or sample distribution function is a function of a set of observed values F_n(x) = k/n, where there are k observations less than or equal to x. The total number of observations is n.

The Kolmogorov-Smirnov test of observed data X₁, ... X_n against a particular continuous distribution F^*(x) tests the hypotheses

H₀: F^*(x) = F^*(x)
H₁: H₀ is false

If H₀ is true then the statistic D_n^*(x) defined by

D_n^*(x) = sup_∞<x<∞|F_n(x) - F^*(x)|

that is, the maximum difference for the observed value x_n, converges as

lim_n→∞ Pr(n^½D_n^* ≤ t) = H(t) = 1 - 2∑(-1)^i-1e^-2i²t²

(DeGroot and Morris, Probability and Statistics, 658-660) The values of H(t) can be computed and the test performed with the R function ks.test(x, y, ...).

The two-sample Kolmogorov-Smirnovtest compares two sets of samples X₁, ... X_m and Y₁, ... Y_n with cdf's F(x) and G(x). The hypotheses are

H₀: F(x) = G(x)
H₁: H₀ is false

The test statistic D_mn is defined as

D_mn = sup_∞<x<∞|F_m(x) - Gⁿ(x)|

that is, the maximum difference for the observed values x_n and y_n, converges as

lim_{m→∞,n→∞} Pr[(mn/m + n)^½D_mn ≤ t] = H(t)

(DeGroot and Morris, Probability and Statistics, 664) The values of H(t) can also be computed and the test performed with the R function ks.test(x, y, ...).

Robust Estimation

A contaminated normal distribution is a normal distribution with an additional component that is not normal. It has the form

ƒ(x) = (1 - ϵ)(2πσ²)^½exp(-1/(2σ²)[x - μ]²) + ϵg(x)

The function g(x) is the contaminating distribution. (DeGroot and Morris, Probability and Statistics, 668) A contaminated normal distribution may have thicker tails than a normal distribution due to outliers and so parametric methods assuming a normal distribution may not be accurate. In particular, the sample mean and sample variance may not be accurate, decreasing in accuracy with the degree of contamination.

A scale parameter for the distribution of a random variable X is a parameter σ where the corresponding parameter for the random variable aX + b is aσ. The mean and variance are scale parameters but they may not exist for all distributions. However, the median and interquartile range (IQR) do exist for all distributions. The median absolute deviation is defined as is the median of |X - m|. In this expression, m is the median of X. The median absolute deviation is a scale parameter that also exists for all functions. The median absolute deviation is half the IQR for symmetric distribution. (DeGroot and Morris, Probability and Statistics, 670)

Linear Regression

Least-Squares

A least-squares line is a method for fitting a line against sample data against a straight line, minimizing the squares of the vertical deviations from the line. A least-squares line is defined by the equation y = β̂₀ + β̂₁x, where

β̂₀ = ȳ - β̂₁x̄
β̂₁ = ∑_i=1ⁿ(y_i - ȳ) / ∑_i=1ⁿ(x_i - x̄)²

and x̄ - (1/n)∑_i=1ⁿx_i, ȳ - (1/n)∑_i=1ⁿy_i. (DeGroot and Morris, Probability and Statistics, 692) The R function lm(y ~ x), amongst other things, performs a least squares fit of y as a function of x.

Example: Let's use the R function lm(y ~ x) to work through Example 11.2.2 in DeGroot and Morris. (DeGroot and Morris, Probability and Statistics, 689-692) The data in this example relates the boiling point of water at different altitudes to the air pressure and is actually from Forbes. (Forbes, Further Experiments and Remarks on the Measurement of Heights by the Boiling Point of Water, 135-143) The array x contains measurements of temperature at which the water boils and the array y contains the air pressure measurements.

          > x <- c(194.5, 194.25, 197.9, 198.43, 199.45, 199.95, 200.93, 201.15, 201.35, 201.3, 203.55, 204.6, 209.47, 208.57, 210.72, 211.95, 212.18)
          > y <- c(20.79, 20.79, 22.4, 22.67, 23.15, 23.35, 23.89, 23.99, 24.02, 24.015, 25.14, 26.57, 28.49, 27.76, 29.04, 29.879, 30.064)
          > least.sq <- lm(y ~ x)
          > least.sq
          Call:
          lm(formula = y ~ x)

          Coefficients:
          (Intercept)            x  
              -81.088        0.523

So β̂₀ = -81.088, β̂₁ = 0.523 and the equation of the line is y = -81.088 + 0.523x. (There are some minor rounding errors in DeGroot and Morris' input data that different from Forbes.) ▢

Least squares can be used to fit sample data against a function of many variables, as in the formula below.

y = β̂₀ + β̂₁x₁ + ... + β̂_kx_k

The sum of squares of the variations are minimized, resulting in a set of simultaneous equations, which can be solved for β̂₀, β̂₁, ... β̂_k. This method can be extended to solving non-linear equations by letting x₁, ... x_k be other functions of x, such as powers of x and log functions.

Example: We saw above that there was a correlation between text size and the number of unique words in different texts in the corpus. However, it was not linear. It turns out that we can get a more linear relation in log space. The following R commands do a linear regression analysis on the log of the log of the word count versus the log of the count of unique words and plot the line of best fit on the graph with the data points.

          > x <- log(corpusstats$word.count)
          > y <- log(corpusstats$unique.words)
          > unique.lm <- lm(y ~ x)
          > plot(x, y, xlab="Log Word Count", ylab="Log Unique Words", pch=17, col="blue")
          > abline(unique.lm)

The function abline() takes the output of the linear regression and adds it to the plot. The graph generated is shown in Figure 10.

Figure 10: Linear Regression of Log Count of Unique Words versus Log of Text Length

▢

Simple linear regression

A regression function of Y on X₁, ..., X_k uses the random variables X₁, ..., X_k as predictors of the random variable Y. The expected value of Y is

E(Y|x₁, ..., x_k) = β₀ + β₁x₁ + ... + β_kx_k

where (x₁, ..., x_k) are observed values. (DeGroot and Morris, Probability and Statistics, 699)

Simple linear regression relates Y to a single random variable X. The relationship can be expressed as Y = β₀ + β₁x₁ + ϵ. The variable ϵ has mean 0 and variance σ². The assumptions of simple linear regression are

The predictor is known. That is, there are observed values x₁, ... , _kx_n.
The values x₁, ... , x_n are from a normal distribution.
Parameters β₀ and β₁ exist, such that the conditional mean of Y_i has the form β₀ + β₁x_i.
The variance σ² of Y_i for different values of x_i is constant.
The random variables Y₁, ... , Y_n are independent.

If these assumptions are met, then the conditional distribution of Y is

ƒ(y|x, β₀, β₁, σ²) = (1/(2πσ²))^n/2 exp[-(1/(2σ²))∑_i=1ⁿ(y_i - β₀ - β₁x_i)²]

An MLE estimate of the variance of Y is

σ̂² = 1/(n)∑_i=1ⁿ(y_i - β̂₀ - β̂₁x_i)

(DeGroot and Morris, Probability and Statistics, 700-701)

Statistical inference on simple linear regression can be made in a similar to the parameters discussed above. The variable nσ̂²/σ² has a χ² distribution with n - 2 degrees of freedom. (DeGroot and Morris, Probability and Statistics, 709)

A prediction interval for Y with coefficient 1 - α₀ is given by

Ŷ ± T_n-2²(1 - α₀) σ′[1 + 1/n + (x - x̄)²/s_x²]^½

where

s_x² = (∑_i=1ⁿ(x_i - x̄)²)^½

σ′ = (S²/(n - 2))^½

S² = ∑_i=1ⁿ(Y_i - β₀ - β₁x_i)²

(DeGroot and Morris, Probability and Statistics, 710-716)

Residuals are the difference between observed and predicted values e_i = y_i - ŷ_i, for i = 1, ..., n. The residuals should be randomly distributed. If there is some kind of pattern of the residuals that may indicate that a straight line is not the best model for the data. The residuals are often plotted as a function of x_i to see if they exhibit a pattern, which is often called analysis of residuals. A normal quantile plot or Q-Q plot, is a chart of the quantiles of the residuals versus the quantiles of the normal distribution. The plot should be a straight line. If not, the residuals may belong to some other kind of distribution than the normal distribution.

Example: Let's find the residuals for the unique word count example above. Calling the function summary() on the linear regression object gives information about residuals, as shown in the R script below.

          > summary(unique.lm)

          Call:
          lm(formula = y ~ x)

          Residuals:
              Min      1Q  Median      3Q     Max 
          -1.2653 -0.2087  0.1326  0.2377  0.3945 

          Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
          (Intercept)  1.15877    0.14502    7.99 2.28e-11 ***
          x            0.68528    0.01936   35.39  < 2e-16 ***
          ---
          Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

          Residual standard error: 0.3426 on 68 degrees of freedom
          Multiple R-squared:  0.9485,  Adjusted R-squared:  0.9477 
          F-statistic:  1252 on 1 and 68 DF,  p-value: < 2.2e-16

          > plot(unique.lm)

This shows that the median residual is 0.1326. Besides the residual summary, a lot of other quite advanced information is displayed as well, which I won't go into. The plot(residual.object) displays a plot of the residuals, as shown in Figure 11.

Figure 11: Residuals of Log Text Length versus Log Unique Word Count

From Figure 11, it can be seen that the distribution of residuals is not really random. Therefore, a log-log relation between text size and unique word count is not perfect. There are several charts generated from the plot(residual.object) above, including a Q-Q plot, which is shown in Figure 12.

Figure 12: QQ Plot for Regression Analysis of Log Text Length versus Log Unique Word Count

The Q-Q plot deviates from a straight line, showing that the distribution of errors is not perfectly normal. ▢

Knell explains the use of R for simple linear regression. (Knell, Introductory R, 222-260)

General Linear Model

The general linear model generalizes the simple model to multiple predictor variables z_i = (z_i0, ... z_ip-1) with a vector of parameters β = (β₀, ..., β_p-1). A similar set of five assumptions is necessary. The dependent values Y_i for i = 1, ..., n have the same variance σ. An expected value of Y is given by

E(Y_i| z_i, β) = z_i0β₀ + z_i1β₁ + ... + x_ip-1β_k

The predictor variables are written in an n × p matrix Z = [z_ij] for i = 0, ...n, j = 0, ... p - 1, called the design matrix. The least squares estimator β̂ of β is

β̂ = (Z′Z)^-1Z′Y

where Z′ is the transpose of Z. The mean of the dependent variable Y can be expressed in matrix form as

E(Y) = Zβ

(DeGroot and Morris, Probability and Statistics, 736-742)

The residuals are similar to simple linear regression,

e_i = y_i - ŷ_i = y_i - z_i0β₀ - ... z_ip-1β_p-1

(DeGroot and Morris, Probability and Statistics, 749) You can do a generalized linear regression in R with the glm() function.

Analysis of Variance

Analysis of variance (ANOVA) is an application of multiple regression where the design matrix has a special form. An ANOVA one-way layout tests whether the means of samples μ_i from p different populations are the same. The variance σ is the same for all of the different populations.

H₀: μ₁ = ... μ_p
H₁: H₀ is false

If H₀ is true then the test statistic

U2 = [S_Betw²/(p - 1)] / [S_Resid²/(n - p)]

has an F distribution with degrees of freedom p - 1 and n - p. The total sum of squares

S_Tot² = ∑_i=1^p∑_j=1^n_i (Y_ij - Ȳ₊₊)²

is partitioned into two parts, the residual sum of squares and the between samples sum of squares:

S_Tot² = S_Resid² + S_Betw²

where the parts are defined as

S_Resid² = ∑_i=1^p∑_j=1^n_i (Y_ij - Ȳ_i+)²
S_Betw² = ∑_i=1^p^{n_i(Ȳ_i+ - Ȳ₊₊)²}

where

Ȳ₊₊ = (1/n)∑_i=1^p∑_j=1^n_iY_ij = (1/n)∑_i=1^pn_iȲ_i+

(DeGroot and Morris, Probability and Statistics, 754-761) You can do an analysis of variance in R with the anova() function, which takes the output of the generalized linear regression function glm(), or the anova() function, which takes a formula.

Example: Let's work through 11.6.5 in DeGroot and Morris using R. (DeGroot and Morris, Probability and Statistics, 755-760) There are p = 4 different kinds of hot dogs (Beef, Meat, Poultry, Specialty) with different calorie counts. We want to test whether the average calorie counts for the different kinds of hotdogs are the same.

          > type <- c("Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Meat", "Meat", "Meat", "Meat", "Meat", "Meat", "Meat", "Meat", "Meat", "Meat", "Meat", "Meat", "Meat", "Meat", "Meat", "Meat", "Meat", "Poultry", "Poultry", "Poultry", "Poultry", "Poultry", "Poultry", "Poultry", "Poultry", "Poultry", "Poultry", "Poultry", "Poultry", "Poultry", "Poultry", "Poultry", "Poultry", "Poultry", "Specialty", "Specialty", "Specialty", "Specialty", "Specialty", "Specialty", "Specialty", "Specialty", "Specialty") 
          > n = length(type)
          > p = 4
          >  calories <- c(186, 181, 176, 149, 184, 190, 158, 139, 175, 148, 152, 111, 141, 153, 190, 157, 131, 149, 135, 132, 173, 191, 182, 190, 172, 147, 146, 139, 175, 136, 179, 153, 107, 195, 135, 140, 138, 129, 132, 102, 106, 94, 102, 87, 99, 107, 113, 135, 142, 86, 143, 152, 146, 144, 155, 170, 114, 191, 162, 146, 140, 187, 180)
          > hotdog.aov = aov(calories ~ type)
          > summary(hotdog.aov)
            Df Sum Sq Mean Sq F value   Pr(>F)    
            type         3  19454    6485    11.6 4.48e-06 ***
            Residuals   59  32995     559                     
            ---
            Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
          > qf(0.99, p - 1, n - p)
          [1] 4.132055
          > qf(0.999, p - 1, n - p)
          [1] 6.185016

The "formula" here relates the vector type of hot dog types to the vector calories of corresponding calorie counts. The test statistic is 11.6, which has a p-value of 4.48^-6, which is rejected at most levels of significance, including α₀ = 0.001. This is demonstrated by the calls to the F distribution quantile R function qf(p, df1, df2). Therefore, the H₀ hypothesis, that the mean calorie counts are the same, is rejected. ▢

Example: This example looks at average word frequencies in Buddhist texts. The word frequencies of function words should be similar across texts of the same period because the do not depend greatly on the content of the text. One of the most frequently occuring function words in literary Chinese is the nominalizer 所 suǒ "it." 所 suǒ is placed in front of a verb to make the objects acted on by the verb the subject. (Pulleybank, Outline of Classical Chinese Grammar, 68) It can be roughly understood as the English pronoun "it." The frequencies per thousand words of 所 suǒ in several texts are given in Table 3. The texts are Biographies of Eminent Monks, Scroll《高僧傳》 by Huijiao (497-554), Buddhist Monasteries in Luoyang, Scroll 1 《洛陽伽藍記》 by Yang Xuanzhi (c. 547), and The Great Calming and Contemplation 《摩訶止觀》by Zhi Yi (538—597).

Table 3: Frequencies per Thousand Words of 所 Suǒ
Text	Biographies of Eminent Monks	Buddhist Monasteries in Luoyang	Great Calming
Abbreviation	GSZ	Luoyang	MHZG
Scroll	Frequency	Frequency	Frequency
1	7.27	6.66	5.89
2	5.76	6.36	6.12
3	6.02	7.45	4.27
4	3.00	6.43	5.15
5	4.84	4.99	4.22
6	5.92		3.16
7	5.66		4.28
8	3.69		2.94
9	4.79		3.35
10	4.76		5.60
11	3.88
12	5.35
13	5.61
14	5.03
Mean	5.11	6.38	4.49

A box plot generated with R is shown in Figure 13.

Figure 13: Box Plot of Frequencies per Thousand Words of 所 Suǒ

A test of whether the means of word frequency for 所 suǒ for each of the three texts is done with ANOVA using the R commands below.

          > gsz <- c(7.27, 5.76, 6.02, 3.00, 4.84, 5.92, 5.66, 3.69, 4.79, 4.76, 3.88, 5.35, 5.61, 5.03)
          > summary(gsz)
             Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
            3.000   4.768   5.190   5.113   5.735   7.270 
          > luoyang <- c(6.66, 6.36, 7.45, 6.43, 4.99)
          > summary(luoyang)
             Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
            4.990   6.360   6.430   6.378   6.660   7.450 
          > mhzg <- c(5.89, 6.12, 4.27, 5.15, 4.22, 3.16, 4.28, 2.94, 3.35, 5.60)
          > summary(mhzg)
             Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
            2.940   3.568   4.275   4.498   5.487   6.120 
          > texts <- c("gsz", "gsz", "gsz", "gsz", "gsz", "gsz", "gsz", "gsz", "gsz", "gsz", "gsz", "gsz", "gsz", "gsz", "luoyang", "luoyang", "luoyang", "luoyang", "luoyang", "mhzg", "mhzg", "mhzg", "mhzg", "mhzg", "mhzg", "mhzg", "mhzg", "mhzg", "mhzg")
          > frequencies <- c(gsz, luoyang, mhzg)
          > length(frequencies)
          [1] 29
          > frequencies.aov = aov(frequencies ~ texts)
          > summary(frequencies.aov)
                      Df Sum Sq Mean Sq F value Pr(>F)  
          texts        2  11.78   5.891   5.014 0.0144 *
          Residuals   26  30.55   1.175                 
          ---
          Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
          > p <- 3
          > n <- length(frequencies)
          > qf(0.99, p - 1, n - p)
          [1] 5.526335
          > qf(0.95, p - 1, n - p)
          [1] 3.369016

The output of the summary of aov() result shows that F value = 5.014. The computed values of the inverse of the cumulative F distribution at the α₀ = 0.01 and 0.05 levels are F^-1(1 - 0.01) = 5.526 and F^-1(1 - 0.05) = 3.369. The ANOVA shows that the null hypothesis that the means are equal should be rejected at the 0.05 level of significance but not the 0.01 level, since 3.369 < F value < 5.526. This is consistent with the ANOVA output for the power function Pr(>F) = 0.0144. ▢

Simulation

Simulation can be used to solve problems where there is no closed-form solution. Examples are computing means, variances, and quantiles for formulas that cannot be solved algebraically. The simulations usually make use of large numbers of randomly generated numbers. Statistical simulation is sometimes called Monte Carlo analysis because of the use of randomness. (DeGroot and Morris, Probability and Statistics, 791)

R functions can be used to generate random values with the different distributions discussed in this document. Figure 14 shows a histogram produced from the R function rnorm(10000) that produces 10,000 random numbers with a standard normal distribution.

Figure 14: Histogram of Random Values from a Standard Normal Distribution

Information Theory

Information theory has arisen from communications theory in electrical engineering in connection with the study of transmission rates and error correction. (Cover and Thomas, 2012, ch. 1) Information theory was first formulated in its present form by Shannon in the 1940's. (Cover and Thomas, 2012, ch. 2) It was later expanded to areas of computer science, such as data compression and data storage. Information theory concepts have since spread back into statistics and other fields like economics and physics. The definition of entropy above can be used as a basis for many other definitions that parallel concepts in probability and statistics. Some of these are described here.

Concepts

The joint entropy H(X, Y) of a pair of random variables is defined as

H(X, Y) = -∑_x∈X∑_y∈Yp(x, y) log(p(x, y))

where (X, Y) is a pair of discrete random variables with probability distribution p(x, y). (Cover and Thomas, 2012, ch. 2.2)

The conditional entropy H(X|Y) is defined as

H(X|Y) = -∑_x∈XH(Y|X=x) = -∑_x∈X∑_y∈Yp(x, y) log(p(y|x))

where (X, Y) is a pair of discrete random variables. (Cover and Thomas, 2012, ch. 2.2)

The chain rule for entropy is

H(X, Y) = H(X) + H(Y|X)

Described in words, the entropy of a pair of random variables is the sum of the entropy of one random variable and the conditional entropy of the other. (Cover and Thomas, 2012, ch. 2.2)

Relative entropy D(p||q) is a measure of the distance between the distributions of random variables p and q. In symbols, it is defined as

D(p||q) = ∑_x∈Xp(x) log(p(x)/q(x))

(Cover and Thomas, 2012, ch. 2.3)

The mutual information I(X; Y) for two discrete random variables X and Y with joint distribution p(x, y) is defined as the relative entropy between the joint distribution and the product of the marginal distributions p(x)p(y)

I(X; Y) = -∑_x∈X∑_y∈Yp(x, y) log(p(y, x)/[p(x)p(y)])

(Cover and Thomas, 2012, ch. 2.3)

The relationship between mutual information and entropy is

I(X; Y) = H(X) - H(X|Y)

(Cover and Thomas, 2012, ch. 2.3)

The concept of convex functions is important in many proofs in information theory. A function f(x) is convex of interval (a, b) if for every pair points x₁, x₂ in the interval and for 0 ≤ λ ≤ 1, the following condition holds

f(λx₁ + (1-λ)x₂) ≤ λx₁ + (1-λ)x₂

(Cover and Thomas, 2012, ch. 2.6)

Markov chains, introduced above, are useful in information theory. Random variables X, Y, and Z form a Markov chain X → Y → Z if the joint probability function is

p(x, y, z) = p(x)px(y|x)p(z|y)

In words, this means that no manipulation of Y can increase the information that it contains about X. (Cover and Thomas, 2012, ch. 2.8)

The Data-processing inequality theorem states that, if X → Y → Z form a Markov chain then

I(X; Y) ≥ I(X; Z)

(Cover and Thomas, 2012, ch. 2.8)

As explained above, a statistic is a function of a set of random variables. A sufficient statistic T(X) contains all the information in X about ϴ. In symbols, a sufficient statistic is a function T(X) relative to a family of probability functions {f_ϴ(x)} parameterized by ϴ, if X is independent of ϴ for any distribution of ϴ, given T(X). In this case, ϴ → T(X) → X is a Markov chain. This is the case of equality in the data-processing inequality theorem:

I(ϴ; X) = I(ϴ; T(X))

(Cover and Thomas, 2012, ch. 2.9)

T(X) is a minimally sufficient statistic if it is a function of every other statistic. In this case, ϴ → T(X) → U(X) → X is a Markov chain.

I(ϴ; X) = I(ϴ; T(X))

(Cover and Thomas, 2012, ch. 2.9)

Asymptotic Equipartition Property

The asymptotic equipartition property (AEP) states that, if X₁, X₂, ... are independent and identicially distributed with probability density function p(x) then

-(1/n) log p(X₁, X₂, ... , X_n) → H(X)

(Cover and Thomas, 2012, ch. 3.1) The AEP is analogous the law of large numbers in statistics. Here the symbol → means convergence of random variables. A sequence of random variables X₁, X₂, ... converges to a random variable X if Pr{lim_n→∞X_n = X} = 1.

A typical set A⁽ⁿ⁾ with respect to probability density function p(x) is the set of sequences (x₁, x₂, ..., x_n) with the property

2^-n(H(X)+ε) ≤ p(x₁, x₂, ..., x_n) ≤ 2^-n(H(X)-ε)

(Cover and Thomas, 2012, ch. 3.1)

The AEP has applications in data compression. By partitioning a set of strings into a typical set and a non-typical set and representing the strings by an integer index, for a large number of elements n, the expected number of bits needed to represent the set is nH(X). (Cover and Thomas, 2012, ch. 3.2)

Data Compression

A source code for a random variable is a mapping from the range of the random variable to a set of finite length strings of symbols in an alphabet. Here the source code is givne the symbol C, the random variable X, the arity of the alphabet is D, C(x) is the codeword for the value x of the random variable X, D^* is the set of symbols C(X), and l(x) is the length of C(x). The expected length L(C) of a source code is

L(C) = ∑_x∈X p(x)l(x)

For example, if X is a random variable with x ∈ {1, 2, 3, 4}, and D = {0, 1}, then a possible source code is C(1) = 0, C(2) = 10, C(3) = 110, C(4) = 111. A code is nonsingular is every element of X maps to a different string in D^*. The example just given is nonsingular because each of the C(X) is different. (Cover and Thomas, 2012, ch. 5.1)

The extension of a source code is the set of strings C(X) concatenated together. In symbols, the extension C^* is the set of strings

C(x₁)C(x₂) ... C(x_n)

If the extension is nonsingular then it is referred to as uniquely decodable. This is important because it means that a compressed file can be made up of concatenated symbols without any separator between them. The example above is uniquely decodable because no symbol is a prefix of any other symbol. Such as source code is called a prefix code or an instantaneous code. It does not need any lookahead to resolve ambiguity in the string of symbols. (Cover and Thomas, 2012, ch. 5.1)

References

Amies, Alex. NTI Buddhist Text Reader GitHub Project, 2014. https://github.com/alexamies/buddhist-dictionary.
Bird, Steven, Ewan Klein, and Edward Loper. Natural Language Processing with Python. O’Reilly Media, Inc., 2009.
Cover, Thomas M, and Joy A Thomas. 2012. Elements of Information Theory. New York, NY: John Wiley & Sons.
Forbes, J. D. “Further Experiments and Remarks on the Measurement of Heights by the Boiling Point of Water.” Transactions of the Royal Society of Edinburgh 21 (1857): 135–43.
DeGroot, Morris H., and Mark J. Schervish. Probability and Statistics. 4 edition. Boston: Pearson, 2011.
Knell, Robert. Introductory R: A Beginner’s Guide to Data Visualisation, Statistical Analysis and Programming in R. United Kingdom: Self published, 2013.
Krippendorff, Klaus H. Content Analysis: An Introduction to Its Methodology. Third Edition edition. SAGE Publications, Inc, 2012.
POS tagger (Java), 2006. http://en.literateprograms.org/POS_tagger_(Java), accessed 2015-09-19.
Pulleyblank, Edwin G. Outline of Classical Chinese Grammar. Vancouver: UBC Press, 1995.
Manning, Christopher D. and Schütze, Hinrich. Foundations of Statistical Natural Language Processing. Cambridge, Mass: MIT Press, 1999.
Mol, M., 2013. rosettacode.org, accessed 2015-09-20.
Yau, Chi. R Tutorial with Bayesian Statistics Using OpenBUGS. Self, 2014. http://www.r-tutor.com/content/r-tutorial-ebook.
The R Project for Statistical Computing (version Pumpkin Helmet). R Foundation, 2014. http://www.r-project.org.

Statistical Background

Contents

Corpus Data

Software Tools

R Project for Statistical Computing

Concepts

Probability

Expectation

Entropy

Markov Models

Special Distributions

Bernoulli Distribution

Binomial Distribution

Poisson Distribution

Normal Distribution

Gamma Distribution

Exponential Distribution

Beta Distribution

Multinomial Distribution

Statistical Estimation

Sampling

Hypothesis Testing

Concepts

Evaluation Metrics in Natural Language Processing

Comparing means from two distributions

F Distribution

Categorical Data

χ2 goodness of fit test

χ2 Test for a Composite Hypothesis

Contingency Tables

Nonparametric Methods

Kolmogorov-Smirnov Tests

Robust Estimation

Linear Regression

Least-Squares

Simple linear regression

General Linear Model

Analysis of Variance

Simulation

Information Theory

Concepts

Asymptotic Equipartition Property

Data Compression

References

χ² goodness of fit test

χ² Test for a Composite Hypothesis