This page gives some incomplete notes on statistical background in analyzing Chinese text and analyzing quality of corpus data. The statistical techniques described here are currently not used in the NTI Reader. Rather, simple counts based on unigram and bigram frequencies are used for word sense disambiguation by the NTI Reader in sentence-based dictionary lookup and HTML gloss generation.
A corpus is a collection of texts, which may be used by linguists for studying the characteristics of language usage. (Bird, Natural Language Processing, 39) The NTI Buddhist Text Reader contains a corpus of Buddhist and classical Chinese texts. A small subset of the corpus is tagged with part-of-speech tags and gloss distinguishing word sense. The organization of the corpus is described on the page Text Management, Annotation, and Gloss on this web site. The tagging scheme is described on the page Part-of-Speech Tag Definitions. The raw data, including word frequencies, can be found at the NTI Buddhist Text Reader GitHub Project in the data/dictionary folder. They are tab delimited text files.
Word frequencies from the tagged subset of the corpus can found in the files unigram.txt and bigram.txt These are a tab delimited UTF-8 file with no header row. The unigram.txt file lists single word frequencies in the tagged corpus. The structure of the unigram.txt file is:
pos_tagged_text: The element text with POS tag and gloss in pinyin and English element_text: The element text in traditional Chinese word_id: Matching id in the word table (positive integer) frequency: The frequency of occurrence of the word sense (positive integer)
The word_id
matches the numeric id of a word in the dictionary file
words.txt.
Each entry in the words.txt file is a lexeme - a single dictionary entry with a single
meaning.
(Manning and Schütze, Statistical Natural Language Processing, 128)
The bigram.txt file lists frequencies of two-word combinations. The structure of the bigram.txt file is:
pos_tagged_text: The element text with POS tag and gloss in pinyin and English previous_text: The element text in traditional Chinese element_text: The element text in traditional Chinese word_id: Matching id in the word table (positive integer) frequency: The frequency of occurrence of the word sense (positive integer)
A tab delimited plain text list of summary statistics for the corpus texts is included in the file corpus_stats.txt. The structure of the file is
source_name: the title of the text word_count: the number of words in the text character_count: the number of characters in the text unique_words: the number of unique words in the text
There are a number of software tools for statistical modeling and may be used for text analysis. The R programming language will be used here. For other languages, examples of basic statistical operations can be found at the pages Combinations and permutations and Statistics/Basic.
The R Project for Statistical Computing (http://www.r-project.org), or simply R, is an open source project, language, and platform. You can freely download and install R for Linux, Mac, and Windows from links in the project web site. The R project web site has links to introductory materials. I referred to the books by Knell (Knell, Introductory R: A Beginner’s Guide to Data Visualisation, Statistical Analysis and Programming in R) and Yau (Yau, R Tutorial with Bayesian Statistics Using OpenBUGS) in preparing the commands and scripts presented here.
R is a generic statistics platform. It is very handy for working through statistics problems in textbooks and trying stuff out rather than using a piece of paper and calculator or a spreadsheet. In addition, there are a number of packages for R specifically for text analysis. These can be found in the web page Available CRAN Packages By Name. Some examples are: koRpus: An R Package for Text Analysis, tau: Text Analysis Utilities, textcat: N-Gram Based Text Categorization, and tm: Text Mining Package.
After installing R, open the command line interpreter with the command R
.
The unigram.txt file can be read in to a data frame using the read.table()
function, as shown below.
Change directories to the directory containing the file first.
$ R . . . > names <- c("pos.tagged.text", "element.text", "word.id", "frequency") > unigram <- read.table("unigram.txt", header=FALSE, sep="\t", quote="\"", col.names=names, numerals ="allow.loss") > head(unigram) pos.tagged.text element.text word.id frequency 1 佛/NR[Fó | Buddha] 佛 3618 330 2 有/VE[yǒu | to have] 有 32458 249 3 於/P[yú | in] 於 1710 220 4 不/AD[bù | not] 不 502 218 5 人/NN[rén | person] 人 399 212 6 諸/DT[zhū | all] 諸 3556 193
The $
prompt is shown before a shell command and the >
prompt is shown
before an R command.
The c()
function concatenates the arguments into a vector, which is assigned to the
variable names
using the assignment operator <-
.
The variable names
is used for the column names later.
The unigram.txt file into the unigram
data frame with the read.table()
function.
The head
function prints out the first few lines of the data frame.
The UTF-8 encoded Chinese characters are read in by R correctly.
The NTI Reader text files are formatted to be loaded into MySQL and there are a few differences with the
basic form of the read.table()
function.
The format for NULL
values in MySQL is \N
but R expects NA
.
So, you will need to be careful if you depend on accurate representation of NULL
values.
In addition, the NTI Reader files do not have variable names in the first row.
The bigram.txt file can be read in to a data frame in a similar way, as shown below.
> binames <- c("pos.tagged.text", "previous.text", "element.text", "word.id", "frequency") > bigram <- read.table("bigram.txt", header=FALSE, sep="\t", quote="\"", col.names=binames, numerals ="allow.loss") > head(bigram) pos.tagged.text previous.text element.text word.id frequency 1 故/NN[gù | purpose] 以 故 7115 41 2 以/P[yǐ | because] 何 以 30648 41 3 僧/NN[sēng | monk] 衆 僧 3366 40 4 此/DT[cǐ | this] 於 此 624 37 5 佛/NR[Fó | Buddha] 諸 佛 3618 34 6 塔/NN[tǎ | stupa] 起 塔 3374 34
Data can be plotted in R using the plot()
function, as shown below.
> x <- seq(1, length(unigram$frequency)) > plot(x, unigram$frequency, xlab="Rank", ylab="Frequency", pch=20, col="blue")
The seq()
function generates a sequence of integers so that the frequency
of each word will be plotted from most frequent at the left to least frequent at the right.
The resulting plot is the word frequency versus rank, shown in Figure 1.
Figure 1: Word Frequency versus Rank
The plot in Figure is informative in that it shows that only a handful of words have a high frequency and the vast majority is a long tail with frequency 1. This is a phenomenon in word frequencies for natural language in general. (Manning and Schütze Foundations of Statistical Natural Language Processing. Cambridge, 23-27) An early model for the relation between word frequency and rank is Zipf's law that states that word frequency f is inversely proportional to rank r.
f ∝ 1/r
(Manning and Schütze Foundations of Statistical Natural Language Processing. Cambridge, 24)
To generate an image file for that plot and the others in this page, pull the project from GitHub, change to the top level directory in the project and type the command
$ Rscript r/generate_images.R
This is a useful diagram because it shows the breadth and depth of the corpus at a glance. The corpus only has about 3,200 unique words and only has a substantial frequency for only about 200 of these words.
You will need to install the R showtext
package to properly display Chinese text on R generated graphics.
Use the commands below to install showtext
.
> install.packages("showtext") > library(showtext)
You can make a histogram plot of the frequency data using the barplot()
function, as shown below.
> freq <- as.vector(as.matrix(unigram[4])[1:8, 1]) > freq.labels <- as.vector(as.matrix(unigram[2])[1:8, 1]) > showtext.begin() > barplot(freq, names.arg=freq.labels, ylab="Frequency", col="steelblue") > showtext.end()
A subset of fourth column of the unigram data.frame is read into the variable freq
, after being converted to a vector.
The labels are the Chinese text for each word taken from column 2.
Rather than generate the graph in a window, the code above writes it to a png file.
The showtext.begin()
needs to be called before generating the graph.
When dev.off()
is called the file will be written.
The chart generated is shown in Figure 2.
Figure 2: Word Frequency Bar Plot for a Subset of the Data
Suppose that we need to choose between one of two options and each option was equally likely. Then each option would have a 1/2 or 50%!<(MISSING)strong>probability of being correct. For example, suppose we see the English word butter in a text and we need to decide whether the noun meaning the butter in the fridge is intended or the verb, as in please butter the toast, is intended. To find out the probability of the intended sense then you might look at the surrounding words or you might guess that the noun meaning a particular dairy product in the fridge is more likely, in general. One of the goals of the NTI Reader is to help the user decide which is the most appropriate word sense and part of speech. These are classic problems in natural language processing and ones that are still problematic today. The technique used is to tag all the words in a subset of the corpus and tally up the different word senses found. It has been found that simply using the most common part of speech of words with multiple possible parts of speech results in selection of the correct part of speech about 90%!o(MISSING)f the time, for English text. (Manning and Schütze, Statistical Natural Language Processing, 344) The success rate may be lower for Chinese due to the long historic period of written Chinese covered and the changing predominant word sense over time.
Example: Suppose that we wish to find the probability of choosing the right meaning of the character 法 (fǎ). 法 (fǎ) most often means the noun "law" or the noun "method" in modern Chinese. However, in Buddhist texts written in literary Chinese, 法 might mean the proper noun "Dharma" (the teachings of the Buddha), teachings in general, or the noun "dharma" (phenomenon). The word frequencies for each sense can be found from the unigram table using the following R commands.
> fa <- subset(unigram, element.text=="法") > fa pos.tagged.text element.text word.id frequency 10 法/NN[fǎ | a dhárma] 法 17994 74 1253 法/NR[Fǎ | Dhárma] 法 3509 1 1376 法/NN[fǎ | a mental object] 法 32204 1 1397 法/NN[fǎ | method] 法 3506 1 > PrA = fa$frequency[1] / sum(fa$frequency) > PrA [1] 0.961039
The subset()
function selects a subset of the unigram table with the value of element_text equal to "法".
The total number of occurrences of 法 in the tagged corpus is sum(fa$frequency) = 74 + 1 + 1 + 1 = 77
.
So the probability of any one occurrence of 法 being the noun (NN) meaning "a dhárma" (a teaching) is
74/77 = 0.961
.
If we had no other information about the context of a occurrence of 法, then we would guess that it would mean
a dhárma. ▢
The conditional probability of an event A given that we know an event B has already occurred is written Pr(A|B). It can be computed as
Pr(A|B) = Pr(A ∩ B) / Pr(B)
This assumes that Pr(B) > 0. (DeGroot and Morris, Probability and Statistics, 56)
Example: If we know the word before 法, it may give us a better chance at picking the correct word sense. This is an example of conditional probability We can use the bigram table to compute the conditional probability of the word sense of 法. Suppose the word before is 說 shuō "to say" in modern Chinese but more commonly "to teach" in literary Chinese. That would make the phrase be 說法 "to teach a dhárma." Let A = the proper meaning of 法 in this instance is 法/NN[fǎ | a dhárma]. Let B = the word before 法 is 說. The conditional probability can be computed using the R commands below.
> subset(subset(bigram, element.text=="法"), previous.text=="說") pos.tagged.text previous.text element.text word.id frequency 34 法/NN[fǎ | a dhárma] 說 法 17994 9
Pr(A|B) = 9 / 9 = 1.0
That is, whenever the word before 法 is 說 then the proper word sense is always 法/NN[fǎ | a dhárma]. The interpretation of this is that the previous word is a very good predictor of word sense, in this case. ▢
Two events A and B are independent if the occurrence of one does not affect the occurrence of the other. If this is true then
Pr(A ∩ B) = Pr(A) Pr(B)
(DeGroot and Morris, Probability and Statistics, 66)
A random variable is a real-valued function in a sample space S. (DeGroot and Morris, Probability and Statistics, 93)
The probability function of a discrete random variable X is the function
ƒ(x) = Pr(X = x)
(DeGroot and Morris, Probability and Statistics, 96) In linguistics discrete random variables, like word frequency, are more common but continuous random variables may also be used, for example the word frequency of a specific word per 1,000 words of text. The equivalent function for a continuous random variable is called the probability density function (pdf).
The cummulative distribution function (cdf) of a random variable X is
F(x) = Pr(X ≤ x) for -∞ < x < ∞
(DeGroot and Morris, Probability and Statistics, 108)
The quantile function F^{-1}(p) of a random variable X is the inverse of the cdf,
which is also the smallest value x with F(x) ≥ p. The variable p is the probability.
F^{-1}(p) is the p quantile of X or 100p percentile.
(DeGroot and Morris, Probability and Statistics, 112)
A quartile is found by sorting the data and then dividing it into four equal groups.
The interquartile range is the middle two quartiles or, in other words, the range
between the 25 and 75th percentiles.
Quartile and range information can be found using the R function summary()
.
The quantile can be found using the R function qt()
.
The interquartile range can be found using the R function IQR()
.
Example: Summary data for the word frequency data can be found as follows.
> summary(unigram$frequency) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 1.000 4.357 3.000 137.000 > length(unigram$frequency) [1] 1646
The length()
function gives the number of items in the data set, which shows that there
are only 3,200 unique words in the tagged corpus. ▢
The joint probability function of two random variables X and Y is
ƒ(x, y) = Pr(X = x and Y = y)
(DeGroot and Morris, Probability and Statistics, 119)
The marginal cdf of a joint probability function of two discrete random variables X and Y is summed over all possible values of y. In symbols,
ƒ_{1}(x) = ∑_{All y} ƒ(x, y)
(DeGroot and Morris, Probability and Statistics, 131)
The expectation or of a random variable is its mean. The expectation of a discrete random variable X with probability function f is defined as
E(X) = Σ_{All x} x ƒ(x)
(DeGroot and Morris, Probability and Statistics, 208)
The median is another measure of central tendency, which separates the lower half from
the upper half of a set of numbers. It can be more useful than the mean
when dealing with small integers or highly skewed data.
The mean of a vector of numbers can be found with the R function mean()
and the median
can be found with the function median()
.
Example: The values of mean and median for word frequency in the NTI Reader tagged corpus can be found with the R commands below.
> mean(unigram$frequency) [1] 4.708101 > median(unigram$frequency) [1] 1
This can give some idea of the adequacy of the size of the tagged corpus. A mean frequency of about 4.7 word occurrences and a median of 1 in the tagged corpus makes the tagged corpus seem kind of small. We need to do more analysis to understand what might be really sufficient. ▢
Some distributions do not have a mean. The Cauchy distribution below is an example of a function with no mean.
ƒ(x) = 1/[π(1 + x^{2})] for -∞ < x < ∞
(DeGroot and Morris, Probability and Statistics, 210)
The variance of a random variable X with mean μ is defined as
Var(X) = E[(X - μ)^{2}]
The variance of a discrete random variable X with mean μ can be computed with the sum
Var(X) = (1/n) ∑_{all i}(X_{i} - μ)^{2}
The standard deviation is the square root of the variance. (DeGroot and Morris, Probability and Statistics, 226)
Example: The variance and standard deviation of word frequency in the NTI Reader tagged corpus can be found with the R commands below.
> x >- unigram$frequency > v <- sum((x-mean(x))^2)/(length(x)) > v [1] 140.0195 > sqrt(v) [1] 11.83299
The variance is about 140.0 and the standard deviation about 11.8. ▢
The covariance of random variables X and Y with means μ_{x} and μ_{y} is defined as
Cov(X, Y) = E[(X - μ_{x})(Y - μ_{y})]
assuming that the expectation exists. (DeGroot and Morris, Probability and Statistics, 248) The sample covariance corrects for sampling from a larger population. It is given by the formula:
Cov_{x, y} = ∑(x - x̄)(y - ȳ)/(n - 1)
where x̄ and ȳ are the sample means and n - 1 is the degrees of freedom.
(Knell, Introductory R, 211)
The sample covariance can be computed using the R function cov()
.
The correlation of random variables X and Y with variances σ_{x}^{2} and σ_{y}^{2} is defined as
ρ(X, Y) = Cov(X, Y) / [σ_{x}σ_{y}]
(DeGroot and Morris, Probability and Statistics, 250) The correlation for a sample can be computed from the formula
r = Σ(x - x̄)(y - ȳ)/[(n - 1)(s_{x} s_{y})]
where s_{x} and s_{y} are the sample standard deviations.
(Knell, Introductory R, 212)
The sample correlation can be computed using the R function cor()
.
Example: Let's find the correlation between text size and the number of unique words in different texts in the corpus. We can load the file corpus_stats.txt, containing the summary statistics of the corpus texts, print out the first row of data, print a summary of the distribution of unique words, find the correlation, and draw a plot with these R commands.
> corpusstats <- read.table("../stats/corpus_stats.txt", sep="\t", quote="\"", numerals ="allow.loss", header=TRUE) > corpusstats[1,] source.name word.count character.count unique.words 1 Diamond Sūtra 金剛般若波羅蜜經 3407 5307 534 > summary(corpusstats$unique.words) Min. 1st Qu. Median Mean 3rd Qu. Max. 15.0 122.2 580.5 962.2 1661.0 5145.0 > cor(corpusstats$word.count, corpusstats$unique.words) [1] 0.7266394 > plot(corpusstats$word.count, corpusstats$unique.words, xlab="Word Count", ylab="Unique Words", pch=17, col="blue")
This particular file does have column headers in the first row that are automatically used by R as column names in the data frame. The image generated is shown in Figure 3.
Figure 3: Variation of Unique Words with Text Size
From the correlation value there r = 0.727 there is clearly a correlation between text length and number of unique words. However, from Figure 3 we can see that it is consistent but not linear. The number of unique words eventually begins to flatten out with really long texts. Since the set of words in a text may include proper nouns, such as person and place names, the number of unique words is practically unlimited. It is useful to consider the summary of unique words with the unigram data for the tagged corpus. The third quartile value is 1661, which is greater than the total number of unique words in the tagged corpus. So, there are at least 25%!o(MISSING)f the texts in the corpus that have more unique words than we have word frequency data for. Clearly, we are short of coverage on word frequency.
Entropy is a measure of the amount of information in a distribution. The entropy H(X) of a discrete random variable X is defined as
H(X) = -∑_{x∈X}p(x)log_{2}p(x)
where x is a value of the random variable X, which is one of a number of symbols in an alphabet X. Entropy is measured in bits. (Manning and Schütze. Statistical Natural Language Processing, 61; Cover and Thomas. Elements of Information Theory, 2012, ch. 2.1) For example, the possible values for X are the values in an 8 letter alphabet, each with equally likely probability of occuring, then the entropy is
H(X) = -∑_{i=1..8}(1/8)log_{2}(1/8) = log_{2}(8) = 3
So, a uniform distribution over the alphabet has 3 bits of information. In other words, the most efficient way of encoding the alphabet is with 3 bits.
Manning and Schütze explain word sense disambiguation with entropy. The approach described selects the word sense that maximizes the mutual entropy with the surrounding words. (Manning and Schütze. Statistical Natural Language Processing, 239-241)
A stochastic process is a sequence of random variables X_{1}, X_{2}, ... at discrete points in time. A Markov chain is a stochastic process where the conditional distributions of all X_{n+j} depend only on X_{n} and not only earlier states. (DeGroot and Morris, Probability and Statistics, 188)
The transition distributions of a Markov chain are the conditional probabilities
p_{ij} = Pr(X_{n+1}=j|X_{n}=i)
where the random variables X_{n} can have k possible states. A transition matrix is a matrix P = [p_{ij}] made up of the conditional probabilities of the transition distributions. (DeGroot and Morris, Probability and Statistics, 190-192)
A hidden Markov model is a Markov model where the state of the model is not known directly. Only the expression of the state may be observed. There is a level of probability between the state of the model and the expression of the state. (Manning and Schütze, Statistical Natural Language Processing, 320-321)
Manning and Schütze describe using a Markov model for a part-of-speech (POS) bigram tagger. (Manning and Schütze, Statistical Natural Language Processing, 345-349) The state of the model is given by the POS tag t_{i} for a word w_{i} at position i in the text. The tags are the Brown Corpus tags that only give the POS, not the specific meaning of a word, like the NTI Chinese corpus. For example, move/VB, the tag VB indictes a verb. We estimate the transition matrix with a training set of tagged data and use it to predict tags in a test set. The model is a hidden Markov model because in the test data we do not know the values of the tags, just their expression in particular words. The following assumptions are made
The probability Pr(t^{k}|t^{j}) of the kth tag t^{k} in the tagset following the tag t^{j} is
Pr(t^{k}|t^{j}) = C(t^{k}, t^{j}) / C(t^{j})
where C(t^{k}, t^{j}) is the count of tag t^{k} following t^{j} and C(t^{j}) is the count of tag t^{j}. The probability of word w^{l}, the lth word in the lexicon, being associated with tag t^{j} is
Pr(w^{l}|t^{j}) = C(w^{l} : t^{j}) / C(t^{j})
where C(w^{l} : t^{j}) is the count of word w^{l} tagged as t^{j}. The optimal series of tags for a sentence is found by optimizing the Maximum Likelihood Estimator Pr(w^{l}|t^{j}) over all words the training set. Since the computation can be computationally intensive, the Vibeti algorithm, which optimizes using memoization is suggested. (Manning and Schütze, Statistical Natural Language Processing, 326-336) Code for a POS Tagger based on a hidden Markov model can be found at LiteratePrograms (See POS_tagger_(Java)).
The current version of the NTI Reader sentence-based dictionary lookup and HTML gloss generator uses word-sense disambiguation based on bigram frequency with fallback to unigram frequency. That is, rather than using a true statistical estimation technique, like the Markov model approach described here, the NTI Reader uses the most common bigram frequency to decide on word sense for a word in a sequence of words. When the reader has never seen a bigram sequence before, it falls back on the unigram frequency to select the word sense. Manning and Schütze describe combining bigram, unigram, and trigram Markov models using linear interpolation and toher techniques. (Manning and Schütze, Natural Language Processing, 353-354) In fact, handling of previously unseen words is a major problem using Markov models for text processing, even with large corpuses. POS tags for unknown words in English text may be guessed based on capitalization and word ending but that is not possible in Chinese text. (Manning and Schütze, Natural Language Processing, 351-352)
Manning and Schütze report accuracies of POS tagging in English of up to 95%!t(MISSING)o 97%! (MISSING) (Manning and Schütze, Natural Language Processing, 371) One would expect Chinese to be a good language for bigram hidden Markov model taggers because word order is fundamental to grammatical structure in Chinese. This may not be the case for Sanskrit, where word morphology is more important than word order.
The Bernoulli distribution for random variable X, which can only take the values 0 and 1, with parameter p (0 ≤ p ≤ 1) has the probabilities
Pr(X = 1) = p and Pr(X = 0) = 1 - p
An sequence random variables with the Bernoulli distribution are called Bernoulli trials. (DeGroot and Morris, Probability and Statistics, 276)
The binomial distribution with integer parameter n and continuous parameter p (0 ≤ p ≤ 1) is defined as
ƒ(x|n,p) = (n ¦ x) p^{x}(1 - p)^{n-x} for x = 0, 1, 2, ... and 0 otherwise
where (n ¦ x) is the binomial coefficient n!/[x!(n - x)]. The mean and variance are
E(X) = np
Var(X) = np(1 - p)
(DeGroot and Morris, Probability and Statistics, 277)
The R function to compute the binomial probability is dbinom()
.
The commands to plot a binomial distribution with parameters n = 20 and p = 0.5 are shown below.
> x <- seq(1, 20) > plot(x, dbinom(x, 20, 0.5), xlab="x", ylab="Frequency", pch=17, col="blue")
The plot()
function takes a sequence of integers x
from 1 to 20
and plots the binomial probability for each value.
The graph generated is shown in Figure 4.
Figure 4: Binomial Distribution with Parameters n = 20 and p = 0.5
The Poisson distribution for random variable X with mean λ is defined as
ƒ(x|λ) = e^{-λ} λ^{x}/x! for x = 0, 1, 2, ... and 0 otherwise.
The variance of the Poisson distribution is also λ.
(DeGroot and Morris, Probability and Statistics, 288-290)
The dpois()
R function can be used to compute values of the Poisson distribution.
This is demonstrated with the R commands below, which generate a graph of the
Poisson distribution with λ = 3.0.
> x <- seq(1, 10) > plot(x, dpois(x, 3.0), xlab="x", ylab="Frequency", pch=17, col="blue")
The graph generated is shown in Figure 5.
Figure 5: Poisson Distribution with λ = 3.0
The normal distribution for the continuous random variable X with mean μ and standard deviation σ is defined as
ƒ(x|μ, σ) = [1/σ√(2π)] exp[-0.5((x - μ)/σ)^{2}] for -∞ < x < ∞
The standard normal distribution has mean μ = 0 and standard deviation σ = 1. It is given by the equation
ƒ(z|μ, σ) = [1/√(2π)] exp[-0.5((z)/σ)^{2}] for -∞ < z < ∞
(DeGroot and Morris, Probability and Statistics, 307)
The standard normal distribution can be plotted using the curve()
function in R, as shown below.
> curve(dnorm(x), -3, 3, xlab="z", ylab="Probability Density", col="blue")
The graph generated is shown in Figure 6.
Figure 6: Standard Normal Distribution
A random variable X with mean μ and standard deviation σ can be transformed with to the standard normal distribution with the equation
z = (x - μ)/σ
The cumulative distribution F(x) of a normal distribution can be computed in R with the function
pnorm(q, mean = 0, sd = 1
.
The quantile function F^{-1}(x) can be computed with and qnorm(p, mean = 0, sd = 1
.
Example: Following Example 5.6.4 in DeGroot and Morris (DeGroot and Morris, Probability and Statistics, 308) the probability of a random variable from a normal distribution with mean 5 and standard deviation 2 being greater than 1 and less than 8 is
Pr(1 < X < 8) = Pr(X < 8) - Pr(X < 1)
This can be computed with R command
> pnorm(8, mean = 5, sd = 2) - pnorm(1, mean = 5, sd = 2) [1] 0.9104427
So the probability that X is greater than 1 and less than 8 is 0.91. ▢
The normal distribution is a good approximation for variables in many random processes. (DeGroot and Morris, Probability and Statistics, 303) The Central Limit Theorem states that the distribution of a sum of random variables Σ_{i=1}^{n} X_{i} with any distribution will be approximately the normal distribution with mean nμ and variance nσ^{2}, as n becomes large. (DeGroot and Morris, Probability and Statistics, 361)
The gamma distribution for the continuous random variable X with parameters α and β is
ƒ(x|α, β) = [β^{α} / Γ(α)] x^{α-1} e^{-βx} for x > 0 or 0 otherwise
where Γ(α) is the gamma function. The mean of the gamma distribution is α/β and the variance is α/β^{2}. (DeGroot and Morris, Probability and Statistics, 319-320)
The exponential distribution for the continuous random variable X with parameter β is
ƒ(x|β) = βe^{-βx} for x > 0 or 0 otherwise
The exponential distribution is a special case of the gamma distribution with α = 1. The mean of the exponential distribution is 1/β and the variance is 1/β^{2}. (DeGroot and Morris, Probability and Statistics, 321) The exponential distribution can be drawn with the R command below.
> curve(0.8*exp(-0.8*x), 0, 5, xlab="x", ylab="Probability Density", col="blue")
The generated graph is shown in Figure 7.
Figure 7: Exponential Distribution (β = 0.8)
The beta distribution for the continuous random variable X with parameters α and β is
ƒ(x|α, β) = [Γ^{β+α} / (Γ(α) Γ(β))] x^{α-1} (1 - x)^{β-1} for 0 < x < 1 or 0 otherwise
The mean of the beta distribution is
E(X) = α/(α + β)
The variance of the beta distribution is
Var(X) = αβ/[(α + β)^{2}(α + β + 1)]
(DeGroot and Morris, Probability and Statistics, 328-329)
The multinomial distribution for the discrete random vector X = (X_{1}, X_{2}, ... X_{k}) having probabilities p = (p_{1}, p_{2}, ... p_{k}) with n items selected is defined as
f(X|n, p) = [n!/(x_{1} x_{2} ... x_{k})] p_{1}^{x1} p_{2}^{x2} ... p_{k}^{xk}
if x_{1} + x_{2} + ... x_{k} = n or 0 otherwise. (DeGroot and Morris, Probability and Statistics, 334) The multinomial distribution is appropriate for distributions into sets that are not necessarily numbers, for example, the frequencies of words of different part of speech values.
A statistical model is a collection of random variables, identification of probability distributions for the variables, and the set of parameters that the distributions require values for. Statistical inference is a probabilistic statement about a statistical model. (DeGroot and Morris, Probability and Statistics, 377-378) For example, the approximate date of a document may be inferred from the vocabulary in it. (Krippendorff, Content Analysis, 42) In a Buddhist text mention of copying sutras or description of devotional practices may help provide information for the date for the text. The data may be recorded as 0 (not present) and 1 (present) or as a word frequency.
A statistic is a function of a set of random variables. For example, mean, median, and variance are statistics. (DeGroot and Morris, Probability and Statistics, 382)
Parameters in probability distributions are usually unknown and needed to be estimated with statistical methods. So the parameters themselves can be considered simply as unknown constants or to have probability distributions themselves. The prior distribution of a parameter θ is the probability distribution ζ(θ) assumed before experimental observations are applied to estimate its value. The posterior distribution ζ(θ|x_{1}, ... x_{n}) is the conditional distribution after the random variables X_{1}, ... X_{n} have been observed. (DeGroot and Morris, Probability and Statistics, 385-387) The likelihood function f_{n}(x|θ) is the joint probability function of the random variables x = (x_{1}, ... x_{n}) and the parameter θ. The likelihood function can be used to relate teh prior and posterior distributions,
ζ(θ|x) ∝ ƒ_{n}(x|θ) ζ(θ)
The constant of proportionality can be found from equating the total probability to 1. (DeGroot and Morris, Probability and Statistics, 390)
A conjugate family of prior distributions is a family of possible distributions for ζ(θ) where the posterior distribution also belongs to the same family. The family of gamma distributions is a conjugate family of prior distributions for Poisson distributions of the random variables X_{1}, ... X_{n} when the parameter θ is unknown. The family of normal distributions is itself a conjugate family of prior distributions for normal distributions of X_{1}, ... X_{n} when the mean is unknown but the variance is known. The family of gamma distributions is also a conjugate family of prior distributions for exponential distributions of X_{1}, ... X_{n} when the value of the parameter θ is unknown. (DeGroot and Morris, Probability and Statistics, 395-402)
An estimator δ(X_{1}, ... X_{n}) gives an estimate of the parameter θ using observed values of the data x = (x_{1}, ... x_{n}). A loss function L(θ, a) quantifies the effect of the difference between the estimate a of θ and the true value. A Bayes estimator δ*(x) minimizes the expected value of the loss function. (DeGroot and Morris, Probability and Statistics, 408-409)
The squared error loss function is defined as
L(θ, a) = (θ - a)^{2}
When the squared error loss function is used the Bayes estimator is the posterior mean value of θ. (DeGroot and Morris, Probability and Statistics, 411)
Manning and Schütze describe word sense disambiguation with Bayesian classification using maximum likelihood estimators. The authors give the examples of ambiguous words in English, including duty (a task or a tax) and land (to stand on land or to land a plane). The decision is based on conditional probability, given the context of the word. Different contexts have been experimented with, including the object of a verb and the surrounding words. The decision rule for selecting a word sense is optimal in the sense that it minimizes the chance for error in a training data set. (Manning and Schütze, Statistical Natural Language Processing, 235-239)
Another approach to estimating parameters maximizes the probability of observed data. A likelihood function f_{n}(x|θ) is a joint pdf for continuous random variables or joint pf for discrete random variables with the parameter θ considered part of the joint distribution. A maximum likelihood estimator maximizes the value of f_{n}(x|θ). (DeGroot and Morris, Probability and Statistics, 418) This approach avoids the need to assume a probability distribution for θ. However, the drawback is that is may not always give a good estimate for θ and sometimes it may not exist at all.
A sampling distribution is the distribution of a statistic T of a set of random variables X = (X_{1}, ... X_{n}) that are a sample of the random variable X with distribution having a parameter θ. A sampling distribution can be used to determine how good an estimate θ̂ of the parameter θ is. (DeGroot and Morris, Probability and Statistics, 465)
The chi-square χ^{2} distribution with m degrees of freedom is the gamma distribution with α = m/2 and β = ½. This can be written as
ƒ(x) = 1/(2^{m/2} Γ(m/2)) x^{(m/2)-1} e^{-x/2} for x > 0.
The mean of the χ^{2} distribution is E(X) = m.
The variance is Var(X) = 2m.
(DeGroot and Morris, Probability and Statistics, 469-470)
The χ^{2} distribution can computed with the R dchisq()
function.
The χ^{2} distribution with 5 degrees of freedom can be drawn in R
using the command below.
> curve(dchisq(x, df=5), 0, 20, xlab="x", ylab="Chi Square", col="blue")
The result is shown in Figure 8.
Figure 8: χ^{2} Distribution with 5 Degrees of Freedom
The χ^{2} distribution with two degrees of freedom is the same as the exponential distribution with parameter ½.
The cumulative χ^{2} distribution can be computed with the pchisq()
R function.
Example: Following the Example 8.2.3 in DeGroot and Morris (DeGroot and Morris, Probability and Statistics, 470) The probability that X is less than 10.0 for a χ^{2} distribution with 10 degrees of freedom can be found using the R command
> pchisq(c(10), df=10) [1] 0.5595067
Giving the probability 0.55 or 56%! (MISSING)▢
If X_{1}, ... X_{n} are a random sample from a normal distribution with mean μ and variance σ^{2} then the sample mean and variance have maximum likelihood estimators
μ̂ = X̄_{n}
σ̂ = [(1/n) Σ_{i=1}^{n}(X_{i} - X̄_{n})^{2}]^{½}
These estimators are independent variables. The estimator for the sample mean X̄_{n} has a normal distribution with mean μ and variance σ/n. The quantity nσ̂^{2}/σ^{2} has a χ^{2} distribution with n - 1 degrees of freedom.
The t distribution is useful when we need to use an estimate of the variance because we do not know the true variance. The t distribution is
Γ((m+1)/2)/[(mπ)^{½}Γ(m/2)] (1 + x^{2}/m)^{-(m+1)/2}
If X_{1}, ... X_{n} is a sample for random variable X with with mean μ and variance σ^{2}. Define the statistic
σ′ = [Σ_{i=1}^{n}(X_{i} - X̄_{n})^{2}]^{½}
Also define the random variable
U = n^{½}(X̄ - μ)/σ
U has a t distribution with n - 1 degrees of freedom.
(DeGroot and Morris, Probability and Statistics, 480-482)
The t distribution looks like the normal distribution but the tails do not converge to zero as quickly,
especially for small values of n.
Values of the pdf for t distribution can be computed with the R dt
function.
The command below draws a graph of the t distribution with 5 degrees of freedom.
> curve(dt(x, df=5), -3, 3, xlab="z", ylab="Probability Density", col="blue")
The chart produced is shown in Figure 9.
Figure 9: t Distribution with 5 Degrees of Freedom
A confidence interval (A, B) for coefficient γ is an interval with
Pr(A < g(θ) < B) ≥ γ
for a random sample X = (X_{1}, ... X_{n}) from a distribution with parameter θ.
The confidence interval for the sample mean from a normal distribution with mean μ and variance σ^{2} is
A = X̄_{n} - T_{n-1}^{-1}((1 + γ)/2) σ′/n^{½}
B = X̄_{n} + T_{n-1}^{-1}((1 + γ)/2) σ′/n^{½}
where T_{n}(c) is the cdf of the t distribution with n degrees of freedom
and T_{n}^{-1} is the quantile function.
(DeGroot and Morris, Probability and Statistics, 486)
The quantile function T_{n}^{-1}(x) for a value x with n degrees fo freedom can be computed
with the R function qt(x, df = n)
.
Example: Computing Example 8.5.3 from DeGroot and Morris (DeGroot and Morris, Probability and Statistics, 487) with a sample of 26 rain measurements with sample average 5.134 and estimated variance 1.60. The 95%!c(MISSING)onfidence interval can be computed using the R commands
> A <- 5.134 - qt(1.95/2, df=25) * 1.6/sqrt(25) > A [1] 4.474948 > B <- 5.134 + qt(1.95/2, df=25) * 1.6/sqrt(25) > B [1] 5.793052
The 95%!c(MISSING)onfidence interval is (4.47, 5.79). ▢
A one-sided confidence interval (A, ∞) has a statistic A, such that
Pr(A < g(θ)) ≥ γ
This is a 100γ percent one-sided confidence interval for g(θ). Similarly, the interval (-∞, B) is a 100γ percent one-sided confidence interval for g(θ) where
Pr(g(θ) < B) ≥ γ
For a normal distribution with mean μ and variance σ^{2} the confidence limits A and B can be computed as
A = X̄_{n} - T_{n-1}^{-1}(γ) σ′/n^{½}
B = X̄_{n} + T_{n-1}^{-1}(γ) σ′/n^{½}
(DeGroot and Morris, Probability and Statistics, 488-489)
An unbiased estimator δ(X) of a function g(θ) of a parameter θ has the same expectation as g(θ) or all values of θ. That is, E_{θ}[δ(X)] = g(θ) for all θ. (DeGroot and Morris, Probability and Statistics, 507) For example, the sample mean X̄_{n} is an unbiased estimate of the true mean μ because the mean of X̄_{n} is μ for all values of μ. The statistic
σ̂_{1}^{2} = (1/(n - 1)) ∑_{i=1}^{n}(X_{i} - X̄_{n})^{2}
is an unbiased estimator of the variance. (DeGroot and Morris, Probability and Statistics, 508) σ̂_{1}^{2} is sometimes called the sample variance.
Example: The sample variance and sample standard deviation of word frequency in the NTI Reader tagged corpus can be found with the R commands below.
> v <- var(unigram$frequency) > v [1] 140.1174 > sqrt(v) [1] 11.83712
Compare this to the population variance 140.20 and standard deviation 11.83 computed above. ▢
The null hypothesis H_{0} is the hypothesis that θ ∈ Ω_{0}, where θ is the parameter of a probability distribution and Ω_{0} is a subset of the parameter space Ω. The alternative hypothesis H_{1} is the hypothesis that θ ∈ Ω_{1}, where Ω_{1} is the complement of Ω_{0}. (DeGroot and Morris, Probability and Statistics, 531-532)
A critical region S_{1} is a subset of the sample space S of a random vector X = (X_{1}, ... X_{n}) for which the null hypothesis H_{0} is rejected. A test statistic T = r(X) defines a procedure where the null hypothesis H_{0} is rejected if T ∈ R, where R is a subset of the real numbers. R is the rejection region of the test. The critical region has the form S_{1} = {x: r(x) ∈ R}. (DeGroot and Morris, Probability and Statistics, 532-533)
A power function π(θ, δ) is the probability that a test procedure δ will reject the null hypothesis for all values of the parameter θ. In symbols
π(θ, δ) = Pr(X ∈ S_{1})
where S_{1} is the critical region. (DeGroot and Morris, Probability and Statistics, 534)
A type I error is an erroneous choice to reject the null hypothesis. A type II error is an erroneous choice not to reject a false null hypothesis. (DeGroot and Morris, Probability and Statistics, 535)
The size α(δ) is defined as
α(δ) = sup _{θ ∈ Ω0} π(θ, δ)
where δ is a test and sup is the supremum. (DeGroot and Morris, Probability and Statistics, 535)
The p-value is the smallest level α_{0} that would result in rejection of the null hypothesis.
The likelihood ratio statistic Λ(x) is the largest value of the likelihood function in Ω_{0} compared to the entire parameter space Ω. In symbols
Λ(x) = [sup _{θ ∈ Ω0} ƒ(x | θ)] / [sup _{θ ∈ Ω} ƒ(x | θ)]
(DeGroot and Morris, Probability and Statistics, 544)
A simple hypothesis is a choice from two alternative parameter values in the parameter space Ω = {θ_{0}, θ_{1}}. It has the form
H_{0}: θ = θ_{0}
H_{1}: θ = θ_{1}
The probability α of a type I error for a test procedure δ is α(δ) = Pr(Reject H_{0} | θ = θ_{0}). The probability β of a type II error for a test procedure δ is β(δ) = Pr(Do not reject H_{0} | θ = θ_{1}). (DeGroot and Morris, Probability and Statistics, 551)
A uniformly most powerful test is for a test procedure δ^{*} for the hypothesis
H_{0}: θ ∈ Ω_{0}
H_{1}: θ ∈ Ω_{1}
at the level of significance α_{0} if α(δ^{*}) ≤ α_{0} where
π(θ, δ) ≤ π(θ, δ^{*}) for all θ ∈ Ω_{1}
(DeGroot and Morris, Probability and Statistics, 560)
A one sided alternative hypothesis has the form
H_{0}: θ ≤ θ_{0}
H_{1}: θ > θ_{1}
(DeGroot and Morris, Probability and Statistics, 562)
A two sided alternative hypothesis has the form
H_{0}: θ = θ_{0}
H_{1}: θ ≠ θ_{1}
(DeGroot and Morris, Probability and Statistics, 565)
An unbiased test is a test δ where
π(θ, δ) ≤ π(θ′, δ)
for every θ ∈ Ω_{0} and θ′ ∈ Ω_{1}. (DeGroot and Morris, Probability and Statistics, 573)
A t test for the one sided hypothesis about the mean μ
H_{0}: μ ≤ μ_{0}
H_{1}: μ > μ_{0}
where the variance σ^{2} is unknown, rejects H_{0} if
U = n^{½} (X̄ - μ_{0})/σ′ ≥ c
and c is found from the t quantile function for level of significance α_{0}. The opposite hypothesis H_{0}: μ ≥ μ_{0} is rejected if U ≤ c. (DeGroot and Morris, Probability and Statistics, 576)
Example: We will work through Example 9.5.3 in DeGroot and Morris (DeGroot and Morris, Probability and Statistics, 578) with R commands. The problem tests the hypothesis H_{0}: μ ≥ 200 with a level of significance α_{0} = 0.1 for a random sample with n = 18, X̄ = 182.17, and σ′ = 72.22. The 0.1 quantile of the t distribution and the test statistic U can be found with the R commands
> c = qt(0.1, df=17) > c [1] -1.333379 > U = sqrt(18-1) * (182.17 - 200)/72.2 > U [1] -1.018213
Since U ≤ c is false, the null hypothesis μ ≥ 200 is not rejected. ▢
Manning and Schütze discuss using hypothesis with the t test for the occurrence of bigrams. The chance of of the two individual words coming w^{1} and w^{2} together at any p articular point in the text is Pr(w^{1}w^{2}) = Pr(w^{1})Pr(w^{2}). The sample variance is needed in the t test, which is estimated as the variance σ^{2} = p(1 - p) of a Bernoulli trial. In this expression p is the word frequency or, in other words, the probabily of occurrence of the words together. (Manning and Schütze, Statistical Natural Language Processing, 163-166)
A t test for the two sided hypothesis about the mean μ
H_{0}: μ = μ_{0}
H_{1}: μ ≠ μ_{0}
where the variance σ^{2} is unknown, rejects H_{0} if
|U| ≥ T_{n-1}^{-1}(1 - α_{0}/2)
where T_{n-1>}^{-1} is the quantile function of the t distribution. (DeGroot and Morris, Probability and Statistics, 582)
Successful and unsuccessful prediction results are summarized in Table 1.
Actual | |||
---|---|---|---|
Positive | Negative | ||
Predicted | Positive | true positive tp | false positive fp |
Negative | true negative fn | false negative tn |
Manning and Schütze define precision P as the number of correctly predicted positive results:
P = tp/(tp + fp)
and recall R as the proportion of target items predicted
R = tp/(tp + fn)
Precision and recall avoid the potential problem with accuracy that the number of negative events will be far greater than the number of positive events. That is, a system that successfully rejects many negative items may have high accuracy but not do very well on the positive events. (Manning and Schütze, Statistical Natural Language Processing, 268-269)
To test a hypothesis comparing the means μ_{1} and μ_{0} of two populations X = X_{1}, ... X_{m} and Y = Y_{1}, ... Y_{n} with the same variance σ, the null and alternate hypotheses
H_{0}: μ_{1} ≤ μ_{2}
H_{1}: μ_{1} > μ_{2}
are used. The two-sample statistic U is defined as
U = (m + n - 2)^{½}(X̄_{m} - Ȳ_{n}) / [(1/m + 1/n)^{½}(S_{X}^{2} + S_{Y}^{2})^{½}]
U has a t distribution with m + n - 2 degrees of freedom. In this statistic
X̄_{m} = (1/m) ∑_{i=1}^{m}X_{i} and Ȳ_{n} = (1/n) ∑_{i=1}^{n}Y_{i}
and
S_{X}^{2} = ∑_{i=1}^{m}(X_{i} - X̄_{m})^{2} and S_{Y}^{2} = ∑_{i=1}^{n}(Y_{i} - Ȳ_{n})^{2}
The null hypothesis H_{0} is rejected at the level of significance α_{0} if U ≥ T_{m+n-2}^{-1}(1-α_{0}). (DeGroot and Morris, Probability and Statistics, 588-589)
Example: We will work through Example 9.6.2 in DeGroot and Morris (DeGroot and Morris, Probability and Statistics, 589-590) with R commands. The problem tests the hypothesis that the mean of rain from seeded clouds μ_{1} is greater than the mean of rain from unseeded clouds μ_{2}. That is, we attempt to reject H_{0}: μ_{1} ≤ μ_{2} at level of significance α_{0} = 0.01. The mean of the 26 measurements with seeded clouds is X̄_{m} = 5.13. The mean of the 26 measurements with unseeded clouds is X̄_{n} = 3.99. Also, S_{X}^{2} = 63.96 and S_{Y}^{2} = 67.39. The R commands to compute U and critical value of T_{m+n-2}^{-1} are
> U = sqrt(26+26-2)*(5.13-3.99)/(sqrt(1/26+1/26)*sqrt(63.96+67.39)) > U [1] 2.535984 > c = qt(0.99, df=26+26-2) > c [1] 2.403272
Since U > c, the null hypothesis is rejected.▢
An F distribution for the variable X combining two random variables Y and W with χ^{2} distributions with m and n degrees of freedom in the form
X = (Y/m)/(W/n)
X has an F distribution, which is given by the formula
ƒ(x) = (Γ[(m + n)/2] m^{m/2}n^{n/2}) / [Γ(m/2)Γ(n/2)] · x^{(m/2)-1}/(mx + n)^{(m+n)/2}
for x > 0.
An F test compares the unknown variances of two random variables X_{1}, ... X_{n} and Y_{1}, ... Y_{n}. The null and alternate hypotheses are
H_{0}: σ_{1}^{2} ≤ σ_{2}^{2}
H_{1}: σ_{1}^{2} > σ_{2}^{2}
In the test the statistic V is defined as
V = [S_{X}^{2}/(m - 1)] / [S_{Y}^{2}/(n - 1)]
If V ≥ c, where c is determined from the level of significance α_{0}, then H_{0} is rejected.
(DeGroot and Morris, Probability and Statistics, 598-599)
The quantile function for the F distribution can be computed with the R command
qf(p, df1, df2
.
Example: We will work through Example 9.7.3 in DeGroot and Morris (DeGroot and Morris, Probability and Statistics, 600-601) with R commands. X_{1}, ... X_{6} have unknown mean and variance. S_{X}^{2} = 30 is computed from observations. Y_{1}, ... Y_{21} also have unknown mean and variance and have S_{Y}^{2} = 40. The following R commands compute the statistic V and the quantile value for α_{0} = 0.025 and α_{0} = 0.05.
> V = (30/(6-1))/(40/(21-1)) > V [1] 3 > qf(0.95, df1=5, df2=20) [1] 2.71089 > qf(0.975, df1=5, df2=20) [1] 3.289056
Since V > 2.71 the null hypothesis is rejected at the α_{0} = 0.05 significance level but since V < 3.29 it is not rejected at the α_{0} = 0.025 level. ▢
Categorical data is data where observations classifies the data into different categories. (DeGroot and Morris, Probability and Statistics, 625)
The χ^{2} goodness of fit test measures how well the proportions of categories in a random sample matches against proportions from the whole population. In this test there are k different categories and p_{i} is the proportion of each in the population. The proportions in the random sample belonging to each category are p_{1}^{0}, ... p_{k}^{0}. The hypothesis tested is
H_{0}: p_{i} = p_{i}^{0} for i = 1, ... k
H_{1}: p_{i} ≠ p_{i}^{0} for at least one i
That is, the null hypothesis is that the distribution of data amongst the different categories is explained by the probabilities given. The alternative hypothesis is that the data in at least one of the categories cannot be explained by the given probabilities. The χ^{2} statistic is defined as
Q = ∑_{i=1}^{k}(N_{i} - np_{i}^{0})^{2} / np_{i}^{0}
where N_{i} are the numbers of each item in the sample and n is the total number in the sample.
If H_{0} is true then Q converges to a χ^{2} distribution as n → ∞.
(DeGroot and Morris, Probability and Statistics, 626)
Smaller values for Q are less likely to result in rejection of the null hypothesis.
The chisq.test
R function can be used to perform this test.
Example: This example uses the χ^{2} goodness of fit test to compare the distribution counts for words grouped by part of speech in the tagged corpus and a document that is included in the tagged corpus. The most frequent few pronouns in the tagged corpus can be extracted with the following R command.
> head(unigram[grep("/PN\\[", unigram$pos.tagged.text),]) pos.tagged.text element.text word.id frequency 2 是/PN[shì | this] 是 17908 136 12 所/PN[suǒ | it] 所 17100 66 26 我/PN[wǒ | I] 我 321 45 28 其/PN[qí | that] 其 1574 43 33 云何/PN[yúnhé | how] 云何 29319 39 62 汝/PN[rǔ | you] 汝 6690 22
The regular expression "/PN\\["
helps the grep()
identify patterns like "/PN[""
encapsulating part-of-speech tags.
The R script pos_counts.R
aggregates the frequencies of each separate part of speech in the tagged corpus and for the Heart Sutra
and compares the counts with a χ^{2} test. It can be run with the following command.
$ Rscript r/pos_counts.R Chi-squared test for given probabilities data: heart.count X-squared = 71.0523, df = 7, p-value = 9.052e-13
With a very small p-value, the output shows that the distribution of part-of-speech counts in the Heart Sutra is not well explained by the general distribution of part-of-speech counts in the tagged corpus. Small count values have been avoided by combining categories. The data are shown in Table 2.
Part of Speech | Observed Count | Predicted Count |
---|---|---|
Adjectives | 15 | 10.3 |
Existential verbs | 21 | 5.5 |
Regular verbs | 32 | 36.9 |
Proper nouns | 7 | 19.8 |
Regular nouns | 63 | 51.2 |
Pronouns and numbers | 10 | 23.8 |
Adverbs | 18 | 11.6 |
Other function words | 15 | 21.8 |
There is a relatively higher proportion of existential verbs in the Heart Sutra, which is expected. ▢
The χ^{2} test for a composite hypothesis test can be used to do a goodness-of-fit test against a distribution parameterized by a set of parameters θ = (θ_{1}, ... θ_{s}). The probabilities for the categories are given by the functions π_{1}(θ), ... π_{k}(θ). The hypothesis is
H_{0}: There is a θ ∈ Ω such that p_{i} = θ_{i} H_{1}: H_{0} is false
The Q statistic is defined as
Q = ∑_{i=1}^{k}[N_{i} - nπ_{i}(θ̂)]^{2} / nπ_{i}(θ̂)
If H_{0} is true then the cdf of Q will converge on the cdf of χ^{2} with k - s - 1 degrees of freedom, as n → ∞. (DeGroot and Morris, Probability and Statistics, 634-635)
A contingency table is a table that classifies each observation in two or more ways. A two-way contingency table classifies data in two ways. A contingency table can be used to test data independence using a χ^{2} test. If N_{ij} is the count in each table cell then the row and column counts are
N_{i+} = ∑_{i=1}^{C}N_{ij}
N_{j+} = ∑_{i=1}^{R}N_{ij}
where C is the number of columns, R is the number of rows, and n is the total number of cells, excluding the row and cell counts. An estimator for the expected number in each cell, based on the marginal probabilities from the row and column counts is
Ê_{ij} = N_{i+}N_{j+}/n
The statistic Q is defined as
Q = ∑_{i=1}^{C}∑_{i=1}^{R}(N_{ij} - Ê_{ij})^{2} / Ê_{ij}
has a χ^{2} distribution with (R - 1)(C - 1) degrees of freedom. (DeGroot and Morris, Probability and Statistics, 642-643)
Manning and Schütze describe the use of a contingency table with a χ^{2} test to test for the significance of collocations. (Manning and Schütze, Statistical Natural Language Processing, 169-172)
Example: This example tests the independence of the distribution of word counts for different parts of speech for three texts, The Heart Sutra, the Amitabha Sutra, and the Diamond Sutra. The R script pos_independence.R computes the word counts and invokes the χ^{2} test. It can be invoked as shown below.
$ $ Rscript r/pos_independence.R Heart Amitabha Diamond Adjectives 15 86 140 Existential Verbs 21 31 130 Verbs 32 245 709 Proper Nouns 7 194 404 Nouns 63 361 942 Pronouns 10 226 503 Adverbs 18 69 248 Function Words 15 164 510 Pearson's Chi-squared test data: data X-squared = 111.0789, df = 14, p-value < 2.2e-16
The low p-value in results show that the null hypothesis should be rejected. The distribution of is not independent at the conventional 0.05 level of significance and even at more strict levels of significance. ▢
The χ^{2} test for data homogeneity is the same as the test for data independence. (DeGroot and Morris, Probability and Statistics, 649)
Nonparametric methods are methods that do not depend on data belonging to specific parametric families of distributions.
Kolmogorov-Smirnov tests can be used to test observed values of a continuous data against a parameterized distribution or to test the similarity of two sets of observed values.
A sample cdf or sample distribution function is a function of a set of observed values F_{n}(x) = k/n, where there are k observations less than or equal to x. The total number of observations is n.
The Kolmogorov-Smirnov test of observed data X_{1}, ... X_{n} against a particular continuous distribution F^{*}(x) tests the hypotheses
H_{0}: F^{*}(x) = F^{*}(x)
H_{1}: H_{0} is false
If H_{0} is true then the statistic D_{n}^{*}(x) defined by
D_{n}^{*}(x) = sup_{∞<x<∞}|F_{n}(x) - F^{*}(x)|
that is, the maximum difference for the observed value x_{n}, converges as
lim_{n→∞} Pr(n^{½}D_{n}^{*} ≤ t) = H(t) = 1 - 2∑(-1)^{i-1}e^{-2i2t2}
(DeGroot and Morris, Probability and Statistics, 658-660)
The values of H(t) can be computed and the test performed with the R function
ks.test(x, y, ...)
.
The two-sample Kolmogorov-Smirnovtest compares two sets of samples X_{1}, ... X_{m} and Y_{1}, ... Y_{n} with cdf's F(x) and G(x). The hypotheses are
H_{0}: F(x) = G(x)
H_{1}: H_{0} is false
The test statistic D_{mn} is defined as
D_{mn} = sup_{∞<x<∞}|F_{m}(x) - G^{n}(x)|
that is, the maximum difference for the observed values x_{n} and y_{n}, converges as
lim_{m→∞,n→∞} Pr[(mn/m + n)^{½}D_{mn} ≤ t] = H(t)
(DeGroot and Morris, Probability and Statistics, 664)
The values of H(t) can also be computed and the test performed with the R function
ks.test(x, y, ...)
.
A contaminated normal distribution is a normal distribution with an additional component that is not normal. It has the form
ƒ(x) = (1 - ϵ)(2πσ^{2})^{½}exp(-1/(2σ^{2})[x - μ]^{2}) + ϵg(x)
The function g(x) is the contaminating distribution. (DeGroot and Morris, Probability and Statistics, 668) A contaminated normal distribution may have thicker tails than a normal distribution due to outliers and so parametric methods assuming a normal distribution may not be accurate. In particular, the sample mean and sample variance may not be accurate, decreasing in accuracy with the degree of contamination.
A scale parameter for the distribution of a random variable X is a parameter σ where the corresponding parameter for the random variable aX + b is aσ. The mean and variance are scale parameters but they may not exist for all distributions. However, the median and interquartile range (IQR) do exist for all distributions. The median absolute deviation is defined as is the median of |X - m|. In this expression, m is the median of X. The median absolute deviation is a scale parameter that also exists for all functions. The median absolute deviation is half the IQR for symmetric distribution. (DeGroot and Morris, Probability and Statistics, 670)
A least-squares line is a method for fitting a line against sample data against a straight line, minimizing the squares of the vertical deviations from the line. A least-squares line is defined by the equation y = β̂_{0} + β̂_{1}x, where
β̂_{0} = ȳ - β̂_{1}x̄
β̂_{1} = ∑_{i=1}^{n}(y_{i} - ȳ) / ∑_{i=1}^{n}(x_{i} - x̄)^{2}
and x̄ - (1/n)∑_{i=1}^{n}x_{i}, ȳ - (1/n)∑_{i=1}^{n}y_{i}.
(DeGroot and Morris, Probability and Statistics, 692)
The R function lm(y ~ x)
, amongst other things, performs a least squares fit of y
as a function of x
.
Example: Let's use the R function lm(y ~ x)
to work through Example 11.2.2
in DeGroot and Morris.
(DeGroot and Morris, Probability and Statistics, 689-692)
The data in this example relates the boiling point of water at different altitudes
to the air pressure and is actually from Forbes.
(Forbes, Further Experiments and Remarks on the Measurement of Heights by the Boiling Point of Water,
135-143)
The array x contains measurements of temperature at which the water boils and the
array y contains the air pressure measurements.
> x <- c(194.5, 194.25, 197.9, 198.43, 199.45, 199.95, 200.93, 201.15, 201.35, 201.3, 203.55, 204.6, 209.47, 208.57, 210.72, 211.95, 212.18) > y <- c(20.79, 20.79, 22.4, 22.67, 23.15, 23.35, 23.89, 23.99, 24.02, 24.015, 25.14, 26.57, 28.49, 27.76, 29.04, 29.879, 30.064) > least.sq <- lm(y ~ x) > least.sq Call: lm(formula = y ~ x) Coefficients: (Intercept) x -81.088 0.523
So β̂_{0} = -81.088, β̂_{1} = 0.523 and the equation of the line is y = -81.088 + 0.523x. (There are some minor rounding errors in DeGroot and Morris' input data that different from Forbes.) ▢
Least squares can be used to fit sample data against a function of many variables, as in the formula below.
y = β̂_{0} + β̂_{1}x_{1} + ... + β̂_{k}x_{k}
The sum of squares of the variations are minimized, resulting in a set of simultaneous equations, which can be solved for β̂_{0}, β̂_{1}, ... β̂_{k}. This method can be extended to solving non-linear equations by letting x_{1}, ... x_{k} be other functions of x, such as powers of x and log functions.
Example: We saw above that there was a correlation between text size and the number of unique words in different texts in the corpus. However, it was not linear. It turns out that we can get a more linear relation in log space. The following R commands do a linear regression analysis on the log of the log of the word count versus the log of the count of unique words and plot the line of best fit on the graph with the data points.
> x <- log(corpusstats$word.count) > y <- log(corpusstats$unique.words) > unique.lm <- lm(y ~ x) > plot(x, y, xlab="Log Word Count", ylab="Log Unique Words", pch=17, col="blue") > abline(unique.lm)
The function abline()
takes the output of the linear regression and adds
it to the plot.
The graph generated is shown in Figure 10.
Figure 10: Linear Regression of Log Count of Unique Words versus Log of Text Length
▢
A regression function of Y on X_{1}, ..., X_{k} uses the random variables X_{1}, ..., X_{k} as predictors of the random variable Y. The expected value of Y is
E(Y|x_{1}, ..., x_{k}) = β_{0} + β_{1}x_{1} + ... + β_{k}x_{k}
where (x_{1}, ..., x_{k}) are observed values. (DeGroot and Morris, Probability and Statistics, 699)
Simple linear regression relates Y to a single random variable X. The relationship can be expressed as Y = β_{0} + β_{1}x_{1} + ϵ. The variable ϵ has mean 0 and variance σ^{2}. The assumptions of simple linear regression are
If these assumptions are met, then the conditional distribution of Y is
ƒ(y|x, β_{0}, β_{1}, σ^{2}) = (1/(2πσ^{2}))^{n/2} exp[-(1/(2σ^{2}))∑_{i=1}^{n}(y_{i} - β_{0} - β_{1}x_{i})^{2}]
An MLE estimate of the variance of Y is
σ̂^{2} = 1/(n)∑_{i=1}^{n}(y_{i} - β̂_{0} - β̂_{1}x_{i})
(DeGroot and Morris, Probability and Statistics, 700-701)
Statistical inference on simple linear regression can be made in a similar to the parameters discussed above. The variable nσ̂^{2}/σ^{2} has a χ^{2} distribution with n - 2 degrees of freedom. (DeGroot and Morris, Probability and Statistics, 709)
A prediction interval for Y with coefficient 1 - α_{0} is given by
Ŷ ± T_{n-2}^{2}(1 - α_{0}) σ′[1 + 1/n + (x - x̄)^{2}/s_{x}^{2}]^{½}
where
s_{x}^{2} = (∑_{i=1}^{n}(x_{i} - x̄)^{2})^{½}
σ′ = (S^{2}/(n - 2))^{½}
S^{2} = ∑_{i=1}^{n}(Y_{i} - β_{0} - β_{1}x_{i})^{2}
(DeGroot and Morris, Probability and Statistics, 710-716)
Residuals are the difference between observed and predicted values e_{i} = y_{i} - ŷ_{i}, for i = 1, ..., n. The residuals should be randomly distributed. If there is some kind of pattern of the residuals that may indicate that a straight line is not the best model for the data. The residuals are often plotted as a function of x_{i} to see if they exhibit a pattern, which is often called analysis of residuals. A normal quantile plot or Q-Q plot, is a chart of the quantiles of the residuals versus the quantiles of the normal distribution. The plot should be a straight line. If not, the residuals may belong to some other kind of distribution than the normal distribution.
Example: Let's find the residuals for the unique word count example above.
Calling the function summary()
on the linear regression object gives information
about residuals, as shown in the R script below.
> summary(unique.lm) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -1.2653 -0.2087 0.1326 0.2377 0.3945 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.15877 0.14502 7.99 2.28e-11 *** x 0.68528 0.01936 35.39 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3426 on 68 degrees of freedom Multiple R-squared: 0.9485, Adjusted R-squared: 0.9477 F-statistic: 1252 on 1 and 68 DF, p-value: < 2.2e-16 > plot(unique.lm)
This shows that the median residual is 0.1326.
Besides the residual summary, a lot of other quite advanced information is displayed as well,
which I won't go into.
The plot(residual.object)
displays a plot of the residuals, as shown in Figure 11.
Figure 11: Residuals of Log Text Length versus Log Unique Word Count
From Figure 11, it can be seen that the distribution of residuals is not really random.
Therefore, a log-log relation between text size and unique word count is not perfect.
There are several charts generated from the plot(residual.object)
above,
including a Q-Q plot, which is shown in Figure 12.
Figure 12: QQ Plot for Regression Analysis of Log Text Length versus Log Unique Word Count
The Q-Q plot deviates from a straight line, showing that the distribution of errors is not perfectly normal. ▢
Knell explains the use of R for simple linear regression. (Knell, Introductory R, 222-260)
The general linear model generalizes the simple model to multiple predictor variables z_{i} = (z_{i0}, ... z_{ip-1}) with a vector of parameters β = (β_{0}, ..., β_{p-1}). A similar set of five assumptions is necessary. The dependent values Y_{i} for i = 1, ..., n have the same variance σ. An expected value of Y is given by
E(Y_{i}| z_{i}, β) = z_{i0}β_{0} + z_{i1}β_{1} + ... + x_{ip-1}β_{k}
The predictor variables are written in an n × p matrix Z = [z_{ij}] for i = 0, ...n, j = 0, ... p - 1, called the design matrix. The least squares estimator β̂ of β is
β̂ = (Z′Z)^{-1}Z′Y
where Z′ is the transpose of Z. The mean of the dependent variable Y can be expressed in matrix form as
E(Y) = Zβ
(DeGroot and Morris, Probability and Statistics, 736-742)
The residuals are similar to simple linear regression,
e_{i} = y_{i} - ŷ_{i} = y_{i} - z_{i0}β_{0} - ... z_{ip-1}β_{p-1}
(DeGroot and Morris, Probability and Statistics, 749)
You can do a generalized linear regression in R with the glm()
function.
Analysis of variance (ANOVA) is an application of multiple regression where the design matrix has a special form. An ANOVA one-way layout tests whether the means of samples μ_{i} from p different populations are the same. The variance σ is the same for all of the different populations.
H_{0}: μ_{1} = ... μ_{p}
H_{1}: H_{0} is false
If H_{0} is true then the test statistic
U2 = [S_{Betw}^{2}/(p - 1)] / [S_{Resid}^{2}/(n - p)]
has an F distribution with degrees of freedom p - 1 and n - p. The total sum of squares
S_{Tot}^{2} = ∑_{i=1}^{p}∑_{j=1}^{ni} (Y_{ij} - Ȳ_{++})^{2}
is partitioned into two parts, the residual sum of squares and the between samples sum of squares:
S_{Tot}^{2} = S_{Resid}^{2} + S_{Betw}^{2}
where the parts are defined as
S_{Resid}^{2} = ∑_{i=1}^{p}∑_{j=1}^{ni}
(Y_{ij} - Ȳ_{i+})^{2}
S_{Betw}^{2} = ∑_{i=1}^{p}^{ni(Ȳi+ - Ȳ++)2
}
where
Ȳ_{++} = (1/n)∑_{i=1}^{p}∑_{j=1}^{ni}Y_{ij} = (1/n)∑_{i=1}^{p}n_{i}Ȳ_{i+}
(DeGroot and Morris, Probability and Statistics, 754-761)
You can do an analysis of variance in R with the anova()
function, which takes the output
of the generalized linear regression function glm()
, or the anova()
function,
which takes a formula.
Example: Let's work through 11.6.5 in DeGroot and Morris using R. (DeGroot and Morris, Probability and Statistics, 755-760) There are p = 4 different kinds of hot dogs (Beef, Meat, Poultry, Specialty) with different calorie counts. We want to test whether the average calorie counts for the different kinds of hotdogs are the same.
> type <- c("Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Beef", "Meat", "Meat", "Meat", "Meat", "Meat", "Meat", "Meat", "Meat", "Meat", "Meat", "Meat", "Meat", "Meat", "Meat", "Meat", "Meat", "Meat", "Poultry", "Poultry", "Poultry", "Poultry", "Poultry", "Poultry", "Poultry", "Poultry", "Poultry", "Poultry", "Poultry", "Poultry", "Poultry", "Poultry", "Poultry", "Poultry", "Poultry", "Specialty", "Specialty", "Specialty", "Specialty", "Specialty", "Specialty", "Specialty", "Specialty", "Specialty") > n = length(type) > p = 4 > calories <- c(186, 181, 176, 149, 184, 190, 158, 139, 175, 148, 152, 111, 141, 153, 190, 157, 131, 149, 135, 132, 173, 191, 182, 190, 172, 147, 146, 139, 175, 136, 179, 153, 107, 195, 135, 140, 138, 129, 132, 102, 106, 94, 102, 87, 99, 107, 113, 135, 142, 86, 143, 152, 146, 144, 155, 170, 114, 191, 162, 146, 140, 187, 180) > hotdog.aov = aov(calories ~ type) > summary(hotdog.aov) Df Sum Sq Mean Sq F value Pr(>F) type 3 19454 6485 11.6 4.48e-06 *** Residuals 59 32995 559 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > qf(0.99, p - 1, n - p) [1] 4.132055 > qf(0.999, p - 1, n - p) [1] 6.185016
The "formula" here relates the vector type
of hot dog types to the vector calories
of corresponding calorie counts.
The test statistic is 11.6, which has a p-value of 4.48^{-6}, which is rejected at most levels of
significance, including α_{0} = 0.001.
This is demonstrated by the calls to the F distribution quantile R function qf(p, df1, df2)
.
Therefore, the H_{0} hypothesis, that the mean calorie counts are the same, is rejected.
▢
Example: This example looks at average word frequencies in Buddhist texts. The word frequencies of function words should be similar across texts of the same period because the do not depend greatly on the content of the text. One of the most frequently occuring function words in literary Chinese is the nominalizer 所 suǒ "it." 所 suǒ is placed in front of a verb to make the objects acted on by the verb the subject. (Pulleybank, Outline of Classical Chinese Grammar, 68) It can be roughly understood as the English pronoun "it." The frequencies per thousand words of 所 suǒ in several texts are given in Table 3. The texts are Biographies of Eminent Monks, Scroll《高僧傳》 by Huijiao (497-554), Buddhist Monasteries in Luoyang, Scroll 1 《洛陽伽藍記》 by Yang Xuanzhi (c. 547), and The Great Calming and Contemplation 《摩訶止觀》by Zhi Yi (538—597).
Text | Biographies of Eminent Monks | Buddhist Monasteries in Luoyang | Great Calming |
---|---|---|---|
Abbreviation | GSZ | Luoyang | MHZG |
Scroll | Frequency | Frequency | Frequency |
1 | 7.27 | 6.66 | 5.89 |
2 | 5.76 | 6.36 | 6.12 |
3 | 6.02 | 7.45 | 4.27 |
4 | 3.00 | 6.43 | 5.15 |
5 | 4.84 | 4.99 | 4.22 |
6 | 5.92 | 3.16 | |
7 | 5.66 | 4.28 | |
8 | 3.69 | 2.94 | |
9 | 4.79 | 3.35 | |
10 | 4.76 | 5.60 | |
11 | 3.88 | ||
12 | 5.35 | ||
13 | 5.61 | ||
14 | 5.03 | ||
Mean | 5.11 | 6.38 | 4.49 |
A box plot generated with R is shown in Figure 13.
Figure 13: Box Plot of Frequencies per Thousand Words of 所 Suǒ
A test of whether the means of word frequency for 所 suǒ for each of the three texts is done with ANOVA using the R commands below.
> gsz <- c(7.27, 5.76, 6.02, 3.00, 4.84, 5.92, 5.66, 3.69, 4.79, 4.76, 3.88, 5.35, 5.61, 5.03) > summary(gsz) Min. 1st Qu. Median Mean 3rd Qu. Max. 3.000 4.768 5.190 5.113 5.735 7.270 > luoyang <- c(6.66, 6.36, 7.45, 6.43, 4.99) > summary(luoyang) Min. 1st Qu. Median Mean 3rd Qu. Max. 4.990 6.360 6.430 6.378 6.660 7.450 > mhzg <- c(5.89, 6.12, 4.27, 5.15, 4.22, 3.16, 4.28, 2.94, 3.35, 5.60) > summary(mhzg) Min. 1st Qu. Median Mean 3rd Qu. Max. 2.940 3.568 4.275 4.498 5.487 6.120 > texts <- c("gsz", "gsz", "gsz", "gsz", "gsz", "gsz", "gsz", "gsz", "gsz", "gsz", "gsz", "gsz", "gsz", "gsz", "luoyang", "luoyang", "luoyang", "luoyang", "luoyang", "mhzg", "mhzg", "mhzg", "mhzg", "mhzg", "mhzg", "mhzg", "mhzg", "mhzg", "mhzg") > frequencies <- c(gsz, luoyang, mhzg) > length(frequencies) [1] 29 > frequencies.aov = aov(frequencies ~ texts) > summary(frequencies.aov) Df Sum Sq Mean Sq F value Pr(>F) texts 2 11.78 5.891 5.014 0.0144 * Residuals 26 30.55 1.175 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > p <- 3 > n <- length(frequencies) > qf(0.99, p - 1, n - p) [1] 5.526335 > qf(0.95, p - 1, n - p) [1] 3.369016
The output of the summary of aov()
result shows that F value = 5.014.
The computed values of the inverse of the cumulative F distribution at the
α_{0} = 0.01 and 0.05 levels are
F^{-1}(1 - 0.01) = 5.526 and F^{-1}(1 - 0.05) = 3.369.
The ANOVA shows that the null hypothesis that the means are equal should be rejected at the 0.05
level of significance but not the 0.01 level, since
3.369 < F value < 5.526.
This is consistent with the ANOVA output for the power function Pr(>F) = 0.0144.
▢
Simulation can be used to solve problems where there is no closed-form solution. Examples are computing means, variances, and quantiles for formulas that cannot be solved algebraically. The simulations usually make use of large numbers of randomly generated numbers. Statistical simulation is sometimes called Monte Carlo analysis because of the use of randomness. (DeGroot and Morris, Probability and Statistics, 791)
R functions can be used to generate random values with the different distributions discussed in this
document.
Figure 14 shows a histogram produced from the R function rnorm(10000)
that produces
10,000 random numbers with a standard normal distribution.
Figure 14: Histogram of Random Values from a Standard Normal Distribution
Information theory has arisen from communications theory in electrical engineering in connection with the study of transmission rates and error correction. (Cover and Thomas, 2012, ch. 1) Information theory was first formulated in its present form by Shannon in the 1940's. (Cover and Thomas, 2012, ch. 2) It was later expanded to areas of computer science, such as data compression and data storage. Information theory concepts have since spread back into statistics and other fields like economics and physics. The definition of entropy above can be used as a basis for many other definitions that parallel concepts in probability and statistics. Some of these are described here.
The joint entropy H(X, Y) of a pair of random variables is defined as
H(X, Y) = -∑_{x∈X}∑_{y∈Y}p(x, y) log(p(x, y))
where (X, Y) is a pair of discrete random variables with probability distribution p(x, y). (Cover and Thomas, 2012, ch. 2.2)
The conditional entropy H(X|Y) is defined as
H(X|Y) = -∑_{x∈X}H(Y|X=x) = -∑_{x∈X}∑_{y∈Y}p(x, y) log(p(y|x))
where (X, Y) is a pair of discrete random variables. (Cover and Thomas, 2012, ch. 2.2)
The chain rule for entropy is
H(X, Y) = H(X) + H(Y|X)
Described in words, the entropy of a pair of random variables is the sum of the entropy of one random variable and the conditional entropy of the other. (Cover and Thomas, 2012, ch. 2.2)
Relative entropy D(p||q) is a measure of the distance between the distributions of random variables p and q. In symbols, it is defined as
D(p||q) = ∑_{x∈X}p(x) log(p(x)/q(x))
(Cover and Thomas, 2012, ch. 2.3)
The mutual information I(X; Y) for two discrete random variables X and Y with joint distribution p(x, y) is defined as the relative entropy between the joint distribution and the product of the marginal distributions p(x)p(y)
I(X; Y) = -∑_{x∈X}∑_{y∈Y}p(x, y) log(p(y, x)/[p(x)p(y)])
(Cover and Thomas, 2012, ch. 2.3)
The relationship between mutual information and entropy is
I(X; Y) = H(X) - H(X|Y)
(Cover and Thomas, 2012, ch. 2.3)
The concept of convex functions is important in many proofs in information theory. A function f(x) is convex of interval (a, b) if for every pair points x_{1}, x_{2} in the interval and for 0 ≤ λ ≤ 1, the following condition holds
f(λx_{1} + (1-λ)x_{2}) ≤ λx_{1} + (1-λ)x_{2}
(Cover and Thomas, 2012, ch. 2.6)
Markov chains, introduced above, are useful in information theory. Random variables X, Y, and Z form a Markov chain X → Y → Z if the joint probability function is
p(x, y, z) = p(x)px(y|x)p(z|y)
In words, this means that no manipulation of Y can increase the information that it contains about X. (Cover and Thomas, 2012, ch. 2.8)
The Data-processing inequality theorem states that, if X → Y → Z form a Markov chain then
I(X; Y) ≥ I(X; Z)
(Cover and Thomas, 2012, ch. 2.8)
As explained above, a statistic is a function of a set of random variables. A sufficient statistic T(X) contains all the information in X about ϴ. In symbols, a sufficient statistic is a function T(X) relative to a family of probability functions {f_{ϴ}(x)} parameterized by ϴ, if X is independent of ϴ for any distribution of ϴ, given T(X). In this case, ϴ → T(X) → X is a Markov chain. This is the case of equality in the data-processing inequality theorem:
I(ϴ; X) = I(ϴ; T(X))
(Cover and Thomas, 2012, ch. 2.9)
T(X) is a minimally sufficient statistic if it is a function of every other statistic. In this case, ϴ → T(X) → U(X) → X is a Markov chain.
I(ϴ; X) = I(ϴ; T(X))
(Cover and Thomas, 2012, ch. 2.9)
The asymptotic equipartition property (AEP) states that, if X_{1}, X_{2}, ... are independent and identicially distributed with probability density function p(x) then
-(1/n) log p(X_{1}, X_{2}, ... , X_{n}) → H(X)
(Cover and Thomas, 2012, ch. 3.1) The AEP is analogous the law of large numbers in statistics. Here the symbol → means convergence of random variables. A sequence of random variables X_{1}, X_{2}, ... converges to a random variable X if Pr{lim_{n→∞}X_{n} = X} = 1.
A typical set A^{(n)} with respect to probability density function p(x) is the set of sequences (x_{1}, x_{2}, ..., x_{n}) with the property
2^{-n(H(X)+ε)} ≤ p(x_{1}, x_{2}, ..., x_{n}) ≤ 2^{-n(H(X)-ε)}
(Cover and Thomas, 2012, ch. 3.1)
The AEP has applications in data compression. By partitioning a set of strings into a typical set and a non-typical set and representing the strings by an integer index, for a large number of elements n, the expected number of bits needed to represent the set is nH(X). (Cover and Thomas, 2012, ch. 3.2)
A source code for a random variable is a mapping from the range of the random variable to a set of finite length strings of symbols in an alphabet. Here the source code is givne the symbol C, the random variable X, the arity of the alphabet is D, C(x) is the codeword for the value x of the random variable X, D^{*} is the set of symbols C(X), and l(x) is the length of C(x). The expected length L(C) of a source code is
L(C) = ∑_{x∈X} p(x)l(x)
For example, if X is a random variable with x ∈ {1, 2, 3, 4}, and D = {0, 1}, then a possible source code is C(1) = 0, C(2) = 10, C(3) = 110, C(4) = 111. A code is nonsingular is every element of X maps to a different string in D^{*}. The example just given is nonsingular because each of the C(X) is different. (Cover and Thomas, 2012, ch. 5.1)
The
C(x_{1})C(x_{2}) ... C(x_{n})
If the extension is nonsingular then it is referred to as uniquely decodable. This is important because it means that a compressed file can be made up of concatenated symbols without any separator between them. The example above is uniquely decodable because no symbol is a prefix of any other symbol. Such as source code is called a prefix code or an instantaneous code. It does not need any lookahead to resolve ambiguity in the string of symbols. (Cover and Thomas, 2012, ch. 5.1)