A brief introduction to information theory in computer science

In this post I'll try to put into words what I understand of information theory and its links with computer science, in particular with the objective of describing what I understand the "information theoretical lower bound" to be in the context of data structures. Please, bear in mind that I'm far from being an expert in the fields of information theory, communication theory and code theory: if I got anything wrong, your comments are more than welcome: mail me!

Information? Entropy?

Consider this situation: you observe an event and it can have two outcomes, one that confirms what you thought was natural given your experience and one that might surprise you. Which one do you think carries more information? The philosophical question about what information is might not be particularly trivial but, in the formal suit of information theory, Claude Shannon gave a list of axioms to define what information is:

An ineludible outcome is unsurprising and carries no information;
The less probable an outcome is, the more surprising it is and the more information it yields;
If two independent events are measured separately, the total amount of information is the sum of information of the individual events.

As you can see, two pivotal concepts in the definition of information are surprise and of probability, whose interplay defines information. From the list above we can think of information as a function that maps the probability of an outcome to a degree of surprise:

$$ \mathcal{I} : [0, 1] \rightarrow [0, \infty) $$

It turns out that the properties listed above can be satisfied by only one function (up to a multiplicative scaling factor): given a real number $b$ and an outcome $x$ with probability $p(x)$, we define self-information as

$$ \mathcal{I}(x) = -\log_b(P(X = x)) $$

which is a quantity expressed in units of information.

base is

A plot of $-\log_b(x)$

As we can see from the plot above, the value of $b$ can change the quantity of information expressed! For example, for $b = 2$, the self-information $\mathcal{I}$ of an event with low probability $p = 0.1$ is $-log_2(0.1) \approx 3.32$, while if $b = 10$ its self-information is $-log_{10}(0.1) = 1$. This aspect is taken into account by the multiplicative factor I cited earlier and is also linked to the unit of measure of information.

The meaning of this change is, in my opinion, explained better in the context of another quantity, strongly connected to self-information, is considered: entropy, which Shannon presented in the setting of communication, with a precise mathematical definition: given a random variable $X$ we compute its entropy $\mathcal{H}(X)$ as the expectation of its self-information, that is

$$ \mathcal{H}(X) = \sum_{x \in X}P(X = x)I(x) = - \sum_{x \in X} P(X = x) \log_b(P(X = x)) $$

Let's take it slow and figure out what this means. Imagine you want to use a means of communication (which we call channel) to show the outcome $x$ of the realization of the event $X$. Suppose the channel allows you to express messages using strings of arbitrary length but made out of symbols from a finite set, which we call alphabet $\mathcal{A}$; for example, we might have that $$\mathcal{A} = \{0, \cdots, 9\}$$ is the set of decimal digits, or

$$ \mathcal{A} = \{a, b, \cdots, z\} $$

is the Latin alphabet; then, we can encode the possible outcomes we want to express into strings of symbols: for example, when throwing a six-faced dice, we might map the face with one dot to $1$, the face with two dots with $2$ and so on. Once we have an encoding, we might ask ourselves if this is the smallest possible encoding we can give: Shannon gave a proof that there is no way to come up with an encoding (that is, a set of strings representing the possible outcomes) with an average length that is less than the entropy of the event.

Theorem (Shannon's Coding Theorem). Let $X$ be a random variable and let $\mathcal{A}$ be an alphabet. Then, let

$$ C : X \rightarrow \mathcal{A}^{*} $$

be a uniquely decodable encoding, that is, a bijection mapping outcomes[^1] of the realization of the event $X$ to strings drawn from the Kleene closure of the alphabet $\mathcal{A}$. Then,

$$ E[|C(X)|] = \sum_{x \in X} |C(X)|P(X = X) \geq -(\sum_{x \in X}P(X = x)\log_b(P(X = x)) ) = \mathcal{H}(X) $$

In other words, the expected length of the encodings of outcomes cannot be less than the entropy of the event.

Back to computer science

Regarding computer science, we want to use the concepts deriving from information theory to be able to draw conclusion regarding representation of data. How can we link the concepts of entropy, which dealt with probability of an event, to something technically deterministic such as computer science?

Let's start with an example: we want to come up with a representation (a coding) of ASCII characters expressed as bits; in other words, our alphabet is $\mathcal{A} = \{{0, 1\}}$. The units measured by self-information are called (in this case where $b = 2$) bits. We have $n = 128$ possible characters to represent, each equiprobable in principle. Let $X$ be the event of seeing one of the $128$ characters. What is the entropy of this event?

$$ \begin{equation*} \begin{split} H(x) &= E[I(P(X))] = -\sum_{x \in X}P(X = x)\log_2(P(X = x))\\ &= -\sum_{x \in X}\frac{1}{128}\log_2(\frac{1}{128}) = -(128 \cdot \frac{1}{128} \cdot \log_2(\frac{1}{128})) \\ &= - (\log_2(1) - \log_2(128)) = 7 \end{split} \end{equation*} $$

and, in fact, we can come up with a representation using exactly $7$ bits to identify each character. Now, we want to represent strings of fixed length $l$ of ASCII characters: we have $128^l$ possible strings, what is the entropy of the event of seeing a string?

$$ \begin{equation*} \begin{split} H(x) &= E[I(P(X))] = -\sum_{x \in X}P(X = x)\log_2(P(X = x))\\ &= -\sum_{x \in X}\frac{1}{128^l}\log_2(\frac{1}{128^l}) = -(128^l \cdot \frac{1}{128^l} \cdot \log_2(\frac{1}{128^l})) \\ &= - (\log_2(1) - \log_2(128^l)) = 7l ~~ [\text{bit}] \end{split} \end{equation*} $$

The entropy obviously depends on the length of the string: for example, if we want to encode strings of $5$ characters, the entropy is $35$ bit, that is, we need on average $35$ bits to encode every possible string. Imagine we want to use Cantor's pairing function to represent strings of five characters as a single number, that is, we want a bijection that given a pair of integers returns a single integer.

I came up with these two functions in Julia:

function pair(x, y)
    return 0.5 * (x + y) * (x + y + 1) + y
end

function to_cantor(s)
    v = pair(0, Int(s[begin]))
    for c in s[2:end]
        v = pair(v, Int(c))
    end

    return v
end

The result of ceil(log2(to_cantor("Hello"))) is $169$: this means that to represent strings this way we need at least[^2] $169$ bits, which is, well, a bit more than $35$. Of course, a better representation might be the juxtaposition of $35$ bits encoding $5$ characters of $7$ bit each, which is exactly the entropy we computed. This is to say that we can use the concept of entropy, which is simplified when the $n$ events are equiprobable to

$$ \mathcal{H}(X) = \log_2(n) $$

to characterize "how good" a representation we come up with really is. Of course, this is not all there is to it - another important aspect to consider is the use of the representation, that is, the operations that we want to compute using the data. Would it be fair to consider "good", in some sense, a representation that uses few spatial resources but consumes large amounts of space and/or time to compute the operations that are required on the data, for example, representing generic trees in $4$ bits but having to consume more than $22$ billion years to read label of the first leaf?

Implicit, succinct and compact data structures

The catch is that the considerations on the data structure must take into account the API that it must offer, without having to decompress any datum. Given data we want to represent whose entropy is $E$ bits, we say that a representation we come up with is

implicit if it takes $E + \mathcal{O}(1)$ bits. An example are null-terminated strings, which given a string of length $n$ encode them in $(n+1) * 8 = E + \mathcal{O}(1)$ bits.
succinct if it takes $E + \mathcal{o}(E)$ bits. An example are arbitrary length strings in the form of {length: n, vec: [n]}, which encode strings in $(n * 8) + \log_2(n) = E + \mathcal{o}(E)$ bits.
compact if it takes $\mathcal{O}(E)$ bits.

All of this while offering an efficient implementation of the API of the data structure.

[^1] Technically, it should map a source encoding of the outcomes and not the outcomes themselves.

[^2] To give the precise number, we should compute the largest number a 5-character string can have and compute the ceiling of the $\log_2$ of that number.