Entropy vs Randomness

For a long time, I confused entropy with randomness, assuming that something with high entropy must have been randomly generated, and vice versa. It seems a small mistake, but it can actually lead to a lot of mistakes in fields like cryptography or information theory. While these two concepts are related, they aren't the same, and mixing them up can make it hard to understand how information really works.

To illustrate this, consider a puzzle. One of these two sequences is the fractional part of π written in binary (just as π is 3.14159... in decimal, it's 11.00100100... in binary). The other was generated by flipping a fair coin 100 times. Can you tell which is which?

A: sequence of 100 bits

B: sequence of 100 bits

Probably not. Both look equally "random", roughly the same number of 0s and 1s, no obvious patterns, no way to guess the next digit from the previous ones. And yet they are totally different: the coin flips are random, generated by an unpredictable physical process, while the digits of π are completely deterministic, computed by a formula, reproducible by anyone. (Sequence A is π.)

Despite being totally different from the "randomness" point of view, both sequences have the same entropy, roughly 1 bit per digit. Entropy measures information content: how much you learn by observing each digit. In both cases, each digit is roughly equally likely to be 0 or 1, so each one carries about 1 bit of information.

This is the counterintuitive distinction this post explores. Randomness is about how something was generated: whether the process produces outcomes that are unpredictable without additional information. Entropy is about how much information it contains: the average amount of uncertainty resolved by observing each outcome. They often go together, but they don't have to, and understanding when and why they diverge is crucial for reasoning about compression, cryptography, and information in general.

What is information?

Before we dive into formulas and definitions, let's build some intuition about what it means for a message to contain "information." Consider these two statements:

"The sun rose this morning."
"It snowed in Zurich in July."

The first statement is true, but it carries almost no information because you already knew the sun would rise. It happens every day with near certainty. The second statement, however, contains lots of information because it's unexpected and surprising. July snow in Zurich is rare, so learning that it happened resolves significant uncertainty about the weather.

This reveals a fundamental principle: information is inversely related to probability. Rare events contain more information than common ones. The more surprising something is, the more information it conveys when it happens. This intuition, that learning about unlikely outcomes tells you more than learning about likely ones, is the foundation of Shannon's mathematical theory of information.

Coin flips

Let's make this more concrete with coin flips. Consider three different coins:

Fair coin: Both heads and tails are equally likely (50/50 probability). Each flip is maximally surprising because you genuinely don't know what will happen. This means maximum information per flip. You learn the most possible from each outcome.
Biased coin (90% heads, 10% tails): Most flips will be heads, so when you observe heads, you gain little information because you mostly expected it. When tails appears, it's surprising and carries more information. On average, however, you get less information per flip than with a fair coin because most of the time the outcome is predictable.
Two-headed coin: Every flip is guaranteed to be heads. There's no surprise at all, no uncertainty to resolve. Zero information. Watching it flip tells you nothing you didn't already know.

Critical distinction: Entropy measures the average information content of a source or distribution, not of any particular sequence. A pseudorandom number generator with a known seed has zero conditional entropy from the perspective of someone who knows that seed, even if its outputs look perfectly random. If the seed is unknown, uncertainty remains and comes from the seed itself. The sequence is deterministic once the seed is fixed.

This distinction is crucial: entropy is a property of distributions, not of individual sequences. A specific sequence of coin flips like HTHHTHT could come from either a fair coin or a heavily biased coin. The sequence itself doesn't have entropy. Only the process that generated it does. What matters is the probability distribution over possible outcomes, not the particular outcome you happened to observe.

The main distinction

These two concepts are related but fundamentally different, and confusing them leads to serious misunderstandings:

Concept	Definition	Example
Randomness	Unpredictability of individual outcomes in a process	Coin flips, atmospheric noise, quantum measurements
Entropy	Average information content of a probability distribution	Fair coin = 1 bit/flip, biased coin (90/10) ≈ 0.47 bits/flip

Let's look at four examples that illustrate the different combinations of randomness and entropy, demonstrating that these properties are independent:

1. High entropy, random: Fair dice

A fair six-sided die is both random and high-entropy. Each roll is unpredictable, and all six outcomes are equally likely, giving you $\log_2(6) \approx 2.58$ bits of information per roll. We have both maximum unpredictability (you can't predict the next roll) and maximum information content (each roll resolves the maximum amount of uncertainty about which of six equally likely possibilities occurred).

2. Low entropy, random: Biased dice

Imagine a weighted die that rolls 6 with 90% probability, and each of 1 through 5 with only 2% probability. Each roll is still random in the sense that you can't predict the exact outcome:the process is genuinely unpredictable. But the entropy is much lower, approximately $H \approx 0.57$ bits per roll, because most of the time you see what you expected (a 6), which tells you very little. Only occasionally does the die surprise you with a different number, and those rare events carry more information, but averaged over many rolls, you're learning far less per roll than you would with a fair die.

3. High entropy, non-random: Digits of π

The digits of π are completely deterministic: there's nothing random about π whatsoever. Given sufficient computation, anyone can calculate the billionth digit of π, and everyone will get the same answer. Yet if you look at the binary representation of π (the fractional part), the digits appear random: each bit (0 or 1) occurs with roughly equal frequency, there are no obvious patterns, and knowing the previous bits doesn't help you predict the next one unless you actually know you're looking at π and can compute it. Empirically, long prefixes of π behave like a high-entropy source despite zero randomness in the generating process. So this is best understood as a practical/empirical example of high entropy without randomness, not a formal proof about all digits of π.

4. Low entropy, non-random: "AAAA..."

The simplest case: a string of repeated identical characters. This is completely predictable, clearly non-random (it's deterministically generated by the rule "repeat A"), and has minimal information content. Once you've seen the first character, you've learned everything. Every subsequent character tells you nothing new. This is low entropy in the most obvious way. Zero bits of information per character after the first one.

Shannon entropy

Now that we have the intuition, let's make it precise. In 1948, Claude Shannon introduced a mathematical definition of information and entropy that became the foundation of information theory and transformed our understanding of communication, compression, and computation.

Consider a discrete random variable $X$ that can take on values $x_1, x_2, \ldots, x_n$, each with probability $p_i = P(X = x_i)$. The Shannon entropy of $X$ is defined as:

\[ H(X) = -\sum_{i=1}^{n} p_i \log_2 p_i \]

Let's unpack this formula piece by piece:

The negative sign ensures entropy is positive, since $\log_2 p_i$ is negative when $0 < p_i < 1$ (logarithms of fractions are negative).
The base-2 logarithm means entropy is measured in bits (binary digits). We could use natural logarithm for nats or base-10 for hartleys, but bits are standard in computer science because they correspond to binary choices.
$\log_2(1/p_i)$ is the "information content" or "surprisal" of outcome $i$. Rare events (small $p_i$) have high information content, while common events (large $p_i$) have low information content.
We take the expected value over all possible outcomes, weighting each outcome's information content by its probability of occurring. This gives us the average information per outcome.

Why the logarithm?

The logarithm has a beautiful interpretation that makes the formula intuitive. If an outcome has probability $p = 1/8$, its information content is:

\[ \log_2(1/p) = \log_2(8) = 3 \text{ bits} \]

This makes perfect sense: to identify which of 8 equally likely possibilities occurred, you need to ask 3 yes/no questions. For example: "Is it in the first half?" (narrows to 4 possibilities), "Is it in the first half of that?" (narrows to 2), "Is it the first one?" (identifies it exactly). Each question is one bit of information, and the logarithm naturally captures this "number of binary questions needed" intuition. More generally, an event with probability $1/2^k$ carries exactly $k$ bits of information.

Extreme cases

Maximum entropy occurs when all $n$ outcomes are equally likely, with $p_i = 1/n$ for all $i$. In this case:

\[ H_{\text{max}} = -n \cdot \frac{1}{n} \log_2 \frac{1}{n} = \log_2 n \]

For example, a fair coin ($n=2$) has $H = \log_2(2) = 1$ bit per flip. A fair six-sided die has $H = \log_2(6) \approx 2.58$ bits per roll. Uniform distributions maximize entropy because they maximize uncertainty. You have no basis for guessing one outcome over another.

Minimum entropy occurs when one outcome is certain, with $p_i = 1$ for some $i$ and all other probabilities zero. In this case:

\[ H_{\text{min}} = -1 \cdot \log_2(1) = 0 \text{ bits} \]

No surprise, no uncertainty, no information. Learning the outcome tells you nothing because you already knew what would happen with certainty.

Interactive Demo: Entropy Calculator

Experiment with Probability Distributions

Try different probability distributions below to see how entropy changes. Load a preset or enter your own comma-separated probabilities to develop intuition for how the distribution shape affects information content.

Preset Distributions:

Or Enter Custom Probabilities (must sum to 1.0):

Select a preset or enter probabilities to calculate entropy

Notice how the fair coin has exactly 1 bit of entropy. Each flip gives you one bit of information, resolving one binary choice. The biased coin (90% heads) has only ~0.47 bits. You gain less than half a bit per flip on average because the outcome is mostly predictable. The two-headed coin has zero entropy. No matter how many times you flip it, you learn nothing new after the first flip.

Real-World Applications

Why does entropy matter? It's not just an abstract mathematical concept. It has profound practical implications across computer science, cryptography, and data science.

Data Compression

Compression algorithms like ZIP, gzip, and bzip2 work by exploiting low entropy. When data has patterns or redundancy (low entropy relative to its symbol space), it can be represented more compactly. Shannon's source coding theorem gives us a fundamental limit: if a source has entropy $H$ bits per symbol, it can theoretically be compressed to $H$ bits per symbol on average:but no further, no matter how clever your compression algorithm. This is a hard mathematical limit that no amount of engineering can overcome.

For example, English text has entropy around 1.5 bits per character when you account for letter frequencies, common words, and grammatical structure, even though each character occupies 8 bits (1 byte) in standard encoding. This enormous gap between the actual entropy and the storage size is why text compresses so well:there's massive redundancy in natural language. On the other hand, random or encrypted data has entropy close to 8 bits per byte because every bit is essentially unpredictable from the others. There's no pattern to exploit, so such data doesn't compress:it's already at its entropy limit.

Password Strength

When security experts talk about password "strength," they're really talking about entropy, though they don't always use that term explicitly. A password's entropy measures the number of bits of uncertainty an attacker faces when trying to guess it, which directly determines how many guesses they need to make on average.

For example, a random 8-character alphanumeric password (using lowercase, uppercase, and digits) has about 48 bits of entropy, calculated as $8 \times \log_2(62) \approx 48$ bits, since there are 62 possible characters per position. An attacker needs to try, on average, $2^{H-1}$ passwords to crack one with $H$ bits of entropy, so 48 bits means roughly $2^{47} \approx 140$ trillion guesses on average. Add special characters and you get more entropy; use a longer password and you get even more.

Crucially, a password like "password123" might theoretically have been generated by a truly random process (imagine someone rolled dice and happened to get those exact letters), but it has very low entropy in practice because it's a common choice that appears in dictionaries and leaked password databases. An attacker would try it early in their guessing strategy, so it provides minimal security. Security depends on entropy:how hard it is to guess given the attacker's knowledge:not on whether the password was randomly generated.

Cryptographic Security

Cryptographically secure pseudorandom number generators (CSPRNGs) must have high entropy in their seed to be safe. If an attacker can predict your random numbers because your entropy source was weak (say, seeded with just the current timestamp), they can break your encryption, forge signatures, or compromise any security mechanism that depends on that randomness.

Operating systems maintain an entropy pool by collecting randomness from unpredictable events like keyboard timings, interrupt timing jitter, and hardware random number generators. On Linux, both /dev/random and /dev/urandom are backed by the kernel CSPRNG; the practical difference is mainly blocking behavior under some conditions. In modern systems after proper initialization, /dev/urandom is generally appropriate for cryptographic use. The main risk is very early boot on poorly seeded systems, where insufficient initial entropy can weaken security.

Closing Thoughts

Let's recap the key insights from this exploration:

Entropy measures average information content. This is how much uncertainty or surprise a probability distribution contains, quantified in bits.
High entropy means high unpredictability of individual symbols on average, but this is fundamentally distinct from randomness, which is about the process that generates outcomes.
Randomness is about process and reproducibility (how something was generated and whether you can predict or reproduce it), while entropy is about information and uncertainty (how much you learn from observing outcomes, given the probability distribution).
These properties are independent: you can have high entropy without randomness (π digits), randomness without high entropy (biased coin), both (fair coin), or neither (constant sequence).
Practical applications span compression (exploiting low entropy), cryptography (requiring high entropy), password security (measuring in bits of entropy), and machine learning (using entropy for decision trees and loss functions).

The digits of π demonstrate the distinction clearly. They are completely deterministic with zero randomness in how they're generated. Everyone computing π gets the same digits. Yet empirically they behave like a high-entropy sequence because they are statistically close to uniform in tested prefixes. Conversely, a pseudorandom generator with a known seed can look random but has near-zero uncertainty for an observer who knows the seed, because the seed determines everything.

Understanding entropy helps us reason rigorously about security (how many bits of entropy does my password really have?), efficiency (what's the theoretical compression limit for this data?), and machine learning (how much information does this feature provide about the label?). It's one of the most powerful and fundamental concepts in computer science, mathematics, and information theory.

References

Claude Shannon, "A Mathematical Theory of Communication" (1948) : The paper that started information theory
Thomas M. Cover and Joy A. Thomas, Elements of Information Theory (2nd ed., 2006) : The definitive textbook
David MacKay, Information Theory, Inference, and Learning Algorithms : Free online book connecting information theory to machine learning