Blog on Birthday Paradox

This is a short blog on the Birthday Paradox - a followup on my statistics blog really wanting to talk about an edge case my algorithm does not cover.

The Birthday Paradox

This is a bit of a misnomer, it isn't a paradox in itself, just an intuitive contradiction, if a term like that exists. In a room of only 23 people, the chance that two people share the same birthday is 50%. You heard that right, 50%, which appears to not make any sense. Like why? How does this even work? It's a 1/365 chance to share the same birthday right? Well not exactly, the math there isn't quite correct nor is it simple.

The likelihood that two people share the same birthday is difficult to compute, so like most statistics problems, a different way to think about this is thinking about the inverse, then subtracting it from 1. So basically, what is the likelihood of no two people sharing the same birthday? Well, this probability is fairly high, at (365 * 364)/(365 * 365), but what if we extend this to X people?

Permutations

Before I go into detail, I need to explain permutations. Permutations are a statistical way to count how many times you can select some amount of objects from a collection of objects. The actual formula is this:

The amount of ways you can select r objects from a collection of n objects is n factorial removing n - r factorial from the result. For example, picking 2 cards from a set of 52, I have 52 * 51 ways to select that card.

Tying Permutations into the Paradox

Permutations offer us a way to count a problem. In a room of X people, we have 365!/(365 - X)! ways to uniquely select birthdays, such that no two birthdays are the same. In order to calculate the probability of getting this compared to all 365 days of the year, we would divide that factorial result by 365 ** X. I coded this algorithm in Python to figure out this probability. The following code snippet does roughly what I describe, minus the factorial part in exchange for a for loop:

outcome_num: float = 365

running_total: float = 1 # running probability that we don't encounter at least two of the same outcome

for i in range(1, 23):
    running_total *= ((outcome_num - i) / outcome_num)


print(f'''
    Chances we get 2 same outcomes    : {round((1 - running_total) * 100, 2)}%
    Chances we get all unique outcomes: {round(running_total * 100, 2)}%
''')

In this snippet, I use a for loop and accumulator pattern to multiply the probabilities by a running total (initially 1.0). If you run this code snippet, you will find the chance that you get two differing outcomes decreases quite significantly as the more "people" are in the room. This is because we are selecting birthdays without replacement (hence unique birthdays), so the odds of selecting a unique birthday get much smaller the farther out we go. I was able to get the following output:

Why this is a problem for Hash Collisions

A Hash is a one way function that transforms an input of arbitrary size to an input of fixed size, normally used as fingerprints to verify the integrity of an input or for cryptographic signatures. Hashes have the following properties:

These two properties ensure that hash functions are moderately safe from collisions, where two different inputs produce the same hash.

For a practical application for this, check out my file integrity tool in Go, which stores hashes of files/folders in a SQL lite database, where you can compare hashes of those files/folders over time and see what changes/what needs to be monitored.

Now, for hash functions are very similar to the birthday paradox. The probability it's someone's birthday (1/365) is analogous to the probability that an input hashes to the given output (1/(characters that can go in the hash set ** length of hash). The number of people picked as a sample is analogous to how many inputs are hashed using the hash function.

So, hashing 1000 times means for hash functions with a few characters to pick from for the output, the chances two inputs hash to the same output is high, which is a problem for things like password storing and DNS queries, as those rely on hashes for secure storage and fingerprints.

Best practices

The best way to protect against a hash collision is to use a hash function with a lot of characters to pick from per entry, like SHA-256, and increase the length of the hashed output to extend the number of hashes required to get at 2 matching inputs as far as possible. This would make it fairly difficult for computers to brute force this problem, and make it easier to detect if they are trying to do so by giving you more time to respond!

Birthday Paradox and its Implications on Hash Collisions and Security

The Birthday Paradox

Permutations

Tying Permutations into the Paradox

Why this is a problem for Hash Collisions

Best practices