Pearson Hashing is a simple hash function that reads an arbitrary number of bytes and outputs a single byte. Instead of complicated algorithms, the Pearson hash uses a lookup table containing a permutation of all possible bytes.
In this article I will walk through how to implement the function, and how to extend its output into N bytes instead of just one.
Since the main part of the algorithm is the lookup table, let’s see how we might generate one.
Generating lookup tables
def make_lookup(rng): l = [] for i in range(256): while True: x = rng() if x not in l: l.append(x) break return l
This allows us to plug an RNG depending on our use case. For example we can choose Python’s random
module.
import random lookup = make_lookup(lambda: random.randint(0, 256)) # Inspect the first 10 elements lookup[:10]
[244, 1, 137, 131, 254, 122, 237, 144, 112, 42]
We can also use /dev/urandom
to generate our lookup table like this.
import os lookup = make_lookup(lambda: os.urandom(1)[0]) # Inspect the first 10 elements lookup[:10]
[219, 154, 235, 209, 172, 126, 127, 15, 70, 83]
Hashing (single-byte)
def hash_single(lookup, buf): h = 0 for byte in buf: h = lookup[h ^ byte] return h
hash_single(lookup, b"Hello, world!") hash_single(lookup, b"Hello, World!") hash_single(lookup, b"Arbitrary-length inputs")
22
219
91
Hashing (multi-byte)
Theory
There are two things that affect the output of this function, aside from the buffer we are hashing. The first one is the internal state, which is an 8-bit value. The second one is the lookup table.
If we change either of those, we will be able to get multiple bytes of output.
Different starting points
Let’s start with this method since it is easier.
Info With this method, you the maximum output size you will be able to get is 256 bytes. If more output is needed, initialize an empty state and feed the byte number to each state value before feeding the actual data.
def hash_multiple_starts(lookup, buf, size=1): state = list(range(size)) for byte in buf: for i in range(size): state[i] = lookup[state[i] ^ byte] return bytes(state)
hash_multiple_starts(lookup, b"Hi there!", 1).hex() hash_multiple_starts(lookup, b"Hi there!", 2).hex() hash_multiple_starts(lookup, b"Hi there!", 3).hex()
'25'
'25f7'
'25f723'
hash_multiple_starts(lookup, b"Hello, world!", 1).hex() hash_multiple_starts(lookup, b"Hello, world!", 2).hex() hash_multiple_starts(lookup, b"Hello, world!", 3).hex()
'16'
'16aa'
'16aa9e'
Multiple lookup tables
BITS = 64 BYTES = int(BITS / 8) lookups = [make_lookup(lambda: os.urandom(1)[0]) for _ in range(BYTES)]
def hash_multiple_lookups(lookups, buf): state = [0 for _ in lookups] for byte in buf: for i, s in enumerate(state): state[i] = lookups[i][s ^ byte] return bytes(state)
hash_multiple_lookups(lookups, b"Hello world").hex() hash_multiple_lookups(lookups, b"Hello World").hex()
'c7a28c34f3d67d47'
'79a1dd2b2829948a'
Use Cases
Randomness extractor
def biased_coin_flips(n): res = "" for _ in range(n): while True: val = os.urandom(1)[0] if val < 100: break if val < 90: res += 'H' else: res += 'T' return res biased_coin_flips(40)
'THHTHTHHTHHHHTHHHHHHHHHHHHHHHHHHHTHHHHHH'
Here’s a biased coin that results in Heads a lot more often than Tails. To be precise, 90% of all flips result in Heads. Let’s calculate how many flips we’ll need to extract 32 unbiased bits.
bits_per_flip = -math.log(0.9, 2) 32 / bits_per_flip
210.5220313267387
for _ in range(10): data = biased_coin_flips(211).encode('ascii') hash_multiple_starts(lookup, data, 4).hex()
'332b6166'
'caafb949'
'7418e373'
'b46f8495'
'10067eb6'
'fce7e552'
'7bf477e9'
'b18929fc'
'268538cc'
'd183a329'