Pearson Hashing

Simple hash function
Reading time: about 8 minutes

Pearson Hashing is a simple hash function that reads an arbitrary number of bytes and outputs a single byte. Instead of complicated algorithms, the Pearson hash uses a lookup table containing a permutation of all possible bytes.

In this article I will walk through how to implement the function, and how to extend its output into N bytes instead of just one.

Since the main part of the algorithm is the lookup table, let’s see how we might generate one.

Generating lookup tables

In [2]:
def make_lookup(rng):
    l = []
    for i in range(256):
        while True:
            x = rng()
            if x not in l:
                l.append(x)
                break
    return l

This allows us to plug an RNG depending on our use case. For example we can choose Python’s random module.

In [3]:
import random
lookup = make_lookup(lambda: random.randint(0, 256))

# Inspect the first 10 elements
lookup[:10]
Out [3]:
[244, 1, 137, 131, 254, 122, 237, 144, 112, 42]

We can also use /dev/urandom to generate our lookup table like this.

In [4]:
import os
lookup = make_lookup(lambda: os.urandom(1)[0])

# Inspect the first 10 elements
lookup[:10]
Out [4]:
[219, 154, 235, 209, 172, 126, 127, 15, 70, 83]

Hashing (single-byte)

In [5]:
def hash_single(lookup, buf):
    h = 0
    for byte in buf:
        h = lookup[h ^ byte]
    return h
In [6]:
hash_single(lookup, b"Hello, world!")
hash_single(lookup, b"Hello, World!")
hash_single(lookup, b"Arbitrary-length inputs")
Out [6]:
22
Out [6]:
219
Out [6]:
91

Hashing (multi-byte)

Theory

There are two things that affect the output of this function, aside from the buffer we are hashing. The first one is the internal state, which is an 8-bit value. The second one is the lookup table.

If we change either of those, we will be able to get multiple bytes of output.

Different starting points

Let’s start with this method since it is easier.

Info With this method, you the maximum output size you will be able to get is 256 bytes. If more output is needed, initialize an empty state and feed the byte number to each state value before feeding the actual data.

In [7]:
def hash_multiple_starts(lookup, buf, size=1):
    state = list(range(size))
    
    for byte in buf:
        for i in range(size):
            state[i] = lookup[state[i] ^ byte]
    
    return bytes(state)
In [8]:
hash_multiple_starts(lookup, b"Hi there!", 1).hex()
hash_multiple_starts(lookup, b"Hi there!", 2).hex()
hash_multiple_starts(lookup, b"Hi there!", 3).hex()
Out [8]:
'25'
Out [8]:
'25f7'
Out [8]:
'25f723'
In [9]:
hash_multiple_starts(lookup, b"Hello, world!", 1).hex()
hash_multiple_starts(lookup, b"Hello, world!", 2).hex()
hash_multiple_starts(lookup, b"Hello, world!", 3).hex()
Out [9]:
'16'
Out [9]:
'16aa'
Out [9]:
'16aa9e'

Multiple lookup tables

In [10]:
BITS = 64
BYTES = int(BITS / 8)

lookups = [make_lookup(lambda: os.urandom(1)[0]) for _ in range(BYTES)]
In [11]:
def hash_multiple_lookups(lookups, buf):
    state = [0 for _ in lookups]
    
    for byte in buf:
        for i, s in enumerate(state):
            state[i] = lookups[i][s ^ byte]
    
    return bytes(state)
In [12]:
hash_multiple_lookups(lookups, b"Hello world").hex()
hash_multiple_lookups(lookups, b"Hello World").hex()
Out [12]:
'c7a28c34f3d67d47'
Out [12]:
'79a1dd2b2829948a'

Use Cases

Randomness extractor

In [13]:
def biased_coin_flips(n):
    res = ""
    
    for _ in range(n):
        while True:
            val = os.urandom(1)[0]
            if val < 100:
                break
                
        if val < 90:
            res += 'H'
        else:
            res += 'T'
    return res

biased_coin_flips(40)
Out [13]:
'THHTHTHHTHHHHTHHHHHHHHHHHHHHHHHHHTHHHHHH'

Here’s a biased coin that results in Heads a lot more often than Tails. To be precise, 90% of all flips result in Heads. Let’s calculate how many flips we’ll need to extract 32 unbiased bits.

In [14]:
bits_per_flip = -math.log(0.9, 2)

32 / bits_per_flip
Out [14]:
210.5220313267387
In [15]:
for _ in range(10):
    data = biased_coin_flips(211).encode('ascii')
    hash_multiple_starts(lookup, data, 4).hex()
Out [15]:
'332b6166'
Out [15]:
'caafb949'
Out [15]:
'7418e373'
Out [15]:
'b46f8495'
Out [15]:
'10067eb6'
Out [15]:
'fce7e552'
Out [15]:
'7bf477e9'
Out [15]:
'b18929fc'
Out [15]:
'268538cc'
Out [15]:
'd183a329'

The following pages link here

Citation

If you find this work useful, please cite it as:
@article{yaltirakliwikipearsonhashing,
  title   = "Pearson Hashing",
  author  = "Yaltirakli, Gokberk",
  journal = "gkbrk.com",
  year    = "2024",
  url     = "https://www.gkbrk.com/wiki/pearson-hashing/"
}
Not using BibTeX? Click here for more citation styles.
IEEE Citation
Gokberk Yaltirakli, "Pearson Hashing", November, 2024. [Online]. Available: https://www.gkbrk.com/wiki/pearson-hashing/. [Accessed Nov. 12, 2024].
APA Style
Yaltirakli, G. (2024, November 12). Pearson Hashing. https://www.gkbrk.com/wiki/pearson-hashing/
Bluebook Style
Gokberk Yaltirakli, Pearson Hashing, GKBRK.COM (Nov. 12, 2024), https://www.gkbrk.com/wiki/pearson-hashing/

Comments

Comment by Alex
2024-03-18 at 22:00
Spam probability: 0.105%

Hello Gokberg. This is not good enough algorithm to generate a Pearson hashing permutation table. The reason is: permutation table should not map some input value to the same value. But you have no control over this, you only rely on the properties of the random number generator (which is in view of its sequence properties (especially filtered) may have some peculiarities). It is possible to construct a permutation table without weakness, in which all values are always completely permuted.

© 2024 Gokberk Yaltirakli