blah blah
In order to judge the distribution, I’ll use the standard deviation.
In [2]:
statistics.stdev([5, 5, 6, 5, 5]) statistics.stdev([4, 4, 10, 4, 4])
Out [2]:
0.4472135954999579
Out [2]:
2.6832815729997477
In [3]:
def evaluate_hash(fn, values): counts = [0] * 256 for val in values: h = fn(val) counts[h] += 1 return statistics.stdev(counts)
In [4]:
unkeyed_hash("Test message 1") unkeyed_hash("Test message 2") unkeyed_hash("Test message 3")
Out [4]:
168
Out [4]:
184
Out [4]:
236
Let’s see how good Python’s built-in hash
function is on “normal” input.
In [5]:
values = [] for i in range(5000): values.append(f"Test {i}") evaluate_hash(crc32, values)
Out [5]:
1.4656190554459485
Now let’s try to skew it a little by choosing the values more carefully.
In [6]:
values = [] i = 0 while len(values) < 5000: val = f"Test {i}" if crc32(val) == 5: values.append(val) i += 1 evaluate_hash(crc32, values)
Out [6]:
312.5
Wow, any data structure or partitioning that expects uniform output from a hash function would be devastated.
In [7]:
hash1 = pearson_with_key("Quentin Coldwater") hash2 = pearson_with_key("Josh Hoberman") values = [] i = 0 while len(values) < 5000: val = f"Test {i}" if hash1(val) == 5: values.append(val) i += 1 evaluate_hash(hash1, values)
Out [7]:
312.5
A similar result, but now let’s try a different key with the same values.
In [8]:
evaluate_hash(hash2, values)
Out [8]:
4.4860349757800115