Data Structures I - NTIN066 - Winter Semester 2015/2016 - 5th homework

Deadline: 19th January 2016.
How to submit: By sending an email to ds1@kam.mff.cuni.cz the required attachments.

The rules

The description of the problem

Implement two types of hash tables and two universal families of hash functions. Then compare their performance on two datasets consiting of Insertions and Finds. Insert operations do not create duplicate keys. If an insertion creating a non-unique key shall be performed do not change the table. Also these operations are counted in the performance statistics. The first dataset consists mainly of Insert operations (Find has only 20 %), the second one constists mainly of Find operations (contains 80 % of Find operations).

Implement linear probing exactly as told in the course. Also implement the course version of cuckoo hashing, i.e. use a single table. Choose the size of the table to be M - the number is described in the section Generating the data. In order to measure the performance of cuckoo hashing use the generator parameter C, which bounds the maximal load factor.

The families of hash functions

Let U = {0, ..., 2u - 1} be the universe and m=2k be the size of the hash table. Both of the systems map the universe to the space {0, ..., m - 1}. In this homework choose the universe U = {0, ..., 232 - 1}. However, the data generator generetates the values from the set {0, ..., 231 - 1}.

Tabulation. Let x be an element of the universe U. If we treat x as a bitstring then we can split it into c parts having the same length, i.e. x = x1x2...xc. For each part we randomly initialize a tabulation table. The table contains exactly 2u/c bit strings of length k. Then h(x) = T1(x1) ⊕ T2(x2) ⊕ ... ⊕ Tc(xc). In other words, each part of the bit string x determines an index in its tabulation table. The tables at the indices contain values from the set {0, ..., m - 1}. The value of the hash function equals to the XOR of these values. A fast implementation usually relies on a fixed value of c, choose c = 4.

Multiplicative function. Let x ∈ U and a ∈ U - {0} be a random number chosen uniformly from the universe. Then h(x) = ((a * x) mod |U|) / (|U| / m). For a 32-bit universe we can implement this function effectively by ignoring the overflow of the multiplication and using the bit shifts.

The results

For each dataset create two graphs, i.e. the results are four graphs. In the first graph plot the average number of operations caused by Insert or Find with respect to the load factor of the table. The second one contains the average running time (scale to miliseconds or microseconds according to your needs) of Insert or Find with respect to the load factor of the table. The operation count should determined as follows.

Linear Probing
The number of operations for Insert or Find equals 1 + the length of the sequence which was iterated to find the element or an empty cell.
Cuckoo hashing
Find requires one or two operations. Insert first tries to find the element, this requires one or two operations. When the inserted element is not found, insertion can make several "swaps". Also count the swaps from the subsequent rehash operations. The maximal rehash operations per one Insert set to be 15. In the case that an Insert requires more rehashes eliminate the run from the results. In the case that each run for a given load factor fails in this way this ignore the load factor in the graphs. You also may draw zero values for such load factors in the graphs.

Write down the report consisting of:

Generating the data

Use the program /afs/ms.mff.cuni.cz/u/b/babkm5am/ds1/hw5/hashgen in the unix lab Rotunda. Run it with the following parameters:

The generator produces a text output. Each row describes one

A single invocation of the generator creates a sequence of runs ranging having load factor from 30 to 90 %,. (the maximal load factor is 55 %, if you used the parameter C). For each load factor there are 4 runs generated. The results shown in the graph should be averaged over these runs.

A sample of the input

# 2 4
I 1
I 2
I 1
I 3
F 1
# 3 4
I 2
I 3
I 7
I 3
I 2
F 2
F 0

Submitting your solution

Send an e-mail with subject of the form 'HW5-55973318' (i.e., HW5-student ID) with the following attachments:

The body of your e-mail shall have the following form:


Jmeno: Martin Babka
Pseudonym: Guybrush Threepwood
Tutorial section: Monday-12:20-odd