Data Structures I - NTIN066 - Winter Semester 2015/2016 - 5th homework

Deadline: 19th January 2016.
How to submit: By sending an email to ds1@kam.mff.cuni.cz the required attachments.

The rules

The whole code which you submit must be originally created by you without any "insipiration".
Do not share your solution so that your code can not become an "inspiration". The only exception is submitting the code.
The homework and the algorithms may be a subject of conversation. But respect the above rules about non-sharing. That means - dicussing is correct but write the code yourself!
Code in C/C++ (recommended), Java or C#.
Use standard constructs of your programming language. Do not use special library functions.
The solutions which violate the above rules shall be considered as wrong.

The description of the problem

Implement two types of hash tables and two universal families of hash functions. Then compare their performance on two datasets consiting of Insertions and Finds. Insert operations do not create duplicate keys. If an insertion creating a non-unique key shall be performed do not change the table. Also these operations are counted in the performance statistics. The first dataset consists mainly of Insert operations (Find has only 20 %), the second one constists mainly of Find operations (contains 80 % of Find operations).

Implement linear probing exactly as told in the course. Also implement the course version of cuckoo hashing, i.e. use a single table. Choose the size of the table to be M - the number is described in the section Generating the data. In order to measure the performance of cuckoo hashing use the generator parameter C, which bounds the maximal load factor.

The families of hash functions

Let U = {0, ..., 2^u - 1} be the universe and m=2^k be the size of the hash table. Both of the systems map the universe to the space {0, ..., m - 1}. In this homework choose the universe U = {0, ..., 2³² - 1}. However, the data generator generetates the values from the set {0, ..., 2³¹ - 1}.

Tabulation. Let x be an element of the universe U. If we treat x as a bitstring then we can split it into c parts having the same length, i.e. x = x₁x₂...x_c. For each part we randomly initialize a tabulation table. The table contains exactly 2^u/c bit strings of length k. Then h(x) = T₁(x₁) ⊕ T₂(x₂) ⊕ ... ⊕ T_c(x_c). In other words, each part of the bit string x determines an index in its tabulation table. The tables at the indices contain values from the set {0, ..., m - 1}. The value of the hash function equals to the XOR of these values. A fast implementation usually relies on a fixed value of c, choose c = 4.

Multiplicative function. Let x ∈ U and a ∈ U - {0} be a random number chosen uniformly from the universe. Then h(x) = ((a * x) mod |U|) / (|U| / m). For a 32-bit universe we can implement this function effectively by ignoring the overflow of the multiplication and using the bit shifts.

The results

For each dataset create two graphs, i.e. the results are four graphs. In the first graph plot the average number of operations caused by Insert or Find with respect to the load factor of the table. The second one contains the average running time (scale to miliseconds or microseconds according to your needs) of Insert or Find with respect to the load factor of the table. The operation count should determined as follows.

Linear Probing: The number of operations for Insert or Find equals 1 + the length of the sequence which was iterated to find the element or an empty cell.
Cuckoo hashing: Find requires one or two operations. Insert first tries to find the element, this requires one or two operations. When the inserted element is not found, insertion can make several "swaps". Also count the swaps from the subsequent rehash operations. The maximal rehash operations per one Insert set to be 15. In the case that an Insert requires more rehashes eliminate the run from the results. In the case that each run for a given load factor fails in this way this ignore the load factor in the graphs. You also may draw zero values for such load factors in the graphs.

Write down the report consisting of:

the four graphs described above. In each graph there should be 4 curves. In the graph with operations devide the overall number of operations by the count of Finds and all Insertions. Similarily in the graph of running times.
A comment summarizing the behaviour of the algorithms.

Generating the data

Use the program /afs/ms.mff.cuni.cz/u/b/babkm5am/ds1/hw5/hashgen in the unix lab Rotunda. Run it with the following parameters:

-s XX, where XX are the last two digits of your student ID;
--short, for shorter data;
I, for a dataset mainly consisting of insertions;
F, for a dataset mainly consisting of finds;
C, restricts the load factor to at most 55 % for cuckoo hashing.

The generator produces a text output. Each row describes one

# N M: start of the run for the set of N values hashed into a table of size M - delete the old table and create a new one. The size of the set can be used to determine the load factor in order to draw the graphs. Moreover the size of the table M remains the same for each run. The value N determines the size of the set created by this run, i.e. the number of successful Insert operations.
I X: insertion of the key X;
F X: find the key X in the table.

A single invocation of the generator creates a sequence of runs ranging having load factor from 30 to 90 %,. (the maximal load factor is 55 %, if you used the parameter C). For each load factor there are 4 runs generated. The results shown in the graph should be averaged over these runs.

A sample of the input

# 2 4
I 1
I 2
I 1
I 3
F 1
# 3 4
I 2
I 3
I 7
I 3
I 2
F 2
F 0

Submitting your solution

Send an e-mail with subject of the form 'HW5-55973318' (i.e., HW5-student ID) with the following attachments:

source code used to get your results (in case it consists of multiple files, please use ZIP or TAR to pack them),
a PDF with plots and comments.

The body of your e-mail shall have the following form:


Jmeno: Martin Babka
Pseudonym: Guybrush Threepwood
Tutorial section: Monday-12:20-odd