Data Structures I - NTIN066 - Winter Semester 2015/2016 - 1st homework

Deadline: 27th October 2015. You can submit your time measurement until 3rd November 2015.
How to submit: By sending an email to ds1@kam.mff.cuni.cz with the source code in the attachment.
Input and output: Your solution has to read the data from the file data.txt and write the sorted data into the file data.out. Assume that the files are in the current directory.
System: The memory of the test computer is 6 GB, OS Linux.
Time limit: Your solution must finish sorting within 35 minutes.

The rules

The description of the problem

Create a program that will sort a file stored on a computer hard drive. Assume that the input file is stored in the current directory and named data.txt The file data.txt is a text file. Each row consists of two 63-bit unsigned values - a key and a value. In the sorted output file data.out the keys should be sorted in increasing order and for each key only the minimum value associated with the key should be output.

Testing and generating the data

Use the unix program /afs/ms.mff.cuni.cz/u/b/babkm5am/ds1/hw1/gen-data available at Rotunda laboratory to generate the input file data.txt. Just redirect the standard output to the file data.txt since the generator outputs the data to the standard output. Then measure the running time of your solution using one of the computers u-pl0 to u-pl15. For measuring the time use the time utility.

The generated input file data.txt is about 45 GB large and the time limit for sorting is 35 minutes. A solution which does not finish within the time limit is considered to be incorrect. We will also test your program on different but similar data.

When running your program, save the input file data.txt, the outputfile data.out and all temporary files into /tmp. After finishing your work delete all your files in /tmp so that others can run their programs as well. In particular, do not forget to remove the input and output files. If you do not delete your files the machine may become unusable for others as others may not have sufficient access rights to delete your files.

Be considerate. Always check whether someone else is not using a particular computer. Use commands who and top or htop. If you see that someone else is performing computations on a given machine use another one or wait until he or she is done. Please report any bad behaviour or a unusable computer to ds1@kam.mff.cuni.cz. The lab administrator can clear /tmp directory, if necessary.

Submit your solution before the deadline to ds1@kam.mff.cuni.cz along with the details specified below.

Input example

96313 3992064775605989878
6024613951808161023 4071849752967919252
11417427497858712892 3671181978208670736
3894 2867422479361385327
96313 3992064775605758114
6437842131676518652 6510455770151352973

The output

3894 2867422479361385327
96313 3992064775605758114
6024613951808161023 4071849752967919252
6437842131676518652 6510455770151352973
11417427497858712892 3671181978208670736

Generating the test data

$/afs/ms.mff.cuni.cz/u/b/babkm5am/ds1/hw1/gen-data -s XX >/tmp/data.txt

generates the whole input data, roughly 45 GB.

$/afs/ms.mff.cuni.cz/u/b/babkm5am/ds1/hw1/gen-data -s XX --short >/tmp/data.txt

generates the data for debugging your program, roughly 3 GB.

The XX are the two last digits of your student ID.

The email format

When submitting using the email report us the measured time in the following format.

Subject should be 'HW1-55973318'. (i.e. HW1-<student number>).

Attach the source code file namedsorter-55973318.cpp.

Name: Martin Babka
Computer: u-pl7
CPU: Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
RAM: 6GB
Running time: real 33m41.704s, user 15m30.975s sys 1m26.892s
Date: 1. 11. 2015 15:28:50
XX: 37
Description of the algorithm: I used Huge-sort combined with Super-sort applied on the inverted elements.
Pseudonym for web results: I don't want to be listed.
Tutorial section: Monday-12:20-odd
Compile with: gcc -O3 sorter-55973318.cpp -o sorter-55973318

Submitting

If you measure the time after 27th October, just submit a second similar email with the measured values and without the source code. If you can not achieve the time limit but you are close, try using XX = 333.

Results

See the Czech version of this page.