1.1 Concepts – Hash Tables | unit 1 hashing

DSA

Hashing

1.1 Concepts – Hash Tables

Hash Table is a data structure which stores data in an associative manner. In a hash table, data is stored in an array format, where each data value has its own unique index value. Access of data becomes very fast if we know the index of the desired data.

Thus, it becomes a data structure in which insertion and search operations are very fast irrespective of the size of the data. Hash Table uses an array as a storage medium and uses hash technique to generate an index where an element is to be inserted or is to be located from.

Hashing

Hashing is a technique to convert a range of key values into a range of indexes of an array. We're going to use modulo operator to get a range of key values. Consider an example of hash table of size 20, and the following items are to be stored. Item are in the (key, value) format.

Hash Function

Fig1- Hash function

(1,20)

(2,70)

(42,80)

(4,25)

(12,44)

(14,32)

(17,11)

(13,78)

(37,98)

Sr. No.	Key	Hash	Array Index
1	1	1 % 20 = 1	1
2	2	2 % 20 = 2	2
3	42	42 % 20 = 2	2
4	4	4 % 20 = 4	4
5	12	12 % 20 = 12	12
6	14	14 % 20 = 14	14
7	17	17 % 20 = 17	17
8	13	13 % 20 = 13	13
9	37	37 % 20 = 17	17

Linear Probing

As we can see, it may happen that the hashing technique is used to create an already used index of the array. In such a case, we can search the next empty location in the array by looking into the next cell until we find an empty cell. This technique is called linear probing.

Sr. No.	Key	Hash	Array Index	After Linear Probing, Array Index
1	1	1 % 20 = 1	1	1
2	2	2 % 20 = 2	2	2
3	42	42 % 20 = 2	2	3
4	4	4 % 20 = 4	4	4
5	12	12 % 20 = 12	12	12
6	14	14 % 20 = 14	14	14
7	17	17 % 20 = 17	17	17
8	13	13 % 20 = 13	13	13
9	37	37 % 20 = 17	17	18

Basic Operations

Following are the basic primary operations of a hash table.

Search − Searches an element in a hash table.

Insert − inserts an element in a hash table.

delete − Deletes an element from a hash table.

DataItem

Define a data item having some data and key, based on which the search is to be conducted in a hash table.

struct DataItem {

int data;

int key;

};

Hash Method

Define a hashing method to compute the hash code of the key of the data item.

int hashCode(int key){

return key % SIZE;

}

Search Operation

Whenever an element is to be searched, compute the hash code of the key passed and locate the element using that hash code as index in the array. Use linear probing to get the element ahead if the element is not found at the computed hash code.

Example

struct DataItem *search(int key) {

//get the hash

int hashIndex = hashCode(key);

//move in array until an empty

while(hashArray[hashIndex] != NULL) {

if(hashArray[hashIndex]->key == key)

return hashArray[hashIndex];

//go to next cell

++hashIndex;

//wrap around the table

hashIndex %= SIZE;

}

return NULL;

}

Insert Operation

Whenever an element is to be inserted, compute the hash code of the key passed and locate the index using that hash code as an index in the array. Use linear probing for empty location, if an element is found at the computed hash code.

Example

void insert(int key,int data) {

struct DataItem *item = (struct DataItem*) malloc(sizeof(struct DataItem));

item->data = data;

item->key = key;

//get the hash

int hashIndex = hashCode(key);

//move in array until an empty or deleted cell

while(hashArray[hashIndex] != NULL && hashArray[hashIndex]->key != -1) {

//go to next cell

++hashIndex;

//wrap around the table

hashIndex %= SIZE;

}

hashArray[hashIndex] = item;

}

Delete Operation

Whenever an element is to be deleted, compute the hash code of the key passed and locate the index using that hash code as an index in the array. Use linear probing to get the element ahead if an element is not found at the computed hash code. When found, store a dummy item there to keep the performance of the hash table intact.

Example

struct DataItem* delete(struct DataItem* item) {

int key = item->key;

//get the hash

int hashIndex = hashCode(key);

//move in array until an empty

while(hashArray[hashIndex] !=NULL) {

if(hashArray[hashIndex]->key == key) {

struct DataItem* temp = hashArray[hashIndex];

//assign a dummy item at deleted position

hashArray[hashIndex] = dummyItem;

return temp;

}

//go to next cell

++hashIndex;

//wrap around the table

hashIndex %= SIZE;

}

return NULL;

}

Hash Table Program in C

Hash Table is a data structure which stores data in an associative manner. In hash table, the data is stored in an array format where each data value has its own unique index value. Access of data becomes very fast, if we know the index of the desired data.

Implementation in C

#include <stdio.h>

#include <string.h>

#include <stdlib.h>

#include <stdbool.h>

#define SIZE 20

struct DataItem {

int data;

int key;

};

struct DataItem* hashArray[SIZE];

struct DataItem* dummyItem;

struct DataItem* item;

int hashCode(int key) {

return key % SIZE;

}

struct DataItem *search(int key) {

//get the hash

int hashIndex = hashCode(key);

//move in array until an empty

while(hashArray[hashIndex] != NULL) {

if(hashArray[hashIndex]->key == key)

return hashArray[hashIndex];

//go to next cell

++hashIndex;

//wrap around the table

hashIndex %= SIZE;

}

return NULL;

}

void insert(int key,int data) {

struct DataItem *item = (struct DataItem*) malloc(sizeof(struct DataItem));

item->data = data;

item->key = key;

//get the hash

int hashIndex = hashCode(key);

//move in array until an empty or deleted cell

while(hashArray[hashIndex] != NULL && hashArray[hashIndex]->key != -1) {

//go to next cell

++hashIndex;

//wrap around the table

hashIndex %= SIZE;

}

hashArray[hashIndex] = item;

}

struct DataItem* delete(struct DataItem* item) {

int key = item->key;

//get the hash

int hashIndex = hashCode(key);

//move in array until an empty

while(hashArray[hashIndex] != NULL) {

if(hashArray[hashIndex]->key == key) {

struct DataItem* temp = hashArray[hashIndex];

//assign a dummy item at deleted position

hashArray[hashIndex] = dummyItem;

return temp;

}

//go to next cell

++hashIndex;

//wrap around the table

hashIndex %= SIZE;

}

return NULL;

}

void display() {

int i = 0;

for(i = 0; i<SIZE; i++) {

if(hashArray[i] != NULL)

printf(" (%d,%d)",hashArray[i]->key,hashArray[i]->data);

else

printf(" ~~ ");

}

printf("\n");

}

int main() {

dummyItem = (struct DataItem*) malloc(sizeof(struct DataItem));

dummyItem->data = -1;

dummyItem->key = -1;

insert(1, 20);

insert(2, 70);

insert(42, 80);

insert(4, 25);

insert(12, 44);

insert(14, 32);

insert(17, 11);

insert(13, 78);

insert(37, 97);

display();

item = search(37);

if(item != NULL) {

printf("Element found: %d\n", item->data);

} else {

printf("Element not found\n");

}

delete(item);

item = search(37);

if(item != NULL) {

printf("Element found: %d\n", item->data);

} else {

printf("Element not found\n");

}

If we compile and run the above program, it will produce the following result −

Output

~~ (1,20) (2,70) (42,80) (4,25) ~~ ~~ ~~ ~~ ~~ ~~ ~~ (12,44) (13,78) (14,32) ~~ ~~ (17,11) (37,97) ~~

Element found: 97

Element not found

Key takeaway

1.2 Hash Function

What is Hashing?

Hashing is the process of mapping large amount of data item to smaller table with the help of hashing function.

Hashing is also known as Hashing Algorithm or Message Digest Function.

It is a technique to convert a range of key values into a range of indexes of an array.

It is used to facilitate the next level searching method when compared with the linear or binary search.

Hashing allows to update and retrieve any data entry in a constant time O(1).

Constant time O(1) means the operation does not depend on the size of the data.

Hashing is used with a database to enable items to be retrieved more quickly.

It is used in the encryption and decryption of digital signatures.

What is Hash Function?

A fixed process converts a key to a hash key is known as a Hash Function.

This function takes a key and maps it to a value of a certain length which is called a Hash value or Hash.

Hash value represents the original string of characters, but it is normally smaller than the original.

It transfers the digital signature and then both hash value and signature are sent to the receiver. Receiver uses the same hash function to generate the hash value and then compares it to that received with the message.

If the hash values are same, the message is transmitted without errors.

What is Hash Table?

Hash table or hash map is a data structure used to store key-value pairs.

It is a collection of items stored to make it easy to find them later.

It uses a hash function to compute an index into an array of buckets or slots from which the desired value can be found.

It is an array of list where each list is known as bucket.

It contains value based on the key.

Hash table is used to implement the map interface and extends Dictionary class.

Hash table is synchronized and contains only unique elements.

hash table

Fig 2 – Hash table

The above figure shows the hash table with the size of n = 10. Each position of the hash table is called as Slot. In the above hash table, there are n slots in the table, names = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}. Slot 0, slot 1, slot 2 and so on. Hash table contains no items, so every slot is empty.

As we know the mapping between an item and the slot where item belongs in the hash table is called the hash function. The hash function takes any item in the collection and returns an integer in the range of slot names between 0 to n-1.

Suppose we have integer items {26, 70, 18, 31, 54, 93}. One common method of determining a hash key is the division method of hashing and the formula is:

Hash Key = Key Value % Number of Slots in the Table

Division method or reminder method takes an item and divides it by the table size and returns the remainder as its hash value.

Data Item	Value % No. of Slots	Hash Value
26	26 % 10 = 6	6
70	70 % 10 = 0	0
18	18 % 10 = 8	8
31	31 % 10 = 1	1
54	54 % 10 = 4	4
93	93 % 10 = 3	3

hash table

After computing the hash values, we can insert each item into the hash table at the designated position as shown in the above figure. In the hash table, 6 of the 10 slots are occupied, it is referred to as the load factor and denoted by, λ = No. of items / table size. For example, λ = 6/10.

It is easy to search for an item using hash function where it computes the slot name for the item and then checks the hash table to see if it is present.

Constant amount of time O(1) is required to compute the hash value and index of the hash table at that location.

Linear Probing

Take the above example, if we insert next item 40 in our collection, it would have a hash value of 0 (40 % 10 = 0). But 70 also had a hash value of 0, it becomes a problem. This problem is called as Collision or Clash. Collision creates a problem for hashing technique.

Linear probing is used for resolving the collisions in hash table, data structures for maintaining a collection of key-value pairs.

Linear probing was invented by Gene Amdahl, Elaine M. McGraw and Arthur Samuel in 1954 and analyzed by Donald Knuth in 1963.

It is a component of open addressing scheme for using a hash table to solve the dictionary problem.

The simplest method is called Linear Probing. Formula to compute linear probing is:

P = (1 + P) % (MOD) Table_size

For example,

hash table

If we insert next item 40 in our collection, it would have a hash value of 0 (40 % 10 = 0). But 70 also had a hash value of 0, it becomes a problem.

Linear probing solves this problem:

P = H(40)

44 % 10 = 0

Position 0 is occupied by 70. so we look elsewhere for a position to store 40.

Using Linear Probing:

P= (P + 1) % table-size

0 + 1 % 10 = 1

But, position 1 is occupied by 31, so we look elsewhere for a position to store 40.

Using linear probing, we try next position: 1 + 1 % 10 = 2

Position 2 is empty, so 40 is inserted there.

hash table

Key takeaway

A fixed process converts a key to a hash key is known as a Hash Function.

This function takes a key and maps it to a value of a certain length which is called a Hash value or Hash.

Hash value represents the original string of characters, but it is normally smaller than the original.

If the hash values are same, the message is transmitted without errors.

1.3 Basic Operations, Bucket, Collision, Probe, Synonym, Overflow, Open Hashing, Closed Hashing, Perfect Hash Function, Load Density, Full Table, Load Factor, Rehashing, Issues in Hashing

1.3.1 Basic Operations

Following are the basic primary operations of a hash table.

Search − Searches an element in a hash table.

Insert − inserts an element in a hash table.

delete − Deletes an element from a hash table.

DataItem

Define a data item having some data and key, based on which the search is to be conducted in a hash table.

struct DataItem {

int data;

int key;

};

Hash Method

Define a hashing method to compute the hash code of the key of the data item.

int hashCode(int key){

return key % SIZE;

}

Search Operation

Example

struct DataItem *search(int key) {

//get the hash

int hashIndex = hashCode(key);

//move in array until an empty

while(hashArray[hashIndex] != NULL) {

if(hashArray[hashIndex]->key == key)

return hashArray[hashIndex];

//go to next cell

++hashIndex;

//wrap around the table

hashIndex %= SIZE;

}

return NULL;

}

Insert Operation

Example

void insert(int key,int data) {

struct DataItem *item = (struct DataItem*) malloc(sizeof(struct DataItem));

item->data = data;

item->key = key;

//get the hash

int hashIndex = hashCode(key);

//move in array until an empty or deleted cell

while(hashArray[hashIndex] != NULL && hashArray[hashIndex]->key != -1) {

//go to next cell

++hashIndex;

//wrap around the table

hashIndex %= SIZE;

}

hashArray[hashIndex] = item;

}

Delete Operation

Example

struct DataItem* delete(struct DataItem* item) {

int key = item->key;

//get the hash

int hashIndex = hashCode(key);

//move in array until an empty

while(hashArray[hashIndex] !=NULL) {

if(hashArray[hashIndex]->key == key) {

struct DataItem* temp = hashArray[hashIndex];

//assign a dummy item at deleted position

hashArray[hashIndex] = dummyItem;

return temp;

}

//go to next cell

++hashIndex;

//wrap around the table

hashIndex %= SIZE;

}

return NULL;

}

1.3.2 Hash Buckets

In computing, a hash table [hash map] is a data structure that provides virtually direct access to objects based on a key [a unique String or Integer]. A hash table uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found. Here are the main features of the key used:

The key used can be your SSN, your telephone number, account number, etc

Must have unique keys

Each key is associated with–mapped to–a value

Hash buckets are used to apportion data items for sorting or lookup purposes. The aim of this work is to weaken the linked lists so that searching for a specific item can be accessed within a shorter timeframe.

Fig 3 – Hash Bucket

A hash table that uses buckets is actually a combination of an array and a linked list. Each element in the array [the hash table] is a header for a linked list. All elements that hash into the same location will be stored in the list. The hash function assigns each record to the first slot within one of the buckets. In case the slot is occupied, then the bucket slots will be searched sequentially until an open slot is found. In case a bucket is completely full, the record will get stored in an overflow bucket of infinite capacity at the end of the table. All buckets share the same overflow bucket. However, a good implementation will use a hash function that distributes the records evenly among the buckets so that as few records as possible go into the overflow bucket.

1.3.3 What is Collision?

Since a hash function gets us a small number for a key which is a big integer or string, there is a possibility that two keys result in the same value. The situation where a newly inserted key maps to an already occupied slot in the hash table is called collision and must be handled using some collision handling technique.

What are the chances of collisions with large table?

Collisions are very likely even if we have big table to store keys. An important observation is Birthday Paradox. With only 23 persons, the probability that two people have the same birthday is 50%.

How to handle Collisions?

There are mainly two methods to handle collision:

1) Separate Chaining

2) Open Addressing

In this article, only separate chaining is discussed. We will be discussing Open addressing in the next post.

Separate Chaining:

The idea is to make each cell of hash table point to a linked list of records that have same hash function value.

Let us consider a simple hash function as “key mod 7” and sequence of keys as 50, 700, 76, 85, 92, 73, 101.

Fig 4 - Example

Advantages:
1) Simple to implement.

2) Hash table never fills up; we can always add more elements to the chain.

3) Less sensitive to the hash function or load factors.

4) It is mostly used when it is unknown how many and how frequently keys may be inserted or deleted.

Disadvantages:
1) Cache performance of chaining is not good as keys are stored using a linked list. Open addressing provides better cache performance as everything is stored in the same table.

2) Wastage of Space (Some Parts of hash table are never used)

3) If the chain becomes long, then search time can become O(n) in the worst case.

4) Uses extra space for links.

Performance of Chaining:

Performance of hashing can be evaluated under the assumption that each key is equally likely to be hashed to any slot of table (simple uniform hashing).

m = Number of slots in hash table

n = Number of keys to be inserted in hash table

Load factor α = n/m

Expected time to search = O(1 + α)

Expected time to delete = O(1 + α)

Time to insert = O(1)

Time complexity of search insert and delete is

O(1) if α is O(1)

1.3.4 Probing

In this section we will see what is linear probing technique in open addressing scheme. There is an ordinary hash function h´(x): U → {0, 1, . . ., m – 1}. In open addressing scheme, the actual hash function h(x) is taking the ordinary hash function h’(x) and attach some another part with it to make one linear equation.

h´(𝑥) = 𝑥 𝑚𝑜𝑑 𝑚

ℎ(𝑥, 𝑖) = (ℎ´(𝑥) + 𝑖)𝑚𝑜𝑑 𝑚

The value of i| = 0, 1, . . ., m – 1. So we start from i = 0, and increase this until we get one free space. So initially when i = 0, then the h(x, i) is same as h´(x).

Example

Suppose we have a list of size 20 (m = 20). We want to put some elements in linear probing fashion. The elements are {96, 48, 63, 29, 87, 77, 48, 65, 69, 94, 61}

Hash Table

Hashing

We have all used a dictionary, and many of us have a word processor equipped with a limited dictionary, that is a spelling checker. We consider the dictionary, as an ADT. Examples of dictionaries are found in many applications, including the spelling checker, the thesaurus, the data dictionary found in database management applications, and the symbol tables generated by loaders, assemblers, and compilers.

In computer science, we generally use the term symbol table rather than dictionary, when referring to the ADT. Viewed from this perspective, we define the symbol table as a set of name-attribute pairs. The characteristics of the name and attribute vary according to the application. For example, in a thesaurus, the name is a word, and the attribute is a list of synonyms for the word; in a symbol table for a compiler, the name is an identifier, and the attributes might include an initial value and a list of lines that use the identifier.

Generally, we would want to perform the following operations on any symbol table:

(1) Determine if a particular name is in the table

(2) Retrieve the attributes of that name

(3) Modify the attributes of that name

(4) Insert a new name and its attributes

(5) Delete a name and its attributes

There are only three basic operations on symbol tables: searching, inserting, and deleting. The technique for those basic operations is hashing. Unlike search tree methods that rely on identifier comparisons to perform a search, hashing relies on a formula called the hash function. We divide our discussion of hashing into two parts: static hashing and dynamic hashing.

Static Hashing

Hash Tables

In static hashing, we store the identifiers in a fixed size table called a hash table. We use an arithmetic function, f, to determine the address, or location, of an identifier, x, in the table. Thus, f (x) gives the hash, or home address, of x in the table. The hash table ht is stored in sequential memory locations that are partitioned into b buckets, ht [0], …, ht [b-1]. Each bucket has s slots. Usually s = 1, which means that each bucket holds exactly one record. We use the hash function f (x) to transform the identifiers x into an address in the hash table. Thus, f (x) maps the set of possible identifiers onto the integers 0 through b-1.

Definition

The identifier density of a hash table is the ratio n/T, where n is the number of identifiers in the table. The loading density or loading factor of a hash table is =n /(sb).

Example 7.1: Consider the hash table ht with b = 26 buckets and s = 2. We have n = 10 distinct identifiers, each representing a C library function. This table has a loading factor, , of 10/52 = 0.19. The hash function must map each of the possible identifiers onto one of the number, 0-25. We can construct a fairly simple hash function by associating the letter, a-z, with the number, 0-25, respectively, and then defining the hash function, f (x), as the first character of x. Using this scheme, the library functions acos, define, float, exp, char, atan, ceil, floor, clock, and ctime hash into buckets 0, 3, 5, 4, 2, 0, 2, 5, 2, and 2, respectively.

Hashing Function

A hash function, f, transforms an identifier, x, into a bucket address in the hash table. We want to hash function that is easy to compute and that minimizes the number of collisions. Although the hash function we used in Example 7.1 was easy to compute, using only the first character in an identifier is bound to have disastrous consequences. We know that identifiers, whether they represent variable names in a program, word in a dictionary, or names in a telephone book, cluster around certain letters of the alphabet. To avoid collisions, the hash function should depend on all the characters in an identifier. It also should be unbiased. That is, if we randomly choose an identifier, x, from the identifier space (the universe of all possible identifiers), the probability that f(x) = i is 1/b for all buckets i. This means that a random x has an equal property a uniform hash function.

There are several types of uniform hash functions, and we shall describe four of them. We assume that the identifiers have been suitably transformed into a numerical equivalent.

Mid-square:

The middle of square hash function is frequently used in symbol table application. We compute the function fm by squaring the identifier and then using an appropriate number of bits from the middle of the square to obtain the bucket address. Since the middle bits of the square usually depend upon all the characters in an identifier, there is a high probability that different identifiers will produce different hash addresses, even when some of the characters are the same. The number of bits used to obtain the bucket address depends on the table size. If we use r bits, the range of the values is 2r. Therefore, the size of the hash table should be a power of 2 when we use this scheme.

Division:

This hash function is using the modulus (%) operator. We divide the identifier x by some number M and use the remainder as the hash address of x. The hash function is: fD (x) = x % M This gives bucket address that range from 0 to M-1, where M = the table size. The choice of M is critical. In the division function, if M is a power of 2, then fD (x) depends only on the least significant bits of x. Such a choice for M results in a biased use of the hash table when several of the identifiers in use have the same suffix. If M is divisible by 2, then odd keys are mapped to odd buckets, and even keys are mapped to even buckets. Hence, an even M results in a biased use of the table when a majority of identifiers are even or when majorities are odd.

Folding:

In this method, we partition the identifier x into several parts. All parts, except for the last one have the same length. We then add the parts together to obtain the hash address for x. There are two ways of carrying out this addition. In the first method, we shift all parts except for the last one, so that the least significant bit of each part lines up with the corresponding bit of the last part. We then add the parts together to obtain f(x). This method is known as shift folding. The second method, known as folding at the boundaries, reverses every other partition before adding.

Digit Analysis:

The last method we will examine, digit analysis, is used with static files. A static file is one in which all the identifiers are known in advance. Using this method, we first transform the identifiers into numbers using some radix, r. We then examine the digits of each identifier, deleting those digits that have the most skewed distributions. We continue deleting digits until the number of remaining digits is small enough to give an address in the range of the hash table. The digits used to calculate the hash address must be the same for all identifiers and must not have abnormally high peaks or valleys (the standard deviation must be small).

1.3.5 Overflow Handling

There are two methods for detecting collisions and overflows in a static hash table; each method using different data structure to represent the hash table.

Tow Methods:

Linear Open Addressing (Linear probing)

Chaining

Linear Open Addressing

When use linear open addressing, the hash table is represented as a one-dimensional array with indices that range from 0 to the desired table size-1. The component type of the array is a struct that contains at least a key field. Since the keys are usually words, we use a string to denote them. Creating the hash table ht with one slot per bucket is:

#define MAX_CHAR 10 /* max number of characters in an identifier */

#define TABLE_SIZE 13 /*max table size = prime number */

struct element

{

char key[MAX_CHAR];

/* other fields */

};

element hash_table[TABLE_SIZE];

Before inserting any elements into this table, we must initialize the table to represent the situation where all slots are empty. This allows us to detect overflows and collisions when we insert elements into the table. The obvious choice for an empty slot is the empty string since it will never be a valid key in any application.

Initialization of a hash table:

void init_table (element ht[ ])

{

short i;

for (i = 0; i < TABLE_SIZE; i ++)

ht [i].key[0] = NULL;

}

To insert a new element into the hash table we convert the key field into a natural number, and then apply one of the hash functions discussed in Hashing Function. We can transform a key into a number if we convert each character into a number and then add these numbers together. The function transform (below) uses this simplistic approach. To find the hash address of the transformed key, hash (below) uses the division method.

short transform (char *key)

{/* simple additive approach to create a natural number that is within the integer range */

short number = 0;

while (*key)

number += *key++;

return number;

}

short hash (char *key)

{/* transform key to a natural number, and return this result modulus the table size */

return (transform (key) % TABLE_SIZE);

}

To implement the linear probing strategy, we first compute f (x) for identifier x and them examine the hash table buckets ht[(f(x) + j) % TABLE_SIZE], 0  j  TABLE_SIZE in this order. Four outcomes result from the examination of a hash table bucket:

(1) The bucket contains x. In this case, x is already in the table. Depending on the application, we may either simply report a duplicate identifier, or we may update information in the other fields of the element.

(2) The bucket contains the empty string. In this case, the bucket is empty, and we may insert the new element into it.

(3) The bucket contains a nonempty string other than x. In this case we proceed to examine the next bucket.

(4) We return to the home bucket ht [f (x)] (j = TABLE_SIZE). In this case, the home bucket is being examined for the second time and all remaining buckets have been examined. The table is full and we report an error condition and exit.

Implementation of the insertion strategy:

void linear_insert (element item, element ht [ ] )

{ /* insert the key into the table using the linear probing technique, exit the function if the table is full */

short i, hash_value;

hash_value = hash (item.key);

i = hash_value;

while (strlen (ht [i].key)

{

if (! strcmp (ht [i].key, item.key)

{

cout << “Duplicate entry !\n”;

exit (1);

}

i = (i+1) % TABLE_SIZE;

if (i = = hash_value)

{

cout << “The table is full !\n”;

exit (1);

}

ht [i] = item;

}

Chaining

Linear probing and its variations perform poorly because inserting an identifier requires the comparison of identifiers with different hash values. To insert a new element we would only have to compute the hash address f (x) and examine the identifiers in the list for f (x). Since we would not know the sizes of the lists in advance, we should maintain them as linked chains. We now require additional space for a link field. Since we will have M lists, where M is the desired table size, we employ a head node for each chain. These head nodes only need a link field, so they are smaller than the other nodes. We maintain the head nodes in ascending order, 0,…….., M-1 so that we may access the lists at random. The C++ declarations required to create the chained hash table are:

#define MAX_CHAR 10 /* maximum identifier size*/

#define TABLE_SIZE 13 /* prime number */

#define IS_FULL (ptr) (!(ptr))

struct element

{

char key[MAX_CHAR];

/* other fields */

};

typedef struct list *lis_pointer;

struct list

{

element item;

list_pointer link;

};

list_pointer hash_table[TABLE_SIZE];

The function chain_insert (below) implements the chaining strategy. The function first computes the hash address for the identifier. It then examines the identifiers in the list for the selected bucket. If the identifier is found, we print an error message and exit. If the identifier is not in the list, we insert it at the end of the list. If list was empty, we change the head node to point to the new entry.

Implementation of the function chain_insert:

void chain_insert (element item, list_pointer ht[])

{ /* insert the key into the table using chaining */

short hash_value = hash (item.key);

list_pointer ptr, trail = NULL, lead = ht [hash_value];

for( ; lead; trail = lead, lead = lead->link)

{

if (!strcmp(lead->item.key, item.key))

{

cout << “The key is in the table \n”;

exit (1);

}

ptr = new struct list;

if (IS_FULL (ptr))

{

cout << “The memory is full \n”;

exit (1);

}

ptr->item = item;

ptr->link = NULL;

if (trail)

trail->link = ptr;

else

ht [hash_value] = ptr;

}

1.3.6 Open and Closed Hashing

Hashing

In this section, we consider a very efficient way to implement dictionaries. Recall that a dictionary is an abstract data type, namely, a set with the operations of searching (lookup), insertion, and deletion defined on its elements. The elements of this set can be of an arbitrary nature: numbers, characters of some alphabet, character strings, and so on. In practice, the most important case is that of records (student records in a school, citizen records in a governmental office, book records in a library).

Typically, records comprise several fields, each responsible for keeping a particular type of information about an entity the record represents. For example, a student record may contain fields for the student’s ID, name, date of birth, sex, home address, major, and so on. Among record fields there is usually at least one called a key that is used for identifying entities represented by the records (e.g., the student’s ID). In the discussion below, we assume that we have to implement a dictionary of n records with keys K1, K2, . . . , Kn.

Hashing is based on the idea of distributing keys among a one-dimensional array H [0..m − 1] called a hash table. The distribution is done by computing, for each of the keys, the value of some predefined function h called the hash function. This function assigns an integer between 0 and m − 1, called the hash address, to a key.

For example, if keys are nonnegative integers, a hash function can be of the form h(K) = K mod m; obviously, the remainder of division by m is always between 0 and m − 1. If keys are letters of some alphabet, we can first assign a letter its position in the alphabet, denoted here ord(K), and then apply the same kind of a function used for integers. Finally, if K is a character string

where C is a constant larger than every ord(ci).

In general, a hash function needs to satisfy somewhat conflicting require-ments:

A hash table’s size should not be excessively large compared to the number of keys, but it should be sufficient to not jeopardize the implementation’s time efficiency (see below).

A hash function needs to distribute keys among the cells of the hash table as evenly as possible. (This requirement makes it desirable, for most applications, to have a hash function dependent on all bits of a key, not just some of them.)

A hash function has to be easy to compute.

Obviously, if we choose a hash table’s size m to be smaller than the number of keys n, we will get collisions—a phenomenon of two (or more) keys being hashed into the same cell of the hash table (Figure 7.4). But collisions should be expected even if m is considerably larger than n (see Problem 5 in this section’s exercises). In fact, in the worst case, all the keys could be hashed to the same cell of the hash table. Fortunately, with an appropriately chosen hash table size and a good hash function, this situation happens very rarely. Still, every hashing scheme must have a collision resolution mechanism. This mechanism is different in the two principal versions of hashing: open hashing (also called separate chaining) and closed hashing (also called open addressing).

Open Hashing (Separate Chaining)

In open hashing, keys are stored in linked lists attached to cells of a hash table. Each list contains all the keys hashed to its cell. Consider, as an example, the following list of words:

A, FOOL, AND, HIS, MONEY, ARE, SOON, PARTED.

As a hash function, we will use the simple function for strings mentioned above, i.e., we will add the positions of a word’s letters in the alphabet and compute the sum’s remainder after division by 13.

We start with the empty table. The first key is the word A; its hash value is h(A) = 1 mod 13 = 1. The second key—the word FOOL—is installed in the ninth cell since (6 + 15 + 15 + 12) mod 13 = 9, and so on. The final result of this process is given in Figure 7.5; note a collision of the keys ARE and SOON because h(ARE) = (1 + 18 + 5) mod 13 = 11 and h(SOON) = (19 + 15 + 15 + 14) mod 13 = 11.

How do we search in a dictionary implemented as such a table of linked lists? We do this by simply applying to a search key the same procedure that was used for creating the table. To illustrate, if we want to search for the key KID in the hash table of Figure 7.5, we first compute the value of the same hash function for the key: h(KID) = 11. Since the list attached to cell 11 is not empty, its linked list may contain the search key. But because of possible collisions, we cannot tell whether this is the case until we traverse this linked list. After comparing the string KID first with the string ARE and then with the string SOON, we end up with an unsuccessful search.

In general, the efficiency of searching depends on the lengths of the linked lists, which, in turn, depend on the dictionary and table sizes, as well as the quality

of the hash function. If the hash function distributes n keys among m cells of the hash table about evenly, each list will be about n/m keys long. The ratio α = n/m, called the load factor of the hash table, plays a crucial role in the efficiency of hashing. In particular, the average number of pointers (chain links) inspected in successful searches, S, and unsuccessful searches, U, turns out to be

respectively, under the standard assumptions of searching for a randomly selected element and a hash function distributing keys uniformly among the table’s cells. These results are quite natural. Indeed, they are almost identical to searching sequentially in a linked list; what we have gained by hashing is a reduction in average list size by a factor of m, the size of the hash table.

Normally, we want the load factor to be not far from 1. Having it too small would imply a lot of empty lists and hence inefficient use of space; having it too large would mean longer linked lists and hence longer search times. But if we do have the load factor around 1, we have an amazingly efficient scheme that makes it possible to search for a given key for, on average, the price of one or two comparisons! True, in addition to comparisons, we need to spend time on computing the value of the hash function for a search key, but it is a constant-time operation, independent from n and m. Note that we are getting this remarkable efficiency not only as a result of the method’s ingenuity but also at the expense of extra space.

The two other dictionary operations—insertion and deletion—are almost identical to searching. Insertions are normally done at the end of a list (but see Problem 6 in this section’s exercises for a possible modification of this rule). Deletion is performed by searching for a key to be deleted and then removing it from its list. Hence, the efficiency of these operations is identical to that of searching, and they are all (1) in the average case if the number of keys n is about equal to the hash table’s size m.

Closed Hashing (Open Addressing)

In closed hashing, all keys are stored in the hash table itself without the use of linked lists. (Of course, this implies that the table size m must be at least as large as the number of keys n.) Different strategies can be employed for collision resolution. The simplest one—called linear probing—checks the cell following the one where the collision occurs. If that cell is empty, the new key is installed there; if the next cell is already occupied, the availability of that cell’s immediate successor is checked, and so on. Note that if the end of the hash table is reached, the search is wrapped to the beginning of the table; i.e., it is treated as a circular array. This method is illustrated in Figure 7.6 with the same word list and hash function used above to illustrate separate chaining.

To search for a given key K, we start by computing h(K) where h is the hash function used in the table construction. If the cell h(K) is empty, the search is unsuccessful. If the cell is not empty, we must compare K with the cell’s occupant: if they are equal, we have found a matching key; if they are not, we compare K with a key in the next cell and continue in this manner until we encounter either a matching key (a successful search) or an empty cell (unsuccessful search). For example, if we search for the word LIT in the table of Figure 7.6, we will get h(LIT) = (12 + 9 + 20) mod 13 = 2 and, since cell 2 is empty, we can stop immediately. However, if we search for KID with h(KID) = (11 + 9 + 4) mod 13 = 11, we will have to compare KID with ARE, SOON, PARTED, and A before we can declare the search unsuccessful.

Although the search and insertion operations are straightforward for this version of hashing, deletion is not. For example, if we simply delete the key ARE from the last state of the hash table in Figure 7.6, we will be unable to find the key SOON afterward. Indeed, after computing h(SOON) = 11, the algorithm would find this location empty and report the unsuccessful search result. A simple solution

is to use “lazy deletion,” i.e., to mark previously occupied locations by a special symbol to distinguish them from locations that have not been occupied.

The mathematical analysis of linear probing is a much more difficult problem than that of separate chaining. The simplified versions of these results state that the average number of times the algorithm must access the hash table with the load factor α in successful and unsuccessful searches is, respectively,

(and the accuracy of these approximations increases with larger sizes of the hash table). These numbers are surprisingly small even for densely populated tables, i.e., for large percentage values of α:

Still, as the hash table gets closer to being full, the performance of linear probing deteriorates because of a phenomenon called clustering. A cluster in linear probing is a sequence of contiguously occupied cells (with a possible wrapping). For example, the final state of the hash table of Figure 7.6 has two clusters. Clusters are bad news in hashing because they make the dictionary operations less efficient. As clusters become larger, the probability that a new element will be attached to a cluster increases; in addition, large clusters increase the probability that two clusters will coalesce after a new key’s insertion, causing even more clustering.

Several other collision resolution strategies have been suggested to alleviate this problem. One of the most important is double hashing. Under this scheme, we use another hash function, s(K), to determine a fixed increment for the probing sequence to be used after a collision at location l = h(K):

(l + s(K)) mod m, (l + 2s(K)) mod m, . . . . (7.6)

To guarantee that every location in the table is probed by sequence (7.6), the increment s(k) and the table size m must be relatively prime, i.e., their only common divisor must be 1. (This condition is satisfied automatically if m itself is prime.) Some functions recommended in the literature are s(k) = m − 2 − k mod (m − 2) and s(k) = 8 − (k mod 8) for small tables and s(k) = k mod 97 + 1 for larger ones.

Mathematical analysis of double hashing has proved to be quite difficult. Some partial results and considerable practical experience with the method suggest that with good hashing functions—both primary and secondary—double hashing is superior to linear probing. But its performance also deteriorates when the table gets close to being full. A natural solution in such a situation is rehashing: the current table is scanned, and all its keys are relocated into a larger table.

It is worthwhile to compare the main properties of hashing with balanced search trees—its principal competitor for implementing dictionaries.

Asymptotic time efficiency with hashing, searching, insertion, and deletion can be implemented to take (1) time on the average but (n) time in the very unlikely worst case. For balanced search trees, the average time efficiencies are (log n) for both the average and worst cases.

Ordering preservation Unlike balanced search trees, hashing does not assume existence of key ordering and usually does not preserve it. This makes hashing less suitable for applications that need to iterate over the keys in or-der or require range queries such as counting the number of keys between some lower and upper bounds.

Since its discovery in the 1950s by IBM researchers, hashing has found many important applications. In particular, it has become a standard technique for storing a symbol table—a table of a computer program’s symbols generated during compilation. Hashing is quite handy for such AI applications as checking whether positions generated by a chess-playing computer program have already been considered. With some modifications, it has also proved to be useful for storing very large dictionaries on disks; this variation of hashing is called extendible hashing. Since disk access is expensive compared with probes performed in the main memory, it is preferable to make many more probes than disk accesses. Accordingly, a location computed by a hash function in extendible hashing indicates a disk ad-dress of a bucket that can hold up to b keys. When a key’s bucket is identified, all its keys are read into main memory and then searched for the key in question.

1.3.7 Static Perfect Hashing

Definition of Perfect Hashing

Perfect hashing is defined as a model of hashing in which any set of n elements can be stored in a hash table of equal size and can have lookups performed in constant time. It was specifically invented and discussed by Fredman, Komlos and Szemeredi (1984) and has therefore been nicknamed as "FKS Hashing".

Definition of Static Hashing

Static Hashing defines another form of the hashing problem which permits users to accomplish lookups on a finalized dictionary set (that means all objects in the dictionary are final as well as not changing).

Application

Since static hashing needs that the database, its objects and reference remain the same its applications are limited. Databases which contain information which experiences rare change are also eligible as it would only require a full rehash of the whole database on rare occasion. Various examples of this hashing scheme include sets of words and definitions of specific languages, sets of significant data for an organization's personnel, etc.

Implementation

In the static case, we are provided a set with a total of p entries, each one associated with a unique key, ahead of time. Fredman, Komlós and Szemerédi select a first-level hash table with size s = 2(p-1) buckets. To construct, p entries are separated into q buckets by the top-level hashing function, where q = 2(p-1). Then for each bucket with r entries, a second-level table is allocated with r2 slots, and its hash function is chosen at random from a universal hash function set so that it becomes collision-free and stored alongside the hash table. If the hash function randomly chosen creates a table with collisions, a new hash function is randomly chosen until a collision-free table can be guaranteed. At last, with the collision-free hash, the r entries are hashed into the second-level table.

1.3.8 Load Factor and Rehashing

How hashing works:

For insertion of a key(K) – value(V) pair into a hash map, 2 steps are required:

K is converted into a small integer (called its hash code) using a hash function.

The hash code is used to find an index (hashCode % arrSize) and the entire linked list at that index(Separate chaining) is first searched for the presence of the K already.

If found, its value is updated and if not, the K-V pair is stored as a new node in the list.

Complexity and Load Factor

For the first step, time taken depends on the K and the hash function.
For example, if the key is a string “abcd”, then it’s hash function may depend on the length of the string. But for very large values of n, the number of entries into the map, length of the keys is almost negligible in comparison to n so hash computation can be considered to take place in constant time, i.e, O(1).

For the second step, traversal of the list of K-V pairs present at that index needs to be done. For this, the worst case may be that all the n entries are at the same index. So, time complexity would be O(n). But, enough research has been done to make hash functions uniformly distribute the keys in the array so this almost never happens.

So, on an average, if there are n entries and b is the size of the array there would be n/b entries on each index. This value n/b is called the load factor that represents the load that is there on our map.

This Load Factor needs to be kept low, so that number of entries at one index is less and so is the complexity almost constant, i.e., O(1).

Rehashing:

As the name suggests, rehashing means hashing again. Basically, when the load factor increases to more than its pre-defined value (default value of load factor is 0.75), the complexity increases. So to overcome this, the size of the array is increased (doubled) and all the values are hashed again and stored in the new double sized array to maintain a low load factor and low complexity.

Why rehashing?

Rehashing is done because whenever key value pairs are inserted into the map, the load factor increases, which implies that the time complexity also increases as explained above. This might not give the required time complexity of O(1).

Hence, rehash must be done, increasing the size of the bucketArray so as to reduce the load factor and the time complexity.

How Rehashing is done?

Rehashing can be done as follows:

For each addition of a new entry to the map, check the load factor.

If it’s greater than its pre-defined value (or default value of 0.75 if not given), then Rehash.

For Rehash, make a new array of double the previous size and make it the new bucketarray.

Then traverse to each element in the old bucketArray and call the insert() for each so as to insert it into the new larger bucket array.

Program to implement Rehashing:

// Java program to implement Rehashing

import java.util.ArrayList;

class Map<K, V> {

class MapNode<K, V> {

K key;

V value;

MapNode<K, V> next;

public MapNode(K key, V value)

{

this.key = key;

this.value = value;

next = null;

}

// The bucket array where

// the nodes containing K-V pairs are stored

ArrayList<MapNode<K, V> > buckets;

// No. of pairs stored - n

int size;

// Size of the bucketArray - b

int numBuckets;

// Default loadFactor

final double DEFAULT_LOAD_FACTOR = 0.75;

public Map()

{

numBuckets = 5;

buckets = new ArrayList<>(numBuckets);

for (int i = 0; i < numBuckets; i++) {

// Initialising to null

buckets.add(null);

}

System.out.println("HashMap created");

System.out.println("Number of pairs in the Map: " + size);

System.out.println("Size of Map: " + numBuckets);

System.out.println("Default Load Factor: " + DEFAULT_LOAD_FACTOR + "\n");

}

private int getBucketInd(K key)

{

// Using the inbuilt function from the object class

int hashCode = key.hashCode();

// array index = hashCode%numBuckets

return (hashCode % numBuckets);

}

public void insert(K key, V value)

{

// Getting the index at which, it needs to be inserted

int bucketInd = getBucketInd(key);

// The first node at that index

MapNode<K, V> head = buckets.get(bucketInd);

// First, loop through all the nodes present at that index

// to check if the key already exists

while (head != null) {

// If already present the value is updated

if (head.key.equals(key)) {

head.value = value;

return;

}

head = head.next;

}

// new node with the K and V

MapNode<K, V> newElementNode = new MapNode<K, V>(key, value);

// The head node at the index

head = buckets.get(bucketInd);

// the new node is inserted

// by making it the head

// and it's next is the previous head

newElementNode.next = head;

buckets.set(bucketInd, newElementNode);

System.out.println("Pair(" + key + ", " + value + ") inserted successfully.\n");

// Incrementing size

// as new K-V pair is added to the map

size++;

// Load factor calculated

double loadFactor = (1.0 * size) / numBuckets;

System.out.println("Current Load factor = " + loadFactor);

// If the load factor is > 0.75, rehashing is done

if (loadFactor > DEFAULT_LOAD_FACTOR) {

System.out.println(loadFactor + " is greater than " +

DEFAULT_LOAD_FACTOR);

System.out.println("Therefore Rehashing will be done.\n");

// Rehash

rehash();

System.out.println("New Size of Map: " + numBuckets + "\n");

}

System.out.println("Number of pairs in the Map: " + size);

System.out.println("Size of Map: " + numBuckets + "\n");

}

private void rehash()

{

System.out.println("\n***Rehashing Started***\n");

// The present bucket list is made temp

ArrayList<MapNode<K, V> > temp = buckets;

// New bucketList of double the old size is created

buckets = new ArrayList<MapNode<K, V> >(2 * numBuckets);

for (int i = 0; i < 2 * numBuckets; i++) {

// Initialised to null

buckets.add(null);

}

// Now size is made zero

// and we loop through all the nodes in the original bucket list(temp)

// and insert it into the new list

size = 0;

numBuckets *= 2;

for (int i = 0; i < temp.size(); i++) {

// head of the chain at that index

MapNode<K, V> head = temp.get(i);

while (head != null) {

K key = head.key;

V val = head.value;

// calling the insert function for each node in temp

// as the new list is now the bucketArray

insert(key, val);

head = head.next;

}

System.out.println("\n***Rehashing Ended***\n");

}

public void printMap()

{

// The present bucket list is made temp

ArrayList<MapNode<K, V> > temp = buckets;

System.out.println("Current HashMap:");

// loop through all the nodes and print them

for (int i = 0; i < temp.size(); i++) {

// head of the chain at that index

MapNode<K, V> head = temp.get(i);

while (head != null) {

System.out.println("key = " + head.key + ", val = " + head.value);

head = head.next;

}

System.out.println();

}

public class GFG {

public static void main(String[] args)

{

// Creating the Map

Map<Integer, String> map = new Map<Integer, String>();

// Inserting elements

map.insert(1, "Geeks");

map.printMap();

map.insert(2, "forGeeks");

map.printMap();

map.insert(3, "A");

map.printMap();

map.insert(4, "Computer");

map.printMap();

map.insert(5, "Portal");

map.printMap();

}

Output:

HashMap created

Number of pairs in the Map: 0

Size of Map: 5

Default Load Factor: 0.75

Pair(1, Geeks) inserted successfully.

Current Load factor = 0.2

Number of pairs in the Map: 1

Size of Map: 5

Current HashMap:

key = 1, val = Geeks

Pair(2, forGeeks) inserted successfully.

Current Load factor = 0.4

Number of pairs in the Map: 2

Size of Map: 5

Current HashMap:

key = 1, val = Geeks

key = 2, val = forGeeks

Pair(3, A) inserted successfully.

Current Load factor = 0.6

Number of pairs in the Map: 3

Size of Map: 5

Current HashMap:

key = 1, val = Geeks

key = 2, val = forGeeks

key = 3, val = A

Pair(4, Computer) inserted successfully.

Current Load factor = 0.8

0.8 is greater than 0.75

Therefore Rehashing will be done.

***Rehashing Started***

Pair(1, Geeks) inserted successfully.

Current Load factor = 0.1

Number of pairs in the Map: 1

Size of Map: 10

Pair(2, forGeeks) inserted successfully.

Current Load factor = 0.2

Number of pairs in the Map: 2

Size of Map: 10

Pair(3, A) inserted successfully.

Current Load factor = 0.3

Number of pairs in the Map: 3

Size of Map: 10

Pair(4, Computer) inserted successfully.

Current Load factor = 0.4

Number of pairs in the Map: 4

Size of Map: 10

***Rehashing Ended***

New Size of Map: 10

Number of pairs in the Map: 4

Size of Map: 10

Current HashMap:

key = 1, val = Geeks

key = 2, val = forGeeks

key = 3, val = A

key = 4, val = Computer

Pair(5, Portal) inserted successfully.

Current Load factor = 0.5

Number of pairs in the Map: 5

Size of Map: 10

Current HashMap:

key = 1, val = Geeks

key = 2, val = forGeeks

key = 3, val = A

key = 4, val = Computer

key = 5, val = Portal

The Sum of Functions Problem

Given a set A and a function f, whose domain contains A and which can be calculated in time O(1), find all non-trivial quadruples (x, y, z, w) of elements x, y, z, w ∈ A such that f(x) + f(y) = f(z) + f(w).

For a trivial solution, we may simply test each quadruple of distinct elements of the set A. We can do this in Θ(n 4) time, where n = |A|.

A more efficient solution can be achieved by sorting. Pick all pairs (x, y) ⊂ A, put them on a list L, and sort them by d x,y = f(x) + f(y). Now, all pairs of elements whose images under f sum up to a same value d will constitute a contiguous sublist of L. Therefore, they can be easily combined to produce the intended quadruples in constant time per quadruple. Owing to sorting, the overall time complexity of this approach is O(n 2 log n + q), in the worst case, where q is the number of quadruples in the output.1

Now we look into a hashing-based approach. We start with an empty hash map H. We consider each unordered pair (x, y) of distinct elements of A, obtaining d = f(x) + f(y), one pair at a time. If d is not a key stored in H, we insert the (key, value) pair (d, S d) in H, where S d is a collection (which can be implemented as a linked list, a hash set or whatever other structure) containing only the pair (x, y) initially. On the other hand, if d is already stored in H, then we simply add (x, y) to the non-empty collection S d which d already maps to in H. After all pairs of distinct elements have been considered, our hash map will contain non-empty collections S d whose elements are pairs (x, y) satisfying f(x) + f(y) = d. For each non-unitary collection Sd, we combine distinct pairs (x, y), (z, w) ∈ S d , two at a time, to produce our desired quadruples (x, y, z, w).

The pseudocode of the proposed hash-based solution.

Algorithm 1 Sum Functions (A, f).

As for the time complexity of the algorithm, each key d can be located or inserted in H in O(1) average time, and each of the Θ(n 2) pairs (x, y) of elements of A can be inserted in some list S d in O(1) time (assuming, for instance, a linked list implementation for S d ). Thus, the whole step of populating H can be performed in Θ(n 2) time on average. Now, each of the q quadruples that satisfy the equality can be obtained, by combining pairs in a same list, in O(1) time. Thus, the whole algorithm runs in Θ(n 2 + q) expected time.

If we tried to achieve the same performance by using an array and direct addressing, such array would have to be twice as large as the maximum element in the image of f. The use of hashing avoids that issue seamlessly.

The Permutation Bingo Problem

In the standard game of bingo, each of the contestants receives a card with some integers. The integers in each card constitute a subset of a rather small range A = [1, a]. Then, subsequent numbers are drawn at random from A, without replacement, until all numbers in the card of a contestant have been drawn. That contestant wins the prize.

In our variation of the game, the contestants receive an empty card. Subsequent numbers are drawn at random from some unknown, huge range B = [1, b], with replacement, forming a sequence S. The contestant who wins the game is the one who first spots a set W = {x, y, z} of three distinct numbers such that all 3! = 6 permutations of its elements appear among the subsequences of three consecutive elements of S. As an example, the sequence

S = (2, 19, 4, 1, 100, 1, 4, 19, 1, 4, 1, 19, 100, 192, 100, 4, 19, 2, 1, 19, 4)

is a minimal winning sequence. There is a set, namely W = {1, 4, 19}, such that every permutation of W appears as three consecutive elements of S. Even though the set B is huge, the distribution of the drawing probability along B is not uniform, and the game finishes in reasonable time almost surely.

We want to write a program to play the permutation bingo. Moreover, we want our program to win the game when it plays against other programs. In other words, we need an efficient algorithm to check, after each number is added to S, whether S happens to be a winning sequence.

One possible approach would be as follows. Keep S in memory all the time. After the n-th number is added, for all n ≥ 8 (a trivial lower bound for the size of a winning sequence), traverse S = (s 1, s 2, ..., s n ) entirely for each set R i = {s i , s i+1 , s i+2 } of three consecutive distinct elements (with 1≤ i ≤ n - 2), checking whether all permutations of R i appear as consecutive elements of S. Checking a sequence of size n this way demands Θ(n 2) time in the worst case.

We can do better by performing a single traversal of S, during which we add each subsequence of three consecutive distinct elements of S into a bin. To make sure that two permutations go to the same bin if and only if they have the same elements (in different orders), we may label each bin using the ascending sequence of its elements. Now, for each subsequence of three distinct elements in S, we compute the label of its bin, find the bin, and place the subsequence therein, making sure we do not add the same subsequence twice in a bin. Whenever a bin has size six, we have a winner. We can implement each bin labeled (x, y, z), with x < y < z, using an array of integers, or a linked list, or a hash set, or even a bit array if we care to compute the position of any given permutation in a total order of the permutations of those elements. As a matter of fact, the way we implement each bin is not important, since its size will never exceed six. Moreover, rather than building the bins from scratch by traversing the whole S each time a new element s n is added, we can keep the bins in memory and proceed to place only each freshly formed subsequence (s n-2 , s n-1 , s n ) into the appropriate bin - an approach which also allows us to keep but the two last elements of S in memory.

We still have a big problem, though. Given a label, how to find the bin among the (possibly huge number of) existing bins? If we maintain, say, an array of bins, then we will have to traverse that entire array until we find the desired bin. It is not even possible to attempt direct addressing here, for the range B of the numbers that can be drawn is not only unknown but also possibly huge - let alone the number of subsets of three such numbers! Even if we did know B, there would probably not be enough available memory for such an array. We know, however, that the game always ends in reasonable time, owing to an appropriate probability distribution along B enforced by the game dealer. Thus, the actual number of bins that will ever be used is small, and available memory to store only that number of bins shall not be an issue. We can therefore keep the bins in a linked list, or even in an array without direct addressing. When the sequence has size n, such list (or array) of bins will have size O(n), and the time it takes to traverse it all and find the desired bin will also be O(n).

The use of hashing can improve the performance even further. We use a hash map where the keys are the bin labels and the values are the bins themselves (each one a list, or bit array etc.), so every time a new element is added, we can find the appropriate bin and update its contents in expected constant time. Since we do not know the number of bins that will be stored in our hash map throughout the game, and we want to keep the load factor of the hash map below a certain constant, we implement the hash map over an initially small dynamic array and adopt a geometric expansion policy to resize it whenever necessary. By doing this, the amortized cost of inserting each new bin is also constant, and the algorithm processes each drawn number in overall O(1) expected time.

The Ferry Loading Problem

The Ferry Loading Problem (FLP) is not unknown to those used to international programming competitions. It appears in the chapter devoted to dynamic programming, the technique employed in the generally accepted, pseudo-polynomial solution. By using a clever backtracking approach aided by hashing, we obtain a simpler algorithm which is just as efficient, in the worst case. As a matter of fact, our algorithm may demand significantly less time and space, in practice, than the usual dynamic programming approach.

There is a ferry to transport cars across a river. The ferry comprises two lanes of cars throughout its length. Before boarding on the ferry, the cars wait on a single queue in their order of arrival. When boarding, each car goes either to the left or to the right lane at the discretion of the ferry operator. Whenever the next car in the queue cannot be boarded on either lane, the ferry departs, and all cars still in the queue should wait for the next ferry. Given the length L ∈ ℤ of the ferry and the lengths ℓ i ∈ ℤ of all cars i > 0 in the queue, the problem asks for the maximum number of cars that can be loaded onto the ferry.

A brute force approach could be as follows. Starting from k = 1, attempt all 2k distributions of the first k cars in the queue. If it is possible to load k cars on the ferry, proceed with the next value of k. Stop when there is no feasible distribution of cars for some k. Clearly, if the number n of cars loaded in the optimal solution is not ridiculously small, such an exponential-time algorithm is doomed to run for ages.

A dynamic programming strategy can be devised to solve the FLP in Θ(nL) time and space. This is not surprising, given the similarity of this problem with the Knapsack Problem, the Partition Problem, and the Bin Packing Problem. Such complexity, as usual, corresponds to the size of the dynamic programming table that must be filled with answers to subproblems, each one computed in constant time.

Apart from the dynamic programming approach, one could come up with a simple backtracking strategy as follows. Identify each feasible configuration of k cars by a tuple (x 1, ..., x k), where x i ∈ {left, right} indicates the lane occupied by the ith car, for i = 1, ..., k. Each configuration (x 1, ..., x k) leads to configurations (x 1, ..., x k, left) and (x 1, ..., x k, right), respectively, if there is room for the (k + 1)th car on the corresponding lane. Starting from the configuration (left) - the first car boards on the left lane, without loss of generality -, inspect all feasible configurations in backtracking fashion and report the size of the largest configuration that is found. It works fine. Unfortunately, such approach demands exponential time in the worst case.

However, we can perfectly well identify each configuration by a pair (k, s), where k is the number of cars on the ferry and s is the total length of the left lane that is occupied by cars. Each configuration (k, s) leads to at most two other configurations in the implicit backtracking graph:

•(k + 1, s + ℓ k+1 ), by placing the (k + 1)th car in the left lane, which is only possible if s + ℓ k+1 ≤ L; and

• (k + 1, ℓ), by placing the (k + 1)th car in the right lane, which is only possible if

http://www.scielo.br/img/revistas/pope/v35n2/1678-5142-pope-35-02-00423-gi1.jpg

- s ≤ L.

If neither placement of the (k + 1)th car is possible, backtrack. Notice that a same configuration (k + 1, s') may be reached by two distinct configurations (k, s 1) and (k, s 2). It happens precisely if |s 1 - s 2| = ℓ k+1 . However, we do not want to visit a same configuration twice, and therefore we must keep track of the visited configurations.

The algorithm searches the whole backtracking graph, starting by the root node (0, 0) - there are zero cars on the ferry, initially, occupying zero space on the left lane. The reported solution will be the largest k that is found among the visited configurations (k, s).

As for the time complexity, the whole search runs in time proportional to the number of nodes (configurations) in the backtracking graph. Since 0 ≤ k ≤ n, 0 ≤ s ≤ L, and both k and s are integers, such number is O(nL), bounding the worst-case time complexity of our algorithm if we make sure we can check whether a configuration has already been visited in O(1) time. If we use a 2-dimensional matrix to store the configurations, the problem is solved. However, by doing so, we demand Θ(nL) space, even though our matrix will most likely be absolutely sparse.

We can resort to hashing to solve that. By using a hash set to store the visited configurations, we can still perform each configuration lookup in (expected) constant time, and the required space will be proportional to the number O(nL) of configurations that are actually visited during the search. In practice, the O(nL) time and space required by our algorithm can be remarkably smaller than the Θ(nL) time and space of the dynamic programming solution, especially if there are many cars with the same size.

Key takeaway

Following are the basic primary operations of a hash table.

Search − Searches an element in a hash table.

Insert − inserts an element in a hash table.

delete − Deletes an element from a hash table.

1.4 Hash Functions- Properties of good Hash Function, Division, Multiplication, Extraction, Mid-square, Folding and Universal

Hash Function

Hash Function is used to index the original value or key and then used later each time the data associated with the value or key is to be retrieved. Thus, hashing is always a one-way operation. There is no need to "reverse engineer" the hash function by analyzing the hashed values.

Characteristics of Good Hash Function:

The hash value is fully determined by the data being hashed.

The hash Function uses all the input data.

The hash function "uniformly" distributes the data across the entire set of possible hash values.

The hash function generates complicated hash values for similar strings.

Some Popular Hash Function is:

1. Division Method:

Choose a number m smaller than the number of n of keys in k (The number m is usually chosen to be a prime number or a number without small divisors, since this frequently a minimum number of collisions).

The hash function is:

DAA Hash Function

For Example: if the hash table has size m = 12 and the key is k = 100, then h (k) = 4. Since it requires only a single division operation, hashing by division is quite fast.

2. Multiplication Method:

The multiplication method for creating hash functions operates in two steps. First, we multiply the key k by a constant A in the range 0 < A < 1 and extract the fractional part of kA. Then, we increase this value by m and take the floor of the result.

The hash function is:

DAA Hash Function

Where "k A mod 1" means the fractional part of k A, that is, k A -⌊k A⌋.

3. Mid Square Method:

The key k is squared. Then function H is defined by

H (k) = L

Where L is obtained by deleting digits from both ends of k2. We emphasize that the same position of k2 must be used for all of the keys.

4. Folding Method:

The key k is partitioned into a number of parts k1, k2.... kn where each part except possibly the last, has the same number of digits as the required address.

Then the parts are added together, ignoring the last carry.

H (k) = k1+ k2+.....+kn

Example: Company has 68 employees, and each is assigned a unique four- digit employee number. Suppose L consist of 2- digit addresses: 00, 01, and 02....99. We apply the above hash functions to each of the following employee numbers:

3205, 7148, 2345

(a) Division Method: Choose a Prime number m close to 99, such as m =97, Then

H (3205) = 4, H (7148) = 67, H (2345) = 17.

That is dividing 3205 by 17 gives a remainder of 4, dividing 7148 by 97 gives a remainder of 67, dividing 2345 by 97 gives a remainder of 17.

(b) Mid-Square Method:

k = 3205 7148 2345

k2= 10272025 51093904 5499025

h (k) = 72 93 99

Observe that fourth & fifth digits, counting from right are chosen for hash address.

H (3205) = 32 + 50 = 82 H (7148) = 71 + 84 = 55

H (2345) = 23 + 45 = 68

Key takeaway

Characteristics of Good Hash Function:

The hash value is fully determined by the data being hashed.

The hash Function uses all the input data.

The hash function "uniformly" distributes the data across the entire set of possible hash values.

The hash function generates complicated hash values for similar strings.

1.5 Collision Resolution Strategies – Open Addressing and Chaining

Collision Resolution Techniques

Collision in hashing

In this, the hash function is used to compute the index of the array.

The hash value is used to store the key in the hash table, as an index.

The hash function can return the same hash value for two or more keys.

When two or more keys are given the same hash value, it is called a collision. To handle this collision, we use collision resolution techniques.

Collision resolution techniques

There are two types of collision resolution techniques.

Separate chaining (open hashing)

Open addressing (closed hashing)

Separate chaining

In this technique, a linked list is created from the slot in which collision has occurred, after which the new key is inserted into the linked list. This linked list of slots looks like a chain, so it is called separate chaining. It is used more when we do not know how many keys to insert or delete.

Time complexity

Its worst-case complexity for searching is o(n).

Its worst-case complexity for deletion is o(n).

Advantages of separate chaining

It is easy to implement.

The hash table never fills full, so we can add more elements to the chain.

It is less sensitive to the function of the hashing.

Disadvantages of separate chaining

In this, cache performance of chaining is not good.

The memory wastage is too much in this method.

It requires more space for element links.

Open addressing

Open addressing is collision-resolution method that is used to control the collision in the hashing table. There is no key stored outside of the hash table. Therefore, the size of the hash table is always greater than or equal to the number of keys. It is also called closed hashing.

The following techniques are used in open addressing:

Linear probing

Quadratic probing

Double hashing

Linear probing

In this, when the collision occurs, we perform a linear probe for the next slot, and this probing is performed until an empty slot is found. In linear probing, the worst time to search for an element is O(table size). The linear probing gives the best performance of the cache but its problem is clustering. The main advantage of this technique is that it can be easily calculated.

Disadvantages of linear probing

The main problem is clustering.

It takes too much time to find an empty slot.

Quadratic probing

In this, when the collision occurs, we probe for i2th slot in ith iteration, and this probing is performed until an empty slot is found. The cache performance in quadratic probing is lower than the linear probing. Quadratic probing also reduces the problem of clustering.

Double hashing

In this, you use another hash function, and probe for (i * hash 2(x)) in the ith iteration. It takes longer to determine two hash functions. The double probing gives the very poor the cache performance, but there has no clustering problem in it.

Key takeaway

Collision in hashing

In this, the hash function is used to compute the index of the array.

The hash value is used to store the key in the hash table, as an index.

The hash function can return the same hash value for two or more keys.

When two or more keys are given the same hash value, it is called a collision. To handle this collision, we use collision resolution techniques.

1.6 Hash Table Overflow – Open Addressing and Chaining, Extendible Hashing, Closed Addressing and Separate Chaining

Overflow Handling

An overflow occurs at the time of the home bucket for a new pair (key, element) is full.

We may tackle overflows by

Search the hash table in some systematic manner for a bucket that is not full.

Linear probing (linear open addressing).

Quadratic probing.

Random probing.

Eliminate overflows by allowing each bucket to keep a list of all pairs for which it is the home bucket.

Array linear list.

Chain.

Open addressing is performed to ensure that all elements are stored directly into the hash table, thus it attempts to resolve collisions implementing various methods.

Linear Probing is performed to resolve collisions by placing the data into the next open slot in the table.

Performance of Linear Probing

Worst-case find/insert/erase time is θ(m), where m is treated as the number of pairs in the table.

This occurs when all pairs are in the same cluster.

Problem of Linear Probing

Identifiers are tending to cluster together

Adjacent clusters are tending to coalesce

Increase or enhance the search time

Quadratic Probing

Linear probing searches buckets (H(x)+i2)%b; H(x) indicates Hash function of x

Quadratic probing implements a quadratic function of i as the increment

Examine buckets H(x), (H(x)+i2)%b, (H(x)-i2)%b, for 1<=i<=(b-1)/2

b is indicated as a prime number of the form 4j+3, j is an integer

Random Probing

Random Probing performs incorporating with random numbers.

H(x):= (H’(x) + S[i]) % b

S[i] is a table along with size b-1

S[i] is indicated as a random permutation of integers [1, b-1].

Open and Closed Hashing

Hashing

where C is a constant larger than every ord(ci).

In general, a hash function needs to satisfy somewhat conflicting require-ments:

A hash table’s size should not be excessively large compared to the number of keys, but it should be sufficient to not jeopardize the implementation’s time efficiency (see below).

A hash function has to be easy to compute.

Open Hashing (Separate Chaining)

In open hashing, keys are stored in linked lists attached to cells of a hash table. Each list contains all the keys hashed to its cell. Consider, as an example, the following list of words:

A, FOOL, AND, HIS, MONEY, ARE, SOON, PARTED.

In general, the efficiency of searching depends on the lengths of the linked lists, which, in turn, depend on the dictionary and table sizes, as well as the quality

Closed Hashing (Open Addressing)

is to use “lazy deletion,” i.e., to mark previously occupied locations by a special symbol to distinguish them from locations that have not been occupied.

(l + s(K)) mod m, (l + 2s(K)) mod m, . . . . (7.6)

It is worthwhile to compare the main properties of hashing with balanced search trees—its principal competitor for implementing dictionaries.

Extendible Hashing is a dynamic hashing method wherein directories, and buckets are used to hash data. It is an aggressively flexible method in which the hash function also experiences dynamic changes.

Main features of Extendible Hashing: The main features in this hashing technique are:

Directories: The directories store addresses of the buckets in pointers. An id is assigned to each directory which may change each time when Directory Expansion takes place.

Buckets: The buckets are used to hash the actual data.

Basic Structure of Extendible Hashing:

Frequently used terms in Extendible Hashing:

Directories: These containers store pointers to buckets. Each directory is given a unique id which may change each time when expansion takes place. The hash function returns this directory id which is used to navigate to the appropriate bucket. Number of Directories = 2^Global Depth.

Buckets: They store the hashed keys. Directories point to buckets. A bucket may contain more than one pointer to it if its local depth is less than the global depth.

Global Depth: It is associated with the Directories. They denote the number of bits which are used by the hash function to categorize the keys. Global Depth = Number of bits in directory id.

Local Depth: It is the same as that of Global Depth except for the fact that Local Depth is associated with the buckets and not the directories. Local depth in accordance with the global depth is used to decide the action that to be performed in case an overflow occurs. Local Depth is always less than or equal to the Global Depth.

Bucket Splitting: When the number of elements in a bucket exceeds a particular size, then the bucket is split into two parts.

Directory Expansion: Directory Expansion Takes place when a bucket overflows. Directory Expansion is performed when the local depth of the overflowing bucket is equal to the global depth.

Basic Working of Extendible Hashing:

Step 1 – Analyze Data Elements: Data elements may exist in various forms eg. Integer, String, Float, etc. Currently, let us consider data elements of type integer. eg: 49.

Step 2 – Convert into binary format: Convert the data element in Binary form. For string elements, consider the ASCII equivalent integer of the starting character and then convert the integer into binary form. Since we have 49 as our data element, its binary form is 110001.

Step 3 – Check Global Depth of the directory. Suppose the global depth of the Hash-directory is 3.

Step 4 – Identify the Directory: Consider the ‘Global-Depth’ number of LSBs in the binary number and match it to the directory id.
Eg. The binary obtained is: 110001 and the global-depth is 3. So, the hash function will return 3 LSBs of 110001 viz. 001.

Step 5 – Navigation: Now, navigate to the bucket pointed by the directory with directory-id 001.

Step 6 – Insertion and Overflow Check: Insert the element and check if the bucket overflows. If an overflow is encountered, go to step 7 followed by Step 8, otherwise, go to step 9.

Step 7 – Tackling Over Flow Condition during Data Insertion: Many times, while inserting data in the buckets, it might happen that the Bucket overflows. In such cases, we need to follow an appropriate procedure to avoid mishandling of data.

First, Check if the local depth is less than or equal to the global depth. Then choose one of the cases below.

Case1: If the local depth of the overflowing Bucket is equal to the global depth, then Directory Expansion, as well as Bucket Split, needs to be performed. Then increment the global depth and the local depth value by 1. And, assign appropriate pointers.
Directory expansion will double the number of directories present in the hash structure.

Case2: In case the local depth is less than the global depth, then only Bucket Split takes place. Then increment only the local depth value by 1. And, assign appropriate pointers.

Step 8 – Rehashing of Split Bucket Elements: The Elements present in the overflowing bucket that is split are rehashed w.r.t the new global depth of the directory.

Step 9 – The element is successfully hashed.

Example based on Extendible Hashing: Now, let us consider a prominent example of hashing the following elements: 16,4,6,22,24,10,31,7,9,20,26.
Bucket Size: 3 (Assume)

Hash Function: Suppose the global depth is X. Then the Hash Function returns X LSBs.

Solution: First, calculate the binary forms of each of the given numbers.
16- 10000

4- 00100

6- 00110

22- 10110

24- 11000

10- 01010

31- 11111

7- 00111

9- 01001

20- 10100

26- 01101

Initially, the global-depth and local-depth is always 1. Thus, the hashing frame looks like this:

Inserting 16:

The binary format of 16 is 10000 and global-depth is 1. The hash function returns 1 LSB of 10000 which is 0. Hence, 16 is mapped to the directory with id=0.

Inserting 4 and 6:

Both 4(100) and 6(110) have 0 in their LSB. Hence, they are hashed as follows:

Inserting 22: The binary form of 22 is 10110. Its LSB is 0. The bucket pointed by directory 0 is already full. Hence, Over Flow occurs.

As directed by Step 7-Case 1, Since Local Depth = Global Depth, the bucket splits and directory expansion take place. Also, rehashing of numbers present in the overflowing bucket takes place after the split. And, since the global depth is incremented by 1, now, the global depth is 2. Hence, 16,4,6,22 is now rehashed w.r.t 2 LSBs.

[16(10000),4(100),6(110),22(10110)]

Notice that the bucket which was underflow has remained untouched. But, since the number of directories has doubled, we now have 2 directories 01 and 11 pointing to the same bucket. This is because the local-depth of the bucket has remained 1. And, any bucket having a local depth less than the global depth is pointed-to by more than one directory.

Inserting 24 and 10: 24(11000) and 10 (1010) can be hashed based on directories with id 00 and 10. Here, we encounter no overflow condition.

Inserting 31,7,9: All of these elements [ 31(11111), 7(111), 9(1001)] have either 01 or 11 in their LSBs. Hence, they are mapped on the bucket pointed out by 01 and 11. We do not encounter any overflow condition here.

Inserting 20: Insertion of data element 20 (10100) will again cause the overflow problem.

20 is inserted in bucket pointed out by 00. As directed by Step 7-Case 1, since the local depth of the bucket = global-depth, directory expansion (doubling) takes place along with bucket splitting. Elements present in overflowing bucket are rehashed with the new global depth. Now, the new Hash table looks like this:

Inserting 26: Global depth is 3. Hence, 3 LSBs of 26(11010) are considered. Therefore 26 best fits in the bucket pointed out by directory 010.

The bucket overflows, and, as directed by Step 7-Case 2, since the local depth of bucket < Global depth (2<3), directories are not doubled but, only the bucket is split and elements are rehashed.
Finally, the output of hashing the given list of numbers is obtained.

Hashing of 11 Numbers is Thus Completed.

Key Observations:

A Bucket will have more than one pointer pointing to it if its local depth is less than the global depth.

When overflow condition occurs in a bucket, all the entries in the bucket are rehashed with a new local depth.

If Local Depth of the overflowing bucket

is equal to the global depth, only then the directories are doubled and the global depth is incremented by 1.

4. The size of a bucket cannot be changed after the data insertion process begins.

Advantages:

Data retrieval is less expensive (in terms of computing).

No problem of Data-loss since the storage capacity increases dynamically.

With dynamic changes in hashing function, associated old values are rehashed w.r.t the new hash function.

Limitations of Extendible Hashing:

The directory size may increase significantly if several records are hashed on the same directory while keeping the record distribution non-uniform.

Size of every bucket is fixed.

Memory is wasted in pointers when the global depth and local depth difference becomes drastic.

This method is complicated to code.

Key takeaway

An overflow occurs at the time of the home bucket for a new pair (key, element) is full.

We may tackle overflows by

Search the hash table in some systematic manner for a bucket that is not full.

Linear probing (linear open addressing).

Quadratic probing.

Random probing.

1.7 Skip List – Representation, Searching and Operations – Insertion, Removal

What is a skip list?

A skip list is a probabilistic data structure. The skip list is used to store a sorted list of elements or data with a linked list. It allows the process of the elements or data to view efficiently. In one single step, it skips several elements of the entire list, which is why it is known as a skip list.

The skip list is an extended version of the linked list. It allows the user to search, remove, and insert the element very quickly. It consists of a base list that includes a set of elements which maintains the link hierarchy of the subsequent elements.

Skip list structure

It is built in two layers: The lowest layer and Top layer.

The lowest layer of the skip list is a common sorted linked list, and the top layers of the skip list are like an "express line" where the elements are skipped.

Complexity table of the Skip list

S. No	Complexity	Average case	Worst case
1.	Access complexity	O(logn)	O(n)
2.	Search complexity	O(logn)	O(n)
3.	Delete complexity	O(logn)	O(n)
4.	Insert complexity	O(logn)	O(n)
5.	Space complexity	-	O(nlogn)

Working of the Skip list

Let's take an example to understand the working of the skip list. In this example, we have 14 nodes, such that these nodes are divided into two layers, as shown in the diagram.

The lower layer is a common line that links all nodes, and the top layer is an express line that links only the main nodes, as you can see in the diagram.

Suppose you want to find 47 in this example. You will start the search from the first node of the express line and continue running on the express line until you find a node that is equal a 47 or more than 47.

You can see in the example that 47 does not exist in the express line, so you search for a node of less than 47, which is 40. Now, you go to the normal line with the help of 40, and search the 47, as shown in the diagram.

Skip list in Data structure

Note: Once you find a node like this on the "express line", you go from this node to a "normal lane" using a pointer, and when you search for the node in the normal line.

Skip List Basic Operations

There are the following types of operations in the skip list.

Insertion operation: It is used to add a new node to a particular location in a specific situation.

Deletion operation: It is used to delete a node in a specific situation.

Search Operation: The search operation is used to search a particular node in a skip list.

Algorithm of the insertion operation

Insertion (L, Key)

local update [0...Max_Level + 1]

a = L → header

for i = L → level down to 0 do.

while a → forward[i] → key forward[i]

update[i] = a

a = a → forward[0]

lvl = random_Level()

if lvl > L → level then

for i = L → level + 1 to lvl do

update[i] = L → header

L → level = lvl

a = makeNode(lvl, Key, value)

for i = 0 to level do

a → forward[i] = update[i] → forward[i]

update[i] → forward[i] = a

Algorithm of deletion operation

Deletion (L, Key)

local update [0... Max_Level + 1]

a = L → header

for i = L → level down to 0 do.

while a → forward[i] → key forward[i]

update[i] = a

a = a → forward[0]

if a → key = Key then

for i = 0 to L → level do

if update[i] → forward[i] ? a then break

update[i] → forward[i] = a → forward[i]

free(a)

while L → level > 0 and L → header → forward[L → level] = NIL do

L → level = L → level - 1

Algorithm of searching operation

Searching (L, SKey)

a = L → header

loop invariant: a → key level down to 0 do.

while a → forward[i] → key forward[i]

a = a → forward[0]

if a → key = SKey then return a → value

else return failure

Example 1: Create a skip list, we want to insert these following keys in the empty skip list.

6 with level 1.

29 with level 1.

22 with level 4.

9 with level 3.

17 with level 1.

4 with level 2.

Ans:

Step 1: Insert 6 with level 1

Skip list in Data structure

Step 2: Insert 29 with level 1

Skip list in Data structure

Step 3: Insert 22 with level 4

Skip list in Data structure

Step 4: Insert 9 with level 3

Skip list in Data structure

Step 5: Insert 17 with level 1

Skip list in Data structure

Step 6: Insert 4 with level 2

Skip list in Data structure

Example 2: Consider this example where we want to search for key 17.

Skip list in Data structure

Ans:

Skip list in Data structure

Advantages of the Skip list

If you want to insert a new node in the skip list, then it will insert the node very fast because there are no rotations in the skip list.

The skip list is simple to implement as compared to the hash table and the binary search tree.

It is very simple to find a node in the list because it stores the nodes in sorted form.

The skip list algorithm can be modified very easily in a more specific structure, such as indexable skip lists, trees, or priority queues.

The skip list is a robust and reliable list.

Disadvantages of the Skip list

It requires more memory than the balanced tree.

Reverse searching is not allowed.

The skip list searches the node much slower than the linked list.

Applications of the Skip list

It is used in distributed applications, and it represents the pointers and system in the distributed applications.

It is used to implement a dynamic elastic concurrent queue with low lock contention.

It is also used with the QMap template class.

The indexing of the skip list is used in running median problems.

The skip list is used for the delta-encoding posting in the Lucene search.

Key takeaway

References:

1. Horowitz, Sahani, Dinesh Mehata, ―Fundamentals of Data Structures in C++‖, Galgotia Publisher, ISBN: 8175152788, 9788175152786.

2. M Folk, B Zoellick, G. Riccardi, ―File Structures‖, Pearson Education, ISBN:81-7758-37-5

3. Peter Brass, ―Advanced Data Structures‖, Cambridge University Press, ISBN: 978-1-107- 43982-5

4. A. Aho, J. Hopcroft, J. Ulman, ―Data Structures and Algorithms‖, Pearson Education, 1998, ISBN-0-201-43578-0.

5. Michael J Folk, ―File Structures an Object-Oriented Approach with C++‖, Pearson Education, ISBN: 81-7758-373-5.

6. Sartaj Sahani, ―Data Structures, Algorithms and Applications in C++‖, Second Edition, University Press, ISBN:81-7371522 X.

7. G A V Pai, ―Data Structures and Algorithms‖, The McGraw-Hill Companies, ISBN - 9780070667266.

8. Goodrich, Tamassia, Goldwasser, ―Data Structures and Algorithms in Java‖, Wiley Publication, ISBN: 9788126551903

Sign Up

Index

Notes

Highlighted

Underlined

Browse by Topics

Notes

Highlighted

Underlined