Hashing
From Real Software Documentation
Aim
We combine a simple data structure called a hash table with the simple linear list from an earlier lesson. The result is a data structure that is much faster than the binary-searched array.
Contents |
About Hash Tables
Hashing is a very simple, very widely used technique to speed up virtually any data structure. It must be combined with another data structure, because what it does is to separate the items to be stored and searched into many smaller groups; but we still have to store and search those groups somehow.
The reason this is fast is because we can split one large data structure into many smaller data structures. The splitting is fast, and takes the same amount of time, no matter how many groups we have.
The groups are made by devising a way of consistently generating a hash key from every possible value to be stored. This hash number should have the following properties:
- It should be reasonably fast to generate;
- It should be in a reasonably small range (in the range of the number of groups you want);
- It should minimize collisions (values that wind up in with the same key). This means that a very “random” set of values should be evenly distributed, but also a set of, for example, very similar values should generate very different hash numbers.
This second criterion is sufficiently open to interpretation to have produced a reasonable amount of debate in computer programming circles. You will find a large range of hashing functions in computer science textbooks and on the Internet. The algorithm we are using is a reasonably robust one found online. Unfortunately, while the hash function itself is reasonably simple, the math behind the choice of a hash function is actually reasonably complex. If you are interested in the issues involved in devising a hash function, consult a suitable computer science text. One of the best (for this and many other topics) is The Art of Computer Programming[note 1].
One thing that helps with almost any hashing function is to have a prime modulus (the number of groups is called the modulus of the hash function).
Activities and Procedures
- Open the project 21_Hashing.
- Examine the code in the Hash module, then the HashStringAccumulator and the ListHashStringAccumulator class.
Notice how simple this all really is.
Also make sure you notice how HashStringAccumulator and ListHashStringAccumulator work together: rather than an explicit event passing from one to the other, the subclass only has to set up the hash table’s contents. HashStringAccumulator can employ any StringAccumulator in the hash table polymorphically; so once the subclass has set up some kind of StringAccumulator in each of the elements of the hash table, its job is done. All of this means that all the subclass needs to supply is a constructor.
Also notice that both the HashStringAccumulator and the contents of its hash table are both StringAccumulators, because they both do the same thing[note 2].
The value of the HashMod constant is a reasonable one for a relatively small amount of data such as we have in this project. To make following what is happening easier, change it to a small value (say, 5). Run the project, trying various values of HashMod.
You might also like to try creating other subclasses of HashStringAccumulator and timing them.
Further Exercises
A good test of your understanding of hashing is to see if you can create a two-level hashing structure (hash tables containing hash tables, that then contain some other structure such as a list). Think for a while about how you might do this.
Bonus points if you stopped to think about that, and you realized that you need to use different HashMod values for the two different levels.
References
- Notes
- ↑ In this case, you would consult Volume 3, but the whole series is an absolute work of art. KNUTH, D E, THE ART OF COMPUTER PROGRAMMING, 2ND ED, ADDISON WESLEY, READING MASS, 1997 ISBN 0-201-89685-0.
- ↑ This is not an uncommon type of strategy. A similar thing you might do on occasion provides a solution to a common problem: you have a class that calls a method on a single object of some type. You later decide that you now want it to call that method on each of a group of such objects. Rather than modify the class definition, you can write an intermediary class of the same type as the single class your class is calling now, and have your new “adapter” class pass that call on to a whole group of objects.
