Designing a good hash code generation algorithm is a black art. There is no formal and objective way to determine what is the best algorithm for a given scenario. In fact, most of the developers do not know how exactly to implement efficiently hash functions. As you know, hash functions for a given type are implemented in .NET by overriding the method
In the other hand, testing a hash function is cumbersome. A good hash function should have the following properties:
The contract verifier adds two child tests to the test fixture. They evaluate the probability of collision and the uniform distribution goodness-of-fit. At this time, the avalanche test is not supported (perhaps in a future release?)
Consider the sample type below. It implements an awful hash function of the additive kind. Let's imagine that according to some imaginary specifications,
Object.GetHashCode()
. Most of us (including myself, I must admit) usually shake up and down the values by shifting and xoring them randomly without knowing exactly if the result will be good enough. It’s even worse: if the implementation is poor, your application will continue to work because a bad hash function does "only" affect performance. So you might even not be able to notice it immediately.In the other hand, testing a hash function is cumbersome. A good hash function should have the following properties:
- Low probability of collision; meaning that the odds to get two values that produces the same hash code should be as low as possible.
- Hash codes should be distributed uniformly.
- It should achieve "avalanche" by generating hash values wildly different if even a single bit is different in the input key.
The contract verifier adds two child tests to the test fixture. They evaluate the probability of collision and the uniform distribution goodness-of-fit. At this time, the avalanche test is not supported (perhaps in a future release?)
Consider the sample type below. It implements an awful hash function of the additive kind. Let's imagine that according to some imaginary specifications,
value
is a number between 0 and 9, and dayOfWeek
is... well, I just let you find out :)public class Foo
{
private readonly int value;
private readonly DayOfWeek dayOfWeek;
public Foo(int value, DayOfWeek dayOfWeek)
{
this.value = value;
this.dayOfWeek = dayOfWeek;
}
public override int GetHashCode()
{
return value + (int)dayOfWeek;
}
}
Now let's use the contract verifier to evaluate our poor implementation.
[TestFixture]
public class FooTest
{
[VerifyContract]
public readonly IContract HashCodeAcceptanceTests = new HashCodeAcceptanceContract<Foo>()
{
CollisionProbabilityLimit = CollisionProbability.Low,
UniformDistributionQuality = UniformDistributionQuality.Good,
DistinctInstances = DataGenerators.Join(
Enumerable.Range(0, 10),
Enum.GetValues(typeof(DayOfWeek)).Cast<DayOfWeek>())
.Select(o => new Foo(o.First, o.Second))
};
}
CollisionProbabilityLimit
and UniformDistributionQuality
are the expected confidence levels. These are probability values between 0 (good) and 1 (bad), but we use here some handy predefined constants to improve readability (more details here and here). You may want to let the default values of 5% if you are not sure.DistinctInstances
must be fed with an enumeration of distinct Foo
instances. It's important to understand that the entire evaluation of the hash function is based on the statistical population you decide to provide. Therefore the confidence in the test results will be as good as (or as bad as, for the pessimists) the quality and the representativeness of that population. Two types of scenarios are possible. Either the range of possibilities is finite and reasonably small, or the number of possible distinct instances is nearly infinite or insanely large. In the first case, it's better to provide to the contract verifier all the possible values. Thus you get an exact and complete stochastic evaluation of your hash function. In the second case, you will need to carefully select a representative subset of all the possible values. In both cases you can feed the contract verifier by using a static method which returns an enumeration.
// ...
DistinctInstances = GetDistinctInstances()
// ...
private static IEnumerable<Foo> GetDistinctInstances()
{
// ...
}
Or to provide an existing catalog if they can be retrieved from an external data source.
// ...
DistinctInstances = Foo.GetThemAll()
// ...
You may also use the powerful MbUnit data generation framework to easily combine and generate random or sequential values.
// ...
DistinctInstances = DataGenerators.Join(
Enumerable.Range(0, 10),
Enum.GetValues(typeof(DayOfWeek)).Cast<DayOfWeek>())
.Select(o => new Foo(o.First, o.Second))
// ...
If we run the tests, the report informs us that the tests have miserably failed. The probability of collision is terribly high and the distribution is far from being uniform.
Let's improve our hash function by using a simple implementation of Dan Bernstein's famous algorithm.
Let's improve our hash function by using a simple implementation of Dan Bernstein's famous algorithm.
public override int GetHashCode()
{
return 33 * value ^ dayOfWeek.GetHashCode();
}
Unsurprisingly, running the tests again makes the report look better :)
Mission accomplished!
Additional information and examples might be found in the Gallio Wiki.
Mission accomplished!
Additional information and examples might be found in the Gallio Wiki.
No comments:
Post a Comment