Today’s topic of study: Hashtables
My view of hash tables is heavily influenced by an oral doctoral exam question my boyfriend in college had: “why or why not would you use a random number generator to dither an image?”
There’s a whole lot of info packed in that question — what is “dithering” for one, but also what randomness actually means. And I’ll give you a hint on that last one: what does a random distribution look like in the Fourier Domain? Yeah, I’ll explain that one too.
Just real quick, dithering is a computer graphics term. Best way to explain it: you know how images in newspapers have dots to make different shades of gray? That’s a dither pattern. For newspapers, the dots are larger and larger for darker areas.
Color printers also use dither patterns. If you look closely at a printout on a color printer, you will see the tiny dots. These aren’t the same as a newspaper dither, where the dots gets larger and larger. There are more and more of them, but they are more scattered for each pixel, not clumped like for a newspaper.
How you dither can drastically affect the quality of the image.
How you dither is tied to the method of distributing the ink.
If I may take an aside, I worked at a company in the late 80’s that was building a color printer. It was one of the very first in the world. It was a dye sublimation printer — meaning the dye sublimated (or sunk down into) the paper. The colors, because of this, were very smooth and rich. The prints were amazing! My boyfriend back then (um, a different one :-> ) was writing the drivers for the printer, and got me a job there too. I worked on their dither patterns. Wow, did I learn a lot!
The obvious one would be to place, say, a dot in the center of the pixel for value 1, then maybe a dot in the center, then a dot in the upper right for value 2, and those plus lower left for 3, and so forth. Actually, the upper right and lower left, in a field of the same color, would touch, so you have to think a lot about how the pattern wrapped on itself as well.
But the bottom line is, the obvious idea was to place the dots as far away as possible.
The prints looked terrible by this method! What was going wrong!
As it turned out, the process of printing was a static charge process. There was a drum that would get charged by the different dots, then roll into a powdered ink, then roll the ink onto the paper. Well, a single dot charge was not quite enough to pick up any ink — two dots next to each other did much better. And so this “well-placed dots” algorithm mean for values 0-50% they didn’t print anything at all! So a dither pattern that more resembled a newspaper dither worked much better. (In the end, I combined the two a bit, by placing dots far away, but always in pairs.)
I had a lot of fun on that project and wrote a little program to allow me to develop my own dither patterns — you were presented a square of 8×8 and would click the dots in the order they should appear for darker and darker colors. I made all kinds of silly dither patterns: spirals, cats, and stuff like that. 🙂
A quick aside on dithering: I learned only recently that pointilism — the method of painting using dots of paint — has a special quality that the paintings look more bright than regular painting. I learned that this is because blank parts of the canvas are left white, and so the paintings have more light reflecting off of them. They really are physically brighter than paintings that cover the whole canvas!
From that, I noticed that today’s dither patterns tend not to be the same for all the color fields. The cyan, magenta and yellow inks do not overlap where they are printed. This takes advantage of the same effect: white paper is reflected back, and makes the image look brighter. Kind of amazing that it works that way!
Ok, so that’s dithering.
Now, I was stuck in the idea of an 8×8 grid to replace each pixel, but I learned about error diffusion in dithering. This is a method whereby you print your dot, then, as you move along printing your image, you keep track of how long it has been since you printed that dot, accumulate that amount, then print again when it’s about time for another dot. This does not produce a grid-like pattern. It is more blended across the whole image. Error diffusion is very cool.
I can now tell you this joke:
My friend, Mark said: “I don’t tan. I have freckles. When I’m in the sun, I just get more freckles.”
Me: “Ah. You have a dithered tan!” Yuk yuk. 🙂
And…so why wouldn’t a random number be just great for generating a dither pattern?
Let’s talk about the Fourier Domain.
The fourier domain is a transformation of a signal into a graph of its component frequencies. Sounds technical, but it’s not really. If you have a sine wave, just one wave, in the fourier domain, it’s a single spike, a single frequency. If you have a wavy sine wave, that’s two frequencies, and they show as two spikes in the Fourier Domain. And so forth.
I learned about the fourier domain by doing a FFT (fast fourier transform) on images in Advanced Computer Graphics class. (Boy, what a great class!!!)
(Note: I didn’t really understand the math itself at the time — I’d never even heard of the fourier domain anyway. I just implemented the algorithm and got the correct answers. I am hungering to go back and re-write that program, now that all the info has sunk in, and I understand it better.)
(Super geek info: We got two images back: one for the frequencies, and one for the phase — where in the sine wave the frequency was in the image, up or down or in the middle, etc. As a homework exercise, we were asked to swap the frequency image with another image and undo the transform, and to do the same with the phase image. The shocker was that swapping the frequency image did nothing, but the phase image caused the donor image to be the result! The phase has far more to do with an image than its frequencies! Wow did that blow my mind!!! Let me know if you want me to elaborate on this in another post. But it’s pretty geeky…)
But for random numbers, what do they look like in the fourier domain?
They are a flat line, all at the same amplitude. In short, they contain all frequencies.
For the lay person, basically, they clump. They clump big and small and all sizes in between. They clump at every scale.
And random clumpiness doesn’t look very good in an image.
A more technical way to put it is: randomness is not well distributed. This is an important thing to understand: for some things, you want them all nicely separated apart from each other.
Like birds on a wire. Well distributed.
For a hash table, you kinda want that too. You want each hash to be nicely far from the others. You kinda want all of them to be nicely separated from all the others.
Well distributed.
A hash table is quite an amazing sorting and retrieval method.
Imagine first that you have a bunch of PO boxes. You put each name on the box in alphabetical order. But now you have to put a letter in “Bea’s” mailbox. Which box is hers? You have to search through all the names to find Bea. That takes a long time.
How about this idea: take all the letters, assign numbers 1-26, then add all of Bea’s name letters together and give her that box! Super easy!
That is a hash.
It is O(1) — wow! All you do to store something is hash the key, and voila! In a single operation, you have the place to put that thing! Getting the letters out is the same — one operation. That’s very very fast.
So our hash algorithm is adding all the letters in a name together, and ba-da-bing! you go right to the box and get your stuff! No searching or slogging through all the boxes for the right name. In one move, you’ve got your stuff!
Hmm…but what happens when you have “Abe” and “Bea”? They have the same letters. When you add them, they crash. They will point to the same place in your hash table. Abe might get Bea’s stuff, and vice versa.
Well, you can simply put “Bea” on the first one, then move to the next one and put “Abe” and when you pop over in your one-operation, then you scan for a match.
Hmm. Of course, this is great when there are few crashes, but your O(1) operation can become O(n) pretty quick, if your hash algorithm isn’t…dun dun duuun!!! Well distributed.
So a nice well distributed hash algorithm is important.
The link below has an amazing discussion of various hashing algorithms. The super amazing thing is someone actually did a study of the various ones, and published their results of how well distributed each was. Sadly, the writer easily interchanges the words “random” and “well distributed” and those are absolutely not the same thing.
But they show some just amazing visualizations of how well distributed each algorithm is. Just look at the purdy rainbow images, and it should all be clear. (But then I really prefer things to be explained visually like that.)
Ok that’s my lesson for today! I’m going to start playing around with well distributed hash algorithms today!