Most serious SAS programmers are familiar with Looking Up Data With a Hash Object. Few of them are familiar with the hash function. It is not an absolute necessity to know about hash functions to use a hash object. However, the hash function takes a lot of the credit for why the hash object search algorithm is fast. Therefore, it deserves some attention. In this post, I briefly introduce what a hash function is. Also, I present an example of the MD5 and SHA256 SAS functions and discuss the pros and cons of the two.
The Hash Function
A hash function takes an input and produces some series of bytes, that is unique to the particular input. There are many different functions with very different underlying algorithms. Hash functions are used for many different purposes. Operating systems use hash functions to verify the integrity of downloaded files. Databases use them to track changes.
Not surprisingly, the hash object also uses the concept of hash functions. When a hash object is declared and instantiated, key values are distributed into binary search trees (AVL trees). The hash object uses a hash function to do so. Furthermore, when a hash search is performed, SAS uses the same function to direct attention only to the relevant search tree. That is the reason that the hash search algorithm is so fast and scales well. Needless to say, this means that in the context of hash objects, hash function speed is of the essence. I write about the SAS hash object search algorithm in the post Comparing SAS Hash Object And Index Search Algorithm.
MD5 vs SHA256
There are many hash functions out there. The most famous hash function is probably the MD5 Function. We know that speed is very important from a hash object point of view. And MD5 is fast. However, it hash known fragilities. Actually, from a file integrity point of view, the MD5 function is considered cryptographically broken. However, the chances of a hash collision is very small. An in the context of hash objects, the consequences are not unbearable.
The SHA256 Function is another example of a well documented and thoroughly tested hash function. It is considered much more robust than MD5. However, it does lack the speed of the MD5 function. Let us look at an example. Beneath, I call the MD5 Function 1mio times and the SHA256 Function 1mio times.
data _null_; time=time(); do i=1 to 10e5; string=uuidgen(); hash=md5(string); end; elapsedtime=time()-time; put elapsedtime; time=time(); do i=1 to 10e5; string=uuidgen(); hash=sha256(string); end; elapsedtime=time()-time; put elapsedtime; run;
You can see the timing results in the log. The MD5 calls take a little over a second while the SHA256 calls take 19 seconds to run. The SHA256 calls are substantially slower.
While SAS does not reveal the internal hash function, we do know this. It is very fast. Also, we know that it does not have the vulnerabilities of the MD5 function.
In this post, I briefly introduced the concept of hash functions and why they are important from a hash function point of view. I also pointed out that speed is the most important thing to consider. Luckily the internal hash function of the hash object is very fast. The hash function takes a large piece of the credit for the performance advantages of the hash object. Therefore it is good to have some basic knowledge on the topic.
You can download the entire code from this post here.