I have written quite a lot about the SAS hash object. Almost all of the questions that I receive on the blog is on this topic. The popularity makes sense. Hash objects solve problems that no other SAS tool does. Also, the construction of the hash object makes it very efficient and the code is usually easy to maintain.
In this post, I will focus on the learning side. Because the learning curve can be quite steep. For the beginner or intermediate Base SAS Programmer, the syntax does not look like regular data step code. Below, I will present five tips to learn and understand the hash object in SAS. Furthermore, I will point you in the direction of the best literature and learning material available.
1. The Proper Literature
If you are serious about learning about hash objects in SAS, there are two books you must acquire:
- SAS Hash Object Programming Made Easy: A very nice Introduction to the hash object. The book covers the hash object mainly as a lookup tool. It explains the object in an easy to understand language. Even the beginner/intermediate SAS programmer will have no problem following the pace in the book.
- Data Management Solutions Using SAS Hash Table Operations: A Business Intelligence Case Study: While the book above covers the hash object mainly as a lookup tool, this book takes it further. It takes a deeper dive into how and why the hash object is so efficient and covers quite a lot of topics unfamiliar to most programmers. The book is written by Don Henderson and Paul Dorfman. Both authors have written a few articles on the topic as well. Browse them at lexjansen.com. Also, Paul Dorfman has been active in both the old SAS-L Community and the official SAS Community. You will learn something from each and every one of his replies.
I recommend that you read both books. If you have some knowledge of the hash object already, I recommend that you skip the first one and go straight to the second.
2. The PDV and Host Variable Interaction
The key to understanding the SAS hash object is to understand the interaction between it and the PDV. The hash object kan transfer data to and from the PDV. No other interactions between the data step and the hash object occur.
There exist three types of methods. Some transfer data from the PDV to the hash object. Some transfer data from the hash object to the PDV (such as in a Hash Object Lookup). Finally, some methods do not transfer data at all.
data _null_; declare hash h(); h.defineKey('k'); h.defineData('d'); h.defineDone(); k=1;d=2; h.add(); /* PDV --> Hash Object */ k=2;d=4; h.add(); /* PDV --> Hash Object */ k=1; rc=h.find(); /* Hash Object --> PDV */ put k= +2 d=; run;
You should familiarize yourself with the different types of hash object methods available. I the code above, I use the Add() Method to add the current values of the host variables in the PDV to the hash object. Furthermore, I use the Find() Method to search for the current PDV host variable value (k=1) in the hash object. If the value exists, the associated data values are transferred into the PDV host variables. I have assembled a list of hash object methods and how they should be called in the blog post Assigned Vs Unassigned Hash Object Method Call in SAS.
3. The Hash Function
If you want to understand why hash objects are fast, you have to learn about hash functions. A hash function takes an input and produces some series of bytes, that is unique to the particular input. In a hash object context, this series of bytes are converted into a number that points to a binary search tree. Also known as an AVL Tree. Then, when a hash search is performed, SAS uses the same hash function to direct attention only to the relevant search tree. That is the reason that the hash search algorithm is so fast and scales well. I compare the hash search to the index search in the blog post Comparing SAS Hash Object And Index Search Algorithm.
data _null_; string="Hash This!"; one=md5(string); two=sha256(string); put one hex32. / two hex32.; run;
Run the code above and check the log. That is what a 32 bit Hexadecimal representation of a hash function output looks like. I have previously written an entire post on the topic in MD5 and SHA256 Hash Function Example.
4. An In-Memory Structure
The hash object resides in memory. That is (partly) why it is so efficient. However, this also means that the amount of data you can store in a SAS hash object is no longer bounded by disk space. Instead, it is limited by the memory available to your SAS session.
This means that you have to be careful about the data you put into a hash object. Do only read in the data that is absolutely necessary. I have previously written about the memory consumption of a hash object in the blog post How Much Memory Does SAS Hash Object Occupy?
Furthermore, I present a few techniques to limit the hash object size in the blog posts Three Basic Techniques to Reduce SAS Hash Object Size and Two Advanced Techniques to Reduce SAS Hash Object Size.
5. The Ability to Grow and Shrink at Run Time
The hash object is the only data structure in the SAS Data Step that can grow and shrink dynamically at run time. Unlike a SAS array, we do not have to specify the number of entries or the amount of memory at compile time. This is quite extraordinary and holds some very powerful applications. Some of which, I present in the blog post The SAS Hash Object as a Dynamic Placeholder.
As a small teaser, consider the code below.
data _null_; declare hash h(); h.defineKey('k'); h.defineData('d'); h.defineDone(); do k=1 to 10; d=k*2; h.add(); num=h.num_items; put num=; end; do k=10 to 1 by -1; h.remove(); num=h.num_items; put num=; end; run;
Here, I set up a hash object h. Then I fill it up with items one at the time and put the number of items in the SAS log. As you can see, I can add and remove items from the hash object directly while the program executes. Quite cool.
In this post, I present five tips to learn and understand the SAS hash object. The hash object can be quite hard to grasp at first. Also, the syntax does not look like ordinary data step programming. But once you get the hang of it, it unlocks a world of opportunities not offered by any other programming facility in SAS. See the related post 6 Reasons Why The SAS Hash Object Fails.
Did I miss something? Feel free to reach out if you have some learning materials or tips on the hash object that I missed here.
You can download the entire code from this post here.