Partial key lookups are not supported in the SAS hash object. When two key variables are specified in a hash object, we must use exactly two keys when we retrieve data from it. In this blog post, I will demonstrate a workaround to retrieve data from the SAS hash object. Without specifying all key variables.
In the examples to come, I will use the data below.
data have; input x $ y z v; datalines; A 1 1 100 A 1 2 200 A 2 3 300 A 2 4 400 B 3 5 500 B 3 6 600 B 4 7 700 B 4 8 800 ;
Partial Key Hash Lookup – The Problem
Suppose I create a hash object h with the key variables x and y. Now, I want to retrieve all the items from h for which x equals “B”. This “partial-key” lookup is not supported in the SAS hash object. You may think of a simplified syntax like below. However, this snippet yields an error. We must specify two keys.
A natural question arises: Why not just add this feature? Unfortunately, there is a good answer. When SAS inserts data into the hash object, it runs the key value through a hash function. The hash function distributes the key values evenly into buckets and into AVL search trees. When we perform a lookup, the same thing happens, and SAS knows exactly which bucket to search in. This is part of the reason why hash table lookup are very fast. However, when multiple keys exist in h, the values are concatenated and run through the hash function. That is why a partial key lookup is not possible. SAS does not know in advance which bucket to search in.
data _null_; if 0 then set have; dcl hash h(dataset : 'have', multidata : 'Y'); h.definekey('x', 'y'); h.definedata(all : 'Y'); h.definedone(); rc = h.find(key : "B"); run;
A Workaround – Multiple Hash Objects
We can create a workaround to the problem though. While a partial key lookup is not directly supported, we can emulate it. It is not a trivial task. Depending on the number of keys, we need a few extra hash objects to help us. Consider the code snippet below. We declare h as above. Furthermore, I create two extra hash objects. First, I create u to hold the unique key combinations of x and y in the input data. Notice, that I do not use the Multidata: “Y” argument tag. I declare a hash iterator and link it to u. Next, I create hx. This will serve as a mapping from x to y.
Next, I enumerate u and insert the unique mappings from x to y into hx. Now, we do the actual lookup. Suppose, I want to find all items in h for which x equals “B”. With my new setup, this is not too hard. First, I use a Do_Over Loop to find all values of y for which x equals “B”. When, I have all combinations of x and y where x equals “B”, the final lookup is easy. I use a second Do_Over call to get all the items in h where x equals “B”.
data _null_; if 0 then set have; dcl hash h(dataset : 'have', multidata : 'Y'); h.definekey('x', 'y'); h.definedata(all : 'Y'); h.definedone(); dcl hash u(dataset : 'have', ordered : 'Y'); u.definekey('x', 'y'); u.definedone(); dcl hiter ui('u'); dcl hash hx(multidata : 'Y'); hx.definekey('x'); hx.definedata('y'); hx.definedone(); do while (ui.next() = 0); hx.add(); end; x = "B"; do while (hx.do_over() = 0); do while (h.do_over() = 0); put (x y z v) (=); end; end; run;
Three Hash Object Keys
The technique is not limited to two key variables. We can apply this technique to an arbitrary number of keys. However, we must decide which keys we want to to partial key lookups for. In the example above, we can retrieve data in h for single-key values of x. Not for y. If we want the same ability for y, we must create another hash object mapping.
In the snippet below, I do exactly the same thing as above. However, here we have three key variables x, y and z. The same principle applies. Suppose, I want to retrieve all items in h for which x equals “B” and y equals 3. I create a mapping from x and y to z in the hash object hxy. Then, I enumerate all combinations of the two keys to get z. Finally, I can to the actual lookup in h.
Logically, this leaves us with the general plan for a partial key lookup.
- Create a hash object with the unique combination of all the original key variables in your master object.
- Decide for which variables or combination of variables you want to be able to do partial key lookups.
- Create a hash object for each variable(s) in step 2, and map them to the rest of the original keys in your master object.
- Enumerate the hash object from (1) with an iterator object and add the items to the hash object maps in (3).
- Finally, do the lookup. First, do the mapping of the partial keys to the rest of the keys with the objects in (3). Then, enumerate the master object to retrieve all the items of interest.
data _null_; if 0 then set have; dcl hash h(dataset : 'have', multidata : 'Y'); h.definekey('x', 'y', 'z'); h.definedata(all : 'Y'); h.definedone(); dcl hash u(dataset : 'have', multidata : 'Y', ordered : 'Y'); u.definekey('x', 'y', 'z'); u.definedone(); dcl hiter ui('u'); dcl hash hxy(multidata : "Y"); hxy.definekey("x", "y"); hxy.definedata("z"); hxy.definedone(); do while (ui.next() = 0); hxy.add(); end; x = "B"; y = 3; do while (hxy.do_over() = 0); do while (h.do_over() = 0); put (x y z v) (=); end; end; run;
In this post, we explore how to do partial key lookups in SAS hash objects. This is not a trivial task. We learn that this is not directly supported in SAS. Also, there is a good reason why. However, we can find a workaround with a little help from a few extra hash objects.
I first learned this technique in the article Better Hashing in SAS 9.2 by Robert Ray and Jason Secosky.
You can download the entire code from this post here.