Using the SAS Hash Object and Point= Option

The hash object provides the tools for extremely efficient and flexible table lookups. However, the hash object is an in-memory structure. Therefore, the amount of data we can read into the hash object is bounded by memory. This can be a problem when the lookup file is very large. Today, I will demonstrate a trick that blew my mind when I first saw it in the article HASH + POINT = KEY by Paul M. Dorfman and Lessia S. Shajenko.

In the lookup examples to come, I will use the following example data sets Master and Lookup. The Master data simply consists of a single key variable k. The Lookup data set contains the key variable k along with 10 character data variables d1-d10 of length 200. Consequently, the lookup data set is about 1GB.

data master (keep = k) lookup;
    do _N_ = 1 to 1e6;
        k = uuidgen();
        array d $ 200 d1-d10;
        do over d; d = uuidgen(); end;
        if rand('uniform') < .5 then output lookup;
        output master;
    end;
run;

The Usual Hash Object Lookup

First off, let us recap how a normal Hash Object Lookup looks like. In the data step below, I use the trick described in Read Many Variables into the SAS Hash Object to read all of the data variables into the hash object data portion. Next, I simply read the Master Data set in the Set Statement. Finally, I use the Find() Method to perform the lookup.

In the code below, I use the technique from How Much Memory Does SAS Hash Object Occupy?. This reveals that the hash object occupies about 0.9 GB memory.

data hash (drop=rc);
 
    if 0 then set lookup;
    array d $ 200 d1-d10;
 
    if _N_ = 1 then do;
        before=input(getoption('xmrlmem'),20.);
        declare hash h (dataset : "lookup");
        h.definekey ("k");
        do over d;
            h.definedata (vname(d));
        end;
        h.definedone ();
        after=input(getoption('xmrlmem'),20.);
 
        hashsize=before-after;
 
        put "Hash Object Takes Up:" hashsize sizekmg10.2;
    end;
 
    set master;
    call missing (of d:);
 
    rc = h.find();
run;

Using a Record Pointer and the Point= Option

Next, let us perform the same lookup as above. However, now I do not read any of the data variables d1-d10 into the SAS hash object. Instead, I first create an empty hash object with the same key k. However, I specify only a single data variable p. This will serve as a observations pointer later. Next, I use a Double DoW Loop. In the first DoW loop, I fill up the hash object from the Lookup data set. I increment p for each iteration and use the Add() Method to read in the observation. In the second DoW loop, I perform the lookup. However, the data portion is not read from the hash object. Instead, I lookup the observations pointer p. If p is found, I use the Set Statement with the Point= Option to read exactly the observation with the relevant data.

As before, I measure the memory footprint of the hash object. However this time, the SAS hash object takes up about 95MB of memory. I.e. a tenth of above. However, not surprisingly, the run time suffers a bit because this is not a pure memory lookup. Instead we rely on I/O operations for each successful lookup of p.

As a final exercise, try editing the data step at the top and create 100 variables instead of 10. What happens? On most systems, the memory consumption is too large for a pure hashing lookup. However, the memory footprint of the Hash+Point= technique will not change.

data hashpoint;
 
    declare hash h ();
    h.definekey ("k");
    h.definedata ("p");
    h.definedone ();
 
    before=input(getoption('xmrlmem'),20.);
    do p = 1 by 1 until (lr1);
        set lookup end=lr1;
        h.add();
    end;
    after=input(getoption('xmrlmem'),20.);
 
    hashsize=before-after;
 
    put "Hash Object Takes Up:" hashsize sizekmg10.2;
 
    do until (lr2);
        set master end=lr2;
        call missing (of d:);
        if h.find() = 0 then set lookup point=p;
        output;
    end;
 
run;

You can verify that the too results are identical with a simple Proc Compare call.

proc compare base=hash comp=hashpoint;
run;

Summary

In this post, I demonstrate a technique to drastically reduce the memory footprint of a SAS hash object. The key is to not read actual data values into the data portion of the SAS hash object. Instead, we read in observation pointers and use these in subsequent Point= Options in a Set Statement. This technique is slower than a pure hashing lookup. However, it can be a beneficial technique when data is very wide with many variables to be looked up.

For other memory saving techniques, see the blog posts Three Basic Techniques to Reduce SAS Hash Object Size and Two Advanced Techniques to Reduce SAS Hash Object Size.

You can download the entire code from this post here.