Last week, I blogged about Fuzzy Merging in SAS Using the Hash Object. I demonstrated a general method to perform fuzzy matching using the hash object. However, suppose we deal with numerical keys. Say we want to match to the closest value of some key in another data set. In that case, we can utilize the dynamic nature of the hash object better. This post shows you how.

I will use this example data in the snippets.

data one;
input n1;
datalines;
1.4 
9.3 
5.1 
3.8 
7   
2   
;
 
data two;
input n2 @@;
datalines;
9 1 5 3 4 6 8 2 10 7 
;

Nearest Neighbor Match Using the Hash Object

The goal is this. Match the closest value of n1 in one to n2 in two. Meaning that 9.3 is matched to 9 because no other number is closer. So the desired result looks like this

Obs  n2  closest 
1    9   9.3 
2    1   1.4 
3    5   5.1 
4    3   3.8 
5    4   3.8 
6    6   5.1 
7    8   7.0 
8    2   2.0 
9    10  9.3 
10   7   7.0

First, let us recall the approach from the previous post. For each observation in two, I iterate through the entire hash object h to find the closest value. The approach works. However it is not hard to see this will not scale well, since we iterate all entries of h for each observation read.

data want1(keep = n2 closest);
   if _N_ = 1 then do;
      dcl hash h(dataset : "one");
      h.definekey("n1");
      h.definedone();
      dcl hiter i("h");
   end;
 
   set two;
   n1 = .;
   c = constant("big");
 
   do while (i.next() = 0);
      if abs(n1 - n2) < c then do;
         c = abs(n1 - n2);
         closest = n1;
      end;
   end;
run;

Instead, let us utilize the fact that the SAS hash iterator lets us jump to a specific point in h using the Setcur Method. The Setcur Method is the only iterator method that takes a key as an argument. The logic in the code below is as follows. We read the data set one into a hash object h and use the ordered keyword to ensure that the entries are traversed in ascending order. I specify n1 as the key variable and link the hash iterator object i to h.

Next, I read the two data set sequentially. I use the Check Method to see if an exact match of n1 is found in one. If so, that is the nearest match and we are done.

If the Check Method fails, there is not an exact match. Therefore, we must find the nearest non-exact match. We start by adding the current value of n2 into h. We do so to be able to use the Setcur Method to put the iterator object between the two candidates of the nearest neighbor. Next, we use the Next and Prev iterator methods to retrieve the next and previous value in h respectively. The If-Then block and Choosen Function simply finds what value is closest.

Finally, we must remove the inserted value of n2 from h again. We can safely call the Remove Method unassigned because we added the value ourselves. So we know it is there.

/* Direct Access with the Setcur Method */
data want2(keep = n2 closest);
   if _N_ = 1 then do;
      if 0 then set one;
      dcl hash h(dataset : "one", ordered : "Y");
      h.definekey("n1");
      h.definedone();
      dcl hiter i("h");
   end;
 
   set two;
 
   if h.check(key : n2) = 0 then closest = n2;
 
   else do;
      h.add(key : n2, data : n2);
      if i.setcur(key : n2) = 0 then if i.prev() = 0 then pn1 = n1;
      if i.setcur(key : n2) = 0 then if i.next() = 0 then nn1 = n1;
 
      if      nmiss(nn1) then idx = 1;
      else if nmiss(pn1) then idx = 2;
      else idx = 1 + (n2 - pn1 > nn1 - n2);
 
      closest = choosen(idx, pn1, nn1);
 
      h.remove(key : n2);
   end;
run;

It is not hard to see that this approach will scale better than the Do_Over Approach. Because we utilize the direct accessing of the Setcur Method instead of traversing the entire object at each obvservation.

A Closest Value Match With Larger Data

Let us create larger versions of the input data and run the code again.

data one;
   do _N_ = 1 to 1000;
      n1 = .1 * ceil(rand('uniform') * 1e4);
      output;
   end;
run;
 
data two;
   do n2 = 1 to 1e5;
      output;
   end;
run;

The code below runs in under a second on the large data above. If you run the code at the top of the post on the same data it will be much slower than that. I do encourage you to so so.

data want(keep = n2 closest);
   if _N_ = 1 then do;
      if 0 then set one;
      dcl hash h(dataset : "one", ordered : "Y");
      h.definekey("n1");
      h.definedone();
      dcl hiter i("h");
   end;
 
   set two;
 
   if h.check(key : n2) = 0 then closest = n2;
 
   else do;
      h.add(key : n2, data : n2);
      if i.setcur(key : n2) = 0 then if i.prev() = 0 then pn1 = n1;
      if i.setcur(key : n2) = 0 then if i.next() = 0 then nn1 = n1;
 
      if      nmiss(nn1) then idx = 1;
      else if nmiss(pn1) then idx = 2;
      else idx = 1 + (n2 - pn1 > nn1 - n2);
 
      closest = choosen(idx, pn1, nn1);
 
      h.remove(key : n2);
   end;
run;

Summary

In this post, we investigate yet another approach to do nearest neighbor matching in SAS. Here, we use the hash object and the Setcur Method. This way, we avoid traversing the lookup data at each observation. I learned this approach from the article The SAS Hash Object in Action by Paul Dorfman Dorfman. There are a few other nice examples in there as well.

Also, read the related posts Closest Value Match in SAS Using the Hash Object – Part 2, Advanced Fuzzy Grouping Using the SAS Hash Object and A Seven of Nine Fuzzy Matching Problem.

You can download the entire code from this post here.