Fuzzy Merge in SAS with the Hash Object

There are several ways to combine data in SAS. However, when the keys are not exactly equal, things get more complicated. This is called a fuzzy merge/join. Today, I will demonstrate how to use the hash object to perform fuzzy merges in SAS. Obviously, the word “Fuzzy” is up for interpretation. However, it usually means “Closest” in the contexts that I see.

I will use the example data below in the examples to come.

data one;
input id x;
datalines;
1 16 
2 32 
3 65 
;
 
data two;
input id x y;
datalines;
1 10 1 
1 14 2 
1 25 3 
1 22 4 
2 30 1 
2 36 2 
2 43 3 
3 50 1 
3 61 2 
3 67 3 
;

A Fuzzy Merge Using the SAS Hash Object

First, let us see an example of how to do a fuzzy merge in SAS. Using the two data sets above, the goal is this. For each ID in one, find the y in two for which the distance between one.x and two.x is minimal. This kind of problem is usually dealt with using Proc SQL. Let us see how the hash object approach goes. First, we declare the hash object h. I rename x to xx in the hash object and specify Multidat : “Y” to allow multiple items per key value.

Next, I read one sequentially and initialize y and xx to missing and set c to a very large number. Then, I use the Do_Over Method to loop over all items in h for the current ID. If the distance between x and xx is smaller than c, then we update c to this distance and set yy = y. This will eventually give me the desired value of y.

data want(keep = id x y);
 
   if _N_ = 1 then do;
      dcl hash h(dataset : "two(rename = x = xx", multidata : "Y");
      h.definekey("id");
      h.definedata("xx", "y");
      h.definedone();
   end;
 
   set one;
   y = .; xx = .; c = constant("big");
 
   do while (h.do_over() = 0);
      if abs(x - xx) < c then do;
         c = abs(x - xx);
         yy = y;
      end;
   end;
 
   y = yy;
run;

Fuzzy Merging With Multiple Items Per ID

Suppose we have multiple observations per ID in one. Also, we do not want to match two different values in one to the same value in two. Most of the logic is the same. However, in the first Do_Over Loop, I use an index variable i to track the number of iterations. I use this vairable in a subsequent loop to find the same entry. Then, I use the Removedup Method to remove that item from the hash object. This means that we cannot match the item to a later value.

data one;
input id x;
datalines;
1 15 
1 16 
2 32 
2 33 
3 65 
3 66 
;
 
data two;
input id x y;
datalines;
1 10 1 
1 14 2 
1 25 3 
1 22 4 
2 30 1 
2 36 2 
2 43 3 
3 50 1 
3 61 2 
3 67 3 
;
 
/* Each obs in one can only match an obs in two once */
data want(keep = id x y);
 
   if _N_ = 1 then do;
      dcl hash h(dataset : "two(rename = x = xx", multidata : "Y");
      h.definekey("id");
      h.definedata("xx", "y");
      h.definedone();
   end;
 
   set one;
   y = .; xx = .; c = constant("big");
 
   do i = 1 by 1 while (h.do_over() = 0);
      if abs(x - xx) < c then do;
         c = abs(x - xx);
         yy = y;
         ii = i;
      end;
   end;
 
   do i = 1 by 1 while (h.do_over() = 0);
      if i = ii then do;
         h.removedup();
         leave;
      end;
   end;
 
   y = yy;
run;

Memory Management

Notice that the data in one and two is sorted by ID. We can utilize this fact for better memory management. I do so in a few steps. First, I do not use the Dataset: Argument Tag. Instead, I use a DoW Loop to read each by-group and read it into the hash object. Then, I use another DoW Loop to implement the same logic as above. Finally, I use the Clear() Method to clear the hash object to make room for the next by group.

data want(keep = id x y);
 
   dcl hash h(multidata : "Y");
   h.definekey("id");
   h.definedata("xx", "y");
   h.definedone();
 
   do until (last.id);
      set two(rename = x = xx);
      by id;
      h.add();
   end;
 
   do until (last.id);
      set one;
      by id;
      y = .; xx = .; c = constant("big");
 
      do i = 1 by 1 while (h.do_over() = 0);
         if abs(x - xx) < c then do;
            c = abs(x - xx);
            yy = y;
            ii = i;
         end;
      end;
 
      y = yy;
      output;
 
      do i = 1 by 1 while (h.do_over() = 0);
         if i = ii then do;
            h.removedup();
            leave;
         end;
      end;
   end;
 
   h.clear();
 
run;

Summary

In this post, we investigate how to do fuzzy merging using the SAS hash object. Usually, programmers do this using PROC SQL. However we can implement more specific logic using the hash object. For example, we demonstrate a queue logic where an item can not be matched more than once. This is quite hard in SQL.

Also, read the related posts
Nearest Neighbor Match in SAS – Part 1, Closest Value Match in SAS Using the Hash Object – Part 2 and Complicated Fuzzy Grouping With The SAS Hash Object.

You can download the entire code from this post here.