Advanced Fuzzy Grouping With The SAS Hash Object

Regularly, a user asks a question on the SAS Community regarding grouping of values from some fuzzy logic. Usually, this logic is not straight forward and the logic varies from question to question. However, the question almost always have one thing in common. The solution involves hash objects. In this post, I take two examples from the community and provide the problem, the solution and a short description thereof. I will not be too thorough in the explanations here. Rather, I will let the code snippets serve as templates for similar problems that readers may encounter.

Fuzzy Grouping With the Hash Object – Example 1

Let us take a look at the first example. In this problem, we have a data set with names like below. Some of the names are identical. Some differ. And some are somewhat similar, but not exactly. Like Michael and Michaell. The task is to group the names so that names that are somewhat similar are grouped together.

data have;
input Name:$20.;
datalines;
Michael  
Michel   
kurt     
kirt     
Michaell 
Benjamin 
kurt     
Mich     
;

Let us see how to approach the problem. First, we create two hash objects, h and hh. One is keyed by Comp (a variable used to compare names with). Notice the hash iterator linked to h. hh is keyed by group (the numeric value representing the assigned group). Notice that we specify the multidata:”Y” in hh.

Next, I read the input data one line at the time. I lookup the name in h for each input name. If it is found, I already have assigned a group to the name and we simply go to the end of the data step and output. If not, we go into the loop.

Inside the Do Loop, we start by iterating through h with the iterator object. I we do not encounter a name with a Complev Distance less than 2 (You can experiment here), I conclude that a new group must be created at the end of the loop and we add this name and group to h and hh.

If we do encounter a name with a Complev Distance less than 2, we know that we have encountered a similar name before. Therefore, we assign the same group to the current name and add it to h and hh with the group.

data want(keep = name group);
   length name $20 comp $20 group 8;
 
   if _n_ = 1 then do;
      declare hash h();
      h.definekey('comp');
      h.definedata('comp', 'group');
      h.definedone();
      declare hiter hi('h');
 
      declare hash hh(multidata:'y');
      hh.definekey('group');
      hh.definedata('group', 'comp');
      hh.definedone(); 
 
      _group=0;
   end;
 
   set have;
 
   rc = h.find(key:name);
 
   if rc ne 0 then do;
      rc = hi.first();
      do while (rc=0);
 
         if complev(name, comp) le 2 then do;
            rc = hh.find();
 
            do while (r ne 0);
               dist = complev(name, comp);
               hh.has_next(result : r);
 
               if r = 0 & dist <= 2 then do;
                  h.add(key : name, data : name, data : group);
                  hh.add();
                  output;
                  return;
               end;
 
               else if r ne 0 & dist le 2 then rc = hh.find_next();
 
               else if dist > 2 then leave;
 
            end;
         end;
 
         rc=hi.next();
 
      end;
 
      _group ++ 1;
      group = _group;
      h.add(key : name, data : name, data : group);
      hh.add(key : group, data : group, data : name);
   end;
 
   output;
run;

The result looks like this

Obs name     group 
1   Michael  1 
2   Michel   1  
3   kurt     2 
4   kirt     2 
5   Michaell 1 
6   Benjamin 3 
7   kurt     2 
8   Mich     4

This problem was posted at the SAS Community in the thread Fuzzy logic to locate duplicates in a table.

Fuzzy Grouping With the Hash Object – Example 2

Next, let us look at the second example. In this example we work with the data below. Here, we have a list of names and their birth dates. Some of the names have id’s assigned already. The task here is to assign a fitting id to all names. If a name has an id already and is somewhat similar to another name with the same birth data, they should have the same id.

data have;
infile datalines missover;
input num_A num_B $ name $ 11-36 birth_date :ddmmyy10. id;
format birth_date ddmmyy10.;
datalines;
5785 fbff João Simões Marques        12/05/2000 7
1234 abcd M Rita Costa Santos        01/01/2020 1
3333 uvwx M Rita Costa Santos        01/01/2020 1
5678 efgh Maria Rita C Santos        01/01/2020  
9101 ijkl Rita Costa Santos          01/01/2020 1
1111 mnop Maria Leonor Santos Silva  02/03/2001 2
2222 qrst Leonor Santos Silva        02/03/2001  
4444 yzab Leonor Santos Silva        30/08/1999  
6565 afgg Donald J Trump             01/01/1960  
2423 sgty Donald J Trump             01/01/1960  
9876 hgvb Pedro Costa Santos         05/09/1990 9
7865 jnbv Luís Miguel Silva          05/09/1990  
;

Let us look at the solution to the problem. First, we create two hash objects h1 and h2. Notice that we specify Multidata : “Y” in h2. H1 is keyed by name and birth_date. H2 only by birth_date. Notice that we fill the objects with data in a DoW Loop. We do this for two reasons. 1: We read data into the objects from the same data source. This way, we read the data once instead of twice when filling the objects. 2: This way we can keep track of the largest id already assigned to a name. This comes in handy later when we want to assign a new id to a name not already encountered.

Next, I read the input data sequentially. If the data does not already have an id, we must add it. Therefore, we start by looking up the id in h1. If the same name and birth_date exists in h1, we already have an id. If it does not we have to traverse all entries in h2 with the same birth_date. We do so with the Do_Over Method. If a somewhat similar name exist with the same birth_date, we take the id from that name.

At the bottom, we handle the case where none of the logic above gave us an id for the name. In this case, we add one to the maximum id until now. Assign it to the name, add it to the hash object and output.

data want(keep = num_A num_B name birth_date id);
   format num_A num_B name birth_date id;
   if _N_ = 1 then do;
      dcl hash h1 ();
      h1.definekey("name", "birth_date");
      h1.definedata("i");
      h1.definedone();
 
      dcl hash h2 (multidata : "Y");
      h2.definekey("birth_date");
      h2.definedata("n", "i");
      h2.definedone();
 
      do until (z);
         set have(rename=(id=i name=n) where = (i)) end = z;
         h1.ref();
         h2.ref();
         maxid = max(maxid, i);
      end;
   end;
 
   set have;
 
   if id = . then do;
      if h1.find() ne 0 then do;
         do while (h2.do_over() = 0);
            if complev(name, n) < 10 then do;
               id = i;
               h1.ref(key : n, key : birth_date, data : id);
            end;
         end;
      end;
      else id = i;
   end;
 
   if id = . then do;
      maxid + 1;
      id = maxid;
      h1.ref(key : name, key : birth_date, data : id);
   end;
 
run;

The result looks like this

Obs  num_A  num_B  name                       birth_date  id 
1    5785   fbff   João Simões Marques        12/05/2000  7 
2    1234   abcd   M Rita Costa Santos        01/01/2020  1 
3    3333   uvwx   M Rita Costa Santos        01/01/2020  1 
4    5678   efgh   Maria Rita C Santos        01/01/2020  1 
5    9101   ijkl   Rita Costa Santos          01/01/2020  1 
6    1111   mnop   Maria Leonor Santos Silva  02/03/2001  2 
7    2222   qrst   Leonor Santos Silva        02/03/2001  2 
8    4444   yzab   Leonor Santos Silva        30/08/1999  10 
9    6565   afgg   Donald J Trump             01/01/1960  11 
10   2423   sgty   Donald J Trump             01/01/1960  11 
11   9876   hgvb   Pedro Costa Santos         05/09/1990  9 
12   7865   jnbv   Luís Miguel Silva          05/09/1990  12

This problem was posted at the SAS Community in the thread How to find similar and equal values in one column and then another column.

Summary

In this post, we explore a common problem of fuzzy grouping in SAS. This is usually done in Proc SQL or the hash object. The examples in this post uses the hash object. Both examples are real life problems from the online SAS Community.

Also, read the related posts Fuzzy Merge in SAS with the Hash Object.

You can download the entire code from this post here.

Complicated Fuzzy Grouping With The SAS Hash Object

Fuzzy Grouping With the Hash Object – Example 1

Fuzzy Grouping With the Hash Object – Example 2

Summary