A Seven of Nine Fuzzy Matching Problem in SAS

In the last few posts, I write about Fuzzy Grouping, Fuzzy Merging and approaches to closest-value matching in Part 1 and Part 2. In them, we learn that the SAS hash object can be of help when we deal with fuzzy logic. And often it outperforms other common techniques such as Proc SQL.

Today, I will stay within the fuzzy theme. Recently, I stumbled upon an interesting thread on SAS-L. The problem was stated by Donald Henderson in 2009. Many good approaches where suggested. However, the most scalable solution used the hash object. No surprise. In this post, I will go through the original problem, the original Proc SQL solution and the hash object approach.

I encourage you to read the original thread and the many good suggestions in the original thread: Regex – Seven of Nine Matching.

Seven of Nine Matching – The Problem

The problem is this. Suppose we have the example data one and two below. For each ssn in one, we want to extract all the ssn in two where at least seven out of nine characters match in the correct positions.

data one;
input ssn $9.;
datalines;
123456789
987654321
121212121
343434343
;
 
data two;
input ssn $9.;
datalines;
123456789
987645321
121212121
132456789
232323232
;

This means that from the simple data above, the desired result looks like this.

Obs  ssn_one    ssn_two    MatchResults 
1    123456789  123456789  Exact 
2    123456789  132456789  Not Exact 
3    987654321  987645321  Not Exact 
4    121212121  121212121  Exact

A Proc SQL Approach

The original solution to the problem is in Proc SQL. It is simple, short and easy to understand. The key is in the Where Clause. Here, we use the Sum Function and check the i’th character of ssn1 and compare it to ssn2. If the Sum Function returns 7 or more, at least 7 of the characters match.

The only thing this solution lacks is scalability. As the log states: “The execution of this query involves performing one or more Cartesian product joins that can not be optimized.”. Naturally, we want to avoid doing catesian products.

proc sql;
   create table matches as
      select a.ssn as ssn_one,
             b.ssn as ssn_two,
             case when (a.ssn ne b.ssn) then 'Not Exact'
                                        else 'Exact    '
             end as MatchResults
      from one a, two b
      where sum(char(a.ssn, 1) = char(b.ssn, 1),
                char(a.ssn, 2) = char(b.ssn, 2),
                char(a.ssn, 3) = char(b.ssn, 3),
                char(a.ssn, 4) = char(b.ssn, 4),
                char(a.ssn, 5) = char(b.ssn, 5),
                char(a.ssn, 6) = char(b.ssn, 6),
                char(a.ssn, 7) = char(b.ssn, 7),
                char(a.ssn, 8) = char(b.ssn, 8),
                char(a.ssn, 9) = char(b.ssn, 9)
                ) ge 7;
quit;

A Hash Object Approach

Next let us take a look at the hash object approach.

First, we create a temporary array ss and two hash objects h and m.
Next, we use two Do Loops. From i = 1 to 8 and from j = i + 1 to 9.
For each set , we do two overall loops. From p = 1 to n1 (the number of obs in one) and from p = 1 to n2 (the number of obs in two).
In the first loop, we read the SAS data set one in the first overall iteration (i = 1 and j = 2) and save the ssn1 values in ss. Otherwise, I retrieve ssn1 from ss and set id = ssn1. Next, I set the i’th and j’th character of id to “*” and add it to the end of h.
In the second loop, we read the data set two in the first overall iteration (i = 1 and j = 2) and save the ssn2 values in ss. Else, we do the same thing as the in the first loop and set the i’th and j’th characters to “*”.
Next, we iterate over n. First, we look up the current value of id and n in h. If this is not found, then we leave the loop. Next, we check whether this pair of ssn1 and ssn2 has already bene encountered. If so, we leave the loop. Next, we check whether ssn1 = ssn2. If so, we have an exact match. Finally, we add the current pair of ssn1 and ssn2 to m. After each pair of , we clear h to the next iteration.

Run the code snippet below and verify that the result is identical to the Proc SQL approach.

data want(keep = ssn1 ssn2 MatchResults);
 
   array ss {2, 5000000} $ 9 _temporary_;
 
   dcl hash h(hashexp : 20);
   h.definekey ("id", "n");
   h.definedata("ssn1");
   h.definedone(); 
 
   dcl hash m(hashexp : 20);
   m.definekey("ssn1", "ssn2");
   m.definedone();
 
   do i = 1 to 8;
      do j = i + 1 to 9;
 
         do p = 1 to n1;
 
            if i = 1 and j = 2 then do;
               set one(rename = ssn = ssn1) nobs = n1;
               ss[1, p] = ssn1;
            end;
 
            else ssn1 = ss[1, p];
            id = ssn1;
 
            substr(id, i, 1) = "*";
            substr(id, j, 1) = "*";
 
            do n = 1 by 1 until (h.check() ne 0);
            end;
 
            h.add();
 
         end;
 
         do p = 1 to n2;
 
            if i = 1 and j = 2 then do;
               set two(rename = ssn = ssn2) nobs = n2;
               ss[2, p] = ssn2;
            end;
            else ssn2 = ss[2, p];
 
            id = ssn2;
 
            substr(id, i, 1) = "*";
            substr(id, j, 1) = "*";
 
            do n = 1 by 1;
               if h.find() ne 0 then leave;
               if m.check() = 0 then leave;
 
               if ssn1 = ssn2 then MatchResults = "Y";
               else                MatchResults = "N";
               output;
 
               m.add();
 
            end;
         end;
 
         h.clear();
 
      end;
   end;
 
   stop;
 
run;

Testing Scalability

THe only reason why we make the effort and apply the SAS hash object approach is scalability. We want to avoid cartesian joins and having to read data more than necessary. In the code snippet below, we create two sample data sets that are larger than above and run the same SAS code to test the scalability of the two approaches.

The Proc SQL approach takes about 30 seconds to run on my system. The hash object approach takes less than a second. Quite the difference.

/* Testing scalability on larger data */
data one two;
   length ssn $ 9;
   do _N_ = 1 to 1e4;
      ssn = put (ceil (ranuni(1) * 1e9), z9.); output one;
      ssn = put (ceil (ranuni(1) * 1e9), z9.); output two;
   end;
run;
 
/* Proc SQL */
proc sql;
   create table matches as
      select a.ssn as ssn_one,
             b.ssn as ssn_two,
             case when (a.ssn ne b.ssn) then 'Not Exact'
                                        else 'Exact    '
             end as MatchResults
      from one a, two b
      where sum(char(a.ssn, 1) = char(b.ssn, 1),
                char(a.ssn, 2) = char(b.ssn, 2),
                char(a.ssn, 3) = char(b.ssn, 3),
                char(a.ssn, 4) = char(b.ssn, 4),
                char(a.ssn, 5) = char(b.ssn, 5),
                char(a.ssn, 6) = char(b.ssn, 6),
                char(a.ssn, 7) = char(b.ssn, 7),
                char(a.ssn, 8) = char(b.ssn, 8),
                char(a.ssn, 9) = char(b.ssn, 9)
                ) ge 7;
quit; 
 
/* Hash Object */
data want(keep = ssn1 ssn2 MatchResults);
 
   array ss {2, 5000000} $ 9 _temporary_;
 
   dcl hash h(hashexp : 20);
   h.definekey ("id", "n");
   h.definedata("ssn1");
   h.definedone(); 
 
   dcl hash m(hashexp : 20);
   m.definekey("ssn1", "ssn2");
   m.definedone();
 
   do i = 1 to 8;
      do j = i + 1 to 9;
 
         do p = 1 to n1;
 
            if i = 1 and j = 2 then do;
               set one(rename = ssn = ssn1) nobs = n1;
               ss[1, p] = ssn1;
            end;
 
            else ssn1 = ss[1, p];
            id = ssn1;
 
            substr(id, i, 1) = "*";
            substr(id, j, 1) = "*";
 
            do n = 1 by 1 until (h.check() ne 0);
            end;
 
            h.add();
 
         end;
 
         do p = 1 to n2;
 
            if i = 1 and j = 2 then do;
               set two(rename = ssn = ssn2) nobs = n2;
               ss[2, p] = ssn2;
            end;
            else ssn2 = ss[2, p];
 
            id = ssn2;
 
            substr(id, i, 1) = "*";
            substr(id, j, 1) = "*";
 
            do n = 1 by 1;
               if h.find() ne 0 then leave;
               if m.check() = 0 then leave;
 
               if ssn1 = ssn2 then MatchResults = "Exact    ";
               else                MatchResults = "Not Exact";
               output;
 
               m.add();
 
            end;
         end;
 
         h.clear();
 
      end;
   end;
 
   stop;
 
run;

Summary

In this post we take up an old discusion from SAS-L on how to match two SAS data sets on 7 of 9 characters in a ssn. We discuss two different approaches ad test their scalability. It turns out that the hash object approach outperforms Proc SQL by quite a lot. Simply bacause we avoid cartesian products and unnecessary passes through the data. Other approaches are presented in the original thread. I encourage you to read it.

You can download the entire SAS code from this post here.