Regularly, a user asks a question on the SAS Community regarding grouping of values from some fuzzy logic. Usually, this logic is not straight forward and the logic varies from question to question. However, the question almost always have one thing in common. The solution involves hash objects. In this post, I take two examples from the community and provide the problem, the solution and a short description thereof. I will not be too thorough in the explanations here. Rather, I will let the code snippets serve as templates for similar problems that readers may encounter.
Fuzzy Grouping With the Hash Object – Example 1
Let us take a look at the first example. In this problem, we have a data set with names like below. Some of the names are identical. Some differ. And some are somewhat similar, but not exactly. Like Michael and Michaell. The task is to group the names so that names that are somewhat similar are grouped together.
data have; input Name:$20.; datalines; Michael Michel kurt kirt Michaell Benjamin kurt Mich ;
Let us see how to approach the problem. First, we create two hash objects, h and hh. One is keyed by Comp (a variable used to compare names with). Notice the hash iterator linked to h. hh is keyed by group (the numeric value representing the assigned group). Notice that we specify the multidata:”Y” in hh.
Next, I read the input data one line at the time. I lookup the name in h for each input name. If it is found, I already have assigned a group to the name and we simply go to the end of the data step and output. If not, we go into the loop.
Inside the Do Loop, we start by iterating through h with the iterator object. I we do not encounter a name with a Complev Distance less than 2 (You can experiment here), I conclude that a new group must be created at the end of the loop and we add this name and group to h and hh.
If we do encounter a name with a Complev Distance less than 2, we know that we have encountered a similar name before. Therefore, we assign the same group to the current name and add it to h and hh with the group.
data want(keep = name group); length name $20 comp $20 group 8; if _n_ = 1 then do; declare hash h(); h.definekey('comp'); h.definedata('comp', 'group'); h.definedone(); declare hiter hi('h'); declare hash hh(multidata:'y'); hh.definekey('group'); hh.definedata('group', 'comp'); hh.definedone(); _group=0; end; set have; rc = h.find(key:name); if rc ne 0 then do; rc = hi.first(); do while (rc=0); if complev(name, comp) le 2 then do; rc = hh.find(); do while (r ne 0); dist = complev(name, comp); hh.has_next(result : r); if r = 0 & dist <= 2 then do; h.add(key : name, data : name, data : group); hh.add(); output; return; end; else if r ne 0 & dist le 2 then rc = hh.find_next(); else if dist > 2 then leave; end; end; rc=hi.next(); end; _group ++ 1; group = _group; h.add(key : name, data : name, data : group); hh.add(key : group, data : group, data : name); end; output; run;
The result looks like this
Obs name group 1 Michael 1 2 Michel 1 3 kurt 2 4 kirt 2 5 Michaell 1 6 Benjamin 3 7 kurt 2 8 Mich 4
This problem was posted at the SAS Community in the thread Fuzzy logic to locate duplicates in a table.
Fuzzy Grouping With the Hash Object – Example 2
Next, let us look at the second example. In this example we work with the data below. Here, we have a list of names and their birth dates. Some of the names have id’s assigned already. The task here is to assign a fitting id to all names. If a name has an id already and is somewhat similar to another name with the same birth data, they should have the same id.
data have; infile datalines missover; input num_A num_B $ name $ 11-36 birth_date :ddmmyy10. id; format birth_date ddmmyy10.; datalines; 5785 fbff João Simões Marques 12/05/2000 7 1234 abcd M Rita Costa Santos 01/01/2020 1 3333 uvwx M Rita Costa Santos 01/01/2020 1 5678 efgh Maria Rita C Santos 01/01/2020 9101 ijkl Rita Costa Santos 01/01/2020 1 1111 mnop Maria Leonor Santos Silva 02/03/2001 2 2222 qrst Leonor Santos Silva 02/03/2001 4444 yzab Leonor Santos Silva 30/08/1999 6565 afgg Donald J Trump 01/01/1960 2423 sgty Donald J Trump 01/01/1960 9876 hgvb Pedro Costa Santos 05/09/1990 9 7865 jnbv Luís Miguel Silva 05/09/1990 ;
Let us look at the solution to the problem. First, we create two hash objects h1 and h2. Notice that we specify Multidata : “Y” in h2. H1 is keyed by name and birth_date. H2 only by birth_date. Notice that we fill the objects with data in a DoW Loop. We do this for two reasons. 1: We read data into the objects from the same data source. This way, we read the data once instead of twice when filling the objects. 2: This way we can keep track of the largest id already assigned to a name. This comes in handy later when we want to assign a new id to a name not already encountered.
Next, I read the input data sequentially. If the data does not already have an id, we must add it. Therefore, we start by looking up the id in h1. If the same name and birth_date exists in h1, we already have an id. If it does not we have to traverse all entries in h2 with the same birth_date. We do so with the Do_Over Method. If a somewhat similar name exist with the same birth_date, we take the id from that name.
At the bottom, we handle the case where none of the logic above gave us an id for the name. In this case, we add one to the maximum id until now. Assign it to the name, add it to the hash object and output.
data want(keep = num_A num_B name birth_date id); format num_A num_B name birth_date id; if _N_ = 1 then do; dcl hash h1 (); h1.definekey("name", "birth_date"); h1.definedata("i"); h1.definedone(); dcl hash h2 (multidata : "Y"); h2.definekey("birth_date"); h2.definedata("n", "i"); h2.definedone(); do until (z); set have(rename=(id=i name=n) where = (i)) end = z; h1.ref(); h2.ref(); maxid = max(maxid, i); end; end; set have; if id = . then do; if h1.find() ne 0 then do; do while (h2.do_over() = 0); if complev(name, n) < 10 then do; id = i; h1.ref(key : n, key : birth_date, data : id); end; end; end; else id = i; end; if id = . then do; maxid + 1; id = maxid; h1.ref(key : name, key : birth_date, data : id); end; run;
The result looks like this
Obs num_A num_B name birth_date id 1 5785 fbff João Simões Marques 12/05/2000 7 2 1234 abcd M Rita Costa Santos 01/01/2020 1 3 3333 uvwx M Rita Costa Santos 01/01/2020 1 4 5678 efgh Maria Rita C Santos 01/01/2020 1 5 9101 ijkl Rita Costa Santos 01/01/2020 1 6 1111 mnop Maria Leonor Santos Silva 02/03/2001 2 7 2222 qrst Leonor Santos Silva 02/03/2001 2 8 4444 yzab Leonor Santos Silva 30/08/1999 10 9 6565 afgg Donald J Trump 01/01/1960 11 10 2423 sgty Donald J Trump 01/01/1960 11 11 9876 hgvb Pedro Costa Santos 05/09/1990 9 12 7865 jnbv Luís Miguel Silva 05/09/1990 12
This problem was posted at the SAS Community in the thread How to find similar and equal values in one column and then another column.
In this post, we explore a common problem of fuzzy grouping in SAS. This is usually done in Proc SQL or the hash object. The examples in this post uses the hash object. Both examples are real life problems from the online SAS Community.
Also, read the related posts Fuzzy Merge in SAS with the Hash Object.
You can download the entire code from this post here.