There is a question that arises in the SAS Community once in a while. The provided solutions are often well-thought. Though advanced and long. In this post, I will present a simple version of the problem. Then, I will outline how the problem is usually approached. Finally, I will demonstrate how we can solve the problem with PROC OPTNET in the SAS/OR package.

The Problem – Connected Components

First, let us see a classic version of the problem. Consider the data set below. Here, I have a data set with two variables from and to. The values in the data set are connected. So in the first observation, A connects to B. In the third observation B connects to C. Consequently, A connects to C. I want to group connected values or ‘components’. The term component is from Operations Research where we can think of the data set below as a graph of connected components. The goal is to connect A, B and C and other connected components in the same group. And other connected components, which are note connected to any other components in any other group.

You can see a graphical representation of a similar problem in the Connected Components Section in the OPTNET Procedure.

data have;
infile datalines;
input from $ to $;
datalines;
A B
C D
B C
V W
W X
X Y
Y Z
K L
L M
M N
;

Two Well-Thought Common Approaches

First, let us take a look at a macro that creates the result for us. You can find the macro in the link How to find all connected components in a graph by PGStats. Without going into two much detail about the inner workings of the macro, all that is required is a simple macro call like below. This creates a data set with all the components (or nodes) in the graph. For each node, the variable node specifies with group the node belongs to.

%SubGraphs(have,out=want);
 
proc sort data=want;
   by clust;
run;

Behind the scenes, the macro utilizes the Dynamic Structure of the SAS Hash Object and a Hash Iterator to traverse the data set (or graph in OR terms).

A non-macro solutions is presented in the library contribution Find out who belong to the same household by Ksharp. Another well-writtes hash object approach, that utilizes the same components as above.

A PROC OPTNET Solution

Finally, let us explore how to utilize SAS/OR software to solve the problem. The Optnet Procedure has a feature designed exactly for such a purpose. Even though most questions like this do not emanate from operations research problems. Namely the Concomp Statement. The Concomp Statement invokes an algorithm designed to find connected components in a graph (or input data set in SAS terms). You can see a detailed explanation in the Connected Components Section in the OPTNET Procedure.

Let us see how to solve the problem with just a few lines. I specify the input data set in the data_links option and the desired output data in the out_nodes option. Be aware that the input data must fulfill a few requirements. For example, the two variables from and to must be present. Sometimes you need to rename them from other variables.

proc optnet data_links = have
            out_nodes  = want;
   concomp;
run;

The want data set looks like this. Exactly as desired. And in line with the results from the other approaches above.

node concomp 
A    1 
B    1 
C    1 
D    1 
V    2 
W    2 
X    2 
Y    2 
Z    2 
K    3 
L    3 
M    3 
N    3

Summary

In this post we have explored the problem of identifying connected components in a graph in SAS. There are several approaches to the problem. Most of which utilice hash objects and iterators. However, there is an alternative approach using PROC OPTNET that requires much less code as we have seen. I highly encourage you to explore the SAS PROC OPTNET Documentation and the features related to connected components.

You can download the entire code from this post here.