Creating Multilabel Formats in SAS with PROC FORMAT

In the blog post Writing User Defined Formats in SAS I demonstrate how rolling out a custom format can be a nice replacement for long and ugly if-then-else statements. However, the advantages from rolling out a user defined format does not stop there. Sometimes you encounter classification problems where the classes are not distinct from each other. This post demonstrates how to handle such overlapping classes with the Multilabel Option in the Format Procedure.

In the examples to come, I will use the example data below. The data is for demonstration purposes only.

data creditdata;
   array first_names{20} $20 _temporary_ ("Paul", "Allan", "Thomas", "Michael", "Chris", "David", "John", "Jerry", "James", "Robert",
                                          "William", "Richard", "Bob", "Daniel", "Paul", "George", "Larry", "Eric", "Charles", "Stephen");
   array last_names{20}$20 _temporary_ ("Smith", "Johnson", "Williams", "Jones", "Brown", "Miller", "Wilson", "Moore", "Taylor", "Hall",
                                        "Anderson", "Jackson", "White", "Harris", "Martin", "Thompson", "Robinson", "Lewis", "Walker", "Allen");
   call streaminit(123);
   do ID=1 to 1e5;
      first_name=first_names[ceil(rand("Uniform")*20)];
      last_name=last_names[ceil(rand("Uniform")*20)];
      creditrate=rand('integer', 1, 10);
      output;
   end;
 
   format ID z6.;
run;

Applying the Multilabel Option

Let us assume that the creditrate variable in the above data set represents the customers credit rating. The smaller the value, the more credit worthy the customer is. Now, we want to categorize the credit ratings into buckets like ‘Strong Approval’, ‘Weak Approval’, ‘Approval’ and so on. However, note that eg ‘Strong Approval’ and ‘Approval’ are not mutually distinct categories. If you have a strong approval rate, you are still approved. This problem is hard to handle in an if-then-else statement. However, the Multilabel Option in PROC FORMAT handles cases like this neatly.

In the Format Procedure below, I create the numeric format appr. In the format options, specified before the ranges, I use the Multilabel Option to allow for overlapping ranges. If I leave out this option, SAS issues an error in the log: “ERROR: These two ranges overlap: 1-2 and 1-6 (fuzz=1E-12).”. I use the Notsorted Option to display the ranges in the order specified in PROC FORMAT in the later summary statistics. I use the Default= Option simply because I like to control length of both formats and variables whenever I can.

proc format library=work;
value appr (default=20 multilabel notsorted)
1-2  = 'Strong approval'
3-6  = 'Weak Approval'
1-6  = 'Approval'
7-8  = 'Weak Decline'
9-10 = 'Strong Decline'
7-10 = 'Decline'
;
run;

Using the Multilabel Format in Summary Procedures

SAS PROC FORMAT Multilabel Option Example PROC MEANSNext, let us put the multilabel format to work. I use PROC MEANS to calculate frequencies of credit approvals. Needless to say, you can calculate all kinds of descriptive statistics here. However, since the focus is not on the statistics, rather the format use, I will keep it simple.

In the Class Statement Options, I use the MLF Option to tell SAS that the format has overlapping ranges. Next, i use the Preloadfmt Option and the Order=data Option to make sure that the procedure maintains the order specified in the above Format Procedure. In the Format Statement, I simply specify the created numeric appr format.

I am most comfortable with the Means Procedure. However, you can use other summary statistics procedures as well. It depends on the specific situation. You can see examples of other approaches in the article Creating and Using Multilabel Formats.

proc means data=creditdata n maxdec=1 nonobs;
   class creditrate / mlf preloadfmt order=data;
   format creditrate appr.;
run;

Summary

In this post, we have seen a simple example of writing and utilizing multilabel formats in SAS. Wee have seen that multilabel formats swiftly handles classification problems with overlapping ranges. Overlapping ranges are much harder to handle within data step logic, where if-then-else statements is probably the first choice of many programmers.

I do not hesitate to call the Format Procedure the most underused SAS procedure of all. I have previously written about Looking Up Data With PROC FORMAT and 5 Picture Format Options You Should Know.

You can download the entire code from this post here.