Three PROC SORT Options You Should Know
When we work with data in SAS, sorting is indedivable. Usually we use PROC SORT to perform the sort. However, sorting is probably the most resource intensive process in data management. Therefore it it important to know how to utilize the SORT Procedure so that you increase performance and decrease run time. In this post, I will present the SORTSIZE, TAGSORT and PRESORTED Options to the SORT Procedure that can help you increase performance and make the most of your PROC SORT runs in SAS.
I have previously written about perhaps the most utilized option in PROC SORT, the NODUPKEY Option, in the example Remove Duplicate Values In SAS. Therefore, I omit that one here.
In the examples to come, I will use the following example data set. The size of the TestData data set is about 1.5 GB.
data TestData; length string $500; string="SASnrd"; do i=1 to 3e6; x=rand('Integer', 1, 10); output; end; run;
SAS SORTSIZE Option
A sort operation in SAS roughly consists of three operations. Read input data, create intermediate data store for utility files and output the sorted data. The first two operations take place at the same time. The data is sorted as it is read in. Therefore, if PROC SORT has enough available memory, it performs the entire sort in-memory. This is called an internal sort and is very quick. An external sort happens when not enough RAM is available. Consequently, PROC SORT sets up temporary utility files on the hard disk. This means more I/O operations, which results in a slower overall process.
Now, we know that the data takes up about 1.5 GB. Therefore, we need more space in-memory to do an internal sort. Luckily, we can control how much memory is available to PROC SORT with the SORTSIZE Procedure Option. First let us purposely set SORTSIZE to less than the data set size and sort the data
proc sort data=TestData sortsize=1G; by x; run;
This sort took about 7.8 seconds. Next, let us set SORTSIZE larger than the data set size.
proc sort data=TestData sortsize=2G; by x; run;
This allows PROC SORT to perform an internal sort. Now, the sort took about 5.2 seconds of real time. A pretty sizable time saving, simply by considering the data set size against the memory available to PROC SORT. The SORTSIZE= Procedure Option temporarily overwrites the SORTSIZE System Option, which can be set only at start up.
To find out what the default SORTSIZE is in your SAS session run PROC OPTIONS like this
proc options option=(sortsize); run;
The TAGSORT Option
When we spedicy the TAGSORT procedure option, we do not read in the entire data set when the Sort Procedure is invoked. Instead, only the variables specified in the By Statement and the observation number is read and sorted. These are referred to as tags. When sort of the tags is done, the tags are used to retrievere the observations from the input data set to return the entire data set in sorted order.
Consequently, two passes of the input data is done. Intuitively, this is bad from a performance point of view. However, when the tag size is small compared to the joint length of the observation, we may see an improvement in run time. This happens because two passes of a small tagged data set requires less I/O operations than one pass of a large non-tagged data set. Also, when we suspect that the data is already somewhat sorted, the TAGSORT Option can increase performance. Let us sort the testdata data set by i with the TAGSORT Option.
proc sort data=TestData tagsort; by i; run;
This sort takes about 6.5 seconds. A similar run without the TAGSORT Option takes about 8 seconds. Remember that the TAGSORT run is faster for two reasons. The tag takes up a small amount of the joint length of the observation and the data is already sorted by the by-variable. Otherwise, the TAGSORT Option can drastically increase run time, so use it with caution.
The PRESORTED Option
In the TAGSORT example, we sort the data by the variable i. However this is a redundant sort because the data is already sorted by i. Though, the SORT Procedure does not know that. The PRESORTED Option lets the Sort Procedure perform a sequential check to ensure that the input data is sorted properly. If it is not, PROC SORT sorts the data. If it is, then PROC SORT merely copies the data. A nice example is sorting TestData by i. If we do a normal sort run with i as by-variable, the procedure merely sorts the data. However when we use the PRESORTED Option like this
proc sort data=TestData presorted; by i; run;
SAS first checks the order of the by-variables. In this case, where the data is actually already sorted accordingly, no sort is performed. Instead a note is written to the log: “Sort order of input data set has been verified.”. This indicates that the sort order is already correct and saves the entire sort algorithm run. Furthermore, it sets the VALIDATED Flag to 1. You can read more about why that is important in the blog post The Importance Of The SAS SORTED And VALIDATED Flags.
The SORT Procedure is one of the most utilized procedures in SAS. Therefore, the novice SAS programmer should familiarize herself with the many options that lets the procedure reach its full potential. This post presented three options, that should be part of every SAS programmers toolkit: The SORTSIZE, TAGSORT and PRESORTED Options. Furthermore, we have seen small examples of how these options can reduce sort run times by orders of magnitude.
I have previously written about Three Alternatives To PROC SORT. Also, if you are serious about speeding up PROC SORT, you have to consider Multi Threaded Processing and Using the BUFSIZE and BUFNO System Options.
You can download the entire code from this post here.