Most SAS users know how to sort data with PROC SORT. Fewer users know how this sorting affects the meta data that contains sorting information on the data set. In this post I will introduce two very important pieces of meta data information. The SORTED and VALIDATED flags. These are very important and nice to know, especially when working with large data sets.
The SAS Sortedby Option
The SORTEDBY Option tells SAS by what variable(s) a data set is sorted. When the SORTEDBY Option is set, the Sorted Flag in the Data Set metadata changes from NO to YES. You can check the value of the Flag in the PROC CONTENTS output to the right. The option can be set automatically by SAS or manually in a Data Step Statement like this
data SortedNotValidated(Sortedby=x); do x=1 to 10e7; output; end; run; proc contents data=SortedNotValidated; run;
Since you can manually chance the value of the SORTED Flag, you have to be careful. You can tell SAS that a data set is sorted by a variable, when actually it is not. Therefore, when you use the SORTEDBY Option, you have to be sure about the sort order. Using the SORTEDBY Option with a variable by which the data set is not actually sorted can cause serious problems.
The SAS Validated Flag
The Validated flag is more reliable than the Sorted Flag. It is a validation by SAS that the Sorted Flag is actually correct. Therefore, you can not manually set the Validated Flag. The Validated flag Can be set only by PROC SORT or by the Order By Clause in PROC SQL. When a sort order is validated, it is seen in the last part of the PROC CONTENTS Output.
proc sort data=sashelp.class out=class; by height; run; proc contents data=class; run;
To the right, you can see that the Sort Procedure sets both the Sorted and Validated flag to YES. Next, we shall see why this is important.
A Performance Example
In the blog post Using the _METHOD Option in PROC SQL I demonstrate how to display the operations performed under SQL execution. Let us see how the SQL Procedure operates differently given different values of the Sorted and Validated flags. In the three code snips to come, I run the exact same SQL query on three almost identical data sets. The only difference between them are the values of the two flags discussed.
Sorted=N and Validated=N
First, let us look at a simple join query where both flags are equal to NO. I create a data set with one variable X and join it on itself in the following SQL Procedure.
data NotSortedNotValidated; do x=1 to 10e7; output; end; run; proc sql noprint _method; create table test1 as select a.* from NotSortedNotValidated as a, NotSortedNotValidated as b where a.x=b.x; quit;
The SQL Procedure takes 1 minute and 33 seconds to run. The _METHOD Option reveals that the query does two sort operations (SQXSORT), one for each input table. This happens even though the data set is actually sorted by X, so the sort operations are redundant. But SAS soes not know that because neither flag is set to YES.
Sorted=Y and Validated=N
Next, let us manually set the Sorted flag to Yes. We use the SORTEDBY Option in the Data Step options and specify X as the variable by which the data is sorted. We then run the exact same query as above.
data SortedNotValidated(Sortedby=x); do x=1 to 10e7; output; end; run; proc sql noprint _method; create table test2 as select a.* from SortedNotValidated as a, SortedNotValidated as b where a.x=b.x; quit;
This takes 41 seconds to run. Less than half the time of the first run. The _METHOD Option reveals that the same operations were performed except the two sort operations of the input tables. SAS knows that the data is sorted properly, so it does not have to do it again. Therefore the reduction in run time.
Sorted=Y and Validated=Y
Finally, let us set both the Sorted and Validated flag to YES. I create a data set the same way as above and use Sortedby=x to set the Sorted Flag to YES. Next, I want to set the Validated Flag to YES. I want to use PROC SORT to do so. However, I do not want to actually sort the table because I know it is sorted properly already. I can use the PRESORTED Option in the procedure statement to simply verify the order in the data set. If the validation succeeds, no sort is performed. If the validation fails, SAS sorts the data set accordingly.
data SortedNotValidated(Sortedby=x); do x=1 to 10e7; output; end; run; /*options sortvalidate; Equivalent to the PRESORTED Option in PROC SORT */ proc sort data=SortedNotValidated presorted out=SortedValidated; by x; run; proc sql noprint _method; create table test3 as select a.* from SortedValidated as a, SortedValidated as b where a.x=b.x; quit;
When we run the SQL query this time it takes 36 seconds. A small reduction from the previous run. This reduction is dure to the fact that the SQL procedure does not have to check the sort order of the data sets. When the Validated flag is set, the SQL Procedure relies blindly on the sort being correct.
This post demonstrates the importance of the Sorted and Validated Flags. We have seen an example of why they are important where the run time of an SQL query was cut in half. Simply by specifying the correct values of the flags.
You can download the entire code from this post here.