The Importance Of SAS Index Centiles
In a previous blog post Using Indexes to Increase Performance in SAS, I demonstrate how to use a SAS index to decrease run time significantly. In the post, I draw a small subset from a large data set. I force SAS to use the index with the IDXNAME Option. Usually though, we do not use that option. Instead, SAS has to guess whether using it will increase performance or not. To do that, SAS uses centiles. The concept of centiles is extremely important in the context of SAS indexes. This post introduces centiles and shows an example of how to alter their behavior with PROC DATASETS.
SAS Index Centile Introduction
The word “centiles” comes from cumulative percentiles. A SAS index contains 21 centiles. Each centile represents the 0’th, 5’th, 10’th … 100’th percentile. Consequently, the first centile corresponds to the minimum value of the index variable(s). The second centile indicates that 5% of the data has index values below the centile value. Naturally, the last centile holds the maximum value.
Centiles gives you a nice overview of the distribution of your data. Furthermore, it gives you an idea of whether a variable qualifies as a good index variable or not. When we draw a subset of a data set with a variable specified in an index, SAS runs an internal algorithm to determine whether to use the binary index search or to process the data sequentially. SAS uses the centiles to do so.
When an indexed SAS data set is modified, the centiles should reflect that change. However, updating the centiles at every data modification is quite CPU costly. Therefore, you should consider how much the index should change before the centiles are updated. You can control how often the centiles update with the Updatecentiles Option in the Index Create Statement in PROC DATASETS like this.
proc datasets library=work nolist; modify MyData; index delete _all_; index create idx=(first_name last_name) / nomiss updatecentiles=5; run;quit;
Above, I have explicitly specified the Updatecentiles Option to the default value 5. This is usually a solid value for the option.
Refresh Index Centiles Example
Naturally, you do not have to wait for the specified amount of data to have changed for the centiles to update. You can use the Refresh Option in the Index Centiles Statement to force a refresh of the centiles.
proc datasets library=work nolist; modify MyData; index centiles idx / refresh; run;quit;
Review Index Centiles
You can review the status of the index centiles with PROC CONTENTS. Simply specify the centiles option in the procedure statement. You can see the result to the right.
proc contents data=MyData centiles; run;
In the HTML results you can see the index name, the Updatecentiles value and how much of that value has been changed so far. You can also the the number of unique values in the index. Finally, you can see the 21 centile values. This gives you a nice overview of the distribution of your data. Furthermore, it gives you an idea of whether the variable(s) construct a good index or not. In this case, each centile value is unique. This means that no value can return more than 5% of the data in a search with this index. Remember that this is exactly what a SAS index does best: Draw small samples from large data sets.
Centiles are very important when we consider indexing a data set or not. In this post, we have seen how to control the attributes of the centiles, and how to refresh and review them. PROC DATASETS is the only procedure that lets you work with index centiles specifically. Therefore I prefer to use PROC DATASETS whenever I create or modify an index. It simply provides more control.
This post serves as a brief introduction to the concept of centiles. If you want to read more, read chapter four of The Complete Guide To SAS Indexes by Michael A. Raithel.
You can download the entire code from this post here.