When you work with data in SAS, you will at some point have to deal with duplicate values. This post shows you a few ways to effectively deal with duplicate values in SAS using PROC SORT and the SQL Procedure.
First, let us create some small example data set. Below, I create the data set MyData with an ID variable and three variables var1, var2 and var3. This data set is for demonstration purposes in the examples to come only.
data MyData; input ID var1-var3; datalines; 1 1 2 3 1 1 2 3 1 2 3 4 2 1 2 3 2 6 4 2 2 1 2 3 3 2 1 4 4 1 2 1 4 7 4 2 ;
SAS PROC SQL Example
The PROC SQL way of removing duplicate values is intuitive and easy to understand. In the select clause, we use the DISTINCT keyword to account for exact duplicate observations in the data. A small example of removing duplicate observations from the example data set is specified below.
proc sql; create table NoDuplicates2 as select distinct * from Mydata; quit;
While the SQL solution is quick and intuitive, especially if you are familiar with the SQL language, it is not very flexible. For example, it is not very easy to remove duplicates by specific variables. Luckily there is another way that handles that situation swiftly.
PROC SORT Nodupkey Example
The Sort Procedure with the NODUPKEY option is the simplest and most common way of removing duplicate values in SAS. Simply specify the NODUPKEY option in the PROC SORT statement. In the BY statement, specify the variables by which you want to remove duplicates. In this case, we want to remove all exact duplicates, so we specify _ALL_ in the BY statement.
proc sort data=Mydata out=NoDuplicates1 nodupkey; by _all_; /* Or equivalently by ID var1-var3 */ run;
The Sort Procedure also lets you specify the variables by which you do not want duplicates. Let’s say that you do not want duplicate values for ID and Var1 only. Then simply specify these variables in the BY statement as such
proc sort data=Mydata out=NoDuplicates1 nodupkey; by ID var1; run;
Duplicate values are a part of life when you work with data. SAS programming gives you the tools to easily and efficiently remove duplicate values. Naturally, you have to carefully consider if removing the duplicates is the right solution to your problem before actually doing so.
The approach described here is the most popular among SAS programmers because it is simple and requires very little coding. However, remember that it does require a sort algorithm. Sort algorithms are CPU and time-consuming. So if you do not want the data sorted, see an alternative approach in the blog post Remove Duplicate Observations In SAS with the Hash Object. For more tips and code examples in data science in SAS, visit the Data Science Code Examples page.
You can download the entire code from this example here.