04 05 05 Removing Duplicate Observations Efficiently

The SORT procedure can be used to remove duplicate observations when it is

  • used with the NODUPKEY option
  • used with the NODUPRECS option
  • followed by FIRST. processing in the DATA step.

Generally, PROC SORT with the NODUPKEY option uses less I/O and CPU time than PROC SORT followed by a DATA step that uses FIRST. processing. Before viewing a comparative example, let's examine each of the techniques that are listed above.


Using the NODUPKEY Option

The NODUPKEY option checks for and eliminates observations that have duplicate BY-variable values. If you specify this option, then PROC SORT compares all BY-variable values for each observation to those for the previous observation that was written to the output data set. If an exact match is found, then the observation is not written to the output data set.


General form, PROC SORT with the NODUPKEY option:
PROC SORT DATA=SAS-data-set-name NODUPKEY;
where SAS-data-set-name is a valid SAS data set name.


Example

The SAS data set Company.Reorder contains two duplicated observations. Observation 9 is a duplicate of observation 1, and observation 7 is a duplicate of observation 2. The duplicate observations are removed when the data is sorted by the values of Product_Line and Product_Name and when the NODUPKEY option is used.

     proc sort data=company.reorder nodupkey;
        by product_line product_name;
     run;

SAS Data Set Company.Reorder
Obs Product_Line Product_Name Supplier_Name
1 Children Ski Jacket w/Removable Fleece Scandinavian Clothing A/S
2 Children Kids Children's Fleece Hat 3Top Sports
3 Clothes & Shoes Watchit 120 Sterling/Reflective Eclipse Inc
4 Sports Sparkle Spray Blue CrystalClear Optics Inc
5 Outdoors Money Purse, Black Top Sports
6 Sports Mayday Serious Down Jacket Mayday Inc
7 Children Kids Children's Fleece Hat 3Top Sports
8 Clothes & Shoes Tyfoon Linen Pants Typhoon Clothing
9 Children Ski Jacket w/Removable Fleece Scandinavian Clothing A/S


SAS Data Set Company.Reorder, Before Removing Duplicate Observations
Obs Product_Line Product_Name Supplier_Name
1 Children Kids Children's Fleece Hat 3Top Sports
2 Children Kids Children's Fleece Hat 3Top Sports
3 Children Ski Jacket w/Removable Fleece Scandinavian Clothing A/S
4 Children Ski Jacket w/Removable Fleece Scandinavian Clothing A/S
5 Clothes & Shoes Tyfoon Linen Pants Typhoon Clothing
6 Clothes & Shoes Watchit 120 Sterling/Reflective Eclipse Inc
7 Outdoors Money Purse, Black Top Sports
8 Sports Mayday Serious Down Jacket Mayday Inc
9 Sports Sparkle Spray Blue CrystalClear Optics Inc


SAS Data Set Company.Reorder, Duplicate Observations Removed
Obs Product_Line Product_Name Supplier_Name
1 Children Kids Children's Fleece Hat 3Top Sports
2 Children Ski Jacket w/Removable Fleece Scandinavian Clothing A/S
3 Clothes & Shoes Tyfoon Linen Pants Typhoon Clothing
4 Clothes & Shoes Watchit 120 Sterling/Reflective Eclipse Inc
5 Outdoors Money Purse, Black Top Sports
6 Sports Mayday Serious Down Jacket Mayday Inc
7 Sports Sparkle Spray Blue CrystalClear Optics Inc


Using the NODUPRECS Option

The NODUPRECS option also checks for and eliminates duplicate observations. However, unlike the NODUPKEY option, the NODUPRECS option compares all of the variable values for each observation to those for the previous observation that was written to the output data set. If an exact match is found, then the observation is not written to the output data set.


General form, PROC SORT with the NODUPRECS option:
PROC SORT DATA=SAS-data-set-name NODUPRECS;

where SAS-data-set-name is a valid SAS data set name.

Note NODUP is an alias for NODUPRECS.


Because NODUPRECS checks only consecutive observations, some nonconsecutive duplicate observations might remain in the output data set. You can remove all duplicates with this option by sorting on all variables.


Example

When Company.Reorder is sorted using the BY variable Product_Line and the NODUPRECS option, the duplicate observation that contains the product name Kids Children's Fleece Hat is eliminated because it exactly matches the observation that was previously written to the output data set. The duplicate observation that contains the product name Ski Jacket w/Removable Fleece is retained because it does not exactly match the observation that was previously written to the output data set.

     proc sort data=company.reorder noduprecs;
        by product_line;
     run;

SAS Data Set Company.Reorder
Obs Product_Line Product_Name Supplier_Name
1 Children Ski Jacket w/Removable Fleece Scandinavian Clothing A/S
2 Children Kids Children's Fleece Hat 3Top Sports
3 Clothes & Shoes Watchit 120 Sterling/Reflective Eclipse Inc
4 Sports Sparkle Spray Blue CrystalClear Optics Inc
5 Outdoors Money Purse, Black Top Sports
6 Sports Mayday Serious Down Jacket Mayday Inc
7 Children Kids Children's Fleece Hat 3Top Sports
8 Clothes & Shoes Tyfoon Linen Pants Typhoon Clothing
9 Children Ski Jacket w/Removable Fleece Scandinavian Clothing A/S


SAS Data Set Company.Reorder, Before Removing Duplicate Observation
Obs Product_Line Product_Name Supplier_Name
1 Children Ski Jacket w/Removable Fleece Scandinavian Clothing A/S
2 Children Kids Children's Fleece Hat 3Top Sports
3 Children Kids Children's Fleece Hat 3Top Sports
4 Children Ski Jacket w/Removable Fleece Scandinavian Clothing A/S
5 Clothes & Shoes Watchit 120 Sterling/Reflective Eclipse Inc
6 Clothes & Shoes Tyfoon Linen Pants Typhoon Clothing
7 Outdoors Money Purse, Black Top Sports
8 Sports Sparkle Spray Blue CrystalClear Optic Inc
9 Sports Mayday Serious Down Jacket Mayday Inc


SAS Data Set Company.Reorder, Duplicate Observation Remaining
Obs Product_Line Product_Name Supplier_Name
1 Children Ski Jacket w/Removable Fleece Scandinavian Clothing A/S
2 Children Kids Children's Fleece Hat 3Top Sports
3 Children Ski Jacket w/Removable Fleece Scandinavian Clothing A/S
4 Clothes & Shoes Watchit 120 Sterling/Reflective Eclipse Inc
5 Clothes & Shoes Tyfoon Linen Pants Typhoon Clothing
6 Outdoors Money Purse, Black Top Sports
7 Sports Sparkle Spray Blue CrystalClear Optic Inc
8 Sports Mayday Serious Down Jacket Mayday Inc


Both duplicate observations are removed when Company.Reorder is sorted by both Product_Line and Product_Name and when the NODUPRECS option is used.
     proc sort data=company.reorder noduprecs;
        by product_line product_name;
     run;

SAS Data Set Company.Reorder, Both Duplicate Observations Removed
Obs Product_Line Product_Name Supplier_Name
1 Children Kids Children's Fleece Hat 3Top Sports
2 Children Ski Jacket w/Removable Fleece Scandinavian Clothing A/S
3 Clothes & Shoes Tyfoon Linen Pants Typhoon Clothing
4 Clothes & Shoes Watchit 120 Sterling/Reflective Eclipse Inc
5 Outdoors Money Purse, Black Top Sports
6 Sports Mayday Serious Down Jacket Mayday Inc
7 Sports Sparkle Spray Blue CrystalClear Optics Inc

 
Note The SORTDUP= system option controls how NODUPRECS processing works. Specifying SORTDUP=PHYSICAL removes duplicates based on all variables in the data set. This is the default. Specifying SORTDUP=LOGICAL removes duplicates based only on the variables that remain after the DROP= and KEEP= data set options are processed. See the SAS documentation for more information.


Using the EQUALS | NOEQUALS Option

EQUALS | NOEQUALS is a SORT procedure option that helps to determine the order of observations in the output data set. When you use NODUPRECS or NODUPKEY to remove observations from the output data set, the choice of EQUALS or NOEQUALS can have an effect on which observations are removed.

EQUALS is the default. For observations that have identical BY-variable values, EQUALS maintains the order from the input data set in the output data set. NOEQUALS does not necessarily preserve this order in the output data set. NOEQUALS can save CPU time and memory resources.


Example

The following program uses PROC SORT with the NODUPKEY option and the NOEQUALS option to create an output data set that contains only the first observation in each BY group. Notice that the output data set Work.New contains different observations when the EQUALS option is used.

     proc sort data=company.products out=work.new
          nodupkey noequals;
        by product_line;
     run;
SAS Data Set Company.Products
Obs Product_Line Product_Name Supplier_Name
1 Clothes & Shoes Big Guy Men's Ringer T Eclipse Inc
2 Children Boy's and Girl's Ski Pants with Braces Scandinavian Clothing A/S
3 Outdoors Cotton Moneybelt/Polyester 45x11 Prime Sports Ltd
4 Sports Cougar Shorts SD Sporting Goods Inc
5 Clothes & Shoes Far Out Teambag S 3Top Sports
6 Children Kid Basic Tracking Suit Triple Sportswear Inc
7 Sports Maxrun Ultra short Sprinter Tights Force Sports
8 Clothes & Shoes Wa.leather Street Shoes Fuller Trading Co.


SAS Data Set Work.New, NOEQUALS Option Used
Obs Product_Line Product_Name Supplier_Name
1 Children Kid Basic Tracking Suit Triple Sportswear Inc
2 Clothes & Shoes Far Out Teambag S 3Top Sports
3 Outdoors Cotton Moneybelt/Polyester 45x11 Prime Sports Ltd
4 Sports Maxrun Ultra short Sprinter Tights Force Sports


SAS Data Set Work.New, EQUALS Option Used
Obs Product_Line Product_Name Supplier_Name
1 Children Boy's and Girl's Ski Pants with Braces Scandinavian Clothing A/S
2 Clothes & Shoes Big Guy Men's Ringer T Eclipse Inc
3 Outdoors Cotton Moneybelt/Polyester 45x11 Prime Sports Ltd
4 Sports Cougar Shorts SD Sporting Goods Inc


Note

The EQUALS | NOEQUALS option is supported by the threaded sort. However, I/O performance might be reduced when you use the EQUALS option because partitioned data sets will be processed as if they are non-partitioned data sets.

Caution The order of observations within BY groups that are returned by the threaded sort might not be consistent between runs. Therefore, using the NOEQUALS option can produce inconsistent results in your output data sets.


Using FIRST. LAST. Processing in the DATA Step

FIRST. LAST. processing in the DATA step can also be used to remove duplicate observations from a SAS data set.

In the data set Company.Onorder, the fourth observation contains a duplicate value for Product_Name. The following program removes the observation that contains the duplicate value by first sorting the input data set, Company.Onorder, by the value of Product_Name. The DATA step then selects only the first observation in the BY group.

     proc sort data=company.onorder
out=work.sorted;
by product_name;
run;
data work.onorder2;
set work.sorted;
by product_name;
if first.product_name;

run;
SAS Data Set Company.Onorder
Obs Product_Line Product_Name Quantity
1 Clothes & Shoes Big Guy Men's Ringer T 70
2 Children Boy's and Girl's Ski Pants with Braces 55
3 Outdoors Cotton Moneybelt/Polyester 45x11 20
4 Sports Big Guy Men's Ringer T 70
5 Sports Cougar Shorts 40
6 Clothes & Shoes Far Out Teambag S 32
7 Children Kid's Basic Tracking Suit 20
8 Sports Maxrun Ultra short Sprinter Tights 25
9 Clothes & Shoes Wa.leather Street Shoes 30


SAS Data Set Work.Onorder2
Obs Product_Line Product_Name Quantity
1 Clothes & Shoes Big Guy Men's Ringer T 70
2 Children Boy's and Girl's Ski Pants with Braces 55
3 Outdoors Cotton Moneybelt/Polyester 45x11 20
4 Sports Cougar Shorts 40
5 Clothes & Shoes Far Out Teambag S 32
6 Children Kid's Basic Tracking Suit 20
7 Sports Maxrun Ultra short Sprinter Tights 25
8 Clothes & Shoes Wa.leather Street Shoes 30



Bar chart iconComparative Example: Removing Duplicate Observations Efficiently

Suppose you want to remove observations from the data set Retail.Order_fact in which the value of Order_Date is duplicated. Among the techniques you could use are

  1. The NODUPKEY Option and the EQUALS Option
  2. The NODUPKEY Option and the NOEQUALS Option
  3. PROC SORT and a DATA Step with BY-Group and FIRST. Processing

The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for eliminating duplicates.


Programming Techniques

1. The NODUPKEY Option and the EQUALS Option

This program uses the NODUPKEY option and the EQUALS option to check for and eliminate observations that have duplicate BY-variable values. For observations that have identical BY-variable values, the EQUALS option maintains the order from the input data set in the output data set.

     proc sort data=retail.order_fact
out=work.sorted
nodupkey equals;
by order_date;
run;


2. The NODUPKEY Option and the NOEQUALS Option

This program uses the NODUPKEY option and the NOEQUALS option to check for and eliminate observations that have duplicate BY-variable values. For observations that have identical BY-variable values, the NOEQUALS option does not necessarily maintain the order from the input data set in the output data set.

     proc sort data=retail.order_fact
out=work.sorted
nodupkey noequals;
by order_date;
run;


2. PROC SORT and a DATA Step with BY-Group and FIRST. Processing

In this program, the input data set is first sorted using the SORT procedure. Duplicate observations are then removed using BY-group and FIRST. processing.

     proc sort data=retail.order_fact
out=work.sorted;
by order_date;
run;
data work.sorted;
set work.sorted;
by order_date;
if first.order_date;

run;

  • To remove duplicate observations from a SAS data set, use PROC SORT with the NODUPKEY option rather than a PROC SORT step followed by a DATA step that uses FIRST. processing.

  • Be careful not to confuse NODUPKEY with NODUPRECS.