| The SORT procedure can be used to remove duplicate observations when it is
Generally, PROC SORT with the NODUPKEY option uses less I/O and CPU time than PROC SORT followed by a DATA step that uses FIRST. processing. Before viewing a comparative example, let's examine each of the techniques that are listed above. |
Using the NODUPKEY OptionThe NODUPKEY option checks for and eliminates observations that have duplicate BY-variable values. If you specify this option, then PROC SORT compares all BY-variable values for each observation to those for the previous observation that was written to the output data set. If an exact match is found, then the observation is not written to the output data set. |
General form, PROC SORT with the NODUPKEY option:
PROC SORT DATA=SAS-data-set-name NODUPKEY;where SAS-data-set-name is a valid SAS data set name. |
ExampleThe SAS data set Company.Reorder contains two duplicated
observations. Observation 9 is a duplicate of observation 1, and observation
7 is a duplicate of observation 2. The duplicate observations are removed
when the data is sorted by the values of proc sort data=company.reorder nodupkey;
by product_line product_name;
run; |
SAS Data Set Company.Reorder
| Obs | Product_Line | Product_Name | Supplier_Name |
| 1 | Children | Ski Jacket w/Removable Fleece | Scandinavian Clothing A/S |
| 2 | Children | Kids Children's Fleece Hat | 3Top Sports |
| 3 | Clothes & Shoes | Watchit 120 Sterling/Reflective | Eclipse Inc |
| 4 | Sports | Sparkle Spray Blue | CrystalClear Optics Inc |
| 5 | Outdoors | Money Purse, Black | Top Sports |
| 6 | Sports | Mayday Serious Down Jacket | Mayday Inc |
| 7 | Children | Kids Children's Fleece Hat | 3Top Sports |
| 8 | Clothes & Shoes | Tyfoon Linen Pants | Typhoon Clothing |
| 9 | Children | Ski Jacket w/Removable Fleece | Scandinavian Clothing A/S |
SAS Data Set Company.Reorder, Before Removing Duplicate Observations
| Obs | Product_Line | Product_Name | Supplier_Name |
| 1 | Children | Kids Children's Fleece Hat | 3Top Sports |
| 2 | Children | Kids Children's Fleece Hat | 3Top Sports |
| 3 | Children | Ski Jacket w/Removable Fleece | Scandinavian Clothing A/S |
| 4 | Children | Ski Jacket w/Removable Fleece | Scandinavian Clothing A/S |
| 5 | Clothes & Shoes | Tyfoon Linen Pants | Typhoon Clothing |
| 6 | Clothes & Shoes | Watchit 120 Sterling/Reflective | Eclipse Inc |
| 7 | Outdoors | Money Purse, Black | Top Sports |
| 8 | Sports | Mayday Serious Down Jacket | Mayday Inc |
| 9 | Sports | Sparkle Spray Blue CrystalClear | Optics Inc |
SAS Data Set Company.Reorder, Duplicate Observations Removed
| Obs | Product_Line | Product_Name | Supplier_Name |
| 1 | Children | Kids Children's Fleece Hat | 3Top Sports |
| 2 | Children | Ski Jacket w/Removable Fleece | Scandinavian Clothing A/S |
| 3 | Clothes & Shoes | Tyfoon Linen Pants | Typhoon Clothing |
| 4 | Clothes & Shoes | Watchit 120 Sterling/Reflective | Eclipse Inc |
| 5 | Outdoors | Money Purse, Black | Top Sports |
| 6 | Sports | Mayday Serious Down Jacket | Mayday Inc |
| 7 | Sports | Sparkle Spray Blue CrystalClear | Optics Inc |
Using the NODUPRECS OptionThe NODUPRECS option also checks for and eliminates duplicate observations. However, unlike the NODUPKEY option, the NODUPRECS option compares all of the variable values for each observation to those for the previous observation that was written to the output data set. If an exact match is found, then the observation is not written to the output data set. |
General form, PROC SORT with the NODUPRECS option:
PROC SORT DATA=SAS-data-set-name NODUPRECS; where SAS-data-set-name is a valid SAS data set name.
|
| Because NODUPRECS checks only consecutive observations, some nonconsecutive duplicate observations might remain in the output data set. You can remove all duplicates with this option by sorting on all variables. |
ExampleWhen Company.Reorder is sorted using the BY variable
proc sort data=company.reorder noduprecs;
by product_line;
run; |
SAS Data Set Company.Reorder
| Obs | Product_Line | Product_Name | Supplier_Name |
| 1 | Children | Ski Jacket w/Removable Fleece | Scandinavian Clothing A/S |
| 2 | Children | Kids Children's Fleece Hat | 3Top Sports |
| 3 | Clothes & Shoes | Watchit 120 Sterling/Reflective | Eclipse Inc |
| 4 | Sports | Sparkle Spray Blue | CrystalClear Optics Inc |
| 5 | Outdoors | Money Purse, Black | Top Sports |
| 6 | Sports | Mayday Serious Down Jacket | Mayday Inc |
| 7 | Children | Kids Children's Fleece Hat | 3Top Sports |
| 8 | Clothes & Shoes | Tyfoon Linen Pants | Typhoon Clothing |
| 9 | Children | Ski Jacket w/Removable Fleece | Scandinavian Clothing A/S |
SAS Data Set Company.Reorder, Before Removing Duplicate Observation
| Obs | Product_Line | Product_Name | Supplier_Name |
| 1 | Children | Ski Jacket w/Removable Fleece | Scandinavian Clothing A/S |
| 2 | Children | Kids Children's Fleece Hat | 3Top Sports |
| 3 | Children | Kids Children's Fleece Hat | 3Top Sports |
| 4 | Children | Ski Jacket w/Removable Fleece | Scandinavian Clothing A/S |
| 5 | Clothes & Shoes | Watchit 120 Sterling/Reflective | Eclipse Inc |
| 6 | Clothes & Shoes | Tyfoon Linen Pants | Typhoon Clothing |
| 7 | Outdoors | Money Purse, Black | Top Sports |
| 8 | Sports | Sparkle Spray Blue | CrystalClear Optic Inc |
| 9 | Sports | Mayday Serious Down Jacket | Mayday Inc |
SAS Data Set Company.Reorder, Duplicate Observation Remaining
| Obs | Product_Line | Product_Name | Supplier_Name |
| 1 | Children | Ski Jacket w/Removable Fleece | Scandinavian Clothing A/S |
| 2 | Children | Kids Children's Fleece Hat | 3Top Sports |
| 3 | Children | Ski Jacket w/Removable Fleece | Scandinavian Clothing A/S |
| 4 | Clothes & Shoes | Watchit 120 Sterling/Reflective | Eclipse Inc |
| 5 | Clothes & Shoes | Tyfoon Linen Pants | Typhoon Clothing |
| 6 | Outdoors | Money Purse, Black | Top Sports |
| 7 | Sports | Sparkle Spray Blue | CrystalClear Optic Inc |
| 8 | Sports | Mayday Serious Down Jacket | Mayday Inc |
Both duplicate observations are removed when Company.Reorder
is sorted by both Product_Line and Product_Name
and when the NODUPRECS option is used.
proc sort data=company.reorder noduprecs;
by product_line product_name;
run; |
SAS Data Set Company.Reorder, Both Duplicate Observations Removed
| Obs | Product_Line | Product_Name | Supplier_Name |
| 1 | Children | Kids Children's Fleece Hat | 3Top Sports |
| 2 | Children | Ski Jacket w/Removable Fleece | Scandinavian Clothing A/S |
| 3 | Clothes & Shoes | Tyfoon Linen Pants | Typhoon Clothing |
| 4 | Clothes & Shoes | Watchit 120 Sterling/Reflective | Eclipse Inc |
| 5 | Outdoors | Money Purse, Black | Top Sports |
| 6 | Sports | Mayday Serious Down Jacket | Mayday Inc |
| 7 | Sports | Sparkle Spray Blue CrystalClear | Optics Inc |
| The SORTDUP= system option controls how NODUPRECS processing works. Specifying SORTDUP=PHYSICAL removes duplicates based on all variables in the data set. This is the default. Specifying SORTDUP=LOGICAL removes duplicates based only on the variables that remain after the DROP= and KEEP= data set options are processed. See the SAS documentation for more information. |
Using the EQUALS | NOEQUALS OptionEQUALS | NOEQUALS is a SORT procedure option that helps to determine the order of observations in the output data set. When you use NODUPRECS or NODUPKEY to remove observations from the output data set, the choice of EQUALS or NOEQUALS can have an effect on which observations are removed. |
EQUALS is the default. For observations that have identical BY-variable values, EQUALS maintains the order from the input data set in the output data set. NOEQUALS does not necessarily preserve this order in the output data set. NOEQUALS can save CPU time and memory resources. |
ExampleThe following program uses PROC SORT with the NODUPKEY option and the NOEQUALS option to create an output data set that contains only the first observation in each BY group. Notice that the output data set Work.New contains different observations when the EQUALS option is used. proc sort data=company.products out=work.new
nodupkey noequals;
by product_line;
run; |
| Obs | Product_Line | Product_Name | Supplier_Name |
| 1 | Clothes & Shoes | Big Guy Men's Ringer T | Eclipse Inc |
| 2 | Children | Boy's and Girl's Ski Pants with Braces | Scandinavian Clothing A/S |
| 3 | Outdoors | Cotton Moneybelt/Polyester 45x11 | Prime Sports Ltd |
| 4 | Sports | Cougar Shorts | SD Sporting Goods Inc |
| 5 | Clothes & Shoes | Far Out Teambag S | 3Top Sports |
| 6 | Children | Kid Basic Tracking Suit | Triple Sportswear Inc |
| 7 | Sports | Maxrun Ultra short Sprinter Tights | Force Sports |
| 8 | Clothes & Shoes | Wa.leather Street Shoes | Fuller Trading Co. |
SAS Data Set Work.New, NOEQUALS Option Used
| Obs | Product_Line | Product_Name | Supplier_Name |
| 1 | Children | Kid Basic Tracking Suit | Triple Sportswear Inc |
| 2 | Clothes & Shoes | Far Out Teambag S | 3Top Sports |
| 3 | Outdoors | Cotton Moneybelt/Polyester 45x11 | Prime Sports Ltd |
| 4 | Sports | Maxrun Ultra short Sprinter Tights | Force Sports |
SAS Data Set Work.New, EQUALS Option Used
| Obs | Product_Line | Product_Name | Supplier_Name |
| 1 | Children | Boy's and Girl's Ski Pants with Braces | Scandinavian Clothing A/S |
| 2 | Clothes & Shoes | Big Guy Men's Ringer T | Eclipse Inc |
| 3 | Outdoors | Cotton Moneybelt/Polyester 45x11 | Prime Sports Ltd |
| 4 | Sports | Cougar Shorts | SD Sporting Goods Inc |
The EQUALS | NOEQUALS option is supported by the threaded sort. However, I/O performance might be reduced when you use the EQUALS option because partitioned data sets will be processed as if they are non-partitioned data sets. |
| The order of observations within BY groups that are returned by the threaded sort might not be consistent between runs. Therefore, using the NOEQUALS option can produce inconsistent results in your output data sets. |
| Using FIRST. LAST. Processing in the DATA Step |
FIRST. LAST. processing in the DATA step can also be used to remove duplicate observations from a SAS data set. In the data set Company.Onorder, the fourth observation
contains a duplicate value for proc sort data=company.onorder |
| Obs | Product_Line | Product_Name | Quantity |
| 1 | Clothes & Shoes | Big Guy Men's Ringer T | 70 |
| 2 | Children | Boy's and Girl's Ski Pants with Braces | 55 |
| 3 | Outdoors | Cotton Moneybelt/Polyester 45x11 | 20 |
| 4 | Sports | Big Guy Men's Ringer T | 70 |
| 5 | Sports | Cougar Shorts | 40 |
| 6 | Clothes & Shoes | Far Out Teambag S | 32 |
| 7 | Children | Kid's Basic Tracking Suit | 20 |
| 8 | Sports | Maxrun Ultra short Sprinter Tights | 25 |
| 9 | Clothes & Shoes | Wa.leather Street Shoes | 30 |
SAS Data Set Work.Onorder2
| Obs | Product_Line | Product_Name | Quantity |
| 1 | Clothes & Shoes | Big Guy Men's Ringer T | 70 |
| 2 | Children | Boy's and Girl's Ski Pants with Braces | 55 |
| 3 | Outdoors | Cotton Moneybelt/Polyester 45x11 | 20 |
| 4 | Sports | Cougar Shorts | 40 |
| 5 | Clothes & Shoes | Far Out Teambag S | 32 |
| 6 | Children | Kid's Basic Tracking Suit | 20 |
| 7 | Sports | Maxrun Ultra short Sprinter Tights | 25 |
| 8 | Clothes & Shoes | Wa.leather Street Shoes | 30 |
|
| Suppose you want to remove observations from the data set Retail.Order_fact
in which the value of
The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for eliminating duplicates. |
Programming Techniques |
| |
| |
|