04 05 05 Removing Duplicate Observations Efficiently

The SORT procedure can be used to remove duplicate observations when it is

used with the NODUPKEY option
used with the NODUPRECS option
followed by FIRST. processing in the DATA step.

Generally, PROC SORT with the NODUPKEY option uses less I/O and CPU time than PROC SORT followed by a DATA step that uses FIRST. processing. Before viewing a comparative example, let's examine each of the techniques that are listed above.

Using the NODUPKEY Option

The NODUPKEY option checks for and eliminates observations that have duplicate BY-variable values. If you specify this option, then PROC SORT compares all BY-variable values for each observation to those for the previous observation that was written to the output data set. If an exact match is found, then the observation is not written to the output data set.

General form, PROC SORT with the NODUPKEY option:
PROC SORT DATA=SAS-data-set-name NODUPKEY;
where SAS-data-set-name is a valid SAS data set name.

Example

The SAS data set Company.Reorder contains two duplicated observations. Observation 9 is a duplicate of observation 1, and observation 7 is a duplicate of observation 2. The duplicate observations are removed when the data is sorted by the values of Product_Line and Product_Name and when the NODUPKEY option is used.

proc sort data=company.reorder nodupkey; by product_line product_name; run;

SAS Data Set Company.Reorder

Obs Product_Line Product_Name Supplier_Name

1 Children Ski Jacket w/Removable Fleece Scandinavian Clothing A/S

2 Children Kids Children's Fleece Hat 3Top Sports

3 Clothes & Shoes Watchit 120 Sterling/Reflective Eclipse Inc

4 Sports Sparkle Spray Blue CrystalClear Optics Inc

5 Outdoors Money Purse, Black Top Sports

6 Sports Mayday Serious Down Jacket Mayday Inc

7 Children Kids Children's Fleece Hat 3Top Sports

8 Clothes & Shoes Tyfoon Linen Pants Typhoon Clothing

9 Children Ski Jacket w/Removable Fleece Scandinavian Clothing A/S

Obs	Product_Line	Product_Name	Supplier_Name
1	Children	Ski Jacket w/Removable Fleece	Scandinavian Clothing A/S
2	Children	Kids Children's Fleece Hat	3Top Sports
3	Clothes & Shoes	Watchit 120 Sterling/Reflective	Eclipse Inc
4	Sports	Sparkle Spray Blue	CrystalClear Optics Inc
5	Outdoors	Money Purse, Black	Top Sports
6	Sports	Mayday Serious Down Jacket	Mayday Inc
7	Children	Kids Children's Fleece Hat	3Top Sports
8	Clothes & Shoes	Tyfoon Linen Pants	Typhoon Clothing
9	Children	Ski Jacket w/Removable Fleece	Scandinavian Clothing A/S

SAS Data Set Company.Reorder, Before Removing Duplicate Observations

Obs Product_Line Product_Name Supplier_Name

1 Children Kids Children's Fleece Hat 3Top Sports

2 Children Kids Children's Fleece Hat 3Top Sports

3 Children Ski Jacket w/Removable Fleece Scandinavian Clothing A/S

4 Children Ski Jacket w/Removable Fleece Scandinavian Clothing A/S

5 Clothes & Shoes Tyfoon Linen Pants Typhoon Clothing

6 Clothes & Shoes Watchit 120 Sterling/Reflective Eclipse Inc

7 Outdoors Money Purse, Black Top Sports

8 Sports Mayday Serious Down Jacket Mayday Inc

9 Sports Sparkle Spray Blue CrystalClear Optics Inc

Obs	Product_Line	Product_Name	Supplier_Name
1	Children	Kids Children's Fleece Hat	3Top Sports
2	Children	Kids Children's Fleece Hat	3Top Sports
3	Children	Ski Jacket w/Removable Fleece	Scandinavian Clothing A/S
4	Children	Ski Jacket w/Removable Fleece	Scandinavian Clothing A/S
5	Clothes & Shoes	Tyfoon Linen Pants	Typhoon Clothing
6	Clothes & Shoes	Watchit 120 Sterling/Reflective	Eclipse Inc
7	Outdoors	Money Purse, Black	Top Sports
8	Sports	Mayday Serious Down Jacket	Mayday Inc
9	Sports	Sparkle Spray Blue CrystalClear	Optics Inc

SAS Data Set Company.Reorder, Duplicate Observations Removed

Obs Product_Line Product_Name Supplier_Name

1 Children Kids Children's Fleece Hat 3Top Sports

2 Children Ski Jacket w/Removable Fleece Scandinavian Clothing A/S

3 Clothes & Shoes Tyfoon Linen Pants Typhoon Clothing

4 Clothes & Shoes Watchit 120 Sterling/Reflective Eclipse Inc

5 Outdoors Money Purse, Black Top Sports

6 Sports Mayday Serious Down Jacket Mayday Inc

7 Sports Sparkle Spray Blue CrystalClear Optics Inc

Obs	Product_Line	Product_Name	Supplier_Name
1	Children	Kids Children's Fleece Hat	3Top Sports
2	Children	Ski Jacket w/Removable Fleece	Scandinavian Clothing A/S
3	Clothes & Shoes	Tyfoon Linen Pants	Typhoon Clothing
4	Clothes & Shoes	Watchit 120 Sterling/Reflective	Eclipse Inc
5	Outdoors	Money Purse, Black	Top Sports
6	Sports	Mayday Serious Down Jacket	Mayday Inc
7	Sports	Sparkle Spray Blue CrystalClear	Optics Inc

Using the NODUPRECS Option

The NODUPRECS option also checks for and eliminates duplicate observations. However, unlike the NODUPKEY option, the NODUPRECS option compares all of the variable values for each observation to those for the previous observation that was written to the output data set. If an exact match is found, then the observation is not written to the output data set.

General form, PROC SORT with the NODUPRECS option:
PROC SORT DATA=SAS-data-set-name NODUPRECS;

where SAS-data-set-name is a valid SAS data set name.

NODUP is an alias for NODUPRECS.

Because NODUPRECS checks only consecutive observations, some nonconsecutive duplicate observations might remain in the output data set. You can remove all duplicates with this option by sorting on all variables.

Example

When Company.Reorder is sorted using the BY variable Product_Line and the NODUPRECS option, the duplicate observation that contains the product name Kids Children's Fleece Hat is eliminated because it exactly matches the observation that was previously written to the output data set. The duplicate observation that contains the product name Ski Jacket w/Removable Fleece is retained because it does not exactly match the observation that was previously written to the output data set.

proc sort data=company.reorder noduprecs; by product_line; run;

Obs	Product_Line	Product_Name	Supplier_Name
1	Children	Ski Jacket w/Removable Fleece	Scandinavian Clothing A/S
2	Children	Kids Children's Fleece Hat	3Top Sports
3	Clothes & Shoes	Watchit 120 Sterling/Reflective	Eclipse Inc
4	Sports	Sparkle Spray Blue	CrystalClear Optics Inc
5	Outdoors	Money Purse, Black	Top Sports
6	Sports	Mayday Serious Down Jacket	Mayday Inc
7	Children	Kids Children's Fleece Hat	3Top Sports
8	Clothes & Shoes	Tyfoon Linen Pants	Typhoon Clothing
9	Children	Ski Jacket w/Removable Fleece	Scandinavian Clothing A/S

SAS Data Set Company.Reorder, Before Removing Duplicate Observation

Obs Product_Line Product_Name Supplier_Name

1 Children Ski Jacket w/Removable Fleece Scandinavian Clothing A/S

2 Children Kids Children's Fleece Hat 3Top Sports

3 Children Kids Children's Fleece Hat 3Top Sports

4 Children Ski Jacket w/Removable Fleece Scandinavian Clothing A/S

5 Clothes & Shoes Watchit 120 Sterling/Reflective Eclipse Inc

6 Clothes & Shoes Tyfoon Linen Pants Typhoon Clothing

7 Outdoors Money Purse, Black Top Sports

8 Sports Sparkle Spray Blue CrystalClear Optic Inc

9 Sports Mayday Serious Down Jacket Mayday Inc

Obs	Product_Line	Product_Name	Supplier_Name
1	Children	Ski Jacket w/Removable Fleece	Scandinavian Clothing A/S
2	Children	Kids Children's Fleece Hat	3Top Sports
3	Children	Kids Children's Fleece Hat	3Top Sports
4	Children	Ski Jacket w/Removable Fleece	Scandinavian Clothing A/S
5	Clothes & Shoes	Watchit 120 Sterling/Reflective	Eclipse Inc
6	Clothes & Shoes	Tyfoon Linen Pants	Typhoon Clothing
7	Outdoors	Money Purse, Black	Top Sports
8	Sports	Sparkle Spray Blue	CrystalClear Optic Inc
9	Sports	Mayday Serious Down Jacket	Mayday Inc

SAS Data Set Company.Reorder, Duplicate Observation Remaining

Obs Product_Line Product_Name Supplier_Name

1 Children Ski Jacket w/Removable Fleece Scandinavian Clothing A/S

2 Children Kids Children's Fleece Hat 3Top Sports

3 Children Ski Jacket w/Removable Fleece Scandinavian Clothing A/S

4 Clothes & Shoes Watchit 120 Sterling/Reflective Eclipse Inc

5 Clothes & Shoes Tyfoon Linen Pants Typhoon Clothing

6 Outdoors Money Purse, Black Top Sports

7 Sports Sparkle Spray Blue CrystalClear Optic Inc

8 Sports Mayday Serious Down Jacket Mayday Inc

Obs	Product_Line	Product_Name	Supplier_Name
1	Children	Ski Jacket w/Removable Fleece	Scandinavian Clothing A/S
2	Children	Kids Children's Fleece Hat	3Top Sports
3	Children	Ski Jacket w/Removable Fleece	Scandinavian Clothing A/S
4	Clothes & Shoes	Watchit 120 Sterling/Reflective	Eclipse Inc
5	Clothes & Shoes	Tyfoon Linen Pants	Typhoon Clothing
6	Outdoors	Money Purse, Black	Top Sports
7	Sports	Sparkle Spray Blue	CrystalClear Optic Inc
8	Sports	Mayday Serious Down Jacket	Mayday Inc

Both duplicate observations are removed when Company.Reorder is sorted by both Product_Line and Product_Name and when the NODUPRECS option is used.
proc sort data=company.reorder noduprecs; by product_line product_name; run;

SAS Data Set Company.Reorder, Both Duplicate Observations Removed

Obs Product_Line Product_Name Supplier_Name

1 Children Kids Children's Fleece Hat 3Top Sports

2 Children Ski Jacket w/Removable Fleece Scandinavian Clothing A/S

3 Clothes & Shoes Tyfoon Linen Pants Typhoon Clothing

4 Clothes & Shoes Watchit 120 Sterling/Reflective Eclipse Inc

5 Outdoors Money Purse, Black Top Sports

6 Sports Mayday Serious Down Jacket Mayday Inc

7 Sports Sparkle Spray Blue CrystalClear Optics Inc

Obs	Product_Line	Product_Name	Supplier_Name
1	Children	Kids Children's Fleece Hat	3Top Sports
2	Children	Ski Jacket w/Removable Fleece	Scandinavian Clothing A/S
3	Clothes & Shoes	Tyfoon Linen Pants	Typhoon Clothing
4	Clothes & Shoes	Watchit 120 Sterling/Reflective	Eclipse Inc
5	Outdoors	Money Purse, Black	Top Sports
6	Sports	Mayday Serious Down Jacket	Mayday Inc
7	Sports	Sparkle Spray Blue CrystalClear	Optics Inc

The SORTDUP= system option controls how NODUPRECS processing works. Specifying SORTDUP=PHYSICAL removes duplicates based on all variables in the data set. This is the default. Specifying SORTDUP=LOGICAL removes duplicates based only on the variables that remain after the DROP= and KEEP= data set options are processed. See the SAS documentation for more information.

Using the EQUALS | NOEQUALS Option

EQUALS | NOEQUALS is a SORT procedure option that helps to determine the order of observations in the output data set. When you use NODUPRECS or NODUPKEY to remove observations from the output data set, the choice of EQUALS or NOEQUALS can have an effect on which observations are removed.

EQUALS is the default. For observations that have identical BY-variable values, EQUALS maintains the order from the input data set in the output data set. NOEQUALS does not necessarily preserve this order in the output data set. NOEQUALS can save CPU time and memory resources.

Example

The following program uses PROC SORT with the NODUPKEY option and the NOEQUALS option to create an output data set that contains only the first observation in each BY group. Notice that the output data set Work.New contains different observations when the EQUALS option is used.

proc sort data=company.products out=work.new nodupkey noequals; by product_line; run;

SAS Data Set Company.Products

Obs Product_Line Product_Name Supplier_Name

1 Clothes & Shoes Big Guy Men's Ringer T Eclipse Inc

2 Children Boy's and Girl's Ski Pants with Braces Scandinavian Clothing A/S

3 Outdoors Cotton Moneybelt/Polyester 45x11 Prime Sports Ltd

4 Sports Cougar Shorts SD Sporting Goods Inc

5 Clothes & Shoes Far Out Teambag S 3Top Sports

6 Children Kid Basic Tracking Suit Triple Sportswear Inc

7 Sports Maxrun Ultra short Sprinter Tights Force Sports

8 Clothes & Shoes Wa.leather Street Shoes Fuller Trading Co.

Obs	Product_Line	Product_Name	Supplier_Name
1	Clothes & Shoes	Big Guy Men's Ringer T	Eclipse Inc
2	Children	Boy's and Girl's Ski Pants with Braces	Scandinavian Clothing A/S
3	Outdoors	Cotton Moneybelt/Polyester 45x11	Prime Sports Ltd
4	Sports	Cougar Shorts	SD Sporting Goods Inc
5	Clothes & Shoes	Far Out Teambag S	3Top Sports
6	Children	Kid Basic Tracking Suit	Triple Sportswear Inc
7	Sports	Maxrun Ultra short Sprinter Tights	Force Sports
8	Clothes & Shoes	Wa.leather Street Shoes	Fuller Trading Co.

SAS Data Set Work.New, NOEQUALS Option Used

Obs Product_Line Product_Name Supplier_Name

1 Children Kid Basic Tracking Suit Triple Sportswear Inc

2 Clothes & Shoes Far Out Teambag S 3Top Sports

3 Outdoors Cotton Moneybelt/Polyester 45x11 Prime Sports Ltd

4 Sports Maxrun Ultra short Sprinter Tights Force Sports

Obs	Product_Line	Product_Name	Supplier_Name
1	Children	Kid Basic Tracking Suit	Triple Sportswear Inc
2	Clothes & Shoes	Far Out Teambag S	3Top Sports
3	Outdoors	Cotton Moneybelt/Polyester 45x11	Prime Sports Ltd
4	Sports	Maxrun Ultra short Sprinter Tights	Force Sports

SAS Data Set Work.New, EQUALS Option Used

Obs Product_Line Product_Name Supplier_Name

1 Children Boy's and Girl's Ski Pants with Braces Scandinavian Clothing A/S

2 Clothes & Shoes Big Guy Men's Ringer T Eclipse Inc

3 Outdoors Cotton Moneybelt/Polyester 45x11 Prime Sports Ltd

4 Sports Cougar Shorts SD Sporting Goods Inc

Obs	Product_Line	Product_Name	Supplier_Name
1	Children	Boy's and Girl's Ski Pants with Braces	Scandinavian Clothing A/S
2	Clothes & Shoes	Big Guy Men's Ringer T	Eclipse Inc
3	Outdoors	Cotton Moneybelt/Polyester 45x11	Prime Sports Ltd
4	Sports	Cougar Shorts	SD Sporting Goods Inc

The EQUALS | NOEQUALS option is supported by the threaded sort. However, I/O performance might be reduced when you use the EQUALS option because partitioned data sets will be processed as if they are non-partitioned data sets.

The order of observations within BY groups that are returned by the threaded sort might not be consistent between runs. Therefore, using the NOEQUALS option can produce inconsistent results in your output data sets.

Using FIRST. LAST. Processing in the DATA Step

FIRST. LAST. processing in the DATA step can also be used to remove duplicate observations from a SAS data set.

In the data set Company.Onorder, the fourth observation contains a duplicate value for Product_Name. The following program removes the observation that contains the duplicate value by first sorting the input data set, Company.Onorder, by the value of Product_Name. The DATA step then selects only the first observation in the BY group.
proc sort data=company.onorder
out=work.sorted;
by product_name;
run;
data work.onorder2;
set work.sorted;
by product_name;
if first.product_name;
run;

SAS Data Set Company.Onorder

Obs Product_Line Product_Name Quantity

1 Clothes & Shoes Big Guy Men's Ringer T 70

2 Children Boy's and Girl's Ski Pants with Braces 55

3 Outdoors Cotton Moneybelt/Polyester 45x11 20

4 Sports Big Guy Men's Ringer T 70

5 Sports Cougar Shorts 40

6 Clothes & Shoes Far Out Teambag S 32

7 Children Kid's Basic Tracking Suit 20

8 Sports Maxrun Ultra short Sprinter Tights 25

9 Clothes & Shoes Wa.leather Street Shoes 30

Obs	Product_Line	Product_Name	Quantity
1	Clothes & Shoes	Big Guy Men's Ringer T	70
2	Children	Boy's and Girl's Ski Pants with Braces	55
3	Outdoors	Cotton Moneybelt/Polyester 45x11	20
4	Sports	Big Guy Men's Ringer T	70
5	Sports	Cougar Shorts	40
6	Clothes & Shoes	Far Out Teambag S	32
7	Children	Kid's Basic Tracking Suit	20
8	Sports	Maxrun Ultra short Sprinter Tights	25
9	Clothes & Shoes	Wa.leather Street Shoes	30

SAS Data Set Work.Onorder2

Obs Product_Line Product_Name Quantity

1 Clothes & Shoes Big Guy Men's Ringer T 70

2 Children Boy's and Girl's Ski Pants with Braces 55

3 Outdoors Cotton Moneybelt/Polyester 45x11 20

4 Sports Cougar Shorts 40

5 Clothes & Shoes Far Out Teambag S 32

6 Children Kid's Basic Tracking Suit 20

7 Sports Maxrun Ultra short Sprinter Tights 25

8 Clothes & Shoes Wa.leather Street Shoes 30

Obs	Product_Line	Product_Name	Quantity
1	Clothes & Shoes	Big Guy Men's Ringer T	70
2	Children	Boy's and Girl's Ski Pants with Braces	55
3	Outdoors	Cotton Moneybelt/Polyester 45x11	20
4	Sports	Cougar Shorts	40
5	Clothes & Shoes	Far Out Teambag S	32
6	Children	Kid's Basic Tracking Suit	20
7	Sports	Maxrun Ultra short Sprinter Tights	25
8	Clothes & Shoes	Wa.leather Street Shoes	30

Comparative Example: Removing Duplicate Observations Efficiently

Suppose you want to remove observations from the data set Retail.Order_fact in which the value of Order_Date is duplicated. Among the techniques you could use are

The NODUPKEY Option and the EQUALS Option

The NODUPKEY Option and the NOEQUALS Option

PROC SORT and a DATA Step with BY-Group and FIRST. Processing

The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for eliminating duplicates.

Programming Techniques

The NODUPKEY Option and the EQUALS Option

This program uses the NODUPKEY option and the EQUALS option to check for and eliminate observations that have duplicate BY-variable values. For observations that have identical BY-variable values, the EQUALS option maintains the order from the input data set in the output data set.

proc sort data=retail.order_fact
out=work.sorted
nodupkey equals;
by order_date;
run;

The NODUPKEY Option and the NOEQUALS Option

This program uses the NODUPKEY option and the NOEQUALS option to check for and eliminate observations that have duplicate BY-variable values. For observations that have identical BY-variable values, the NOEQUALS option does not necessarily maintain the order from the input data set in the output data set.

proc sort data=retail.order_fact
out=work.sorted
nodupkey noequals;
by order_date;
run;

PROC SORT and a DATA Step with BY-Group and FIRST. Processing

In this program, the input data set is first sorted using the SORT procedure. Duplicate observations are then removed using BY-group and FIRST. processing.

proc sort data=retail.order_fact
out=work.sorted;
by order_date;
run;
data work.sorted;
set work.sorted;
by order_date;
if first.order_date;
run;

To remove duplicate observations from a SAS data set, use PROC SORT with the NODUPKEY option rather than a PROC SORT step followed by a DATA step that uses FIRST. processing.

Be careful not to confuse NODUPKEY with NODUPRECS.

Using the NODUPKEY Option

Example

Using the NODUPRECS Option

Example

Using the EQUALS | NOEQUALS Option

Example

Comparative Example: Removing Duplicate Observations Efficiently

Programming Techniques

document.write(generalRecommendationsTitle);