04 03 02 Reducing Data Storage Space for Numeric Variables
Another way to eliminate wasted space and thereby to reduce the amount of data storage space that you need is to reduce the length of numeric variables. In addition to conserving data storage space, reduced-length numeric variables use less I/O, both when data is written and when it is read. For a file that is read frequently, this savings can be significant. However, in order to safely reduce the length of numeric variables, you need to understand how SAS stores numeric data.


How SAS Stores Numeric Variables

To store numbers of large magnitude and to perform computations that require many digits of precision to the right of the decimal point, SAS stores all numeric values using double-precision floating-point representation. SAS stores the value of a numeric variable as multiple digits per byte. A SAS numeric variable can be from 2 to 8 bytes or 3 to 8 bytes in length, depending on your operating environment. The default length for a numeric variable is 8 bytes.

The figures below show how SAS stores a numeric value in 8 bytes. For mainframe environments, the first bit stores the sign, the next seven bits store the exponent of the value, and the remaining 56 bits store the mantissa.


Detail of numeric variable storage in mainframe environments

For non-mainframe environments, the first bit stores the sign, the next eleven bits store the exponent of the value, and the remaining 52 bits store the mantissa.


Detail of numeric variable storage in non-mainframe environments

Note The minimum length for a numeric variable is 2 bytes in mainframe environments and 3 bytes in non-mainframe environments.


Now that you have seen how SAS stores numeric variables, let's look at how you can assign a length to your numeric variables that is less than the default length of 8 bytes.


Assigning Lengths to Numeric Variables

You can use a LENGTH statement to assign a length from 2 to 8 bytes to numeric variables. Remember, the minimum length of numeric variables depends on the operating environment. Also, keep in mind that the LENGTH statement affects the length of a numeric variable only in the output data set. Numeric variables always have a length of 8 bytes in the program data vector and during processing.


General form, LENGTH statement for numeric variables:
LENGTH variable(s) length <DEFAULT=n>;
where
  • variable(s) specifies the name of one or more numeric SAS variables, separated by spaces.

  • length is an integer that specifies the length of the variable(s).

  • the optional DEFAULT=n argument changes the default number of bytes that SAS uses to store the values of any newly created numeric variables. If you use the DEFAULT= argument, you do not need to list any variable(s).
Note Values between 2 and 8 or between 3 and 8 depending on your operating environment are valid for n or length.


DEFAULT= applies only to numeric variables that are added to the program data vector after the LENGTH statement is compiled. You would list specific variables in the LENGTH statement along with the DEFAULT= argument only if you wanted those variables to have a length other than the value for DEFAULT=. If you list individual variables in the LENGTH statement, you must list an integer length for each of them.


Caution You should assign reduced lengths to numeric variables only if those variables have integer values. Fractional numbers lose precision if truncated. You will learn more about the loss of precision with reduced-length numeric variables on the next page of this lesson.


Example

The following program assigns a length of 4 to the new variable Sale_Percent in the data set ReducedSales. The LENGTH statement in this DATA step does not apply to the variables that are read in from the Sales data set; those variables will maintain whatever length they had in Sales when they are read into ReducedSales.

     data reducedsales;
        length default=4;
        set sales;
        Sale_Percent=15;   
     run;

Maintaining Precision in Reduced-Length Numeric Variables

There is a limit to the values that you can precisely store in a reduced-length numeric variable. You have learned that reducing the number of bytes that are used for storing a numeric variable does not affect how the numbers are stored in the program data vector. Instead, specifying a value of less than 8 in the LENGTH statement causes the number to be truncated to the specified length when the value is written to the SAS data set.

You should never use the LENGTH statement to reduce the length of your numeric variables if the values are not integers. Fractional numbers lose precision if truncated. Even if the values are integers, you should keep in mind that reducing the length of a numeric variable limits the integer values that can accurately be stored as a value.

The following table lists the possible storage length for integer values on UNIX or Windows operating environments.


UNIX/Windows
Length (bytes) Largest Integer Represented Exactly
3 8,192
4 2,097,152
5 536,870,912
6 137,438,953,472
7 35,184,372,088,832
8 9,007,199,254,740,992


The following table lists the possible storage length for integer values on the z/OS operating environment.


z/OS
Length (bytes) Largest Integer Represented Exactly
2 256
3 65,536
4 16,777,216
5 4,294,967,296
6 1,099,511,627,776
7 281,474,946,710,656
8 72,057,594,037,927,936


When you store an integer that is equal to or less than the number listed above as the largest integer that can be represented exactly in a reduced-length variable, SAS truncates bytes that contain only zeros. If the integer that is stored in a reduced-length variable is larger than the recommended limit, SAS truncates bytes that contain numbers other than zero, and the integer value is changed. Similarly, you should not reduce the stored size of non-integer data because it can result in a loss of precision due to the truncation of nonzero bytes.

If you decide to reduce the length of your numeric variables, you might want to verify that you have not lost any precision in your values. Let's look at one way to do this.


Using PROC COMPARE

You can use PROC COMPARE to gauge the precision of the values that are stored in a shortened numeric variable by comparing the original variable with the shortened variable. The COMPARE procedure compares the contents of two SAS data sets, selected variables in different data sets, or variables within the same data set.


General form, PROC COMPARE step to compare two data sets:
PROC COMPARE BASE=SAS-data-set-one
                           COMPARE=SAS-data-set-two;
RUN;
where SAS-data-set-one and SAS-data-set-two specify the two SAS data sets that you want to compare.


PROC COMPARE is a good technique to use for gauging the loss of precision in shortened numeric variables because it shows you whether there are differences in the stored numeric values even if these differences do not show up once the numeric variables have been formatted. PROC COMPARE looks at the two data sets and compares their

  • data set attributes
  • variables
  • variable attributes for matching variables
  • observations
  • values in matching variables.

Output from the COMPARE procedure includes

  • a data set summary
  • a variables summary
  • a listing of common variables that have different attributes
  • an observation summary
  • a values comparison summary
  • a listing of variables that have unequal values
  • a detailed list of value comparison results for variables.


Example

The data set Company.Discount contains data about sale dates and discounts for certain retail products. There are 35 observations in Company.Discount, which is described below.


Variable Type Length Description
Product_ID num 8 product ID number
Start_Date num 4 start date of sale
End_Date num 5 end date of sale
Unit_Sales_Price num 8 discounted sales price per unit
Discount num 8 discount as percent of normal sales price


Suppose you shorten the length of the numeric variable Discount. The DATA step below creates a new data set named Company.Discount_Short, whose only difference from Company.Discount is that the length of the variable Discount is 4 instead of 8.

     data company.discount_short;
        length Discount 4;
        set Company.Discount;
     run
You can use PROC COMPARE to evaluate whether shortening the length of Discount affects the precision of its values by comparing Company.Discount to Company.Discount_Short.
     proc compare base=company.discount
                  compare=company.discount_short;
     run;

If you were to print these two data sets (Company.Discount and Company.Discount_Short), the values might appear to be identical. However, there are differences in the values as they are stored that are not apparent in the formatted output.

In the partial output below, you can see that shortening the length of Discount results in a loss of precision in its values; the values for Discount in Company.Discount_Short differ by a maximum of 1.9073E-07. The value comparison results show that although the values for Discount in the first five observations appear as 70% in both data sets, the precise (unformatted) values differ by -1.907E-7.


Partial PROC COMPARE Output
     
                     Variables Summary
                                              
     Number of Variables in Common: 5.         
     Number of Variables with Differing Attributes: 1.
                      
  
     Listing of Common Variables with Differing Attributes
                          
     Variable  Dataset                Type  Length  Format 
                                  
     Discount  COMPANY.DISCOUNT        Num       8  PERCENT. 
       	       COMPANY.DISCOUNT_SHORT  Num       4  PERCENT.

  
                     Values Comparison Summary           
                                                    
    Number of Variables Compared with All Observations Equal: 4.    
    Number of Variables Compared with Some Observations Unequal: 1. 
    Total Number of Values which Compare Unequal: 35.             
    Maximum Difference: 1.9073E-07.                      

   
            Value Comparison Results for Variables         
                                                              
  __________________________________________________________   
             ||  Discount as Percent of Normal Retail Sal      
             ||  .. es Price                                   
             ||       Base    Compare                         
         Obs ||   Discount   Discount      Diff.     % Diff     
   ________  ||  _________  _________  _________  _________    
             ||                                                 
          1  ||       70%        70%   -1.907E-7  -0.000027     
          2  ||       70%        70%   -1.907E-7  -0.000027      
          3  ||       70%        70%   -1.907E-7  -0.000027     
          4  ||       70%        70%   -1.907E-7  -0.000027      
          5  ||       70%        70%   -1.907E-7  -0.000027      



Bar chart iconComparative Example: Creating a SAS Data Set That Contains Reduced-Length Numeric Variables

Suppose you want to create a SAS data set in which to store retail data about a group of orders. Suppose that the data you want to include in your data set is all numeric data and that it is currently stored in a raw data file. You can create the data set using

  1. Default-Length Numeric Variables
  2. Reduced-Length Numeric Variables

The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for creating reduced-length numeric variables.


Note Throughout this course, the keyword _NULL_ is often used as the data set name in sample programs. Using _NULL_ causes SAS to execute the DATA step as if it were creating a new data set. However, no observations or variables are written to an output data set. Using _NULL_ when benchmarking enables you to determine what resources are used to read a SAS data set.


Programming Techniques

1. Default-Length Numeric Variables

This program reads the external data file that is referenced by the fileref flat1 and creates a new data set called Retail.Longnums that contains 12 numeric variables. Each of the variables in Retail.Longnums has the default storage length of 8 bytes. The second DATA step in this program reads the numeric variables from Retail.Longnums.

data retail.longnums;
   infile flat1;
   input  Customer_ID          12.
          Employee_ID          12.
          Street_ID            12.
          Order_Date           date9.
          Delivery_Date        date9.
          Order_ID             12. 
          Order_Type           comma16.
          Product_ID           12.     
          Quantity             4.     
          Total_Retail_Price   dollar13.2
          CostPrice_Per_Unit   dollar13.2
          Discount             5.2       ;
run;

data _null_;
  set retail.longnums;
run;


2. Reduced-Length Numeric Variables

This program reads the external data file that is referenced by the fileref flat1 and creates a new SAS data set called Retail.Shortnums that contains 12 numeric variables. A LENGTH statement is used to reduce the storage length of most of the numeric variables in Retail.Shortnums, as follows:

  • Total_Retail_Price and CostPrice_Per_Unit have a storage length of 8 bytes.
  • Product_ID has a storage length of 7 bytes.
  • Street_ID and Order_ID have a storage length of 6 bytes.
  • Employee_ID has a storage length of 5 bytes.
  • Customer_ID, Order_Date, Delivery_Date and Discount have a storage length of 4 bytes.
  • Order_Type and Quantity have a storage length of 3 bytes.
The second DATA step reads the reduced-length numeric variables from Retail.Shortnums.

data retail.shortnums;
   infile flat1;
   length Quantity                  3
          Customer_ID Order_Date 
             Delivery_Date Discount 4
          Employee_ID               5
          Street_ID Order_ID        6
          Product_ID                7
          Total_Retail_Price   
             CostPrice_Per_Unit     8;
   input  Customer_ID          12.
          Employee_ID          12.
          Street_ID            12.
          Order_Date           date9.
          Delivery_Date        date9.
          Order_ID             12. 
          Order_Type           comma16.
          Product_ID           12.     
          Quantity             4.     
          Total_Retail_Price   dollar13.2
          CostPrice_Per_Unit   dollar13.2
          Discount             5.2       ;
run;

data _null_;
   set retail.shortnums;
run;
Note Remember that when you reduce the storage length of numeric variables, you risk losing precision in their values. You can use PROC COMPARE to verify the precision of shortened numeric variables.
     proc compare base=retail.longnums;
                  compare=retail.shortnums;
     run;

  • Create reduced-length numeric variables for integer values when you need to conserve data storage space.