Linear Regression in SAS

In the linear regression model, we try to explain the relationship between a dependent variable \mathbold{y} and one or more explanatory variables \mathbold{x_1}, \mathbold{x_2} and so on. The relationship between the dependent variable and the independent variables is assumed linear. In matrix notation, this menas that the model is written as \bold{y} = \mathbold{\mu + X \beta + \epsilon}, where \mathbold{y} is a vector of dependent variables to be explained, \mathbold{\mu} is the overall mean of the model, \mathbold{X} is a matrix of independent explanatory variables, \mathbold{\epsilon} is a vector of residuals and \mathbold{\beta} is a vector of parameters to be estimated from the independent variables.

The usual method of estimating \mathbold{\beta} is Ordinary Least Squares, which minimizes the sum of the squared residuals. This method, along with a little calculus, leads to the following closed form expression for the estimated value of the parameters , \mathbold{\hat{\beta}} = \left( \mathbold{X^T X} \right) \mathbold{X^T y}. Assuming that the error terms have finite variance and are uncorrelated with the regressors, this estimator is unbiased and consistent. Further assuming that the variance is constant through the observations, the estimator is also efficient. Wikipedia provides a more thorough examination of the theory of the linear regression model, its applications and the assumptions behind it.

Fitting a linear regression model in SAS

The simplest way to fit linear regression models in SAS is using one of the procedures, that supports OLS estimation. The first procedure you should consult is PROC REG. A simple example is seen below.

proc reg data = sashelp.class;
   model weight = height;
run;quit;

In the MODEL statement we list the dependent variable on the left side of the equal sign, and the explanatory variables on the right side. This means that the model looks like this

(1)   \begin{equation*} \text{Height} = \mu + \beta \text{Weight} + \epsilon \end{equation*}

The REG Procedure produces a lot of output and it is important to go about this in the right order (which is NOT the order in which the output is presented). First, you should look at the ‘Fit Diagnostics’ plots. Based on the histogram and QQ plots, does your data look approximately normal? If your data does in fact look normal, you can proceed to look at the ‘Analysis of Variance’ and ‘Parameter Estimates’. Here, you can see that we get a very small p-value for the overall model, which means that given our specified model, the probability of obtaining our data, purely by chance is very small. This indicates a good overall model fit. Finally, you should have a look at the parameter estimates and the associated t-tests and p-values. In this case, both the Intercept and the parameter for Weight are highly significant.

The OLS parameter estimates provided by PROC REG imply that the best fitting linear regression model given the specified variables is

(2)   \begin{equation*} \text{Height} = -143.03 + 3.9 * \text{Weight} \end{equation*}

which means that a unit of increase in Weight implies a 3.9 unit increase in Height. If an intercept does not make sense in your model, you can suppress it using the NOINT option in the model statement.

Using PROC GLM

The linear regression model is a special case of a general linear model, where the dependent variable is a continuous normally distributed variable and no class variables exist among the independent variables. Therefore, another common way to fit a linear regression model in SAS is using PROC GLM.

proc glm data = sashelp.class;
   model weight = height;
run;quit;

For more material and examples of model fitting using the above procedures, consult the SAS documentation for PROC REG and PROC GLM. Both procedures assume normality. Therefore, you should familiarize yourself with the Normal Distribution.

Linear Regression in IML

The two procedures used in the section above both produce a lot of output and gives a lot of information with very little code. However, it can be a bit confusing how all these quantities are actually calculated. I have written an IML program below, that calculates all the quantities from the ‘Analysis of Variance’ and ‘Parameter Estimates’ sections in the previous. Admittedly, using three lines of code one of the above procedures is much simpler than doing this through IML, but it gives a nice overview of the calculations performed in linear regression.

proc iml;
use sashelp.class;                         /* Open dataset for reading                       */
   read all var {'weight'} into y;         /* Read dependent variable into vector y */
   read all var {'height'} into X[c=names];/* Read independent variable(s) into matrix X     */
close sashelp.class;                       /* Close dataset for reading                      */
 
df_model = ncol(X);                        /* Model degress of freedom                       */
X = j(nrow(X),1,1) || X;                   /* Intercept                                      */
df_error = nrow(X) - ncol(X);              /* Error degrees of freedom                       */  
 
beta_hat = inv(t(X)*X) * t(X)*y;           /* Solve normal equations for parameter estimates */             
y_hat = X*beta_hat;                        /* Predicted values                               */
res = y - y_hat;                           /* Residuals                                      */
 
SSM = sum((y_hat - mean(y))##2);           /* Model Sum of Squares                           */
SSE = sum(res##2);                         /* Eror Sum of Squares                            */
MSM = SSM / df_model;                      /* Model Mean Square                              */  
MSE = SSE / df_error;                      /* Error Mean Square                              */
R_square = SSM / (SSM + SSE);              /* R^2                                            */
 
F = MSM / MSE;                             /* F test statistic for overall model             */
p_F = 1 - CDF('F',F,df_model,df_error);    /* p-values                                       */ 
 
std_err = sqrt(MSE*vecdiag(inv(t(X)*X)));  /* Standard Errors of estimated parameters        */
t = beta_hat / std_err;                    /* t test statistic for estimated parameters      */
p_t = 2 * (1-cdf('t',abs(t),df_error));    /* p values for s                                 */
 
print ('Intercept' // t(names))[l='Parameters']
      beta_hat[f=best10.2 l='Estimate']
      std_err[f=best10.2 l='Std. Error']
      t[f=best5. l='t Value']              
      p_t[f=pvalue6.4 l='p Value'];        /* Print beta values, t-stats and p-values        */
 
print R_square[f=best10.2 l='R^2'];
 
print ({'Model', 'Error', 'Corrected Total'})[l='Source']
      (df_model // df_error // df_model+df_error)[f=best10. l='DF']
      (SSM // SSE // SSM+SSE)[f=best10. l='Sums of Squares']
      (MSM // MSE)[f=best10. l='Mean Square']
       F[f=best5. l='F Value']
       p_F[f = pvalue6.4 l='p Value'];     /* Print sums of squares, F test and p-value      */          
quit;

You can download the entire program here