Fit Continuous Distribution in SAS

One of the most important questions preceding most statistical analyses is “How is my data distributed?”. In statistics, it is very important to know your data well and also to be certain about the distribution of you data. Therefore, you should familiarize yourself with the different tools available for distribution fitting. Remember that univariate simply means ‘one variable’. In this post, I will present the most common way of fitting distributions in SAS with an example using PROC UNIVARIATE.

PROC UNIVARIATE

When we examine the distribution of continuous univariate data, the first procedure that should come to mind is PROC UNIVARIATE. I usually assess the distribution of my data using these three tools:

  • Histogram – Plotting an empirical histogram of your data and overlaying it with the best fitting theoretical densities of the distributions you wish to assess is usually the first line of business. As a result, it gives you a nice visual overview of what distributions may fit your data well, and what distributions may not.
  • QQ Plot  – The Quantile-Quantile plot compares ordered variable values with quantiles of some known theoretical distribution. If the theoretical distribution fits the data well, the QQ plot will form a linear pattern of points.
  • Goodness of Fit Test – The Histogram and QQ Plot are great tools to visually assess the distribution of your data. Goodness of Fit tests are statistical sizes that quantify how well some distribution fits your data.

An Example

Using the three tools described above, let us look at an example with the Miles Pr Gallon in the City variable (mpg_city). The dataset used is the sashelp.cars dataset.

ods select Histogram QQplot GoodnessOfFit;               /* Select only the relevant output  */
proc univariate data=sashelp.cars;
   histogram mpg_city/ normal
                       weibull(theta=est)                /* Default is theta = 0.            */ 
                       lognormal(theta=est)
                       midpoints = 4 to 62 by 2
                       odstitle = "Assesing Distribution Using Proc Univariate";
   qqplot mpg_city / normal(mu=est sigma=est) odstitle = "Normal QQ Plot.";
   qqplot mpg_city / weibull(c=est sigma=est theta=est) odstitle = "Weibull QQ Plot.";
   qqplot mpg_city / lognormal(sigma=est theta=est zeta=est) odstitle = "LogNormal QQ Plot.";
run;

Histogram overlaid with densities of fitted distributions

Histogram

First of all, I use the HISTOGRAM statement in the above procedure with the Normal, Weibull and Lognormal options to request a histogram of the mpg_city variable overlaid with fitted densities from the Normal, Weibull and Lognormal distributions using Maximum Likelihood estimation. Then, I use the theta=est option to suppress the default option of theta=0 and tell SAS to estimate the threshold parameter \theta as well. The histogram is seen to the right. Visually, it seems like the lognormal distribution fits our data the best, because it describes both the peak and the skewness of the data.

QQ Plots

After looking at the histogram, I also request the QQ-plots for the same three distributions as specified in the histogram. Here though, the syntax does not allow me to request these in a single statement. Therefore, I have to use three distinct statements. The QQ-plots from PROC UNIVARIATE below supports the visual evidence that the data i Lognormally distributed since the Lognormal plot seems to resemble the linear reference line the best.

QQ Plot for Normal Distribution

QQ Plot for Weibull Distribution

QQ Plot Lognormal Distribution

Goodness Of Fit Tests

Finally, we look at the Goodness of Fit Statistics for the three distributions, which are implicitly requested through the HISTOGRAM statement. Also these suggest that the Lognormal distribution fits the data the best of the three distributions.

Summary

PROC UNIVARIATE handles continuous variables only. You can not fit discrete probability distributions to univariate data such as Poisson or Negative Binomial with the Univariate Procedure. To see an example of how to fit discrete data, see the post Fit Poisson And Negative Binomial Distribution In SAS. For code examples of the three distributions assessed in the above PROC UNIVARIATE example and many more, check the Distribution Examples under the Examples menu, where I present code examples of the Normal, Weibull and Lognormal distribution.

For more information and examples of PROC UNIVARIATE consult the procedure documentation here.

You can download the entire program here.