Fit Continuous Distribution in SAS
One of the most important questions preceding most statistical analyses is “How is my data distributed?”. In statistics, it is very important to know your data well and also to be certain about the distribution of you data. Therefore, you should familiarize yourself with the different tools available for distribution fitting. Remember that univariate simply means ‘one variable’. In this post, I will present the most common way of fitting distributions in SAS with an example using PROC UNIVARIATE.
When examining the distribution of your data, the first procedure that should come to mind is PROC UNIVARIATE. I usually assess the distribution of my data using these three tools:
- Histogram – Plotting an empirical histogram of your data and overlaying it with the best fitting theoretical densities of the distributions you wish to assess is usually the first line of business. As a result, it gives you a nice visual overview of what distributions may fit your data well, and what distributions may not.
- QQ Plot – The Quantile-Quantile plot compares ordered variable values with quantiles of some known theoretical distribution. If the theoretical distribution fits the data well, the QQ plot will form a linear pattern of points.
- Goodness of Fit Test – The Histogram and QQ Plot are great tools to visually assess the distribution of your data. Goodness of Fit tests are statistical sizes that quantify how well some distribution fits your data.
Using the three tools described above, let us look at an example with the Miles Pr Gallon in the City variable (mpg_city). The dataset used is the sashelp.cars dataset.
ods select Histogram QQplot GoodnessOfFit; /* Select only the relevant output */ proc univariate data=sashelp.cars; histogram mpg_city/ normal weibull(theta=est) /* Default is theta = 0. */ lognormal(theta=est) midpoints = 4 to 62 by 2 odstitle = "Assesing Distribution Using Proc Univariate"; qqplot mpg_city / normal(mu=est sigma=est) odstitle = "Normal QQ Plot."; qqplot mpg_city / weibull(c=est sigma=est theta=est) odstitle = "Weibull QQ Plot."; qqplot mpg_city / lognormal(sigma=est theta=est zeta=est) odstitle = "LogNormal QQ Plot."; run;
First of all, I use the HISTOGRAM statement in the above procedure with the Normal, Weibull and Lognormal options to request a histogram of the mpg_city variable overlaid with fitted densities from the Normal, Weibull and Lognormal distributions using Maximum Likelihood estimation. Then, I use the theta=est option to suppress the default option of theta=0 and tell SAS to estimate the threshold parameter as well. The histogram is seen to the right. Visually, it seems like the lognormal distribution fits our data the best, because it describes both the peak and the skewness of the data.
After looking at the histogram, I also request the QQ-plots for the same three distributions as specified in the histogram. Here though, the syntax does not allow me to request these in a single statement. Therefore, I have to use three distinct statements. The QQ-plots from PROC UNIVARIATE below supports the visual evidence that the data i Lognormally distributed since the Lognormal plot seems to resemble the linear reference line the best.
Goodness Of Fit Tests
Finally, we look at the Goodness of Fit Statistics for the three distributions, which are implicitly requested through the HISTOGRAM statement. Also these suggest that the Lognormal distribution fits the data the best of the three distributions.
We assess both visually and through goodness of fit tests, concluding that the Lognormal Distribution is a good fit for the data.
Finally, it is worth mentioning that PROC UNIVARIATE handles continuous variables only. You can not fit discrete probability distributions to univariate data such as Poisson or Negative Binomial with the Univariate Procedure. I will show how to do this with an in a future post. For code examples of the three distributions assessed in the above PROC UNIVARIATE example and many more, check the Distribution Examples under the Examples menu, where I present code examples of the Normal, Weibull and Lognormal distribution.
For more information and examples of PROC UNIVARIATE consult the procedure documentation here.
You can download the entire program here.