Table of Contents
This post makes a trio with two most recent blogs on the same subject for R and Excel, this time in Python usking the sklearn package.
The example given in the official documentation (https://goo.gl/GGmUKK) is somewhat sketchy and uses a different dataset than the previous two examples. Although the example dataset is different, my main purpose is to illustrate the default output.
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.518
Model: OLS Adj. R-squared: 0.507
Method: Least Squares F-statistic: 46.27
Date: Wed, 29 Aug 2018 Prob (F-statistic): 3.83e-62
Time: 08:39:07 Log-Likelihood: -2386.0
No. Observations: 442 AIC: 4794.
Df Residuals: 431 BIC: 4839.
Df Model: 10
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 152.1335 2.576 59.061 0.000 147.071 157.196
x1 -10.0122 59.749 -0.168 0.867 -127.448 107.424
x2 -239.8191 61.222 -3.917 0.000 -360.151 -119.488
x3 519.8398 66.534 7.813 0.000 389.069 650.610
x4 324.3904 65.422 4.958 0.000 195.805 452.976
x5 -792.1842 416.684 -1.901 0.058 -1611.169 26.801
x6 476.7458 339.035 1.406 0.160 -189.621 1143.113
x7 101.0446 212.533 0.475 0.635 -316.685 518.774
x8 177.0642 161.476 1.097 0.273 -140.313 494.442
x9 751.2793 171.902 4.370 0.000 413.409 1089.150
x10 67.6254 65.984 1.025 0.306 -62.065 197.316
==============================================================================
Omnibus: 1.506 Durbin-Watson: 2.029
Prob(Omnibus): 0.471 Jarque-Bera (JB): 1.404
Skew: 0.017 Prob(JB): 0.496
Kurtosis: 2.726 Cond. No. 227.
==============================================================================
Source: User JARH on Stackoveflow (https://goo.gl/vZ3UWP), using the same dataset as the official documentation.
The output is divided into three sections.
- OLS REGRESSION RESULTS lacks the “call,” identifying the dependent and independent variables. It contains less information on residuals than R and adds some additional information. The date/time stamp is a gift during the exploratory data phase before the analyst has started more formal procedures.
Four additional substantive fields are added
- Covariance Type is a useful reminder whether the the requirements for a reliable linear regression are satisfied totally or not (“nonrobust”).
- Log-likelihood specifies which measure was applied to deal with the role of error terms
- AIC – the Akalike information criterion – is a “score” to assess the quality of competing models
- BIC – the Baysian inforation criterion – performs a similar role
- The second section, below the “===============” line is similar to R and Excel in providing coefficients and standard error. Like Excel, it provides a confidence interval (but in standard 92.5/97.5 form for the 95% confidence interval) that has to be run separately in R. Unlike the other two programs it omits the convenient asterisk coding of significance levels.
- The third section, below the “================” has information omitted from both R and Excel or in a different form.
- The Omnibus test is the F statistic from analysis of variance (ANOVA), included in Excel and separatly available in R. Skew and kurtosis describe asymmetries in the distribution curve. Durbin-Watson tests for autocorrelation. It is separately available in R; I haven’t checked in Excel. Jarque-Bera tests whether the data is normally distributed. It is separately available in R; I haven’t checked in Excel. The condition number is a measure of collinearity among the independent variables, which can have the effect of overemmphasizing redundant variables. It is separately available in R; I haven’t checked in Excel.
As sometimes said of economists, you can lay them end-to-end, and they would still point in different directions. Sklearn’s community of designers make different choices from those of R and the proprietary team of Microsoft Excel. Which is better or worse is a matter of personal taste, so long as each provides some functionality to produce the same output separately from the default for linear regression output.
If you don’t regularly use all three, I hope these guides have been useful.