Recently was working with a limited set of data. Using statsmodels , employed a regression model on the data.

To test the confidence in the model needed to do cross validation. The solution that immediately sprang to mind was the cross_val_score function from sci-kit learn library.

However, that function is not applicable on statsmodels object.

Solution: wrap sklearn base estimators on statsmodels objects and then use the model.

Here is the code for wrapping a sklearn baseestimators over statsmodels objects.

from sklearn.base import BaseEstimator, RegressorMixin
import statsmodels.formula.api as smf
import statsmodels.api as sm
class statsmodel(BaseEstimator, RegressorMixin):
def __init__(self, sm_class, formula):
self.sm_class = sm_class
self.formula = formula
self.model = None
self.result = None
def fit(self,data,dummy):
self.model = self.sm_class(self.formula,data)
self.result = self.model.fit()
def predict(self,X):
return self.result.predict(X)

Notice the dummy in the fit function, the regression api for sklearn needs X and y and for this implementation I was relying on the formula API of statsmodels, so had to add a dummy variable.

Here’s a quick example on how to use it.

from sklearn import linear_model
from sklearn.model_selection import cross_val_score
# Get data
ccard = sm.datasets.ccard.load_pandas()
print (ccard.data.head())
# create a model
clf = statsmodel(smf.ols, "AVGEXP ~ AGE + INCOME")
# Print cross val score on this model
print (cross_val_score(clf, ccard.data, ccard.data['AVGEXP']))
# Same thing on sklearn's linear regression model
lm = linear_model.LinearRegression()
print (cross_val_score(lm , ccard.data.iloc[:,1:3].values, ccard.data.iloc[:,0].values))

I love sklearn. Convenient, efficient and consistent API can make things so much easier.

### Like this:

Like Loading...

*Related*