Recently was working with a limited set of data. Using statsmodels , employed a regression model on the data.
To test the confidence in the model needed to do cross validation. The solution that immediately sprang to mind was the cross_val_score function from sci-kit learn library.
However, that function is not applicable on statsmodels object.
Solution: wrap sklearn base estimators on statsmodels objects and then use the model.
Here is the code for wrapping a sklearn baseestimators over statsmodels objects.
from sklearn.base import BaseEstimator, RegressorMixin import statsmodels.formula.api as smf import statsmodels.api as sm class statsmodel(BaseEstimator, RegressorMixin): def __init__(self, sm_class, formula): self.sm_class = sm_class self.formula = formula self.model = None self.result = None def fit(self,data,dummy): self.model = self.sm_class(self.formula,data) self.result = self.model.fit() def predict(self,X): return self.result.predict(X)
Notice the dummy in the fit function, the regression api for sklearn needs X and y and for this implementation I was relying on the formula API of statsmodels, so had to add a dummy variable.
Here’s a quick example on how to use it.
from sklearn import linear_model from sklearn.model_selection import cross_val_score # Get data ccard = sm.datasets.ccard.load_pandas() print (ccard.data.head()) # create a model clf = statsmodel(smf.ols, "AVGEXP ~ AGE + INCOME") # Print cross val score on this model print (cross_val_score(clf, ccard.data, ccard.data['AVGEXP'])) # Same thing on sklearn's linear regression model lm = linear_model.LinearRegression() print (cross_val_score(lm , ccard.data.iloc[:,1:3].values, ccard.data.iloc[:,0].values))
I love sklearn. Convenient, efficient and consistent API can make things so much easier.