Mosaic Plot in Python

I am sure everyone one of us has seen charts like this.

Any management training you attend, a version of this chart is bound to sneak up in the presentation, often in lecture notes or hands on activity.

These charts are a good representation of categorical entries. A mosaic plot allows visualizing multivariate categorical data in a rigorous and informative way. Click here to read A Brief History of the Mosaic Display [PDF]

Here’s a quick example how to plot mosaic in python

from statsmodels.graphics.mosaicplot import mosaic
import matplotlib.pyplot as plt
import pandas

gender = ['male', 'male', 'male', 'female', 'female', 'female']
pet = ['cat', 'dog', 'dog', 'cat', 'dog', 'cat']
data = pandas.DataFrame({'gender': gender, 'pet': pet})
mosaic(data, ['pet', 'gender'])
plt.show()

 

Here’s another example from the tips dataset.


from statsmodels.graphics.mosaicplot import mosaic
import matplotlib.pyplot as plt
import seaborn as sns

tips = sns.load_dataset('tips')
mosaic(tips, ['sex','smoker','time'])
plt.show()

 

Advertisements

Timeline in Python with Matplotlib

Was documenting a series of events and after the list was complete, thought of showing them on a timeline.

After an hour two, fiddling with python standard libraries this is what I had.

Format of the data


from theflile import GenerateTimeLine
import pandas as pd
data = pd.read_csv(r'events.txt', parse_dates=True, index_col=0)
ax = GenerateTimeLine(data)
plt.show()

The code as always available at this GitHub

Detect Outliers with Matplotlib

Problem: You have a huge large multivariate data and want to get list of outliers?

Outlier detection is a significant statistical process and lot of theory under pining but there is a simple, quick way to do this is using the Inter-quartile (IQR) rule.

Read the linked PDF for a simple example summary

Solution:

bloxplot_stats from matplotlib.cbook

Returns list of dictionaries of statistics

Here is a quick example

From matplotlib.cbook import boxplot_stats
st = boxplot_stats(data.AMT)
outliers = st[0]["fliers"]

I like this, because it is quick and does not need any external libraries apart from matplotlib.

Q in Statsmodels

If you are coming from R, you will love the formula API of statsmodels that work in a similar way.

I love it and been using it for quite some time since last year.

It’s good to test quick regression and GLM models and works on pandas dataframe which, at least in my case, are the basic unit where data is captured.

Most real life data is like this.

Created to be read by humans, so hyphens, spaces in name is common. However, this is a major issue if you need to use them in the formulae.

My old approach was to rename the column names with no special characters and gaps. This worked but messy.

data.rename(columns={'Max crack size': 'maxcracksize'},inplace=True)

Recently I discovered Q, a helper function from the patsy library, which the formula API of statsmodels uses.

Just wrap your formula names with Q if they have gaps of something.

Here is an example usage with Q

import statsmodels.formula.api as smf
import pandas as pd
data = pd.read_csv(fname, index_col=0, usecols=[0,1,2])
model  = smf.ols('Q("Max crack size") ~ CSN',data =data).fit()

I would say this is very convenient. No need of the unnecessarily and complex pre-processing code.

Numpy Arrays to Numpy Arrays Records

Recently working with hdf5 format and the datasets in the format requires the arrays to be records.

What does that mean?

NumPy provides powerful capabilities to create arrays of structured datatype. These arrays permit one to manipulate the data by named fields.

One defines a structured array through the dtype object. Creating them from list is simple and take the below form

Example:

x = np.array([(1,2.,'Hello'), (2,3.,"World")],
             <span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>dtype=[('foo', 'i4'),('bar', 'f4'), ('baz', 'S10')])
print x['foo']
array([1, 2])
print x['faz']
array(['Hello', 'World'],dtype='|S10')

But the problem with my code was that I already had arrays. So how to convert numpy arrays to numpy records arrays.

Numpy.core.records.fromarray to the rescue.

Example:

data = np.random.randn(15).reshape(5,3)
rec_data  = np.core.records.fromarrays(data.T, names=['a','b','c'])

Notice the transpose. That is required.

Cross Validation Score with Statsmodels

Recently was working with a limited set of data. Using statsmodels , employed a regression model on the data.

To test the confidence in the model needed to do cross validation. The solution that immediately sprang to mind was the cross_val_score function from sci-kit learn library.

However, that function is not applicable on statsmodels object.

Solution: wrap sklearn base estimators on statsmodels objects and then use the model.

Here is the code for wrapping a sklearn baseestimators over statsmodels objects.

from sklearn.base import BaseEstimator, RegressorMixin
import statsmodels.formula.api as smf
import statsmodels.api as sm

class statsmodel(BaseEstimator, RegressorMixin):
    def __init__(self, sm_class, formula):
        self.sm_class = sm_class
        self.formula = formula
        self.model = None
        self.result = None

    def fit(self,data,dummy):
        self.model = self.sm_class(self.formula,data)
        self.result = self.model.fit()

    def predict(self,X):
        return self.result.predict(X)

Notice the dummy in the fit function, the regression api for sklearn needs X and y and for this implementation I was relying on the formula API of statsmodels, so had to add a dummy variable.

Here’s a quick example on how to use it.


from sklearn import linear_model
from sklearn.model_selection import cross_val_score

# Get data
ccard = sm.datasets.ccard.load_pandas()
print (ccard.data.head())

# create a model
clf = statsmodel(smf.ols, "AVGEXP ~ AGE + INCOME")

# Print cross val score on this model
print (cross_val_score(clf, ccard.data, ccard.data['AVGEXP']))

# Same thing on sklearn's linear regression model
lm = linear_model.LinearRegression()

print (cross_val_score(lm , ccard.data.iloc[:,1:3].values, ccard.data.iloc[:,0].values))

I love sklearn. Convenient, efficient and consistent API can make things so much easier.

Parsing date – a simple example

Problem: Let’s say we have a data like the below image, it can be anything, from IOT sensors data to something like the below expense log.

If the data has day, month and year as separate columns, as shown below, how to read this kind of data log with pandas and create date as the index.

Solution:

import pandas as pd
data = pd.read_csv('EXP2018.csv',parse_dates={'date':['YY','MM','DD']}, index_col='date')

Yes use pd.todatetime functionally after the read fro this and other non-standard parsing, but this one is elegant.

Date parsing in pandas is super useful. Here are some other options that can be considered for parsing dates with pandas.

parse_dates : boolean or list of ints or names or list of lists or dict, default False

* boolean. If True -> try parsing the index.

* list of ints or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3

each as a separate date column.

* list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as

a single date column.

* dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result

'foo'


If a column or index contains an unparseable date, the entire column or

index will be returned unaltered as an object data type. For non-standard

datetime parsing, use ``pd.to_datetime`` after ``pd.read_csv``

Note: A fast-path exists for iso8601-formatted dates.

infer_datetime_format : boolean, default False



If True and parse_dates is enabled, pandas will attempt to infer the format

of the datetime strings in the columns, and if it can be inferred, switch

to a faster method of parsing them. In some cases this can increase the

parsing speed by 5-10x.



keep_date_col : boolean, default False

If True and parse_dates specifies combining multiple columns then

keep the original columns.



date_parser : function, default None

Function to use for converting a sequence of string columns to an array of

datetime instances. The default uses ``dateutil.parser.parser`` to do the

conversion. Pandas will try to call date_parser in three different ways,

advancing to the next if an exception occurs: 1) Pass one or more arrays

(as defined by parse_dates) as arguments; 2) concatenate (row-wise) the

string values from the columns defined by parse_dates into a single array

and pass that; and 3) call date_parser once for each row using one or more

strings (corresponding to the columns defined by parse_dates) as arguments.



dayfirst : boolean, default False

DD/MM format dates, international and European format