Mosaic Plot in Python

I am sure everyone one of us has seen charts like this.

Any management training you attend, a version of this chart is bound to sneak up in the presentation, often in lecture notes or hands on activity.

These charts are a good representation of categorical entries. A mosaic plot allows visualizing multivariate categorical data in a rigorous and informative way. Click here to read A Brief History of the Mosaic Display [PDF]

Here’s a quick example how to plot mosaic in python

from statsmodels.graphics.mosaicplot import mosaic
import matplotlib.pyplot as plt
import pandas

gender = ['male', 'male', 'male', 'female', 'female', 'female']
pet = ['cat', 'dog', 'dog', 'cat', 'dog', 'cat']
data = pandas.DataFrame({'gender': gender, 'pet': pet})
mosaic(data, ['pet', 'gender'])
plt.show()

 

Here’s another example from the tips dataset.


from statsmodels.graphics.mosaicplot import mosaic
import matplotlib.pyplot as plt
import seaborn as sns

tips = sns.load_dataset('tips')
mosaic(tips, ['sex','smoker','time'])
plt.show()

 

Advertisements

Timeline in Python with Matplotlib

Was documenting a series of events and after the list was complete, thought of showing them on a timeline.

After an hour two, fiddling with python standard libraries this is what I had.

Format of the data


from theflile import GenerateTimeLine
import pandas as pd
data = pd.read_csv(r'events.txt', parse_dates=True, index_col=0)
ax = GenerateTimeLine(data)
plt.show()

The code as always available at this GitHub

Kill Tasks in Windows

Linux, Mac users have it easy. Top and kill are two commands that can help one take control of the system.

For windows user, until recently I was stuck with the task manager. Like all manager this one demands too much attention and is not batch able.

Summoned Google Gennie and discovered.

Tasklist and Taskkill commands in CMD.

Neat. Where were these commands hiding?

Here is two gifs for how they work

The Most Audacious Flying Machine Ever

It may only be a matter of weeks before Stratolaunch, the world’s biggest plane, with a wingspan longer than a football field, takes to the air for the first time. The aircraft was unveiled by Paul Allen, the Microsoft co-founder, in June 2017. The aircraft could eventually be used to transport rockets carrying satellites and people into the Earth’s upper atmosphere, where they will blast off into space. Allen recently said of Stratolaunch: “When you see that giant plane, it’s a little nutty. And you don’t build it unless you’re very serious, not only about wanting to see the plane fly but to see it fulfil its purpose. Which is getting vehicles in orbit.”

Via http://www.dailymail.co.uk/sciencetech/article-6080013/The-worlds-biggest-plane-inches-closer-takeoff.html

Homepage for the project https://www.stratolaunch.com/

If you are interested in the story and motivation for the project read this excellent post

Detect Outliers with Matplotlib

Problem: You have a huge large multivariate data and want to get list of outliers?

Outlier detection is a significant statistical process and lot of theory under pining but there is a simple, quick way to do this is using the Inter-quartile (IQR) rule.

Read the linked PDF for a simple example summary

Solution:

bloxplot_stats from matplotlib.cbook

Returns list of dictionaries of statistics

Here is a quick example

From matplotlib.cbook import boxplot_stats
st = boxplot_stats(data.AMT)
outliers = st[0]["fliers"]

I like this, because it is quick and does not need any external libraries apart from matplotlib.

Train your model of the world

Recently read this quote in one random blogpost that I stumbled upon surfing. Sorry forgot to save the link.

Loved it. As most of my work time, now a day are saturated with the words like, model, training etc., so this quote stuck a nerve.

Reading and experience train your model of the world. And even if you forget the experience or what you read, its effect on your model of the world persists. Your mind is like a compiled program you’ve lost the source of. It works, but you don’t know why. – Paul Graham

Q in Statsmodels

If you are coming from R, you will love the formula API of statsmodels that work in a similar way.

I love it and been using it for quite some time since last year.

It’s good to test quick regression and GLM models and works on pandas dataframe which, at least in my case, are the basic unit where data is captured.

Most real life data is like this.

Created to be read by humans, so hyphens, spaces in name is common. However, this is a major issue if you need to use them in the formulae.

My old approach was to rename the column names with no special characters and gaps. This worked but messy.

data.rename(columns={'Max crack size': 'maxcracksize'},inplace=True)

Recently I discovered Q, a helper function from the patsy library, which the formula API of statsmodels uses.

Just wrap your formula names with Q if they have gaps of something.

Here is an example usage with Q

import statsmodels.formula.api as smf
import pandas as pd
data = pd.read_csv(fname, index_col=0, usecols=[0,1,2])
model  = smf.ols('Q("Max crack size") ~ CSN',data =data).fit()

I would say this is very convenient. No need of the unnecessarily and complex pre-processing code.