Participate in the 2018 Python Developer Survey

If you use python in your work or hobby projects, please consider participating in the 2018 python developer survey.

Reposting a PSF-Community email as a PSA

Excerpt from an email to the and mailing lists:

As some of you may have seen, the 2018 Python Developer Survey is available. If you haven’t taken the survey yet, please do so soon! Additionally, we’d appreciate any assistance you all can provide with sharing the survey with your local Python groups, schools, work colleagues, etc. We will keep the survey open through October 26th, 2018.

Python Developers Survey 2018

We’re counting on your help to better understand how different Python developers use Python and related frameworks, tools, and technologies. We also hope you’ll enjoy going through the questions.

The survey is organized in partnership between the Python Software Foundation and JetBrains. Together we will publish the aggregated results. We will randomly choose and announce 100 winners to receive a Python Surprise Gift Pack (must complete the full survey to qualify).


Outlook Calendar Entries for a Day with Python

For me it is not the mobile. It is the email, which is the most distraction application in office.

The variable reward system has become a habit. I tried closing it but then I keep missing meetings etc.

So had to keep it open just so the reminders for meetings show up.

Only reason I need the outlook is to have a day view why not print it. Bad option, see the already printed stack on my desk. Why not get all the events out for a day and use a text file on the desktop.

Therefore, that is what I did.

Here’s the python function to get the appointments of the day from outlook calendar.

import win32com.client, datetime

def getCalendarEntries(days=1):
    Returns calender entries for days default is 1
    Outlook = win32com.client.Dispatch("Outlook.Application")
    ns = Outlook.GetNamespace("MAPI")
    appointments = ns.GetDefaultFolder(9).Items
    appointments.IncludeRecurrences = "True"
    today =
    begin ="%m/%d/%Y")
    tomorrow= datetime.timedelta(days=days)+today
    end ="%m/%d/%Y")
    appointments = appointments.Restrict("[Start] >= '" +begin+ "' AND [END] <= '" +end+ "'")
    for a in appointments:
        #print a.Start, a.Subject,a.Duration
    return events

The functions spits out a dict and can be easily converted used in any way one wants.

Sweet! Let's see if am able to kill outlook now.

Mosaic Plot in Python

I am sure everyone one of us has seen charts like this.

Any management training you attend, a version of this chart is bound to sneak up in the presentation, often in lecture notes or hands on activity.

These charts are a good representation of categorical entries. A mosaic plot allows visualizing multivariate categorical data in a rigorous and informative way. Click here to read A Brief History of the Mosaic Display [PDF]

Here’s a quick example how to plot mosaic in python

from import mosaic
import matplotlib.pyplot as plt
import pandas

gender = ['male', 'male', 'male', 'female', 'female', 'female']
pet = ['cat', 'dog', 'dog', 'cat', 'dog', 'cat']
data = pandas.DataFrame({'gender': gender, 'pet': pet})
mosaic(data, ['pet', 'gender'])


Here’s another example from the tips dataset.

from import mosaic
import matplotlib.pyplot as plt
import seaborn as sns

tips = sns.load_dataset('tips')
mosaic(tips, ['sex','smoker','time'])


Timeline in Python with Matplotlib

Was documenting a series of events and after the list was complete, thought of showing them on a timeline.

After an hour two, fiddling with python standard libraries this is what I had.

Format of the data

from theflile import GenerateTimeLine
import pandas as pd
data = pd.read_csv(r'events.txt', parse_dates=True, index_col=0)
ax = GenerateTimeLine(data)

The code as always available at this GitHub

Detect Outliers with Matplotlib

Problem: You have a huge large multivariate data and want to get list of outliers?

Outlier detection is a significant statistical process and lot of theory under pining but there is a simple, quick way to do this is using the Inter-quartile (IQR) rule.

Read the linked PDF for a simple example summary


bloxplot_stats from matplotlib.cbook

Returns list of dictionaries of statistics

Here is a quick example

From matplotlib.cbook import boxplot_stats
st = boxplot_stats(data.AMT)
outliers = st[0]["fliers"]

I like this, because it is quick and does not need any external libraries apart from matplotlib.

Q in Statsmodels

If you are coming from R, you will love the formula API of statsmodels that work in a similar way.

I love it and been using it for quite some time since last year.

It’s good to test quick regression and GLM models and works on pandas dataframe which, at least in my case, are the basic unit where data is captured.

Most real life data is like this.

Created to be read by humans, so hyphens, spaces in name is common. However, this is a major issue if you need to use them in the formulae.

My old approach was to rename the column names with no special characters and gaps. This worked but messy.

data.rename(columns={'Max crack size': 'maxcracksize'},inplace=True)

Recently I discovered Q, a helper function from the patsy library, which the formula API of statsmodels uses.

Just wrap your formula names with Q if they have gaps of something.

Here is an example usage with Q

import statsmodels.formula.api as smf
import pandas as pd
data = pd.read_csv(fname, index_col=0, usecols=[0,1,2])
model  = smf.ols('Q("Max crack size") ~ CSN',data =data).fit()

I would say this is very convenient. No need of the unnecessarily and complex pre-processing code.

Numpy Arrays to Numpy Arrays Records

Recently working with hdf5 format and the datasets in the format requires the arrays to be records.

What does that mean?

NumPy provides powerful capabilities to create arrays of structured datatype. These arrays permit one to manipulate the data by named fields.

One defines a structured array through the dtype object. Creating them from list is simple and take the below form


x = np.array([(1,2.,'Hello'), (2,3.,"World")],
             <span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>dtype=[('foo', 'i4'),('bar', 'f4'), ('baz', 'S10')])
print x['foo']
array([1, 2])
print x['faz']
array(['Hello', 'World'],dtype='|S10')

But the problem with my code was that I already had arrays. So how to convert numpy arrays to numpy records arrays.

Numpy.core.records.fromarray to the rescue.


data = np.random.randn(15).reshape(5,3)
rec_data  = np.core.records.fromarrays(data.T, names=['a','b','c'])

Notice the transpose. That is required.