Spellings: How to get good at it?

As my school friends will attest, spellings were never a strong suit for me. Most of this could be because I was lazy and never paid attention to it when reading. I liked reading but never looked at the spellings.

I forgot about this until few years back when I began finding my kids getting spelling tests at school. To give them extra practice, I turned to python.

This is the app that I developed which is based on the concept of spaced revision. According to wikipedia

Spaced repetition is an evidence-based learning technique that is usually performed with flashcards. Newly introduced and more difficult flashcards are shown more frequently while older and less difficult flashcards are shown less frequently in order to exploit the psychological spacing effect.

Here’s a long demo of the app being used by my older kid.

Another fun thing added to the program is voice, so my kids really do love doing this.

My kids have been using this consistently for a over a year now and they have improved their spellings.

If you want to give it a try, please download this wheel or visit this github project to download the code.

Do give it a try if you have kids and let me know how it goes?

Determining screen locked of a system using python’s standard library

I think, apart from the ease of use, python’s batteries include philosopy is one of the reason its has become so popular.

Here’s another cool functionality that we needed in one of our app that was trying to maximise the usage of computing resources when the user has locked his computer.

The problem,

Get to know if the screen is locked

def screen_locked():
    Find if the user has locked their screen.
    user32 = ctypes.windll.User32
    OpenDesktop = user32.OpenDesktopA
    SwitchDesktop = user32.SwitchDesktop

    hDesktop = OpenDesktop("default", 0, False, DESKTOP_SWITCHDESKTOP)
    result = SwitchDesktop(hDesktop)
    if result:
        return False
        return True

File and Folder Comparison with Python

Python standard library modules are incedible. There’s a small gem to compare files and directories.

Its useful

Say you have two ascii files and you want to do a file comparision, don’t worry, use python.

import filecmp

# Check two files
assert filecmp.cmp(base_reduced_bdd, bdd_file, shallow=False) is True

To compare two directories, use

x = filecmp.dircmp(dir1, dir2)

# prints a report on the differences between dir1 and dir2

filecmp module has utilities for comparing files and directories.

It consists of the following.


    cmp(f1, f2, shallow=True) -> int
    cmpfiles(a, b, common) -> ([], [], [])

Signature: filecmp.cmp(f1, f2, shallow=True)
Compare two files.


f1 -- First file name

f2 -- Second file name

shallow -- Just check stat signature (do not read the files).
           defaults to True.

Return value:

True if the files are the same, False otherwise.

This function uses a cache for past comparisons and the results,
with cache entries invalidated if their stat information

Refer docs for more usage.

Memory Profile Your code with Ipython

Lets say you have a function that you want check the memory usage in python. This can be evaluated with another IPython extension, the memory_profiler.

The memory profiler extension contains two useful magic functions: the %memit magic and the %mprun function.

%memit magic gives the peak and total memory used by a function, while %mprun provides a line by line usage of memory.

file: temp_interp.py

import numpy as np
from scipy.interpolate import interp1d
def test(n):
	a = np.random.rand(n,4000,30)
	x = np.arange(n)
	xx = np.linspace(0,n, 2*n)
	f= interp1d(x,a, axis=0, copy=False, fill_value="extrapolate", assume_sorted=True)
	b = f(xx)

To test this function with %mprun

from test_interp import test
%mprun -f test test(1000)

This shows a line-by-line description of memory use.

Before using, we need to load the extension:

%load_ext memory_profiler

To install the extension use the following

pip install memory_profiler

Get Activation from scikit-learn’s neural network model

I have a simple multilayer perceptron that I use on my work computer. It works well with the and has 95% accuracy on top 5 basis.

It’s a delight to see it work, but I want to get more insights on what is happening inside it, one way to make it work is to see how and what neurons are firing.

Unlike keras, sklearn doesn’t give back activations for each layer on by itself, but there is a way to get the activations,

Following is code that helps get the activation from a sklearn neural network model

def get_activations(clf, X):
        hidden_layer_sizes = clf.hidden_layer_sizes
        if not hasattr(hidden_layer_sizes, "__iter__"):
            hidden_layer_sizes = [hidden_layer_sizes]
        hidden_layer_sizes = list(hidden_layer_sizes)
        layer_units = [X.shape[1]] + hidden_layer_sizes + \
        activations = [X]
        for i in range(clf.n_layers_ - 1):
                                         layer_units[i + 1])))
return activations

via stackoverflow

Shutil make_archive to Rescue

Shutil to rescue

I am amazed at the versatility of the shutil library, one usage that i discovered recently was its ability to create archives.

Previously I was always using zipfile module from python but make_archive function of shutil is such an intuitive function to use.

With a single line you can take backup of a folder.


shutil.make_archive(output_filename, 'zip', dir_name)

shutil.make_archive(base_name, format, root_dir=None, base_dir=None,
verbose=0, dry_run=0, owner=None, group=None, logger=None)

Create an archive file (eg. zip or tar).

For more info check docs.

Outlook Calendar Entries for a Day with Python

For me it is not the mobile. It is the email, which is the most distraction application in office.

The variable reward system has become a habit. I tried closing it but then I keep missing meetings etc.

So had to keep it open just so the reminders for meetings show up.

Only reason I need the outlook is to have a day view why not print it. Bad option, see the already printed stack on my desk. Why not get all the events out for a day and use a text file on the desktop.

Therefore, that is what I did.

Here’s the python function to get the appointments of the day from outlook calendar.

import win32com.client, datetime

def getCalendarEntries(days=1):
    Returns calender entries for days default is 1
    Outlook = win32com.client.Dispatch("Outlook.Application")
    ns = Outlook.GetNamespace("MAPI")
    appointments = ns.GetDefaultFolder(9).Items
    appointments.IncludeRecurrences = "True"
    today = datetime.datetime.today()
    begin = today.date().strftime("%m/%d/%Y")
    tomorrow= datetime.timedelta(days=days)+today
    end = tomorrow.date().strftime("%m/%d/%Y")
    appointments = appointments.Restrict("[Start] >= '" +begin+ "' AND [END] <= '" +end+ "'")
    for a in appointments:
        #print a.Start, a.Subject,a.Duration
    return events

The functions spits out a dict and can be easily converted used in any way one wants.

Sweet! Let's see if am able to kill outlook now.

Mosaic Plot in Python

I am sure everyone one of us has seen charts like this.

Any management training you attend, a version of this chart is bound to sneak up in the presentation, often in lecture notes or hands on activity.

These charts are a good representation of categorical entries. A mosaic plot allows visualizing multivariate categorical data in a rigorous and informative way. Click here to read A Brief History of the Mosaic Display [PDF]

Here’s a quick example how to plot mosaic in python

from statsmodels.graphics.mosaicplot import mosaic
import matplotlib.pyplot as plt
import pandas

gender = ['male', 'male', 'male', 'female', 'female', 'female']
pet = ['cat', 'dog', 'dog', 'cat', 'dog', 'cat']
data = pandas.DataFrame({'gender': gender, 'pet': pet})
mosaic(data, ['pet', 'gender'])


Here’s another example from the tips dataset.

from statsmodels.graphics.mosaicplot import mosaic
import matplotlib.pyplot as plt
import seaborn as sns

tips = sns.load_dataset('tips')
mosaic(tips, ['sex','smoker','time'])


Timeline in Python with Matplotlib

Was documenting a series of events and after the list was complete, thought of showing them on a timeline.

After an hour two, fiddling with python standard libraries this is what I had.

Format of the data

from theflile import GenerateTimeLine
import pandas as pd
data = pd.read_csv(r'events.txt', parse_dates=True, index_col=0)
ax = GenerateTimeLine(data)

The code as always available at this GitHub

Detect Outliers with Matplotlib

Problem: You have a huge large multivariate data and want to get list of outliers?

Outlier detection is a significant statistical process and lot of theory under pining but there is a simple, quick way to do this is using the Inter-quartile (IQR) rule.

Read the linked PDF for a simple example summary


bloxplot_stats from matplotlib.cbook

Returns list of dictionaries of statistics

Here is a quick example

From matplotlib.cbook import boxplot_stats
st = boxplot_stats(data.AMT)
outliers = st[0]["fliers"]

I like this, because it is quick and does not need any external libraries apart from matplotlib.

Q in Statsmodels

If you are coming from R, you will love the formula API of statsmodels that work in a similar way.

I love it and been using it for quite some time since last year.

It’s good to test quick regression and GLM models and works on pandas dataframe which, at least in my case, are the basic unit where data is captured.

Most real life data is like this.

Created to be read by humans, so hyphens, spaces in name is common. However, this is a major issue if you need to use them in the formulae.

My old approach was to rename the column names with no special characters and gaps. This worked but messy.

data.rename(columns={'Max crack size': 'maxcracksize'},inplace=True)

Recently I discovered Q, a helper function from the patsy library, which the formula API of statsmodels uses.

Just wrap your formula names with Q if they have gaps of something.

Here is an example usage with Q

import statsmodels.formula.api as smf
import pandas as pd
data = pd.read_csv(fname, index_col=0, usecols=[0,1,2])
model  = smf.ols('Q("Max crack size") ~ CSN',data =data).fit()

I would say this is very convenient. No need of the unnecessarily and complex pre-processing code.

Numpy Arrays to Numpy Arrays Records

Recently working with hdf5 format and the datasets in the format requires the arrays to be records.

What does that mean?

NumPy provides powerful capabilities to create arrays of structured datatype. These arrays permit one to manipulate the data by named fields.

One defines a structured array through the dtype object. Creating them from list is simple and take the below form


x = np.array([(1,2.,'Hello'), (2,3.,"World")],
             <span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>dtype=[('foo', 'i4'),('bar', 'f4'), ('baz', 'S10')])
print x['foo']
array([1, 2])
print x['faz']
array(['Hello', 'World'],dtype='|S10')

But the problem with my code was that I already had arrays. So how to convert numpy arrays to numpy records arrays.

Numpy.core.records.fromarray to the rescue.


data = np.random.randn(15).reshape(5,3)
rec_data  = np.core.records.fromarrays(data.T, names=['a','b','c'])

Notice the transpose. That is required.

Cross Validation Score with Statsmodels

Recently was working with a limited set of data. Using statsmodels , employed a regression model on the data.

To test the confidence in the model needed to do cross validation. The solution that immediately sprang to mind was the cross_val_score function from sci-kit learn library.

However, that function is not applicable on statsmodels object.

Solution: wrap sklearn base estimators on statsmodels objects and then use the model.

Here is the code for wrapping a sklearn baseestimators over statsmodels objects.

from sklearn.base import BaseEstimator, RegressorMixin
import statsmodels.formula.api as smf
import statsmodels.api as sm

class statsmodel(BaseEstimator, RegressorMixin):
    def __init__(self, sm_class, formula):
        self.sm_class = sm_class
        self.formula = formula
        self.model = None
        self.result = None

    def fit(self,data,dummy):
        self.model = self.sm_class(self.formula,data)
        self.result = self.model.fit()

    def predict(self,X):
        return self.result.predict(X)

Notice the dummy in the fit function, the regression api for sklearn needs X and y and for this implementation I was relying on the formula API of statsmodels, so had to add a dummy variable.

Here’s a quick example on how to use it.

from sklearn import linear_model
from sklearn.model_selection import cross_val_score

# Get data
ccard = sm.datasets.ccard.load_pandas()
print (ccard.data.head())

# create a model
clf = statsmodel(smf.ols, "AVGEXP ~ AGE + INCOME")

# Print cross val score on this model
print (cross_val_score(clf, ccard.data, ccard.data['AVGEXP']))

# Same thing on sklearn's linear regression model
lm = linear_model.LinearRegression()

print (cross_val_score(lm , ccard.data.iloc[:,1:3].values, ccard.data.iloc[:,0].values))

I love sklearn. Convenient, efficient and consistent API can make things so much easier.

Parsing date – a simple example

Problem: Let’s say we have a data like the below image, it can be anything, from IOT sensors data to something like the below expense log.

If the data has day, month and year as separate columns, as shown below, how to read this kind of data log with pandas and create date as the index.


import pandas as pd
data = pd.read_csv('EXP2018.csv',parse_dates={'date':['YY','MM','DD']}, index_col='date')

Yes use pd.todatetime functionally after the read fro this and other non-standard parsing, but this one is elegant.

Date parsing in pandas is super useful. Here are some other options that can be considered for parsing dates with pandas.

parse_dates : boolean or list of ints or names or list of lists or dict, default False

* boolean. If True -> try parsing the index.

* list of ints or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3

each as a separate date column.

* list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as

a single date column.

* dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result


If a column or index contains an unparseable date, the entire column or

index will be returned unaltered as an object data type. For non-standard

datetime parsing, use ``pd.to_datetime`` after ``pd.read_csv``

Note: A fast-path exists for iso8601-formatted dates.

infer_datetime_format : boolean, default False

If True and parse_dates is enabled, pandas will attempt to infer the format

of the datetime strings in the columns, and if it can be inferred, switch

to a faster method of parsing them. In some cases this can increase the

parsing speed by 5-10x.

keep_date_col : boolean, default False

If True and parse_dates specifies combining multiple columns then

keep the original columns.

date_parser : function, default None

Function to use for converting a sequence of string columns to an array of

datetime instances. The default uses ``dateutil.parser.parser`` to do the

conversion. Pandas will try to call date_parser in three different ways,

advancing to the next if an exception occurs: 1) Pass one or more arrays

(as defined by parse_dates) as arguments; 2) concatenate (row-wise) the

string values from the columns defined by parse_dates into a single array

and pass that; and 3) call date_parser once for each row using one or more

strings (corresponding to the columns defined by parse_dates) as arguments.

dayfirst : boolean, default False

DD/MM format dates, international and European format



Variable length list to Numpy array

Suppose you have a variable length list and you want to convert it to a numb array

alist = [[1,2,3],[5,6]]

What is the efficient way to convert this list to a numpy array?

My first answer was using pandas and this is what I did?

import pandas as pd
data = pd.Dataframe(alist).fillna(0).values

This worked and I moved on to my other problem, but then realised if there is any other way which is more efficient. Turns out there is.

import itertools
data=np.array(list(itertools.izip_longest(*alist, fillvalue=0))).T

In python 2.7 and the following in python3

import itertools
data=np.array(list(itertools.zip_longest(*alist, fillvalue=0))).T

How Fast and efficient? See the below image.


Post writing the above I googled and found this link. Here is the result of both of the methods on the example data in the link.




Clearly, itertools is the winner.

Panda Time index vs DateTime and Inconveniences

Sometimes convenience becomes a handicap. Saw this first hand later last month.

Was so used to using Pandas DataFrame and the Timeindex object that when I had to move back to a system which didn’t have pandas I was struggling to get a simplified day of year, day of week and week of year from python’s standard datetime module.

Here is how all this available in Pandas from a Timeindex column or index.

If your looks like this, with index as time

You can get all the convenience functions like this

However, this gave an opportunity to explore datetime and here is the code to get all this and more from date time.

import datetime
day_of_year = today.timetuple().tm_yday

Inconveniences are good; they always end up teaching you something.

The Best Debugging Tool of All

Rob Pike (one of the authors of the Go language) has this to say on debugging…..:

A year or two after I’d joined the Labs, I was pair programming with Ken Thompson on an on-the-fly compiler for a little interactive graphics language designed by Gerard Holzmann. I was the faster typist, so I was at the keyboard and Ken was standing behind me as we programmed. We were working fast, and things broke, often visibly—it was a graphics language, after all. When something went wrong, I’d reflexively start to dig in to the problem, examining stack traces, sticking in print statements, invoking a debugger, and so on. But Ken would just stand and think, ignoring me and the code we’d just written. After a while I noticed a pattern: Ken would often understand the problem before I would, and would suddenly announce, “I know what’s wrong.” He was usually correct. I realized that Ken was building a mental model of the code and when something broke it was an error in the model. By thinking about *how* that problem could happen, he’d intuit where the model was wrong or where our code must not be satisfying the model.

Ken taught me that thinking before debugging is extremely important. If you dive into the bug, you tend to fix the local issue in the code, but if you think about the bug first, how the bug came to be, you often find and correct a higher-level problem in the code that will improve the design and prevent further bugs.

I recognize this is largely a matter of style. Some people insist on line-by-line tool-driven debugging for everything. But I now believe that thinkingwithout looking at the code—is the best debugging tool of all, because it leads to better software.