The Most Audacious Flying Machine Ever

It may only be a matter of weeks before Stratolaunch, the world’s biggest plane, with a wingspan longer than a football field, takes to the air for the first time. The aircraft was unveiled by Paul Allen, the Microsoft co-founder, in June 2017. The aircraft could eventually be used to transport rockets carrying satellites and people into the Earth’s upper atmosphere, where they will blast off into space. Allen recently said of Stratolaunch: “When you see that giant plane, it’s a little nutty. And you don’t build it unless you’re very serious, not only about wanting to see the plane fly but to see it fulfil its purpose. Which is getting vehicles in orbit.”


Homepage for the project

If you are interested in the story and motivation for the project read this excellent post


Detect Outliers with Matplotlib

Problem: You have a huge large multivariate data and want to get list of outliers?

Outlier detection is a significant statistical process and lot of theory under pining but there is a simple, quick way to do this is using the Inter-quartile (IQR) rule.

Read the linked PDF for a simple example summary


bloxplot_stats from matplotlib.cbook

Returns list of dictionaries of statistics

Here is a quick example

From matplotlib.cbook import boxplot_stats
st = boxplot_stats(data.AMT)
outliers = st[0]["fliers"]

I like this, because it is quick and does not need any external libraries apart from matplotlib.

Train your model of the world

Recently read this quote in one random blogpost that I stumbled upon surfing. Sorry forgot to save the link.

Loved it. As most of my work time, now a day are saturated with the words like, model, training etc., so this quote stuck a nerve.

Reading and experience train your model of the world. And even if you forget the experience or what you read, its effect on your model of the world persists. Your mind is like a compiled program you’ve lost the source of. It works, but you don’t know why. – Paul Graham

Q in Statsmodels

If you are coming from R, you will love the formula API of statsmodels that work in a similar way.

I love it and been using it for quite some time since last year.

It’s good to test quick regression and GLM models and works on pandas dataframe which, at least in my case, are the basic unit where data is captured.

Most real life data is like this.

Created to be read by humans, so hyphens, spaces in name is common. However, this is a major issue if you need to use them in the formulae.

My old approach was to rename the column names with no special characters and gaps. This worked but messy.

data.rename(columns={'Max crack size': 'maxcracksize'},inplace=True)

Recently I discovered Q, a helper function from the patsy library, which the formula API of statsmodels uses.

Just wrap your formula names with Q if they have gaps of something.

Here is an example usage with Q

import statsmodels.formula.api as smf
import pandas as pd
data = pd.read_csv(fname, index_col=0, usecols=[0,1,2])
model  = smf.ols('Q("Max crack size") ~ CSN',data =data).fit()

I would say this is very convenient. No need of the unnecessarily and complex pre-processing code.

Thank you Guido

Here is a blog post that I am reposting from this link.

I have similar sentiments about python. I first began python in 2010 but truly took it up in summer of 2011 when a task to use Perl script landed as one of my assignment. Instead of Perl script, I worked with python and I have never looked back.

Therefore, without much delay here is the blogpost.

When I was in my early 20s, I was OK at programming, but I definitely didn’t like it. Then, one evening, I read the Python tutorial. That evening changed my mind. I woke up the next morning, like Neo in the matrix, and knew Python.

I was doing statistics at the time. Python, with Numeric, was a powerful tool. It definitely could do things that SPSS could only dream about. Suddenly, something has happened that never happened before — I started to enjoy programming.

I had to spend six years in the desert of programming in languages that were not Python, before my work place, and soon afterwards the world, realized what an amazing tool Python is. I have not had to struggle to find a Python position since.

I started with Python 1.4. I have grew up with Python. Now I am…no longer in my 20s, and Python version 3.7 was recently released.

I owe much of my career, many of my friends, and much of my hobby time to that one evening, sitting down and reading the Python tutorial — and to the man who made the language and wrote the first version of that tutorial, Guido van Rossum.

Python, like all open source projects, like, indeed, all software projects, is not a one man show. A whole team, with changing personnel, works on core Python and its ecosystem. But it was all started by Guido.

As Guido is stepping down to take a less active role in Python’s future, I want to offer my eternal gratitude. For my amazing career, for my friends, for my hobby. Thank you, Guido van Rossum. Your contribution to humanity, and to this one human in particular, is hard to overestimate.

Numpy Arrays to Numpy Arrays Records

Recently working with hdf5 format and the datasets in the format requires the arrays to be records.

What does that mean?

NumPy provides powerful capabilities to create arrays of structured datatype. These arrays permit one to manipulate the data by named fields.

One defines a structured array through the dtype object. Creating them from list is simple and take the below form


x = np.array([(1,2.,'Hello'), (2,3.,"World")],
             <span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>dtype=[('foo', 'i4'),('bar', 'f4'), ('baz', 'S10')])
print x['foo']
array([1, 2])
print x['faz']
array(['Hello', 'World'],dtype='|S10')

But the problem with my code was that I already had arrays. So how to convert numpy arrays to numpy records arrays.

Numpy.core.records.fromarray to the rescue.


data = np.random.randn(15).reshape(5,3)
rec_data  = np.core.records.fromarrays(data.T, names=['a','b','c'])

Notice the transpose. That is required.