Thank you Guido

Here is a blog post that I am reposting from this link.

I have similar sentiments about python. I first began python in 2010 but truly took it up in summer of 2011 when a task to use Perl script landed as one of my assignment. Instead of Perl script, I worked with python and I have never looked back.

Therefore, without much delay here is the blogpost.

When I was in my early 20s, I was OK at programming, but I definitely didn’t like it. Then, one evening, I read the Python tutorial. That evening changed my mind. I woke up the next morning, like Neo in the matrix, and knew Python.

I was doing statistics at the time. Python, with Numeric, was a powerful tool. It definitely could do things that SPSS could only dream about. Suddenly, something has happened that never happened before — I started to enjoy programming.

I had to spend six years in the desert of programming in languages that were not Python, before my work place, and soon afterwards the world, realized what an amazing tool Python is. I have not had to struggle to find a Python position since.

I started with Python 1.4. I have grew up with Python. Now I am…no longer in my 20s, and Python version 3.7 was recently released.

I owe much of my career, many of my friends, and much of my hobby time to that one evening, sitting down and reading the Python tutorial — and to the man who made the language and wrote the first version of that tutorial, Guido van Rossum.

Python, like all open source projects, like, indeed, all software projects, is not a one man show. A whole team, with changing personnel, works on core Python and its ecosystem. But it was all started by Guido.

As Guido is stepping down to take a less active role in Python’s future, I want to offer my eternal gratitude. For my amazing career, for my friends, for my hobby. Thank you, Guido van Rossum. Your contribution to humanity, and to this one human in particular, is hard to overestimate.

Advertisements

Numpy Arrays to Numpy Arrays Records

Recently working with hdf5 format and the datasets in the format requires the arrays to be records.

What does that mean?

NumPy provides powerful capabilities to create arrays of structured datatype. These arrays permit one to manipulate the data by named fields.

One defines a structured array through the dtype object. Creating them from list is simple and take the below form

Example:

x = np.array([(1,2.,'Hello'), (2,3.,"World")],
             <span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>dtype=[('foo', 'i4'),('bar', 'f4'), ('baz', 'S10')])
print x['foo']
array([1, 2])
print x['faz']
array(['Hello', 'World'],dtype='|S10')

But the problem with my code was that I already had arrays. So how to convert numpy arrays to numpy records arrays.

Numpy.core.records.fromarray to the rescue.

Example:

data = np.random.randn(15).reshape(5,3)
rec_data  = np.core.records.fromarrays(data.T, names=['a','b','c'])

Notice the transpose. That is required.

Cross Validation Score with Statsmodels

Recently was working with a limited set of data. Using statsmodels , employed a regression model on the data.

To test the confidence in the model needed to do cross validation. The solution that immediately sprang to mind was the cross_val_score function from sci-kit learn library.

However, that function is not applicable on statsmodels object.

Solution: wrap sklearn base estimators on statsmodels objects and then use the model.

Here is the code for wrapping a sklearn baseestimators over statsmodels objects.

from sklearn.base import BaseEstimator, RegressorMixin
import statsmodels.formula.api as smf
import statsmodels.api as sm

class statsmodel(BaseEstimator, RegressorMixin):
    def __init__(self, sm_class, formula):
        self.sm_class = sm_class
        self.formula = formula
        self.model = None
        self.result = None

    def fit(self,data,dummy):
        self.model = self.sm_class(self.formula,data)
        self.result = self.model.fit()

    def predict(self,X):
        return self.result.predict(X)

Notice the dummy in the fit function, the regression api for sklearn needs X and y and for this implementation I was relying on the formula API of statsmodels, so had to add a dummy variable.

Here’s a quick example on how to use it.


from sklearn import linear_model
from sklearn.model_selection import cross_val_score

# Get data
ccard = sm.datasets.ccard.load_pandas()
print (ccard.data.head())

# create a model
clf = statsmodel(smf.ols, "AVGEXP ~ AGE + INCOME")

# Print cross val score on this model
print (cross_val_score(clf, ccard.data, ccard.data['AVGEXP']))

# Same thing on sklearn's linear regression model
lm = linear_model.LinearRegression()

print (cross_val_score(lm , ccard.data.iloc[:,1:3].values, ccard.data.iloc[:,0].values))

I love sklearn. Convenient, efficient and consistent API can make things so much easier.

Parsing date – a simple example

Problem: Let’s say we have a data like the below image, it can be anything, from IOT sensors data to something like the below expense log.

If the data has day, month and year as separate columns, as shown below, how to read this kind of data log with pandas and create date as the index.

Solution:

import pandas as pd
data = pd.read_csv('EXP2018.csv',parse_dates={'date':['YY','MM','DD']}, index_col='date')

Yes use pd.todatetime functionally after the read fro this and other non-standard parsing, but this one is elegant.

Date parsing in pandas is super useful. Here are some other options that can be considered for parsing dates with pandas.

parse_dates : boolean or list of ints or names or list of lists or dict, default False

* boolean. If True -> try parsing the index.

* list of ints or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3

each as a separate date column.

* list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as

a single date column.

* dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result

'foo'


If a column or index contains an unparseable date, the entire column or

index will be returned unaltered as an object data type. For non-standard

datetime parsing, use ``pd.to_datetime`` after ``pd.read_csv``

Note: A fast-path exists for iso8601-formatted dates.

infer_datetime_format : boolean, default False



If True and parse_dates is enabled, pandas will attempt to infer the format

of the datetime strings in the columns, and if it can be inferred, switch

to a faster method of parsing them. In some cases this can increase the

parsing speed by 5-10x.



keep_date_col : boolean, default False

If True and parse_dates specifies combining multiple columns then

keep the original columns.



date_parser : function, default None

Function to use for converting a sequence of string columns to an array of

datetime instances. The default uses ``dateutil.parser.parser`` to do the

conversion. Pandas will try to call date_parser in three different ways,

advancing to the next if an exception occurs: 1) Pass one or more arrays

(as defined by parse_dates) as arguments; 2) concatenate (row-wise) the

string values from the columns defined by parse_dates into a single array

and pass that; and 3) call date_parser once for each row using one or more

strings (corresponding to the columns defined by parse_dates) as arguments.



dayfirst : boolean, default False

DD/MM format dates, international and European format