skip to navigation
skip to content

Planet Python

Last update: November 20, 2017 04:47 PM

November 20, 2017


Stack Abuse

Using Machine Learning to Predict the Weather: Part 2

Using Machine Learning to Predict the Weather: Part 2

This article is a continuation of the prior article in a three part series on using Machine Learning in Python to predict weather temperatures for the city of Lincoln, Nebraska in the United States based off data collected from Weather Underground's API services. In the first article of the series, Using Machine Learning to Predict the Weather: Part 1, I described how to extract the data from Weather Underground, parse it, and clean it. For a summary of the topics for each of the articles presented in this series, please see the introduction to the prior article.

The focus of this article will be to describe the processes and steps required to build a rigorous Linear Regression model to predict future mean daily temperature values based off the dataset built in the prior article. To build the Linear Regression model I will be demonstrating the use of two important Python libraries in the Machine Learning industry:
Scikit-Learn and StatsModels.

Refamiliarizing Ourselves with the Dataset

In this GitHub repository you will find a Jupyter Notebook with the file name Weather Underground API.ipynb which describes the step-by-step actions required to collect the dataset we will be working with in this and the final article. Additionally, in this repository you will find a pickled Pandas DataFrame file called end-part1_df.pkl. So, if you would like to follow along without going through the somewhat painful experience of gathering, processing, and cleaning the data described in the prior article then pull down the pickle file and use the following code to deserialize the data back into a Pandas DataFrame for use in this section.

import pickle  
with open('end-part1_df.pkl', 'rb') as fp:  
    df = pickle.load(fp)

Background on Linear Regression using Ordinary Least Squares

Linear regression aims to apply a set of assumptions primary regarding linear relationships and numerical techniques to predict an outcome (Y, aka the dependent variable) based off of one or more predictors (X's independent variables) with the end goal of establishing a model (mathematical formula) to predict outcomes given only the predictor values with some amount of uncertainty.

The generalized formula for a Linear Regression model is:

    
ŷ = β0 + β1 * x1 + β2 * x2 + ... + β(p-n) x(p-n) + Ε
    

where:

That last term in the equation for the Linear Regression is a very important one. The most basic form of building a Linear Regression model relies on an algorithm known as Ordinary Least Squares which finds the combination of βj's values which minimize the Ε term.

Selecting Features for our Model

A key assumption required by the linear regression technique is that you have a linear relationship between the dependent variable and each independent variable. One way to assess the linearity between our independent variable, which for now will be the mean temperature, and the other independent variables is to calculate the Pearson correlation coefficient.

The Pearson correlation coefficient (r) is a measurement of the amount of linear correlation between equal length arrays which outputs a value ranging -1 to 1. Correlation values ranging from 0 to 1 represent increasingly strong positive correlation. By this I mean that two data series are positively correlated when values in one data series increase simultaneously with the values in the other series and, as they both go up in increasingly equal magnitude the Pearson correlation value will approach 1.

Correlation values from 0 to -1 are said to be inversely, or negatively, correlated in that when the values of one series increase the corresponding values in the opposite series decrease but, as changes in magnitude between the series become equal (with opposite direction) the correlation value will approach -1. Pearson correlation values that closely straddle either side of zero are suggestive to have a weak linear relationship, becoming weaker as the value approaches zero.

Opinions vary among statisticians and stats books on clear-cut boundaries for the levels of strength of a correlation coefficient. However, I have found that a generally accepted set of classifications for the strengths of correlation are as follows:

Correlation Value Interpretation
0.8 - 1.0 Very Strong
0.6 - 0.8 Strong
0.4 - 0.6 Moderate
0.2 - 0.4 Weak
0.0 - 0.2 Very Weak

To assess the correlation in this data I will call the corr() method of the Pandas DataFrame object. Chained to this corr() method call I can then select the column of interest ("meantempm") and again chain another method call sort_values() on the resulting Pandas Series object. This will output the correlation values from most negatively correlated to the most positively correlated.

df.corr()[['meantempm']].sort_values('meantempm')  
meantempm
maxpressurem_1 -0.519699
maxpressurem_2 -0.425666
maxpressurem_3 -0.408902
meanpressurem_1 -0.365682
meanpressurem_2 -0.269896
meanpressurem_3 -0.263008
minpressurem_1 -0.201003
minhumidity_1 -0.148602
minhumidity_2 -0.143211
minhumidity_3 -0.118564
minpressurem_2 -0.104455
minpressurem_3 -0.102955
precipm_2 0.084394
precipm_1 0.086617
precipm_3 0.098684
maxhumidity_1 0.132466
maxhumidity_2 0.151358
maxhumidity_3 0.167035
maxdewptm_3 0.829230
maxtempm_3 0.832974
mindewptm_3 0.833546
meandewptm_3 0.834251
mintempm_3 0.836340
maxdewptm_2 0.839893
meandewptm_2 0.848907
mindewptm_2 0.852760
mintempm_2 0.854320
meantempm_3 0.855662
maxtempm_2 0.863906
meantempm_2 0.881221
maxdewptm_1 0.887235
meandewptm_1 0.896681
mindewptm_1 0.899000
mintempm_1 0.905423
maxtempm_1 0.923787
meantempm_1 0.937563
mintempm 0.973122
maxtempm 0.976328
meantempm 1.000000

In selecting features to include in this linear regression model, I would like to error on the side of being slightly less permissive in including variables with moderate or lower correlation coefficients. So I will be removing the features that have correlation values less than the absolute value of 0.6. Also, since the "mintempm" and "maxtempm" variables are for the same day as the prediction variable "meantempm", I will be removing those also (i.e. if I already know the min and max temperatures then I already have the answer to my prediction).

With this information, I can now create a new DataFrame that only contains my variables of interest.

predictors = ['meantempm_1',  'meantempm_2',  'meantempm_3',  
              'mintempm_1',   'mintempm_2',   'mintempm_3',
              'meandewptm_1', 'meandewptm_2', 'meandewptm_3',
              'maxdewptm_1',  'maxdewptm_2',  'maxdewptm_3',
              'mindewptm_1',  'mindewptm_2',  'mindewptm_3',
              'maxtempm_1',   'maxtempm_2',   'maxtempm_3']
df2 = df[['meantempm'] + predictors]  

Visualizing the Relationships

Because most people, myself included, are much more accustomed to looking at visuals to assess and verify patterns, I will be graphing each of these selected predictors to prove to myself that there is in fact a linear relationship. To do this I will utilize matplotlib's pyplot module.

For this plot I would like to have the dependent variable "meantempm" be the consistent y-axis along all of the 18 predictor variables plots. One way to accomplish this is to create a grid of plots. Pandas does come with a useful plotting function called the scatter_plot(), but I generally only use it when there are only up to about 5 variables because it turns the plot into an N x N matrix (18 x 18 in our case), which becomes difficult to see details in the data. Instead I will create a grid structure with six rows of three columns in order to avoid sacrificing clarity in the graphs.

import matplotlib  
import matplotlib.pyplot as plt  
import numpy as np  
%matplotlib inline

# manually set the parameters of the figure to and appropriate size
plt.rcParams['figure.figsize'] = [16, 22]

# call subplots specifying the grid structure we desire and that 
# the y axes should be shared
fig, axes = plt.subplots(nrows=6, ncols=3, sharey=True)

# Since it would be nice to loop through the features in to build this plot
# let us rearrange our data into a 2D array of 6 rows and 3 columns
arr = np.array(predictors).reshape(6, 3)

# use enumerate to loop over the arr 2D array of rows and columns
# and create scatter plots of each meantempm vs each feature
for row, col_arr in enumerate(arr):  
    for col, feature in enumerate(col_arr):
        axes[row, col].scatter(df2[feature], df2['meantempm'])
        if col == 0:
            axes[row, col].set(xlabel=feature, ylabel='meantempm')
        else:
            axes[row, col].set(xlabel=feature)
plt.show()  

Using Machine Learning to Predict the Weather: Part 2

From the plots above it is recognizable that all the remaining predictor variables show a good linear relationship with the response variable ("meantempm"). Additionally, it is also worth noting that the relationships all look uniformly randomly distributed. By this I mean there appears to be relatively equal variation in the spread of values devoid of any fanning or cone shape. A uniform random distribution of spread along the points is also another important assumption of Linear Regression using Ordinary Least Squares algorithm.

Using Step-wise Regression to Build a Robust Model

A robust Linear Regression model should utilize statistical tests for selecting meaningful, statistically significant, predictors to include. To select statistically significant features, I will utilize the Python statsmodels library. However, before I jump into the practical implementation of using the statsmodels library I would like to take a step back and explain some of the theoretical meaning and purpose for taking this approach.

A key aspect of using statistical methods such as Linear Regression in an analytics project are the establishment and testing of hypothesis tests to validate the significance of assumptions made about the data under study. There are numerous hypothesis tests that have been developed to test the robustness of a linear regression model against various assumptions that are made. One such hypothesis test is to evaluate the significance of each of the included predictor variables.

The formal definition of the hypothesis test for the significance of a βj parameters are as follows:

By using tests of probability to evaluate the likelihood that each βj is significant beyond simple random chance at a selected threshold Α we can be more stringent in selecting the variables to include resulting in a more robust model.

However, in many datasets there can be interactions that occur between variables that can lead to false interpretations of these simple hypothesis tests. To test for the effects of interactions on the significance of any one variable in a linear regression model a technique known as step-wise regression is often applied. Using step-wise regression you add or remove variables from the model and assess the statistical significance of each variable on the resultant model.

In this article, I will be using a technique known as backward elimination, where I begin with a fully loaded general model that includes all my variables of interest.

Backward Elimination works as follows:

  1. Select a significance level Α for which you test your hypothesis against to determine if a variable should stay in the model
  2. Fit the model with all predictor variables
  3. Evaluate the p-values of the βj coefficients and for the one with the greatest p-value, if p-value > Α progress to step 4, if not you have your final model
  4. Remove the predictor identified in step 3
  5. Fit the model again but, this time without the removed variable and cycle back to step 3

So, without further delay let us build this fully loaded generalized model using statsmodels following the above steps.

# import the relevant module
import statsmodels.api as sm

# separate our my predictor variables (X) from my outcome variable y
X = df2[predictors]  
y = df2['meantempm']

# Add a constant to the predictor variable set to represent the Bo intercept
X = sm.add_constant(X)  
X.ix[:5, :5]  
const meantempm_1 meantempm_2 meantempm_3 mintempm_1
date
2015-01-04 1.0 -4.0 -6.0 -6.0 -13.0
2015-01-05 1.0 -14.0 -4.0 -6.0 -18.0
2015-01-06 1.0 -9.0 -14.0 -4.0 -14.0
2015-01-07 1.0 -10.0 -9.0 -14.0 -14.0
2015-01-08 1.0 -16.0 -10.0 -9.0 -19.0
# (1) select a significance value
alpha = 0.05

# (2) Fit the model
model = sm.OLS(y, X).fit()

# (3) evaluate the coefficients' p-values
model.summary()  

The summary() call will produce the following data in your Jupyter notebook:

OLS Regression Results
Dep. Variable: meantempm R-squared: 0.895
Model: OLS Adj. R-squared: 0.893
Method: Least Squares F-statistic: 462.7
Date: Thu, 16 Nov 2017 Prob (F-statistic): 0.00
Time: 20:55:25 Log-Likelihood: -2679.2
No. Observations: 997 AIC: 5396.
Df Residuals: 978 BIC: 5490.
Df Model: 18
Covariance Type: nonrobust

 

coef std err t P>|t| [0.025 0.975]
const 1.0769 0.526 2.049 0.041 0.046 2.108
meantempm_1 0.1047 0.287 0.364 0.716 -0.459 0.669
meantempm_2 0.3512 0.287 1.225 0.221 -0.211 0.914
meantempm_3 -0.1084 0.286 -0.379 0.705 -0.669 0.453
mintempm_1 0.0805 0.149 0.539 0.590 -0.213 0.373
mintempm_2 -0.2371 0.149 -1.587 0.113 -0.530 0.056
mintempm_3 0.1521 0.148 1.028 0.304 -0.138 0.443
meandewptm_1 -0.0418 0.138 -0.304 0.761 -0.312 0.228
meandewptm_2 -0.0121 0.138 -0.088 0.930 -0.282 0.258
meandewptm_3 -0.0060 0.137 -0.044 0.965 -0.275 0.263
maxdewptm_1 -0.1592 0.091 -1.756 0.079 -0.337 0.019
maxdewptm_2 -0.0113 0.091 -0.125 0.900 -0.189 0.166
maxdewptm_3 0.1326 0.089 1.492 0.136 -0.042 0.307
mindewptm_1 0.3638 0.084 4.346 0.000 0.200 0.528
mindewptm_2 -0.0119 0.088 -0.136 0.892 -0.184 0.160
mindewptm_3 -0.0239 0.086 -0.279 0.780 -0.192 0.144
maxtempm_1 0.5042 0.147 3.438 0.001 0.216 0.792
maxtempm_2 -0.2154 0.147 -1.464 0.143 -0.504 0.073
maxtempm_3 0.0809 0.146 0.555 0.579 -0.205 0.367

 

Omnibus: 13.252 Durbin-Watson: 2.015
Prob(Omnibus): 0.001 Jarque-Bera (JB): 17.097
Skew: -0.163 Prob(JB): 0.000194
Kurtosis: 3.552 Cond. No. 291.

Ok, I recognize that the call to summary() just barfed out a whole lot of information onto the screen. Do not get overwhelmed! We are only going to focus on about 2-3 values in this article:

  1. P>|t| - this is the p-value I mentioned above that I will be using to evaluate the hypothesis test. This is the value we are going to use to determine whether to eliminate a variable in this step-wise backward elimination technique.
  2. R-squared - a measure that states how much of the overall variance in the outcome our model can explain
  3. Adj. R-squared - the same as R-squared but, for multiple linear regression this value has a penalty applied to it based off the number of variables being included to explain the level of overfitting.

This is not to say that the other values in this output are without merit, quite the contrary. However, they touch on the more esoteric idiosyncrasies of linear regression which we simply just don't have the time to get into now. For a full explanation of them I will defer you to an advanced regression textbook such as Kutner's Applied Linear Regression Models, 5th Ed. as well as the statsmodels documentation.

# (3) cont. - Identify the predictor with the greatest p-value and assess if its > our selected alpha.
#             based off the table it is clear that meandewptm_3 has the greatest p-value and that it is
#             greater than our alpha of 0.05

# (4) - Use pandas drop function to remove this column from X
X = X.drop('meandewptm_3', axis=1)

# (5) Fit the model 
model = sm.OLS(y, X).fit()

model.summary()  
OLS Regression Results
Dep. Variable: meantempm R-squared: 0.895
Model: OLS Adj. R-squared: 0.893
Method: Least Squares F-statistic: 490.4
Date: Thu, 16 Nov 2017 Prob (F-statistic): 0.00
Time: 20:55:41 Log-Likelihood: -2679.2
No. Observations: 997 AIC: 5394.
Df Residuals: 979 BIC: 5483.
Df Model: 17
Covariance Type: nonrobust

 

coef std err t P>|t| [0.025 0.975]
const 1.0771 0.525 2.051 0.041 0.046 2.108
meantempm_1 0.1040 0.287 0.363 0.717 -0.459 0.667
meantempm_2 0.3513 0.286 1.226 0.220 -0.211 0.913
meantempm_3 -0.1082 0.286 -0.379 0.705 -0.669 0.452
mintempm_1 0.0809 0.149 0.543 0.587 -0.211 0.373
mintempm_2 -0.2371 0.149 -1.588 0.113 -0.530 0.056
mintempm_3 0.1520 0.148 1.028 0.304 -0.138 0.442
meandewptm_1 -0.0419 0.137 -0.305 0.761 -0.312 0.228
meandewptm_2 -0.0121 0.138 -0.088 0.930 -0.282 0.258
maxdewptm_1 -0.1592 0.091 -1.757 0.079 -0.337 0.019
maxdewptm_2 -0.0115 0.090 -0.127 0.899 -0.189 0.166
maxdewptm_3 0.1293 0.048 2.705 0.007 0.036 0.223
mindewptm_1 0.3638 0.084 4.349 0.000 0.200 0.528
mindewptm_2 -0.0119 0.088 -0.135 0.892 -0.184 0.160
mindewptm_3 -0.0266 0.058 -0.456 0.648 -0.141 0.088
maxtempm_1 0.5046 0.146 3.448 0.001 0.217 0.792
maxtempm_2 -0.2154 0.147 -1.465 0.143 -0.504 0.073
maxtempm_3 0.0809 0.146 0.556 0.579 -0.205 0.367

 

Omnibus: 13.254 Durbin-Watson: 2.015
Prob(Omnibus): 0.001 Jarque-Bera (JB): 17.105
Skew: -0.163 Prob(JB): 0.000193
Kurtosis: 3.553 Cond. No. 286.

In respect of your reading time and in an attempt to keep the article to a reasonable length I am going to omit the remaining elimination cycles required to build each new model, evaluate p-values and remove the least significant value. Instead I will jump right to the last cycle and provide you with the final model. Afterall, the main goal here was to describe the process and the reasoning behind it.

Below you will find the output from the final model I converged on after applying the backwards elimination technique. You can see from the output that all the remaining predictors have a p-values significantly below our Α of 0.05. Another thing worthy of some attention are the R-squared values in the final output. Two things to note here are (1) the R-squared and Adj. R-squared values are both equal which suggests there is minimal risk that our model is being over fitted by excessive variables and (2) the value of 0.894 is interpreted such that our final model explains about 90% of the observed variation in the outcome variable, the "meantempm".

model = sm.OLS(y, X).fit()  
model.summary()  
OLS Regression Results
Dep. Variable: meantempm R-squared: 0.894
Model: OLS Adj. R-squared: 0.894
Method: Least Squares F-statistic: 1196.
Date: Thu, 16 Nov 2017 Prob (F-statistic): 0.00
Time: 20:55:47 Log-Likelihood: -2681.7
No. Observations: 997 AIC: 5379.
Df Residuals: 989 BIC: 5419.
Df Model: 7
Covariance Type: nonrobust

 

coef std err t P>|t| [0.025 0.975]
const 1.1534 0.411 2.804 0.005 0.346 1.961
mintempm_1 0.1310 0.053 2.458 0.014 0.026 0.236
mintempm_2 -0.0964 0.037 -2.620 0.009 -0.169 -0.024
mintempm_3 0.0886 0.041 2.183 0.029 0.009 0.168
maxdewptm_1 -0.1939 0.047 -4.117 0.000 -0.286 -0.101
maxdewptm_3 0.1269 0.040 3.191 0.001 0.049 0.205
mindewptm_1 0.3352 0.051 6.605 0.000 0.236 0.435
maxtempm_1 0.5506 0.024 22.507 0.000 0.503 0.599

 

Omnibus: 13.123 Durbin-Watson: 1.969
Prob(Omnibus): 0.001 Jarque-Bera (JB): 16.871
Skew: -0.163 Prob(JB): 0.000217
Kurtosis: 3.548 Cond. No. 134.

Using SciKit-Learn's LinearRegression Module to Predict the Weather

Now that we have gone through the steps to select statistically meaningful predictors (features), we can use SciKit-Learn to create a prediction model and test its ability to predict the mean temperature. SciKit-Learn is a very well established machine learning library that is widely used in both industry and academia. One thing that is very impressive about SciKit-Learn is that it maintains a very consistent API of "fit", "predict", and "test" across many numerical techniques and algorithms which makes using it very simple. In addition to this consistent API design, SciKit-Learn also comes with several useful tools for processing data common to many machine learning projects.

We will start by using SciKit-Learn to split our dataset into a testing and training sets by importing the train_test_split() function from sklearn.model_selection module. I will split the training and testing datasets into 80% training and 20% testing and assign a random_state of 12 to ensure you will get the same random selection of data as I do. This random_state parameter is very useful for reproducibility of results.

from sklearn.model_selection import train_test_split  
# first remove the const column because unlike statsmodels, SciKit-Learn will add that in for us
X = X.drop('const', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12)  

The next action to take is to build the regression model using the training dataset. To do this I will import and use the LinearRegression class from the sklearn.linear_model module. As mentioned previously, scikit-learn scores major usability bonus points by implementing a common fit() and predict() API across its numerous numerical techniques which makes using the library very user friendly.

from sklearn.linear_model import LinearRegression  
# instantiate the regressor class
regressor = LinearRegression()

# fit the build the model by fitting the regressor to the training data
regressor.fit(X, y)

# make a prediction set using the test set
prediction = regressor.predict(X_test)

# Evaluate the prediction accuracy of the model
from sklearn.metrics import mean_absolute_error, median_absolute_error  
print("The Explained Variance: %.2f" % regressor.score(X_test, y_test))  
print("The Mean Absolute Error: %.2f degress celcius" % mean_absolute_error(y_test, prediction))  
print("The Median Absolute Error: %.2f degrees celcius" % median_absolute_error(y_test, prediction))  
The Explained Variance: 0.90  
The Mean Absolute Error: 2.68 degress celcius  
The Median Absolute Error: 2.22 degrees celcius  

As you can see in the few lines of code above using scikit-learn to build a Linear Regression prediction model is quite simple. This is truly where the library shines in its ability to easily fit a model and make predictions about an outcome of interest.

To gain an interpretative understanding of the models validity I used the regressor model's score() function to determine that the model is able to explain about 90% of the variance observed in the outcome variable, mean temperature. Additionally, I used the mean_absolute_error() and median_absolute_error() of the sklearn.metrics module to determine that on average the predicted value is about 3 degrees Celsius off and half of the time it is off by about 2 degrees Celsius.

Resources

Want to learn the tools, machine learning techniques, and data analysis used in this tutorial? Here are a few great resources to get you started:

Conclusion

In this article, I demonstrated how to use the Linear Regression Machine Learning algorithm to predict future mean weather temperatures based off the data collected in the prior article. I demonstrated how to use the statsmodels library to select statistically significant predictors based off of sound statistical methods. I then utilized this information to fit a prediction model based off a training subset using Scikit-Learn's LinearRegression class. Using this fitted model I could then predict the expected values based off of the inputs from a testing subset and evaluate the accuracy of the prediction, which indicates a reasonable amount of accuracy.

I would like to thank you for reading my article and I hope you look forward to the upcoming final article in this machine learning series where I describe how to build a Neural Network to predict the weather temperature.

November 20, 2017 04:40 PM


Doug Hellmann

cmd — Line-oriented Command Processors — PyMOTW 3

The cmd module contains one public class, Cmd , designed to be used as a base class for interactive shells and other command interpreters. By default it uses readline for interactive prompt handling, command line editing, and command completion. Read more… This post is part of the Python Module of the Week series for Python …

November 20, 2017 02:00 PM


Mike Driscoll

PyDev of the Week: Adrian Rosebrock

This week we welcome Adrian Rosebrock (@PyImageSearch) as our PyDev of the Week. Adrian is the author of several books on Python and OpenCV. He is also the author of PyImageSearch.com, a very popular Python blog that focuses on computer vision. Let’s take some time to get to know him a bit better!

Can you tell us about yourself and PyImageSearch?

Hi Mike, thank you for the opportunity to be interviewed on PyDev of the Week.

My name is Adrian Rosebrock. I have Ph.D in computer science with a focus in computer vision and machine learning from the University of Maryland, Baltimore County (UMBC).

I blog over at PyImageSearch.com about Python, computer vision, and deep learning. Over the past few years running PyImageSearch I have written a handful of books/courses, including:

What are your hobbies?

Fitness is a big part of my life so when I’m not working or coding I spend a lot of my time at the gym.

I’ve always been interested in sports and physical activity. Growing up I played soccer, baseball, basketball, and a handful of other sports.

By my junior year of college I started getting into weight lifting. A buddy in my operating systems class offered to take me to the weight room and teach me a few lifts.

Since then I’ve been focusing less on team sports and more on lifting. I do a combination of olympic lifting, CrossFit, and personal training. I like the constant variation and challenging my body.

Otherwise, I’m a big RPG fan, especially the old school RPGs on the SNES. I’ve played through the Final Fantasy series way too many times to count.

When did you start using Python?

The first time I used Python was about 13 years ago, so about 15 years ago, sometime around the Python 2.1/Python 2.2 release.

I was just getting into programming, and to be honest, I didn’t like Python at first. I didn’t fully understand the point of Read-Eval-Print Loop (REPL) and I was still trying to get used to the command line environment and Unix-environments.

At that time Python was a little too complicated for my (extremely) novice programming skills. I continued to play with Python as I learned to code in other languages, but it wasn’t until sophomore year of college that I really got back into Python.

Since then, Python has been my programming language of choice.

What other programming languages do you know and which is your favorite?

There is a difference between knowing the syntax of a language versus knowing the libraries and ecosystem associated with a language.

While in graduate school I did a lot of work in big data, mainly with Hadoop clusters — this required me to do a lot of coding in Java. Java was also used for most of the 200 and 300 level classes at UMBC (especially data structures, although now UMBC uses Python).

My first job out of college also required me to use Java on Hadoop clusters and Accumulo databases. Because of this, I became *very* familiar with Java libraries and building scalable data workflows and predictive models.

I also know PHP far too well then I care to admit. I picked up the language as it had C-like syntax but with a scripting language flavor. This also opened the door for my early career in web development.

While in my undergraduate years I worked for RateMyTeachers.com, starting off as a junior developer, working my way up to senior developer, and eventually becoming the CTO.

Along the way I rewrote the site from scratch in PHP and the CodeIgniter framework (this is well before Laravel or any other mature, decent PHP framework that could easily scale existed). This helped me learn a lot about scalability and web applications.

I also know C/C++ fairly well, but nowadays I mainly only use them for optimizing various Python routines.

At this point in my life I can say that Python is undoubtedly my favorite programming language. I use it for everything from production code to one-off scripts that need to parse a log file. It truly is a versatile language.

What projects are you working on now?

Besides the handful of consulting/contracting jobs I’m working on, my primary focus is PyImageSearch and creating quality computer vision and deep learning tutorials centered around the Python programming language.

I just released my new book, Deep Learning for Computer Vision with Python, earlier this month.

Throughout the rest of this year I’ll be finalizing the details on PyImageConf a conference I’ll be hosting in San Francisco, CA in 2018.

I also have a few other books in the works for 2018 as well.

Which Python libraries are your favorite (core or 3rd party)?

OpenCV, while not a pure Python library, provides Python bindings. Without OpenCV computer vision would not be as easily accessible. I use OpenCV daily.

The same can be said for both NumPy and SciPy — this libraries are *heavily* responsible for researchers, practitioners, and data scientists moving to the Python programming language. NumPy and SciPy play a pivotal role in every single computer vision + machine learning script I write.

Scikit-learn and scikit-image are another set of libraries I use daily. Without scikit-learn I would have really struggled to finish the research associated with my dissertation.

Finally, I cannot say enough positive things about Keras. Just as scikit-learn has made machine learning accessible, Keras has done the same for deep learning.

How did you become an author?

I became an author at the exact same time I became an entrepreneur. As a young kid, perhaps only seven or eight years old, I was an avid reader. I read everything I could get my hands on.

At some point early in my childhood I decided I wanted to write my own books.

And that’s exactly what I did.

I would take 20ish index cards, write my own story on the index cards (normally focused on horror stories), and then design my own cover using thick cardboard. I would color in the cover using crayons. And then finally staple it all together to create a book.

Once the book was done I would sell it to my parents for a quarter.

When I was a kid I wanted to be either (1) an author or (2) an architect. In many ways I have combined both of them as an adult.

Can you describe some of the things you’ve learned while writing your books?

I’ve been developing software as a career since was 17 years old but it wasn’t until my mid-20’s that I started writing computer vision books.

Writing wasn’t a challenge for me — I had years of experience writing, especially considering my dissertation.

But marketing and selling a book was brand new territory. At first I didn’t know what I was doing. There was a lot of blind experimentation going on.

I was (pleasantly) surprised to discover that shipping and selling a book isn’t much different than shipping a piece of software: initial user feedback matters.

In my case, I wrote drafts of books and released them before they were 100% finalized. This helped me get initial reader feedback faster and then iterate on the book; the exact process you would take if you were validating a SaaS venture.

It turned out that the skills I learned working for small, startup companies in the software space were directly transferable to writing a book.

Is there anything else you’d like to say?

I just wanted to say thank you again for the opportunity and interview.

If you want to learn more about me, please head over to PyImageSearch.com. You can also follow me on Twitter.

And if you’re interested in learning more about my new book, Deep Learning for Computer Vision with Python, just click here and grab your free set of sample chapters and Table of Contents.

Thanks for doing the interview!

November 20, 2017 01:30 PM


Nigel Babu

Remote Triggering a Jenkins Job

All I wanted to do was trigger a job on another Jenkins instance. Here’s all the things I tried.

Pretty much what I kept hitting

It turns out that what I thought was working wasn’t actually working. I wasn’t passing parameters to my remote job. It was using the defaults. The fix for this problem is the most hilarious that I’ve seen. Turns out if you use the crumbs API and Jenkins auth, you don’t need the token. This was a bit of a lovely discovery after all this pain.

Now I need to figure out how to follow the Jenkins job, i.e. get the console output on remote Jenkins in real time. I found a python script that does exactly that. I tested it and it works.

November 20, 2017 01:00 PM


Hynek Schlawack

Python Hashes and Equality

Most Python programmers don’t spend a lot of time thinking about how equality and hashing works. It usually just works. However there’s quite a bit of gotchas and edge cases that can lead to subtle and frustrating bugs once one starts to customize their behavior – especially if the rules on how they interact aren’t understood.

November 20, 2017 06:45 AM

November 19, 2017


Toshio Kuratomi

My first python3 script

I’ve been hacking on other people’s python3 code for a while doing porting and bugfixes but so far my own code has been tied to python2 because of dependencies. Yesterday I ported my first personal script from python2 to python3. This was just a simple, one file script that hacks together a way to track how long my kids are using the computer and log them off after they’ve hit a quota. The kind of thing that many a home sysadmin has probably hacked together to automate just a little bit of their routine. For that use, it seemed very straightforward to make the switch. There were only four changes in the language that I encountered when making the transition:

When I’ve worked on porting libraries that needed to maintain some form of compat between python2 (older versions… no nice shiny python-2.7 for you!) and python3 these concerns were harder to address as there needed to be two versions of the code (usually, maintained via automatic build-time invocation of 2to3). With this application/script, throwing out python2 compatibility was possible so switching over was just a matter of getting an error when the code executed and switching the syntax over.

This script also didn’t use any modules that had either not ported, been dropped, or been restructured in the switch from python2 to python3. Unlike my day job where urllib’s restructuring would affect many of the things that we’ve written and lack of ported third-party libraries would prevent even more things from being ported, this script (and many other of my simple-home-use scripts) didn’t require any changes due to library changes.

Verdict? Within these constraints, porting to python3 was as painless as porting between some python2.x releases has been. I don’t see any reason I won’t use python3 for new programming tasks like this. I’ll probably port other existing scripts as I need to enhance them.


November 19, 2017 02:32 PM


Import Python

#151: Augmented reality, Profiling CPython, ptracer by Pinterest and more

Worthy Read

This post chronicles a mobile app development team’s journey to continuous delivery, the challenges along the way, how they overcame them and their thoughts beyond continuous delivery. Check it out.
advert

You may (or may not) have heard of or seen the augmented reality Invizimals video game or the Topps 3D baseball cards. The main idea is to render in the screen of a tablet, PC or smartphone a 3D model of a specific figure on top of a card according to the position and orientation of the card. 
opencv

Must read thread.
core-python

Instagram employs Python in one of the world’s largest settings, using it to implement the “business logic” needed to serve 800 million monthly active users. We use the reference implementation of Python, known as CPython, as the runtime used to execute our code. As we’ve grown, the number of machines required toserve our users has become a significant contributor to our growing infrastructure needs. These machines are CPU bound, and as a result, we keep a close eye on the efficiency of the code we write and deploy, and have focused on building tools to detect and diagnose performance regressions. This continues to serve us well, but the projected growth of our web tier led us to investigate sources of inefficiency in the runtime itself.
cpython
,
production

In this post, I will outline a strategy to ‘learn pandas’. For those who are unaware, pandas is the most popular library in the scientific Python ecosystem for doing data analysis. Pandas
pandas

Making Pinterest faster and more reliable is a constant focus for our engineering team and using hardware resources more efficiently is a major part of this effort. Improving efficiency and reliability requires good diagnostic tools, and today we’re announcing our newest tracing tool: ptracer, which provides granular syscall tracing in Python programs. In this post we’ll cover background on Pinterest’s codebase, why we needed a better tracer and how ptracer can help solve certain engineering problems.
Pinterest

The Definitive Guide to Apache Spark. Download today!
advert

A "wat" is what I call a snippet of code that demonstrates a counterintuitive edge case of a programming language. (The name comes from this excellent talk by Gary Bernhardt.) If you're not familiar with the language, you might conclude that it's poorly designed when you see a wat. Often, more context about the language design will make the wat seem reasonable, or at least justified.
core-python

code snippets

For statically checking Python code, mypy is great, but it only works after you have added type annotations to your codebase. When you have a large codebase, this can be painful. At Dropbox we’ve annotated over 1.2 million lines of code (about 20% of our total Python codebase), so we know how much work this can be. It’s worth it though: the payoff is fantastic.
mypy

Django 2.0 release candidate 1 is the final opportunity for you to try out the assortment of new features before Django 2.0 is released.
django

Tangent is a new, free, and open-source Python library for automatic differentiation. In contrast to existing machine learning libraries, Tangent is a source-to-source system, consuming a Python function f and emitting a new Python function that computes the gradient of f. This allows much better user visibility into gradient computations, as well as easy user-level editing and debugging of gradients. Tangent comes with many more features for debugging and designing machine learning models.
neural networks

Today, we’re going to do some splunking within the deep, dark place which is your browser history.
pandas
,
codesnippets

DEAL WITH IT is a meme where glasses fly in from off the screen, and on to a user’s face. The best instances of this meme do so in a unique way. Today we’ll write an automatic meme generator, using any static image with faces as our input. This code makes a great starting point for a meme API, or building your own animated version using video input.
meme

In this article, we going to look at Spark Streaming and this is one of several other libraries exposed by Spark Platform.
spark
,
code snippets

What do words look like as colours? What would Shakespear’s sonnets look like as colours? This mini project renders text as colours using Python and saves them in a grid as an image.
image processing
,
art

Building your first poetical chat-bot from scratch
telegram

cryptocurrency

Embed docs directly on your website with a few lines of code. Test the API for free.
advert


Jobs

Piscataway Township, NJ, United States
Participate in testing RESTFul APIs using Postman and Soapui. Study, analyse the project requirements and ensure the delivering code meets the client needs. Performance and Stress testing through JMeter


Projects

pyannotate - 383 Stars, 8 Fork
Auto-generate PEP-484 annotations

tosheets - 156 Stars, 1 Fork
Send your stdin to google sheets.

YOLO_Object_Detection - 117 Stars, 38 Fork
This is the code for "YOLO Object Detection" by Siraj Raval on Youtube

django-subadmin - 59 Stars, 2 Fork
A special kind of ModelAdmin that allows it to be nested within another ModelAdmin.

csvs-to-sqlite - 36 Stars, 0 Fork
Convert CSV files into a SQLite database.

automatic-memes - 13 Stars, 3 Fork
Automatic Memes in Python with Face Detection.

smop - 0 Stars, 0 Fork
Small Matlab to Python compiler (Please open an issue if you are using this program)

collab-ide - 0 Stars, 0 Fork
An online real-time collaborative IDE for teams to work together efficiently in a fast-paced project.

November 19, 2017 02:30 PM


David MacIver

Shaping the World

I gave a keynote at PyCon UK recently – it was mostly about the book “Seeing Like A State” and what software developers can learn from it about our effect on the world.

I’ve been meaning edit it up into a blog post, and totally failing to get around to it, so in lieu of that, here’s my almost entirely unedited script – it’s not that close to the version I actually got up on stage and said, because I saw 800 people looking at me and panicked and all the words went out of my head (apparently this was not at all obvious to people), but the general themes of the two are the same and neither is strictly better than the other – if you prefer text like a sensible person, read this post. If you prefer video, the talk is supposedly pretty good based on the number of people who have said nice things to me about it (I haven’t been able to bear to watch it yet).

The original slides are available here (warning: Don’t load on mobile data. They’re kinda huge). I’ve inserted a couple of the slide images into the post where the words don’t make sense without the accompanying image, but otherwise decided not to clutter the text with images (read: I was too lazy).


Hi, I’m David MacIver. I’m here to talk to you today about the ways which we, as software developers, shape the world, whether we want to or not.

This is a talk about consequences. Most, maybe all, of you are good people. The Python community is great, but I’d be saying that anywhere. Most people are basically good, even though it doesn’t look that way sometimes. But unless you know about what effect your actions have, all the good intentions in the world won’t help you, and good people can still make the world a worse place. I’m going to show you some of the ways that I think we’re currently doing that.

The tool I’m going to use to do this is cultural anthropology: The study of differences and similarities between different cultures and societies. I’m not a cultural anthropologist. I’ve never even taken a class on it, I’ve just read a couple of books. But I wish I’d read those books earlier, and I’d like to share with you some of the important lessons for software development that I’ve drawn from them.

In particular I’d like to talk to you about the work of James C. Scott, and his book “Seeing like a state”. Seeing like a state is about the failure modes of totalitarian regimes, and other attempts to order human societies, which are surprisingly similar to some of the failure modes of software projects. I do recommend reading the book. If you’re like me and not that used to social science writing, it’s a bit of a heavy read, but it’s worth doing. But for now, I’ll highlight what I think are the important points.

Unsorted binary tree

Binary Tree by Derrick Coetzee

Before I talk about totalitarian states, I’d like to talk about trees. If you’re a computer scientist, or have had an unfortunate developer job interview recently, a tree is probably something like this. It has branches and leaves, and not much else.

If you’re anyone else, a tree is rather different. It’s a large living organism. It has leaves and branches, sure, but it also has a lot of other context and content. It provides shade, maybe fruit, it has a complex root system. It’s the center of its own little ecosystem, providing shelter and food for birds, insects, and other animals. Compared to the computer scientist’s view of a tree it’s almost infinitely complicated.

But there’s another simplifying view of a tree we could have taken, which is that of the professional forester. A tree isn’t a complex living organism, it’s just potential wood. The context is no longer relevant, all we really care about the numbers – it costs this much to produce this amount of this grade of wood and, ultimately, this amount of money when you sell the wood.

This is a very profitable view of a tree, but it runs into some difficulties. If you look at a forest, it’s complicated. You’ve got lots of different types of trees. Some of them are useful, some of them are not – not all wood is really saleable, some trees are new and still need time to grow, trees are not lined up with each other so you have to navigate around ones you didn’t want. As well as the difficulty of harvesting, this also creates difficulty measuring – even counting the trees is hard because of this complexity, let alone more detailed accounting of when and what type of wood will be ready, so how can you possibly predict how much wood you’re going to harvest and thus plan around what profit you’re going to make? Particularly a couple of hundred years ago when wood was the basis of a huge proportion of the national economy, this was a big deal. We have a simple view of the outcomes we want, but the complex nature of reality fights back at our attempts to achieve that. So what are we going to do?

Well, we simplify the forest. If the difficulty in achieving our simple goals is that reality is too complicated, we make the reality simpler. As we cut down the forest, we replant it with easy to manage trees in easy to manage lines. We divide it into regions where all of the trees are of the same age. Now we have a relatively constant amount of wood per unit of area, and we can simply just log an entire region at once, and now our profits become predictable and, most importantly, high.

James Scott talks about this sort of thing as “legibility”. The unmanaged forest is illegible – we literally cannot read it, because it has far more complexity than we can possibly hope to handle – while, in contrast, the managed forest is legible – we’ve reshaped its world to be expressible in a small number of variables – basically just the land area, and the number of regions we’ve divided it into. The illegible world is unmanageable, while the legible world is manageable, and we can control it by adjusting a small number of parameters.

In a technical sense, legibility lets us turn our control over reality into optimisation problems. We have some small number of variables, and an outcome we want to optimise for, so we simply reshape the world by finding the values of those variables that maximize that outcome – our profits. And this works great – we have our new simple refined world, and we maximize our profit. Everyone is happy.

Oh, sure, there are all those other people who were using the forest who might not be entirely happy. The illegible natural forest contains fruit for gathering, brush to collect for firewood, animals for hunting, and a dozen other uses all of which are missing from our legible managed forest. Why? Well because those didn’t affect our profit. The natural behaviour of optimisation processes is to destroy everything in their path that isn’t deliberately preserved or directly required for their outcome. If the other use cases didn’t result in profit for us, they’re at best distractions or at worst impediments. Either way we get rid of them. But those only matter to the little people, so who cares? We’re doing great, and we’re making lots of money.

At least, for about eighty years, at which point all of the trees start dying. This really happened. These days, we’re better a bit better at forest management, and have figured out more of which complexity is necessary and which we can safely ignore, but in early scientific forestry, about 200 years ago in Germany, they learned the hard way that a lot of things they had thought weren’t important really were. There was an entire complex ecological cycle that they’d ignored, and they got away with it for about 80 years because they had a lot of high quality soil left over from that period that they could basically strip mine for a while. But the health of the forest deteriorated over time as the soil got worse, and eventually the trees were unhealthy enough that they started getting sick. And because all of the trees were the same, when one got sick it spread like wildfire to the others. They called it Waldsterben – forest death.

The problem that the German scientific foresters ran into is that complex, natural, systems are often robust in ways that simple, optimised systems are not. They’ve evolved over time, with lots of fiddly little details that have occurred locally to adapt to and patch over problems. Much of that illegibility turns out not to be accidental complexity, but instead the adaptation that was required to make the system work at all. That’s not to say all complexity is necessary, or that there isn’t a simpler system that also works, but if the complexity is there, chances are we can’t just remove it without replacing it with something else and assume the system will keep working, even if it might look like it does for a while.

This isn’t actually a talk about trees, but it is a talk about complexity, and about simplification. And it’s a talk about what happens when we apply this kind of simplification process to people. Because it turns out that people are even more complicated than trees, and we have a long history of trying to fix that, to take complex, messy systems of people and produce nice, simple, well behaved social orders that follow straightforward rules.

This is what James Scott calls Authoritarian High-Modernism – the desire to force people to fit into some rational vision of the world. Often this is done for entirely virtuous reasons – many authoritarian high-modernist projects are utopian in nature – we want everyone to be happy and well fed and fulfilled in their lives. Often they are less virtuous – totalitarian regimes love forcing people into their desired mould. But virtuous or not, they often fail in the same way that early scientific forestry did. Seeing like a state has a bunch of good examples of this. I won’t go into them in detail, but here’s a few.

A picture of a building with multiple windows bricked up.

Portland Street, Southampton, England, by Gary Burt

An amusing example is buildings like this. Have you seen these? Do you know why there are these bricked up windows? Well it’s because of window taxes. A while back, income tax was very unpopular. Depending on who you ask, maybe it still is, but it was even more so back then. But the government wanted to extract money from its citizens. What could they do? Well, they could tax where people live by size – rich people live in bigger buildings – but houses are often irregularly shaped, so measuring the size of the house is hard, but there’s a nice, simple,convenient proxy for it – the number of windows. So this is where windows taxes come from – take complex, messy, realities of wealth and pick a simple proxy for it, you pick a simple proxy for that, and and you end up taxing the number of windows. Of course what happens is that people brick up their windows to save on taxes. And then suffer health problems from lack of natural light and proper ventilation in their lives, which is less funny, but so it goes.

Another very classic example that also comes from taxation is the early history of the Cadastral, or land-use, map. We want to tax land-use, so we need to know who owns the land. So we create these detailed land-use maps which say who owns what, and we tax them accordingly. This seems very straight forward, right? But in a traditional village this is nonsense. Most land isn’t owned by any single person – there are complex systems of shared usage rights. You might have commons on which anyone can graze their animals, but where certain fruit trees are owned, but everyone has the rights to use fallen fruit. It’s not that there aren’t notions of ownership per se, but they’re very fine grained and contextual, and they shift according to a complex mix of circumstance and need. The state doesn’t care. These complex shared ownerships are illegible, so we force people to conform instead to the legible idea of single people or families owning each piece of land. This is where a lot of modern notions of ownership come from by the way – the state created them so they could collect more tax.

And of course we have the soviet union’s program of farm collectivization, which has the state pushing things in entirely the opposite direction. People were operating small family owned farms, which were adapted to their local conditions and what grew well where they were. A lot of it was subsistence farming, particularly in lean times – when you had excess, you sold it. When you didn’t, you lived off the land. This was hard to manage if you’re a state who wants to appropriate large quantities of food to feed your army and decide who is deserving and who gets what. So they forcibly moved everyone to work on large, collective, farms which grew what the state wanted, typically in large fields of monocultures that ignored the local conditions. From a perspective of producing enough food, this worked terribly. The large, collectivized, farms, produced less food less reliably than the more distributed, adapted, local farms. The result was famine which killed millions. But from the point of view of making the food supply legible, and allowing the state to control it, the system worked great, and the soviets weren’t exactly shy about killing millions of people, so the collectivization program was largely considered a success by them, though it did eventually slow and stop before they converted every farm.

But there’s another, more modern, example of all of these patterns. We have met the authoritarians and they are us. Tech may not look much like a state, even ignoring its strongly libertarian bent, but it has many of the same properties and problems, and every tech company is engaged in much the same goal as these states were: Making the world legible in order to increase profit.

Every company does this to some degree, but software is intrinsically a force for legibility. A piece of software has some representation of the part of the world that it interacts with, boiling it down to the small number of variables that it needs to deal with. We don’t necessarily make people conform to that vision, but we don’t have to – as we saw with the windows, people will shape themselves in response to the incentives we give them,as long as we are able to reward compliance with our vision or punish deviance from it..

When you hear tech companies talk about disruption, legibility is at the heart of what we’re doing. We talk about efficiency – taking these slow, inefficient, legacy industries and replacing them with our new, sleek, streamlined versions based on software. But that efficiency comes mostly from legibility – it’s not that we’ve got some magic wand that makes everything better, it’s that we’ve reduced the world to the small subset of it that we think of as the important bits, and discarded the old, illegible, reality as unimportant.

And that legibility we impose often maps very badly to the actual complexity of the world. You only have to look at the endless stream of falsehoods programmers believe articles to get a sense of how much of the world’s complexity we’re ignoring. It’s not just programmers of course – if anything the rest of the company is typically worse – but we’re still pretty bad. We believe falsehoods about names, but also gender, addresses, time, and  many more.

This probably still feels like it’s not a huge problem. Companies are still not states. We’re not forcing things on anyone, right? If you don’t use our software, nobody is going to kick down your door and make you. Much of the role of the state is to hold a monopoly on the legitimate use of physical force, and we don’t have access to that. We like to pretend makes some sort of moral difference. We’re just giving people things that they want, not forcing them to obey us.

Unfortunately, that is a fundamental misunderstanding of the nature of power. Mickey Mouse, despite his history of complicity in US racism, has never held a gun to anyone’s head and forced them to do his bidding, outside of a cartoon anyway. Nevertheless he is almost single-handedly responsible for reshaping US copyright law, and by extension copyright law across most of the world. When Mickey Mouse is in danger of going out of copyright, US copyright law mysteriously extends the length of time after the creator’s death that works stay in copyright. We now live in a period of eternal copyright, largely on the strength of the fact that kids like Mickey Mouse.

This is what’s called Soft Power. Conventional ideas of power are derived from coercion – you make someone do what you want – while soft power is power that you derive instead from appeal – People want to do what you want. There are a variety of routes to soft power, but there’s one that has been particularly effective for colonising forces, the early state, and software companies. It goes like this.

First you make them want what you have, then you make them need it.

The trick is to to basically ease people in – you give them a hook that makes your stuff appealing, and then once they’re use to it they can’t do without. Either because it makes their life so much better, or because in the new shape of the world doing without it would make their life so much worse. These aren’t the same thing. There are some common patterns for this, but there are three approaches that have seen a lot of success that I’d like to highlight

The first is that you create an addiction. You sell them alcohol, or you sell them heroin. The first one’s free – just a sampler, a gift of friendship. But hey, that was pretty good. Why not have just a little bit more… Modern tech companies are very good at this. There’s a whole other talk you could give about addictive behaviours in modern software design. But, for example, I bet a lot of you find yourselves compulsively checking Twitter. You might not want to – you might even want to quit it entirely – but the habit is there. I’m certainly in this boat. That’s an addictive behaviour right there, and perhaps it wasn’t deliberately created, but it sure looks like it was.

The second strategy is that you can sell them guns. Arms dealing is great for creating dependency! You get to create an arms race by offering them to both sides, each side buys it for fear that the other one will, and now they have to keep buying from you because you’re the only one who can supply them bullets. Selling advertising and social media strategies to companies works a lot like this.

The third is you can sell them sugar. It’s cheap and delicious! And is probably quite bad for you and certainly takes over your diet, crowding out other more nutritious options. Look at companies who do predatory pricing, like Uber. It’s great – so much cheaper than existing taxis, and way more convenient than public transport, right? Pity they’re going to hike the prices way up when they’ve driven the competition into the ground and want to stop hemorrhaging money.

And we’re going to keep doing this, because this is the logic of the market. If people don’t want and need our product, they’re not going to use it, we’re not going to make money, and your company will fail and be replaced by one with no such qualms. The choice is not whether or not to exert soft power, it’s how and to what end.

I’m making this all sound very bleak, as if the things I’m talking about were uniformly bad. They’re not. Soft power is just influence, and it’s what happens every day as we interact with people. It’s an inevitable part of human life. Legibility is just an intrinsic part of how we come to understand and manipulate the world, and is at the core of most of the technological advancements of the last couple of centuries. Legibility is why we have only a small number of standardised weights and measures instead of a different notion of a pound or a foot for every village.

Without some sort of legible view of the world, nothing resembling modern civilization would be possible and, while modern civilization is not without its faults, on balance I’m much happier for it existing than not.

But civilizations fall as well as rise, and things that seemed like they were a great idea in the short term often end in forest death and famine. Sometimes it turns out that what we were disrupting was our life support system.

And on that cheerily apocalyptic note, I’d like to conclude with some free advice on how we can maybe try to do a bit better on that balancing act. It’s not going to single handedly save the world, but it might make the little corners of it that we’re responsible for better.

My first piece of free advice is this: Richard Stallman was right. Proprietary software is a harbinger of the end times, and an enemy of human flourishing. … don’t worry, I don’t actually expect you to follow this one. Astute observers will notice that I’m actually running Windows on the computer I’m using to show these slides, so I’m certainly not going to demand that you go out and install Linux, excuse me, GNU/Linux, and commit to a world of 100% free software all the time. But I don’t think this point of view is wrong either. As long as the software we use is not under our control, we are being forced to conform to someone else’s idea of the legible world. If we want to empower users, we can only do that with software they can control. Unfortunately I don’t really know how to get there from here, but a good start would be to be better about funding open source.

In contrast, my second piece of advice is one that I really do want you all to follow. Do user research, listen to what people say, and inform your design decisions based on it. If you’re going to be forming a simplified model of the world, at least base it on what’s important to the people who are going to be using your software.

And finally, here’s the middle ground advice that I’d really like you to think about. Stop relying on ads. As the saying goes, if your users aren’t paying for it, they’re not the customer, they’re the product. The product isn’t a tree, it’s planks. It’s not a person, it’s data. Ads and adtech are one of the most powerful forces for creating a legible society, because they are fundamentally reliant on turning a complex world of people and their interactions into simple lists of numbers, then optimising those numbers to make money. If we don’t want our own human shaped version of forest death, we need to figure out what important complexity we’re destroying, and we need to stop doing that.

And that is all I have to say to you today. I won’t be taking questions, but I will be around for the rest of the conference if you want to come talk to me about any of this. Thank you very much.

November 19, 2017 10:07 AM

November 18, 2017


Weekly Python StackOverflow Report

(c) stackoverflow python report

These are the ten most rated questions at Stack Overflow last week.
Between brackets: [question score / answers count]
Build date: 2017-11-18 20:59:57 GMT


  1. Python tuple unpacking in return statement - [21/1]
  2. How to get random value of attribute of Enum on each iteration? - [13/7]
  3. How to use functional programming to iterate and find maximum product of five consecutive numbers in a list? - [12/7]
  4. Which operator (+ vs +=) should be used for performance? (In-place Vs not-in-place) - [10/1]
  5. Converting Python 3 String of Bytes of Unicode - `str(utf8_encoded_str)` back to unicode - [8/2]
  6. Explanation for a code needed : Appending the values to list in python - [7/1]
  7. Python - how to multiply characters in string by number after character - [6/4]
  8. pandas.eval with a boolean series with missing data - [6/1]
  9. How to count the most frequent letter in a string? - [5/6]
  10. what happened when using the same variable in two layer loops in python? - [5/6]

November 18, 2017 08:59 PM


Toshio Kuratomi

Python testing: Asserting raw byte output with pytest

The Code to Test

When writing code that can run on both Python2 and Python3, I’ve sometimes found that I need to send and receive bytes to stdout. Here’s some typical code I might write to do that:

# byte_writer.py
import sys
import six

def write_bytes(some_bytes):
    # some_bytes must be a byte string
    if six.PY3:
        stdout = sys.stdout.buffer
    else:
        stdout = sys.stdout
    stdout.write(some_bytes)

if __name__ == '__main__':
    write_bytes(b'\xff')

In this example, my code needs to write a raw byte to stdout. To do this, it uses sys.stdout.buffer on Python3 to circumvent the automatic encoding/decoding that occurs on Python3’s sys.stdout. So far so good. Python2 expects bytes to be written to sys.stdout by default so we can write the byte string directly to sys.stdout in that case.

The First Attempt: Pytest newb, but willing to learn!

Recently I wanted to write a unittest for some code like that. I had never done this in pytest before so my first try looked a lot like my experience with nose or unittest2: override sys.stdout with an io.BytesIO object and then assert that the right values showed up in sys.stdout:

# test_byte_writer.py
import io
import sys

import mock
import pytest
import six

from byte_writer import write_bytes

@pytest.fixture
def stdout():
    real_stdout = sys.stdout
    fake_stdout = io.BytesIO()
    if six.PY3:
        sys.stdout = mock.MagicMock()
        sys.stdout.buffer = fake_stdout
    else:
        sys.stdout = fake_stdout

    yield fake_stdout

    sys.stdout = real_stdout

def test_write_bytes(stdout):
    write_bytes()
    assert stdout.getvalue() == b'a'

This gave me an error:

[pts/38@roan /var/tmp/py3_coverage]$ pytest              (07:46:36)
_________________________ test_write_byte __________________________

stdout = 

    def test_write_byte(stdout):
        write_bytes(b'a')
>       assert stdout.getvalue() == b'a'
E       AssertionError: assert b'' == b'a'
E         Right contains more items, first extra item: 97
E         Use -v to get the full diff

test_byte_writer.py:27: AssertionError
----------------------- Captured stdout call -----------------------
a
===================== 1 failed in 0.03 seconds =====================

I could plainly see from pytest’s “Captured stdout” output that my test value had been printed to stdout. So it appeared that my stdout fixture just wasn’t capturing what was printed there. What could be the problem? Hmmm…. Captured stdout… If pytest is capturing stdout, then perhaps my fixture is getting overridden by pytest’s internal facility. Let’s google and see if there’s a solution.

The Second Attempt: Hey, that’s really neat!

Wow, not only did I find that there is a way to capture stdout with pytest, I found that you don’t have to write your own fixture to do so. You can just hook into pytest’s builtin capfd fixture to do so. Cool, that should be much simpler:

# test_byte_writer.py
import io
import sys

from byte_writer import write_bytes

def test_write_byte(capfd):
    write_bytes(b'a')
    out, err = capfd.readouterr()
    assert out == b'a'

Okay, that works fine on Python2 but on Python3 it gives:

[pts/38@roan /var/tmp/py3_coverage]$ pytest              (07:46:41)
_________________________ test_write_byte __________________________

capfd = 

    def test_write_byte(capfd):
        write_bytes(b'a')
        out, err = capfd.readouterr()
>       assert out == b'a'
E       AssertionError: assert 'a' == b'a'

test_byte_writer.py:10: AssertionError
===================== 1 failed in 0.02 seconds =====================

The assert looks innocuous enough. So if I was an insufficiently paranoid person I might be tempted to think that this was just stdout using python native string types (bytes on Python2 and text on Python3) so the solution would be to use a native string here ("a" instead of b"a". However, where the correctness of anyone else’s bytes <=> text string code is concerned, I subscribe to the philosophy that you can never be too paranoid. So….

The Third Attempt: I bet I can break this more!

Rather than make the seemingly easy fix of switching the test expectation from b"a" to "a" I decided that I should test whether some harder test data would break either pytest or my code. Now my code is intended to push bytes out to stdout even if those bytes are non-decodable in the user’s selected encoding. On modern UNIX systems this is usually controlled by the user’s locale. And most of the time, the locale setting specifies a UTF-8 compatible encoding. With that in mind, what happens when I pass a byte string that is not legal in UTF-8 to write_bytes() in my test function?

# test_byte_writer.py
import io
import sys

from byte_writer import write_bytes

def test_write_byte(capfd):
    write_bytes(b'\xff')
    out, err = capfd.readouterr()
    assert out == b'\xff'

Here I adapted the test function to attempt writing the byte 0xff (255) to stdout. In UTF-8, this is an illegal byte (ie: by itself, that byte cannot be mapped to any unicode code point) which makes it good for testing this. (If you want to make a truly robust unittest, you should probably standardize on the locale settings (and hence, the encoding) to use when running the tests. However, that deserves a blog post of its own.) Anyone want to guess what happens when I run this test?

[pts/38@roan /var/tmp/py3_coverage]$ pytest              (08:19:52)
_________________________ test_write_byte __________________________

capfd = 

    def test_write_byte(capfd):
        write_bytes(b'\xff')
        out, err = capfd.readouterr()
>       assert out == b'\xff'
E       AssertionError: assert '�' == b'\xff'

test_byte_writer.py:10: AssertionError
===================== 1 failed in 0.02 seconds =====================

On Python3, we see that the undecodable byte is replaced with the unicode replacement character. Pytest is likely running the equivalent of b"Byte string".decode(errors="replace") on stdout. This is good when capfd is used to display the Captured stdout call information to the console. Unfortunately, it is not what we need when we want to check that our exact byte string was emitted to stdout.

With this change, it also becomes apparent that the test isn’t doing the right thing on Python2 either:

[pts/38@roan /var/tmp/py3_coverage]$ pytest-2            (08:59:37)
_________________________ test_write_byte __________________________

capfd = 

    def test_write_byte(capfd):
        write_bytes(b'\xff')
        out, err = capfd.readouterr()
>       assert out == b'\xff'
E       AssertionError: assert '�' == '\xff'
E         - �
E         + \xff

test_byte_writer.py:10: AssertionError
========================= warnings summary =========================
test_byte_writer.py::test_write_byte
  /var/tmp/py3_coverage/test_byte_writer.py:10: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
    assert out == b'\xff'

-- Docs: http://doc.pytest.org/en/latest/warnings.html
=============== 1 failed, 1 warnings in 0.02 seconds ===============

In the previous version, this test passed. Now we see that the test was passing because Python2 evaluates u"a" == b"a" as True. However, that’s not really what we want to test; we want to test that the byte string we passed to write_bytes() is the actual byte string that was emitted on stdout. The new data shows that instead, the test is converting the value that got to stdout into a text string and then trying to compare that. So a fix is needed on both Python2 and Python3.

These problems are down in the guts of pytest. How are we going to fix them? Will we have to seek out a different strategy that lets us capture stdout, overriding pytest’s builtin?

The Fourth Attempt: Fortuitous Timing!

Well, as it turns out, the pytest maintainers merged a pull request four days ago which implements a capfdbinary fixture. capfdbinary is like the capfd fixture that I was using in the above example but returns data as byte strings instead of as text strings. Let’s install it and see what happens:

$ pip install --user git+git://github.com/pytest-dev/pytest.git@6161bcff6e3f07359c94a7be52ad32ecb8822142
$ mv ~/.local/bin/pytest ~/.local/bin/pytest-2
$ pip3 install --user git+git://github.com/pytest-dev/pytest.git@6161bcff6e3f07359c94a7be52ad32ecb8822142

And then update the test to use capfdbinary instead of capfd:

# test_byte_writer.py
import io
import sys

from byte_writer import write_bytes

def test_write_byte(capfdbinary):
    write_bytes(b'\xff')
    out, err = capfdbinary.readouterr()
    assert out == b'\xff'

And with those changes, the tests now pass:

[pts/38@roan /var/tmp/py3_coverage]$ pytest              (11:42:06)
======================= test session starts ========================
platform linux -- Python 3.5.4, pytest-3.2.5.dev194+ng6161bcf, py-1.4.34, pluggy-0.5.2
rootdir: /var/tmp/py3_coverage, inifile:
plugins: xdist-1.15.0, mock-1.5.0, cov-2.4.0, asyncio-0.5.0
collected 1 item                                                    

test_byte_writer.py .

===================== 1 passed in 0.01 seconds =====================

Yay! Mission accomplished.


November 18, 2017 07:56 PM


Davy Wybiral

Build a Feed Reader in Python (Parts 2 & 3)

Part 02

Installing and configuring MySQL, flask, and SQLAlchemy.



Part 03

Modeling our feed reader data using SQLAlchemy.

November 18, 2017 01:49 PM

November 17, 2017


Weekly Python Chat

pip: installing Python libraries

This week we're talking about pip: the Python package manager. We'll talk about installing packages with pip and discovering packages that suit your needs.

November 17, 2017 08:30 PM


NumFOCUS

Theano and the Future of PyMC

This is a guest post by Christopher Fonnesbeck of PyMC3. — PyMC, now in its third iteration as PyMC3, is a project whose goal is to provide a performant, flexible, and friendly Python interface to Bayesian inference.  The project relies heavily on Theano, a deep learning framework, which has just announced that development will not be […]

November 17, 2017 05:25 PM


Davy Wybiral

Build a Feed Reader in Python (Part 1)

Adding the feedparser module to the project and using it to extract information about feed sources and articles.


November 17, 2017 04:12 PM

November 16, 2017


Mike Driscoll

wxPython: Moving items in ObjectListView

I was recently asked about how to implement drag-and-drop of items in a wx.ListCtrl or in ObjectListView. Unfortunately neither control has this built-in although I did find an article on the wxPython wiki that demonstrated one way to do drag-and-drop of the items in a ListCtrl.

However I did think that implementing some buttons to move items around in an ObjectListView widget should be fairly easy to implement. So that’s what this article will be focusing on.


Changing Item Order

If you don’t have wxPython and ObjectListView installed, then you will want to use pip to install them:

pip install wxPython objectlistview

Once that is done, open up your favorite text editor or IDE and enter the following code:

import wx
from ObjectListView import ObjectListView, ColumnDefn
 
 
class Book(object):
    """
    Model of the Book object
    Contains the following attributes:
    'ISBN', 'Author', 'Manufacturer', 'Title'
    """
 
    def __init__(self, title, author, isbn, mfg):
        self.isbn = isbn
        self.author = author
        self.mfg = mfg
        self.title = title
 
    def __repr__(self):
        return "<Book: {title}>".format(title=self.title)
 
 
class MainPanel(wx.Panel):
 
    def __init__(self, parent):
        wx.Panel.__init__(self, parent=parent, id=wx.ID_ANY)
        self.current_selection = None
        self.products = [Book("wxPython in Action", "Robin Dunn",
                              "1932394621", "Manning"),
                         Book("Hello World", "Warren and Carter Sande",
                              "1933988495", "Manning"),
                         Book("Core Python Programming", "Wesley Chun",
                             "0132269937", "Prentice Hall"),
                         Book("Python Programming for the Absolute Beginner",
                              "Michael Dawson", "1598631128",
                              "Course Technology"),
                         Book("Learning Python", "Mark Lutz",
                              "0596513984", "O'Reilly")
                         ]
 
        self.dataOlv = ObjectListView(self, wx.ID_ANY, 
                                      style=wx.LC_REPORT|wx.SUNKEN_BORDER)
        self.setBooks()
 
        # Allow the cell values to be edited when double-clicked
        self.dataOlv.cellEditMode = ObjectListView.CELLEDIT_SINGLECLICK
 
        # create up and down buttons
        up_btn = wx.Button(self, wx.ID_ANY, "Up")
        up_btn.Bind(wx.EVT_BUTTON, self.move_up)
 
        down_btn = wx.Button(self, wx.ID_ANY, "Down")
        down_btn.Bind(wx.EVT_BUTTON, self.move_down)
 
        # Create some sizers
        mainSizer = wx.BoxSizer(wx.VERTICAL)
 
        mainSizer.Add(self.dataOlv, 1, wx.ALL|wx.EXPAND, 5)
        mainSizer.Add(up_btn, 0, wx.ALL|wx.CENTER, 5)
        mainSizer.Add(down_btn, 0, wx.ALL|wx.CENTER, 5)
        self.SetSizer(mainSizer)
 
    def move_up(self, event):
        """
        Move an item up the list
        """        
        self.current_selection = self.dataOlv.GetSelectedObject()
        data = self.dataOlv.GetObjects()
        if self.current_selection:
            index = data.index(self.current_selection)
            if index > 0:
                new_index = index - 1
            else:
                new_index = len(data)-1
            data.insert(new_index, data.pop(index))
            self.products = data
            self.setBooks()
            self.dataOlv.Select(new_index)
 
    def move_down(self, event):
        """
        Move an item down the list
        """
        self.current_selection = self.dataOlv.GetSelectedObject()
        data = self.dataOlv.GetObjects()
        if self.current_selection:
            index = data.index(self.current_selection)
            if index < len(data) - 1:
                new_index = index + 1
            else:
                new_index = 0
            data.insert(new_index, data.pop(index))
            self.products = data
            self.setBooks()
            self.dataOlv.Select(new_index)
 
    def setBooks(self):
        self.dataOlv.SetColumns([
            ColumnDefn("Title", "left", 220, "title"),
            ColumnDefn("Author", "left", 200, "author"),
            ColumnDefn("ISBN", "right", 100, "isbn"),
            ColumnDefn("Mfg", "left", 180, "mfg")
        ])
 
        self.dataOlv.SetObjects(self.products)
 
 
class MainFrame(wx.Frame):
    def __init__(self):
        wx.Frame.__init__(self, parent=None, id=wx.ID_ANY, 
                          title="ObjectListView Demo", size=(800,600))
        panel = MainPanel(self)
        self.Show()
 
 
if __name__ == "__main__":
    app = wx.App(False)
    frame = MainFrame()
    app.MainLoop()

The code we care about most in this example are the move_up() and move_down() methods. Each of these methods will check to see if you have an item in the ObjectListView widget selected. It will also grab the current contents of the widgets. If you have an item selected, then it will grab that item’s index from the ObjectListView widget’s data that we grabbed when we called GetObjects(). Then we can use that index to determine whether we should increment (move_down) or decrement (move_up) its index depending on which of the buttons we press.

After we update the list with the changed positions, then we update self.products, which is our class variable that we use in the setBooks() to update our ObjectListView widget. Finally we actually call setBooks() and we reset the selection since our original selection moved.


Wrapping Up

I thought this was a neat little project that didn’t take very long to put together. I will note that there is at least one issue with this implementation and that is that it doesn’t work correctly when you select multiple items in the control. You could probably fix this by disabling multiple selection in your ObjectListView widget or by figuring out the logic to make it work with multiple selections. But I will leave that up the reader to figure out. Have fun an happy coding!

November 16, 2017 06:15 PM


Django Weblog

DSF calls for applicants for a Django Fellow

After three years of full-time work as the Django Fellow, I'd like to scale back my involvement to part-time. That means it's time to hire another Fellow who would like to work on Django 20-40 hours per week. The position is ongoing - the successful applicant will have the position until they choose to step down.

The position of Fellow is primarily focused on housekeeping and community support - you'll be expected to do the work that would benefit from constant, guaranteed attention rather than volunteer-only efforts. In particular, your duties will include:

Being a committer isn't a prerequisite for this position; we'll consider applications from anyone with a proven history of working with either the Django community or another similar open-source community.

Your geographical location isn't important either - we have several methods of remote communication and coordination that we can use depending on the timezone difference to the supervising members of Django.

You'll be expected to post a weekly report of your work to the django-developers mailing list.

If you don't perform the duties to a satisfactory level, we may end your contract. We may also terminate the contract if we're unable to raise sufficient funds to support the Fellowship on an ongoing basis (unlikely, given the current fundraising levels).

Compensation isn't competitive with full-time salaries in big cities like San Francisco or London. The Fellow will be selected to make best use of available funds.

If you're interested in applying for the position, please email us with details of your experience with Django and open-source contribution and community support in general, the amount of time each week you'd like to dedicate to the position (a minimum of 20 hours a week), your hourly rate, and when you'd like to start working. The start date is flexible and will be on or after January 1, 2018.

Applications will be open until 1200 UTC, December 18, 2017, with the expectation that the successful candidate will be announced around December 22.

Successful applicants will not be an employee of the Django Project or the Django Software Foundation. Fellows will be contractors and expected to ensure that they meet all of their resident country's criteria for self-employment or having a shell consulting company, invoicing the DSF on a monthly basis and ensuring they pay all relevant taxes.

If you or your company is interested in helping fund this program and future DSF activities, please consider becoming a corporate member to learn about corporate membership, or you can make a donation to the Django Software Foundation.

November 16, 2017 02:18 PM


Red Hat Developers

Speed up your Python using Rust

What is Rust?

Rust is a systems programming language that runs blazingly fast, prevents segfaults, and guarantees thread safety.

Featuring

Description is taken from rust-lang.org.

Why does it matter for a Python developer?

The better description of Rust I heard from Elias (a member of the Rust Brazil Telegram Group).

Rust is a language that allows you to build high level abstractions, but without giving up low-level control – that is, control of how data is represented in memory, control of which threading model you want to use etc.
Rust is a language that can usually detect, during compilation, the worst parallelism and memory management errors (such as accessing data on different threads without synchronization, or using data after they have been deallocated), but gives you a hatch escape in the case you really know what you’re doing.
Rust is a language that, because it has no runtime, can be used to integrate with any runtime; you can write a native extension in Rust that is called by a program node.js, or by a python program, or by a program in ruby, lua etc. and, however, you can script a program in Rust using these languages. — “Elias Gabriel Amaral da Silva”

There is a bunch of Rust packages out there to help you extending Python with Rust.

I can mention Milksnake created by Armin Ronacher (the creator of Flask) and also PyO3 The Rust bindings for Python interpreter.

See a complete reference list at the bottom of this article.

Let’s see it in action

For this post, I am going to use Rust Cpython, it’s the only one I have tested, it is compatible with stable version of Rust and found it straightforward to use.

NOTE: PyO3 is a fork of rust-cpython, comes with many improvements, but works only with the nightly version of Rust, so I prefered to use the stable for this post, anyway the examples here must work also with PyO3.

Pros: It is easy to write Rust functions and import from Python and as you will see by the benchmarks it worth in terms of performance.

Cons: The distribution of your project/lib/framework will demand the Rust module to be compiled on the target system because of variation of environment and architecture, there will be a compiling stage which you don’t have when installing Pure Python libraries, you can make it easier using rust-setuptools or using the MilkSnake to embed binary data in Python Wheels.

Python is sometimes slow

Yes, Python is known for being “slow” in some cases and the good news is that this doesn’t really matter depending on your project goals and priorities. For most projects, this detail will not be very important.

However, you may face the rare case where a single function or module is taking too much time and is detected as the bottleneck of your project performance, often happens with string parsing and image processing.

Example

Let’s say you have a Python function which does a string processing, take the following easy example of counting pairs of repeated chars, but have in mind that this example can be reproduced with other string processing functions or any other generally slow process in Python.

# How many subsequent-repeated group of chars are in the given string? 
abCCdeFFghiJJklmnopqRRstuVVxyZZ... {millions of chars here}
  1   2    3        4    5   6

Python is slow for doing large string processing, so you can use pytest-benchmark to compare a Pure Python (with Iterator Zipping) function versus a Regexp implementation.

# Using a Python3.6 environment
$ pip3 install pytest pytest-benchmark

Then write a new Python program called doubles.py

import re
import string
import random

# Python ZIP version
def count_doubles(val):
    total = 0
    # there is an improved version later on this post
    for c1, c2 in zip(val, val[1:]):
        if c1 == c2:
            total += 1
    return total


# Python REGEXP version
double_re = re.compile(r'(?=(.)\1)')

def count_doubles_regex(val):
    return len(double_re.findall(val))


# Benchmark it
# generate 1M of random letters to test it
val = ''.join(random.choice(string.ascii_letters) for i in range(1000000))

def test_pure_python(benchmark):
    benchmark(count_doubles, val)

def test_regex(benchmark):
    benchmark(count_doubles_regex, val)

Run pytest to compare:

$ pytest doubles.py                                                                                                           
=============================================================================
platform linux -- Python 3.6.0, pytest-3.2.3, py-1.4.34, pluggy-0.4.
benchmark: 3.1.1 (defaults: timer=time.perf_counter disable_gc=False min_roun
rootdir: /Projects/rustpy, inifile:
plugins: benchmark-3.1.1
collected 2 items

doubles.py ..


-----------------------------------------------------------------------------
Name (time in ms)         Min                Max               Mean          
-----------------------------------------------------------------------------
test_regex            24.6824 (1.0)      32.3960 (1.0)      27.0167 (1.0)    
test_pure_python      51.4964 (2.09)     62.5680 (1.93)     52.8334 (1.96)   
-----------------------------------------------------------------------------

Lets take the Mean for comparison:

Extending Python with Rust

Create a new crate

crate is how we call Rust Packages.

Having rust installed (recommended way is https://www.rustup.rs/) Rust is also available on Fedora and RHEL repositories by the rust-toolset

I used rustc 1.21.0

In the same folder run:

cargo new pyext-myrustlib

It creates a new Rust project in that same folder called pyext-myrustlib containing the Cargo.toml (cargo is the Rust package manager) and also a src/lib.rs (where we write our library implementation).

Edit Cargo.toml

It will use the rust-cpython crate as dependency and tell cargo to generate a dylib to be imported from Python.

[package]
name = "pyext-myrustlib"
version = "0.1.0"
authors = ["Bruno Rocha <rochacbruno@gmail.com>"]

[lib]
name = "myrustlib"
crate-type = ["dylib"]

[dependencies.cpython]
version = "0.1"
features = ["extension-module"]

Edit src/lib.rs

What we need to do:

  1. Import all macros from cpython crate.
  2. Take Python and PyResult types from CPython into our lib scope.
  3. Write the count_doubles function implementation in Rust, note that this is very similar to the Pure Python version except for:
    • It takes a Python as first argument, which is a reference to the Python Interpreter and allows Rust to use the Python GIL.
    • Receives a &str typed val as reference.
    • Returns a PyResult which is a type that allows the rise of Python exceptions.
    • Returns an PyResult object in Ok(total) (Result is an enum type that represents either success (Ok) or failure (Err)) and as our function is expected to return a PyResult the compiler will take care of wrapping our Ok on that type. (note that our PyResult expects a u64 as return value).
  4. Using py_module_initializer! macro we register new attributes to the lib, including the __doc__ and also we add the count_doubles attribute referencing our Rust implementation of the function.
    • Attention to the names libmyrustlib, initlibmyrustlib, and PyInit.
    • We also use the try! macro, which is the equivalent to Python’stry.. except.
    • Return Ok(()) – The () is an empty result tuple, the equivalent of None in Python.
#[macro_use]
extern crate cpython;

use cpython::{Python, PyResult};

fn count_doubles(_py: Python, val: &str) -> PyResult<u64> {
    let mut total = 0u64;

    // There is an improved version later on this post
    for (c1, c2) in val.chars().zip(val.chars().skip(1)) {
        if c1 == c2 {
            total += 1;
        }
    }

    Ok(total)
}

py_module_initializer!(libmyrustlib, initlibmyrustlib, PyInit_myrustlib, |py, m | {
    try!(m.add(py, "__doc__", "This module is implemented in Rust"));
    try!(m.add(py, "count_doubles", py_fn!(py, count_doubles(val: &str))));
    Ok(())
});

Now let’s build it with cargo

$ cargo build --release
    Finished release [optimized] target(s) in 0.0 secs

$ ls -la target/release/libmyrustlib*
target/release/libmyrustlib.d
target/release/libmyrustlib.so*  <-- Our dylib is here

Now let’s copy the generated .so lib to the same folder where our doubles.py is located.

NOTE: on Fedora you must get a .so in other system you may get a .dylib and you can rename it changing extension to .so.

$ cd ..
$ ls
doubles.py pyext-myrustlib/

$ cp pyext-myrustlib/target/release/libmyrustlib.so myrustlib.so

$ ls
doubles.py myrustlib.so pyext-myrustlib/

Having the myrustlib.so in the same folder or added to your Python path allows it to be directly imported, transparently as it was a Python module.

 

Importing from Python and comparing the results

Edit your doubles.py now importing our Rust implemented version and adding a benchmark for it.

import re
import string
import random
import myrustlib   #  <-- Import the Rust implemented module (myrustlib.so)


def count_doubles(val):
    """Count repeated pair of chars ins a string"""
    total = 0
    for c1, c2 in zip(val, val[1:]):
        if c1 == c2:
            total += 1
    return total


double_re = re.compile(r'(?=(.)\1)')


def count_doubles_regex(val):
    return len(double_re.findall(val))


val = ''.join(random.choice(string.ascii_letters) for i in range(1000000))


def test_pure_python(benchmark):
    benchmark(count_doubles, val)


def test_regex(benchmark):
    benchmark(count_doubles_regex, val)


def test_rust(benchmark):   #  <-- Benchmark the Rust version
    benchmark(myrustlib.count_doubles, val)

Benchmark

$ pytest doubles.py
==============================================================================
platform linux -- Python 3.6.0, pytest-3.2.3, py-1.4.34, pluggy-0.4.
benchmark: 3.1.1 (defaults: timer=time.perf_counter disable_gc=False min_round
rootdir: /Projects/rustpy, inifile:
plugins: benchmark-3.1.1
collected 3 items

doubles.py ...


-----------------------------------------------------------------------------
Name (time in ms)         Min                Max               Mean          
-----------------------------------------------------------------------------
test_rust              2.5555 (1.0)       2.9296 (1.0)       2.6085 (1.0)    
test_regex            25.6049 (10.02)    27.2190 (9.29)     25.8876 (9.92)   
test_pure_python      52.9428 (20.72)    56.3666 (19.24)    53.9732 (20.69)  
-----------------------------------------------------------------------------

Lets take the Mean for comparison:

Rust implementation can be 10x faster than Python Regex and 21x faster than Pure Python Version.

Interesting that Regex version is only 2x faster than Pure Python 🙂

NOTE: That numbers makes sense only for this particular scenario, for other cases that comparison may be different.

Updates and Improvements

After this article has been published I got some comments on r/python and also on r/rust

The contributions came as Pull Requests and you can send a new if you think the functions can be improved.

Thanks to: Josh Stone we got a better implementation for Rust which iterates the string only once and also the Python equivalent.

Thanks to: Purple Pixie we got a Python implementation using itertools, however this version is not performing any better and still needs improvements.

Iterating only once

fn count_doubles_once(_py: Python, val: &str) -> PyResult<u64> {
    let mut total = 0u64;

    let mut chars = val.chars();
    if let Some(mut c1) = chars.next() {
        for c2 in chars {
            if c1 == c2 {
                total += 1;
            }
            c1 = c2;
        }
    }

    Ok(total)
}
def count_doubles_once(val):
    total = 0
    chars = iter(val)
    c1 = next(chars)
    for c2 in chars:
        if c1 == c2:
            total += 1
        c1 = c2
    return total

Python with itertools

import itertools

def count_doubles_itertools(val):
    c1s, c2s = itertools.tee(val)
    next(c2s, None)
    total = 0
    for c1, c2 in zip(c1s, c2s):
        if c1 == c2:
            total += 1
    return total

Why not C/C++/Nim/Go/Ĺua/PyPy/{other language}?

Ok, that is not the purpose of this post, this post was never about comparing Rust X other language, this post was specifically about how to use Rust to extend and speed up Python and by doing that it means you have a good reason to choose Rust instead of other language or by its ecosystem or by its safety and tooling or just to follow the hype, or simply because you like Rust doesn’t matter the reason, this post is here to show how to use it with Python.

I (personally) may say that Rust is more future proof as it is new and there are lots of improvements to come, also because of its ecosystem, tooling, and community and also because I feel comfortable with Rust syntax, I really like it!

So, as expected people started complaining about the use of other languages and it becomes a sort of benchmark, and I think it is cool!

So as part of my request for improvements some people on Hacker News also sent ideas, martinxyz sent an implementation using C and SWIG that performed very well.

C Code (swig boilerplate omitted)

uint64_t count_byte_doubles(char * str) {
  uint64_t count = 0;
  while (str[0] && str[1]) {
    if (str[0] == str[1]) count++;
    str++;
  }
  return count;
}

And our fellow Red Hatter Josh Stone improved the Rust implementation again by replacing chars with bytes so it is a fair competition with C as C is comparing bytes instead of Unicode chars.

fn count_doubles_once_bytes(_py: Python, val: &str) -> PyResult<u64> {
    let mut total = 0u64;

    let mut chars = val.bytes();
    if let Some(mut c1) = chars.next() {
        for c2 in chars {
            if c1 == c2 {
                total += 1;
            }
            c1 = c2;
        }
    }

    Ok(total)
}

There are also ideas to compare Python list comprehension and numpy so I included here

Numpy:

import numpy as np

def count_double_numpy(val):
    ng=np.fromstring(val,dtype=np.byte)
    return np.sum(ng[:-1]==ng[1:])

List comprehension

def count_doubles_comprehension(val):
    return sum(1 for c1, c2 in zip(val, val[1:]) if c1 == c2)

The complete test case is on repository test_all.py file.

New Results

NOTE: Have in mind that the comparison was done in the same environment and may have some differences if run in a different environment using another compiler and/or different tags.

-------------------------------------------------------------------------------------------------
Name (time in us)                     Min                    Max                   Mean          
-------------------------------------------------------------------------------------------------
test_rust_bytes_once             476.7920 (1.0)         830.5610 (1.0)         486.6116 (1.0)    
test_c_swig_bytes_once           795.3460 (1.67)      1,504.3380 (1.81)        827.3898 (1.70)   
test_rust_once                   985.9520 (2.07)      1,483.8120 (1.79)      1,017.4251 (2.09)   
test_numpy                     1,001.3880 (2.10)      2,461.1200 (2.96)      1,274.8132 (2.62)   
test_rust                      2,555.0810 (5.36)      3,066.0430 (3.69)      2,609.7403 (5.36)   
test_regex                    24,787.0670 (51.99)    26,513.1520 (31.92)    25,333.8143 (52.06)  
test_pure_python_once         36,447.0790 (76.44)    48,596.5340 (58.51)    38,074.5863 (78.24)  
test_python_comprehension     49,166.0560 (103.12)   50,832.1220 (61.20)    49,699.2122 (102.13) 
test_pure_python              49,586.3750 (104.00)   50,697.3780 (61.04)    50,148.6596 (103.06) 
test_itertools                56,762.8920 (119.05)   69,660.0200 (83.87)    58,402.9442 (120.02) 
-------------------------------------------------------------------------------------------------

NOTE: If you want to propose changes or improvements send a PR here: https://github.com/rochacbruno/rust-python-example/

Conclusion

Back to the purpose of this post “How to Speed Up your Python with Rust” we started with:

– Pure Python function taking 102 ms.
– Improved with Numpy (which is implemented in C) to take 3 ms.
– Ended with Rust taking 1 ms.

In this example Rust performed 100x faster than our Pure Python.

Rust will not magically save you, you must know the language to be able to implement the clever solution and once implemented in the right it worth as much as C in terms of performance and also comes with amazing tooling, ecosystem, community and safety bonuses.

Rust may not be yet the general purpose language of choice by its level of complexity and may not be the better choice yet to write common simple applications such as web sites and test automation scripts.

However, for specific parts of the project where Python is known to be the bottleneck and your natural choice would be implementing a C/C++ extension, writing this extension in Rust seems easy and better to maintain.

There are still many improvements to come in Rust and lots of others crates to offer Python <--> Rust integration. Even if you are not including the language in your tool belt right now, it is really worth to keep an eye open to the future!

References

The code snippets for the examples showed here are available in GitHub repo: https://github.com/rochacbruno/rust-python-example.

The examples in this publication are inspired by Extending Python with Rust talk by Samuel Cormier-Iijima in Pycon Canada. video here: https://www.youtube.com/watch?v=-ylbuEzkG4M.

Also by My Python is a little Rust-y by Dan Callahan in Pycon Montreal. video here: https://www.youtube.com/watch?v=3CwJ0MH-4MA.

Other references:

Join Community:

Join Rust community, you can find group links in https://www.rust-lang.org/en-US/community.html.

If you speak Portuguese, I recommend you to join https://t.me/rustlangbr and there is the http://bit.ly/canalrustbr on Youtube.

Author

Bruno Rocha

More info: http://about.me/rochacbruno and http://brunorocha.org


Whether you are new to Containers or have experience, downloading this cheat sheet can assist you when encountering tasks you haven’t done lately.

Share

The post Speed up your Python using Rust appeared first on RHD Blog.

November 16, 2017 11:00 AM


Davy Wybiral

Write a Feed Reader in Python

I just started a new video tutorial series. This time it'll cover the entire process of writing an RSS feed reader in Python from start to finish using the feedparser module, flask, and SQLAlchemy. Expect to see about 3-4 new videos a week until this thing is finished!

Click to watch

November 16, 2017 10:23 AM


Python Bytes

#52 Call your APIs with uplink and test them in the tavern

November 16, 2017 08:00 AM


Kushal Das

PyConf Hyderabad 2017

In the beginning of October, I attended a new PyCon in India, PyConf Hyderabad (no worries, they are working on the name for the next year). I was super excited about this conference, the main reason is being able to meet more Python developers from India. We are a large country, and we certainly need more local conferences :)

We reached the conference hotel a day before the event starts along with Py. The first day of the conference was workshop day, we reached the venue on time to say hi to everyone. Meet the team at the conference and many old friends. It was good to see that folks traveled from all across the country to volunteer for the conference. Of course, we had a certain number of dgplug folks there :)

In the conference day, Anwesha and /my setup in the PSF booth, and talked to the attendees. During the lighting talk session, Sayan and Anwesha introduced PyCon Pune, and they also opened up the registration during the lighting talk :). I attended Chandan Kumar’s talk about his journey into upstream projects. Have to admit that I feel proud to see all the work he has done.

Btw, I forgot to mention that lunch at PyConf Hyderabad was the best conference food ever. They had some amazing biryani :).

The last talk of the day was my keynote titled Free Software movement & current days. Anwesha and I wrote an article on the history of Free Software a few months back, and that the talk was based on that. This was also the first time I spoke about Freedom of the Press Foundation (attended my first conference as the FPF staff member).

The team behind the conference did some amazing groundwork to make this conference happening. It was a good opportunity to meet the community, and make new friends.

November 16, 2017 04:25 AM


Wallaroo Labs

Non-native event-driven windowing in Wallaroo

Certain applications lend themselves to pure parallel computation better than others. In some cases we require to apply certain algorithms over a “window” in our data. This means that after we have completed a certain amount of processing (be it time, number of messages or some other arbitrary metric), we want to perform a special action for the data in that window. An example application of this could be producing stats for log files over a certain period of time.

November 16, 2017 12:00 AM

November 15, 2017


Django Weblog

Django 2.0 release candidate 1 released

Django 2.0 release candidate 1 is the final opportunity for you to try out the assortment of new features before Django 2.0 is released.

The release candidate stage marks the string freeze and the call for translators to submit translations. Provided no major bugs are discovered that can't be solved in the next two weeks, Django 2.0 will be released on or around December 1. Any delays will be communicated on the django-developers mailing list thread.

Please use this opportunity to help find and fix bugs (which should be reported to the issue tracker). You can grab a copy of the package from our downloads page or on PyPI.

The PGP key ID used for this release is Tim Graham: 1E8ABDC773EDE252.

November 15, 2017 11:54 PM


PyCharm

PyCharm 2017.3 EAP 10

This week’s early access program (EAP) version of PyCharm is now available from our website:

Get PyCharm 2017.3 EAP 10

The release is getting close, and we’re just polishing out the last small issues until it’s ready.

Improvements in This Version

If these features sound interesting to you, try them yourself:

Get PyCharm 2017.3 EAP 10

If you are using a recent version of Ubuntu (16.04 and later) you can also install PyCharm EAP versions using snap:

sudo snap install [pycharm-professional | pycharm-community] --classic --edge

If you already used snap for the previous version, you can update using:

sudo snap refresh [pycharm-professional | pycharm-community] --classic --edge

As a reminder, PyCharm EAP versions:

If you run into any issues with this version, or another version of PyCharm, please let us know on our YouTrack. If you have other suggestions or remarks, you can reach us on Twitter, or by commenting on the blog.

November 15, 2017 05:47 PM


Codementor

Onicescu correlation coefficient-Python - Alexandru Daia

Implementing a new correlation method based on kinetic energies I research

November 15, 2017 05:41 PM


Ian Ozsvald

PyDataBudapest and “Machine Learning Libraries You’d Wish You’d Known About”

I’m back at BudapestBI and this year it has its first PyDataBudapest track. Budapest is fun! I’ve had a second iteration talking on a slightly updated “Machine Learning Libraries You’d Wish You’d Known About” (updated from PyDataCardiff two weeks back). When I was here to give an opening keynote talk two years back the conference was a bit smaller, it has grown by +100 folk since then. There’s also a stronger emphasis on open source R and Python tools. As before, the quality of the members here is high – the conversations are great!

During my talk I used my Explaining Regression Predictions Notebook to cover:

Nick’s photo of me on stage

Some audience members asked about co-linearity detection and explanation. Whilst I don’t have a good answer for identifying these relationships, I’ve added a seaborn pairplot, a correlation plot and the Pandas Profiling tool to the Notebook which help to show these effects.

Although it is complicated, I’m still pretty happy with this ELI5 plot that’s explaining feature contributions to a set of cheap-to-expensive houses from the Boston dataset:

Boston ELI5

I’m planning to do some training on these sort of topics next year, join my training list if that might be of use.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

November 15, 2017 03:50 PM