skip to navigation
skip to content

Planet Python

Last update: July 28, 2016 07:49 AM

July 28, 2016

Philip Semanchuk

Creating PDF Documents Using LibreOffice and Python

This post is a supplement to a talk I’m giving at PyOhio about using Python to create PDFs “the lazy way”. It’s the first of a series on this subject which is a bit too big for just one blog post.

In the talk and in this series, I advocate a technique for creating PDFs that uses LibreOffice (or OpenOffice) to do most of the hard work, and I contrast that to the common solution of using ReportLab (or a library like it).

This technique offers some unique benefits, and in some common use cases—most importantly, perhaps in your case—it can be much more efficient than the alternative. I’ll compare and contrast the two in another blog post. In this post I just want to describe the technique I’m advocating.


Creating PDFs programmatically is a task most Python programmers encounter
at least once.

When I talk about creating PDFs programmtically, I’m thinking of the situation where one wants to create a lot of PDFs that follow a template. For instance, you might work for a bank that wants to produce end-of-month account statements for each of its 100,000 customers. The cover page will always contain the bank’s logo, some legal boilerplate, the month and year, and a bland stock photo 17068-a-woman-and-older-man-sitting-at-a-table-pvof happy customers doing something  unrelated to banking, like this one.

The first page after that will be a summary of the customer’s accounts, and then subsequent pages contain information about the account—a list of transactions, changes in values of stocks, etc.

Each PDF will be different, but similar because they follow a template. Computers are great for this sort of thing, and this technique is particularly good at it. As I said above, I’ll tell you why I think it’s good in another blog post. For now, I want to stop talking mysteriously about “the technique” and actually describe it.


Here’s a concise outline. Don’t worry if you don’t understand all the steps; they’re fleshed out below.

  1. Create a LibreOffice document that will serve as a template for the documents you want to create. (Note: I mean “template” in the general sense of a form or skeleton, not a LibreOffice .ott template file.)
  2. Unzip that document.
  3. Manipulate the document’s XML using standard Python libraries.
  4. Zip the modified files into a new LibreOffice .odt file.
  5. Ask LibreOffice to export the document in PDF format.

Let’s go through these step by step. I encourage you to follow along. We’re not going to write a single line of Python code, just explore a process. Writing Python would come later when you automate steps 2 – 5.

1. Create a LibreOffice Document to Use as a Template

This step will probably require the most work.

We usually know in advance at least some of the content we want. For instance, in the bank example above, we know what the cover page will look like, where each section should appear in the document, and how a section (e.g. a list of account transactions) should be formatted, even if we don’t know in advance the exact values of each transaction.

Your job during this step is to create a LibreOffice document that will serve as a skeleton (or template, or form) for your final documents. For content that you don’t know (words in paragraphs, images, bullet points in a list, table contents, etc.), leave placeholders.

If you want to play along with this blog post, here’s the LibreOffice document that I’ll use in the examples below.

2. Unzip the Document

This is a trick you might not know—LibreOffice documents are ZIP files. (This is true of all documents that follow the Open Document Format for Office Applications). You can unzip them with command line tools, or with the zipfile module in Python’s standard library.

On my Mac, the following command unzips the document into the directory unzipped.

unzip practice.odt -d unzipped

After unzipping, you’ll see a bunch of files like this:

drwxr-xr-x  11 philip staff    374 Jul 27 16:43 Configurations2/
drwxr-xr-x   3 philip staff    102 Jul 27 16:43 META-INF/
drwxr-xr-x   3 philip staff    102 Jul 27 16:43 Thumbnails/
-rw-r--r--   1 philip staff   8988 Jul 27 16:44 content.xml
-rw-r--r--   1 philip staff    899 Jul 27  2016 manifest.rdf
-rw-r--r--   1 philip staff   1005 Jul 27  2016 meta.xml
-rw-r--r--   1 philip staff     39 Jul 27  2016 mimetype
-rw-r--r--   1 philip staff  10319 Jul 27  2016 settings.xml
-rw-r--r--   1 philip staff  14903 Jul 27  2016 styles.xml

Of the files above, you’re only likely to be interested in content.xml. (You might also want to explore styles.xml, but I consider that an advanced topic, and I’m trying to maintain a rigorous standard of laziness.)

3. Manipulate the XML

The XML in content.xml is governed by the 846-page Open Document Format for Office Applications. You might think I’m going to suggest you read it, or at least familiarize yourself with it.

Heck no! That’s not the lazy way. I’m very pleased that it’s an ISO standard, but I don’t want to learn it if I can save time and effort by not doing so, and you shouldn’t have to either.

Instead I suggest you use what I use: common sense and intution, which can get you surprisingly far. For instance, if you see this in the XML—

<text:p text:style-name="P4">
 The fox jumped over the dog.

You don’t have to read 846 pages of documentation to guess that you can change it to this—

<text:p text:style-name="P4">
 The quick brown fox jumped over the lazy dog.

Or even this—

<text:p text:style-name="P4">
 No one expects the Spanish Inquisition!

Are you starting to see some possibilities?

If you’re doing this programmatically, you can use LibreOffice bookmarks to demarcate the text you want to replace. Bookmarks are visible in the XML and trivial to locate using XPath. You can see this in my example document where I’ve surrounded two blank space characters with bookmarks where adjectives might go to describe the fox and dog.

<text:p text:style-name="P1">
    <text:bookmark-start text:name="fox_type_placeholder"/>
    <text:bookmark-end text:name="fox_type_placeholder"/>
    fox jumped over the
    <text:bookmark-start text:name="dog_type_placeholder"/>
    <text:bookmark-end text:name="dog_type_placeholder"/>

What do you think will happen if you replace the first occurrence of  <text:s/> with quick brown?

Text isn’t the only thing you can change.

If you have a list item with bullets and you want another bullet or three, you can just duplicate existing bullets. For instance, if you start with this—

<text:list xml:id="list3413943092755896283" text:style-name="L1">
        <text:p text:style-name="P2">First</text:p>
        <text:p text:style-name="P2">Second</text:p>
        <text:p text:style-name="P2">Third</text:p>

You can turn it into this—

<text:list xml:id="list3413943092755896283" text:style-name="L1">
        <text:p text:style-name="P2">First</text:p>
        <text:p text:style-name="P2">Second</text:p>
        <text:p text:style-name="P2">Third</text:p>
        <text:p text:style-name="P2">Fourth</text:p>
        <text:p text:style-name="P2">Fifth</text:p>
        <text:p text:style-name="P2">Sixth</text:p>

Note that the text:list element itself has what looks like a unique id associated with it. This is a yellow flag that indicates to me that if you want to copy the entire list, you’ll need to give it a new unique id, and hope that LibreOffice  doesn’t reference that id in some other file.

I’m sure the details are somewhere in that 846-page document. You can read that document, or you can also just try your change and see what happens. The worst case scenario is that LibreOffice will tell you that your document is corrupted and you’ll have to go back and explore some more.

4. Zip a New LibreOffice File

Once you’ve made the changes you want, it’s time to reverse step 2, using your modified content.xml.

Here’s the command that works on my Mac—

cd unzipped && zip -r ../my_new_file.odt * && cd ..

Note that this command doesn’t respect the OpenDocument specification which has rules regarding how the mime type file should be represented in the zip file (as the first file in the archive, and uncompressed, per OpenDocument v1.2 part3, § 3.3 MIME Media Type). It works for me, maybe because LibreOffice is forgiving. It’s not something you should rely on, however. In another post, I’ll present some Python code that constructs the ZIP file according to standard.

5. Export to PDF via LibreOffice

If you’re just experimenting, you can just open the document in LibreOffice manually and then use the “File/Export as PDF…” menu item. (Opening manually is also a good test that you didn’t do anything objectionable to the XML.)

Programmatically, I recommend using unoconv for converting your finished document to PDF.


So there you have it! If you feel underwhelmed, keep in mind that this was only a proof of concept. In some future posts, I’ll explain why I think this method is often an excellent choice (and also when it isn’t).

Photo Credit

Thanks to the National Cancer Institute for making many photos available for free, including the one used in this blog post which was taken by Rhoda Baer.

July 28, 2016 03:51 AM

July 27, 2016

Caktus Consulting Group

How I Built a Power Debugger (PyCon 2016 Must-See Talk: 3/6)

Part three of six in our annual PyCon Must-See Series, a weekly highlight of talks our staff especially loved at PyCon. With so many fantastic talks, it’s hard to know where to start, so here’s our short list.

While at PyCon 2016, I really enjoyed Doug Hellmann’s talk, “How I built a power debugger out of the standard library and things I found on the internet” (video below). It's listed as a novice talk but anyone can learn from this talk. Doug talked about the process of creating this project more than the project itself. He talked about his original idea, his motivations, and how he worked in pieces towards his goal. His approach and attitude were refreshing, including talking about places that he struggled and now how long the process took. A beautiful glimpse into the mind of a very smart, creative, and humble developer.

More in the annual PyCon Must-See Talks Series.

July 27, 2016 02:15 PM

Ned Batchelder 4.2 4.2 is done.

As I mentioned in the beta 1 announcement, this contains work from the sprint at PyCon 2016 in Portland.

The biggest change since 4.1 is the only incompatible change. The "coverage combine" command now will ignore an existing .coverage data file, rather than appending to it as it used to do. This new behavior makes more sense to people, and matches how "coverage run" works. If you've ever seen (or written!) a tox.ini file with an explicit coverage-clean step, you won't have to any more. There's also a new "--append" option on "coverage combine", so you can get the old behavior if you want it.

The multiprocessing support continues to get the polish it deserves:

Finally, the text report can be sorted by columns as you wish, making it more convenient.

The complete change history is in the source.

July 27, 2016 01:35 PM

PyCon Australia

Announcing keynote speaker Damien George

Damien George

The PyCon Australia team is quietly ecstatic to announce that our second keynote speaker will be Damien George.

Damien is the creator of MicroPython and ran two very fruitful Kickstarter campaigns to build a community around this microcontroller language. He has built a successful company based on MicroPython and the pyboard, brought it to makers, teachers and industry developers around the world, worked with the BBC on the micro:bit project, and embarked on projects with the European Space Agency to bring MicroPython into space.

“Damien’s work, and continuing community efforts, have been an important part of the Python ecosystem,” said Richard Jones, conference chair. “I’m especially excited to hear Damien talk through the journey of dreaming up and implementing a whole new Python just for the smallest possible deployments, on microcontrollers, and where that journey has taken him.”

We are fascinated to hear Damien’s keynote address and hear about MicroPython in our macro universe. Will you be there?

Registrations for PyCon Australia 2016 are already open and tickets are almost sold out. Book your conference ticket today!

July 27, 2016 01:21 PM

Mike Driscoll

Python: Visualization with Bokeh

The Bokeh package is an interactive visualization library that uses web browsers for its presentation. Its goal is to provide graphics in the vein of D3.js that look elegant and are easy to construct. Bokeh supports large and streaming datasets. You will probably be using this library for creating plots / graphs. One of its primary competitors seems to be Plotly.

Note: This will not be an in-depth tutorial on the Bokeh library as the number of different graphs and visualizations it is capable of is quite large. Instead, the aim of the article is to give you a taste of what this interesting library can do.

Let’s take a moment and get it installed. The easiest way to do so is to use pip or conda. Here’s how you can use pip:

pip install bokeh

This will install Bokeh and all its dependencies. You may want to install Bokeh into a virtualenv because of this, but that’s up to you. Now let’s check out a simple example. Save the following code into a file with whatever name you deem appropriate.

from bokeh.plotting import figure, output_file, show
x = range(1, 6)
y = [10, 5, 7, 1, 6]
plot = figure(title='Line example', x_axis_label='x', y_axis_label='y')
plot.line(x, y, legend='Test', line_width=4)

Here we just import a few items from the Bokeh library. We just tell it where to save the output. You will note that the output is HTML. Then we create some values for the x and y axises so we can create the plot. Then we actually create the figure object and give it a title and labels for the two axises. Finally we plot the line, give it a legend and line width and show the plot. The show command will actually open your plot in your default browser. You should end up seeing something like this:


Bokeh also supports the Jupyter Notebook with the only change being that you will need to use output_notebook instead of output_file.

The Bokeh quick start guide has a neat example of a series of sine waves on a grid plot. I reduced the example down a bit to just one sine wave. Note that you will need NumPy installed for the following example to work correctly:

import numpy as np
from bokeh.layouts import gridplot
from bokeh.plotting import figure, output_file, show
N = 100
x = np.linspace(0, 4*np.pi, N)
y0 = np.sin(x)
sine = figure(width=500, plot_height=500, title='Sine'), y0, size=10, color="navy", alpha=0.5)
p = gridplot([[sine]], toolbar_location=None)

The main difference between this example and the previous one is that we are using NumPy to generate the data points and we’re putting our figure inside of a gridplot instead of just drawing the figure itself. When you run this code, you should end up with a plot that looks like this:


If you don’t like circles, then you’ll be happy to know that Bokeh supports other shapes, such as square, triangle and several others.

Wrapping Up

The Bokeh project is really interesting and provides a simple, easy-to-use API for creating graphs, plots and other visualizations of your data. The documentation is quite well put together and includes lots of examples that showcase what this package can do for you. It is well worth just skimming the documentation so you can see what some of the other graphs look like and how short the code examples are that generate such nice results. My only gripe is that Bokeh doesn’t have a way to save an image file programmatically. This appears to be a long term bug that they have been trying to fix for a couple of years now. Hopefully they find a way to support that feature soon. Otherwise, I thought it was really cool!

July 27, 2016 12:30 PM

Python Software Foundation

PyGotham: a Python Conference at the United Nations

United Nations Headquarters

I've never had to take my belt off to get into a Python conference before.

This is the fifth year I attended PyGotham, here in New York City. In past years we held the conference in a standard convention center or, memorably, on a couple boats moored in the Hudson River. But this year PyGotham gathered in the United Nations.

What first struck me about the new venue was its vigilant security, of course. Guards in blue uniforms sent us through metal detectors and x-rayed our bags. Once I got through security and put my belt on, I entered the UN Conference Building. The lobby is full of inspiring posters about anti-poverty summits, scientific committees, global peace initiatives. In the conference rooms themselves every seat has its own microphone and an earpiece for simultaneous translation. Sound-proof booths surround and overlook each room, with signs in their thick glass windows saying "English", "French", "German". Along a hallway stands old-fashioned gray communications gear. There are rows of plug boards, analog meters, tape-to-tape reels, cathode ray tube screens surrounded by switches, buttons, and dials.

I turned my attention from the conference environment to the people there, and I was struck by a second novel impression: demographics! I've come to expect Python conferences to include many women and people of color, but at PyGotham 2016 women of color were particularly well-represented, and there were teenage coders and even a few pre-teens.

PyGotham is a production of Big Apple Py. Our spot in the UN is the outcome of a new partnership: PyGotham has joined Open Camp, a giant UN-sponsored series of technology conferences that focus on technology's humanitarian uses.

I interviewed Big Apple Py's Jon Banafato, and Open Camp coordinator Forest Mars, to learn more about why this PyGotham was so different from the past.

PyGotham started in 2011 as a conference for the New York City Python community. The conference has grown a lot since then. This year, we had over 500 attendees from around the world, but PyGotham still remains a tight-knit community of New Yorkers at heart.

The organizing team strives for a diverse speaker list and audience. This year’s conference would not have been the same without the help of the Python Software Foundation, who funded 50 diversity scholarship tickets.

A half-dozen community groups helped us get tickets into the right hands: NYC PyLadies, Girl Develop It, Django Girls NYC, Write Speak Code, and Women in Machine Learning and Data Science.

Working with Open Camps and the United Nations this year let us make 2016 the most affordable and largest PyGotham to date. We hope this better accomplishes our goals of promoting open source software and making Python education more accessible to all.

— Jon Banafato

Open Camp is a community-organized open source technology conference, which also happens to be one of the largest open source conferences in the world. This year (our 5th) nearly 6,000 individuals attended over the course of 10 days.

Open Camps is "mission-driven": we're distinguished from other conferences by our focus on how technology is used, its impact on the world, and its alignment with humanitarian ideals. Rather than proscribe, however, Open Camps provides a forum where these topics can be discussed.

At the start we were on the campuses of Columbia University and NYU. For the past three years we've been graciously hosted by the United Nations at their world headquarters in New York. Open Camps at the UN is a collaboration of the United Nations Open Source Innovation Initiative (Unite Open Source), the Open Camps organizing team and dozens of open-source communities.

Open Camps is dedicated to the principles of inclusiveness and diversity, and has always been free for anyone to attend. Our 2013 theme was "Get Off the Island"—we wanted to combat isolationism in communities of technology. Our first year at the UN we chose the theme "Women in Technology" featuring two keynote addresses by influential women in tech, and a panel discussion.

Since the beginning, we've included the "Next Generation" initiative for youth in technology. We work with CSNYC and ScriptEd. Open Camp speakers have been as young as 11 years old. The Next Gen program is also ongoing, and we have hosted numerous hands-on workshops teaching youth how to use open source technology.

Long terms goals for Open Camps include a "tech assembly": we want to bring together thought leaders from around the globe to engage in a broader conversation. We'll discuss consensus-driven tech, and technology transfer of open source tools and best practices between the "technology haves" and the technology "have nots".

We care about giving back. Each year we host programs ranging from "Coding for a Cause" to "Hacking for Humanity." Last year we had a ground-breaking event: not just the first Hackathon at the UN, but the first 24 hour hackathon. This year, our Unite For Humanity Hackathon drew 20 teams, again to spend 24 hours building solutions for the UN's 17 Sustainable Development Goals. The winning team will then work with the UN to develop their hackathon project into an application.

— Forest Mars

Image description: man posing in front of old-fashioned gray communications gear. There are analog meters, a tape-to-tape reel, a cathode-ray tube screen surrounded by switches, buttons, and dials. Directly behind the man is a plugboard with dozens of sockets.
Your correspondent, standing in front of vintage United Nations communications gear.

July 27, 2016 09:00 AM

July 26, 2016

A. Jesse Jiryu Davis

Talk Python to Me: "Write an Excellent Programming Blog"

Michael Kennedy and I talked about writing about programming. What kind of writing is most valuable, how do you choose a topic, improve your writing, find an audience, and find the time to write? Listen to the podcast on the Talk Python To Me site.

I've talked with Michael before: Episode 2 of "Talk Python to Me" was about MongoDB and Python.

By the way: Michael's a Python expert and a master teacher. Join his "Python for Entrepreneurs" course on Kickstarter now for early access to the course at a deeply discounted price.

July 26, 2016 08:50 PM

Marcos Dione


For a long time now I've been thinking on a problem: OSM data sometimes contains riverbanks that have no centerline. This means that someone mapped (part of) the coasts of a river (or stream!), but didn't care about adding a line that would mark its centerline.

But this should be computationally solvable, right? Well, it's not that easy. See, for given any riverbank polygon in OSM's database, you have 4 types of segments: those representing the right and left riverbanks (two types) and the flow-in and flow-out segments, which link the banks upstream and downstream. With a little bit of luck there will be only one flow-in and one flow-out segment, but there are no guarantees here.

One method could try and identify these segments, then draw a line starting in the middle of the flow-in segment, calculating the middle by traversing both banks at the same time, and finally connect to the middle for the flow-out segment. Identifying the segments by itself is hard, but it is also possible that the result is not optimal, leading to a jagged line. I didn't try anything on those lines, but I could try some examples by hand...

Enter topology, the section of maths that deals with this kind of problems. The skeleton of a polygon is a group of lines that are equidistant to the borders of the polygon. One of the properties this set of lines provides is direction, which can be exploited to find the banks and try to apply the previous algorithm. But a skeleton has a lot of 'branches' that might confuse the algo. Going a little further, there's the medial axis, which in most cases can be considered a simplified skeleton, without most of the skeleton branches.

Enter free software :) CGAL is a library that can compute a lot of topological properties. PostGIS is clever enough to leverage those algorithms and present, among others, the functions ST_StraightSkeleton() and ST_ApproximateMedialAxis(). With these two and the original polygon I plan to derive the centerline. But first an image that will help explaining it:

The green 'rectangle' is the original riverbank polygon. The thin black line is the skeleton for it; the medium red line is the medial. Notice how the medial and the center of the skeleton coincide. Then we have the 4 branches forming a V shape with its vertex at each end of the medial and its other two ends coincide with the ends of the flow in and flow out segments!

So the algorithm is simple: start with the medial; from its ends, find the branches in the skeleton that form that V; using the other two ends of those Vs, calculate the point right between them, and extend the medial to those points. This only calculates a centerline. The next step would be to give it a direction. For that I will need to see if there are any nearby lines that could be part of the river (that's what the centerline is for, to possibly extend existing rivers/centerlines), and use its direction to give it to the new centerline.

For the moment the algorithm only solves this simple case. A slightly more complex case is not that trivial, as skeletons and medials are returned as a MultiLineString with a line for each segment, so I will have to rebuild them into LineStrings before processing.

I put all the code online, of course :) Besides a preloaded PostgreSQL+PostGIS database with OSM data, you'll need python3-sqlalchemy, geoalchemy, python3-fiona and python3-shapely. The first two allows me to fetch the data from the db. Ah! by the way, you will need a couple of views:

CREATE VIEW planet_osm_riverbank_skel   AS SELECT osm_id, way, ST_StraightSkeleton (way)      AS skel   FROM planet_osm_polygon WHERE waterway = 'riverbank';
CREATE VIEW planet_osm_riverbank_medial AS SELECT osm_id, way, ST_ApproximateMedialAxis (way) AS medial FROM planet_osm_polygon WHERE waterway = 'riverbank';

Shapely allows me to manipulate the polygonal data, and fiona is used to save the results to a shapefile. This is the first time I ever use all of them (except SQLAlchemy), and it's nice that it's so easy to do all this in Python.

openstreetmap gis python

July 26, 2016 04:55 PM


Writing type stubs for Numpy

Continuing our coverage of MyPy (check parts #1, #2, and #3 of our “A Day With MyPy” series), this time we wanted to show you how we applied what we learned so far, by creating a type stub to a package that we use on a daily basis: NumPy.

Our goals regarding this experiment were:

All the code for the NumPy stub is available on GitHub.

MyPy stubs

When you want to add type annotations to code you don’t own, one solution is to write type stubs which are files with a description of the public interface of the modules with no implementations. Given that MyPy allows mixing dynamic and static typing, we decided to write the declarations for the most popular parts of numpy.


At the core of numpy are the ``ndarray``s, which are multi-dimensional arrays that hold fixed-size items. Given that it’s the most popular part of the library, and that the rest of numpy is built on it, we decided to start by adding types to its interface.

This is when we encountered our first obstacle: Most of numpy is written in C. With a regular package written in Python, we would have walked through the code and the we would have written signatures that match the code, adding the type information. This wasn’t possible with numpy. In some cases we used introspection, but we relied mostly on the reference documentation.

The second problem we faced was numpy’s inherent flexibility. Take this example, for instance:

In [1]: import numpy as np

In [2]: np.array('a string')
array('a string',

In [3]: np.array(2)
Out[3]: array(2)

In [4]: np.array([1,2,3])
Out[4]: array([1, 2, 3])

In [5]: np.array((1,2,"3"))
array(['1', '2', '3'],

The array function is used to create array objects, and as you can see, no matter what you use as parameter for the object argument, it does its best to to convert it to homogeneous values that can go into an ndarray in return. This is great for users, but it is a source of headaches if you want to add type annotations.

Luckily, our type signature for ndarray allows us to be explicit about the type of items stored in the arrays:

class ndarray(_ArrayLike[_S], Generic[_S]):...

so we can do things like:

my_array = np.array([1,2,3])  # type: np.ndarray[int]

You’re probably wondering about the _ArrayLike[_S] class, as it doesn’t exist on the numpy namespace. We wrote this fictional class to describe the array interface that is common between arrays and scalars.

Little gotcha regarding type expressions

While testing the stub we found something that might affect other type stubs for structures that work as containers. Take a look at this example:

import numpy as np

def do_something(array: np.ndarray[bool]):
    return array.all()

some_array = np.array([True, False])  # type: np.ndarray[bool]

if do_something(some_array):
    print('done something')

It all seems fine, and mypy doesn’t complain about, but if you try to run it, you’ll get the following error:

$ python
Traceback (most recent call last):
  File "", line 3, in <module>
    def do_something(array: np.ndarray[bool]):
TypeError: 'type' object is not subscriptable

Which makes total sense because ndarray is not a descendant of Generic. This is why we have classes like List or Dict in the typing module, so the type declaration doesn’t clash with the actual classes. There’s an easy work around, surrounding the type declaration in quotes:

import numpy as np

def do_something(array: 'np.ndarray[bool]'):
    return array.all()

some_array = np.array([True, False])  # type: np.ndarray[bool]

if do_something(some_array):
    print('done something')

This way the type expression is evaluated as a string and no errors are generated. Notice that there was no problem with the second declaration as it was in a comment, and those aren’t evaluated.

Although this is a valid work-around, we will most likely introduce a class named NDarray (to follow the pattern established by the typing module) that can be used safely in type declarations.

Problems, problems everywhere

We tried our best to provide meaningful type declarations for mypy’s type inference engine, but the dynamic nature numpy made it difficult sometimes. Take this signature for example:

def all(self, axis: AxesType=None, out: '_ArrayLike[_U]'=None,
        keepdims: bool=None) -> Union['_ArrayLike[_U]', '_ArrayLike[bool]']: ...

According to the ndarray.all documentation, it returns True when all array elements along a given axis evaluate to True. It actually returns a numpy.bool_ scalar, hence the _ArrayLike[bool] signature. However, if the out parameter is passed, the type of the return value would be the same as out‘s.

The proper way to declare all would have been something like:

def all(self, axis: AxesType=None, keepdims: bool=None) -> '_ArrayLike[bool]': ...

def all(self, axis: AxesType=None, keepdims: bool=None,
        *, out: '_ArrayLike[_U]') -> '_ArrayLike[_U]': ...

But due to a mypy bug we had to go with the former declaration. Once the bug has been dealt with, we’ll improve the declarations to help mypy type inference engine.

We also encountered problems within numpy itself.

In [1]: import numpy as np

In [2]: nda = np.random.rand(4,5) < 0.5

In [3]: ndb = np.arange(5)

In [4]: nda.all(axis=0, out=ndb)
Out[4]: array([0, 1, 0, 0, 0])

In [5]: nda.all(0, ndb)
(traceback not shown)
TypeError: data type not understood

According to the argument specification of ndarray.all, there shouldn’t be any problems with the last sentence. In the implementation, the positional arguments are not exactly the same as in the docs.

With these problems in mind, we tried the stub against some of our own code. Here’s a snippet that shows what we found:

$ mypy --strict-optional --check-untyped-defs error: No library stub file for module 'scipy' note: (Stub files are from error: No library stub file for module 'sklearn' error: No library stub file for module 'sklearn.utils.fixes' error: No library stub file for module 'sklearn.utils.extmath' error: No library stub file for module 'sklearn.datasets' error: No library stub file for module 'sklearn.linear_model' note: In member "fit" of class "LR": error: "module" has no attribute "unique" note: In member "decision_function" of class "LR": error: "module" has no attribute "dot" note: In member "predict" of class "LR": error: "module" has no attribute "int" note: In member "predict_proba" of class "LR": error: "module" has no attribute "dot" note: In member "likelihood" of class "LR": error: "module" has no attribute "dot" error: "module" has no attribute "sum" error: "module" has no attribute "dot" error: "module" has no attribute "dot"

Besides the missing stubs for scipy and sklearn (we might tackle those in the future), most of the problems came from the fact that the developer used the array operation functions (like dot or sum) defined on the numpy namespace instead of the methods defined on the ndarray class. Here’s an example of this:

def decision_function(self, X_test):
    scores =, self.weights[:-1].T) + self.weights[-1]
    return scores.ravel() if len(scores.shape) > 1 and scores.shape[1] == 1 else scores

Here, the developer used instead of We found that this happens quite often (at least in our code), so we’re going to add type declarations for the most common functions defined in the top-level numpy namespace.


During one of our meetings we reviewed our findings and decided that we could improve the stub with a little bit of user input. So if you think you’re up to it, please take a look at the code and give us your feedback. Even if you think we did everything wrong, that’ll a great help for us, as we aim to provide a meaningful contribution to the NumPy, MyPy and Python communities in general.

July 26, 2016 04:15 PM

Graham Dumpleton

Installing mod_wsgi on MacOS X with native operating system tools.

Operating systems inevitably change over time, and because writing documentation is often an after thought or developers have no time, the existing instructions on how to install a piece of software can suffer bit rot and stop working. This has been the case for a while with various parts of the documentation for mod_wsgi. This post is a first step at least in getting the documentation for

July 26, 2016 02:21 PM

Bruno Rocha

Microservices with Python, RabbitMQ and Nameko

"Micro-services is the new black" - Splitting the project in to independently scalable services is the currently the best option to ensure the evolution of the code. In Python there is a Framework called "Nameko" which makes it very easy and powerful.

Micro services

The term "Microservice Architecture" has sprung up over the last few years to describe a particular way of designing software applications as suites of independently deployable services. - M. Fowler

I recommend reading the Fowler's posts to understand the theory behind it.

Ok I so what does it mean?

In brief a Micro Service Architecture exists when your system is divided in small (single context bound) responsibilities blocks, those blocks doesn't know each other, they only have a common point of communication, generally a message queue, and does know the communication protocol and interfaces.

Give me a real-life example

The code is available on github: take a look at service and api folders for more info.

Consider you have an REST API, that API has an endpoint receiving some data and you need to perform some kind of computation with that data, instead of blocking the caller you can do it asynchronously, return an status "OK - Your request will be processed" to the caller and do it in a background task.

Also you want to send an email notification when the computation is finished without blocking the main computing process, so it is better to delegate the "email sending" to another service.


enter image description here

Show me the code!

Lets create the system to understand it in practice.


We need an environment with:


The easiest way to have a RabbitMQ in development environment is running its official docker container, considering you have Docker installed run:

docker run -d --hostname my-rabbit --name some-rabbit -p 15672:15672 -p 5672:5672 rabbitmq:3-management

Go to the browser and access http://localhost:15672 using credentials guest:guest if you can login to RabbitMQ dashboard it means you have it running locally for development.

enter image description here

The Service environment

Now lets create the Micro Services to consume our tasks. We'll have a service for computing and another for mail, follow the steps.

In a shell create the root project directory

$ mkdir myproject
$ cd myproject

Create and activate a virtualenv (you can also use virtualenv-wrapper)

$ virtualenv service_env
$ source service_env/bin/activate

Install nameko framework and yagmail

(service_env)$ pip install nameko
(service_env)$ pip install yagmail

The service code

Now having that virtualenv prepared (consider you can run service in a server and API in another) lets code the nameko RPC Services.

We are going to put both services in a single python module, but you can also split in separate modules and also run them in separate servers if needed.

In a file called

import yagmail
from nameko.rpc import rpc, RpcProxy

class Mail(object):
    name = "mail"

    def send(self, to, subject, contents):
        yag = yagmail.SMTP('', 'mypassword')
        # read the above credentials from a safe place.
        # Tip: take a look at Dynaconf setting module

class Compute(object):
    name = "compute"
    mail = RpcProxy('mail')    

    def compute(self, operation, value, other, email):
        operations = {'sum': lambda x, y: int(x) + int(y),
                      'mul': lambda x, y: int(x) * int(y),
                      'div': lambda x, y: int(x) / int(y),
                      'sub': lambda x, y: int(x) - int(y)}
            result = operations[operation](value, other)
        except Exception as e:
            self.mail.send.async(email, "An error occurred", str(e))
                "Your operation is complete!", 
                "The result is: %s" % result
            return result

Now with the above services definition we need to run it as a Nameko RPC service.

NOTE: We are going to run it in a console and leave it running, but in production it is recommended to put the service to run using supervisord or an alternative.

Run the service and let it running in a shell

(service_env)$ nameko run service --broker amqp://guest:guest@localhost
starting services: mail, compute
Connected to amqp://guest:**@
Connected to amqp://guest:**@

Testing it

Go to another shell (with the same virtenv) and test it using nameko shell

(service_env)$ nameko shell --broker amqp://guest:guest@localhost
Nameko Python 2.7.9 (default, Apr  2 2015, 15:33:21) 
[GCC 4.9.2] shell on linux2
Broker: amqp://guest:guest@localhost

You are now in the RPC client testing shell exposing the n.rpc object, play with it

>>> n.rpc.mail.send("", "testing", "Just testing")

The above should sent an email and we can also call compute service to test it, note that it also spawns an async mail sending with result.

>>> n.rpc.compute.compute('sum', 30, 10, "")
>>> n.rpc.compute.compute('sub', 30, 10, "")
>>> n.rpc.compute.compute('mul', 30, 10, "")
>>> n.rpc.compute.compute('div', 30, 10, "")

Calling the micro-service through the API

In a different shell (or even a different server) prepare the API environment

Create and activate a virtualenv (you can also use virtualenv-wrapper)

$ virtualenv api_env
$ source api_env/bin/activate

Install Nameko, Flask and Flasgger

(api_env)$ pip install nameko
(api_env)$ pip install flask
(api_env)$ pip install flasgger

NOTE: In api you dont need the yagmail because it is service responsability

Lets say you have the following code in a file

from flask import Flask, request
from flasgger import Swagger
from nameko.standalone.rpc import ClusterRpcProxy

app = Flask(__name__)
CONFIG = {'AMQP_URI': "amqp://guest:guest@localhost"}

@app.route('/compute', methods=['POST'])
def compute():
    Micro Service Based Compute and Mail API
    This API is made with Flask, Flasgger and Nameko
      - name: body
        in: body
        required: true
          id: data
              type: string
                - sum
                - mul
                - sub
                - div
              type: string
              type: integer
              type: integer
        description: Please wait the calculation, you'll receive an email with results
    operation = request.json.get('operation')
    value = request.json.get('value')
    other = request.json.get('other')
    email = request.json.get('email')
    msg = "Please wait the calculation, you'll receive an email with results"
    subject = "API Notification"
    with ClusterRpcProxy(CONFIG) as rpc:
        # asynchronously spawning and email notification
        rpc.mail.send.async(email, subject, msg)
        # asynchronously spawning the compute task
        result = rpc.compute.compute.async(operation, value, other, email)
        return msg, 200

Put the above API to run in a different shell or server

(api_env) $ python 
 * Running on (Press CTRL+C to quit)

and then access the url http://localhost:5000/apidocs/index.html you will see the Flasgger UI and you can interact with the api and start producing tasks on queue to the service to consume.


NOTE: You can see the shell where service is running for logging, prints and error messages. You can also access the RabbitMQ dashboard to see if there is some message in process there.

There is a lot of more advanced things you can do with Nameko framework you can find more information on

Let's Micro Serve!

July 26, 2016 01:29 PM

Mike Driscoll

Python 3 – An Intro to asyncio

The asyncio module was added to Python in version 3.4 as a provisional package. What that means is that it is possible that asyncio receives backwards incompatible changes or could even be removed in a future release of Python. According to the documentation asyncio “provides infrastructure for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives“. This chapter is not meant to cover everything you can do with asyncio, however you will learn how to use the module and why it is useful.

If you need something like asyncio in an older version of Python, then you might want to take a look at Twisted or gevent.


The asyncio module provides a framework that revolves around the event loop. An event loop basically waits for something to happen and then acts on the event. It is responsible for handling such things as I/O and system events. Asyncio actually has several loop implementations available to it. The module will default to the one most likely to be the most efficient for the operating system it is running under; however you can explicitly choose the event loop if you so desire. An event loop basically says “when event A happens, react with function B”.

Think of a server as it waits for someone to come along and ask for a resource, such as a web page. If the website isn’t very popular, the server will be idle for a long time. But when it does get a hit, then the server needs to react. This reaction is known as event handling. When a user loads the web page, the server will check for and call one or more event handlers. Once those event handlers are done, they need to give control back to the event loop. To do this in Python, asyncio uses coroutines.

A coroutine is a special function that can give up control to its caller without losing its state. A coroutine is a consumer and an extension of a generator. One of their big benefits over threads is that they don’t use very much memory to execute. Note that when you call a coroutine function, it doesn’t actually execute. Instead it will return a coroutine object that you can pass to the event loop to have it executed either immediately or later on.

One other term you will likely run across when you are using the asyncio module is future. A future is basically an object that represents the result of work that hasn’t completed. Your event loop can watch future objects and wait for them to finish. When a future finishes, it is set to done. Asyncio also supports locks and semaphores.

The last piece of information I want to mention is the Task. A Task is a wrapper for a coroutine and a subclass of Future. You can even schedule a Task using the event loop.

async and await

The async and await keywords were added in Python 3.5 to define a native coroutine and make them a distinct type when compared with a generator based coroutine. If you’d like an in-depth description of async and await, you will want to check out PEP 492.

In Python 3.4, you would create a coroutine like this:

# Python 3.4 coroutine example
import asyncio
def my_coro():
    yield from func()

This decorator still works in Python 3.5, but the types module received an update in the form of a coroutine function which will now tell you if what you’re interacting with is a native coroutine or not. Starting in Python 3.5, you can use async def to syntactically define a coroutine function. So the function above would end up looking like this:

import asyncio
async def my_coro():
    await func()

When you define a coroutine in this manner, you cannot use yield inside the coroutine function. Instead it much include a return or await statement that are used for returning values to the caller. Note that the await keyword can only be used inside an async def function.

The async / await keywords can be considered an API to be used for asynchronous programming. The asyncio module is just a framework that happens to use async / await for programming asynchronously. There is actually a project called curio that proves this concept as it is a separate implementation of an event loop thats uses async / await underneath the covers.

A Coroutine Example

While it is certainly helpful to have a lot of background information into how all this works, sometimes you just want to see some examples so you can get a feel for the syntax and how to put things together. So with that in mind, let’s start out with a simple example!

A fairly common task that you will want to complete is downloading a file from some location whether that be an internal resource or a file on the Internet. Usually you will want to download more than one file. So let’s create a pair of coroutines that can do that:

import asyncio
import os
import urllib.request
async def download_coroutine(url):
    A coroutine to download the specified url
    request = urllib.request.urlopen(url)
    filename = os.path.basename(url)
    with open(filename, 'wb') as file_handle:
        while True:
            chunk =
            if not chunk:
    msg = 'Finished downloading {filename}'.format(filename=filename)
    return msg
async def main(urls):
	Creates a group of coroutines and waits for them to finish
    coroutines = [download_coroutine(url) for url in urls]
    completed, pending = await asyncio.wait(coroutines)
    for item in completed:
if __name__ == '__main__':
    urls = ["",
    event_loop = asyncio.get_event_loop()

In this code, we import the modules that we need and then create our first coroutine using the async syntax. This coroutine is called download_coroutine and it uses Python’s urllib to download whatever URL is passed to it. When it is done, it will return a message that says so.

The other coroutine is our main coroutine. It basically takes a list of one or more URLs and queues them up. We use asyncio’s wait function to wait for the coroutines to finish. Of course, to actually start the coroutines, they need to be added to the event loop. We do that at the very end where we get an event loop and then call its run_until_complete method. You will note that we pass in the main coroutine to the event loop. This starts running the main coroutine which queues up the second coroutine and gets it going. This is known as a chained coroutine.

Scheduling Calls

You can also schedule calls to regular functions using the asyncio event loop. The first method we’ll look at is call_soon. The call_soon method will basically call your callback or event handler as soon as it can. It works as a FIFO queue, so if some of the callbacks take a while to run, then the others will be delayed until the previous ones have finished. Let’s look at an example:

import asyncio
import functools
def event_handler(loop, stop=False):
    print('Event handler called')
    if stop:
        print('stopping the loop')
if __name__ == '__main__':
    loop = asyncio.get_event_loop()
        loop.call_soon(functools.partial(event_handler, loop))
        print('starting event loop')
        loop.call_soon(functools.partial(event_handler, loop, stop=True))
        print('closing event loop')

The majority of asyncio’s functions do not accept keywords, so we will need the functools module if we need to pass keywords to our event handler. Our regular function will print some text out to stdout whenever it is called. If you happen to set its stop argument to True, it will also stop the event loop.

The first time we call it, we do not stop the loop. The second time we call it, we do stop the loop. The reason we want to stop the loop is that we’ve told it to run_forever, which will put the event loop into an infinite loop. Once the loop is stopped, we can close it. If you run this code, you should see the following output:

starting event loop
Event handler called
Event handler called
stopping the loop
closing event loop

There is a related function called call_soon_threadsafe. As the name implies, it works the same way as call_soon, but it’s thread-safe.

If you want to actually delay a call until some time in the future, you can do so using the call_later function. In this case, we could change our call_soon signature to the following:

current_time = loop.time()

Once you have that, then you can just use the call_at function and pass it the time that you want it to call your event handler. So let’s save we want to call our event handler five minutes from now. Here’s how you might do it:

loop.call_at(current_time + 300, event_handler, loop)

In this example, we use the current time that we grabbed and append 300 seconds or five minutes to it. By so doing, we delay calling our event handler for five minutes! Pretty neat!


Tasks are a subclass of a Future and a wrapper around a coroutine. They give you the ability to keep track of when they finish processing. Because they are a type of Future, other coroutines can wait for a task and you can also grab the result of a task when it’s done processing. Let’s take a look at a simple example:

import asyncio
import time
async def my_task(seconds):
    A task to do for a number of seconds
    print('This task is taking {} seconds to complete'.format(
    return 'task finished'
if __name__ == '__main__':
    my_event_loop = asyncio.get_event_loop()
        print('task creation started')
        task_obj = my_event_loop.create_task(my_task(seconds=2))
    print("The task's result was: {}".format(task_obj.result()))

Here we create an asynchronous function that accepts the number of seconds it will take for the function to run. This simulates a long running process. Then we create our event loop and then create a task object by calling the event loop object’s create_task function. The create_task function accepts the function that we want to turn into a task. Then we tell the event loop to run until the task completes. At the very end, we get the result of the task since it has finished.

Tasks can also be canceled very easily by using their cancel method. Just call it when you want to end a task. Should a task get canceled when it is waiting for another operation, the task will raise a CancelledError.

Wrapping Up

At this point, you should know enough to start working with the asyncio library on your own. The asyncio library is very powerful and allows you to do a lot of really cool and interesting tasks. You should check out which is a curated listing of various projects that are using asyncio. It is a wonderful place to get ideas for how to use this library. The Python documentation is also a great place to start from.

Related Reading

July 26, 2016 12:30 PM

qutebrowser development blog

qutebrowser v0.8.0 released

I'm happy to annouce the release of qutebrowser v0.8.0!

qutebrowser is a keyboard driven browser with a vim-like, minimalistic interface. It's written using PyQt and cross-platform.

The main reason for this release is that v0.7.0 will break with PyQt 5.7 which is soon going to be released.

I decided to do a new minor release instead of a patch release as plenty new features have accumulated already. If your distribution can't update to v0.8.0 for some reason, backporting this patch patch should work, though I haven't verified this.

This release also got a big refactoring to prepare for QtWebEngine support. To my current knowledge, all issues have been smoothened out. If not, crash reports shall now tell me. ;)

You can also already start with "--backend webengine" with this release to try the QtWebEngine support - however many features are still missing.

Source release and binaries for Windows/OS X are available, the Debian packages are still work-in-progress.

The full changelog for this release:


  • New :repeat-command command (mapped to .) to repeat the last command. Note that two former default bundings conflict with that binding, unbinding them via :unbind .i and :unbind .o is recommended.
  • New qute:bookmarks page which displays all bookmarks and quickmarks.
  • New :prompt-open-download (bound to Ctrl-X) which can be used to open a download directly when getting the filename prompt.
  • New {host} replacement for tab- and window titles which evaluates to the current host.
  • New default binding ;t for :hint input.
  • New variables $QUTE_CONFIG_DIR, $QUTE_DATA_DIR and $QUTE_DOWNLOAD_DIR available for userscripts.
  • New option ui -> status-position to configure the position of the status bar (top/bottom).
  • New --pdf <filename> argument for :print which can be used to generate a PDF without a dialog.


  • :scroll-perc now prefers a count over the argument given to it, which means gg can be used with a count.
  • Aliases can now use ;; to have an alias which executed multiple commands.
  • :edit-url now does nothing if the URL isn't changed in the spawned editor.
  • :bookmark-add can now be passed a URL and title to add that as a bookmark rather than the current page.
  • New taskadd userscript to add a taskwarrior task annotated with the current URL.
  • :bookmark-del and :quickmark-del now delete the current page's URL if none is given.


  • Compatibility with PyQt 5.7
  • Fixed some configuration values being lost when a config option gets removed from qutebrowser's code.
  • Fix crash when downloading with a full disk
  • Using :jump-mark (e.g. '') when the current URL is invalid doesn't crash anymore.


  • The ability to display status messages from webpages, as well as the related ui ->  display-statusbar-messages setting.
  • The general -> wrap-search setting as searches now always wrap. According to a quick straw poll and prior crash logs, almost nobody is using wrap-search = false, and turning off wrapping is not possible with QtWebEngine.
  • :edit-url now doesn't accept a count anymore as its behavior was confusing and it doesn't make much sense to add a count.

Since v0.7.0, the following people have contributed to qutebrowser:

  • Ryan Roden-Corrent
  • Jan Verbeek
  • Daniel Schadt
  • Marshall Lochbaum
  • Ismail S
  • David Vogt
  • Michał Góral
  • Panashe M. Fundira
  • Jeremy Kaplan
  • Edgar Hipp
  • Daryl Finlay
  • Jean-Louis Fuchs
  • Kevin Velghe
  • Jakub Klinkovský
  • Dietrich Daroch

Thank you!

July 26, 2016 12:29 PM

David MacIver

It might be worth learning an ML-family language

It’s long been a popular opinion that learning Haskell or another ML-family language will make you a better programmer. I think this is probably true, but I think it’s an overly specific claim because learning almost anything will make you a better programmer, and I’ve not been convinced that Haskell is a much better choice than many other things in terms of reshaping your thinking. I’ve never thought that you shouldn’t learn Haskell of course, I’ve just not been convinced that learning Haskell purely for the sake of learning Haskell was the best use of time.

But… I’ve been noticing something recently when teaching Python programmers to use Hypothesis that has made me reconsider somewhat. Not so much a fundamental reshaping of the way you think as a highly useful microskill that people seem to struggle to learn in dynamically typed languages.

That skill is this: Keeping track of what the type of the value in a variable is.

That may not seem like an important skill in a dynamic language, but it really is: Although functions will tend to be more lenient about what type of value they accept (is it a list or a tuple? Who cares!), they will tend to go wrong in interesting and confusing ways when you get it too wrong, and you then waste valuable debugging time trying to figure out what you did wrong. A good development workflow will typically let you find the problem, but it will still take significantly longer than just not making the mistake in the first place.

In particular this seems to come up when the types are related but distinct. Hypothesis has a notion of a “strategy”, which is essentially a data generator, and people routinely seem to get confused as to whether something is a value of a type, a strategy for producing values of that type, or a function that returns a strategy for producing the type.

It might be that I’ve just created a really confusing API, but I don’t think that’s it – people generally seem to really like the API and this is by far the second most common basic usage error people make with it (the first most common is confusing the functions one_of and sampled_from, which do similar but distinct things. I’m still trying to figure out better names for them).

It took me a while to notice this because I just don’t think of it as a difficult thing to keep track of, but it’s definitely a common pattern. It also appears to be almost entirely absent from people who have used Haskell (and presumably other ML-family languages – any statically typed language with type-inference and a bias towards functional programming really) but I don’t know of anyone who has tried to use Hypothesis knowing an ML-family language without also knowing Haskell).

I think the reason for this is that in an ML family language, where the types are static but inferred, you are constantly playing a learning game with the compiler as your teacher: Whenever you get this wrong, the compiler tells you immediately that you’ve done so and localises it to the place where you’ve made the mistake. The error messages aren’t always that easy to understand, but it’s a lot easier to figure out where you’ve made the mistake than when the error message is instead “AttributeError: ‘int’ object has no attribute ‘kittens'” in some function unrelated to where you made the actual error. In the dynamically typed context, there’s a larger separation between the observed problem and the solution, which makes it harder to learn from the experience.

This is probably a game worth playing. If people are making this error when using Hypothesis, they I’d expect them to be making it in many other places too. I don’t expect many of these errors are making it through to production (especially if your code is well tested), but they’re certainly wasting time while developing.

In terms of which ML-family language to choose for this, I’m not sure. I haven’t actually used it myself yet (I don’t really have anything I want to write in the space that it targets), but I suspect Elm is probably the best choice. They’ve done some really good work on making type errors clear and easy to understand, which is likely exactly what you need for this sort of learning exercise.


July 26, 2016 09:23 AM

S. Lott

Another Python to the Rescue Story -- Creating a DSL from Python Class Definitions

July 26, 2016 08:00 AM

Talk Python to Me

#69 Write an Excellent Programming Blog

Do you have a blog? How many articles have you written for it? Do you find it hard to keep writing or hard to get started doing technical writing? We might be able to help you out with that this week. <br/> <br/> You're probably aware that blogging is one of the key ways to establish yourself as a thought-leader in the industry. You'll make more connections, open more opportunities, and likely find your work more rewarding if you share your experiences and expertise through blogging. <br/> <br/> But it can be challenging to keep writing or find time for writing. That's why I asked A. Jesse Jiryu Davis from MongoDB to share his thoughts on writing an excellent programming blog. <br/> <br/> You'll even learn about Jesse's 5 "design patterns" for blogging to help break writer's block. <br/> <br/> Links from the show: <br/> <div style="font-size: .85em;"> <br/> <b>PyCon Talk by Jesse</b>: <a href='' target='_blank'></a> <br/> <b>Excellent blog article</b>: <a href='' target='_blank'></a> <br/> <b>Unyielding by Glyph</b>: <a href='' target='_blank'></a> <br/> <b>Assigning to a threadlocal is not thread-safe</b>: <a href='' target='_blank'></a> <br/> <b>Growing Open Source Seeds</b>: <a href='' target='_blank'></a> <br/> <b>Why does this Python code raise a SyntaxWarning?</b>: <a href='' target='_blank'></a> <br/> <b>Review of O'Reilly's Building Node Applications with MongoDB and Backbone</b>: <a href='' target='_blank'></a> <br/> <b>Planet Python</b>: <a href='' target='_blank'></a> <br/> <b>Coding with Knives blog</b>: <a href='' target='_blank'></a> <br/> <br/> <b>Python for Entrepreneurs Kickstarter</b>: <a href='' target='_blank'></a> <br/> </div>

July 26, 2016 08:00 AM

Daniel Bader

How to use Python’s min() and max() with nested lists

How to use Python’s min() and max() with nested lists

Let’s talk about using Python’s min and max functions on a list containing other lists. Sometimes this is referred to as a nested list or a lists of lists.

Finding the minimum or maximum element of a list of lists1 based on a specific property of the inner lists is a common situation that can be challenging for someone new to Python.

To give us a more concrete example to work with, let’s say we have the following list of item, weight pairs2:

nested_list = [['cherry', 7], ['apple', 100], ['anaconda', 1360]]

We want Python to select the minimum and maximum element based on each item’s weight stored at index 1. We expect min and max to return the following elements:

But if we simply call min and max on that nested list we don’t get the results we expected.

The ordering we get seems to be based on the item’s name, stored at index 0:

>>> min(nested_list)
['anaconda', 1360]  # Not what we expected!

>>> max(nested_list)
['cherry', 7]  # Not what we expected!

Alright, why does it pick the wrong elements?

Let’s stop for a moment to think about how Python’s max function works internally. The algorithm looks something like this:

def my_max(sequence):
    """Return the maximum element of a sequence"""
    if not sequence:
        raise ValueError('empty sequence')

    maximum = sequence[0]

    for item in sequence:
        if item > maximum:
            maximum = item

    return maximum

The interesting bit of behavior here can be found in the condition that selects a new maximum: if item > maximum:.

This condition works nicely if sequence only contains primitive types like int or float because comparing those is straightforward (in the sense that it’ll give an answer that we intuitively expect; like 3 > 2).

However, if sequence contains other sequences then things get a little more complex. Let’s look at the Python docs to learn how Python compares sequences:

Sequence objects may be compared to other objects with the same sequence type. The comparison uses lexicographical ordering: first the first two items are compared, and if they differ this determines the outcome of the comparison; if they are equal, the next two items are compared, and so on, until either sequence is exhausted.

When max needs to compare two sequences to find the “larger” element then Python’s default comparison behavior might not be what we want3.

Now that we understand why we get an unexpected result we can think about ways to fix our code.

How can we change the comparison behavior?

We need to tell max to compare the items differently.

In our example, Python’s max looks at the first item in each inner list (the string cherry, apple, or anaconda) and compares it with the current maximum element. That’s why it returns cherry as the maximum element if we just call max(nested_list).

How do we tell max to compare the second item of each inner list?

Let’s imagine we had an updated version of my_max called my_max_by_weight that uses the second element of each inner list for comparison:

def my_max_by_weight(sequence):
    if not sequence:
        raise ValueError('empty sequence')

    maximum = sequence[0]

    for item in sequence:
        # Compare elements by their weight stored
        # in their second element.
        if item[1] > maximum[1]:
            maximum = item

    return maximum

That would do the trick! We can see that my_max_by_weight selects the maximum element we expected:

>>> my_max_by_weight(nested_list)
['anaconda', 1360]

Now imagine we needed to find the maximum of different kinds of lists.

Perhaps the index (or key) we’re interested in won’t always be the second item. Maybe sometimes it’ll be the third or fourth item, or a different kind of lookup is necessary all together.

Wouldn’t it be great if we could reuse the bulk of the code in our implementation of my_max? Some parts of it will always work the same, for example checking if an empty sequence was passed to the function.

How can we make max() more flexible?

Because Python allows us to treat functions as data we can extract the code selecting the comparison key into its own function. We’ll call that the key func. We can write different kinds of key funcs and pass them to my_max as necessary.

This gives us complete flexibility! Instead of just being able to choose a specific list index for the comparison, like index 1 or 2, we can tell our function to select something else entirely — for example, the length of the item’s name.

Let’s have a look at some code that implements this idea:

def identity(x):
    return x

def my_max(sequence, key_func=None):
    Return the maximum element of a sequence.
    key_func is an optional one-argument ordering function.
    if not sequence:
        raise ValueError('empty sequence')

    if not key_func:
        key_func = identity

    maximum = sequence[0]

    for item in sequence:
        # Ask the key func which property to compare
        if key_func(item) > key_func(maximum):
            maximum = item

    return maximum

In the code example you can see how by default we let my_max use a key func we called identity, which just uses the whole, unmodified item to do the comparison.

With identity as the key func we expect my_max to behave the same way max behaves.

nested_list = [['cherry', 7], ['apple', 100], ['anaconda', 1360]]

>>> my_max(nested_list)
['cherry', 7]

And we can confirm that we’re still getting the same (incorrect) result as before, which is a pretty good indication that we didn’t screw up the implementation completely 😃.

Now comes the cool part — we’re going to override the comparison behavior by writing a key_func that returns the second sub-element instead of the element itself:

def weight(x):
    return x[1]

>>> my_max(nested_list, key_func=weight)
['anaconda', 1360]

And voilà, this is the maximum element we expected to get!

Just to demonstrate the amount of flexibility this refactoring gave us, here’s a key_func that selects the maximum element based on the length of the item’s name:

def name_length(x):
    return len(x[0])

>>> my_max(nested_list, key_func=name_length)
['anaconda', 1360]

Is there a shorthand for this stuff?

Instead of defining the key func explicitly with def and giving it a name we can also use Python’s lambda keyword to define a function anonymously. This shortens the code quite a bit (and won’t create a named function):

my_max(nested_list, key_func=lambda x: x[1])
>>> ['anaconda', 1360]

To make the naming a little slicker (albeit less expressive) imagine we’ll shorten the key_func arg to key and we’ve arrived at a code snippet that works with the max function in vanilla Python.

This means we’ll no longer need our own re-implementation of Python’s max function to find the “correct” maximum element:

# This is pure, vanilla Python:
>>> max(nested_list, key=lambda x: x[1])
['anaconda', 1360]

The same also works for Python’s built-in min:

>>> min(nested_list, key=lambda x: x[1])
['cherry', 7]

It even works for Python’s sorted function, making the “key func” concept really valuable in a number of situations you might face as a Python developer:

>>> sorted(nested_list, key=lambda x: x[1])
[['anaconda', 1360], ['apple', 100], ['cherry', 7]]

Try it out yourself

I hope this post helped you out. What started out as a simple question ended up being a little more involved than you may have expected. But it’s often like that when you learn about new programming concepts.

Feel free to drop me a line of Twitter or over email if you got stuck anywhere. I’d love to improve this tutorial over time :)

  1. Sometimes you’ll see tuples used for the inner lists. Using tuples instead of lists doesn’t really make a difference for how min and max work, but in some cases it can bring a performance benefit. Nothing we’ll have to worry about for now. The code in this tutorial will work fine on a list of tuples. 

  2. I actually googled these for you. Apparently the average cherry sold in a super market weighs 7 grams. I’m not 100 per cent sure about anacondas though. 

  3. Note that Python strings are also sequences so when you compare two strings they will be compared character by character. 

July 26, 2016 12:00 AM

July 25, 2016

Robin Wilson

Showing code changes when teaching

A key – but challenging – part of learning to program is moving from writing technically-correct code “that works” to writing high-quality code that is sensibly decomposed into functions, generically-applicable and generally “good”. Indeed, you could say that this is exactly what Software Carpentry is about – taking you from someone bodging together a few bits of wood in the shed, to a skilled carpenter. As well as being challenging to learn, this is also challenging to teach: how should you show the progression from “working” to “good” code in a teaching context?

I’ve been struggling with this recently as part of some small-group programming teaching I’ve been doing. Simply showing the “before” and “after” ends up bombarding the students with too many changes at once: they can’t see how you get from one to the other, so I want some way to show the development of code over time as things are gradually done to it (for example, moving this code into a separate function, adding an extra argument to that function to make it more generic, renaming these variables and so on). Obviously when teaching face-to-face I can go through this interactively with the students – but some changes to real-world code are too large to do live – and students often seem to find these sorts of discussions a bit overwhelming, and want to refer back to the changes and reasoning later (or they may want to look at other examples I’ve given them). Therefore, I want some way to annotate these changes to give the explanation (to show why we’re moving that bit of code into a separate function, but not some other bit of code), but to still show them in context.

Exactly what code should be used for these examples is another discussion: I’ve used real-world code from other projects, code I’ve written specifically for demonstration, code I’ve written myself in the past and sometimes code that the students themselves have written.

So far, I’ve tried the following approaches for showing these changes with annotation:

  1. Making all of the changes to the code and providing a separate document with an ordered list of what I’ve changed and why.
    Simple and low-tech, but often difficult for the students to visualise each change
  2. The same as above but committing between each entry in the list.
    Allows them to step through git commits if they want, and to get back to how the code was after each individual change – but many of the students struggle to do this effectively in git, and it adds a huge technological barrier…particularly with Git’s ‘interesting’ user-interface.
  3. The same as above, but using Github’s line comments feature to put comments at specific locations in the code.
    Allows annotations at specific locations in the code, but rather clunky to step through the full diff view of commits in order using Github’s UI.

I suspect any solution will involve some sort of version control system used in some way (although I’m not sure that standard diffs are quite the best way to represent changes for this particular use-case), but possibly with a different interface on it.

Is this a problem anyone else has faced in their teaching? Can you suggest any tools or approaches that might make this easier – for both the teacher and students?

(This has also been cross-posted on the Software Carpentry blog)

July 25, 2016 04:07 PM

Continuum Analytics News

Analyzing and Visualizing Big Data Interactively on Your Laptop: Datashading the 2010 US Census

Posted Tuesday, July 26, 2016

The 2010 Census collected a variety of demographic information for all the more than 300 million people in the USA. A subset of this has been pre-processed by the Cooper Center at the University of Virginia, who produced an online map of the population density and the racial/ethnic makeup of the USA. Each dot in this map corresponds to a specific person counted in the census, located approximately at their residence. (To protect privacy, the precise locations have been randomized at the block level, so that the racial category can only be determined to within a rough geographic precision.)

Using Datashader on Big Data

The Cooper Center website delivers pre-rendered image tiles to your browser, which is fast, but limited to the plotting choices they made. What if you want to look at the data a different way - filter it, combine it with other data or manipulate it further? You could certainly re-do the steps they did, using their Python source code, but that would be a big project. Just running the code takes "dozens of hours" and adapting it for new uses requires significant programming and domain expertise. However, the new Python datashader library from Continuum Analytics makes it fast and fully interactive to do these kinds of analyses dynamically, using simple code that is easy to adapt to new uses. The steps below show that using datashader makes it quite practical to ask and answer questions about big data interactively, even on your laptop.

Load Data and Set Up

First, let's load the 2010 Census data into a pandas dataframe:

import pandas as pd

df = pd.read_hdf('data/census.h5', 'census')
df.race = df.race.astype('category')

     CPU times: user 13.9 s, sys: 35.7 s, total: 49.6 s

     Wall time: 1min 7s


  meterswest metersnorth race
306674999 -8922890.0 2958501.2 h
306675000 -8922863.0 2958476.2 h
306675001 -8922887.0 2958355.5 h
306675002 -8922890.0 2958316.0 h
306675003 -8922939.0 2958243.8 h

Loading the data from the HDF5-format file takes a minute, as you can see, which is by far the most time-consuming step. The output of .tail() shows that there are more than 300 million datapoints (one per person), each with a location in Web Mercator coordinates, and that the race/ethnicity for each datapoint has been encoded as a single character (where 'w' is white, 'b' is black, 'a' is Asian, 'h' is Hispanic and 'o' is other (typically Native American).

Let's define some geographic ranges to look at later and a default plot size.

USA = ((-13884029, -7453304), (2698291, 6455972))
LakeMichigan = ((-10206131, -9348029), (4975642, 5477059))
Chicago = (( -9828281, -9717659), (5096658, 5161298))
Chinatown = (( -9759210, -9754583), (5137122, 5139825))

NewYorkCity = (( -8280656, -8175066), (4940514, 4998954))
LosAngeles = ((-13195052, -13114944), (3979242, 4023720))
Houston = ((-10692703, -10539441), (3432521, 3517616))
Austin = ((-10898752, -10855820), (3525750, 3550837))
NewOrleans = ((-10059963, -10006348), (3480787, 3510555))
Atlanta = (( -9448349, -9354773), (3955797, 4007753))

x_range,y_range = USA

plot_width = int(900)
plot_height = int(plot_width*7.0/12)

Population Density

For our first examples, let's ignore the race data, focusing on population density alone.

Datashader works by aggregating an arbitrarily large set of data points (millions, for a pandas dataframe, or billions+ for a dask dataframe) into a fixed-size buffer that's the shape of your final image. Each of the datapoints is assigned to one bin in this buffer, and each of these bins will later become one pixel. In this case, we'll aggregate all the datapoints from people in the continental USA into a grid containing the population density per pixel:

import datashader as ds
import datashader.transfer_functions as tf
from datashader.colors import Greys9, Hot, colormap_select as cm
def bg(img): return tf.set_background(img,"black")

cvs = ds.Canvas(plot_width, plot_height, *USA)
agg = cvs.points(df, 'meterswest', 'metersnorth')

     CPU times: user 3.97 s, sys: 12.2 ms, total: 3.98 s
     Wall time: 3.98 s

Computing this aggregate grid will take some CPU power (4-8 seconds on this MacBook Pro), because datashader has to iterate through the hundreds of millions of points in this dataset, one by one. But once the agg array has been computed, subsequent processing will now be nearly instantaneous, because there are far fewer pixels on a screen than points in the original database.

The aggregate grid now contains a count of the number of people in each location. We can visualize this data by mapping these counts into a grayscale value, ranging from black (a count of zero) to white (maximum count for this dataset). If we do this colormapping linearly, we can very quickly and clearly see...

bg(tf.interpolate(agg, cmap = cm(Greys9), how='linear'))

     CPU times: user 25.6 ms, sys: 4.77 ms, total: 30.4 ms
     Wall time: 29.8 ms

...almost nothing. The amount of detail visible is highly dependent on your monitor and its display settings, but it is unlikely that you will be able to make much out of this plot on most displays. If you know what to look for, you can see hotspots (high population densities) in New York City, Los Angeles, Chicago and a few other places. For feeding 300 million points in, we're getting almost nothing back in terms of visualization.

The first thing we can do is prevent "undersampling." In the plot above, there is no way to distinguish between pixels that are part of the background and those that have low but nonzero counts; both are mapped to black or nearly black on a linear scale. Instead, let's map all values that are not background to a dimly visible gray, leaving the highest-density values at white - let's discard the first 25% of the gray colormap and linearly interpolate the population densities over the remaining range:

bg(tf.interpolate(agg, cmap = cm(Greys9,0.25), how='linear'))

The above plot at least reveals that data has been measured only within the political boundaries of the continental United States and that many areas in the mountainous West are so poorly populated that many pixels contained not even a single person (in datashader images, the background color is shown for pixels that have no data at all, using the alpha channel of a PNG image, while the specified colormap is shown for pixels that do have data). Some additional population centers are now visible, at least on some monitors. But, mainly what the above plot indicates is that population in the USA is extremely non-uniformly distributed, with hotspots in a few regions, and nearly all other pixels having much, much lower (but nonzero) values. Again, that's not much information to be getting out out of 300 million datapoints!

The problem is that of the available intensity scale in this gray colormap, nearly all pixels are colored the same low-end gray value, with only a few urban areas using any other colors. Thus, both of the above plots convey very little information. Because the data are clearly distributed so non-uniformly, let's instead try a nonlinear mapping from population counts into the colormap. A logarithmic mapping is often a good choice for real-world data that spans multiple orders of magnitude:

bg(tf.interpolate(agg, cmap = cm(Greys9,0.2), how='log'))

Suddenly, we can see an amazing amount of structure! There are clearly meaningful patterns at nearly every location, ranging from the geographic variations in the mountainous West, to the densely spaced urban centers in New England and the many towns stretched out along roadsides in the midwest (especially those leading to Denver, the hot spot towards the right of the Rocky Mountains).

Clearly, we can now see much more of what's going on in this dataset, thanks to the logarithmic mapping. Yet, the choice of 'log' was purely arbitrary, and one could easily imagine that other nonlinear functions would show other interesting patterns. Instead of blindly searching through the space of all such functions, we can step back and notice that the main effect of the log transform has been to reveal local patterns at all population densities -- small towns show up clearly even if they are just slightly more dense than their immediate, rural neighbors, yet large cities with high population density also show up well against the surrounding suburban regions, even if those regions are more dense than the small towns on an absolute scale.

With this idea of showing relative differences across a large range of data values in mind, let's try the image-processing technique called histogram equalization. Given a set of raw counts, we can map these into a range for display such that every available color on the screen represents about the same number of samples in the original dataset. The result is similar to that from the log transform, but is now non-parametric -- it will equalize any linearly or nonlinearly distributed data, regardless of the distribution:

bg(tf.interpolate(agg, cmap = cm(Greys9,0.2), how='eq_hist'))

Effectively, this transformation converts the data from raw magnitudes, which can easily span a much greater range than the dynamic range visible to the eye, to a rank-order or percentile representation, which reveals density differences at all ranges but obscures the absolute magnitudes involved. In this representation, you can clearly see the effects of geography (rivers, coastlines and mountains) on the population density, as well as history (denser near the longest-populated areas) and even infrastructure (with many small towns located at crossroads).

Given the very different results from the different types of plot, a good practice when visualizing any dataset with datashader is to look at both the linear and the histogram-equalized versions of the data; the linear version preserves the magnitudes but obscures the distribution, while the histogram-equalized version reveals the distribution while preserving only the order of the magnitudes, not their actual values. If both plots are similar, then the data is distributed nearly uniformly across the interval. But, much more commonly, the distribution will be highly nonlinear, and the linear plot will reveal only the envelope of the data - the lowest and the highest values. In such cases, the histogram-equalized plot will reveal much more of the structure of the data, because it maps the local patterns in the data into perceptible color differences on the screen, which is why eq_hist is the default colormapping.

Because we are only plotting a single dimension, we can use the colors of the display to effectively reach a higher dynamic range, mapping ranges of data values into different color ranges. Here, we'll use the colormap with the colors interpolated between the named colors shown:

bg(tf.interpolate(agg, cmap = cm(Hot,0.2)))

     ['darkred', 'red', 'orangered', 'darkorange', 'orange', 'gold', 'yellow', 'white']

Such a representation can provide additional detail in each range, while still accurately conveying the overall distribution.

Because we can control the colormap, we can use it to address very specific questions about the data itself. For instance, after histogram equalization, data should be uniformly distributed across the visible colormap. Thus, if we want to highlight, for exmaple, the top 1% of pixels (by population density), we can use a colormap divided into 100 ranges and simply change the top one to a different color:

import numpy as np
grays2 = cm([(i,i,i) for i in np.linspace(0,255,99)]) + ["red"]
bg(tf.interpolate(agg, cmap = grays2))

The above plot now conveys nearly all the information available in the original linear plot - that only a few pixels have the very highest population densities - while also conveying the structure of the data at all population density ranges via histogram equalization.

Categorical Data (Race)

Since we've got the racial/ethnic category for every pixel, we can use color to indicate the category value, instead of just extending dynamic range or highlighting percentiles, as shown above. To do this, we first need to set up a color key for each category label:

color_key = {'w':'aqua', 'b':'lime', 'a':'red', 'h':'fuchsia', 'o':'yellow' }

We can now aggregate the counts per race into grids, using ds.count_cat, instead of just a single grid with the total counts (which is what happens with the default aggregate reducer ds.count). We then generate an image by colorizing each pixel using the aggregate information from each category for that pixel's location:

def create_image(x_range, y_range, w=plot_width, h=plot_height, spread=0):
     cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x_range, y_range=y_range)
     agg = cvs.points(df, 'meterswest', 'metersnorth', ds.count_cat('race'))
     img = tf.colorize(agg, color_key, how='eq_hist')
     if spread: img = tf.spread(img,px=spread)

     return tf.set_background(img,"black")

The result shows that the USA is overwhelmingly white, apart from some predominantly Hispanic regions along the Southern border, some regions with high densities of blacks in the Southeast and a few isolated areas of category "other" in the West (primarily Native American reservation areas).



Interestingly, the racial makeup has some sharp boundaries around urban centers, as we can see if we zoom in:


With sufficient zoom, it becomes clear that Chicago (like most large US cities) has both a wide diversity of racial groups, and profound geographic segregation:


Eventually, we can zoom in far enough to see individual datapoints. Here we can see that the Chinatown region of Chicago has, as expected, very high numbers of Asian residents, and that other nearby regions (separated by features like roads and highways) have other races, varying in how uniformly segregated they are:


Note that we've used the tf.spread function here to enlarge each point to cover multiple pixels so that each point is clearly visible.

Other Cities, for Comparison

Different cities have very different racial makeup, but they all appear highly segregated:







Analyzing Racial Data Through Visualization

The racial data and population densities are visible in the original Cooper Center map tiles, but because we aren't just working with static images here, we can look at any aspect of the data we like, with results coming back in a few seconds, rather than days. For instance, if we switch back to the full USA and then select only the black population, we can see that blacks predominantly reside in urban areas, except in the South and the East Coast:

cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height)
agg = cvs.points(df, 'meterswest', 'metersnorth', ds.count_cat('race'))

bg(tf.interpolate(agg.sel(race='b'), cmap=cm(Greys9,0.25), how='eq_hist'))

(Compare to the all-race eq_hist plot at the start of this post.)

Or we can show only those pixels where there is at least one resident from each of the racial categories - white, black, Asian and Hispanic - which mainly highlights urban areas (compare to the first racial map shown for the USA above):

agg2 = agg.where((agg.sel(race=['w', 'b', 'a', 'h']) > 0).all(dim='race')).fillna(0)
bg(tf.colorize(agg2, color_key, how='eq_hist'))

In the above plot, the colors still show the racial makeup of each pixel, but the pixels have been filtered so that only those with at least one datapoint from every race are shown.

We can also look at all pixels where there are more black than white datapoints, which highlights predominantly black neighborhoods of large urban areas across most of the USA, but also some rural areas and small towns in the South:

bg(tf.colorize(agg.where(agg.sel(race='w') < agg.sel(race='b')).fillna(0), color_key, how='eq_hist'))

Here the colors still show the predominant race in that pixel, which is black for many of these, but in Southern California it looks like there are several large neighborhoods where blacks outnumber whites, but both are outnumbered by Hispanics.

Notice how each of these queries takes only a line or so of code, thanks to the xarray multidimensional array library that makes it simple to do operations on the aggregated data. Anything that can be derived from the aggregates is visible in milliseconds, not the days of computing time that would have been required using previous approaches. Even calculations that require reaggregating the data only take seconds to run, thanks to the optimized Numba and dask libraries used by datashader.

Using datashader, it is now practical to try out your own hypotheses and questions, whether for the USA or for your own region. You can try posing questions that are independent of the number of datapoints in each pixel, since that varies so much geographically, by normalizing the aggregated data in various ways. Now that the data has been aggregated but not yet rendered to the screen, there is an infinite range of queries you can pose!

Interactive Bokeh Plots Overlaid with Map Data

The above plots all show static images on their own. datashader can also be combined with plotting libraries, in order to add axes and legends, to support zooming and panning (crucial for a large dataset like this one!), and/or to combine datashader output with other data sources, such as map tiles. To start, we can define a Bokeh plot that shows satellite imagery from ArcGIS:

import bokeh.plotting as bp
from bokeh.models.tiles import WMTSTileSource


def base_plot(tools='pan,wheel_zoom,reset',webgl=False):
     p = bp.figure(tools=tools,
         plot_width=int(900), plot_height=int(500),
         x_range=x_range, y_range=y_range, outline_line_color=None,
         min_border=0, min_border_left=0, min_border_right=0,
         min_border_top=0, min_border_bottom=0, webgl=webgl)

     p.axis.visible = False
     p.xgrid.grid_line_color = None
     p.ygrid.grid_line_color = None

     return p

p = base_plot()

tile_renderer = p.addtile(WMTSTileSource(url=url))

We can then add an interactive plot that uses a callback to a datashader pipeline. In this pipeline, we'll use the tf.dynspread function to automatically increase the plotted size of each datapoint, once you've zoomed in so far that datapoints no longer have nearby neighbors:

from datashader.bokeh_ext import InteractiveImage

def image_callback(x_range, y_range, w, h):
     cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x_range, y_range=y_range)
     agg = cvs.points(df, 'meterswest', 'metersnorth', ds.count_cat('race'))
     img = tf.colorize(agg, color_key, 'log')
     return tf.dynspread(img,threshold=0.75, max_px=8)

InteractiveImage(p, image_callback)

The above image will just be a static screenshot, but in a running Jupyter notebook you will be able to zoom and pan interactively, selecting any region of the map to study in more detail. Each time you zoom or pan, the entire datashader pipeline will be re-run, which will take a few seconds. At present, datashader does not use caching, tiling, partitioning or any of the other optimization techniques that would provide even more responsive plots, and, as the library matures, you can expect to see further improvements over time. But, the library is already fast enough to provide interactive plotting of all but the very largest of datasets, allowing you to change any aspect of your plot "on-the-fly" as you interact with the data.

To learn more about datashader, check out our extensive set of tutorial notebooks, then install it using conda install -c bokeh datashader and start trying out the Jupyter notebooks from github yourself! You can also watch my datashader talk from SciPy 2016 on YouTube. 

July 25, 2016 03:12 PM

Django Weblog

Registration for Django: Under the Hood 2016 is now open!

Django: Under the Hood is back for its third edition!

DUTH is an annual Django conference that takes place in Amsterdam, the Netherlands. On 3rd - 6th November this year, we're going to see 9 deep dive talks into topics of Django channels, testing, Open Source funding, JavaScript, Django forms validation, debugging and many more.

Django: Under the Hood also gives the opportunity to bring many Django core developers to work together and shape the future of Django with a group of 300 passionate Django developers attending the conference.

This year, the registration process for the conference became a lottery to avoid mad rush and tickets selling out in minutes.


You can register now, and the lottery is only open until 26th of July at noon Amsterdam time.

If you want to make sure that tickets for your team are reserved and set aside, Django: Under the Hood still has few sponsorship opportunities open. Please get in touch on

July 25, 2016 02:35 PM

Doug Hellmann

threading — Manage Concurrent Operations — PyMOTW 3

Using threads allows a program to run multiple operations concurrently in the same process space. Read more… This post is part of the Python Module of the Week series for Python 3. See for more articles from the series.

July 25, 2016 01:00 PM

Mike Driscoll

PyDev of the Week: Nicholas Tollervey

This week we welcome Nicholas Tollervey (@ntoll) as our PyDev of the Week. He is the author of the Python in Education booklet and the co-author of Learning jQuery Deferreds: Taming Callback Hell with Deferreds and Promises. He was one of the co-founders of the London Python Code Dojo. You should check out his website to see what he’s up to. Let’s spend some time learning more about our fellow Pythonista!

Can you tell us a little about yourself (hobbies, education, etc):

I’m a UK based freelance programmer.

I first learned to program as a kid using a BBC micro. Like many UK programmers of a certain age I go all misty-eyed when remembering the sense of excitement, adventure and possibilities that such a machine generated.

As a teenager I had music lessons and ended up getting my undergraduate degree from the Royal College of Music in London. It was a lot of fun: among other things, my composition tutor used to teach Andrew Lloyd Webber and my keyboard harmony teacher was the Queen’s personal organist (he was director of the Chapel Royal). I played professionally for a while before embarking on a career teaching music. I also went on to read for degrees in philosophy and then computing ~ my twenties were all about
teaching and learning cool stuff!

When I started a family I decided I also needed a career change and remembered back to the fun I’d had on the BBC micro so reverted back to programming. Now I spend a lot of my free time making music.

Why did you start using Python?

I was a senior .NET developer in an investment bank in London – I worked with the quants on a suite of in-house SCM related development tools. They wanted to be able to script some of my software so I looked into IronPython as a potential solution. As well as being impressed by Python-the-language, these investigations led me to the remarkably friendly UK Python community and I realised I was among friends. I took some months off to learn Python and re-started my programming career as a junior Python developer and have never looked back! That was six or seven years ago.

What other programming languages do you know and which is your favorite?

Due to the large amount of web-related programming I’ve done, JavaScript is my second language after Python. Obviously I know C# and VB (although they’re very rusty) and I’ve used Ruby, Java and Erlang at various points in my career. I tend to dip my toes into other languages because it’s interesting to get another perspective, although I always love returning to Python: it’s like putting on a favourite pair of comfortable slippers. 😉

Lisp is a beautiful language. I once wrote a version of the Lisp programming language (for fun) although that’ll never see the light of day. It proved to me how difficult it is to design a programming language and made me realise what an amazing job Guido and the core developers do with Python.

What projects are you working on now?

I proposed, coordinated and contributed to the PSF’s involvement with the new BBC micro:bit project. A micro:bit is a small programmable device aimed at 11 year-olds. A million of them have been given out to all the UK’s 11 and 12 year olds. The intention is to recreate the glory-days of the BBC micro (see above) and inspire a new generation of digital makers. My current focus is on making the fruits of these efforts sustainable.

Thanks to the amazing work of Damien George, the device runs MicroPython, it’s an open source project so anyone can contribute code.

I’ve also written many tools related to the project: Mu (a code editor for kids that’s written in Python), a child-friendly web-based Python editor for the website and various Python related utilities that make it easy to interact with the device. I’ve also been learning some C and C++ as I’ve made minor contributions to the MicroPython code that runs on the micro:bit. It’s fascinating and fun to be a complete beginner again.

Which Python libraries are your favorite (core or 3rd party)?

I love working with asyncio. I’ve written a distributed hash table with it (as a fun educational exercise). I also firmly believe that Python 3 is what everyone should be using for new projects. 😉

I also love simplicity and, happily, there are many examples of simple libraries in the Python ecosystem.

I’m especially impressed with the PyGameZero, GPIOZero and NetworkZero libraries (with more “WhateverZero” libraries coming soon). The idea is to write a simple API above an existing-yet-powerful library to make it more accessible to beginner programmers. Once beginner programmers have the basic concepts worked out they can begin the process of pulling back the curtain and exploring the underlying library. Dan Pope (PyGameZero), Ben Nuttall (GPIOZero) and Tim Golden (NetworkZero) should be showered with praise for their extraordinary efforts (all three of them spent significant amounts of time collaborating with teachers to work out how best to design their library to appeal to new programmers).

Where do you see Python going as a programming language?

What is your take on the current market for Python programmers?

There’s a huge demand for Python developers here in the UK across all sectors. This situation doesn’t look like it’ll change for the foreseeable future (see my comment above about education).

Is there anything else you’d like to say?

The best of Python is in its community: I’d like to publicly say “thank you” to the many people who have helped me as I continue to learn and use this remarkable language.

Thanks for doing the interview!

July 25, 2016 12:30 PM

Caktus Consulting Group

ShipIt Day Recap - July 2016

We finished up last week with another successful ShipIt Day. ShipIt Days are quarterly events where we put down client work for a little bit and focus on learning, stretching ourselves, and sharing. Everyone chooses to work together or individually on an itch or a project that has been on the back of their mind for the last few months. This time, we stretched ourselves by trying out new frameworks, languages, and pluggable apps. Here are some of the projects we worked on during ShipIt Day:

TinyPNG Image Optimization in Django

Kia and Dmitriy started on django_tinypng. This project creates a new OptimizedImageField in Django which uses the tinify client for the tinypng project to compress PNG files. This means that files uploaded by users can be reduced in size by up to 70% without perceivable differences in image quality. Reducing image sizes can free up disk space on servers and improve page load speeds, significantly improving user experiences.

Maintaining Clean Python Dependencies / Requirements.txts

Rebecca Muraya researched how we, as developers, can consistently manage our requirements files. In particular, she was looking for a way to handle second-level (and below) dependencies -- should these be explicitly pinned, or not? Rebecca did some research and found the pip-tools package as a possible solution and presented it to the group. Rebecca described pip-tools as a requirements file compiler which gives you the flexibility to describe your requirements at the level that makes sense to your development team, but have them consistently managed across development, testing, and production environments. Rebecca presented ideas for integrating pip-tools into our standard development workflows.


Neil and Dan each independently decided to build projects using Elm, a functional language for frontend programming.They were excited to demonstrate how they rearranged their concept of development temporarily to focus on state and state changes in data structures. And then, how these state changes would be drawn on the screen dynamically. Dan mentioned missing HTML templates, the norm in languages where everything is not a function, but loved that it forced programmers to handle all cases as a result of the strict type system (unlike Python). Neil dug not only into Elm on the frontend, but also a functional backend framework Yesod and the Haskell language. Neil built a chat app using Websockets and Yesod channels.

Firebase + React = Bill Tracking

Hunter built a bill tracking project using Google’s Firebase database and the React frontend framework. Hunter walked us through his change in thought process from writing code as a workflow to writing code that changes state and code that updates the drawing of the state. It was great to see the Firebase development tools and learn a bit more about React.

Open Data Policing Database Planning

Rebecca Conley worked on learning some new things about database routing and some of the statistics that go into the Open Data Policing project. She also engaged Caelan, Calvin’s son who was in the office during the day, to build a demonstration of what she had been working on.

Mozilla’s DXR Python Parser Contributions

DXR is a web-based code indexing and searching tool built by Mozilla. For his project, Jeff Bradberry decided to create a pull request contribution to the project that improves Python code indexing. Specifically, he used Python’s own Abstract Syntax Tree (AST), a way to introspect and consider Python code as structured data to be analyzed. Jeff’s contribution improves the analysis of nested calls like a(b(c())) and chained calls like a().b().c().

Hatrack: We all wear lots of hats, switch contexts easily

Rather than working on something completely new, Calvin decided to package up and share a project he has been working off-and-on in his free time called Hatrack. Hatrack attempts to solve a problem that web developers frequently face: changing projects regularly means switching development environments and running lots of local development web servers. Hatrack notices what projects you try to load up in your browser and starts up the development server automatically. For his project, Calvin put Hatrack up on NPM and shared it with the world. You can also check out the Hatrack source code on Github.

Software Testing Certification

Sometimes ShipIt Day can be a chance to read or review documentation. Charlotte went this route and reviewed the requirements for the International Software Testing Qualifications Board (ISTQB)’s certification programs. Charlotte narrowed down on a relevant certification and began reviewing the study materials. She came back to the group and walked us through some QA best practices including ISTQB’s seven principles of software testing.

Cross Functional Depth & Breadth

Sarah began work to visualize project teams’ cross-functional specialties with an eye towards finding skill gaps. She built out a sample questionnaire for the teams and a method of visualizing the skill ranges in specific areas on a team. This could be used in the future when team members move between teams and for long-term planning.

Demographic Data

Colin and Alex each separately investigated adding demographic data into existing project data sets using SunlightLab’s Python bindings for the Cenus API. While the Census dataset contains tens of thousands of variables in various geographic resolution levels (states, counties, down to block groups), using the Census’ API and Sunlight Lab’s bindings made it relatively quick and painless.

July 25, 2016 12:00 PM

Ilian Iliev

SQLAlchemy and &quot;Lost connection to MySQL server during query&quot;

... a weird behaviour when pool_recycle is ignored

Preface: This is quite a standard problem for apps/websites with low traffic or those using heavy caching and hitting the database quite seldom. Most of the articles you will find on the topic will tell you one thing - change the wait_timeout setting in the database. Unfortunately in some of the cases this disconnect occurs much earlier than the expected wait_timeout (default ot 8 hours). If you are in one of those cases keep reading.

This issue haunted our team for weeks. When we first faced it the project that was raising it was still in dev phase so it wasn't critical but with getting closer to the release data we started to search for solution. We have read several articles and decided that pool_recycle is our solution.

Adding pool_recycle: According to SQL Alchemy's documentation pool_recycle "causes the pool to recycle connections after the given number of seconds has passed". Nice, so if you recycle the connection in intervals smaller that the await_timeout the error above should not appear. Let's try it out:

import time from sqlalchemy.engine import create_engine url = 'mysql+pymysql://user:pass@' engine = create_engine(url, pool_recycle=1).connect() query = 'SELECT NOW();' while True: print('Q1', engine.execute(query).fetchall()) engine.execute('SET wait_timeout=2') time.sleep(3) print('Q2', engine.execute(query).fetchall())

So what does the code do - we create a connection to a local MySQL server and state that it should be recycled every second(line 7). Then we execute a simple query (line 12) just to verify that the connection is working.
We set the wait_timeout to 2 second and wait for 3. At this stage the connection to the server will timeout, but SQL Alchemy should recycle it, so the last query should be executed successfully and the loop should continue.
Unfortunately the results looks like:

sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: 'SELECT NOW();']

Wait, what happened, why is not the connection recycled?

Solution: Well, as with all such problems the solution was much simpler compared to the time it took us to find it (we fought with this for days). The only change that solved it was on line 7:

engine = create_engine(url, pool_recycle=1) # the result Q1 [(datetime.datetime(2016, 7, 24, 20, 51, 41),)] Q2 [(datetime.datetime(2016, 7, 24, 20, 51, 44),)] Q1 [(datetime.datetime(2016, 7, 24, 20, 51, 44),)] Q2 [(datetime.datetime(2016, 7, 24, 20, 51, 47),)]

Have you spot the difference? We are not calling the connect() method any more.

Final words: To keep it honest, I don't know why this solved the issue. Hopefully someone more familiar with SQL Alchemy will come with a reasonable explanation for it. The bad part is that the examples in the official docs are using "connect". So it is either a bug or a bad documentation. I will send this article to SQL Alchemy's Twitter account so hopefully we will see some input from them. Till then, if any of you have an idea explanation about the behaviour I'll be happy to hear it.

July 25, 2016 10:40 AM

July 24, 2016

Weekly Python StackOverflow Report

(xxix) stackoverflow python report

These are the ten most rated questions at Stack Overflow last week.
Between brackets: [question score / answers count]
Build date: 2016-07-24 15:23:27 GMT

  1. Why and how are Python functions hashable? - [24/3]
  2. why is a sum of strings converted to floats - [12/1]
  3. Python: PEP 8 class name as variable - [11/2]
  4. How to assign member variables temporarily? - [8/5]
  5. Simplifying / optimizing a chain of for-loops - [8/4]
  6. Is there a pythonic way to process tree-structured dict keys? - [8/2]
  7. list of objects python - [8/1]
  8. Pandas rolling computations for printing elements in the window - [8/1]
  9. django-debug-toolbar breaking on admin while getting sql stats - [8/1]
  10. How to free memory of python deleted object? - [8/1]

July 24, 2016 03:24 PM