skip to navigation
skip to content

Planet Python

Last update: September 24, 2014 01:57 AM

September 24, 2014

Vasudev Ram

How Guido nearly dropped out - and then dropped in

By Vasudev Ram

Interview with Guido van Rossum, creator of the Python language.

He says in the interview that he nearly dropped out of school - but was dissuaded from doing so by his manager and professor.


[ I was actually close to dropping out.

Oh my gosh! Why?

The job was so fun, and studying for exams wasn’t. Fortunately, my manager at the data center, as well as one of my professors, cared enough about me to give me small nudges in the direction of, “Well, maybe it would be smart to graduate, and then you can do this full-time!” (laughing) ]

Later, of course, he dropped in ... to the box.

- Vasudev Ram - Dancing Bison Enterprises

Contact Page

September 24, 2014 01:45 AM

September 23, 2014

Mike Driscoll

Python 101: How to send SMS/MMS with Twilio

I’ve been hearing some buzz about a newish web service called Twilio which allows you to send SMS and MMS messages among other things. There’s a handy Python wrapper to their REST API as well. If you sign up with Twilio, they will give you a trial account without even requiring you to provide a credit card, which I appreciated. You will receive a Twilio number that you can use for sending out your messages. Since you are using a trail account, you do have to authorize any phone numbers you want to send messages to before you can actually send a message. Let’s spend some time learning how this works!

Getting Started

To get started, you’ll need to sign up on Twilio and you’ll also need to install the Python twilio wrapper. To install the latter, just do the following:

pip install twilio

Once that’s installed and you’re signed up, we’ll be ready to continue.

Using Twilio to Send SMS Messages

Sending an SMS (Short Message Service) message via Twilio is really easy. You will need to look in your Twilio account to get the sid and authentication token as well as your twilio number. Once you have those three pieces of key information, you can send an SMS message. Let’s see how:

from import TwilioRestClient
def send_sms(msg, to):
    sid = "text-random"
    auth_token = "youAreAuthed"
    twilio_number = "123-456-7890"
    client = TwilioRestClient(sid, auth_token)
    message = client.messages.create(body=msg,
if __name__ == "__main__":
    msg = "Hello from Python!"
    to = "111-111-1111"
    send_sms(msg, to)

When you run the code above, you will receive a message on your phone that says the following: Sent from your Twilio trial account – Hello from Python!. As you can see, that was really easy. All you needed to do was create an instance of TwilioRestClient and then create a message. Twilio will do the rest. Sending an MMS (Multimedia Messaging Service) message is almost as easy. Let’s take a look at that in the next example!

Sending an MMS Message

Most of the time, you won’t actually want to put your sid, authentication token or Twilio phone number in the code itself. Instead, that would normall be stored in a database or config file. So in this example, we’ll put those pieces of information into a config file and extract them using ConfigObj. Here’s the contents of my config file:

sid = "random-text"
auth_token = "youAreAuthed!"
twilio_number = "123-456-7890"

Now let’s write some code to extract this information and send out an MMS message:

import configobj
from import TwilioRestClient
def send_mms(msg, to, img):
    cfg = configobj.ConfigObj("/path/to/config.ini")
    sid = cfg["twilio"]["sid"]
    auth_token = cfg["twilio"]["auth_token"]
    twilio_number = cfg["twilio"]["twilio_number"]
    client = TwilioRestClient(sid, auth_token)
    message = client.messages.create(body=msg,
if __name__ == "__main__":
    msg = "Hello from Python!"
    to = "111-111-1111"
    img = ""
    send_mms(msg, to=to, img=img)

This code generates the same message as the last one, but it also sends along an image. You will note that Trilio requires you to use either an HTTP or HTTPS URL to attach photos to your MMS message. Otherwise, that was pretty easy to do too!

Wrapping Up

What could you use this for in the real world? I’m sure businesses will want to use it to send out coupons or offer codes to new products. This also sounds like something that a band or politician would use to send out messages to their fans, consitutents, etc. I saw one article where someone used it to automate sending themselves sports scores. I found the service easy to use. I don’t know if their pricing is competitive, but the trial account is certainly good enough for testing and certainly worth a try.

Additional Reading

September 23, 2014 10:15 PM

Alex Gaynor

Python for Ada

Last year I wrote about why I think it's important to support diversity within our communities, and about some of the work the Ada Initiative does to support this. The reasons I talked about are good, and (sadly) as relevant today as they were then.

I'd like to add a few more reasons I care about these issues:

I'm very tired of being tired. And yet, I can't even begin to imagine how tired I would be if I was a recipient of the constant stream of harassment that many women who speak up receive.

For all these reasons (and one more that I'll get to), I'm calling on members of the Python community to join me in showing their support for working to fix these issues, and foster a diverse community, by donating to support the Ada Initiative.

For the next 5 days, Jacob Kaplan-Moss, Carl Meyer, and myself will be matching donations, up to $7,5000:

Donate now

I encourage you to donate to show your support.

I mentioned there was one additional reason this is important to me. A major theme, for myself, over the last year has been thinking about my ethical obligations as a programmer (and more broadly, the obligations all programmers have). I've been particularly influenced by this blog post by Glyph, and this talk by Mike Monteiro. If you haven't already, take a moment to read/watch them.

Whoever came up with the term "User agent" to describe a browser uncovered a very powerful idea. Computer programs are expected to faithfully execute and represent the agency of their human operator.

Trying to understand the intent and desire of our users can be a challenging thing under the best of circumstances. As an industry, we compound this problem many times over by the underrepresentation of many groups among our workforce. This issues shows up again and again with service's such as Twitter and Facebook's handling of harassment and privacy issues. We tend to build products for ourselves, and when we all look the same, we don't build products that serve all of our users well.

The best hope we have for building programs that are respectful of the agency of our users is for the people who use them to be represented by the people who build them. To get there, we need to create an industry where harassment and abuse are simply unacceptable.

It's a long road, but the Ada Initiative does fantastic work to pursue these goals (particularly in the open source community, which is near and dear to me). Please, join us in supporting the ongoing work of building the community I know we all want to see, and which we can be proud of.

Donate now

September 23, 2014 07:38 PM


Wake up Marvin, the Call for Papers Opens Oct 1st!


PyTennessee 2015, taking place in Nashville, will be accepting all types of proposals starting next Wednesday, Oct 1st, 2014 through Nov 15th, 2014. Due to the competitive nature of the selection process, we encourage prospective speakers to submit their proposals as early as possible. We’re looking forward to receiving your best proposals for tutorials and talks. Lightning talk sign ups will be available at the registration desk the day of the event.

Note: To submit a proposal, sign up or log in to your account and proceed to your account dashboard!

What should I speak about? In short, anything software or python related. We want to make a special emphasis on covering unique things at PyTennessee. We would love talks on financial, scientific, systems management, and general development topics. We also want to have a bit of fun, so bring on the weird, odd and wonderful python talks!

Talks These are the traditional talk sessions given during the main conference days. They’re mostly 45 minutes long, but we offer a limited number of 30 minute slots. We organize the schedule into three “tracks”, grouping talks by topic and having them in the same room for consecutive sessions.

Tutorials As with the talks, we are looking for tutorials that can grow this community at any level. We aim for tutorials that will advance Python, advance this community, and shape the future. There are 4 2-hour tutorial sessions and 4 1-hour tutorial sessions available.

Code of Conduct Make sure you review the code of conduct prior to submitting a talk.

September 23, 2014 06:18 PM

Python Piedmont Triad User Group

PYPTUG Meeting - September 29th

PYthon Piedmont Triad User Group meeting

Come join PYPTUG at out next meeting (September 29th 2014) to learn more about the Python programming language, modules and tools. Python is the perfect language to learn if you've never programmed before, and at the other end, it is also the perfect tool that no expert would do without.


Meeting will start at 5:30pm.

We will open on an Intro to PYPTUG and on how to get started with Python, PYPTUG activities and members projects, then on to News from the community.

This month we will have two talks!

Talk #1

by Delphine Masse

Title: "Alembic for database version control"

Bio: Delphine is a python software engineer at Inmar working on APIs and B2B applications. Previous to working at Inmar she was developing website solutions for the pharmaceutical industry on LAMP stack with PHP. Her academic background includes computer science and biology and interests range from data mining to UI optimization.

Abstract: Managing databases manually in multiple environments can quickly become a headache. Alembic is a python package that allows to formalize each database change and to apply those changes consistently across multiple environments. I will go over the use of Alembic with MySQL as well as some "gotchas" and how to best handle them.


Talk #2

by Paul Pauca

Title: "Assistive Technologies for the Disabled"
Bio: Dr. Paul Pauca is an Associate Professor at Wake Forest. He is interested in the application of computer science to the benefit of society, particularly in the use of technology to assist the disabled.

Abstract: Dr. Pauca will introduce some of his recent research projects in developing highly affordable modern technology that can bridge the interaction gap between computers and people with physical and cognitive disabilities The goal of his research group is to exploit recent advancements in machine learning, computer vision, and human computer interaction to use the person’s body as the interface between the computer and the brain, thereby reducing or eliminating the limiting effects of conventional interface devices.


 Lightning talks!

We might have some time for extemporaneous "lightning talks" of 5-10 minute duration. If you'd like to do one, some suggestions of talks were provided here, if you are looking for inspiration. Or talk about a project you are working on.


Monday, September 29th 2014
Meeting starts at 5:30PM


Wake Forest University, close to Polo Rd and University Parkway:

Wake Forest University, Winston-Salem, NC 27109

 Map this

See also this campus map (PDF) and also the Parking Map (PDF) (Manchester hall is #20A on the parking map)

And speaking of parking:  Parking after 5pm is on a first-come, first-serve basis.  The official parking policy is:
"Visitors can park in any general parking lot on campus. Visitors should avoid reserved spaces, faculty/staff lots, fire lanes or other restricted area on campus. Frequent visitors should contact Parking and Transportation to register for a parking permit.

Mailing List

Don't forget to sign up to our user group mailing list:

It is the only step required to become a PYPTUG member.

Meetup Group

In order to get a feel for how much food we'll need, we ask that you register your attendance to this meeting on meetup:

September 23, 2014 01:02 PM


Support Vector Machines


Support vector machines is one of the most popular methods of classification in machine learning although they can be used as a black box, understanding what's happening behind scenes can be very useful not to mention interesting.

In an internal learning course, I decided to implement SVMs and my objective with this article to mention some of the difficulties encountered. If you’re planning to explore on how to implement support vector machines, have in mind this issues and the problem will be a little bit more easy to affront.

Basic idea

To simplify explanations, all along this blog post we will consider an example dataset with 2 features and 2 classes. This way you can imagine your dataset on a cartesian plane:

In many machine learning algorithms, the main task during the training it’s the minimization or maximization of a function. In this case we want to find a line that divides both classes and has maximum distance between the points of one class and the other.


I won't get too deep into the training process, but basically the formula of the width of the margin it’s taken into a function of some Lagrange multipliers.

Solving this gives you some non-zero multipliers applied to some of the training vectors (also called the support vectors). This is all you need to store to be able to classify, all the math on the classification is done using this.


To classify an unknown examples, you decide which class belongs to based on which side of the divisor line falls.

Maximization or minimization?

The first trouble I faced was that the literature sometimes expresses this as a maximization problem and sometimes as a minimization problem. There are a lot of algorithms to find the minimum of a function so if you have to find a maximum, the trick is to find the minimum on the inverse function. In the theory, as we’ve seen, this is a maximization problem but to simplify things some times is presented directly with the formula to minimize so it is important to have this in mind when doing your own implementation.

Ok, I got it, but how to find this minimum?

Once you understand what you have to minimize, you have to think on which method use. Many minimization methods needs you to find the derivative (or second derivative) of the function. But this particular problem, can be put in terms of a quadratic programming problem. This means you can use a minimization solver for this kinds of problems such as cvxopt.

Not all problems are like that

You might be thinking that not all the problems have such data distribution that can be linearly separable and it’s true, but SVMs have a trick to work with such datasets so the main idea remains: when stuck, switch to another perspective.

Other sources

September 23, 2014 12:19 PM

Continuum Analytics Blog

Introducing Blaze - HMDA Practice

We look at data from the Home Mortgage Disclosure Act, a collection of actions taken on housing loans by various governmental agencies (gzip-ed csv file here) (thanks to Aron Ahmadia for the pointer). Uncompressed this dataset is around 10GB on disk, so we don’t want to load it up into memory with a modern commercial notebook.

Instead, we use Blaze to investigate the data, select down to the data we care about, and then migrate that data into a suitable computational backend.

September 23, 2014 10:00 AM

Python Meeting Düsseldorf - 2014-09-30

The following text is in German, since we're announcing a regional user group meeting in Düsseldorf, Germany.


Das nächste Python Meeting Düsseldorf findet an folgendem Termin statt:

Dienstag, 30.09.2014, 18:00 Uhr
Raum 1, 2.OG im Bürgerhaus Stadtteilzentrum Bilk
Düsseldorfer Arcaden, Bachstr. 145, 40217 Düsseldorf


Bereits angemeldete Vorträge

Charlie Clark
       "Generator Gotchas"

Marc-Andre Lemburg
       "Pythons und Fliegen - Speicherbedarf von Python Objekten optimieren"

Weiterhin werden wir die Ergebnisse des PyDDF Sprints 2014 vom kommenden Wochenende präsentieren.

Wir suchen noch weitere Vorträge. Bei Interesse, bitte unter melden.

Startzeit und Ort

Wir treffen uns um 18:00 Uhr im Bürgerhaus in den Düsseldorfer Arcaden.

Das Bürgerhaus teilt sich den Eingang mit dem Schwimmbad und befindet sich an der Seite der Tiefgarageneinfahrt der Düsseldorfer Arcaden.

Über dem Eingang steht ein großes “Schwimm’in Bilk” Logo. Hinter der Tür direkt links zu den zwei Aufzügen, dann in den 2. Stock hochfahren. Der Eingang zum Raum 1 liegt direkt links, wenn man aus dem Aufzug kommt.

>>> Eingang in Google Street View


Das Python Meeting Düsseldorf ist eine regelmäßige Veranstaltung in Düsseldorf, die sich an Python Begeisterte aus der Region wendet.

Einen guten Überblick über die Vorträge bietet unser PyDDF YouTube-Kanal, auf dem wir Videos der Vorträge nach den Meetings veröffentlichen.

Veranstaltet wird das Meeting von der GmbH, Langenfeld, in Zusammenarbeit mit Clark Consulting & Research, Düsseldorf:


Das Python Meeting Düsseldorf nutzt eine Mischung aus Open Space und Lightning Talks, wobei die Gewitter bei uns auch schon mal 20 Minuten dauern können :-)

Lightning Talks können vorher angemeldet werden, oder auch spontan während des Treffens eingebracht werden. Ein Beamer mit XGA Auflösung steht zur Verfügung. Folien bitte als PDF auf USB Stick mitbringen.

Lightning Talk Anmeldung bitte formlos per EMail an


Das Python Meeting Düsseldorf wird von Python Nutzern für Python Nutzer veranstaltet.

Da Tagungsraum, Beamer, Internet und Getränke Kosten produzieren, bitten wir die Teilnehmer um einen Beitrag in Höhe von EUR 10,00 inkl. 19% Mwst. Schüler und Studenten zahlen EUR 5,00 inkl. 19% Mwst.

Wir möchten alle Teilnehmer bitten, den Betrag in bar mitzubringen.


Da wir nur für ca. 20 Personen Sitzplätze haben, möchten wir bitten, sich per EMail anzumelden. Damit wird keine Verpflichtung eingegangen. Es erleichtert uns allerdings die Planung.

Meeting Anmeldung bitte formlos per EMail an

Weitere Informationen

Weitere Informationen finden Sie auf der Webseite des Meetings:


Viel Spaß !

Marc-Andre Lemburg,

September 23, 2014 08:00 AM

Nick Coghlan

Seven billion seconds per second

A couple of years ago, YouTube put together their "One hour per second" site, visualising the fact that for every second of time that elapses, an hour of video is uploaded to YouTube. Their current statistics page indicates that figure is now up to 100 hours per minute (about 1.7 hours per second).

Impressive numbers to be sure. However, there's another set of numbers I personally consider significantly more impressive: every second, more than seven billion seconds are added to the tally of collective human existence on Earth.

Think about that for a moment.

Tick. Another 7 billion seconds of collective human existence.

Tick. Another 117 million minutes of collective human existence.

Tick. Another 2 million hours of collective human existence.

Tick. Another 81 thousand days of collective human existence.

Tick. Another 11 thousand weeks of collective human existence.

Tick. Another 222 years of collective human existence.

222 years of collective human experience, every single second, of every single day. And as the world population grows, it's only going to get faster.

222 years of collective human experience per second.

13 centuries per minute.

801 centuries per hour.

19 millenia per day.

135 millenia per week.

7 billion years per year.

The growth in our collective human experience over the course of a single year would stretch halfway back to the dawn of time if it was experienced by an individual.

We currently squander most of that potential. We allow a lot of it to be wasted scrabbling for the basic means of survival like food, clean water and shelter. We lock knowledge up behind closed doors, forcing people to reinvent solutions to already solved problems because they can't afford the entry fee.

We ascribe value to people based solely on their success in the resource acquisition game that is the market economy, without acknowledging the large degree to which sheer random chance is frequently the determinant in who wins and who loses.

We inflict bile and hate on people who have the temerity to say "I'm here, I'm human, and I have a right to be heard", while being different from us. We often focus on those superficial differences, rather than our underlying common humanity.

We fight turf wars based on where we were born, the colour of our skin, and which supernatural beings or economic doctrines we allow to guide our actions.

Is it possible to change this? Is it possible to build a world where we consider people to have inherent value just because they're fellow humans, rather than because of specific things they have done, or specific roles they take up?

I honestly don't know, but it seems worthwhile to try. I certainly find it hard to conceive of a better possible way to spend my own meagre slice of those seven billion seconds per second :)

September 23, 2014 07:35 AM

Andrew Dalke

Summary of the /etc/passwd reader API

Next month I will be in Finland for Pycon Finland 2014. It will be my first time in Finland. I'm especially looking forward to the pre-conference sauna on Sunday evening.

My presentation, "Iterators, Generators, and Coroutines", will cover much of the same ground as my earlier essay. In that essay, I walked through the steps to make an /etc/passwd parser API which can be used like this:

from passwd_reader import read_passwd_entries

with read_passwd_entries("/etc/passwd", errors="strict") as reader:
    location = reader.location
    for entry in reader:
        print("%d:%s" % (location.lineno,
I think the previous essay was a bit too detailed to understand the overall points, so in this essay I'll summarize what I did and why I did it. Hopefully it will also help me prepare for Finland.

The /etc/passwd parser is built around a generator, which is a pretty standard practice. Another standard approach is to build a parser object, as a class instance which implements the iterator protocol. The main difference is that the generator uses local variables in the generator's execution frame where the class approach uses instance variables.

Since the parser API can open a file, when the first parameter is a filename string instead of a file object, I want it to implement the context manager protocol and implement deterministic resource handling. If it always created a file then I could use contextlib.closing() or contextlib.contextmanager() to convert an iterator into a self-closing context manager, but my read_passwd_entries reader is polymorphic in that first parameter, so I can't use a standard solution.

I instead wrapped the generator inside of a PasswdReader which implements the appropriate __enter__ and __exit__ methods.

I also want the parser to track location information about current record; in this case the line number of the current record but in general it could include byte position or other information about the record's provenance. I store this in a Location instance accessed via the PasswdReader's "location" attribute.

The line number is stored as a local variable in the iterator's execution frame. While this could be accessed through the generator's ".gi_frame.f_locals", the documentation says that frames are internal types whose definitions may change with future versions of the interpreter. That doesn't sound like something I want to depend upon.

Instead, I used an uncommon technique where the generator registers a callback function that the Location can use to get the line number. This function is defined inside of the generator's scope so can access the local variables. This isn't quite as simple as I would like, because exception handling in a generator, including the StopIteration from calling a generator's close(), is a bit tricky, but it does work.

The more common technique is to rewrite the generator as a class which implements the iterator protocol, where each instance stores its state information as instance variables. It's easy to access instance variables, but it's a different sort of tricky to read and write the state information at the respectively start and end of each iteration step.

A good software design balances many factors, including performance and maintability. The weights for each factor depend on the expected use cases. An unsual alternate design can be justified when it's a better match to the use cases, which I think is the case with my uncommon technique.

In most cases, API users don't want the line number of each record. For the /etc/passwd parser I think it's only useful for error reporting. More generally, it could be used to build a record index, or a syntax highlighter, but those are relatively rare needs.

The traditional class-based solution is, I think, easier to understand and implement, though it's a bit tedious to save and restore the parser state for each entry and exit point. This synchronization adds a lot of overhead to the parser, which isn't neeed for the common case where that information is ignored.

By comparison, my alternative generator solution has a larger overhead - two function calls instead of an attribute lookup - to access location information, but it doesn't need the explicit save/restore for each step because those are maintained by the generator's own execution frame. I think it's a better match for my use cases.

(Hmm. I think I've gone the other way. I think few will understand this without the examples or context. So, definitely lots of code examples for Pycon Finland.)

September 23, 2014 01:40 AM

September 22, 2014

PyPy Development

PyPy 2.4.0 released, 9 days left in funding drive

We're pleased to announce the availability of PyPy 2.4.0; faster, fewer bugs, and updated to the python 2.7.8 stdlib.

This release contains several bugfixes and enhancements. Among the user-facing improvements:
  • internal refactoring in string and GIL handling which led to significant speedups
  • improved handling of multiple objects (like sockets) in long-running programs. They are collected and released more efficiently, reducing memory use. In simpler terms - we closed what looked like a memory leak
  • Windows builds now link statically to zlib, expat, bzip, and openssl-1.0.1i
  • Many issues were resolved since the 2.3.1 release in June

You can download PyPy 2.4.0 here

We would like to also point out that in September, the Python Software Foundation will match funds for any donations up to $10k, so head over to our website and help this mostly-volunteer effort out.

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7 and 3.2.5. It's fast (pypy 2.4 and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.

This release supports x86 machines running Linux 32/64, Mac OS X 64, Windows, and OpenBSD, as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux. 
We would like to thank our donors for the continued support of the PyPy project.

The complete release notice is here.

Please try it out and let us know what you think. We especially welcome success stories, please tell us about how it has helped you!

Cheers, The PyPy Team

September 22, 2014 08:12 PM

Hernan Grecco

Simulated devices in PyVISA: early preview

PyVISA started as wrapper for the NI-VISA library and therefore you need to install National Instruments VISA library in your system. This works most of the time, for most people. But sometimes you need to test PyVISA without the devices or even without VISA.

Starting form version 1.6, PyVISA allows to use different backends. These backends can be dynamically loaded. PyVISA-sim is one of such backends. It implements most of the methods for Message Based communication (Serial/USB/GPIB/Ethernet) in a simulated environment. The behavior of simulated devices can be controlled by a simple configuration in plain text. In the near future, you will be able to load this from file to change it depending on your needs.

To test it you need to install PyVISA 1.6 which is currently only available from GitHub:

$ pip install -U

And then install:

$ pip install -U

For those of you interest in the internals, the plugin mechanism for PyVISA hooks at the VisaLibrary level. Mocking the Library allows for a comprehensive and powerful testing.

By the end of the week I will be blogging about another cool VISA backend which will be opened soon: PyVISA-py. It is a backend that does not require the NI-VISA library. Stay tuned!

Remember that this is an early preview. Submit your bug reports, comments and suggestions in the Issue Tracker. We will address them promptly.

Or fork the code:

September 22, 2014 03:39 PM

Wesley Chun

Simple Google API access from Python


Back in 2012 when I published Core Python Applications Programming, 3rd ed., I
posted about how I integrated Google technologies into the book. The only problem is that I presented very specific code for Google App Engine and Google+ only. I didn't show a generic way how, using pretty much the same boilerplate Python snippet, you can access any number of Google APIs; so here we are.

In this multi-part series, I'll break down the code that allows you to leverage Google APIs to the most basic level (even for Python), so you can customize as necessary for your app, whether it's running as a command-line tool or something server-side in the cloud backending Web or mobile clients. If you've got the book and played around with our Google+ API example, you'll find this code familiar, if not identical -- I'll go into more detail here, highlighting the common code for generic API access and then bring in the G+-relevant code later.

We'll start in this first post by demonstrating how to access public or unauthorized data from Google APIs. (The next post will illustrate how to access authorized data from Google APIs.) Regardless of which you use, the corresponding boilerplate code stands alone. In fact, it's probably best if you saved these generic snippets in a library module so you can (re)use the same bits for any number of apps which access any number of modern Google APIs.

Google API access

In order to access Google APIs, follow these instructions:

Accessing Google APIs from Python

Now that you're set up, everything else is done on the Python side. To talk to a Google API, you need the Google APIs Client Library for Python, specifically the function. Download and install the library in your usual way, for example:

$ pip install -U google-api-python-client
NOTE: If you're building a Python App Engine app, you'll need something else, the Google APIs Client Library for Python on Google App Engine. It's similar but has extra goodies (specifically decorators -- brief generic intro to those in my previous post) just for cloud developers that must be installed elsewhere. As App Engine developers know, libraries must be in the same location on the filesystem as your source code.
Once everything is installed, make sure that you can import apiclient.discovery:

$ python
Python 2.7.6 (default, Apr  9 2014, 11:48:52)
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.38)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import apiclient.discovery

In is the build() function, which is what we need to create a service endpoint for interacting with an API. Now craft the following lines of code in your command-line tool:

from apiclient.discovery import build

API_KEY = # copied from project credentials page
SERVICE = build(API, VERSION, developerKey=API_KEY)

Take the API KEY you copied from the credentials page and assign to the API_KEY variable as a string. Obviously, embedding an API key in source code isn't something you'd so in practice as it's not secure whatsoever -- stick it in a database, key broker, encrypt, or at least have it in a separate byte code (.pyc/.pyo) file that you import -- but we'll allow it now solely for illustrative purposes of a simple command-line script.

In our short example we're going to do a simple search for "python" in public Google+ posts, so for the API variable, use the string 'plus'. The API version is currently on version 1 (at the time of this writing), so use 'v1' for VERSION. (Each API will use a different name and version string... again, you can find those in the OAuth Playground or in the docs for the specific API you want to use.) Here's the call once we've filled in those variables:

SERVICE = build('plus', 'v1', developerKey=API_KEY)

We need a template for the results that come back. There are many fields in a Google+ post, so we're only going to pick three to display... the user name, post timestamp, and a snippet of the post itself:

TMPL = '''
    User: %s
    Date: %s
    Post: %s

Now for the code. Google+ posts are activities (known as "notes;" there are other activities as well). One of the methods you have access to is search(), which lets you query public activities; so that's what we're going to use. Add the following call using the SERVICE endpoint you already created using the verbs we just described and execute it:

items = SERVICE.activities().search(query='python').execute().get('items', [])

If all goes well, the (JSON) response payload will contain a set of 'items' (else we assign an empty list for the for loop). From there, we'll loop through each matching post, do some minor string manipulation to replace all whitespace characters (including NEWLINEs [ \n ]) with spaces, and display if not blank:

for data in items:
    post = ' '.join(data['title'].strip().split())
    if post:
        print TMPL % (data['actor']['displayName'],
                      data['published'], post)

We're using the print statement here in Python 2, but a pro tip to start getting ready for Python 3 is to add this import to the top of your script (which has no effect in 3.x) so you can use the print() function instead:

from __future__ import print_function


To find out more about the input parameters as well as all the fields that are in the response, take a look at the docs. Below is the entire script missing only the API_KEY which you'll have to fill in yourself.

#!/usr/bin/env python

from apiclient.discovery import build

TMPL = '''
    User: %s
    Date: %s
    Post: %s

API_KEY = # copied from project credentials page
SERVICE = build('plus', 'v1', developerKey=API_KEY)
items = SERVICE.activities().search(query='python').execute().get('items', [])
for data in items:
    post = ' '.join(data['title'].strip().split())
    if post:
        print TMPL % (data['actor']['displayName'],
                      data['published'], post)

When you run it, you should see pretty much what you'd expect, a few posts on Python, some on Monty Python, and of course, some on the snake -- I called my script for "Google+ unauthenticated:"

$ python 

    User: Jeff Ward
    Date: 2014-09-20T18:08:23.058Z
    Post: How to make python accessible in the command window.

    User: Fayland Lam
    Date: 2014-09-20T16:40:11.512Z
    Post: Data Engineer #python #hadoop #jobs...

    User: Willy's Emporium LTD
    Date: 2014-09-20T16:19:33.851Z
    Post: MONTY PYTHON QUOTES MUG Take a swig to wash down all that albatross and crunchy frog. Featuring 20 ...

    User: Doddy Pal
    Date: 2014-09-20T15:49:54.405Z
    Post: Classic Monty Python!!!

    User: Sebastian Huskins
    Date: 2014-09-20T15:33:00.707Z
    Post: Made a small python script to get shellcode out of an executable. I found a nice oneline...

EXTRA CREDIT: To test your skills, check the docs and add a fourth line to each output which is the URL/link to that specific post, so that you (and your users) can open a browser to it if of interest.

If you want to build on from here, check out the larger app using the Google+ API featured in Chapter 15 of the book -- it adds some brains to this basic code where the Google+ posts are sorted by popularity using a "chatter" score. That just about wraps it up this post... tune in next time to see how to get authorized data access!

September 22, 2014 09:04 AM

Gael Varoquaux

Hiring an engineer to mine large brain connectivity databases

Work with us to leverage leading-edge machine learning for neuroimaging

At Parietal, my research team, we work on improving the way brain images are analyzed, for medical diagnostic purposes, or to understand the brain better. We develop new machine-learning tools and investigate new methodologies for quantifying brain function from MRI scans.

One of our important alley of contributions is in deciphering “functional connectivity”: analysis the correlation of brain activity to measure interactions across the brain. This direction of research is exciting because it can be used to probe the neural-support of functional deficits in incapacitated patients, and thus lead to new biomarkers on functional pathologies, such as autism. Indeed, functional connectivity can be computed without resorting to complicated cognitive tasks, unlike most functional imaging approaches. The flip side is that exploiting such “resting-state” signal requires advanced multivariate statistics tools, something at which the Parietal team excels.

For such multivariate processing of brain imaging data, Parietal has an ecosystem of leading-edge high-quality tools. In particular we have built the foundations of the most successful Python machine learning library, scikit-learn, and we are growing a dedicate software, nilearn , that leverages machine-learning for neuroimaging. To support this ecosystem, we have dedicated top-notch programmers, lead by the well-known Olivier Grisel.

We are looking for a data-processing engineer to join our team and work on applying our tools on very large neuroimaging databases to learn specific biomarkers of pathologies. For this, the work will be shared with the CATI, the Fench platform for multicentric neuroimaging studies, located in the same building as us. The general context of the job is the NiConnect project, a multi-organisational research project that I lead and that focuses on improving diagnostic tools on resting-state functional connectivity. We have access to unique algorithms and datasets, before they are published. What we are now missing between those two, and that link could be you.

If you want more details, they can be found on the job offers. This post is to motivate the job in a personal way, that I cannot give in an official posting.

Why take this job?

I don’t expect someone to take this job only because it pays the bill. To be clear, the kind of person I am looking for has no difficulties finding a job elsewhere. So, if you are that person, why would you take the job?

What would make me excited in a resume?

If you are interested and feel up for the challenge, read the real job offer, and send me your resume.

September 22, 2014 07:35 AM

Twisted Matrix Labs

Twisted 14.0.1 & 14.0.2 Released

On behalf of Twisted Matrix Laboratories, I’m releasing Twisted 14.0.1, a security release for Twisted 14.0. It is strongly suggested that users of 14.0.0 upgrade to this release.

This patches a bug in Twisted Web’s Agent, where BrowserLikePolicyForHTTPS would not honour the trust root given, and would use the system trust root instead. This would have broken, for example, attempting to pin the issuer for your HTTPS application because you only trust one issuer.

Note: on OS X, with the system OpenSSL, you still can't fully rely on this API for issuer pinning, due to modifications by Apple — please see for more details.

You can find the downloads at (or alternatively The NEWS file is also available at

Thanks for Alex Gaynor for discovering the bug, Glyph & Alex for developing a patch, and David Reid for reviewing it.

14.0.2 is a bugfix patch for distributors, that corrects a failing test in the patch for 14.0.1.

Twisted Regards,

Edit, 22nd Sep 2014: CVE-2014-7143 has been assigned for this issue.

September 22, 2014 06:49 AM

Calvin Spealman

Oh Captain, My Captain

I know of you, but can I say I know you?
Every time I see you, you wear a mask.
Every time I see you, there is a new mask.
I think you wore those masks so skillfully,
But I have come to notice them slightly askew.
I see with each mask a glimpse of the hidden face,
And I see among all the masks similarities.
You have taught me much in my life.
As a child you taught me fun and carefree joy.
When coming of age, you taught me of identity.
As an adult, you taught me of freeful dignity.
I mourned your loss with those who shared sorrow,
And my pain is not unique but it shouldn’t be,
Because the difference you have made is vast.
The world may mourn your loss in sorrow,
But we will always remember your name with joy.

In memory of Robin Williams

September 22, 2014 03:44 AM

Thanks to Apple, this is now my iTunes library

September 22, 2014 03:37 AM

September 21, 2014

Ionel Cristian

Ramblings about data-structures

I was reading this the other day. The article presents this peculiar quote:

"It is better to have 100 functions operate on one data structure than to have 10 functions operate on 10 data structures."

—Alan J. Perlis [1]

And then some thoughts and questions regarding the quote that seemed a bit ridiculous. But what does this seemingly ambiguous quote really mean ?

After some thought, I've concluded that it's an apologia for Lisp's lists everywhere philosophy.

You can reduce the quote to:

It's better to have one extremely generic type than 10 incompatible types.

Python already does this: every object implements, surprise, an object interface that boils down to a bunch of magic methods.

On the other hand, Lisp has generic functions. What this mean is that there is a dispatch system that binds certain function implementations to certain data-structures. A data structure that is bound with specific actions (aka functions or methods) is what I call a type.

The idea of not having specialised data-structures is an illusion - if a function takes something as an input then you have assumed a specific data structure, not just a mere list. This is why I think it's worthwhile designing the data-structures before the actions you'll want to have in the system.

Sounds counter-intuitive, as actions make the system run, not the data-structures - it seems natural to map out the actions first. Alas, starting with data-structures first lets you easily see what actions you could have and what actions you can't. In other words, it gives you perspective on many things: hard constrains, data flow and dependencies.

Just the actions don't give you perspective on dependencies. Dependencies imply state. State implies data. They don't give you perspective on what the system can and can't do - actions depends on inputs, data in other words. To put it another way, data is the limiting factor on what actions you could have and what actions you could not.

Given Python's lax access to the innards of objects, a large part of your type's API is just the data-structure. Also, given Python's support for properties [2], a large part of your API could be something that looks like a data-structure. So it's worthwhile to look really hard at this aspect of software design in the early stages of your project.

Generality of types*

There are two main properties of types:

  • Utility: How well does the type supports the consumer of said type. Does it have all the required actions? Is the API well suited or it makes the customer handle things that he should not be concerned with? Those are the key questions.
  • Reach: How many distinct consumers can use this type. Does the type bring unwanted dependencies or concerns in the consumer? Does the type have many actions that go unused, and can't be used, by a large part of all the possible consumers?

To give examples few examples:

  • A list has a very large reach and most of the time it fits fairly well consumers that just need a sequence-like type. However, the utility is very limited - you wouldn't use a list where you would use a higher level type, like an invoice.
  • An invoice has a very limited reach, you can only use it in billing code. But the utility of it is tremendous - you wouldn't want to use a mere list - you'd burden your payment code with concerns that are better encapsulated in the invoice type.

There's a tradeoff in having both reach and utility: complexity vs re-usability. Something with reach and utility can be used in many places. However, complexity is bound to occur - handling all those disparate use-cases is taxing.

I'd even go as far to argue there's a hard limit to reaching both goals - pushing one goal limits the other, from the perspective of what you can have in an API.

If you can afford to change things later it's best to start with good utility and then move towards reach as use-cases arise. Otherwise you're at risk of over-engineering and wasting time both on development and maintenance.

What about the JavaScript Array?*

The Array object (as any other object in JavaScript) is very interesting problem from the perspective of the iterator interface. A for (var element in variable) block will iterate on whatever is there, both attributes and elements of the actual sequence. From this perspective the Array increases complexity (there's a cognitive burden, both on the implementers of the Array object and the users of it). If the Array would not allow attributes then this wouldn't be such an issue. But then the reach would be less.

On the other hand, you could in theory use an Array object as an invoice substitute, you could just slap the invoice fields like buyer, seller, total value, reference number etc on the Array object. So from this perspective it has higher utility than a plain list where you can't have arbitrary attributes (AttributeError: 'list' object has no attribute 'buyer').

Importance of data-structures*

So, you see, designing data-structures is tricky. There are tradeoffs to be made. If your data is wrongly designed then you'll have to compensate those flaws in your code. That means more code, crappy code, to maintain.

Interestingly, this has been well put almost 40 years ago:

"Show me your tables, and I won't usually need your flowchart; it'll be obvious."

—Fred Brooks, in Chapter 9 of The Mythical Man-Month [3]

The same thing, in more modern language:

"Show me your data structures, and I won't usually need your code; it'll be obvious."

—Eric Raymond, in The Cathedral and the Bazaar, paraphrasing Brooks [3]

Though it seems this isn't instilled in everyone's mind: some people actually think that you can accurately reason about business logic before reasoning about the data [4]. You can't reliably reason about logic if you don't know what data dependencies said logic has.

Chicken and egg*

There are situations where you can't know what data you need to store before you know what the actions are. So you'd be inclined to start thinking about all the actions first.

I prefer to sketch out a minimalistic data-structure and refine it as I become aware of what actions I need to support, making a note of what actions I have so far. This works reasonably well and allows me to easily see any redundancy or potential to generalize, or the need to specialize for that matter. In a way, this is similar to CRC cards but more profound.


Starting with the data first allows you to easily see any redundancy in the data and make intelligent normalization choices.

Duplicated data-structures, especially the ones that are slightly different but distinct are a special kind of evil. They will frequently encourage, and sometimes even force the programmer to produce duplicated code, or code that tries to handle all the variances. It's very tempting because they are so similar. But alas, you can only think of so many things at the same time.

Even if you don't want to normalize your data, starting with the data first can result in synergy: the data from this place mixes or adapts well to the data from that other place. This synergy will reduce the amount of boilerplate and adapter code.

Closing thoughts*

Alan J. Perlis' principle can't be applied in all situations as the reach property is not always the imperative of software design and it has some disadvantages as illustrated above.

There are situations where none of these ideas apply (I don't mean to be complete):

  • You don't have any data or your application is not data-centric, or there other more pressing things to consider first.
  • You live in the perfect world of frozen requirements that are known before implementation.
[1]9th quote:
[2]Python allows you to implement transparent getters and setters via descriptors or the property builtin (which implements the descriptor interface).
[3](1, 2)

A most peculiar remark: "As a symptom of this, I've interviewed candidates who, when presented with a simple OO business logic exercise, start by writing Django models. Please don't do this."

September 21, 2014 09:00 PM


5000 commits to Nikola!

Nikola, the static site genrator, has just hit 5000 commits! The lucky commit was a99ef7a, with the message Document new previewimage metadata and was authored by Daniel Aleksandersen (Aeyoun).

During 2.5 years, 103 different people committed to the project. The top 10 committers are:

#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .title{font-family:"monospace";font-size:16px}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .legends .legend text{font-family:"monospace";font-size:14px}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .axis text{font-family:"monospace";font-size:10px}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .axis text.major{font-family:"monospace";font-size:10px}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .series text{font-family:"monospace";font-size:8px}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .tooltip text{font-family:"monospace";font-size:16px}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad text.no_data{font-size:64px} #chart-436ff7ae-87d4-4398-a597-f5cd8f163fad{background-color:transparent}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad path,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad line,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad rect,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad circle{-webkit-transition:250ms;-moz-transition:250ms;transition:250ms}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .graph > .background{fill:transparent}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .plot > .background{fill:rgba(240,240,240,0.7)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .graph{fill:rgba(0,0,0,0.9)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad text.no_data{fill:rgba(0,0,0,0.9)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .title{fill:rgba(0,0,0,0.9)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .legends .legend text{fill:rgba(0,0,0,0.9)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .legends .legend:hover text{fill:rgba(0,0,0,0.9)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .axis .line{stroke:rgba(0,0,0,0.9)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .axis .guide.line{stroke:rgba(0,0,0,0.5)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .axis .major.line{stroke:rgba(0,0,0,0.9)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .axis text.major{stroke:rgba(0,0,0,0.9);fill:rgba(0,0,0,0.9)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .axis.y .guides:hover .guide.line,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .line-graph .axis.x .guides:hover .guide.line,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .stackedline-graph .axis.x .guides:hover .guide.line,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .xy-graph .axis.x .guides:hover .guide.line{stroke:rgba(0,0,0,0.9)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .axis .guides:hover text{fill:rgba(0,0,0,0.9)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .reactive{fill-opacity:.8}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .active .reactive{fill-opacity:.9}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .series text{fill:rgba(0,0,0,0.9)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .tooltip rect{fill:rgba(240,240,240,0.7);stroke:rgba(0,0,0,0.9)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .tooltip text{fill:rgba(0,0,0,0.9)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .map-element{fill:rgba(0,0,0,0.9);stroke:rgba(0,0,0,0.5) !important;opacity:.9;stroke-width:3;-webkit-transition:250ms;-moz-transition:250ms;-o-transition:250ms;transition:250ms}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .map-element:hover{opacity:1;stroke-width:10}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-0,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-0 a:visited{stroke:rgb(12,55,149);fill:rgb(12,55,149)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-1,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-1 a:visited{stroke:rgb(117,38,65);fill:rgb(117,38,65)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-2,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-2 a:visited{stroke:rgb(228,127,0);fill:rgb(228,127,0)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-3,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-3 a:visited{stroke:rgb(159,170,0);fill:rgb(159,170,0)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-4,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-4 a:visited{stroke:rgb(149,12,12);fill:rgb(149,12,12)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-5,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-5 a:visited{stroke:rgb(12,55,149);fill:rgb(12,55,149)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-6,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-6 a:visited{stroke:rgb(117,38,65);fill:rgb(117,38,65)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-7,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-7 a:visited{stroke:rgb(228,127,0);fill:rgb(228,127,0)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-8,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-8 a:visited{stroke:rgb(159,170,0);fill:rgb(159,170,0)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-9,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-9 a:visited{stroke:rgb(149,12,12);fill:rgb(149,12,12)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-10,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-10 a:visited{stroke:rgb(12,55,149);fill:rgb(12,55,149)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-11,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-11 a:visited{stroke:rgb(117,38,65);fill:rgb(117,38,65)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-12,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-12 a:visited{stroke:rgb(228,127,0);fill:rgb(228,127,0)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-13,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-13 a:visited{stroke:rgb(159,170,0);fill:rgb(159,170,0)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-14,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-14 a:visited{stroke:rgb(149,12,12);fill:rgb(149,12,12)}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-15,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .color-15 a:visited{stroke:rgb(12,55,149);fill:rgb(12,55,149)} #chart-436ff7ae-87d4-4398-a597-f5cd8f163fad text.no_data{text-anchor:middle}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .guide.line{fill-opacity:0}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .centered{text-anchor:middle}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .title{text-anchor:middle}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .legends .legend text{fill-opacity:1}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .axis.x text{text-anchor:middle}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .axis.x:not(.web) text[transform]{text-anchor:start}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .axis.y text{text-anchor:end}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .axis.y2 text{text-anchor:start}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .axis.y .logarithmic text:not(.major),#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .axis.y2 .logarithmic text:not(.major){font-size:50%}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .axis .guide.line{stroke-dasharray:4,4}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .axis{stroke-dasharray:6,6}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .axis text.major{stroke-width:0.5px}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .horizontal .axis.y .guide.line,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .horizontal .axis.y2 .guide.line,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .vertical .axis.x .guide.line{opacity:0}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .horizontal .axis.always_show .guide.line,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .vertical .axis.always_show .guide.line{opacity:1 !important}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .axis.y .guides:hover .guide.line,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .axis.y2 .guides:hover .guide.line,#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .axis.x .guides:hover .guide.line{opacity:1}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .axis .guides:hover text{opacity:1}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .nofill{fill:none}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .subtle-fill{fill-opacity:.2}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .dot{stroke-width:1px;fill-opacity:1}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad{stroke-width:5px}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .series text{stroke:none}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .series{opacity:1}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .tooltip rect{fill-opacity:0.8}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .tooltip text{fill-opacity:1}#chart-436ff7ae-87d4-4398-a597-f5cd8f163fad .tooltip text tspan.label{fill-opacity:.8}Top 10 Nikola committers0.0200.0400.0600.0800.01000.01200.01400.01600.01800.02000.02200.02400.0Roberto AlsinaChris “Kwpolska” WarrickDaniel AleksandersenNiko WenselowskiPuneeth ChagantiDamián AvilaClaudio CanepaIvan Teohschettino72Areski Belaid2526262.6199.5961538461320142.047743468179.01923076922432.4911322249158.44230769219329.3923594616137.8653846156516.5974267617117.2884615385215.297941409396.71153846154614.69817893976.13461538463413.498653998455.55769230773713.798535233634.98076923083713.798535233614.4038461538Top 10 Nikola committers25261320224193655246343737Commits

According to OpenHub.Net, there are 54 active developers contributing code in the past year — making Nikola a large open source project, with its contributors count being in the top 2% of all OpenHub teams.

Using data generated by David A. Wheeler’s SLOCCount, 12,002 source lines of code were produced (including tests and internal code). 95.1% of codebase is Python, with the remaining 4.9% split between JavaScript, CSS, Shell, and XSLT.

Over the years, a total of 1426 issues were opened, 778 of which are already closed. 38 versions were released, including the most recent one, v7.1.0, released 15 days ago. The first ever release was version v1.1. It already contained many features still present to this day, and it also used the doit tool to do its job. Here’s a screenshot of the original page:

Nikola v1.1 Demo Site

In celebration of this milestone, the demo site from version 1.1 is now online.

Thanks for all the commits — and here’s to the next five thousand!

What is Nikola?

Nikola is a static site and blog generator, written in Python. It can use Mako and Jinja2 templates, and input in many popular markup formats, such as reStructuredText and Markdown — and can even turn IPython Notebooks into blog posts! It also supports image galleries, and is multilingual. Nikola is flexible, and page builds are extremely fast, courtesy of doit (which is rebuilding only what has been changed).

September 21, 2014 08:00 AM

Hector Garcia

From LIKE to Full-Text Search (part II)

If you missed it, read the first post of this series

What do you do when you need to filter a long list of records for your users?

That was the question we set to answer in a previous post. We saw that, for simple queries, built-in filtering provided by your framework of choice (think Django) is just fine. Most of the time, though, you'll need something more powerful. This is where PostgreSQL's full text search facilities comes in handy.

We also saw that just using to_tsvector and to_tsquery functions goes a long way filtering your records. But what about documents that contain accented characters? What can we do to optimize performance? How do we integrate this with Django?

Hola, Mundo!

We have found that the need to search documents in multiple languages is fairly common. You can query your data using to_tsquery without passing a language configuration name but remember that, under the hood, the text search functions always use one.

The default language is english, but you have to use the right language stemmer according to your document language or you might not get any matches.

If, for example, we search for física in spanish documents that have this word and its variations we would only see exact matches for this query:

=> SELECT text FROM terms WHERE to_tsvector(text) @@ to_tsquery('física');
 física (aparatos e instrumentos de —)
 física (educación —)
 física (investigación en —)
 rehabilitación física (aparatos de —) para uso médico
 educación física
 conversión de datos y programas informáticos, excepto conversión física
 investigación en física
 terapia física
(8 rows)

And worse, if we search just for fisica, unaccented:

=> SELECT text FROM terms WHERE to_tsvector(text) @@ to_tsquery('fisica');
(0 rows)

To get results that contain a variation of física (físicas, físico, físicamente, etc…) we have to use the right stemmer. Remember that if our stemmer doesn't have a word in its dictionary it won't stem it:

=> SELECT ts_lexize('english_stem', 'programming');
(1 row)

=> SELECT ts_lexize('spanish_stem', 'programming');
(1 row)

But if we use the right language, it will:

=> SELECT text FROM terms WHERE to_tsvector('spanish', text) @@ to_tsquery('spanish', 'física');
 física (aparatos e instrumentos de —)
 ejercicios físicos (aparatos para —)
 entrenamiento físico (aparatos de —)
 físicos (aparatos para ejercicios —)
 física (educación —)
 preparador físico personal [mantenimiento físico] (servicios de —)
 física (investigación en —)
 ejercicio físico (aparatos de —) para uso médico
 rehabilitación física (aparatos de —) para uso médico
 aparatos para ejercicios físicos
 almacenamiento de soportes físicos de datos o documentos electrónicos
 clases de mantenimiento físico
 clubes deportivos [entrenamiento y mantenimiento físico]
 educación física
 conversión de datos o documentos de un soporte físico a un soporte electrónico
 conversión de datos y programas informáticos, excepto conversión física
 investigación en física
 terapia física
(18 rows)

Working with multiple languages

Of course, we don't want to fill in the language manually every time. A straightforward solution would be to store the record's language in its own column. to_tsvector and friends accept a string to set the language but also a column name. So we could write:

=> SELECT text FROM term WHERE to_tsvector(language, text) @@ to_tsquery(language, 'entrena');
 entrenamiento físico (aparatos de —)
 clubes deportivos [entrenamiento y mantenimiento físico]
(2 rows)

The only catch here is that the column must be a regconfig. If you are using South or Django migrations to manage your database schema remember to change the type of your language column:

=> ALTER TABLE terms ALTER COLUMN language TYPE regconfig USING language::regconfig;

This way we can query records in their own language.

Even Better: Accented Characters

But what about accented words? If our users don't carefully type them (most users don't) then they won't find anything.

If we search for the word fisica again, even with all the previous setup, nothing shows up:

=> SELECT text FROM terms WHERE to_tsvector(text) @@ to_tsquery('fisica');
(0 rows)

We found this behavior funny (not in a good way). Some words work and some don't. I guess it depends on completeness of the dictionary we choose to use. But we can't depend on the word being on the dictionary. It would be a hit or miss (and, by our experience, more miss) thing.

But don't worry. There is a PostgreSQL extension for that™.


The unaccent extension can be used to extend the default language configurations to filter out accented characters:

ALTER TEXT SEARCH CONFIGURATION sp ALTER MAPPING FOR hword, hword_part, word WITH unaccent, spanish_stem;

And, behold! It works great:

=> SELECT text FROM terms WHERE to_tsvector('sp', text) @@ to_tsquery('sp', 'fisica');
 física (aparatos e instrumentos de —)
 ejercicios físicos (aparatos para —)
 entrenamiento físico (aparatos de —)
 físicos (aparatos para ejercicios —)
 física (educación —)
 preparador físico personal [mantenimiento físico] (servicios de —)
 física (investigación en —)
 ejercicio físico (aparatos de —) para uso médico
 rehabilitación física (aparatos de —) para uso médico
 aparatos para ejercicios físicos
 almacenamiento de soportes físicos de datos o documentos electrónicos
 clases de mantenimiento físico
 clubes deportivos [entrenamiento y mantenimiento físico]
 educación física
 conversión de datos o documentos de un soporte físico a un soporte electrónico
 conversión de datos y programas informáticos, excepto conversión física
 investigación en física
 terapia física
(18 rows)

Note that you must be superuser to install this extension on a normal installation (not in Heroku Postgres). If PostgreSQL complains about not finding the extension, install the postgresql-contrib package in your distro.

A Note on Performance

To make text search work well in a large dataset, consider creating a GIN index:

CREATE INDEX terms_idx ON terms USING gin(to_tsvector(language, text));

GIN indexes might take longer time to build than GiST indexes, but lookups are a lot faster.

Wrapping Up: Using It in Django

Finally, to make all this easy to use from Python, we wrote a custom queryset and used django-model-utils to embed it on the model's manager:

from django.db import models
from model_utils.managers import PassThroughManager

class TermSearchQuerySet(models.query.QuerySet):
    def search(self, query, raw=False):
        function = "to_tsquery" if raw else "plainto_tsquery"
        search_vector = "language, text"
        ts_query = "%s(language, '%s')" % (function, query)
        where = "to_tsvector(%s) @@ %s" % (search_vector, ts_query)
        return self.extra(where=[where])

class Term(models.Model):
    text = models.CharField(_('text'), max_length=255)
    language = models.CharField(max_length=20, default='sp', editable=False)

    objects = PassThroughManager.for_queryset_class(TermSearchQuerySet)()

Remember our original example on the first part?

>>>'man is biting a tail')
[<Entry: Man Bites Dogs Tails>]

September 21, 2014 01:39 AM

September 20, 2014

Varun Nischal

taT4Py | Recursively Search Regex Patterns

I have mainly used python for parsing text, validation and transforming as needed. If it was done using shell script, I would end up writing variety of regular expression to play around. Getting Started Well, python is no different and in order to cook up regular expressions, one must import re (module) and get started. import … Continue reading

September 20, 2014 06:34 AM

taT4Py | Convert AutoSys Job Attributes into Python Dictionary

If you ever look at the definition of specific AutoSys Job, you would find that it contains attribute-value pairs (line-by-line), delimited by colon ‘:’ I thought it would be cool to parse the job definition, by creating python dictionary using the attribute-value pairs. Let us take a look at sample job definition; $> cat sample_jil insert_job: … Continue reading

September 20, 2014 06:28 AM

September 19, 2014

Robin Wilson

Review: High Performance Python by Gorelick and Ozsvald

Summary: Fascinating book covering the whole breadth of high performance Python. It starts with detailed discussion of various profiling methods, continues with chapters on performance in standard Python, then focuses on high performance using arrays, compiling to C and various approaches to parallel programming. I learnt a lot from the book, and have already started improving the performance of the code I wrote for my PhD (rather than writing up my thesis, but oh well…).


Reference: Gorelick, M. and Ozsvald, I., 2014, High Performance Python, O’Reilly, 351pp Amazon LinkPublishers Link

I would consider myself to be a relatively good Python programmer, but I know that I don’t always write my code in a way that would allow it to run fast. This, as Gorelick and Ozsvald point out a number of times in the book, is actually a good thing: it’s far better to focus on programmer time than CPU time – at least in the early stages of a project. This has definitely been the case for the largest programming project that I’ve worked on recently: my PhD algorithm. It’s been difficult enough to get the algorithm to work properly as it is – and any focus on speed improvements during my PhD would definitely have been a premature optimization!

However, I’ve now almost finished my PhD, and one of the improvements listed in the ‘Further Work’ section at the end of my thesis is to improve the computational efficiency of my algorithm. I specifically requested a review copy of this book from O’Reilly as I hoped it would help me to do this: and it did!

I have a background in C and have taken a ‘High Performance Computing’ class at my university, so I already knew some of theory, but was keen to see how it applied to Python. I must admit that when I started the book I was disappointed that it didn’t jump straight into high performance programming with numpy, and parallel programming libraries – but I soon changed my mind when I learnt about the range of profiling tools (Chapter 2), and the significant performance improvements that can be done in pure Python code (Chapters 3-5). In fact, when I finished the book and started applying it to my PhD algorithm I was surprised just how much optimization could be done on my pure Python code, even though the algorithm is a heavy user of numpy.

When we got to numpy (Chapter 6) I realised there were a lot of things that I didn’t know – particularly the inefficiency of how numpy allocates memory for storing the results of computations. The whole book is very ‘data-driven': they show you all of the code, and then the results for each version of the code. This chapter was a particularly good example of this, using the Linux perf tool to show how different Python code led to significantly different behaviour at a very low level. As a quick test I implemented numexpr for one of my more complicated numpy expressions and found that it halved the time taken for that function: impressive!

I found the methods for compiling to C (discussed in Chapter 7) to be a lot easier than expected, and I even managed to set up Cython on my Windows machine to play around with it (admittedly by around 1am…but still!). Chapter 8 focused on concurrency, mostly in terms of asynchronous events. This wasn’t particularly relevant to my scientific work, but I can see how it would be very useful for some of the other scripts I’ve written in the past: downloading things from the internet, processing data in a webapp etc.

Chapter 9 was definitely useful from the point of view of my research, and I found the discussion of a wide range of solutions for parallel programming (threads, processes, and then the various methods for sharing flags) very useful. I felt that Chapter 10 was a little limited, and focused more on the production side of a cluster (repeatedly emphasising how you need good system admin support) than how to actually program effectively for a cluster. A larger part of this section devoted to the IPython parallel functionality would have been nice here. Chapter 11 was interesting but also less applicable to me – although I was surprised that nothing was mentioned about using integers rather than floats in large amounts of data where possible (in satellite imaging values are often multiplied by 10,000 or 100,000 to make them integers rather than floats and therefore smaller to store and quicker to process). I found the second example in Chapter 12 (by Radim Rehurek) by far the most useful, and wished that the other examples were a little more practical rather than discussing the production and programming process.

Although I have made a few criticisms above, overall the book was very interesting, very useful and also fun to read (the latter is very important for a subject that could be relatively dry). There were a few niggles: some parts of the writing could have done with a bit more proof-reading, some things were repeated a bit too much both within and between chapters, and I really didn’t like the style of the graphs (that is me being really picky – although I’d still prefer those style graphs over no graphs at all!). If these few niggles were fixed in the 2nd edition then I’d have almost nothing to moan about! In fact, I really hope there is a second edition, as one of the great things about this area of Python is how quickly new tools are developed – this is wonderful, but it does mean that books can become out of date relatively quickly. I’d be fascinated to have an update in a couple of years, by which time I imagine many of the projects mentioned in the book will have moved on significantly.

Overall, I would strongly recommend this book for any Python programmer looking to improve the performance of their code. You will get a lot out of it whether you write in pure Python or use numpy a lot, whether you are an expert in C or a novice, and whether you have a single machine or a large cluster.

September 19, 2014 09:50 PM

Mike C. Fletcher

Eventually Text Munging Gets Nasty

So today I started into the basic command-and-control part of the Listener system. The original tack I took was to create escaped commands inside the text stream using the "typing" substitutions (the stuff that converts ,comma into ','):

some text \action_do_something more text

But that got rather grotty rather fast when looking at corner cases (e.g. when you want to type \action to document that mechanism). So I reworked to have two different levels of operation, the first pre-processes to find commands and splits out the text such that you get a sequence of commands-and-text while interpreting. That should allow for e.g. switching the interpretation context in-flight.

Still need to get the actual commands hooked up to do something. The meta-commands (commands related to the dictation process itself, such as "correct that" or "undo that") will be the first, after that I can look at how to make commands registered by apps get passed through and intepreted on the client side.

September 19, 2014 06:59 PM

Glyph Lefkowitz


I am not an engineer.

I am a computer programmer. I am a software developer. I am a software author. I am a coder.

I program computers. I develop software. I write software. I code.

I’d prefer that you not refer to me as an engineer, but this is not an essay about how I’m going to heap scorn upon you if you do so. Sometimes, I myself slip and use the word “engineering” to refer to this activity that I perform. Sometimes I use the word “engineer” to refer to myself or my peers. It is, sadly, fairly conventional to refer to us as “engineers”, and avoiding this term in a context where it’s what everyone else uses is a constant challenge.

Nevertheless, I do not “engineer” software. Neither do you, because nobody has ever known enough about the process of creating software to “engineer” it.

According to, “engineering” is:

“the art or science of making practical application of the knowledge of pure sciences, as physics or chemistry, as in the construction of engines, bridges, buildings, mines, ships, and chemical plants.”

When writing software, we typically do not apply “knowledge of pure sciences”. Very little science is germane to the practical creation of software, and the places where it is relevant (firmware for hard disks, for example, or analytics for physical sensors) are highly rarified. The one thing that we might sometimes use called “science”, i.e. computer science, is a subdiscipline of mathematics, and not a science at all. Even computer science, though, is hardly ever brought to bear - if you’re a working programmer, what was the last project where you had to submit formal algorithmic analysis for any component of your system?

Wikipedia has a heaping helping of criticism of the terminology behind software engineering, but rather than focusing on that, let's see where Wikipedia tells us software engineering comes from in the first place:

The discipline of software engineering was created to address poor quality of software, get projects exceeding time and budget under control, and ensure that software is built systematically, rigorously, measurably, on time, on budget, and within specification. Engineering already addresses all these issues, hence the same principles used in engineering can be applied to software.

Most software projects fail; as of 2009, 44% are late, over budget, or out of specification, and an additional 24% are cancelled entirely. Only a third of projects succeed according to those criteria of being under budget, within specification, and complete.

What would that look like if another engineering discipline had that sort of hit rate? Consider civil engineering. Would you want to live in a city where almost a quarter of all the buildings were simply abandoned half-constructed, or fell down during construction? Where almost half of the buildings were missing floors, had rents in the millions of dollars, or both?

My point is not that the software industry is awful. It certainly can be, at times, but it’s not nearly as grim as the metaphor of civil engineering might suggest. Consider this: despite the statistics above, is using a computer today really like wandering through a crumbling city where a collapsing building might kill you at any moment? No! The social and economic costs of these “failures” is far lower than most process consultants would have you believe. In fact, the cause of many such “failures” is a clumsy, ham-fisted attempt to apply engineering-style budgetary and schedule constraints to a process that looks nothing whatsoever like engineering. I have to use scare quotes around “failure” because many of these projects classified as failed have actually delivered significant value. For example, if the initial specification for a project is overambitious due to lack of information about the difficulty of the tasks involved, for example – an extremely common problem at the beginning of a software project – that would still be a failure according to the metric of “within specification”, but it’s a problem with the specification and not the software.

Certain missteps notwithstanding, most of the progress in software development process improvement in the last couple of decades has been in acknowledging that it can’t really be planned very far in advance. Software vendors now have to constantly present works in progress to their customers, because the longer they go without doing that there is an increasing risk that the software will not meet the somewhat arbitrary goals for being “finished”, and may never be presented to customers at all.

The idea that we should not call ourselves “engineers” is not a new one. It is a minority view, but I’m in good company in that minority. Edsger W. Dijkstra points out that software presents what he calls “radical novelty” - it is too different from all the other types of things that have come before to try to construct it by analogy to those things.

One of the ways in which writing software is different from engineering is the matter of raw materials. Skyscrapers and bridges are made of steel and concrete, but software is made out of feelings. Physical construction projects can be made predictable because the part where creative people are creating the designs - the part of that process most analagous to software - is a small fraction of the time required to create the artifact itself.

Therefore, in order to create software you have to have an “engineering” process that puts its focus primarily upon the psychological issue of making your raw materials - the brains inside the human beings you have acquired for the purpose of software manufacturing - happy, so that they may be efficiently utilized. This is not a common feature of other engineering disciplines.

The process of managing the author’s feelings is a lot more like what an editor does when “constructing” a novel than what a foreperson does when constructing a bridge. In my mind, that is what we should be studying, and modeling, when trying to construct large and complex software systems.

Consequently, not only am I not an engineer, I do not aspire to be an engineer, either. I do not think that it is worthwhile to aspire to the standards of another entirely disparate profession.

This doesn’t mean we shouldn’t measure things, or have quality standards, or try to agree on best practices. We should, by all means, have these things, but we authors of software should construct them in ways that make sense for the specific details of the software development process.

While we are on the subject of things that we are not, I’m also not a maker. I don’t make things. We don’t talk about “building” novels, or “constructing” music, nor should we talk about “building” and “assembling” software. I like software specifically because of all the ways in which it is not like “making” stuff. Making stuff is messy, and hard, and involves making lots of mistakes.

I love how software is ethereal, and mistakes are cheap and reversible, and I don’t have any desire to make it more physical and permanent. When I hear other developers use this language to talk about software, it makes me think that they envy something about physical stuff, and wish that they were doing some kind of construction or factory-design project instead of making an application.

The way we use language affects the way we think. When we use terms like “engineer” and “builder” to describe ourselves as creators, developers, maintainers, and writers of software, we are defining our role by analogy and in reference to other, dissimilar fields.

Right now, I think I prefer the term “developer”, since the verb develop captures both the incremental creation and ongoing maintenance of software, which is so much a part of any long-term work in the field. The only disadvantage of this term seems to be that people occasionally think I do something with apartment buildings, so I am careful to always put the word “software” first.

If you work on software, whichever particular phrasing you prefer, pick one that really calls to mind what software means to you, and don’t get stuck in a tedious metaphor about building bridges or cars or factories or whatever.

To paraphrase a wise man:

I am developer, and so can you.

September 19, 2014 06:22 PM