skip to navigation
skip to content

Planet Python

Last update: July 24, 2017 09:46 PM

July 24, 2017


Python Anywhere

Outage report: 20, 21 and 22 July 2017

We had several outages over the last few days. The problem appears to be fixed now, but investigations into the underlying cause are still underway. This post is a summary of what happened, and what we know so far. Once we've got a better understanding of the issue, we'll post more.

It's worth saying at the outset that while the problems related to the way we manage our users' files, those files themselves were always safe. While availability problems are clearly a big issue, we regard data integrity as more important.

20 July: the system update

On Thursday 20 July, at 05:00 UTC, we released a new system update for PythonAnywhere. This was largely an infrastructural update. In particular, we updated our file servers from Ubuntu 14.04 to 16.04, as part of a general upgrade of all servers in our cluster.

File servers are, of course, the servers that manage the files in your PythonAnywhere disk storage. Each server handles the data for a specific set of users, and serves the files up to the other servers in the cluster that need them -- the "execution" servers where your websites, your scheduled tasks, and your consoles run. The files themselves are stored on network-attached storage (and mirrored in realtime to redundant disks on a separate set of backup servers); the file servers simply act as NFS servers and manage a few simple things like disk quotas.

While the system update took a little longer than we'd planned, once everything was up and running, the system looked stable and all monitoring looked good.

20 July: initial problems

At 12:07 UTC our monitoring system showed a very brief issue. From some of our web servers, it appeared that access to one of our file servers, file-2, had very briefly slowed right down -- it was taking more than 30 seconds to list the specific directory that is monitored. The problem cleared up after about 60 seconds. Other file servers were completely unaffected. We did some investigations, but couldn't find anything, so we chalked it up as a glitch and kept an eye out for further problems.

At 14:12 UTC it happened again, and then over the course of the afternoon, the "glitches" became more frequent and started lasting longer. We discovered that the symptom from the file server's side was that all of the NFS daemons -- the processes that together make up an NFS server -- would all become busy; system load would rise from about 1.5 to 64 or so. They were all waiting uninterruptably on what we think was disk I/O (status "D" in top).

The problem only affected file-2 -- other file servers were all fine. Given that every file server had been upgraded to an identical system image, our initial suspicion was that there might be some kind of hardware problem. At 17:20 UTC we got in touch with AWS to discuss whether this was likely.

By 19:10 our discussions with AWS had revealed nothing of interest. The "glitches" had become a noticeable problem for users, and we decided that while there was no obvious sign of hardware problems, it would be good to at least eliminate that as a possible cause, so we took a snapshot of all disks containing user data (for safety), then migrated the server to new hardware, causing a 20-minute outage for users on that file server (who were already seeing a serious slowdown anyway), and a 5-minute outage for everyone else, the latter because we had to reboot the execution servers

After this move, at 19:57 UTC, everything seemed OK. Our monitoring was clear, and the users we were in touch with confirmed that everything was looking good.

21 July: the problem continues

At 14:31 UTC on 21 July, we saw another glitch on our monitoring. Again, the problem cleared up quickly, but we started looking again into what could possibly be the cause. There were further glitches at 15:17 and 16:51, but then the problem seemed to clear up.

Unfortunately at 22:44 it flared up again. Again, the issues started happening more frequently, and lasting longer each time, until they became very noticeable for our users at around 00:30 UTC. At 00:55 UTC we decided to move the server to different hardware again -- there's no official way to force a move to new hardware on AWS; stopping and starting an instance usually does it, but there's a small chance you'd end up on the same physical host again, so a second attempt seemed worth-while. If nothing else, it would hopefully at least clear things up for another 12 hours or so and buy us time to work out what was really going wrong.

This time, things didn't go according to plan. The file server failed to come up on the new hardware, and trying to move again did not help. We decided that we were going to need to provision a completely fresh file server, and move the disks across. While we have processes in place for replacing file servers as part of a normal system update, and for moving them to new hardware without changing (for example) their IP address, replacing one under these circumstances is not a procedure we've done before. Luckily, it went as well as could be expected under the circumstances. At 01:23 UTC we'd worked out what to do and started the new file server. By 01:50 we'd started the migration, and by 02:20 UTC everything was moved over. There were a few remaining glitches, but these were cleared up by 02:45 UTC.

23 July -- more problems -- and a resolution?

We did not expect the fix we'd put in to be a permanent solution -- though we did have a faint hope that perhaps the problem had been caused by some configuration issue on file-2, which might have been remediated by our having provisioned a new server rather than simply moving the old one. This was never a particularly strong hope, however, and when the problems started again at 12:16 UTC we weren't entirely surprised.

We had generated two new hypotheses about the possible cause of these issues by now:

The problem with both of these hypotheses was that only one of our file servers was affected. All file servers had the same number of workers, and all had been upgraded to 16.04.

Still, it was worth a try, we thought. We decided to try changing the number of daemon processes first, as it we believed it would cause minimal downtime; however, we started up a new file server on 14.04 so that it would be ready just in case.

At 14:41 UTC we reduced the number of workers down to eight. We were happy to see that this was picked up across the cluster without any need to reboot anything, so there was no downtime.

Unfortunately, at 15:04, we saw another problem. We decided to spend more time investigating a few ideas that had occurred to us before taking the system down again. At 19:00 we tried increasing the number of NFS processes to 128, but that didn't help. At 19:23 we decided to go ahead with switching over to the 14.04 server we'd prepared earlier. We kicked off some backup snapshots of the user data, just in case there were problems, and at 19:38 we started the migration over.

This completed at 19:46, but required a restart of all of the execution servers in order to pick up the new file server. We started this process immediately, and web servers came back online at 19:48, consoles at 19:50, and scheduled tasks at 19:55.

By 20:00 we were satisfied that everything looked good, and so we went back to monitoring.

Where we are now

Since the update on Saturday, there were no monitoring glitches at all on Sunday, but we did see one potential problem on Monday at 12:03. However this blip was only noticed from one of our web servers (previous issues affected at least three at a time, and sometimes as many as nine), and the problem has not been followed by any subsequent outages in the 4 hours since, which is somewhat reassuring.

We're continuing to monitor closely, and are brainstorming hypotheses to explain what might have happened (or, perhaps still be happening). Of particular interest is the fact that this issue only affected one of our file servers, despite all of them having been upgraded. One possibility we're considering is that the correlation in timing with the upgrade is simply a red herring -- that instead there's some kind of access pattern, some particular pattern of reads/writes to the storage, which only started at around midday on Thursday after the system update. We're planning possible ways to investigate that should the problem occur again.

Either way, whether the problem is solved now or not, we clearly have much more investigation to do. We'll post again when we have more information.

July 24, 2017 04:44 PM


Will Kahn-Greene

Soloists: code review on a solo project

Summary

I work on some projects with other people, but I also spend a lot of time working on projects by myself. When I'm working by myself, I have difficulties with the following:

  1. code review
  2. bouncing ideas off of people
  3. peer programming
  4. long slogs
  5. getting help when I'm stuck
  6. publicizing my work
  7. dealing with loneliness
  8. going on vacation

I started a #soloists group at Mozilla figuring there are a bunch of other Mozillians who are working on solo projects and maybe if we all work alone together, then that might alleviate some of the problems of working solo. We hang out in the #soloists IRC channel on irc.mozilla.org. If you're solo, join us!

I keep thinking about writing a set of blog posts for things we've talked about in the channel and how I do things. Maybe they'll help you.

This one covers code review.

Read more… (10 mins to read)

July 24, 2017 04:00 PM


Doug Hellmann

hmac — Cryptographic Message Signing and Verification — PyMOTW 3

The HMAC algorithm can be used to verify the integrity of information passed between applications or stored in a potentially vulnerable location. The basic idea is to generate a cryptographic hash of the actual data combined with a shared secret key. The resulting hash can then be used to check the transmitted or stored message … Continue reading hmac — Cryptographic Message Signing and Verification — PyMOTW 3

July 24, 2017 01:00 PM


Python Software Foundation

2017 Bylaw Changes

The PSF has changed its bylaws, following a discussion and vote among the voting members. I'd like to publicly explain those changes.

For each of the changes, I will describe  1.) what the bylaws used to say prior to June 2017 2.) what the new bylaws say and 3.) why the changes were implemented.

Certification of Voting Members
Every member had to acknowledge that they wanted to vote/or not vote every year.
The bylaws now say that the list of voters is based on criteria decided upon by the board.
The previous bylaws pertaining to this topic created too much work for our staff to handle and sometimes it was not done because we did not have the time resources to do it. We can now change the certification to something more manageable for our staff and our members.

Voting in New PSF Fellow Members
We did not have a procedure in place for this in the previous bylaws.
Now the bylaws allow any member to nominate a Fellow. Additionally, it gives the chance for the PSF Board to create a work group for evaluating the nominations.
We lacked a procedure. We had several inquiries and nominations in the past, but did not have a policy to respond with. Now that we voted in this bylaw, the PSF Board voted in the creation of the Work Group. We can now begin accepting new Fellow Members after several years.
Staggered Board Terms
We did not have staggered board terms prior to June 2017. Every director would be voted on every term.
The bylaws now say that in the June election, the top 4 voted directors would hold 3 year terms, the next 4 voted-in directors hold 2 year terms and the next 3 voted-in directors hold 1 year terms. That resulted in:
  1. Naomi Ceder (3 yr)
  2. Eric Holscher (3 yr)
  3. Jackie Kazil (3 yr)
  4. Paul Hildebrandt (3 yr)
  5. Lorena Mesa (2 yr)
  6. Thomas Wouters (2 yr)
  7. Kushal Das (2 yr)
  8. Marlene Mhangami (2 yr)
  9. Kenneth Reitz (1 yr)
  10. Trey Hunner (1 yr)
  11. Paola Katherine Pacheco (1 yr)
The main push behind this change is continuity. As the PSF continues to grow, we are hoping to make it more stable and sustainable. Having some directors in place for more than one year will help us better complete short-term and long-term projects. It will also help us pass on context from previous discussions and meetings.
Direct Officers
We did not have Direct Officers prior to June 2017.
The bylaws state that the current General Counsel and Director of Operations will be the Direct Officers of the PSF. Additionally, they state that the Direct Officers become the 12th and 13th members of the board giving them rights to vote on board business. Direct Officers can be removed by a.) fail of an approval vote, held on at least the same schedule as 3-year-term directors; b) leave the office associated with the officer director position; or c) fail a no-confidence vote.
In an effort to become a more stable and mature board, we are appointing two important positions to be directors of the board. Having the General Counsel and Director of Operations on the board helps us have more strength with legal situations and how the PSF operates. The two new Direct Officers are:
  1. Van Lindberg
  2. Ewa Jodlowska
Delegating Ability to Set Compensation
The bylaws used to state that the President of the Foundation would direct how compensation of the Foundation’s employees was decided.
The bylaws have changed so that the Board of Directors decide how employee compensation is decided.
This change was made because even though we keep the president informed of major changes, Guido does not participate in day to day operations nor employee management. We wanted the bylaws to clarify the most effective and fair way we set compensation for our staff.

We hope this breakdown sheds light on the changes and why they were important to implement. Please feel free to contact me with any questions or concerns.

July 24, 2017 11:35 AM


A. Jesse Jiryu Davis

Vote For Your Favorite PyGotham Talks

Black and white photograph of voters in 1930s-era British dress, standing lined up on one side of a wooden table, consulting with poll workers seated on the other side of the table and checking voter rolls.

We received 195 proposals for talks at PyGotham this year. Now we have to find the best 50 or so. For the first time, we’re asking the community to vote on their favorite talks. Voting will close August 7th; then I and my comrades on the Program Committee will make a final selection.

Your Mission, If You Choose To Accept It

We need your help judging which proposals are the highest quality and the best fit for our community’s interests. For each talk we’ll ask you one question: “Would you like to see this talk at PyGotham?” Remember, PyGotham isn’t just about Python: it’s an eclectic conference about open source technology, policy, and culture.

You can give each talk one of:

You can sign up for an account and begin voting at vote.pygotham.org. The site presents you with talks in random order, omitting the ones you have already voted on. For each talk, you will see this form:

image of +1/0/-1 voting form

Click “Save Vote” to make sure your vote is recorded. Once you do, a button appears to jump to the next proposal.

Our thanks to Ned Jackson Lovely, who made this possible by sharing the talk voting app “progcom” that was developed for the PyCon US committee.

So far, about 50 people have cast votes. We need to hear from you, too. Please help us shape this October’s PyGotham. Vote today!


Image: Voting in Brisbane, 1937

July 24, 2017 10:57 AM


Catalin George Festila

Fix Gimp with python script.

Today I will show you how python language can help GIMP users.
From my point of view, Gimp does not properly import frames from GIF files.
This program imports GIF files in this way:

Using the python module, you can get the correct frames from the GIF file.
Here's my script that uses the python PIL module.

import sys
from PIL import Image, ImageSequence
try:
img = Image.open(sys.argv[1])
except IOError:
print "Cant load", infile
sys.exit(1)

pal = img.getpalette()
prev = img.convert('RGBA')
prev_dispose = True
for i, frame in enumerate(ImageSequence.Iterator(img)):
dispose = frame.dispose

if frame.tile:
x0, y0, x1, y1 = frame.tile[0][1]
if not frame.palette.dirty:
frame.putpalette(pal)
frame = frame.crop((x0, y0, x1, y1))
bbox = (x0, y0, x1, y1)
else:
bbox = None

if dispose is None:
prev.paste(frame, bbox, frame.convert('RGBA'))
prev.save('result_%03d.png' % i)
prev_dispose = False
else:
if prev_dispose:
prev = Image.new('RGBA', img.size, (0, 0, 0, 0))
out = prev.copy()
out.paste(frame, bbox, frame.convert('RGBA'))
out.save('result_%03d.png' % i)
Name the python script with convert_gif.py and then you can use it on the GIF file as follows:
C:\Python27>python.exe convert_gif.py 0001.gif
The final result has a smaller number of images than in Gimp, but this was to be expected.

July 24, 2017 10:24 AM

July 23, 2017


Kevin Dahlhausen

Using Beets from 3rd Party Python Applications

I am thinking of using Beets as music library to update a project. The only example of using it this way is in the source code of the Beets command-line interface. That code is well-written but does much more than I need so I decided to create a simple example of using Beets in a 3rd party application.

The hardest part turned out to be determining how to create a proper configuration pro grammatically. The final code is short:


        config["import"]["autotag"] = False
        config["import"]["copy"] = False
        config["import"]["move"] = False
        config["import"]["write"] = False
        config["library"] = music_library_file_name
        config["threaded"] = True 

This will create a configuration that keeps the music files in place and does not attempt to autotag them.

Importating files requires one to subclass importer.ImportSession. A simple importer that serves to import files and not change them is:


    class AutoImportSession(importer.ImportSession):
        "a minimal session class for importing that does not change files"

        def should_resume(self, path):
            return True

        def choose_match(self, task):
            return importer.action.ASIS

        def resolve_duplicate(self, task, found_duplicates):
            pass

        def choose_item(self, task):
            return importer.action.ASIS 

That’s the trickiest part of it. The full demo is:


# Copyright 2017, Kevin Dahlhausen
#
# Permission is hereby granted, free of charge, to any person obtaining
# a copy of this software and associated documentation files (the
# "Software"), to deal in the Software without restriction, including
# without limitation the rights to use, copy, modify, merge, publish,
# distribute, sublicense, and/or sell copies of the Software, and to
# permit persons to whom the Software is furnished to do so, subject to
# the following conditions:
#
# The above copyright notice and this permission notice shall be
# included in all copies or substantial portions of the Software.

from beets import config
from beets import importer
from beets.ui import _open_library

class Beets(object):
    """a minimal wrapper for using beets in a 3rd party application
       as a music library."""

    class AutoImportSession(importer.ImportSession):
        "a minimal session class for importing that does not change files"

        def should_resume(self, path):
            return True

        def choose_match(self, task):
            return importer.action.ASIS

        def resolve_duplicate(self, task, found_duplicates):
            pass

        def choose_item(self, task):
            return importer.action.ASIS

    def __init__(self, music_library_file_name):
        """ music_library_file_name = full path and name of
            music database to use """
        "configure to keep music in place and do not auto-tag"
        config["import"]["autotag"] = False
        config["import"]["copy"] = False
        config["import"]["move"] = False
        config["import"]["write"] = False
        config["library"] = music_library_file_name
        config["threaded"] = True

        # create/open the the beets library
        self.lib = _open_library(config)

    def import_files(self, list_of_paths):
        """import/reimport music from the list of paths.
            Note: This may need some kind of mutex as I
                  do not know the ramifications of calling
                  it a second time if there are background
                  import threads still running.
        """
        query = None
        loghandler = None  # or log.handlers[0]
        self.session = Beets.AutoImportSession(self.lib, loghandler,
                                               list_of_paths, query)
        self.session.run()

    def query(self, query=None):
        """return list of items from the music DB that match the given query"""
        return self.lib.items(query)

if __name__ == "__main__":

    import os

    # this demo places music.db in same lib as this file and
    # imports music from <this dir>/Music
    path_of_this_file = os.path.dirname(__file__)
    MUSIC_DIR = os.path.join(path_of_this_file, "Music")
    LIBRARY_FILE_NAME = os.path.join(path_of_this_file, "music.db")

    def print_items(items, description):
        print("Results when querying for "+description)
        for item in items:
            print("   Title: {} by '{}' ".format(item.title, item.artist))
            print("      genre: {}".format(item.genre))
            print("      length: {}".format(item.length))
            print("      path: {}".format(item.path))
        print("")

    demo = Beets(LIBRARY_FILE_NAME)

    # import music - this demo does not move, copy or tag the files
    demo.import_files([MUSIC_DIR, ])

    # sample queries:
    items = demo.query()
    print_items(items, "all items")

    items = demo.query(["artist:heart,", "title:Hold", ])
    print_items(items, 'artist="heart" or title contains "Hold"')

    items = demo.query(["genre:Hard Rock"])
    print_items(items, 'genre = Hard Rock') 

I hope this helps. Turns out it is easy to use beets in other apps.

July 23, 2017 08:47 PM


Mike Driscoll

Python is #1 in 2017 According to IEEE Spectrum

It’s always fun to see what languages are considered to be in the top ten. This year, IEEE Spectrum named Python as the #1 language in the Web and Enterprise categories. Some of the Python community over at Reddit think that the scoring of the languages are flawed because Javascript is below R in web programming. That gives me pause as well. Frankly I don’t really see how anything is above Javascript when it comes to web programming.

Regardless, it’s still interesting to read through the article.

Related Articles

July 23, 2017 06:53 PM


NumFOCUS

Meet our GSoC Students Part 3: Matplotlib, PyMC3, FEniCS, MDAnalysis, Data Retriever, & Gensim

July 23, 2017 05:00 PM


Trey Hunner

Craft Your Python Like Poetry

Line length is a big deal… programmers argue about it quite a bit. PEP 8, the Python style guide, recommends a 79 character maximum line length but concedes that a line length up to 100 characters is acceptable for teams that agree to use a specific longer line length.

So 79 characters is recommended… but isn’t line length completely obsolete? After all, programmers are no longer restricted by punch cards, teletypes, and 80 column terminals. The laptop screen I’m typing this on can fit about 200 characters per line.

Line length is not obsolete

Line length is not a technical limitation: it’s a human-imposed limitation. Many programmers prefer short lines because long lines are hard to read. This is true in typography and it’s true in programming as well.

Short lines are easier to read.

In the typography world, a line length of 55 characters per line is recommended for electronic text (see line length on Wikipedia). That doesn’t mean we should use a 55 character limit though; typography and programming are different.

Python isn’t prose

Python code isn’t structured like prose. English prose is structured in flowing sentences: each line wraps into the next line. In Python, statements are somewhat like sentences, meaning each sentence begins at the start of each line.

Python code is more like poetry than prose. Poets and Python programmers don’t wrap lines once they hit an arbitrary length; they wrap lines when they make sense for readability and beauty.

I stand amid the roar Of a surf-tormented shore, And I hold within my hand
Grains of the golden sand— How few! yet how they creep Through my fingers to
the deep, While I weep—while I weep! O God! can I not grasp Them with a
tighter clasp? O God! can I not save One from the pitiless wave? Is all that we
see or seem But a dream within a dream?

Don’t wrap lines arbitrarily. Craft each line with care to help readers experience your code exactly the way you intended.

I stand amid the roar
Of a surf-tormented shore,
And I hold within my hand
Grains of the golden sand—
How few! yet how they creep
Through my fingers to the deep,
While I weep—while I weep!
O God! can I not grasp
Them with a tighter clasp?
O God! can I not save
One from the pitiless wave?
Is all that we see or seem
But a dream within a dream?

Examples

It’s not possible to make a single rule for when and how to wrap lines of code. PEP8 discusses line wrapping briefly, but it only discusses one case of line wrapping and three different acceptable styles are provided, leaving the reader to choose which is best.

Line wrapping is best discussed through examples. Let’s look at a few examples of long lines and few variations for line wrapping for each.

Example: Wrapping a Comprehension

This line of code is over 79 characters long:

1
employee_hours = [schedule.earliest_hour for employee in self.public_employees for schedule in employee.schedules]

Here we’ve wrapped that line of code so that it’s two shorter lines of code:

1
2
employee_hours = [schedule.earliest_hour for employee in
                  self.public_employees for schedule in employee.schedules]

We’re able to insert that line break in this line because we have an unclosed square bracket. This is called an implicit line continuation. Python knows we’re continuing a line of code whenever there’s a line break inside unclosed square brackets, curly braces, or parentheses.

This code still isn’t very easy to read because the line break was inserted arbitrarily. We simply wrapped this line just before a specific line length. We were thinking about line length here, but we completely neglected to think about readability.

This code is the same as above, but we’ve inserted line breaks in very particular places:

1
2
3
employee_hours = [schedule.earliest_hour
                  for employee in self.public_employees
                  for schedule in employee.schedules]

We have two lines breaks here and we’ve purposely inserted them before our for clauses in this list comprehension.

Statements have logical components that make up a whole, the same way sentences have clauses that make up the whole. We’ve chosen to break up this list comprehension by inserting line breaks between these logical components.

Here’s another way to break up this statement:

1
2
3
4
5
employee_hours = [
    schedule.earliest_hour
    for employee in self.public_employees
    for schedule in employee.schedules
]

Which of these methods you prefer is up to you. It’s important to make sure you break up the logical components though. And whichever method you choose, be consistent!

Example: Function Calls

This is a Django model field with a whole bunch of arguments being passed to it:

1
2
3
default_appointment = models.ForeignKey(othermodel='AppointmentType',
                                        null=True, on_delete=models.SET_NULL,
                                        related_name='+')

We’re already using an implicit line continuation to wrap these lines of code, but again we’re wrapping this code at an arbitrary line length.

Here’s the same Django model field with one argument specific per line:

1
2
3
4
default_appointment = models.ForeignKey(othermodel='AppointmentType',
                                        null=True,
                                        on_delete=models.SET_NULL,
                                        related_name='+')

We’re breaking up the component parts (the arguments) of this statement onto separate lines.

We could also wrap this line by indenting each argument instead of aligning them:

1
2
3
4
5
6
default_appointment = models.ForeignKey(
    othermodel='AppointmentType',
    null=True,
    on_delete=models.SET_NULL,
    related_name='+'
)

Notice we’re also leaving that closing parenthesis on its own line. We could additionally add a trailing comma if we wanted:

1
2
3
4
5
6
default_appointment = models.ForeignKey(
    othermodel='AppointmentType',
    null=True,
    on_delete=models.SET_NULL,
    related_name='+',
)

Which of these is the best way to wrap this line?

Personally for this line I prefer that last approach: each argument on its own line, the closing parenthesis on its own line, and a comma after each argument.

It’s important to decide what you prefer, reflect on why you prefer it, and always maintain consistency within each project/file you create. And keep in mind that consistence of your personal style is less important than consistency within a single project.

Example: Chained Function Calls

Here’s a long line of chained Django queryset methods:

1
books = Book.objects.filter(author__in=favorite_authors).select_related('author', 'publisher').order_by('title')

Notice that there aren’t parenthesis around this whole statement, so the only place we can currently wrap our lines is inside those parenthesis. We could do something like this:

1
2
3
4
5
books = Book.objects.filter(
    author__in=favorite_authors
).select_related(
    'author', 'publisher'
).order_by('title')

But that looks kind of weird and it doesn’t really improve readability.

We could add backslashes at the end of each line to allow us to wrap at arbitrary places:

1
2
3
4
books = Book.objects\
    .filter(author__in=favorite_authors)\
    .select_related('author', 'publisher')\
    .order_by('title')

This works, but PEP8 recommends against this.

We could wrap the whole statement in parenthesis, allowing us to use implicit line continuation wherever we’d like:

1
2
3
4
books = (Book.objects
    .filter(author__in=favorite_authors)
    .select_related('author', 'publisher')
    .order_by('title'))

It’s not uncommon to see extra parenthesis added in Python code to allow implicit line continuations.

That indentation style is a little odd though. We could align our code with the parenthesis instead:

1
2
3
4
books = (Book.objects
         .filter(author__in=favorite_authors)
         .select_related('author', 'publisher')
         .order_by('title'))

Although I’d probably prefer to align the dots in this case:

1
2
3
4
books = (Book.objects
             .filter(author__in=favorite_authors)
             .select_related('author', 'publisher')
             .order_by('title'))

A fully indentation-based style works too (we’ve also moved objects to its own line here):

1
2
3
4
5
6
7
books = (
    Book
    .objects
    .filter(author__in=favorite_authors)
    .select_related('author', 'publisher')
    .order_by('title')
)

There are yet more ways to resolve this problem. For example we could try to use intermediary variables to avoid line wrapping entirely.

Chained methods pose a different problem for line wrapping than single method calls and require a different solution. Focus on readability when picking a preferred solution and be consistent with the solution you pick. Consistency lies at the heart of readability.

Example: Dictionary Literals

I often define long dictionaries and lists defined in Python code.

Here’s a dictionary definition that has been over multiple lines, with line breaks inserted as a maximum line length is approached:

1
2
3
MONTHS = {'January': 1, 'February': 2, 'March': 3, 'April': 4, 'May': 5,
          'June': 6, 'July': 7, 'August': 8, 'September': 9, 'October': 10,
          'November': 11, 'December': 12}

Here’s the same dictionary with each key-value pair on its own line, aligned with the first key-value pair:

1
2
3
4
5
6
7
8
9
10
11
12
MONTHS = {'January': 1,
          'February': 2,
          'March': 3,
          'April': 4,
          'May': 5,
          'June': 6,
          'July': 7,
          'August': 8,
          'September': 9,
          'October': 10,
          'November': 11,
          'December': 12}

And the same dictionary again, with each key-value pair indented instead of aligned (with a trailing comma on the last line as well):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
MONTHS = {
    'January': 1,
    'February': 2,
    'March': 3,
    'April': 4,
    'May': 5,
    'June': 6,
    'July': 7,
    'August': 8,
    'September': 9,
    'October': 10,
    'November': 11,
    'December': 12,
}

This is the strategy I prefer for wrapping long dictionaries and lists. I very often wrap short dictionaries and lists this way as well, for the sake of readability.

Python is Poetry

The moment of peak readability is the moment just after you write a line of code. Your code will be far less readable to you one day, one week, and one month after you’ve written it.

When crafting Python code, use spaces and line breaks to split up the logical components of each statement. Don’t write a statement on a single line unless it’s already very clear. If you break each line over multiple lines for clarity, lines length shouldn’t be a major concern because your lines of code will mostly be far shorter than 79 characters already.

Make sure to craft your code carefully as you write it because your future self will have a much more difficult time cleaning it up than you will right now. So take that line of code you just wrote and carefully add line breaks to it.

July 23, 2017 05:00 PM


Kay Hayen

Nuitka Release 0.5.27

This is to inform you about the new stable release of Nuitka. It is the extremely compatible Python compiler. Please see the page "What is Nuitka?" for an overview.

This release comes a lot of bug fixes and improvements.

Bug Fixes

  • Fix, need to add recursed modules immediately to the working set, or else they might first be processed in second pass, where global names that are locally assigned, are optimized to the built-in names although that should not happen. Fixed in 0.5.26.1 already.
  • Fix, the accelerated call of methods could crash for some special types. This had been a regress of 0.5.25, but only happens with custom extension types. Fixed in 0.5.26.1 already.
  • Python3.5: For async def functions parameter variables could fail to properly work with in-place assignments to them. Fixed in 0.5.26.4 already.
  • Compatability: Decorators that overload type checks didn't pass the checks for compiled types. Now isinstance and as a result inspect module work fine for them.
  • Compatiblity: Fix, imports from __init__ were crashing the compiler. You are not supposed to do them, because they duplicate the package code, but they work.
  • Compatiblity: Fix, the super built-in on module level was crashing the compiler.
  • Standalone: For Linux, BSD and MacOS extension modules and shared libraries using their own $ORIGIN to find loaded DLLs resulted in those not being included in the distribution.
  • Standalone: Added more missing implicit dependencies.
  • Standalone: Fix, implicit imports now also can be optional, as e.g. _tkinter if not installed. Only include those if available.
  • The --recompile-c-only was only working with C compiler as a backend, but not in the C++ compatibility fallback, where files get renamed. This prevented that edit and test debug approach with at least MSVC.
  • Plugins: The PyLint plug-in didn't consider the symbolic name import-error but only the code F0401.
  • Implicit exception raises in conditional expressions would crash the compiler.

New Features

  • Added support for Visual Studio 2017. Issue#368.
  • Added option --python2-for-scons to specify the Python2 execute to use for calling Scons. This should allow using AnaConda Python for that task.

Optimization

  • References to known unassigned variables are now statically optimized to exception raises and warned about if the according option is enabled.
  • Unhashable keys in dictionaries are now statically optimized to exception raises and warned about if the according option is enabled.
  • Enable forward propagation for classes too, resulting in some classes to create only static dictionaries. Currently this never happens for Python3, but it will, once we can statically optimize __prepare__ too.
  • Enable inlining of class dictionary creations if they are mere return statements of the created dictionary. Currently this never happens for Python3, see above for why.
  • Python2: Selecting the metaclass is now visible in the tree and can be statically optimized.
  • For executables, we now also use a freelist for traceback objects, which also makes exception cases slightly faster.
  • Generator expressions no longer require the use of a function call with a .0 argument value to carry the iterator value, instead their creation is directly inlined.
  • Remove "pass through" frames for Python2 list contractions, they are no longer needed. Minimal gain for generated code, but more lightweight at compile time.
  • When compiling Windows x64 with MinGW64 a link library needs to be created for linking against the Python DLL. This one is now cached and re-used if already done.
  • Use common code for NameError and UnboundLocalError exception code raises. In some cases it was creating the full string at compile time, in others at run time. Since the later is more efficient in terms of code size, we now use that everywhere, saving a bit of binary size.
  • Make sure to release unused functions from a module. This saves memory and can be decided after a full pass.
  • Avoid using OrderedDict in a couple of places, where they are not needed, but can be replaced with a later sorting, e.g. temporary variables by name, to achieve deterministic output. This saves memory at compile time.
  • Add specialized return nodes for the most frequent constant values, which are None, True, and False. Also a general one, for constant value return, which avoids the constant references. This saves quite a bit of memory and makes traversal of the tree a lot faster, due to not having any child nodes for the new forms of return statements.
  • Previously the empty dictionary constant reference was specialized to save memory. Now we also specialize empty set, list, and tuple constants to the same end. Also the hack to make is not say that {} is {} was made more general, mutable constant references and now known to never alias.
  • The source references can be marked internal, which means that they should never be visible to the user, but that was tracked as a flag to each of the many source references attached to each node in the tree. Making a special class for internal references avoids storing this in the object, but instead it's now a class property.
  • The nodes for named variable reference, assignment, and deletion got split into separate nodes, one to be used before the actual variable can be determined during tree building, and one for use later on. This makes their API clearer and saves a tiny bit of memory at compile time.
  • Also eliminated target variable references, which were pseudo children of assignments and deletion nodes for variable names, that didn't really do much, but consume processing time and memory.
  • Added optimization for calls to staticmethod and classmethod built-in methods along with type shapes.
  • Added optimization for open built-in on Python3, also adding the type shape file for the result.
  • Added optimization for bytearray built-in and constant values. These mutable constants can now be compile time computed as well.
  • Added optimization for frozenset built-in and constant values. These mutable constants can now be compile time computed as well.
  • Added optimization for divmod built-in.
  • Treat all built-in constant types, e.g. type itself as a constant. So far we did this only for constant values types, but of course this applies to all types, giving slightly more compact code for their uses.
  • Detect static raises if iterating over non-iterables and warn about them if the option is enabled.
  • Split of locals node into different types, one which needs the updated value, and one which just makes a copy. Properly track if a functions needs an updated locals dict, and if it doesn't, don't use that. This gives more efficient code for Python2 classes, and exec using functions in Python2.
  • Build all constant values without use of the pickle module which has a lot more overhead than marshal, instead use that for too large long values, non-UTF8 unicode values, nan float, etc.
  • Detect the linker arch for all Linux platforms using objdump instead of only a hand few hard coded ones.

Cleanups

  • The use of INCREASE_REFCOUNT got fully eliminated.
  • Use functions not vulenerable for buffer overflow. This is generally good and avoids warnings given on OpenBSD during linking.
  • Variable closure for classes is different from all functions, don't handle the difference in the base class, but for class nodes only.
  • Make sure mayBeNon doesn't return None which means normally "unclear", but False instead, since it's always clear for those cases.
  • Comparison nodes were using the general comparison node as a base class, but now a proper base class was added instead, allowing for cleaner code.
  • Valgrind test runners got changed to using proper tool namespace for their code and share it.
  • Made construct case generation code common testing code for re-use in the speedcenter web site. The code also has minor beauty bugs which will then become fixable.
  • Use appdirs package to determine place to store the downloaded copy of depends.exe.
  • The code still mentioned C++ in a lot of places, in comments or identifiers, which might be confusing readers of the code.
  • Code objects now carry all information necessary for their creation, and no longer need to access their parent to determine flag values. That parent is subject to change in the future.
  • Our import sorting wrapper automatically detects imports that could be local and makes them so, removing a few existing ones and preventing further ones on the future.
  • Cleanups and annotations to become Python3 PyLint clean as well. This found e.g. that source code references only had __cmp__ and need rich comparison to be fully portable.

Tests

  • The test runner for construct tests got cleaned up and the constructs now avoid using xrange so as to not need conversion for Python3 execution as much.
  • The main test runner got cleaned up and uses common code making it more versatile and robust.
  • Do not run test in debugger if CPython also segfaulted executing the test, then it's not a Nuitka issue, so we can ignore that.
  • Improve the way the Python to test with is found in the main test runner, prefer the running interpreter, then PATH and registry on Windows, this will find the interesting version more often.
  • Added support for "Landscape.io" to ignore the inline copies of code, they are not under our control.
  • The test runner for Valgrind got merged with the usage for constructs and uses common code now.
  • Construct generation is now common code, intended for sharing it with the Speedcenter web site generation.
  • Rebased Python 3.6 test suite to 3.6.1 as that is the Python generally used now.

Organizational

  • Added inline copy of appdirs package from PyPI.
  • Added credits for RedBaron and isort.
  • The --experimental flag is now creating a list of indications and more than one can be used that way.
  • The PyLint runner can also work with Python3 pylint.
  • The Nuitka Speedcenter got more fine tuning and produces more tags to more easily identify trends in results. This needs to become more visible though.
  • The MSI files are also built on AppVeyor, where their building will not depend on me booting Windows. Getting these artifacts as downloads will be the next step.

Summary

This release improves many areas. The variable closure taking is now fully transparent due to different node types, the memory usage dropped again, a few obvious missing static optimizations were added, and many built-ins were completed.

This release again improves the scalability of Nuitka, which again uses less memory than before, although not an as big jump as before.

This does not extend or use special C code generation for bool or any type yet, which still needs design decisions to proceed and will come in a later release.

July 23, 2017 03:42 PM


Patricio Paez

Concatenating strings with punctuation

Creating strings of the form “a, b, c and d” from a list [‘a’, ‘b’, ‘c’, ‘d’] is a task I faced some time ago, as I needed to include such strings in some HTML documents. The “,” and the “and” are included according to the amount of elements. [‘a’, ‘b’] yields “a and b“, [‘a’] yields “a” for example. In a recent review to the code, I changed the method from using string concatenation:

if len(items) > 1:
    text = items[0]
    for item in items[1:-1]:
        text += ', ' + item
    text += ' and ' + items[-1]
else:
    text = items[0]

to the use of slicing of the items list, addition of the resulting sublists and str.join to include the punctuation:

first = items[:1]
middle = items[1:-1]
last = items[1:][-1:]
first_middle = [', '.join(first + middle)]
text = ' and '.join(first_middle + last)

The old method requires an additonal elif branch to work when items is an empty list; the new method returns an empty string if the items list is empty. I share this tip in case it is useful to someone else.

July 23, 2017 11:45 AM


Full Stack Python

How to Add Hosted Monitoring to Flask Web Applications

How do you know whether your application is running properly with minimal errors after building and deploying it? The fastest and easiest way to monitor your operational Flask web application is to integrate one of the many available fantastic hosted monitoring tools.

In this post we will quickly add Rollbar monitoring to catch errors and visualize our application is running properly.

Our Tools

We can use either Python 2 or 3 to build this tutorial, but Python 3 is strongly recommended for all new applications. I used Python 3.6.2 to execute my code. We will also use the following application dependencies throughout the post:

If you need help getting your development environment configured before running this code, take a look at this guide for setting up Python 3 and Flask on Ubuntu 16.04 LTS.

All code in this blog post is available open source under the MIT license on GitHub under the monitor-flask-apps directory of the blog-code-examples repository. Use and abuse the source code as you desire for your own applications.

Installing Dependencies

Change into the directory where you keep your Python virtualenvs. Create a new virtual environment for this project using the following command.

python3 -m venv monitorflask

Activate the virtualenv.

source monitorflask/bin/activate

The command prompt will change after activating the virtualenv:

Activating our Python virtual environment on the command line.

Remember that you need to activate the virtualenv in every new terminal window where you want to use the virtualenv to run the project.

Flask, Rollbar and Blinker can now be installed into the now-activated virtualenv.

pip install flask==0.12.2 rollbar==0.13.12 blinker==1.4

Our required dependencies should be installed within our virtualenv after a short installation period. Look for output like the following to confirm everything worked.

Installing collected packages: blinker, itsdangerous, click, MarkupSafe, Jinja2, Werkzeug, Flask, idna, urllib3, chardet, certifi, requests, six, rollbar
  Running setup.py install for blinker ... done
  Running setup.py install for itsdangerous ... done
  Running setup.py install for MarkupSafe ... done
  Running setup.py install for rollbar ... done
Successfully installed Flask-0.12.2 Jinja2-2.9.6 MarkupSafe-1.0 Werkzeug-0.12.2 blinker-1.4 certifi-2017.4.17 chardet-3.0.4 click-6.7 idna-2.5 itsdangerous-0.24 requests-2.18.1 rollbar-0.13.12 six-1.10.0 urllib3-1.21.1

Now that we have our Python dependencies installed into our virtualenv we can create the initial version of our application.

Building Our Flask App

Create a folder for your project named monitor-flask-apps. Change into the folder and then create a file named app.py with the following code.

import re
from flask import Flask, render_template, Response
from werkzeug.exceptions import NotFound


app = Flask(__name__)
MIN_PAGE_NAME_LENGTH = 2


@app.route("/<string:page>/")
def show_page(page):
    try:
        valid_length = len(page) >= MIN_PAGE_NAME_LENGTH
        valid_name = re.match('^[a-z]+$', page.lower()) is not None
        if valid_length and valid_name:
            return render_template("{}.html".format(page))
        else:
            msg = "Sorry, couldn't find page with name {}".format(page)
            raise NotFound(msg)
    except:
        return Response("404 Not Found")


if __name__ == "__main__":
    app.run(debug=True)

The above application code has some standard Flask imports so we can create a Flask web app and render template files. We have a single function named show_page to serve a single Flask route. show_page checks if the URL path contains only lowercase alpha characters for a potential page name. If the page name can be found in the templates folder then the page is rendered, otherwise an exception is thrown that the page could not be found. We need to create at least one template file if our function is ever going to return a non-error reponse.

Save app.py and make a new subdirectory named templates under your project directory. Create a new file named battlegrounds.html and put the following Jinja2 template markup into it.

<!DOCTYPE html>
<html>
  <head>
    <title>You found the Battlegrounds GIF!</title>
  </head>
  <body>
    <h1>PUBG so good.</h1>
    <img src="https://media.giphy.com/media/3ohzdLMlhId2rJuLUQ/giphy.gif">
  </body>
</html>

The above Jinja2 template is basic HTML without any embedded template tags. The template creates a very plain page with a header description of "PUBG so good" and a GIF from this excellent computer game.

Time to run and test our code. Change into the base directory of your project where app.py file is located. Execute app.py using the python command as follows (make sure your virtualenv is still activated in the terminal where you are running this command):

python app.py

The Flask development server should start up and display a few lines of output.

Run the Flask development server locally.

What happens when we access the application running on localhost port 5000?

Testing our Flask application at the base URL receives an HTTP 404 error.

HTTP status 404 page not found, which is what we expected because we only defined a single route and it did not live at the base path.

We created a template named battlegrounds.html that should be accessible when we go to localhost:5000/battlegrounds/.

Testing our Flask application at /battlegrounds/ gets the proper template with a GIF.

The application successfully found the battlegrounds.html template but that is the only one available. What if we try localhost:5000/fullstackpython/?

If no template is found we receive a 500 error.

HTTP 500 error. That's no good.

The 404 and 500 errors are obvious to us right now because we are testing the application locally. However, what happens when the app is deployed and a user gets the error in their own web browser? They will typically quit out of frustration and you will never know what happened unless you add some error tracking and application monitoring.

We will now modify our code to add Rollbar to catch and report those errors that occur for our users.

Handling Errors

Head to Rollbar's homepage so we can add their hosted monitoring tools to our oft-erroring Flask app.

Rollbar homepage in the web browser.

Click the "Sign Up" button in the upper right-hand corner. Enter your email address, a username and the password you want on the sign up page.

Enter your basic account information on the sign up page.

After the sign up page you will see the onboarding flow where you can enter a project name and select a programming language. For project name enter "Battlegrounds" and select that you are monitoring a Python app.

Create a new project named 'Battlegrounds' and select Python as the programming language.

Press the "Continue" button at the bottom to move along. The next screen shows us a few quick instructions to add monitoring to our Flask application.

Set up your project using your server-side access token.

Let's modify our Flask application to test whether we can properly connect to Rollbar's service. Change app.py to include the following highlighted lines.

~~import os
import re
~~import rollbar
from flask import Flask, render_template, Response
from werkzeug.exceptions import NotFound


app = Flask(__name__)
MIN_PAGE_NAME_LENGTH = 2


~~@app.before_first_request
~~def add_monitoring():
~~    rollbar.init(os.environ.get('ROLLBAR_SECRET'))
~~    rollbar.report_message('Rollbar is configured correctly')


@app.route("/<string:page>/")
def show_page(page):
    try:
        valid_length = len(page) >= MIN_PAGE_NAME_LENGTH
        valid_name = re.match('^[a-z]+$', page.lower()) is not None
        if valid_length and valid_name:
            return render_template("{}.html".format(page))
        else:
            msg = "Sorry, couldn't find page with name {}".format(page)
            raise NotFound(msg)
    except:
        return Response("404 Not Found")


if __name__ == "__main__":
    app.run(debug=True)

We added a couple of new imports, os and rollbar. os allows us to grab environment variable values, such as our Rollbar secret key. rollbar is the library we installed earlier. The two lines below the Flask app instantiation are to initialize Rollbar using the Rollbar secret token and send a message to the service that it started correctly.

The ROLLBAR_SECRET token needs to be set in an environment variable. Save an quit the app.py. Run export ROLLBAR_SECRET='token here' on the command line where your virtualenv is activated. This token can be found on the Rollbar onboarding screen.

I typically store all my environment variables in a file like template.env and invoke it from the terminal using the . ./template.env command. Make sure to avoid committing your secret tokens to a source control repository, especially if the repository is public!

After exporting your ROLLBAR_SECRET key as an environment variable we can test that Rollbar is working as we run our application. Run it now using python:

python app.py

Back in your web browser press the "Done! Go to Dashboard" button. Don't worry about the "Report an Error" section code, we can get back to that in a moment.

If the event hasn't been reported yet we'll see a waiting screen like this one:

Waiting for data on the dashboard.

Once Flask starts up though, the first event will be populated on the dashboard.

First event populated on our dashboard for this project.

Okay, our first test event has been populated, but we really want to see all the errors from our application, not a test event.

Testing Error Handling

How do we make sure real errors are reported rather than just a simple test event? We just need to add a few more lines of code to our app.

import os
import re
import rollbar
~~import rollbar.contrib.flask
from flask import Flask, render_template, Response
~~from flask import got_request_exception
from werkzeug.exceptions import NotFound


app = Flask(__name__)
MIN_PAGE_NAME_LENGTH = 2


@app.before_first_request
def add_monitoring():
    rollbar.init(os.environ.get('ROLLBAR_SECRET'))
~~    ## delete the next line if you dont want this event anymore
    rollbar.report_message('Rollbar is configured correctly')
~~    got_request_exception.connect(rollbar.contrib.flask.report_exception, app)


@app.route("/<string:page>/")
def show_page(page):
    try:
        valid_length = len(page) >= MIN_PAGE_NAME_LENGTH
        valid_name = re.match('^[a-z]+$', page.lower()) is not None
        if valid_length and valid_name:
            return render_template("{}.html".format(page))
        else:
            msg = "Sorry, couldn't find page with name {}".format(page)
            raise NotFound(msg)
    except:
~~        rollbar.report_exc_info()
        return Response("404 Not Found")


if __name__ == "__main__":
    app.run(debug=True)

The above highlighted code modifies the application so it reports all Flask errors as well as our HTTP 404 not found issues that happen within the show_page function.

Make sure your Flask development server is running and try to go to localhost:5000/b/. You will receive an HTTP 404 exception and it will be reported to Rollbar. Next go to localhost:5000/fullstackpython/ and an HTTP 500 error will occur.

You should see an aggregation of errors as you test out these errors:

Rollbar dashboard showing aggregations of errors.

Woohoo, we finally have our Flask app reporting all errors that occur for any user back to the hosted Rollbar monitoring service!

What's Next?

We just learned how to catch and handle errors with Rollbar as a hosted monitoring platform in a simple Flask application. Next you will want to add monitoring to your more complicated web apps. You can also check out some of Rollbar's more advanced features such as:

There is a lot more to learn about web development and deployments so keep learning by reading up on Flask and other web frameworks such as Django, Pyramid and Sanic. You can also learn more about integrating Rollbar with Python applications via their Python documentation.

Questions? Let me know via a GitHub issue ticket on the Full Stack Python repository, on Twitter @fullstackpython or @mattmakai.

See something wrong in this blog post? Fork this page's source on GitHub and submit a pull request with a fix.

July 23, 2017 04:00 AM

July 22, 2017


Weekly Python StackOverflow Report

(lxxxiii) stackoverflow python report

These are the ten most rated questions at Stack Overflow last week.
Between brackets: [question score / answers count]
Build date: 2017-07-22 19:52:51 GMT


  1. What's the efficiency difference between using "+" and "," in print()? - [14/7]
  2. Unexpected result from `in` operator - Python - [13/3]
  3. Why is deque implemented as a linked list instead of a circular array? - [13/2]
  4. What is co_names? - [10/1]
  5. list comprehension in exec with empty locals: NameError - [9/4]
  6. Add values of keys and sort it by occurrence of the keys in a list of dictionaries in Python - [8/5]
  7. Pandas - return n smallest indexes by column - [8/3]
  8. Differences between generator comprehension expressions - [8/1]
  9. Sort 2 lists in Python based on the ratio of individual corresponding elements or based on a third list - [7/4]
  10. Dynamically generating elements of list within list - [7/3]

July 22, 2017 07:55 PM


Catalin George Festila

About py-translate python module.

This python module is used for translating text in the terminal.
You can read and see examples with this API on this web page.
Features


Installation 
C:\>cd Python27

C:\Python27>cd Scripts

C:\Python27\Scripts>pip install py-translate
Collecting py-translate
Downloading py_translate-1.0.3-py2.py3-none-any.whl (61kB)
100% |################################| 61kB 376kB/s
Installing collected packages: py-translate
Successfully installed py-translate-1.0.3

C:\Python27\Scripts>
Let's test it with a simple example:
>>> import translate
>>> dir(translate)
['TestLanguages', 'TestTranslator', '__author__', '__build__', '__builtins__', '__copyright__', '__doc__', '__file__', '__license__', '__name__', '__package__', '__path__', '__title__', '__version__', 'accumulator', 'coroutine', 'coroutines', 'languages', 'print_table', 'push_url', 'set_task', 'source', 'spool', 'tests', 'translation_table', 'translator', 'write_stream']
>>> from translate import translator
>>> translator('ro', 'en', 'Consider ca dezvoltarea personala este un pas important')
[[[u'I think personal development is an important step', u'Consider ca dezvoltarea personala este un pas important', None, None, 0]], None, u'ro']
>>>

July 22, 2017 11:46 AM

Make one executable from a python script.

The official website of this tool tells us:
PyInstaller bundles a Python application and all its dependencies into a single package. The user can run the packaged app without installing a Python interpreter or any modules. PyInstaller supports Python 2.7 and Python 3.3+, and correctly bundles the major Python packages such as numpy, PyQt, Django, wxPython, and others.

PyInstaller is tested against Windows, Mac OS X, and Linux. However, it is not a cross-compiler: to make a Windows app you run PyInstaller in Windows; to make a Linux app you run it in Linux, etc. PyInstaller has been used successfully with AIX, Solaris, and FreeBSD, but is not tested against them.

The manual of this tool can be see it here.

C:\Python27>cd Scripts

C:\Python27\Scripts>pip install pyinstaller
Collecting pyinstaller
Downloading PyInstaller-3.2.1.tar.bz2 (2.4MB)
100% |################################| 2.4MB 453kB/s
....
Collecting pypiwin32 (from pyinstaller)
Downloading pypiwin32-219-cp27-none-win32.whl (6.7MB)
100% |################################| 6.7MB 175kB/s
...
Successfully installed pyinstaller-3.2.1 pypiwin32-219
Also this will install PyWin32 python module.
Let's make one test python script and then to make it executable.
I used this python script to test it:
from tkinter import Tk, Label, Button

class MyFirstGUI:
def __init__(self, master):
self.master = master
master.title("A simple GUI")

self.label = Label(master, text="This is our first GUI!")
self.label.pack()

self.greet_button = Button(master, text="Greet", command=self.greet)
self.greet_button.pack()

self.close_button = Button(master, text="Close", command=master.quit)
self.close_button.pack()

def greet(self):
print("Greetings!")

root = Tk()
my_gui = MyFirstGUI(root)
root.mainloop()
The output of the command of pyinstaller:
C:\Python27\Scripts>pyinstaller.exe   --onefile --windowed ..\tk_app.py
92 INFO: PyInstaller: 3.2.1
92 INFO: Python: 2.7.13
93 INFO: Platform: Windows-10-10.0.14393
93 INFO: wrote C:\Python27\Scripts\tk_app.spec
95 INFO: UPX is not available.
96 INFO: Extending PYTHONPATH with paths
['C:\\Python27', 'C:\\Python27\\Scripts']
96 INFO: checking Analysis
135 INFO: checking PYZ
151 INFO: checking PKG
151 INFO: Building because toc changed
151 INFO: Building PKG (CArchive) out00-PKG.pkg
213 INFO: Redirecting Microsoft.VC90.CRT version (9, 0, 21022, 8) -> (9, 0, 30729, 9247)
2120 INFO: Building PKG (CArchive) out00-PKG.pkg completed successfully.
2251 INFO: Bootloader c:\python27\lib\site-packages\PyInstaller\bootloader\Windows-32bit\runw.exe
2251 INFO: checking EXE
2251 INFO: Rebuilding out00-EXE.toc because tk_app.exe missing
2251 INFO: Building EXE from out00-EXE.toc
2267 INFO: Appending archive to EXE C:\Python27\Scripts\dist\tk_app.exe
2267 INFO: Building EXE from out00-EXE.toc completed successfully.
Then I run the executable output:
C:\Python27\Scripts>C:\Python27\Scripts\dist\tk_app.exe

C:\Python27\Scripts>
...and working well.

The output file come with this icon:

Also you can make changes by using your icons or set the type of this file, according to VS_FIXEDFILEINFO structure.
You need to have the icon file and / or version.txt file for VS_FIXEDFILEINFO structure.
Let's see the version.txt file:
# UTF-8
#
# For more details about fixed file info 'ffi' see:
# http://msdn.microsoft.com/en-us/library/ms646997.aspx
VSVersionInfo(
ffi=FixedFileInfo(
# filevers and prodvers should be always a tuple with four items: (1, 2, 3, 4)
# Set not needed items to zero 0.
filevers=(2017, 1, 1, 1),
prodvers=(1, 1, 1, 1),
# Contains a bitmask that specifies the valid bits 'flags'
mask=0x3f,
# Contains a bitmask that specifies the Boolean attributes of the file.
flags=0x0,
# The operating system for which this file was designed.
# 0x4 - NT and there is no need to change it.
OS=0x4,
# The general type of file.
# 0x1 - the file is an application.
fileType=0x1,
# The function of the file.
# 0x0 - the function is not defined for this fileType
subtype=0x0,
# Creation date and time stamp.
date=(0, 0)
),
kids=[
StringFileInfo(
[
StringTable(
u'040904b0',
[StringStruct(u'CompanyName', u'python-catalin'),
StringStruct(u'ProductName', u'test'),
StringStruct(u'ProductVersion', u'1, 1, 1, 1'),
StringStruct(u'InternalName', u'tk_app'),
StringStruct(u'OriginalFilename', u'tk_app.exe'),
StringStruct(u'FileVersion', u'2017, 1, 1, 1'),
StringStruct(u'FileDescription', u'test tk'),
StringStruct(u'LegalCopyright', u'Copyright 2017 free-tutorials.org.'),
StringStruct(u'LegalTrademarks', u'tk_app is a registered trademark of catafest.'),])
]),
VarFileInfo([VarStruct(u'Translation', [0x409, 1200])])
]
)
Now you can use this command for tk_app.py and version.txt files from the C:\Python27 folder:
 pyinstaller.exe --onefile --windowed --version-file=..\version.txt ..\tk_app.py
Let's see this info into the executable file:

If you wand to change the icon then you need to add the --icon=tk_app.ico, where tk_app.ico is the new icon of the executable.



July 22, 2017 11:12 AM

Python Qt4 - part 001.

Today I started with PyQt4 and python version :

Python 2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:42:59) [MSC v.1500 32 bit (Intel)] on win32
To install PyQt4 I used this link to take the executable named: PyQt4-4.11.4-gpl-Py2.7-Qt4.8.7-x32.exe.
The name of this executable shows us: can be used with python 2.7.x versions and come with Qt4.8.7 for our 32 bit python.
I start with a default Example class to make a calculator interface with PyQt4.
This is my example:
#!/usr/bin/python
# -*- coding: utf-8 -*-

import sys
from PyQt4 import QtGui

"""
Qt.Gui calculator example
"""

class Example(QtGui.QWidget):

def __init__(self):
super(Example, self).__init__()

self.initUI()

def initUI(self):
title = QtGui.QLabel('Title')
titleEdit = QtGui.QLineEdit()
grid = QtGui.QGridLayout()
grid.setSpacing(10)

grid.addWidget(title, 0, 0)

grid.addWidget(titleEdit,0,1,1,4)

self.setLayout(grid)

names = ['Cls', 'Bck', 'OFF',
'/', '.', '7', '8',
'9', '*', 'SQR', '3',
'4', '5', '-', '=',
'0', '1', '2', '+']

positions = [(i,j) for i in range(1,5) for j in range(0,5)]

for position, name in zip(positions, names):

if name == '':
continue
button = QtGui.QPushButton(name)
grid.addWidget(button, *position)

self.move(300, 250)
self.setWindowTitle('Calculator')
self.show()

def main():
app = QtGui.QApplication(sys.argv)
ex = Example()
sys.exit(app.exec_())

if __name__ == '__main__':
main()
The example is simple.
First you need a QGridLayout - this make a matrix.
I used labels, line edit and buttons all from QtGui: QLabel, QLineEdit and QPushButton.
First into this matrix - named grid is: Title and edit area named titleEdit.
This two is added to the grid - matrix with addWidget.
The next step is to put all the buttons into one array.
This array will be add to the grid matrix with a for loop.
To make this add from array to matrix I used the zip function.
The zip function make an iterator that aggregates elements from each of the iterables.
Also I set the title to Calculator with setWindowTitle.
I have not implemented the part of the events and the calculation.
The main function will start the interface by using the QApplication.
The goal of this tutorial was the realization of the graphical interface with PyQt4.
This is the result of my example:

July 22, 2017 11:11 AM

The pyquery python module.

This tutorial is about pyquery python module and python 2.7.13 version.
First I used pip command to install it.

C:\Python27>cd Scripts

C:\Python27\Scripts>pip install pyquery
Collecting pyquery
Downloading pyquery-1.2.17-py2.py3-none-any.whl
Requirement already satisfied: lxml>=2.1 in c:\python27\lib\site-packages (from pyquery)
Requirement already satisfied: cssselect>0.7.9 in c:\python27\lib\site-packages (from pyquery)
Installing collected packages: pyquery
Successfully installed pyquery-1.2.17
I try to install with pip and python 3.4 version but I got errors.
The development team tells us about this python module:
pyquery allows you to make jquery queries on xml documents. The API is as much as possible the similar to jquery. pyquery uses lxml for fast xml and html manipulation.
Let's try a simple example with this python module.
The base of this example is find links by html tag.
from pyquery import PyQuery

seeds = [
'https://twitter.com',
'http://google.com'
]

crawl_frontiers = []

def start_crawler():
crawl_frontiers = crawler_seeds()

print(crawl_frontiers)

def crawler_seeds():
frontiers = []
for index, seed in enumerate(seeds):
frontier = {index: read_links(seed)}
frontiers.append(frontier)

return frontiers

def read_links(seed):
crawler = PyQuery(seed)
return [crawler(tag_a).attr("href") for tag_a in crawler("a")]

start_crawler()
The read_links function take links from seeds array.
To do that, I need to read the links and put in into another array crawl_frontiers.
The frontiers array is used just for crawler process.
Also this simple example allow you to understand better the arrays.
You can read more about this python module here .

July 22, 2017 11:11 AM


Full Stack Python

How to Make Phone Calls in Python

Good old-fashioned phone calls remain one of the best forms of communication despite the slew of new smartphone apps that have popped up over the past several years. With just a few lines of Python code plus a web application programming interface we can make and receive phone calls from any application.

Our example calls will say a snippet of text and put all incoming callers into a recorded conference call. You can modify the instructions using Twilio's TwiML verbs when you perform different actions in your own application's phone calls.

Our Tools

You should have either Python 2 or 3 installed to build this application. Throughout the post we will also use:

You can snag all the open source code for this tutorial in the python-twilio-example-apps GitHub repository under the no-framework/phone-calls directory. Use and copy the code for your own applications. Everything in that repository and in this blog post are open source under the MIT license.

Install App Dependencies

Our application will use the Twilio Python helper library to create an HTTP POST request to Twilio's API. The Twilio helper library is installable from PyPI into a virtual environment. Open your terminal and use the virtualenv command to create a new virtualenv:

virtualenv phoneapp

Invoke the activate script within the virtualenv bin/ directory to make this virtualenv the active Python executable. Note that you will need to perform this step in every terminal window that you want the virtualenv to be active.

source phoneapp/bin/activate

The command prompt will change after activating the virtualenv to something like (phoneapp) $.

Next use the pip command to install the Twilio Python package into the virtualenv.

pip install twilio==5.7.0

We will have the required dependency ready for project as soon as the installation script finishes. Now we can write and execute Python code to dial phone numbers.

Our Python Script

Create a new file named phone_calls.py and copy or type in the following lines of code.

from twilio.rest import TwilioRestClient


# Twilio phone number goes here. Grab one at https://twilio.com/try-twilio
# and use the E.164 format, for example: "+12025551234"
TWILIO_PHONE_NUMBER = ""

# list of one or more phone numbers to dial, in "+19732644210" format
DIAL_NUMBERS = ["",]

# URL location of TwiML instructions for how to handle the phone call
TWIML_INSTRUCTIONS_URL = \
  "http://static.fullstackpython.com/phone-calls-python.xml"

# replace the placeholder values with your Account SID and Auth Token
# found on the Twilio Console: https://www.twilio.com/console
client = TwilioRestClient("ACxxxxxxxxxx", "yyyyyyyyyy")


def dial_numbers(numbers_list):
    """Dials one or more phone numbers from a Twilio phone number."""
    for number in numbers_list:
        print("Dialing " + number)
        # set the method to "GET" from default POST because Amazon S3 only
        # serves GET requests on files. Typically POST would be used for apps
        client.calls.create(to=number, from_=TWILIO_PHONE_NUMBER,
                            url=TWIML_INSTRUCTIONS_URL, method="GET")


if __name__ == "__main__":
    dial_numbers(DIAL_NUMBERS)

There are a few lines that you need to modify in this application before it will run. First, insert one or more phone numbers you wish to dial into the DIAL_NUMBERS list. Each one should be a string, separated by a comma. For example, DIAL_NUMBERS = ["+12025551234", "+14155559876", "+19735551234"].

Next, TWILIO_PHONE_NUMBER and the Account SID and Authentication Token, found on the client = TwilioRestClient("ACxxxxxxxxxx", "yyyyyyyyyy") line, need to be set. We can get these values from the Twilio Console.

In your web browser go to the Twilio website and sign up for a free account or sign into your existing Twilio account.

Twilio sign up screen.

Copy the Account SID and Auth Token from the Twilio Console and paste them into your application's code:

Obtain the Account SID and Auth Token from the Twilio Console.

The Twilio trial account allows you to dial and receive phone calls to your own validated phone number. To handle calls from any phone number then you need to upgrade your account (hit the upgrade button on the top navigation bar).

Once you are signed into your Twilio account, go to the manage phone numbers screen. On this screen you can buy one or more phone numbers or click on an existing phone number in your account to configure it.

Manage phone numbers screen.

After clicking on a number you will reach the phone number configuration screen. Paste in the URL with TwiML instructions and change the dropdown from "HTTP POST" to "HTTP GET". In this post we'll use http://static.fullstackpython.com/phone-calls-python.xml, but that URL can be more than just a static XML file.

Twilio phone number configuration screen.

The power of Twilio really comes in when that URL is handled by your web application so it can dynamically respond with TwiML instructions based on the incoming caller number or other properties stored in your database.

Under the Voice webhook, paste in http://static.fullstackpython.com/phone-calls-python.xml and change the drop-down to the right from "HTTP POST" to "HTTP GET". Click the "Save" button at the bottom of the screen.

Now try calling your phone number. You should hear the snippet of text read by the Alice voice and then you will be placed into a conference call. If no one else calls the number then hold music should be playing.

Making Phone Calls

We just handled inbound phone calls to our phone number. Now it's time to dial outbound phone calls. Make sure your phone_calls.py file is saved and that your virtualenv is still activated and then execute the script:

python phone_calls.py

In a moment all the phone numbers you write in the DIAL_NUMBERS list should light up with calls. Anyone that answers will hear our message read by the "Alice" voice and then they'll be placed together into a recorded conference call, just like when someone dials into the number.

Here is my inbound phone call:

Receiving an incoming phone call on the iPhone.

Not bad for just a few lines of Python code!

Next Steps

Now that we know how to make and receive phone calls from a Twilio number that follows programmatic instructions we can do a whole lot more in our applications. Next you can use one of these tutorials to do more with your phone number:

Questions? Contact me via Twitter @fullstackpython or @mattmakai. I'm also on GitHub as mattmakai.

See something wrong in this post? Fork this page's source on GitHub and submit a pull request.

July 22, 2017 04:00 AM

July 21, 2017


Sandipan Dey

SIR Epidemic model for influenza A (H1N1): Modeling the outbreak of the pandemic in Kolkata, West Bengal, India in 2010 (Simulation in Python & R)

This appeared as a project in the edX course DelftX: MathMod1x Mathematical Modelling Basics and the project report can be found here. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Summary In this report, the spread of the pandemic influenza A (H1N1) that had an outbreak in Kolkata, West Bengal, India, 2010 is going to be simulated. … Continue reading SIR Epidemic model for influenza A (H1N1): Modeling the outbreak of the pandemic in Kolkata, West Bengal, India in 2010 (Simulation in Python & R)

July 21, 2017 05:47 PM


Continuum Analytics News

Galvanize Capstone Series: Geolocation of Twitter Users

Monday, July 24, 2017
Shawn Terryah
Guest Blogger

This post is part of our Galvanize Capstone featured projects. This post was written by Shawn Terryah and posted here with his permission. 

In June of this year, I completed the Data Science Immersive program at Galvanize in Austin, TX. The final few weeks of the program were dedicated to individual capstone projects of our choosing. I have a background in infectious disease epidemiology and, when I was in graduate school, there was a lot of interest in using things like Google search queries, Facebook posts, and tweets to try to track the spread of infectious diseases in real-time. One of the limitations to using Twitter is that only about 1% of tweets are geotagged with the tweet's location, which can make much of this work very difficult. For my capstone project, I chose to train a model using the 1% of tweets that are geotagged to predict the US city-level location of Twitter users who do not geotag their tweets. This is how I did it:

Streaming Training Tweets Using Tweepy

Tweepy is a Python wrapper for the Twitter API that allowed me to easily collect tweets in real-time and store them in MongoBD. The script below was run on an Amazon Web Services EC2 instance with 200 GiB of storage for roughly two weeks using tmux. By filtering based on location, I only received geotagged tweets with a known location to use for training the model.

import tweepy 
import json 
from pymongo import MongoClient 

class StreamListener(tweepy.StreamListener): 
    """tweepy.StreamListener is a class provided by tweepy used to access 
    the Twitter Streaming API to collect tweets in real-time. 
    """ 
    def on_connect(self): 
        """Called when the connection is made""" 
        print("You're connected to the streaming server.") 

    def on_error(self, status_code): 
        """This is called when an error occurs""" 
        print('Error: ' + repr(status_code)) 
        return False 

    def on_data(self, data): 
        """This will be called each time we receive stream data""" 
        client = MongoClient() 

        # I stored the tweet data in a database called 'training_tweets' in MongoDB, if 
        # 'training_tweets' does not exist it will be created for you. 
        db = client.training_tweets 

        # Decode JSON 
        datajson = json.loads(data) 

        # I'm only storing tweets in English. I stored the data for these tweets in a collection 
        # called 'training_tweets_collection' of the 'training_tweets' database. If 
        # 'training_tweets_collection' does not exist it will be created for you. 
        if "lang" in datajson and datajson["lang"] == "en": 
            db.training_tweets_collection.insert_one(datajson) 

if __name__ == "__main__": 
    # These are provided to you through the Twitter API after you create a account 
    consumer_key = "your_consumer_key" 
    consumer_secret = "your_consumer_secret" 
    access_token = "your_access_token" 
    access_token_secret = "your_access_token_secret" 

    auth1 = tweepy.OAuthHandler(consumer_key, consumer_secret) 
    auth1.set_access_token(access_token, access_token_secret) 

    # LOCATIONS are the longitude, latitude coordinate corners for a box that restricts the 
    # geographic area from which you will stream tweets. The first two define the southwest 
    # corner of the box and the second two define the northeast corner of the box. 
    LOCATIONS = [-124.7771694, 24.520833, -66.947028, 49.384472,     # Contiguous US 
                 -164.639405, 58.806859, -144.152365, 71.76871,      # Alaska 
                 -160.161542, 18.776344, -154.641396, 22.878623]     # Hawaii 

    stream_listener = StreamListener(api=tweepy.API(wait_on_rate_limit=True)) 
    stream = tweepy.Stream(auth=auth1, listener=stream_listener) 
    stream.filter(locations=LOCATIONS)

Feature Selection, Feature Engineering, and Data Cleaning

Feature Selection

At the end of two weeks, I had collected data from over 21 million tweets from over 15,500 cities. In addition to the tweet itself, the API provides a number of other fields. These are the fields I used to build the model:

FIELD TYPE DESCRIPTION

'text' 

String The actual UTF-8 text of the tweet
'country_code' String Country code representing the country that tweet was sent from
'full_name' String Full representation of the place the tweet was sent from. For the US, often in the form of 'City, State,' but not always.
'coordinates' Array of Array of Array of Float A series of longitude and latitude points that define a bounding box from where the tweet was sent
'screen_name'

String

The screen name chosen by the user

'favourites_count'

Int

The number of tweets this user has liked in the account’s lifetime

'followers_count'

Int

The number of followers the user currently has

'statuses_count'

Int

The number of tweets (including retweets) issued by the user

'friends_count'

Int

The number of users the user is following (AKA their “followings”)

'listed_count'

Int

The number of public lists the user is a member of

'location'

String

The user-defined location for the account’s profile, which is not necessarily a geographic location (e.g., 'the library,' 'watching a movie,' 'in my own head,' 'The University of Texas') (Nullable)
'created_at'

String

The UTC datetime of when the tweet was issued

'utc_offset'

Int

The offset from GMT/UTC in seconds based the Time Zone that the user selects for their profile (Nullable)

To pull these fields I first exported the data from MongoDB as a json file:

$ mongoexport --db training_tweets --collection training_tweets_collection --out training_tweets.json

I then converted training_tweets.json to a csv file and pulled only the fields from the table above:

import json 
import unicodecsv as csv     # unicodecsv ensures that emojis are preserved 

def tweets_json_to_csv(file_list, csv_output_file): 
    ''' 
    INPUT: list of JSON files 
    OUTPUT: single CSV file 

    This function takes a list of JSON files containing tweet data and reads 
    each file line by line, parsing the revelent fields, and writing it to a CSV file. 
    ''' 

    count = 0 
    f = csv.writer(open(csv_output_file, "wb+")) 

    # Column names 
    f.writerow(['tweet', # relabeled: the API calls this 'text' 
                'country_code', 
                'geo_location', # relabeled: the API calls this 'full_name' 
                'bounding_box', 
                'screen_name', 
                'favourites_count', 
                'followers_count', 
                'statuses_count', 
                'friends_count', 
                'listed_count', 
                'user_described_location', # relabeled: the API calls this 'location' 
                'created_at', 
                'utc_offset']) 

    for file_ in file_list: 
        with open(file_, "r") as r: 
            for line in r: 
                try: 
                    tweet = json.loads(line) 
                except: 
                    continue 
                if tweet and tweet['place'] != None: 
                    f.writerow([tweet['text'], 
                                tweet['place']['country_code'], 
                                tweet['place']['full_name'], 
                                tweet['place']['bounding_box']['coordinates'], 
                                tweet['user']['screen_name'], 
                                tweet['user']['favourites_count'], 
                                tweet['user']['followers_count'], 
                                tweet['user']['statuses_count'], 
                                tweet['user']['friends_count'], 
                                tweet['user']['listed_count'], 
                                tweet['user']['location'], 
                                tweet['created_at'], 
                                tweet['user']['utc_offset']]) 
                    count += 1 

                    # Status update 
                    if count % 100000 == 0: 
                        print 'Just stored tweet #{}'.format(count) 

if __name__ == "__main__": 

    tweets_json_to_csv(['training_tweets.json'], 'training_tweets.csv')

From this point forward, I was able to read and manipulate the csv file as a pandas DataFrame:

import pandas as pd 

df = pd.read_csv('training_tweets.csv', encoding='utf-8') # 'utf-8' ensures that emojis are preserved

Feature Engineering

'centroid'

Instead of providing the exact latitude and longitude of the tweet, the Twitter API provides a polygonal bounding box of coordinates that encloses the place where the tweet was sent. To plot the tweets on a map and perform other functions, I found the centroid of each bounding box:

def find_centroid(row): 
    ''' 
    Helper function to return the centroid of a polygonal bounding box of longitude, latitude coordinates 
    ''' 

    try: 
        row_ = eval(row) 
        lst_of_coords = [item for sublist in row_ for item in sublist] 
        longitude = [p[0] for p in lst_of_coords] 
        latitude = [p[1] for p in lst_of_coords] 
        return (sum(latitude) / float(len(latitude)), sum(longitude) / float(len(longitude))) 
    except: 
        return None 

# Create a new column called 'centroid' 
df['centroid'] = map(lambda row: find_centroid(row), df['bounding_box'])

Using the centroids, I was able to plot the training tweets on a map using the Matplotlib Basemap Toolkit. Below is the code for generating a plot of the tweets that originated in or around the contiguous US. The same was also done for Alaska and Hawaii.

from mpl_toolkits.basemap import Basemap 
import matplotlib.pyplot as plt 

def plot_contiguous_US_tweets(lon, lat, file_path): 
    ''' 
    INPUT: List of longitudes (lon), list of latitudes (lat), file path to save the plot (file_path) 
    OUTPUT: Plot of tweets in the contiguous US. 
    ''' 

    map = Basemap(projection='merc', 
                  resolution = 'h',  
                  area_thresh = 10000, 
                  llcrnrlon=-140.25, # lower left corner longitude of contiguous US 
                  llcrnrlat=5.0, # lower left corner latitude of contiguous US 
                  urcrnrlon=-56.25, # upper right corner longitude of contiguous US 
                  urcrnrlat=54.75) # upper right corner latitude of contiguous US 

    x,y = map(lon, lat) 

    map.plot(x, y, 'bo', markersize=2, alpha=.3) 
    map.drawcoastlines() 
    map.drawstates() 
    map.drawcountries() 
    map.fillcontinents(color = '#DAF7A6', lake_color='#a7cdf2') 
    map.drawmapboundary(fill_color='#a7cdf2') 
    plt.gcf().set_size_inches(15,15) 
    plt.savefig(file_path, format='png', dpi=1000)

The resulting plots for the contiguous US, Alaska, and Hawaii were joined in Photoshop and are shown on the left. The plot on the right is from the Vulcan Project at Purdue University and shows carbon footprints in the contiguous US. As you can see, the plots are very similiar, providing an indication that streaming tweets in this way provides a representative sample of the US population in terms of geographic location.

'tweet_time_secs'

The field 'created_at' is the UTC datetime of when the tweet was issued. Here is an example:

u'created_at': u'Sun Apr 30 01:23:27 +0000 2017'

I was interested in the UTC time, rather than the date, that a tweet was sent, because there are likely geographic differences in these values. I therefore parsed this information from the time stamp and reported this value in seconds.

from dateutil import parser 

def get_seconds(row): 
    ''' 
    Helper function to parse time from a datetime stamp and return the time in seconds 
    ''' 

    time_str = parser.parse(row).strftime('%H:%M:%S') 
    h, m, s = time_str.split(':') 
    return int(h) * 3600 + int(m) * 60 + int(s) 

# Create a new column called 'tweet_time_secs' 
df['tweet_time_secs'] = map(lambda row: get_seconds(row), df['created_at'])

Data Cleaning

Missing Data

Both 'user_described_location' (note: the API calls this field 'location') and 'utc_offset' are nullable fields that frequently contain missing values. When this was the case, I filled them in with indicator values:

df['user_described_location'].fillna('xxxMISSINGxxx', inplace=True) 
df['utc_offset'].fillna(999999, inplace=True)

Additionally, a small percentage of tweets contained missing values for 'country_code.' When this or other information was missing, I chose to drop the entire row:

df.dropna(axis=0, inplace=True)

Tweets Outside the US

The bounding box I used to stream the tweets included areas outside the contiguous US. Since the goal for this project was to predict the US city-level location of Twitter users, I relabeled tweets that originated from outside the US. For these tweets 'country_code' was relabeled to 'NOT_US' and 'geo_location' (note: the API calls this field 'full_name') was relabeled to 'NOT_IN_US, NONE':

def relabel_geo_locations(row): 
    ''' 
    Helper function to relabel the geo_locations from tweets outside the US 
    to 'NOT_IN_US, NONE' 
    ''' 

    if row['country_code'] == 'US': 
        return row['geo_location'] 
    else: 
        return 'NOT_IN_US, NONE' 

# Relabel 'country_code' for tweets outside the US to 'NOT_US' 
     df['country_code'] = map(lambda cc: cc if cc == 'US' else 'NOT_US', df['country_code']) 

# Relabel 'geo_location' for tweets outside the US to 'NOT_IN_US, NONE' 
df['geo_location'] = df.apply(lambda row: relabel_geo_locations(row), axis=1)

Tweets Lacking a 'City, State' Location Label

Most tweets that originated in the US had a 'geo_location' in the form of 'City, State' (e.g., 'Austin, TX'). For some tweets, however, the label was less specific and in the form of 'State, Country' (e.g., 'Texas, USA') or, even worse, in the form of a totally unique value (e.g., 'Tropical Tan'). Since this data was going to be used to train the model, I wanted to have as granular of a label as possible for each tweet. Therefore, I only kept tweets that were in the form of 'City, State' and dropped all others:

def geo_state(row): 
    ''' 
    Helper function to parse the state code for 'geo_location' labels 
    in the form of 'City, State' 
    ''' 

    try: 
        return row['geo_location'].split(', ')[1] 
    except: 
        return None 

# Create a new column called 'geo_state' 
df['geo_state'] = df.apply(lambda row: geo_state(row),axis=1) 

# The 'geo_state' column will contain null values for any row where 'geo_location' was not 
# comma separated (e.g., 'Tropical Tan'). We drop those rows here: 
df.dropna(axis=0, inplace=True) 

# list of valid geo_state labels. "NONE" is the label I created for tweets outside the US 
states = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", "HI", "ID", 
          "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", 
          "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", 
          "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY", "NONE"] 

# Keep only rows with a valid geo_state, among others this will drop all rows that had 
# a 'geo_location' in the form of 'State, Country' (e.g., 'Texas, USA') 
df = df[df['geo_state'].isin(states)]

Aggregating the Tweets by User

During the two week collection period, many users tweeted more than once. To prevent potential leakage, I grouped the tweets by user ('screen_name'), then aggregated the remaining fields.

from collections import Counter 

# aggregation functions 
agg_funcs = {'tweet' : lambda x: ' '.join(x), 
             'geo_location' : lambda x: Counter(x).most_common(1)[0][0], 
             'geo_state' : lambda x: Counter(x).most_common(1)[0][0], 
             'user_described_location': lambda x: Counter(x).most_common(1)[0][0], 
             'utc_offset': lambda x: Counter(x).most_common(1)[0][0], 
             'geo_country_code': lambda x: Counter(x).most_common(1)[0][0], 
             'tweet_time_secs' : np.median, 
             'statuses_count': np.max, 
             'friends_count' :np.mean, 
             'favourites_count' : np.mean, 
             'listed_count' : np.mean, 
             'followers_count' : np.mean} 

# Groupby 'screen_name' and then apply the aggregation functions in agg_funcs 
df = df.groupby(['screen_name']).agg(agg_funcs).reset_index()

Remapping the Training Tweets to the Closest Major City

Since the training tweets came from over 15,500 cities, and I didn't want to do a 15,500-wise classification problem, I used the centroids to remap all the training tweets to their closest major city from a list of 378 major US cities based on population (plus the single label for tweets outside the US, which used Toronto's coordinates). This left me with a 379-wise classification problem. Here is a plot of those major cities and the code to remap all the US training tweets to their closest major US city:

import numpy as np 
import pickle 

def load_US_coord_dict(): 
    ''' 
    Input: n/a 
    Output: A dictionary whose keys are the location names ('City, State') of the 
    378 US classification locations and the values are the centroids for those locations 
    (latitude, longittude) 
    ''' 

    pkl_file = open("US_coord_dict.pkl", 'rb') 
    US_coord_dict = pickle.load(pkl_file) 
    pkl_file.close() 
    return US_coord_dict 

def find_dist_between(tup1, tup2): 
    ''' 
    INPUT: Two tuples of latitude, longitude coordinates pairs for two cities 
    OUTPUT: The distance between the cities 
    ''' 

    return np.sqrt((tup1[0] - tup2[0])**2 + (tup1[1] - tup2[1])**2) 

def closest_major_city(tup): 
    ''' 
    INPUT: A tuple of the centroid coordinates for the tweet to remap to the closest major city 
    OUTPUT: String, 'City, State', of the city in the dictionary 'coord_dict' that is closest to the input city 
    ''' 

    d={} 
    for key, value in US_coord_dict.iteritems(): 
        dist = find_dist_between(tup, value) 
        if key not in d: 
            d[key] = dist 
        return min(d, key=d.get) 

def get_closest_major_city_for_US(row): 
    ''' Helper function to return the closest major city for US users only. For users 
    outside the US it returns 'NOT_IN_US, NONE' 
    ''' 

    if row['geo_location'] == 'NOT_IN_US, NONE': 
        return 'NOT_IN_US, NONE' 
    else: 
        return closest_major_city(row['centroid']) 

if __name__ == "__main__": 

    # Load US_coord_dict 
    US_coord_dict = load_US_coord_dict() 

    # Create a new column called 'closest_major_city' 
    df['closest_major_city'] = df.apply(lambda row: get_closest_major_city_for_US(row), axis=1)

 

Building the Predictive Model

The steps below were run on an Amazon Web Services r3.8xlarge EC2 instance with 244 GiB of memory. Here is a high-level overview of the final model:

High-level Overview of the Stacked Model

Step 1: Load dependencies and prepare the cleaned data for model fitting

import pandas as pd 
import numpy as np 
import nltk 
from nltk.tokenize import TweetTokenizer 
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.svm import LinearSVC 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.externals import joblib 

# Tokenizer to use for text vectorization 
def tokenize(tweet): 
    tknzr = TweetTokenizer(strip_handles=True, reduce_len=True, preserve_case=False) 
    return tknzr.tokenize(tweet) 

# Read cleaned training tweets file into pandas and randomize it 
df = pd.read_pickle('cleaned_training_tweets.pkl') 
randomized_df = df.sample(frac=1, random_state=111) 

# Split randomized_df into two disjoint sets 
half_randomized_df = randomized_df.shape[0] / 2 
base_df = randomized_df.iloc[:half_randomized_df, :] # used to train the base classifiers 
meta_df = randomized_df.iloc[half_randomized_df:, :] # used to train the meta classifier 

# Create variables for the known the geotagged locations from each set 
base_y = base_df['closest_major_city'].values 
meta_y = meta_df['closest_major_city'].values

Step 2: Train a base-level Linear SVC classifier on the user described locations

# Raw text of user described locations 
base_location_doc = base_df['user_described_location'].values 
meta_location_doc = meta_df['user_described_location'].values 

# fit_transform a tf-idf vectorizer using base_location_doc and use it to transform meta_location_doc 
location_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=tokenize, ngram_range=(1,2)) 
base_location_X = location_vect.fit_transform(base_location_doc.ravel()) 
meta_location_X = location_vect.transform(meta_location_doc) 

# Fit a Linear SVC Model with 'base_location_X' and 'base_y'. Note: it is important to use 
# balanced class weights otherwise the model will overwhelmingly favor the majority class. 
location_SVC = LinearSVC(class_weight='balanced') 
location_SVC.fit(base_location_X, base_y) 

# We can now pass meta_location_X into the fitted model and save the decision 
# function, which will be used in Step 4 when we train the meta random forest 
location_SVC_decsfunc = location_SVC.decision_function(meta_location_X) 

# Pickle the location vectorizer and the linear SVC model for future use 
joblib.dump(location_vectorizer, 'USER_LOCATION_VECTORIZER.pkl') 
joblib.dump(location_SVC, 'USER_LOCATION_SVC.pkl')

Step 3: Train a base-level Linear SVC classifier on the tweets

# Raw text of tweets 
base_tweet_doc = base_df['tweet'].values 
meta_tweet_doc = meta_df['tweet'].values 

# fit_transform a tf-idf vectorizer using base_tweet_doc and use it to transform meta_tweet_doc 
tweet_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=tokenize) 
base_tweet_X = tweet_vectorizer.fit_transform(base_tweet_doc.ravel()) 
meta_tweet_X = tweet_vectorizer.transform(meta_tweet_doc) 

# Fit a Linear SVC Model with 'base_tweet_X' and 'base_tweet_y'. Note: it is important to use 
# balanced class weights otherwise the model will overwhelmingly favor the majority class. 
tweet_SVC = LinearSVC(class_weight='balanced') 
tweet_SVC.fit(base_tweet_X, base_y) 

# We can now pass meta_tweet_X into the fitted model and save the decision 
# function, which will be used in Step 4 when we train the meta random forest 
tweet_SVC_decsfunc = tweet_SVC.decision_function(meta_tweet_X) 

# Pickle the tweet vectorizer and the linear SVC model for future use 
joblib.dump(tweet_vectorizer, 'TWEET_VECTORIZER.pkl') 
joblib.dump(tweet_SVC, 'TWEET_SVC.pkl')

Step 4: Train a meta-level Random Forest classifier

# additional features from meta_df to pull into the final model 
friends_count = meta_df['friends_count'].values.reshape(meta_df.shape[0], 1) 
utc_offset = meta_df['utc_offset'].values.reshape(meta_df.shape[0], 1) 
tweet_time_secs = meta_df['tweet_time_secs'].values.reshape(meta_df.shape[0], 1) 
statuses_count = meta_df['statuses_count'].values.reshape(meta_df.shape[0], 1) 
favourites_count = meta_df['favourites_count'].values.reshape(meta_df.shape[0], 1) 
followers_count = meta_df['followers_count'].values.reshape(meta_df.shape[0], 1) 
listed_count = meta_df['listed_count'].values.reshape(meta_df.shape[0], 1) 

# np.hstack these additional features together 
add_features = np.hstack((friends_count, 
                          utc_offset, 
                          tweet_time_secs, 
                          statuses_count, 
                          favourites_count, 
                          followers_count, 
                          listed_count)) 

# np.hstack the two decision function variables from steps 2 & 3 with add_features 
meta_X = np.hstack((location_SVC_decsfunc, # from Step 2 above 
                    tweet_SVC_decsfunc, # from Step 3 above 
                    add_features)) 

# Fit Random Forest with 'meta_X' and 'meta_y' 
meta_RF = RandomForestClassifier(n_estimators=60, n_jobs=-1) 
meta_RF.fit(meta_X, meta_y) 

# Pickle the meta Random Forest for future use 
joblib.dump(meta_RF, 'META_RF.pkl')

Testing the Model

Collecting and Preparing a Fresh Data Set

A week after I collected the training data set, I collected a fresh data set to use to evaluate the model. For this, I followed the same data collection and preparation procedures as above with a few of exceptions: 1) I only ran the Tweepy script for 48 hours, 2) I removed any users from the evaluation data set that were in the training data set, and 3) I went back to the Twitter API and pulled the 200 most recent tweets for each user that remained in the data set. Remember, the goal for the model is to predict the US city-level location of Twitter users, not individual tweets; therefore, by giving the model a larger corpus of tweets for each user, I hoped to increase the model's accuracy. Here is the script for pulling the 200 most recent tweets for each user:

import tweepy 
import pandas as pd 

# these are provided to you through the Twitter API after you create a account 
consumer_key = "your_consumer_key" 
consumer_secret = "your_consumer_secret" 
access_token = "your_access_token" 
access_token_secret = "your_access_token_secret" 

count = 0 

def get_200_tweets(screen_name): 
    ''' 
    Helper function to return a list of a Twitter user's 200 most recent tweets 
    ''' 

    auth = tweepy.OAuthHandler(consumer_key, consumer_secret) 
    auth.set_access_token(access_key, access_secret) 
    api = tweepy.API(auth, 
                     wait_on_rate_limit=True, 
                     wait_on_rate_limit_notify=True) 

    # Initialize a list to hold the user's 200 most recent tweets 
    tweets_data = [] 

    global count 

    try: 
        # make request for most recent tweets (200 is the maximum allowed per distinct request) 
        recent_tweets = api.user_timeline(screen_name = screen_name, count=200) 

        # save data from most recent tweets 
        tweets_data.extend(recent_tweets) 

        count += 1 

        # Status update 
        if count % 1000 == 0: 
            print 'Just stored tweets for user #{}'.format(count) 

    except: 
        count += 1 
        pass 

    # pull only the tweets and encode them in utf-8 to preserve emojis 
    list_of_recent_tweets = [''.join(tweet.text.encode("utf-8")) for tweet in tweets_data] 

    return list_of_recent_tweets 

# Create a new column in evaluation_df called '200_tweets' 
evaluation_df['200_tweets'] = map(lambda x: get_200_tweets(x), evaluation_df['screen_name'])

Making Predictions on the Fresh Data Set

To make predictions on evaluation_df, the script below was run on the same Amazon Web Services r3.8xlarge EC2 instance that was used to build the model:

import pandas as pd 
import numpy as np 
from sklearn.externals import joblib 
import nltk 
from nltk.tokenize import TweetTokenizer 

def tokenize(tweet): 
    tknzr = TweetTokenizer(strip_handles=True, reduce_len=True, preserve_case=False) 
    return tknzr.tokenize(tweet) 

class UserLocationClassifier: 

    def __init__(self): 
        ''' 
        Load the stacked classifier's pickled vectorizers, base classifiers, and meta classifier 
        ''' 

        self.location_vectorizer = joblib.load('USER_LOCATION_VECTORIZER.pkl') 
        self.location_SVC = joblib.load('USER_LOCATION_SVC.pkl') 
        self.tweet_vectorizer = joblib.load('TWEET_VECTORIZER.pkl') 
        self.tweet_SVC = joblib.load('TWEET_SVC.pkl') 
        self.meta_RF = joblib.load('META_RF.pkl') 

    def predict(self, df): 
        ''' 
        INPUT: Cleaned and properly formatted dataframe to make predictions for 
        OUTPUT: Array of predictions 
        ''' 
        # Get text from 'user_described_location' column of DataFrame 
        location_doc = df['user_described_location'].values 

        # Convert the '200_tweets' column from a list to just a string of all tweets 
        df.loc[:, '200_tweets'] = map(lambda x: ''.join(x), df['200_tweets']) 

        # Get text from '200_tweets' column of DataFrame 
        tweet_doc = df['200_tweets'].values 

        # Vectorize 'location_doc' and 'tweet_doc' 
        location_X = self.location_vectorizer.transform(location_doc.ravel()) 
        tweet_X = self.tweet_vectorizer.transform(tweet_doc.ravel()) 

        # Store decision functions for 'location_X' and 'tweet_X' 
        location_decision_function = self.location_SVC.decision_function(location_X) 
        tweet_decision_function = self.tweet_SVC.decision_function(tweet_X) 

        # Get additional features to pull into the Random Forest 
        friends_count = df['friends_count'].values.reshape(df.shape[0], 1) 
        utc_offset = df['utc_offset'].values.reshape(df.shape[0], 1) 
        tweet_time_secs = df['tweet_time_secs'].values.reshape(df.shape[0], 1) 
        statuses_count = df['statuses_count'].values.reshape(df.shape[0], 1) 
        favourites_count = df['favourites_count'].values.reshape(df.shape[0], 1) 
        followers_count = df['followers_count'].values.reshape(df.shape[0], 1) 
        listed_count = df['listed_count'].values.reshape(df.shape[0], 1) 

        # np.hstack additional features together 
        add_features = np.hstack((friends_count, 
                               utc_offset, 
                               tweet_time_secs, 
                               statuses_count, 
                               favourites_count, 
                               followers_count, 
                               listed_count)) 

        # np.hstack the two decision function variables with add_features 
        meta_X = np.hstack((location_decision_function, tweet_decision_function, add_features)) 

        # Feed meta_X into Random Forest and make predictions 
        return self.meta_RF.predict(meta_X) 

if __name__ == "__main__": 

        # Load evaluation_df into pandas DataFrame 
        evaluation_df = pd.read_pickle('evaluation_df.pkl') 

        # Load UserLocationClassifier 
        clf = UserLocationClassifier() 

        # Get predicted locations 
        predictions = clf.predict(evaluation_df) 

        # Create a new column called 'predicted_location' 
        evaluation_df.loc[:, 'predicted_location'] = predictions 

        # Pickle the resulting DataFrame with the location predictions 
        evaluation_df.to_pickle('evaluation_df_with_predictions.pkl')

Plotting the Locations of Twitter Users on a Map Using Bokeh

Here are some examples of how the model performed on a few selected cities. For each of the maps shown below, the dots indicate the user's true location, while the title of the map indicates where the model predicted them to be. As you can see, for each city there is a tight cluster in and around the correct location, with only a handfull of extreme misses. Here is the code for generating these plots (note: the final plots shown here were constructed in Photoshop after first using the 'pan' and 'wheel_zoom' tools in Bokeh to capture screenshots of the contiguous US, Alaska, and Hawaii):

import pandas as pd 
import pickle 

from bokeh.plotting import figure, output_notebook, output_file, show 
from bokeh.tile_providers import STAMEN_TERRAIN 
output_notebook() 

from functools import partial 
from shapely.geometry import Point 
from shapely.ops import transform 
import pyproj 

# Web mercator bounding box for the US 
US = ((-13884029, -7453304), (2698291, 6455972)) 

x_range, y_range = US 
plot_width = int(900) 
plot_height = int(plot_width*7.0/12) 

def base_plot(tools='pan,wheel_zoom,reset',plot_width=plot_width, plot_height=plot_height, **plot_args): 
    p = figure(tools=tools, plot_width=plot_width, plot_height=plot_height, 
        x_range=x_range, y_range=y_range, outline_line_color=None, 
        min_border=0, min_border_left=0, min_border_right=0, 
        min_border_top=0, min_border_bottom=0, **plot_args) 

    p.axis.visible = False 
    p.xgrid.grid_line_color = None 
    p.ygrid.grid_line_color = None 
    return p 

def plot_predictions_for_a_city(df, name_of_predictions_col, city): 
    ''' 
    INPUT: DataFrame with location predictions; name of column in DataFrame that 
    contains the predictions; city ('City, State') to plot predictions for 
    OUTPUT: Bokeh map that shows the actual location of all the users predicted to 
    be in the selected city 
    ''' 

    df_ = df[df[name_of_predictions_col] == city] 

    # Initialize two lists to hold all the latitudes and longitudes 
    all_lats = [] 
    all_longs = [] 

    # Pull all latitudes in 'centroid' column append to all_lats 
    for i in df_['centroid']: 
        all_lats.append(i[0]) 

    # Pull all longitudes in 'centroid' column append to all_longs 
    for i in df_['centroid']: 
        all_longs.append(i[1]) 

    # Initialize two lists to hold all the latitudes and longitudes 
    # converted to web mercator 
    all_x = [] 
    all_y = [] 

    # Convert latittudes and longitudes to web mercator x and y format 
    for i in xrange(len(all_lats)): 
        pnt = transform( 
            partial( 
                pyproj.transform, 
                pyproj.Proj(init='EPSG:4326'), 
                pyproj.Proj(init='EPSG:3857')), 
                Point(all_longs[i], all_lats[i])) 
        all_x.append(pnt.x) 
        all_y.append(pnt.y) 

    p = base_plot() 
    p.add_tile(STAMEN_TERRAIN) 
    p.circle(x=all_x, y=all_y, line_color=None, fill_color='#380474', size=15, alpha=.5) 
    output_file("stamen_toner_plot.html") 
    show(p) 

if __name__ == "__main__": 

    # Load pickled evaluation_df with location predictions 
    evaluation_df_with_predictions = pd.read_pickle('evaluation_df_with_predictions.pkl') 

    # Plot actual locations for users predicted to be in Eugene, OR 
    plot_predictions_for_a_city(evaluation_df_with_predictions, 'predicted_location', 'Eugene, OR')

Example 1: Eugene, OR

Example 2: Durham, NC

Example 3: Shreveport, LA

Tweet Term Importances for these Cities

To get an idea of what tweet terms were important for predicting these cities, I went through and calculated mean tf-idf values for each of these cities. Below are some of the more interesting terms for each of these cities. To generate these plots, I followed an excellent guide written by Thomas Buhrmann.

Emoji Skin Tone Modifications

One of the more interesting things to fall out of the model was the colored boxes shown above. These represent the skin tone modifications you can add to certain emojis. For most emojis there was not a strong geographic signal; however, for the skin tone modifications there was. As you can see in the term importances plots, Twitter users in Eugene, OR, tended to use lighter colored skin tone modifications while users in Durham, NC, and Shreveport, LA, tended to use darker skin tone modifications.

Scoring the Model

Median Error: 49.6 miles

To score the model I chose to use median error, which came out to be 49.6 miles. This was calculated by using the centroids to find the great-circle distance between the predicted city and the true location. Here is how it was calculated (note: if the user was predicted to be in the correct city, the error was scored as 0.0 miles, regardless of the actual distance between the centroids):

from math import sin, cos, sqrt, atan2, radians 
import pickle 

def load_coord_dict(): 
    ''' 
    Input: n/a 
    Output: A dictionary whose keys are the location names ('City, State') of the 
    379 classification labels and the values are the centroids for those locations 
    (latitude, longitude) 
    ''' 

    pkl_file = open("coord_dict.pkl", 'rb') 
    coord_dict = pickle.load(pkl_file) 
    pkl_file.close() 
    return coord_dict 

def compute_error_in_miles(zipped_predictions): 
    ''' 
    INPUT: Tuple in the form of (predicted city, centroid of true location) 
    OUTPUT: Float of the great-circle error distance between the predicted city 
    and the true locaiton. 
    ''' 

    radius = 3959 # approximate radius of earth in miles 

    predicted_city = zipped_predictions[0] 
    actual_centroid = zipped_predictions[1] 

    lat1 = radians(coord_dict[predicted_city][0]) 
    lon1 = radians(coord_dict[predicted_city][1]) 
    lat2 = radians(actual_centroid[0]) 
    lon2 = radians(actual_centroid[1]) 

    delta_lon = lon2 - lon1 
    delta_lat = lat2 - lat1 

    a = sin(delta_lat / 2)**2 + cos(lat1) * cos(lat2) * sin(delta_lon / 2)**2 
    c = 2 * atan2(sqrt(a), sqrt(1 - a)) 

    error_distance = radius * c 
    return error_distance 

def correct_outside_the_us_errors(row): 
    ''' 
    Helper function to correct the errors to 0.0 for the users that were correctly predicted. This 
    is especially important for users outside the US since they were all given the 
    same label ('NOT_IN_US, NONE') even though their centroids were all different. 
    ''' 

    if row['predicted_location'] == row['geo_location']: 
        error = 0.0 
    else: 
        error = row['error_in_miles'] 
    return error 

if __name__ == "__main__": 

    # Load coord_dict 
    coord_dict = load_coord_dict() 

    centroid_of_true_location = evaluation_df['centroid'].values 
    zipped_predictions = zip(predictions, centroid_of_true_location) 

    # Create a new column with the error value for each prediction 
    evaluation_df['error_in_miles'] = map(lambda x: compute_error_in_miles(x), zipped_predictions) 

    # Change the error of correct predictions to 0.0 miles 
    evaluation_df['error_in_miles'] = evaluation_df.apply(lambda x: 
                                                          correct_outside_the_us_errors(x), 
                                                           axis=1)

     median_error = evaluation_df['error_in_miles'].median()

Histogram of Error Distances

Influence of Tweet Number on the Model's Accuracy

Recall that, for each user I wanted to make a prediction on, I went back to the API and pulled 200 of their most recent tweets. The plot below was generated using the same procedure as above with increasing numbers of tweets for each user. I originally chose 200 because this is the maximum number the API allows you to pull per distinct request. However, as you can see in the plot below, after about 100 tweets there is negligible improvement in the model's accuracy, meaning for future use it might not be necessary to pull so many tweets for each user.

Final Notes

While a median error of 49.6 miles is pretty good, there is still plenty of room for improvement. Running the Tweepy streaming script for a longer period of time and having a larger collection of training data would likely give an immediate improvement. Additionally, with more training data, you could also include more than 379 classification labels, which would also help to decrease the median error of the model. That said, given the time constraints of the project, I'm satisfied with the current model's accuracy and think it could be a valuable resource to many projects where having an extremely granular estimate of a Twitter user's location is not required.

July 21, 2017 04:45 PM


Jaysinh Shukla

PyDelhi Conf 2017: A beautiful conference happened in New Delhi, India

PyDelhi Conf
2017

TL;DR

PyDelhi conf 2017 was a two-day conference which featured workshops, dev sprints, both full-length and lightning talks. There were workshop sessions without any extra charges. Delhiites should not miss the chance to attend this conference in future. I conducted a workshop titled “Tango with Django” helping beginners to understand the Django web framework.

Detailed Review

About the PyDelhi community

PyDelhi
Community

PyDelhi conf 2017 volunteers

The PyDelhi community was known as NCR Python Users Group before few years. This community is performing a role of an umbrella organization for other FLOSS communities across New Delhi, India. They are actively arranging monthly meetups on interesting topics. Last PyCon India which is a national level conference of Python programming language was impressively organized by this community. This year too they took the responsibility of managing it. I am very thankful to this community for their immense contribution to this society. If you are around New Delhi, India then you should not miss the chance to attend their meetups. This community has great people who are always happy to mentor.

PyDelhi conf 2017

Conference T-shirt

Conference T-shirt

PyDelhi conf is a regional level conference of Python programming language organized by PyDelhi community. It is their second year organizing this conference. Last year it was located at JNU University. This year it happened at IIM, Lucknow campus based in Noida, New Delhi, India. I enjoyed various talks which I will mention later here, a workshops section because I was conducting one and some panel discussions because people involved were having a good level of experience. 80% of the time slot was divided equally between 30 minutes talk and 2-hour workshop section. 10% were given to panel discussions and 10% was reserved for lightning talks. The dev sprints were happening in parallel with the conference. The early slot was given to workshops for both the days. One large conference hall was located on a 2nd floor of the building and two halls at the ground floor. Food and beverages were served on the base floor.

Panel discussion

Panel Discussion

Desk

Registration desk

Lunch

Tea break

Keynote speakers

Mr. Richardo Rocha

Mr. Chris Stucchio

Interesting Talks

Because I took the wrong metro train, I was late for the inaugural ceremony. I also missed a keynote given by Mr. Rocha. Below talks were impressively presented at the conference.

I love discussing with people rather than sit in on sessions. With that ace-reason, I always lose some important talks presented at the conference. I do not forget to watch them once they are publicly available. This year I missed following talks.

Volunteer Party

I got a warm invitation by the organizers to join the volunteer party, but I was little tensed about my session happening on the next day. So, I decided to go home and improve the slides. I heard from friends that the party was awesome!

My workshop session

Tango with Django

Me conducting workshop

I conducted a workshop on Django web framework. “Tango with Django” was chosen as a title with a thought of attracting beginners. I believe this title is already a name of famous book solving the same purpose.

Dev sprints

Dev sprints

Me hacking at dev sprints section

The dev sprints were happening parallel with the conference. Mr. Pillai was representing Junction. I decided to test few issues of CPython but didn’t do much. There were a bunch of people hacking but didn’t find anything interesting. The quality of chairs was so an impressive that I have decided to buy the same for my home office.

Why attend this conference?

What was missing?

Thank you PyDelhi community!

I would like to thank all the known, unknown volunteers who performed their best in arranging this conference. I am encouraging PyDelhi community for keep organizing such an affable conference.

Proofreaders: Mr. Daniel Foerster, Mr. Dhavan Vaidya, Mr. Sayan Chowdhury, Mr. Trent Buck

July 21, 2017 12:44 PM


The Digital Cat

Refactoring with tests in Python: a practical example

This post contains a step-by-step example of a refactoring session guided by tests. When dealing with untested or legacy code refactoring is dangerous and tests can help us do it the right way, minimizing the amount of bugs we introduce, and possibly completely avoiding them.

Refactoring is not easy. It requires a double effort to understand code that others wrote, or that we wrote in the past, and moving around parts of it, simplifying it, in one word improving it, is by no means something for the faint-hearted. Like programming, refactoring has its rules and best practices, but it can be described as a mixture of technique, intuition, experience, risk.

Programming, after all, is craftsmanship.

The starting point

The simple use case I will use for this post is that of a service API that we can access, and that produces data in JSON format, namely a list of elements like the one shown here

{
    'age': 20,
    'surname': 'Frazier',
    'name': 'John',
    'salary': '£28943'
}

Once we convert this to a Python data structure we obtain a list of dictionaries, where 'age' is an integer, and the remaining fields are strings.

Someone then wrote a class that computes some statistics on the input data. This class, called DataStats, provides a single method stats(), whose inputs are the data returned by the service (in JSON format), and two integers called iage and isalary. Those, according to the short documentation of the class, are the initial age and the initial salary used to compute the average yearly increase of the salary on the whole dataset.

The code is the following

import math
import json


class DataStats:

    def stats(self, data, iage, isalary):
        # iage and isalary are the starting age and salary used to
        # compute the average yearly increase of salary.

        # Compute average yearly increase
        average_age_increase = math.floor(
            sum([e['age'] for e in data])/len(data)) - iage
        average_salary_increase = math.floor(
            sum([int(e['salary'][1:]) for e in data])/len(data)) - isalary

        yearly_avg_increase = math.floor(
            average_salary_increase/average_age_increase)

        # Compute max salary
        salaries = [int(e['salary'][1:]) for e in data]
        threshold = '£' + str(max(salaries))

        max_salary = [e for e in data if e['salary'] == threshold]

        # Compute min salary
        salaries = [int(d['salary'][1:]) for d in data]
        min_salary = [e for e in data if e['salary'] ==
                      '£{}'.format(str(min(salaries)))]

        return json.dumps({
            'avg_age': math.floor(sum([e['age'] for e in data])/len(data)),
            'avg_salary': math.floor(sum(
                [int(e['salary'][1:]) for e in data])/len(data)),
            'avg_yearly_increase': yearly_avg_increase,
            'max_salary': max_salary,
            'min_salary': min_salary
        })

The goal

It is fairly easy, even for the untrained eye, to spot some issues in the previous class. A list of the most striking ones is

So, since we are going to use this code in some part of our Amazing New Project™, we want to possibly fix these issues.

The class, however, is working perfectly. It has been used in production for many years and there are no known bugs, so our operation has to be a refactoring, which means that we want to write something better, preserving the behaviour of the previous object.

The path

In this post I want to show you how you can safely refactor such a class using tests. This is different from TDD, but the two are closely related. The class we have has not been created using TDD, as there are no tests, but we can use tests to ensure its behaviour is preserved. This should therefore be called Test Driven Refactoring (TDR).

The idea behind TDR is pretty simple. First, we have to write a test that checks the behaviour of some code, possibly a small part with a clearly defined scope and output. This is a posthumous (or late) unit test, and it simulates what the author of the code should have provided (cough cough, it was you some months ago...).

Once you have you unit test you can go and modify the code, knowing that the behaviour of the resulting object will be the same of the previous one. As you can easily understand, the effectiveness of this methodology depends strongly on the quality of the tests themselves, possibly more than when developing with TDD, and this is why refactoring is hard.

Caveats

Two remarks before we start our first refactoring. The first is that such a class could easily be refactored to some functional code. As you will be able to infer from the final result there is no real reason to keep an object-oriented approach for this code. I decided to go that way, however, as it gave me the possibility to show a design pattern called wrapper, and the refactoring technique that leverages it.

The second remark is that in pure TDD it is strongly advised not to test internal methods, that is those methods that do not form the public API of the object. In general, we identify such methods in Python by prefixing their name with an underscore, and the reason not to test them is that TDD wants you to shape objects according to the object-oriented programming methodology, which considers objects as behaviours and not as structures. Thus, we are only interested in testing public methods.

It is also true, however, that sometimes even tough we do not want to make a method public, that method contains some complex logic that we want to test. So, in my opinion the TDD advice should sound like "Test internal methods only when they contain some non-trivial logic".

When it comes to refactoring, however, we are somehow deconstructing a previously existing structure, and usually we end up creating a lot of private methods to help extracting and generalising parts of the code. My advice in this case is to test those methods, as this gives you a higher degree of confidence in what you are doing. With experience you will then learn which tests are required and which are not.

Setup of the testing environment

Clone this repository and create a virtual environment. Activate it and install the required packages with

pip install -r requirements.txt

The repository already contains a configuration file for pytest and you should customise it to avoid entering your virtual environment directory. Go and fix the norecursedirs parameter in that file, adding the name of the virtual environment you just created; I usually name my virtual environments with a venv prefix, and this is why that variable contains the entry venv*.

At this point you should be able to run pytest -svv in the parent directory of the repository (the one that contains pytest.ini), and obtain a result similar to the following

========================== test session starts ==========================
platform linux -- Python 3.5.3, pytest-3.1.2, py-1.4.34, pluggy-0.4.0
cachedir: .cache
rootdir: datastats, inifile: pytest.ini
plugins: cov-2.5.1
collected 0 items 

====================== no tests ran in 0.00 seconds ======================

The given repository contains two branches. master is the one that you are into, and contains the initial setup, while develop points to the last step of the whole refactoring process. Every step of this post contains a reference to the commit that contains the changes introduced in that section.

Step 1 - Testing the endpoints

Commit: 27a1d8c

When you start refactoring a system, regardless of the size, you have to test the endpoints. This means that you consider the system as a black box (i.e. you do not know what is inside) and just check the external behaviour. In this case we can write a test that initialises the class and runs the stats() method with some test data, possibly real data, and checks the output. Obviously we will write the test with the actual output returned by the method, so this test is automatically passing.

Querying the server we get the following data

test_data = [
    {
        "id": 1,
        "name": "Laith",
        "surname": "Simmons",
        "age": 68,
        "salary": "£27888"
    },
    {
        "id": 2,
        "name": "Mikayla",
        "surname": "Henry",
        "age": 49,
        "salary": "£67137"
    },
    {
        "id": 3,
        "name": "Garth",
        "surname": "Fields",
        "age": 70,
        "salary": "£70472"
    }
]

and calling the stats() method with that output, with iage set to 20, and isalary set to 20000, we get the following JSON result

{
    'avg_age': 62,
    'avg_salary': 55165,
    'avg_yearly_increase': 837,
    'max_salary': [{
        "id": 3,
        "name": "Garth",
        "surname": "Fields",
        "age": 70,
        "salary": "£70472"
    }],
    'min_salary': [{
        "id": 1,
        "name": "Laith",
        "surname": "Simmons",
        "age": 68,
        "salary": "£27888"
    }]
}

Caveat: I'm using a single very short set of real data, namely a list of 3 dictionaries. In a real case I would test the black box with many different use cases, to ensure I am not just checking some corner case.

The test is the following

import json

from datastats.datastats import DataStats


def test_json():
    test_data = [
        {
            "id": 1,
            "name": "Laith",
            "surname": "Simmons",
            "age": 68,
            "salary": "£27888"
        },
        {
            "id": 2,
            "name": "Mikayla",
            "surname": "Henry",
            "age": 49,
            "salary": "£67137"
        },
        {
            "id": 3,
            "name": "Garth",
            "surname": "Fields",
            "age": 70,
            "salary": "£70472"
        }
    ]

    ds = DataStats()

    assert ds.stats(test_data, 20, 20000) == json.dumps(
        {
            'avg_age': 62,
            'avg_salary': 55165,
            'avg_yearly_increase': 837,
            'max_salary': [{
                "id": 3,
                "name": "Garth",
                "surname": "Fields",
                "age": 70,
                "salary": "£70472"
            }],
            'min_salary': [{
                "id": 1,
                "name": "Laith",
                "surname": "Simmons",
                "age": 68,
                "salary": "£27888"
            }]
        }
    )

As said before, this test is obviously passing, having been artificially constructed from a real execution of the code.

Well, this test is very important! Now we know that if we change something inside the code, altering the behaviour of the class, at least one test will fail.

Step 2 - Getting rid of the JSON format

Commit: 65e2997

The method returns its output in JSON format, and looking at the class it is pretty evident that the conversion is done by json.dumps().

The structure of the code is the following

class DataStats:

    def stats(self, data, iage, isalary):
        [code_part_1]

        return json.dumps({
            [code_part_2]
        })

Where obviously code_part_2 depends on code_part_1. The first refactoring, then, will follow this procedure

  1. We write a test called test__stats() for a _stats() method that is supposed to return the data as a Python structure. We can infer this latter manually from the JSON or running json.loads() from a Python shell. The test fails.
  2. We duplicate the code of the stats() method that produces the data, putting it in the new _stats() method. The test passes.
class DataStats:

    def _stats(parameters):
        [code_part_1]

        return [code_part_2]

    def stats(self, data, iage, isalary):
        [code_part_1]

        return json.dumps({
            [code_part_2]
        })
  1. We remove the duplicated code in stats() replacing it with a call to _stats()
class DataStats:

    def _stats(parameters):
        [code_part_1]

        return [code_part_2]

    def stats(self, data, iage, isalary):
        return json.dumps(
            self._stats(data, iage, isalary)
        )

At this point we could refactor the initial test test_json() that we wrote, but this is an advanced consideration, and I'll leave it for some later notes.

So now the code of our class looks like this

class DataStats:

    def _stats(self, data, iage, isalary):
        # iage and isalary are the starting age and salary used to
        # compute the average yearly increase of salary.

        # Compute average yearly increase
        average_age_increase = math.floor(
            sum([e['age'] for e in data])/len(data)) - iage
        average_salary_increase = math.floor(
            sum([int(e['salary'][1:]) for e in data])/len(data)) - isalary

        yearly_avg_increase = math.floor(
            average_salary_increase/average_age_increase)

        # Compute max salary
        salaries = [int(e['salary'][1:]) for e in data]
        threshold = '£' + str(max(salaries))

        max_salary = [e for e in data if e['salary'] == threshold]

        # Compute min salary
        salaries = [int(d['salary'][1:]) for d in data]
        min_salary = [e for e in data if e['salary'] ==
                      '£{}'.format(str(min(salaries)))]

        return {
            'avg_age': math.floor(sum([e['age'] for e in data])/len(data)),
            'avg_salary': math.floor(sum(
                [int(e['salary'][1:]) for e in data])/len(data)),
            'avg_yearly_increase': yearly_avg_increase,
            'max_salary': max_salary,
            'min_salary': min_salary
        }

    def stats(self, data, iage, isalary):
        return json.dumps(
            self._stats(data, iage, isalary)
        )

and we have two tests that check the correctness of it.

Step 3 - Refactoring the tests

Commit: d619017

It is pretty clear that the test_data list of dictionaries is bound to be used in every test we will perform, so it is high time we moved that to a global variable. There is no point now in using a fixture, as the test data is just static data.

We could also move the output data to a global variable, but the upcoming tests are not using the whole output dictionary any more, so we can postpone the decision.

The test suite now looks like

import json

from datastats.datastats import DataStats


test_data = [
    {
        "id": 1,
        "name": "Laith",
        "surname": "Simmons",
        "age": 68,
        "salary": "£27888"
    },
    {
        "id": 2,
        "name": "Mikayla",
        "surname": "Henry",
        "age": 49,
        "salary": "£67137"
    },
    {
        "id": 3,
        "name": "Garth",
        "surname": "Fields",
        "age": 70,
        "salary": "£70472"
    }
]


def test_json():

    ds = DataStats()

    assert ds.stats(test_data, 20, 20000) == json.dumps(
        {
            'avg_age': 62,
            'avg_salary': 55165,
            'avg_yearly_increase': 837,
            'max_salary': [{
                "id": 3,
                "name": "Garth",
                "surname": "Fields",
                "age": 70,
                "salary": "£70472"
            }],
            'min_salary': [{
                "id": 1,
                "name": "Laith",
                "surname": "Simmons",
                "age": 68,
                "salary": "£27888"
            }]
        }
    )


def test__stats():

    ds = DataStats()

    assert ds._stats(test_data, 20, 20000) == {
        'avg_age': 62,
        'avg_salary': 55165,
        'avg_yearly_increase': 837,
        'max_salary': [{
            "id": 3,
            "name": "Garth",
            "surname": "Fields",
            "age": 70,
            "salary": "£70472"
        }],
        'min_salary': [{
            "id": 1,
            "name": "Laith",
            "surname": "Simmons",
            "age": 68,
            "salary": "£27888"
        }]
    }

Step 4 - Isolate the average age algorithm

Commit: 9db1803

Isolating independent features is a key target of software design. Thus, our refactoring shall aim to disentangle the code dividing it into small separated functions.

The output dictionary contains five keys, and each of them corresponds to a value computed either on the fly (for avg_age and avg_salary) or by the method's code (for avg_yearly_increase, max_salary, and min_salary). We can start replacing the code that computes the value of each key with dedicated methods, trying to isolate the algorithms.

To isolate some code, the first thing to do is to duplicate it, putting it into a dedicated method. As we are refactoring with tests, the first thing is to write a test for this method.

def test__avg_age():

    ds = DataStats()

    assert ds._avg_age(test_data) == 62

We know that the method's output shall be 62 as that is the value we have in the output data of the original stats() method. Please note that there is no need to pass iage and isalary as they are not used in the refactored code.

The test fails, so we can dutifully go and duplicate the code we use to compute 'avg_age'

    def _avg_age(self, data):
        return math.floor(sum([e['age'] for e in data])/len(data))

and once the test passes we can replace the duplicated code in _stats() with a call to _avg_age()

        return {
            'avg_age': self._avg_age(data),
            'avg_salary': math.floor(sum(
                [int(e['salary'][1:]) for e in data])/len(data)),
            'avg_yearly_increase': yearly_avg_increase,
            'max_salary': max_salary,
            'min_salary': min_salary
        }

Checking after that that no test is failing. Well done! We isolated the first feature, and our refactoring produced already three tests.

Step 5 - Isolate the average salary algorithm

Commit: 4122201

The avg_salary key works exactly like the avg_age, with different code. Thus, the refactoring process is the same as before, and the result should be a new test__avg_salary() test

def test__avg_salary():

    ds = DataStats()

    assert ds._avg_salary(test_data) == 55165

a new _avg_salary() method

    def _avg_salary(self, data):
        return math.floor(sum([int(e['salary'][1:]) for e in data])/len(data))

and a new version of the final return value

        return {
            'avg_age': self._avg_age(data),
            'avg_salary': self._avg_salary(data),
            'avg_yearly_increase': yearly_avg_increase,
            'max_salary': max_salary,
            'min_salary': min_salary
        }

Step 6 - Isolate the average yearly increase algorithm

Commit: 4005145

The remaining three keys are computed with algorithms that, being longer than one line, couldn't be squeezed directly in the definition of the dictionary. The refactoring process, however, does not really change; as before, we first test a helper method, then we define it duplicating the code, and last we call the helper removing the code duplication.

For the average yearly increase of the salary we have a new test

def test__avg_yearly_increase():

    ds = DataStats()

    assert ds._avg_yearly_increase(test_data, 20, 20000) == 837

a new method that passes the test

    def _avg_yearly_increase(self, data, iage, isalary):
        # iage and isalary are the starting age and salary used to
        # compute the average yearly increase of salary.

        # Compute average yearly increase
        average_age_increase = math.floor(
            sum([e['age'] for e in data])/len(data)) - iage
        average_salary_increase = math.floor(
            sum([int(e['salary'][1:]) for e in data])/len(data)) - isalary

        return math.floor(average_salary_increase/average_age_increase)

and a new version of the _stats() method

    def _stats(self, data, iage, isalary):
        # Compute max salary
        salaries = [int(e['salary'][1:]) for e in data]
        threshold = '£' + str(max(salaries))

        max_salary = [e for e in data if e['salary'] == threshold]

        # Compute min salary
        salaries = [int(d['salary'][1:]) for d in data]
        min_salary = [e for e in data if e['salary'] ==
                      '£{}'.format(str(min(salaries)))]

        return {
            'avg_age': self._avg_age(data),
            'avg_salary': self._avg_salary(data),
            'avg_yearly_increase': self._avg_yearly_increase(
                data, iage, isalary),
            'max_salary': max_salary,
            'min_salary': min_salary
        }

Please note that we are not solving any code duplication but the ones that we introduce to refactor. The first achievement we should aim to is to completely isolate independent features.

Step 7 - Isolate max and min salary algorithms

Commit: 17b2413

When refactoring we shall always do one thing at a time, but for the sake of conciseness, I'll show here the result of two refactoring steps at once. I'll recommend the reader to perform them as independent steps, as I did when I wrote the code that I am posting below.

The new tests are

def test__max_salary():

    ds = DataStats()

    assert ds._max_salary(test_data) == [{
        "id": 3,
        "name": "Garth",
        "surname": "Fields",
        "age": 70,
        "salary": "£70472"
    }]


def test__min_salary():

    ds = DataStats()

    assert ds._min_salary(test_data) == [{
        "id": 1,
        "name": "Laith",
        "surname": "Simmons",
        "age": 68,
        "salary": "£27888"
    }]

The new methods in the DataStats class are

    def _max_salary(self, data):
        # Compute max salary
        salaries = [int(e['salary'][1:]) for e in data]
        threshold = '£' + str(max(salaries))

        return [e for e in data if e['salary'] == threshold]

    def _min_salary(self, data):
        # Compute min salary
        salaries = [int(d['salary'][1:]) for d in data]
        return [e for e in data if e['salary'] ==
                '£{}'.format(str(min(salaries)))]

and the _stats() method is now really tiny

    def _stats(self, data, iage, isalary):
        return {
            'avg_age': self._avg_age(data),
            'avg_salary': self._avg_salary(data),
            'avg_yearly_increase': self._avg_yearly_increase(
                data, iage, isalary),
            'max_salary': self._max_salary(data),
            'min_salary': self._min_salary(data)
        }

Step 8 - Reducing code duplication

Commit: b559a5c

Now that we have the main tests in place we can start changing the code of the various helper methods. These are now small enough to allow us to change the code without further tests. While this can be true in this case, however, in general there is no definition of what "small enough" means, as there is no real definition of what "unit test" is. Generally speaking you should be confident that the change that you are doing is covered by the tests that you have. Weren't this the case, you'd better add one or more tests until you feel confident enough.

The two methods _max_salary() and _min_salary() share a great deal of code, even though the second one is more concise

    def _max_salary(self, data):
        # Compute max salary
        salaries = [int(e['salary'][1:]) for e in data]
        threshold = '£' + str(max(salaries))

        return [e for e in data if e['salary'] == threshold]

    def _min_salary(self, data):
        # Compute min salary
        salaries = [int(d['salary'][1:]) for d in data]
        return [e for e in data if e['salary'] ==
                '£{}'.format(str(min(salaries)))]

I'll start by making explicit the threshold variable in the second function. As soon as I change something, I'll run the tests to check that the external behaviour did not change.

    def _max_salary(self, data):
        # Compute max salary
        salaries = [int(e['salary'][1:]) for e in data]
        threshold = '£' + str(max(salaries))

        return [e for e in data if e['salary'] == threshold]

    def _min_salary(self, data):
        # Compute min salary
        salaries = [int(d['salary'][1:]) for d in data]
        threshold = '£{}'.format(str(min(salaries)))

        return [e for e in data if e['salary'] == threshold]

Now, it is pretty evident that the two functions are the same but for the min() and max() functions. They still use different variable names and different code to format the threshold, so my first action is to even out them, copying the code of _min_salary() to _max_salary() and changing min() to max()

    def _max_salary(self, data):
        # Compute max salary
        salaries = [int(d['salary'][1:]) for d in data]
        threshold = '£{}'.format(str(max(salaries)))

        return [e for e in data if e['salary'] == threshold]

    def _min_salary(self, data):
        # Compute min salary
        salaries = [int(d['salary'][1:]) for d in data]
        threshold = '£{}'.format(str(min(salaries)))

        return [e for e in data if e['salary'] == threshold]

Now I can create another helper called _select_salary() that duplicates that code and accepts a function, used instead of min() or max(). As I did before, first I duplicate the code, and then remove the duplication by calling the new function.

After some passages, the code looks like this

    def _select_salary(self, data, func):
        salaries = [int(d['salary'][1:]) for d in data]
        threshold = '£{}'.format(str(func(salaries)))

        return [e for e in data if e['salary'] == threshold]

    def _max_salary(self, data):
        return self._select_salary(data, max)

    def _min_salary(self, data):
        return self._select_salary(data, min)

I noticed then a code duplication between _avg_salary() and _select_salary():

    def _avg_salary(self, data):
        return math.floor(sum([int(e['salary'][1:]) for e in data])/len(data))
    def _select_salary(self, data, func):
        salaries = [int(d['salary'][1:]) for d in data]

and decided to extract the common algorithm in a method called _salaries(). As before, I write the test first

def test_salaries():

    ds = DataStats()

    assert ds._salaries(test_data) == [27888, 67137, 70472]

then I implement the method

    def _salaries(self, data):
        return [int(d['salary'][1:]) for d in data]

and eventually I replace the duplicated code with a call to the new method

    def _salaries(self, data):
        return [int(d['salary'][1:]) for d in data]
    def _select_salary(self, data, func):
        threshold = '£{}'.format(str(func(self._salaries(data))))

        return [e for e in data if e['salary'] == threshold]

While doing this I noticed that _avg_yearly_increase() contains the same code, and fix it there as well.

    def _avg_yearly_increase(self, data, iage, isalary):
        # iage and isalary are the starting age and salary used to
        # compute the average yearly increase of salary.

        # Compute average yearly increase
        average_age_increase = math.floor(
            sum([e['age'] for e in data])/len(data)) - iage
        average_salary_increase = math.floor(
            sum(self._salaries(data))/len(data)) - isalary

        return math.floor(average_salary_increase/average_age_increase)

It would be useful at this point to store the input data inside the class and to use it as self.data instead of passing it around to all the class's methods. This however would break the class's API, as currently DataStats is initialised without any data. Later I will show how to introduce changes that potentially break the API, and briefly discuss the issue. For the moment, however, I'll keep changing the class without modifying the external interface.

It looks like age has the same code duplication issues as salary, so with the same procedure I introduce the _ages() method and change the _avg_age() and _avg_yearly_increase() methods accordingly.

Speaking of _avg_yearly_increase(), the code of that method contains the code of the _avg_age() and _avg_salary() methods, so it is worth replacing it with two calls. As I am moving code between existing methods, I do not need further tests.

    def _avg_yearly_increase(self, data, iage, isalary):
        # iage and isalary are the starting age and salary used to
        # compute the average yearly increase of salary.

        # Compute average yearly increase
        average_age_increase = self._avg_age(data) - iage
        average_salary_increase = self._avg_salary(data) - isalary

        return math.floor(average_salary_increase/average_age_increase)

Step 9 - Advanced refactoring

Commit: cc0b0a1

The initial class didn't have any __init__() method, and was thus missing the encapsulation part of the object-oriented paradigm. There was no reason to keep the class, as the stats() method could have easily been extracted and provided as a plain function.

This is much more evident now that we refactored the method, because we have 10 methods that accept data as a parameter. I would be nice to load the input data into the class at instantiation time, and then access it as self.data. This would greatly improve the readability of the class, and also justify its existence.

If we introduce a __init__() method that requires a parameter, however, we will change the class's API, breaking the compatibility with every other code that imports and uses it. Since we want to keep it, we have to devise a way to provide both the advantages of a new, clean class and of a stable API. This is not always perfectly achievable, but in this case the Adapter design pattern (also known as Wrapper) can perfectly solve the issue.

The goal is to change the current class to match the new API, and then build a class that wraps the first one and provides the old API. The strategy is not that different from what we did previously, only this time we will deal with classes instead of methods. With a stupendous effort of my imagination I named the new class NewDataStats. Sorry, but sometimes you just have to get the job done.

The first things, as happens very often with refactoring, is to duplicate the code, and when we insert new code we need to have tests that justify it. The tests will be the same as before, as the new class shall provide the same functionalities as the previous one, so I just create a new file, called test_newdatastats.py and start putting there the first test test_init().

import json

from datastats.datastats import NewDataStats


test_data = [
    {
        "id": 1,
        "name": "Laith",
        "surname": "Simmons",
        "age": 68,
        "salary": "£27888"
    },
    {
        "id": 2,
        "name": "Mikayla",
        "surname": "Henry",
        "age": 49,
        "salary": "£67137"
    },
    {
        "id": 3,
        "name": "Garth",
        "surname": "Fields",
        "age": 70,
        "salary": "£70472"
    }
]


def test_init():

    ds = NewDataStats(test_data)

    assert ds.data == test_data

This test doesn't pass, and the code that implements the class is very simple

class NewDataStats:

    def __init__(self, data):
        self.data = data

Now I can start an iterative process:

  1. I will copy one of the tests of DataStats and adapt it to NewDataStats
  2. I will copy come code from DataStats to NewDataStats, adapting it to the new API and making it pass the test.

At this point iteratively removing methods from DataStats and replacing them with a call to NewDataStats would be overkill. I'll show you in the next section why, and what we can do to avoid that.

An example of the resulting tests for NewDataStats is the following

def test_ages():

    ds = NewDataStats(test_data)

    assert ds._ages() == [68, 49, 70]

and the code that passes the test is

    def _ages(self):
        return [d['age'] for d in self.data]

Once finished, I noticed that, as now methods like _ages() do not require an input parameter any more, I can convert them to properties, changing the tests accordingly.

    @property
    def _ages(self):
        return [d['age'] for d in self.data]

It is time to replace the methods of DataStats with calls to NewDataStats. We could do it method by method, bu actually the only thing that we really need is to replace stats(). So the new code is

    def stats(self, data, iage, isalary):
        nds = NewDataStats(data)
        return nds.stats(iage, isalary)

And since all the other methods are no more used we can safely delete them, checking that the tests do not fail. Speaking of tests, removing method will make many tests of DataStats fail, so we need to remove them.

class DataStats:

    def stats(self, data, iage, isalary):
        nds = NewDataStats(data)
        return nds.stats(iage, isalary)

Final words

I hope this little tour of a refactoring session didn't result too trivial, and helped you to grasp the basic concepts of this technique. If you are interested in the subject I'd strongly recommend the classic book by Martin Fowler "Refactoring: Improving the Design of Existing Code", which is a collection of refactoring patterns. The reference language is Java, but the concepts are easily adapted to Python.

Feedback

Feel free to use the blog Google+ page to comment the post. Feel free to reach me on Twitter if you have questions. The GitHub issues page is the best place to submit corrections.

July 21, 2017 08:30 AM

July 20, 2017


Reuven Lerner

Globbing and Python’s “subprocess” module

Python’s “subprocess” module makes it really easy to invoke an external program and grab its output. For example, you can say

import subprocess
print(subprocess.check_output('ls'))

and the output is then

$ ./blog.py
b'blog.py\nblog.py~\ndictslice.py\ndictslice.py~\nhexnums.txt\nnums.txt\npeanut-butter.jpg\nregexp\nshowfile.py\nsieve.py\ntest.py\ntestintern.py\n'

subprocess.check_output returns a bytestring with the filenames on my desktop. To deal with them in a more serious way, and to have the ASCII 10 characters actually function as newlines, I need to invoke the “decode” method, which results in a string:

output = subprocess.check_output('ls').decode('utf-8')
print(output)

This is great, until I want to pass one or more arguments to my “ls” command.  My first attempt might look like this:

output = subprocess.check_output('ls -l').decode('utf-8')
print(output)

But I get the following output:

$ ./blog.py
Traceback (most recent call last):
 File "./blog.py", line 5, in <module>
 output = subprocess.check_output('ls -l').decode('utf-8')
 File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 336, in check_output
 **kwargs).stdout
 File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 403, in run
 with Popen(*popenargs, **kwargs) as process:
 File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 707, in __init__
 restore_signals, start_new_session)
 File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 1333, in _execute_child
 raise child_exception_type(errno_num, err_msg)
FileNotFoundError: [Errno 2] No such file or directory: 'ls -l'

The most important part of this error message is the final line, in which the system complains that I cannot find the program “ls -l”. That’s right — it thought that the command + option was a single program name, and failed to find that program.

Now, before you go and complain that this doesn’t make any sense, remember that filenames may contain space characters. And that there’s no difference between a “command” and any other file, except for the way that it’s interpreted by the operating system. It might be a bit weird to have a command whose name contains a space, but that’s a matter of convention, not technology.

Remember, though, that when a Python program is invoked, we can look at sys.argv, a list of the user’s arguments. Always, sys.argv[0] is the program’s name itself. We can thus see an analog here, in that when we invoke another program, we also need to pass that program’s name as the first element of a list, and the arguments as subsequent list elements.

In other words, we can do this:

output = subprocess.check_output(['ls', '-l']).decode('utf-8')
print(output)

and indeed, we get the following:

$ ./blog.py
total 88
-rwxr-xr-x 1 reuven 501 126 Jul 20 21:43 blog.py
-rwxr-xr-x 1 reuven 501 24 Jul 20 21:31 blog.py~
-rwxr-xr-x 1 reuven 501 401 Jul 17 13:43 dictslice.py
-rwxr-xr-x 1 reuven 501 397 Jun 8 14:47 dictslice.py~
-rw-r--r-- 1 reuven 501 54 Jul 16 11:11 hexnums.txt
-rw-r--r-- 1 reuven 501 20 Jun 25 22:24 nums.txt
-rw-rw-rw- 1 reuven 501 51011 Jul 3 13:51 peanut-butter.jpg
drwxr-xr-x 6 reuven 501 204 Oct 31 2016 regexp
-rwxr-xr-x 1 reuven 501 1669 May 28 03:03 showfile.py
-rwxr-xr-x 1 reuven 501 143 May 19 02:37 sieve.py
-rw-r--r-- 1 reuven 501 0 May 28 09:15 test.py
-rwxr-xr-x 1 reuven 501 72 May 18 22:18 testintern.py

So far, so good.  Notice that check_output can thus get either a string or a list as its first argument.  If we pass a list, we can pass additional arguments, as well:

output = subprocess.check_output(['ls', '-l', '-F']).decode('utf-8')
print(output)

As a result of adding the “-F’ flag, we now get a file-type indicator at the end of every filename:

$ ls -l -F
total 80
-rwxr-xr-x 1 reuven 501 137 Jul 20 21:44 blog.py*
-rwxr-xr-x 1 reuven 501 401 Jul 17 13:43 dictslice.py*
-rw-r--r-- 1 reuven 501 54 Jul 16 11:11 hexnums.txt
-rw-r--r-- 1 reuven 501 20 Jun 25 22:24 nums.txt
-rw-rw-rw- 1 reuven 501 51011 Jul 3 13:51 peanut-butter.jpg
drwxr-xr-x 6 reuven 501 204 Oct 31 2016 regexp/
-rwxr-xr-x 1 reuven 501 1669 May 28 03:03 showfile.py*
-rwxr-xr-x 1 reuven 501 143 May 19 02:37 sieve.py*
-rw-r--r-- 1 reuven 501 0 May 28 09:15 test.py
-rwxr-xr-x 1 reuven 501 72 May 18 22:18 testintern.py*

It’s at this point that we might naturally ask: What if I want to get a file listing of one of my Python programs? I can pass a filename as an argument, right?  Of course:

output = subprocess.check_output(['ls', '-l', '-F', 'sieve.py']).decode('utf-8')
print(output)

And the output is:

-rwxr-xr-x 1 reuven 501 143 May 19 02:37 sieve.py*

Perfect!

Now, what if I want to list all of the Python programs in this directory?  Given that this is a natural and everyday thing we do on the command line, I give it a shot:

output = subprocess.check_output(['ls', '-l', '-F', '*.py']).decode('utf-8')
print(output)

And the output is:

$ ./blog.py
ls: cannot access '*.py': No such file or directory
Traceback (most recent call last):
 File "./blog.py", line 5, in <module>
 output = subprocess.check_output(['ls', '-l', '-F', '*.py']).decode('utf-8')
 File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 336, in check_output
 **kwargs).stdout
 File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 418, in run
 output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ls', '-l', '-F', '*.py']' returned non-zero exit status 2.

Oh, no!  Python thought that I was trying to find the literal file named “*.py”, which clearly doesn’t exist.

It’s here that we discover that when Python connects to external programs, it does so on its own, without making use of the Unix shell’s expansion capabilities. Such expansion, which is often known as “globbing,” is available via the Python “glob” module in the standard library.  We could use that to get a list of files, but it seems weird that when I invoke a command-line program, I can’t rely on it to expand the argument.

But wait: Maybe there is a way to do this!  Many functions in the “subprocess” module, including check_output, have a “shell” parameter whose default value is “False”. But if I set it to “True”, then a Unix shell is invoked between Python and the command we’re running. The shell will surely expand our star, and let us list all of the Python programs in the current directory, right?

Let’s see:

output = subprocess.check_output(['ls', '-l', '-F', '*.py'], shell=True).decode('utf-8')
print(output)

And the results:

$ ./blog.py
blog.py
blog.py~
dictslice.py
dictslice.py~
hexnums.txt
nums.txt
peanut-butter.jpg
regexp
showfile.py
sieve.py
test.py
testintern.py

Hmm. We didn’t get an error.  But we also didn’t get what we wanted.  This is mighty strange.

The solution, it turns out, is to pass everything — command and arguments, including the *.py — as a single string, and not as a list. When you’re invoking commands with shell=True, you’re basically telling Python that the shell should break apart your arguments and expand them.  If you pass a list to the shell, then the parsing is done the wrong number of times, and in the wrong places, and you get the sort of mess I showed above.  And indeed, with shell=True and a string as the first argument, subprocess.check_output does the right thing:

output = subprocess.check_output('ls -l -F *.py', shell=True).decode('utf-8')
print(output)

And the output from our program is:

$ ./blog.py
-rwxr-xr-x 1 reuven 501 141 Jul 20 22:03 blog.py*
-rwxr-xr-x 1 reuven 501 401 Jul 17 13:43 dictslice.py*
-rwxr-xr-x 1 reuven 501 1669 May 28 03:03 showfile.py*
-rwxr-xr-x 1 reuven 501 143 May 19 02:37 sieve.py*
-rw-r--r-- 1 reuven 501 0 May 28 09:15 test.py
-rwxr-xr-x 1 reuven 501 72 May 18 22:18 testintern.py*

The bottom line is that you can get globbing to work when invoking commands via subprocess.check_output. But you need to know what’s going on behind the scenes, and what shell=True does (and doesn’t) do, to make it work.

The post Globbing and Python’s “subprocess” module appeared first on Lerner Consulting Blog.

July 20, 2017 07:12 PM


PyCharm

PyCharm 2017.2 RC

We’ve been putting the finishing touches on PyCharm 2017.2, and we have a release candidate ready! Go get it on our website

Fixes since the last EAP:

As this is a release candidate, it does not come with a 30 day EAP license. If you don’t have a license for PyCharm Professional Edition you can use a trial license.

Even though this is not called an EAP version anymore, our EAP promotion still applies! If you find any issues in this version and report them on YouTrack, you can win prizes in our EAP competition.

To get all EAP builds as soon as we publish them, set your update channel to EAP (go to Help | Check for Updates, click the ‘Updates’ link, and then select ‘Early Access Program’ in the dropdown). If you’d like to keep all your JetBrains tools up to date, try JetBrains Toolbox!

-PyCharm Team
The Drive to Develop

July 20, 2017 03:37 PM