skip to navigation
skip to content

Planet Python

Last update: March 29, 2015 04:59 PM

March 29, 2015


Dave Haynes

Scalable design is within GRASP

Turberfield is just about ready for release now. I'll put a link to the demo at the end of this post. There are a few things I discovered along the way which I thought I'd write up for you.

I have been hacking away on this for nearly six months, pausing several times to refactor. Now I'm finally happy with how it goes together, it occurs to me I don't know why; after all that work I can't articulate how the design is right.

So for safety, I look about for some way to validate all these ad-hoc decisions. I come across a book I've owned for many years, which has been very useful in the past, though I've not looked at it in a while. I find in there the concepts which match the code I've written. Those concepts might be useful to you too if you're writing a modern Python application.

For the purposes of this article, let's just say that Turberfield has to scale. Scale to more users, scale to larger deployments, scale to more developers.

Question is, how do I design something I'm confident will scale in these ways? The answer lies in a set of principles called GRASP.

Got to be a loose fit

How to scale to more developers? I'm going to skate over this today, since I cover these aspects elsewhere.

Suffice to say you should use Python namespace packages as the means to componentise your project. Entry points are the way to provide a plugin mechanism for customised behaviour.

What you're doing here is creating what GRASP calls Loose Coupling between the components in your application. By agreeing to Polymorphic behaviour in those components, you Protect Variation in the deployment of your software.

OK, that's covered; let's skip on to the new stuff!

Too much too young

While I was pondering scalability, I became aware of a phenomenon I'd seen in some other projects which claim to be 'distributed' or 'designed to scale'.

These projects often base themselves around a message queue technology like RabbitMQ or ZeroMQ. Those technologies may be good, but they are not fairy dust.

If your software project cannot work without a message queue infrastructure, it's not designed to scale, it's irreducibly complex. In fact, to be truly scalable, software first must exist as a tiny thing, but contain within it a pattern for growth.

GRASP mentions a design pattern called Controller. It's not a new concept; simply that the business rules and logic of your program should be separate from GUI code, database access, etc.

In Python, the simplest executable unit is a single module which runs from the command line. I organised Turberfield so that it consists of small programs which can either run by themselves or get invoked from a central point.

So when you launch the GUI (it's a Bottle web app), the code invokes one or more controllers, passing to them the command line arguments you gave it:

with concurrent.futures.ProcessPoolExecutor() as executor:
    future = executor.submit(
        turberfield.machina.demo.simulation.main, args
    )
    bottle.run(app, host="localhost", port=8080)

Or else, if you're debugging or working in batch mode, you can do away with the GUI and just run one controller on its own:

$ python -m turberfield.machina.demo.simulation -v

Inspectable

So if Turberfield consists of many collaborating controllers, how do they communicate?

Firstly, there must be a shared understanding of what information they are passing. A useful piece of Indirection here is the concept of events. Each controller generates (and may consume) a series of time-ordered events.

Secondly, the data format must be flexible, and friendly to human inspection. For this I selected RSON which has a number of attractive features. It's essentially a superset of JSON which has the useful property that if you concatenate RSON files the result remains a valid sequence of objects.

This deserves a picture, and so there's a diagram of it all below. The square boxes are our controllers (W is the web tier). One controller can broadcast to others by writing an RSON file of events.

A diagram of the components of a scalable system.

What do you mean, you need a break from the coroutine?

To promote High Cohesion within a controller, we separate the functionality out into classes, each with a specific behaviour. They are the numbered ellipses in the diagram above.

We distribute the desired behaviour of the controller among classes according to which class keeps the data necessary for that particular piece of logic. This is a GRASP principle too. It means our classes are Information Experts.

After working on Turberfield for a while I discovered that these experts must:

  • operate autonomously within an event loop
  • publish certain attributes to other experts within the same controller
  • publish certain attributes to other controllers via RSON
  • listen on a queue for messages from another expert
  • listen on a queue for messages from another controller
  • be configured with global settings like file paths and host names

This is quite a shopping list of features and it took me a while to work out how to do it in a Pythonic way.

Turberfield embraces asyncio whole-heartedly, to the degree that each Expert is a callable object and a coroutine. Each also has a formalised public interface to publish its attributes, and a static method for configuring options. There's a convention I defined on initialisation which lets you connect the Expert to an arbitrary number of message queues.

If you propagate this, then your children will speak REST

A web server is a very handy thing to have around. Whether or not your application is web-based, when you scale it out so that it's deployed to more than one server, HTTP becomes a credible option for host-to-host communications.

In Turberfield, controllers have the ability to publish JSON data which is served outward by the web tier. That web tier accepts REST API calls and feeds these back as messages over a POSIX named pipe to the controller again.

GRASP calls this evolution of microservices Pure Fabrication. So a concrete example is clearly required. You can download turberfield-utils (GPL) and turberfield-machina (AGPL) to try it for yourself.

Happy

Turberfield still has a long way to go, but the current demo proves the concept. You run it like this:

$ pip install turberfield-machina
$ python -m turberfield.machina.demo.main

Point your browser to localhost:8080 and you'll see a simple web-based adventure game.

What I'm particularly happy about is that there were only nine lines of Javascript to write. All the rest is HTML5, SVG, and scalable asynchronous Python3. And that's the future of the distributed web.

Now if there's one thing worse than working to a deadline, then that's not working to a deadline. So I'm bashing out the documentation for Turberfield to make it a suitable library for the next PyWeek. I guess that'll be in May.

Oh, and I'll be available for professional engagements from the end of April. You can get in touch via the Contact page.

March 29, 2015 04:46 PM

March 27, 2015


Gaël Varoquaux

Euroscipy 2015: Call for paper

EuroScipy 2015, the annual conference on Python in science will take place in Cambridge, UK on 26-30 August 2015. The conference features two days of tutorials followed by two days of scientific talks & posters and an extra day dedicated to developer sprints. It is the major event in Europe in the field of technical/scientific computing within the Python ecosystem. Scientists, PhD’s, students, data scientists, analysts, and quants from more than 20 countries attended the conference last year.

The topics presented at EuroSciPy are very diverse, with a focus on advanced software engineering and original uses of Python and its scientific libraries, either in theoretical or experimental research, from both academia and the industry.

Submissions for posters, talks & tutorials (beginner and advanced) are welcome on our website at http://www.euroscipy.org/2015/ Sprint proposals should be addressed directly to the organisation at euroscipy-org@python.org

Important dates:

We look forward to an exciting conference and hope to see you in Cambridge

The EuroSciPy 2015 Team - http://ww.euroscipy.org/2015/

March 27, 2015 11:00 PM


PyCharm

Feature Spotlight: Python remote development with PyCharm

Happy Friday everyone,

In today’s blog post I’m going to cover some basic principles and features in PyCharm that make Python remote development easy as pie. To demonstrate them I’ll use a very simple flask web application from the official flask github repository. Enjoy the read!

First I clone the official flask repository from https://github.com/mitsuhiko/flask. Then from the PyCharm’s Welcome screen I open the blueprintexample directory, which stores the source of the super simple flask application I’m going to use for the demo:

flask1

PyCharm opens the directory and creates a project based on it:

flask2

Now I’m going to set up the remote machine to start with the remote development. I use Vagrant which PyCharm offers great support for. In one of my previous blog posts I already covered Vagrant integration, so here are just the straight steps to provision and run a VM. I go to Tools | Vagrant | Init in Project Root and select the Ubuntu 14.04 image which was previously imported from the collection of vagrant boxes. This creates the Vagrantfile in the project root. Now, I’m going to edit this file to configure a private network and make my VM visible from my host machine:

flask3

Next, I run the VM with Tools | Vagrant | Up and PyCharm shows me that the VM is up and running:

flask4

We can open a local terminal inside PyCharm to test the VM:

flask5

Alright, the VM responses to ping. Now, I want to run my web application on the VM, so I need to copy my project sources to the remote host. This is easily done with the Deployment tool inside PyCharm.
I go to Tools | Deployment | Configuration and specify the connection details for my VM:

flask6

On the Mappings tab in the same window I specify the path mapping rule:

flask7

In my case I want my current local project directory blueprintexampe to be mapped to remote /home/vagrant/blueprintremote.
Now I can right-click my project in the project view and select Upload to:

flask9

And this will upload my project to the specified directory on the remote machine:

flask10

One of the handiest features is that you can set up automatic upload to the remote machine by simply clicking Tools | Deployment | Automatic Upload:

flask11

From this point on, all changes made locally will be uploaded to the remote machine automatically, so you don’t need to worry about having fresh sources on the remote host. Cool, isn’t it?

So now, I’m going to modify one of the files in my project, so the flask application will be visible remotely (adding host=’0.0.0.0’ as a parameter to the app.run() ), and PyCharm automatically uploads the changes to the remote machine:

flask12

Next, I specify the python interpreter to be used for my project. I go to File | Settings (Preferences for Mac OS) | Project | Project Interpreter. By default, PyCharm sets the local Python interpreter as a project interpreter, so I’ll change it to the remote one:

flask13

As I’ve already created a deployment configuration, PyCharm offers to export Python interpreter settings from the existing deployment configuration:

flask14

But I can also specify the remote interpreter manually, using SSH credentials or a Vagrant configuration. Here I’ll do it manually:

flask15

After I specify the new remote python interpreter, PyCharm starts indexing and finds that the flask package is not installed on the project interpreter:

flask16

I can fix this easily with Alt + Enter on the unresolved reference error highlighted with red:

flask17

Alright. Now everything is ok, so we can finally specify Run/Debug configuration and launch our application. Let’s go to Run | Edit Configurations and add a new Python run/debug configuration:

flask18

In the Run/Debug configuration dialog, I specify the name for my new configuration and the script to be executed on the remote host. PyCharm sets the project interpreter (remote in this case) by default for this new run configuration, and finally I need to specify path mappings for this particular run configuration:

flask19

It seems we’re all set. I click the Run button:

flask20

PyCharm shows that the application is up and running on port 5000 on the VM.
I open the browser to check that the application is really working:

flask21

From this point on, we can work with this project like with a normal local project. PyCharm takes care of uploading any changes to the remote machine and keeps the VM up and running.

With the same Run/Debug configuration, we can do a simple remote debug session putting a few breakpoints right in the editor:

flask22

Click the debug button or go to Run | Debug:

flask23

That’s it! Hopefully you’ll appreciate this remote development functionality in PyCharm that makes Python remote development a breeze.

If you’re still craving for more details on PyCharm remote development capabilities, as well as other remote development features, please see the online help.

Talk to you next week,
Dmitry

March 27, 2015 09:30 PM


Python Software Foundation

To sublicense or not to sublicense? That is the election.

Earlier this month, the PSF opened an election on two issues: the first was a straightforward vote on the adoption of new Sponsor Members; the second was more experimental: a non-binding vote for the membership to weigh in on a complex issue to be decided by the Board. This poll was part of the larger project (featured in several recent blog posts, see, for example, Let’s Make Decisions to make the PSF a more inclusive, diverse, and democratic organization.

Source: National Museum of American History. PD-USGOV
The election was closed yesterday, March 26th. The results can be found at Election 9 Results and are as follows:
Sponsor Members, Bloomburg LP, Fastly and Infinite Code were all voted in by large margins.
Sponsor Member Candidatesyesnoabstain
Bloomberg LP174728
Fastly193313
Infinite Code1471349
The second issue:
The PSF Board of Directors is seeking the collective perspective of PSF Voting Members on the appropriate handling of video recording sublicensing for presentations at PyCon US (see Membership Vote). 
This poll sought members' views along two dimensions of the sublicensing issue: the entities to whom licenses should be granted; and the timeframe of the videos to be licensed.
The results of the poll were quite divided.
Sublicense entities
Only YouTube (others embedding)As many mirrors as possibleOnly non-commercial mirrors
1710468
Sublicense timeframe
Prospectively onlyIncluding retroactivelyNot applicable
87889
The PSF wishes to thank everyone who participated. This input of the membership is extremely valuable to the PSF, and this was a useful first run at use of non-binding polls.
There will be a lot more discussion around this topic while the Board continues to weigh pros and cons prior to making the decision that best supports the interests of the membership. Please feel free to comment on this Blog, on Twitter, to the PSF (or in the Hallway Track in Montreal).
I would love to hear from readers. Please send feedback, comments, or blog ideas to me at msushi@gnosis.cx.

March 27, 2015 07:33 PM


EasyGUI

Easygui project shuts down

Effective March 6, 2013, I am shutting down the EasyGui project.

The EasyGui software will continue to be available at its current location, but I will no longer be supporting, maintaining, or enhancing it.

The reasons for this decision are personal, and not very interesting. I’m older now, and retired. I no longer do software development, in any programming language. I have other interests that I find more compelling. I spend time with my family. I play and promote petanque. Life is good, but it is different.

During the course of my software development career I’ve had occasion to shut down a number of projects. On every occasion when I turned over a project to a new owner, the results were disappointing. Consequently, I have decided to shut down the EasyGui project rather than to try to find a new owner for it.

The EasyGui software will remain frozen in its current state. I invite anyone who has the wish, the will, the energy, and the vision to continue to evolve EasyGui, to do so. Copy it, fork it, and make it the basis for your own new work.

— Steve Ferg, March 6, 2013

March 27, 2015 07:09 PM


Python Software Foundation

Let's make decisions together!

Personal opinion: I think it’s always a good idea to periodically revisit one’s purpose and basic goals; speaking from experience, getting lost is inefficient and no fun at all.
Photo credit: Gerd Altmann; License: CC0
So, let's review:
The mission of the PSF is to:
[…] promote, protect, and advance the Python programming language, and to support and facilitate the growth of a diverse and international community of Python programmers.
The PSF takes this mission seriously. Last year, the Board of Directors changed the membership by-laws in order to make the PSF a more inclusive and diverse organization. Since then, the PSF leadership has been working on ways to build on that change. The recent non-binding poll of voting members (PSF Blog) is one such tactic. Another is a new procedure for strategic decision-making recently proposed by PSF Director Nick Coghlan.
Last week, Nick posted this proposal on the Members' List for discussion (it's also on the Python wiki).
According to Nick,
One step we are proposing is to have a more open strategic decision making process where significant decisions which don’t need to be made quickly, and which don’t require any confidentiality, can be discussed with the full PSF membership before being placed before the Board as a proposed resolution.
The new guidelines are similar to the process used for Python Enhancement Proposals (PEP)–whereby developers and user groups make suggestions for design decisions to amend the Python language (PEP). Nick also took inspiration from Red Hat’s “Open Decision Making Framework,” and the Fedora change approval process.
Since this proposal is itself the first instance of its use (in what Nick calls “a delightfully meta exercise”), it’s important that the membership review it and offer feedback. And if you’re not a member but would like to become one, see Enroll as a Voting Member to sign up.

Below I’ve excerpted some of the basic ideas from the text of the proposal, but I urge members to read the entire draft before weighing in.  
PSF Strategic Decision Making Process 
The primary mechanism for strategic decision making in the PSF is through resolutions of the PSF Board. Members of the PSF Board of Directors are elected annually in accordance with the PSF Bylaws, and bear the ultimate responsibility for determining “how” the PSF pursues its mission [...]
However, some proposed clarifications of or changes to the way the PSF pursues its strategic priorities are of sufficient import that they will benefit from a period of open discussion amongst the PSF membership prior to presentation for a Board resolution [...]  
Non-binding polls of PSF Voting Members  
At their discretion, the PSF Board may choose to include non-binding polls in ballots issued to PSF members [...]
Proposals for Discussion  
Any PSF Member (including Basic Members) may use the Wiki to submit a proposal for discussion with the full PSF membership [...] 
Proposals for Resolution
Any PSF Director or Officer may determine that a particular proposal is ready for resolution [...]
Proposals submitted for resolution will be resolved either directly by a Board resolution, or, at the Board’s discretion, by a full binding vote of eligible PSF Voting Members.
Nick is also currently drafting proposed guidelines for “PSF Strategic Priorities” and for procedures for recognition and promotion to the designation of “PSF Fellow."

Stay tuned to the members' list and to this blog to stay informed and to participate in the discussion and adoption of these additional proposals to improve the PSF's role as an organization that truly reflects and supports the needs and views of its membership.

I would love to hear from readers. Please send feedback, comments, or blog ideas to me at msushi@gnosis.cx.

March 27, 2015 05:10 PM


Astro Code School

Meet Caktus CTO Colin Copeland

This is the third post in a series of interviews about the people at Astro Code School. This one is about Colin Copeland the CTO and Co-Founder of Caktus Consulting Group. He’s one of the people who came up with the idea for Astro Code School and a major contributor to it's creation.

Where were you born?

Oberlin, Ohio

What was your favorite childhood pastime?

Spending time with friends during the Summer.

Where did you go to college and what did you study?

I went to Earlham College and studied Computer Science.

How did you become a CTO of the nation's largest Django firm?

I collaborated with the co-founders on a software engineering project. We moved to North Carolina to start the business. I was lucky to have met them!

How did you and the other Caktus founders come up with the idea to start Astro Code School?

Caktus has always been involved with trainings and trying to contribute back to the Django community where possible, from hosting Django sprints to leading public and private Django trainings on best practices. We're excited to see the Django community grow and saw an opportunity to focus our training services with Astro.

What is one of your favorite things about Python?

Readability. Whether it's reading through some of my old source code or diving into a new open source project, I feel like you can get your bearings quickly and feel comfortable learning or re-learning the code. The larger Django and Python communities are also very welcoming and friendly to new and long time members.

Who are your mentors and how have they influenced you?

So many, but especially my Caktus business partners and colleagues.

Do you have any hobbies?

I'm co-captain of the Code for Durham Brigade.

Which is your favorite Sci-fi or Fantasy fiction? Why?

Sci-fi. I've always loved the books Neuromancer and Snow Crash. Recently I've been enjoying the Silo science fiction series.

March 27, 2015 03:30 PM


Caktus Consulting Group

Welcome to Our New Staff Members

We’ve hit one of our greatest growth points yet in 2015, adding nine new team members since January to handle our increasing project load. There are many exciting things on the horizon for Caktus and our clients, so it’s wonderful to have a few more hands on deck.

One of the best things about working at Caktus is the diversity of our staff’s interests and backgrounds. In order of their appearance from left to right in the photos above, here’s a quick look at our new Cakti’s roles and some fun facts:

Neil Ashton

Neil was also a Caktus contractor who has made the move to full-time Django developer. He is a keen student of more than programming languages; he holds two degrees in Classics and another Master’s in Linguistics.

Jeff Bradberry

Though Jeff has been working as a contractor at Caktus, he recently became a full-time developer. In his spare time, he likes to play around with artificial intelligence, sometimes giving his creations a dose of inexplicable, random behavior to better mimic us poor humans.

Ross Pike

Ross is our new Lead Designer and has earned recognition for his work from Print, How Magazine, and the AIGA. He also served in the Peace Corps for a year in Bolivia on a health and water mission.

Lucas Rowe

Lucas joins us for six months as a game designer, courtesy of a federal grant to reduce the spread of HIV. When he’s not working on Epic Allies, our HIV medication app, he can be found playing board games or visiting local breweries.

Erin Mullaney

Erin has more than a decade of development experience behind her, making her the perfect addition to our team of Django developers. She loves cooking healthy, vegan meals and watching television shows laden with 90s nostalgia.

Liza Chabot

Liza is an English major who loves to read, write, and organize, all necessary skills as Caktus’ Administrative and Marketing Assistant. She is also a weaver and sells and exhibits her handwoven wall hangings and textiles in the local craft community.

NC Nwoko

NC’s skills are vast in scope. She graduated from UNC Chapel Hill with a BA in Journalism and Mass Communication with a focus on public relations and business as well as a second major in International Studies with a focus on global economics. She now puts this experience to good use as Caktus’ Digital Health Product Manager, but on the weekends you can find her playing video games and reading comic books.

Edward Rowe

Edward is joining us for six months as a game developer for the Epic Allies project. He loves developing games for social good. Outside of work, Edward continues to express his passion for games as an avid indie game developer, UNC basketball fan, and board and video game player.

Rob Lineberger

Rob is our new Django contractor. Rob is a renaissance man; he’s not only a skilled and respected visual artist, he’s trained in bioinformatics, psychology, information systems and knows his way around the kitchen.

To learn more about our team, visit our About Page. And if you’re wishing you could spend your days with these smart, passionate people, keep in mind that we’re still hiring.

March 27, 2015 12:00 PM


Montreal Python User Group

Montréal-Python 53: Sanctified Terabit + MTLData + DevOpsMTL + DockerMTL

affiche-mp53

Sketch credit Cynthia Savard

If PyCon is not enough, Montréal-Python has the solution: a meetup! Now that your first incredible day of sprints is over, we are bringing on stage some of PyCon's superstar presenters for encore presentations.

This special Montréal-Python edition will be co-organized by MTLData, DevOpsMTL and DockerMTL.

Trey Causey: Scalable Machine Learning in Python using GraphLab Create

I'll be giving an overview of how to use GraphLab Create to quickly build scalable predictive models and deploy them to production using just an IPython notebook on a laptop.

Nina Zakharenko: Technical Debt - The code monster in everyone's closet

Technical debt is the code monster hiding in everyone's closet. If you ignore it, it will terrorize you at night. To banish it and re-gain your productivity, you'll need to face it head on.

Olivier Grisel: What's new in scikit-learn 0.16 and what's cooking in the master branch.

Scikit-learn is a Machine Learning library from the Python data ecosystem. Olivier will give an overview and some demos of the (soon to be | recently) released 0.16.0 version.

Jérome Petazzoni: Deep dive into Docker storage drivers

We will present how aufs and btrfs drivers compare from a high-level perspective, explaining their pros and cons. This will help the audience to make more informed decisions when picking the most appropriate driver for their workloads.

Pierre-Yves David: Mercurial, with real python bites

In this talk, we'll go over on the advantages of Python that helped the project both in its early life when so much feature needs to be implemented, but also nowaday when major companies like Facebook bet on Mercurial for scaling. We'll also point at the drawback of choosing Python and how some work-arounds had to be found. Finally, we'll look at how the choice of Python have an impact on the user too with a demonstration of the extensions system.

Thanks also to our special sponsors for this event: Docker Inc. and LightSpeed Retail

When:

Monday, April 13h 2015

Where

Notman House 51 Rue Sherbrooke West, Montréal, QC H2X 1X2 https://goo.gl/maps/rg4jI

How

Just grab a ticket here: http://pycon-python-data-devops-docker.eventbrite.ca

Schedule:

We’d like to thank our sponsors for their ongoing support:

March 27, 2015 04:00 AM


Vasudev Ram

Which which is which?

By Vasudev Ram

Recently I had blogged about which.py, a simple Python program that I wrote, here:

A simple UNIX-like 'which' command in Python

I also posted the same program on ActiveState Code:

A UNIX-like "which" command for Python (Python recipe)

A reader there, Hongxu Chen, pointed out that my which.py actually implemented the variant "which -a", that is, the UNIX which with the -a (or -all) option included. This variant displays not just the first full pathname of an occurrence of the searched-for name in the PATH (environment variable), but all such occurrences, in any directory in the PATH. That was not what I had intended. It was a bug. I had intended it to only show the first occurrence.

So I rewrote the program to fix that bug, and also implemented the -a option properly - i.e. when -a (or its long form, --all, is given, find all occurrences, otherwise only find the first. Here is the new version:

from __future__ import print_function

# which.py
# A minimal version of the UNIX which utility, in Python.
# Also implements the -a or --all option.
# Author: Vasudev Ram - www.dancingbison.com
# Copyright 2015 Vasudev Ram - http://www.dancingbison.com

import sys
import os
import os.path
import stat

def usage():
sys.stderr.write("Usage: python which.py [ -a | --all ] name ...\n")
sys.stderr.write("or: which.py [ -a | --all ] name ...\n")

def which(name, all):
for path in os.getenv("PATH").split(os.path.pathsep):
full_path = path + os.sep + name
if os.path.exists(full_path):
print(full_path)
if not all:
break

def main():
if len(sys.argv) 2:
usage()
sys.exit(1)
if sys.argv[1] in ('-a', '--all'):
# Print all matches in PATH.
for name in sys.argv[2:]:
which(name, True)
else:
# Stop after printing first match in PATH.
for name in sys.argv[1:]:
which(name, False)

if "__main__" == __name__:
main()
I tested it some and it seems to be working okay both with and without the -a option now. After more testing, I'll upload it to my Bitbucket account.

- Vasudev Ram - Online Python training and programming

Dancing Bison Enterprises

Signup to hear about new software or info products that I create.

Posts about Python  Posts about xtopdf

Contact Page

March 27, 2015 02:52 AM


Mikko Ohtamaa

Testing web hook HTTP API callbacks with ngrok in Python

Today many API services provide webhooks calling back your website or system over HTTP. This enables simple third party interprocess communications and notifications for websites. However unless you are running in production, you often find yourself in a situation where it is not possible to get an Internet exposed HTTP endpoint over publicly accessible IP address. These situations may include your home desktop, public WI-FI access point or continuous integration services. Thus, developing or testing against webhook APIs become painful for contemporary nomad developers.

Screen Shot 2015-03-26 at 17.46.39

ngrok (source) is a pay-what-you-want service to create HTTP tunnels through third party relays. What makes ngrok attractice is that the registration is dead simple with Github credentials and upfront payments are not required. ngrok is also open source, so you can run your own relay for sensitive traffic.

In this blog post, I present a Python solution how to programmatically create ngrok tunnels on-demand. This is especially useful for webhook unit tests, as you have zero configuration tunnels available anywhere where you run your code. ngrok is spawned as a controlled subprocess for a given URL. Then, you can tell your webhook service provider to use this URL to make calls back to your unit tests.

One could use ngrok completely login free. In this case you lose the ability to name your HTTP endpoints. I have found it practical to have control over the endpoint URLs, as this makes debugging much more easier.

For real-life usage, you can check cryptoassets.core project where I came up with ngrok method. ngrok succesfully tunneled me out from drone.io CI service and my laptop.

Installation

Installing ngrok on OSX from Homebrew:

brew install ngrok

Installing ngrok for Ubuntu:

apt-get install -y unzip
cd /tmp
wget -O ngrok.zip "https://api.equinox.io/1/Applications/ap_pJSFC5wQYkAyI0FIVwKYs9h1hW/Updates/Asset/ngrok.zip?os=linux&arch=386&channel=stable"
unzip ngrok
mv ngrok /usr/local/bin

Official ngrok download, self-contained zips.

Sign up for the ngrok service and grab your auth token.

Export auth token as an environment variable in your shell, don’t store it in version control system:

export NGROK_AUTH_TOKEN=xxx

Ngrok tunnel code

Below is Python 3 code for NgrokTunnel class. See the full source code here.

import os
import time
import uuid
import logging
import subprocess
from distutils.spawn import find_executable


logger = logging.getLogger(__name__)


class NgrokTunnel:

    def __init__(self, port, auth_token, subdomain_base="zoq-fot-pik"):
        """Initalize Ngrok tunnel.

        :param auth_token: Your auth token string you get after logging into ngrok.com

        :param port: int, localhost port forwarded through tunnel

        :parma subdomain_base: Each new tunnel gets a generated subdomain. This is the prefix used for a random string.
        """
        assert find_executable("ngrok"), "ngrok command must be installed, see https://ngrok.com/"
        self.port = port
        self.auth_token = auth_token
        self.subdomain = "{}-{}".format(subdomain_base, str(uuid.uuid4()))

    def start(self, ngrok_die_check_delay=0.5):
        """Starts the thread on the background and blocks until we get a tunnel URL.

        :return: the tunnel URL which is now publicly open for your localhost port
        """

        logger.debug("Starting ngrok tunnel %s for port %d", self.subdomain, self.port)

        self.ngrok = subprocess.Popen(["ngrok", "-authtoken={}".format(self.auth_token), "-log=stdout", "-subdomain={}".format(self.subdomain), str(self.port)], stdout=subprocess.DEVNULL)

        # See that we don't instantly die
        time.sleep(ngrok_die_check_delay)
        assert self.ngrok.poll() is None, "ngrok terminated abrutly"
        url = "https://{}.ngrok.com".format(self.subdomain)
        return url

    def stop(self):
        """Tell ngrok to tear down the tunnel.

        Stop the background tunneling process.
        """
        self.ngrok.terminate()

Example usage in tests

Here is a short pseudo example from cryptoassets.core block.io webhook handler unit tests. See the full unit test code here.

class BlockWebhookTestCase(CoinTestRoot, unittest.TestCase):

    def setUp(self):

        self.ngrok = None

        self.backend.walletnotify_config["class"] = "cryptoassets.core.backend.blockiowebhook.BlockIoWebhookNotifyHandler"

        # We need ngrok tunnel for webhook notifications
        auth_token = os.environ["NGROK_AUTH_TOKEN"]
        self.ngrok = NgrokTunnel(21211, auth_token)

        # Pass dynamically generated tunnel URL to backend config
        tunnel_url = self.ngrok.start()
        self.backend.walletnotify_config["url"] = tunnel_url
        self.backend.walletnotify_config["port"] = 21211

        # Start the web server
        self.incoming_transactions_runnable = self.backend.setup_incoming_transactions(self.app.conflict_resolver, self.app.event_handler_registry)

        self.incoming_transactions_runnable.start()

    def teardown(self):

        # Stop webserver
        incoming_transactions_runnable = getattr(self, "incoming_transactions_runnable", None)
        if incoming_transactions_runnable:
            incoming_transactions_runnable.stop()

        # Stop tunnelling
        if self.ngrok:
            self.ngrok.stop()
            self.ngrok = None

Other

Please see the unit tests for NgrokTunnel class itself.

 Subscribe to RSS feed Follow me on Twitter Follow me on Facebook Follow me Google+

March 27, 2015 12:49 AM

March 26, 2015


Python Software Foundation

World Domination: One Student at a Time!

A couple of years ago, I discovered the edX MIT course 6.00x Intro to Computer Science and Programming Using Python. At the time, I was eager to learn Python and CS basics, so I took the plunge. 
The course has been offered through edX each semester since, and at some point it was divided into two courses to allow more time for in-depth study, as the original one-semester course moved very quickly from basics to more advanced topics, such as complexity classes, plotting techniques, stochastic programs, probability, random walks, and graph optimization. I can’t say enough good things about the excellence of Professor John Guttag, who developed the course and wrote the accompanying textbook (which is recommended but not required), along with co-teachers, Profs. Eric Grimson and Chris Terman.
I was grateful at the time to have found a free introductory college-level course in computer science that uses Python, rather than C, Java, or another language, as I had already had some acquaintance with Python and wanted to solidify my foundation and gain more skill. Working through the course led me to appreciate the features of Python that make it a wonderful teaching language. Since it is relatively easy to learn, it allows the learner to get up and running quickly, to write code and get results early on, without getting too bogged down and discouraged (something that I, as a humanities rather than a math person, had experienced in the past.) In addition, Python teaches good programming habits, including the importance of good documentation, what Prof. Guttag frequently referred to as "good hygiene." I remember wondering at the time why Python wasn’t always the language taught to beginners.
Well, today this is the trend.
According to a July 2014 study by Phillip Guo, Python is Now the Most Popular Introductory Teaching Language at Top U.S. Universities. Guo analyzed the course curricula for the top 39 CS Departments in the US. He used U.S. News' ranking of best computer science schools in 2014, which begins with Carnegie Mellon, MIT, Stanford, and UC Berkeley (he stopped at 39 because apparently there was an 8-way tie for #40), and found that 27 of them teach Python in their Intro courses. Of the top 10 departments, the proportion was higher– 8 of them teach Python. The next most-taught languages the study found were (in descending order): Java, MATLAB, C, C+, Scheme, and Scratch. Moreover, in addition to edX, both Udacity and Coursera use Python for their introductory courses.
Anecdotally, Guo found that professors in academic fields outside of CS are increasingly using Python to fill their students' needs for programming skills. See February’s PSF blog post Python in Nature for an explanation and example of this trend by Dr. Adina Howe, Professor of Agriculture and Biosystems Engineering at Iowa State University.
The increasing popularity of Python as the language for introductory CS courses in the US will undoubtedly lead to further growth of the Python community and the language. As Guo explains: 
… the choice of what language to teach first reflects the pedagogical philosophy of each department and influences many students' first impressions of computer science. The languages chosen by top U.S. departments could indicate broader trends in computer science education, since those are often trendsetters for the rest of the educational community.
I would love to hear from readers. Please send feedback, comments, or blog ideas to me at msushi@gnosis.cx.

March 26, 2015 10:42 PM


Frank Wierzbicki

Jython 2.7 release candidate 1 available!

[Update: on Windows machines the installer shows an error at the end. The installer needs to be closed manually, but then the install should still work.]

 We will fix this for rc2.] On behalf of the Jython development team, I'm pleased to announce that the first release candidate of Jython 2.7 is available! We're getting very close to a real release finally! I'd like to thank Amobee for sponsoring my work on Jython. I'd also like to thank the many contributors to Jython.

Jython 2.7rc1 brings us up to language level compatibility with the 2.7 version of CPython. We have focused largely on CPython compatibility, and so this release of Jython can run more pure Python apps then any previous release. Please see the NEWS file for detailed release notes. This release of Jython requires JDK 7 or above.

This release is being hosted at maven central. There are three main distributions. In order of popularity:

To see all of the files available including checksums, go here and navigate to the appropriate distribution and version.

March 26, 2015 07:00 PM


Carl Trachte

IE and Getting a Text File Off the Web - Selenium Web Tools

I've blogged previously about getting information off of a distant server on my employer's internal SharePoint site.  Automating this can be a little challenging, especially when there's a change.

My new desktop showed up with Internet Explorer 11 and Windows 7 Enterprise.  When I went to run my MineSight multirun (basically a batch file with a GUI front end that our mine planning vendor provides) the file fetch from our SharePoint site didn't work.  A little googling led me to Selenium.

As is often the case, I am wayyyy late to the party here.  I remember Selenium from Pycon 2010 in Atlanta because they gave us a nice mug with new string formatting on it that I use frequently (both the mug and the formatting):

 
I was at Pycon 2010 . . . and I have the mug to prove it.


 
My project manager/boss at the time, Eric, seeing me gush over the string formatting commands, did his usual button-pushing exercise by commenting, "I don't know; why didn't they put something on there like 'from pot import coffee'?"  People, y'know?

Back to Selenium - I was able to get what I needed from it with some research and downloading.  The steps are basically:


 
    1) Download IEDriverServer.exe
 
    2) Put the executable in a location in your path.
 
    3) Download Python Selenium Bindings and follow the install instructions.  I went the Python 3.4 route (versus the Python 2.7 that comes with MineSight) - personal preference on my part.

    4) Make sure your Internet Explorer environment/application is set up in a way that won't cause you problems.  I could try to describe this, but this blog post from a Selenium developer does it so much better (complete with screenshots):  http://jimevansmusic.blogspot.com/2012/08/youre-doing-it-wrong-protected-mode-and.html.  When Microsoft talks about "zones" and IE Protected Mode, the zones refer to things like "Trusted Sites," company web, external internet, etc. - all those have to be set to protected mode or things won't work and you'll get a fairly cryptic error message when the script crashes.


For my example, I was able to comment out some of the things I need to do within the MineSight multirun.  The DOS window hangs and IEDriverServer stays open within the MineSight multirun and app - I hacked this problem by killing it with an os.system() call.  Whatever it takes.
 
I couldn't efficiently get the script to recognize HTML tag names, so I hacked that with text processing.  This is bad, but effective.
 
The code:
 
#!C:\Python34\python
 
"""
Get text from site via Internet Explorer.
"""
 
INST = 'instructions.txt'
 
# For killing process inside Multirun.
# import os
 
from time import sleep as slpx
 
from selenium import webdriver
 
# XXX - hack - had difficulty getting
#       things by tag - text processed it.
PRETAG = '<pre>'
PRETAGLEN = len(PRETAG)

PRETAGCLOSE = '</pre>'
# Seconds to pause at end.
PAUSE = 3
INSTRUCTIONS = 'http://ftp3.usa.openbsd.org/pub/OpenBSD/5.6/README'
INSTR = 'instructions.txt'
 
# XXX - may not matter (\r versus \n), in all cases
#       but for numbers in multirun, makeshift chomp
#       processing made a difference.

RETCHAR = '\r'
 
# Hack to shutdown DOS window.
# TASKKILL = 'taskkill /im IEDriverServer.exe /F'
 
def getbody(url):
    """
    Given the website address (url),
    returns inner HTML text stripped of tags.
    """
    browser = webdriver.Ie()
    browser.get(url)
    text = browser.page_source
    browser.close()
    text = text[(text.index(PRETAG) + PRETAGLEN):]
    text = text[:(text.index(PRETAGCLOSE))]
    text = text.split(RETCHAR)
    [x.strip() for x in text]
    return text
 
textii = getbody(INSTRUCTIONS)
print('\nDealing with writing of instructions file . . .\n')
textii = ''.join(textii)
f = open(INSTR, 'w')
f.write(textii)
f.close()
print('Instructions copied.')
print('\nPausing {:d} seconds . . .\n'.format(PAUSE))
slpx(PAUSE)

# XXX - can't get window to close in Multirun (MXPERT) - CBT 23MAR2015
# os.system(TASKKILL)
 
 


March 26, 2015 04:41 PM


PyCon

For Microsoft, Python support extends far beyond Windows installers

You might have known that Python's 1.0 release came at the start of 1994, but did you know Microsoft shipped its Merchant Server 1.0 product built on Python only a few years later in 1996? Microsoft, this year's Keystone sponsor, has long been a user and supporter of Python, with a history of use within build and test infrastructure and individual users all around the company. There are even a few lawyers writing Python code.

In 2006 they introduced the world to IronPython, a .NET runtime for Python, and later the excellent Python Tools for Visual Studio plug-in in 2011. They continue to release Python code, as it's "a must-have language for any team that releases developer kits or libraries, especially for services on Azure that can be used from any operating system," according to Steve Dower, a developer on Microsoft's Python Tools team.

"Python has very strong cross-platform support, which is absolutely critical these days," says Steve. "It’s very attractive for our users to literally be able to 'write once-run anywhere.'

"The breadth of the community is also very attractive, especially the support for scientific use," he continued. Microsoft has been a significant donor to the Jupyter project (formerly IPython) as well as a platinum sponsor of the NumFOCUS Foundation.

Along with supporting those projects, they have also been providing MSDN subscriptions to the core Python team to assist with development and testing on Windows. Beyond supporting the existing developers, they've jumped in the ring themselves as one of the few companies to employ developers working on CPython itself. "Python has done an amazing job of working well on Windows, and we hope that by taking an active involvement we can push things along further," offers Steve, whose work includes being a core developer on the CPython project.


Steve's CPython work has focused around Windows issues, including an improved installer for 3.5. Additionally, the team was able to come up with a special package for Python users: Microsoft Visual C++ Compiler for Python 2.7. Due to Python 2.7 being built on the Visual C++ 2008 runtime, which is no longer supported, they created this package to provide the necessary tools and headers to continue building extension modules for Python 2.7, which will live through at least 2020 as was announced at last year's language summit.


Along with efforts on Python itself, they're hard at work on improving tooling for the upcoming Visual Studio 2015 and Python 3.5 releases. "Practically everything we do will integrate with Visual Studio in some way," says Steve of Python Tools for Visual Studio. "PTVS has been free and open-source from day one, and combined with Visual Studio Community Edition makes for a powerful, free multi-lingual IDE."

As for what's next with PTVS, Steve says, "we try and be responsive to the needs of our users, and we are an open-source project that accepts contributions, so there’s always a chance that the next amazing feature won’t even come from our team. We've also recently joined forces with the Azure Machine Learning team and are looking forward to adding more data science tooling as well.

"We want new and experienced developers alike to have the best tools, the best libraries, the best debugging and the best services without having to give up Linux support, Visual Studio, CPython, git, or whatever tools they’ve already integrated into their workflow."

When it comes to PyCon, they see it as "a learning opportunity for Microsoft, as well as a chance for us to show off some of the work we’ve been doing." "For those of us at Microsoft who always knew how great the Python community is, it’s also been great to bring our colleagues and show them.

"We love that PyCon is about building and diversifying the community, and not about sales, marketing and business deals," says Steve. If you head to their booth in the expo hall, you'll find out first hand that they're there to talk about code and building great things. They're looking forward to showing off some great new demos and have exciting new things to talk about.

The PyCon organizers thank Microsoft for another year of sponsorship and look forward to another great conference!

March 26, 2015 10:14 AM


Tryton News

Pycon 2015

This year, there will be 2 Foundation members (Sharoon Thomas and Cédric Krier) present during the PyCon 2015 at Montréal. PyCon is the largest annual conference for the Python community which Tryton is a part of.

If you want to meet Tryton's people, we will host an Open Space and sprint on Tryton will be organized.

March 26, 2015 10:00 AM


PyPy Development

PyPy 2.5.1 Released

PyPy 2.5.1 - Pineapple Bromeliad

We’re pleased to announce PyPy 2.5.1, Pineapple Bromeliad following on the heels of 2.5.0. You can download the PyPy 2.5.1 release here:
We would like to thank our donors for the continued support of the PyPy project, and for those who donate to our three sub-projects, as well as our volunteers and contributors. We’ve shown quite a bit of progress, but we’re slowly running out of funds. Please consider donating more, or even better convince your employer to donate, so we can finish those projects! The three sub-projects are:
  • Py3k (supporting Python 3.x): We have released a Python 3.2.5 compatible version we call PyPy3 2.4.0, and are working toward a Python 3.3 compatible version
     
  • STM (software transactional memory): We have released a first working version, and continue to try out new promising paths of achieving a fast multithreaded Python

  • NumPy which requires installation of our fork of upstream numpy, available on bitbucket
We would also like to encourage new people to join the project. PyPy has many layers and we need help with all of them: PyPy and Rpython documentation improvements, tweaking popular modules to run on pypy, or general help with making Rpython’s JIT even better.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It’s fast (pypy and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.

This release supports x86 machines on most common operating systems (Linux 32/64, Mac OS X 64, Windows, and OpenBSD), as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux.

While we support 32 bit python on Windows, work on the native Windows 64 bit python is still stalling, we would welcome a volunteer to handle that.

Highlights

  • The past months have seen pypy mature and grow, as rpython becomes the goto solution for writing fast dynamic language interpreters. Our separation of Rpython from the python interpreter PyPy is now much clearer in the PyPy documentation and we now have seperate RPython documentation. Tell us what still isn’t clear, or even better help us improve the documentation.
  • We merged version 2.7.9 of python’s stdlib. From the python release notice:
    • The entirety of Python 3.4’s ssl module has been backported. See PEP 466 for justification.
    • HTTPS certificate validation using the system’s certificate store is now enabled by default. See PEP 476 for details.
    • SSLv3 has been disabled by default in httplib and its reverse dependencies due to the POODLE attack.
    • The ensurepip module has been backported, which provides the pip package manager in every Python 2.7 installation. See PEP 477.

  • The garbage collector now ignores parts of the stack which did not change since the last collection, another performance boost
  • errno and LastError are saved around cffi calls so things like pdb will not overwrite it
  • We continue to asymptotically approach a score of 7 times faster than cpython on our benchmark suite, we now rank 6.98 on latest runs
Please try it out and let us know what you think. We welcome success stories, experiments, or benchmarks, we know you are using PyPy, please tell us about it!
Cheers
The PyPy Team

March 26, 2015 09:45 AM


Continuum Analytics Blog

The Art of Abstraction - Continuum + Silicon Valley Data Science White Paper

How Separating Code, Data, and Context Will Make Your Business Better

March 26, 2015 12:00 AM

March 25, 2015


Mike Driscoll

The Python 101 Screencast Kickstarter is Now Live!

mousecovertitlejpg_sm_title

My latest project is turning my book, Python 101, into a Screencast. I have started a Kickstarter to raise funds to help in this endeavor. You can check it out here:

https://www.kickstarter.com/projects/34257246/the-python-101-screencast

The basic idea is to take each chapter of the book and turn it into a screencast. There are 44 chapters currently that will be turned into mini-videos. I’ve already realized I can add a lot of other items in a screencast that are easier to show than to write about, so there will definitely be additional content. I hope you will join me in this project.

Thanks,
Mike

March 25, 2015 01:19 PM


ClusterHQ

Moving a database container with Docker Swarm and Flocker

Please note: because this demo uses Powerstrip, which is only meant for prototyping Docker extensions, we do not recommend this configuration for anything approaching production usage. When Docker extensions become official, Flocker will support them. Until then, this is just a proof-of-concept.

Today we are going to demonstrate migrating a database container and its data across hosts using only the Docker client as the trigger.

Imagine we are running a database container with data being saved in a Docker volume and now we want to upgrade the hardware for the host the database container is running on. Clearly, we need a way of moving the container and the data as a single atomic unit.

Docker Swarm is capable of scheduling the container to be run on a particular host (using constraints) – we are going to demonstrate how to combine Flocker with Swarm and migrate the data alongside the container by combining the following tools:

Swarm + Flocker + Weave + Powerstrip + Powerstrip-flocker + Powerstrip-weave

Overview

Here is an overview of the components that are used to make this example work:

overview

Scenario

We have 2 simple services in our stack – a HTTP API and a Database API, also exposed over HTTP. We have used Flocker to handle our volumes and Weave to handle our networking.

We quickly discover that the node with the spinning disk which hosts our database container is too slow. We need an SSD drive and this means migrating our database container (along with its data) to another host.

Ideally – we want to run our database container on a second machine and for the data to just move with the container. Because we are using Powerstrip-flocker together with Swarm – we can stop the container on the first node, start it up on the second node and our data will have followed the database container – using only the Docker client!

Before

Here is a diagram of the setup before we have moved the database container:

layout-pre

After

Here is a diagram of the setup after we have moved the database container (and its data + IP address):

layout-post

Try it yourself! Requirements for the demo

First you need to install: VirtualBox + Vagrant

Start

To run the demo virtual machines:

$ git clone https://github.com/binocarlos/powerstrip-swarm-demo
$ cd powerstrip-swarm-demo
$ vagrant up

Run

We have included a script that will run through each of the commands in the demo automatically:

$ vagrant ssh master
master$ sudo bash /vagrant/run.sh demo

Manual example

If we run each step of the example manually, we can see clearly what is happening on each step.

Step 1

First we SSH onto the master and export DOCKER_HOST:

$ vagrant ssh master
master$ export DOCKER_HOST=localhost:2375

Step 2

Then we start the HTTP server on node1:

master$ docker run -d \
  --name demo-server \
  -e constraint:storage==disk \
  -e WEAVE_CIDR=10.255.0.11/24 \
  -e API_IP=10.255.0.10 \
  -p 8080:80 \
  binocarlos/multi-http-demo-server:latest

Step 3

Then we start the DB server on node1:

master$ docker run -d \
  --hostname disk \
  --name demo-api \
  -e constraint:storage==disk \
  -e WEAVE_CIDR=10.255.0.10/24 \
  -v /flocker/data1:/tmp \
  binocarlos/multi-http-demo-api:latest

Step 4

Then we hit the web service a few times to increase the number:

master$ curl -L http://172.16.255.251:8080
master$ curl -L http://172.16.255.251:8080
master$ curl -L http://172.16.255.251:8080
master$ curl -L http://172.16.255.251:8080
master$ curl -L http://172.16.255.251:8080

Step 5

Then we kill the database container:

master$ docker rm -f demo-api

Step 6

Now we start the DB server on node2:

master$ docker run -d \
  --hostname ssd \
  --name demo-api \
  -e constraint:storage==ssd \
  -e WEAVE_CIDR=10.255.0.10/24 \
  -v /flocker/data1:/tmp \
  binocarlos/multi-http-demo-api:latest

Step 7

Then we hit the web service a few times to ensure that it:

master$ curl -L http://172.16.255.251:8080
master$ curl -L http://172.16.255.251:8080
master$ curl -L http://172.16.255.251:8080
master$ curl -L http://172.16.255.251:8080
master$ curl -L http://172.16.255.251:8080

Step 8

Now we close the 2 containers

master$ docker rm -f demo-api
master$ docker rm -f demo-server

Info

You can see the state of the swarm by doing this on the master:

$ vagrant ssh master
master$ DOCKER_HOST=localhost:2375 docker ps -a

This displays the containers used for Powerstrip, Flocker and weave

You can see the state of the weave network by doing this on node1 or node2:

$ vagrant ssh node1
node1$ sudo bash /vagrant/install.sh weave status

Conclusion

We have moved both an IP address and a data volume across hosts using nothing more than the Docker client!

What do you think?  Join the discussion over on HackerNews.

The post Moving a database container with Docker Swarm and Flocker appeared first on ClusterHQ.

March 25, 2015 01:11 PM


François Marier

Keeping up with noisy blog aggregators using PlanetFilter

I follow a few blog aggregators (or "planets") and it's always a struggle to keep up with the amount of posts that some of these get. The best strategy I have found so far to is to filter them so that I remove the blogs I am not interested in, which is why I wrote PlanetFilter.

Other options

In my opinion, the first step in starting a new free software project should be to look for a reason not to do it :) So I started by looking for another approach and by asking people around me how they dealt with the firehoses that are Planet Debian and Planet Mozilla.

It seems like a lot of people choose to "randomly sample" planet feeds and only read a fraction of the posts that are sent through there. Personally however, I find there are a lot of authors whose posts I never want to miss so this option doesn't work for me.

A better option that other people have suggested is to avoid subscribing to the planet feeds, but rather to subscribe to each of the author feeds separately and prune them as you go. Unfortunately, this whitelist approach is a high maintenance one since planets constantly add and remove feeds. I decided that I wanted to follow a blacklist approach instead.

PlanetFilter

PlanetFilter is a local application that you can configure to fetch your favorite planets and filter the posts you see.

If you get it via Debian or Ubuntu, it comes with a cronjob that looks at all configuration files in /etc/planetfilter.d/ and outputs filtered feeds in /var/cache/planetfilter/.

You can either:

The software will fetch new posts every hour and overwrite the local copy of each feed.

A basic configuration file looks like this:

[feed]
url = http://planet.debian.org/atom.xml

[blacklist]

Filters

There are currently two ways of filtering posts out. The main one is by author name:

[blacklist]
authors =
  Alice Jones
  John Doe

and the other one is by title:

[blacklist]
titles =
  This week in review
  Wednesday meeting for

In both cases, if a blog entry contains one of the blacklisted authors or titles, it will be discarded from the generated feed.

Tor support

Since blog updates happen asynchronously in the background, they can work very well over Tor.

In order to set that up in the Debian version of planetfilter:

  1. Install the tor and polipo packages.
  2. Set the following in /etc/polipo/config:

     proxyAddress = "127.0.0.1"
     proxyPort = 8008
     allowedClients = 127.0.0.1
     allowedPorts = 1-65535
     proxyName = "localhost"
     cacheIsShared = false
     socksParentProxy = "localhost:9050"
     socksProxyType = socks5
     chunkHighMark = 67108864
     diskCacheRoot = ""
     localDocumentRoot = ""
     disableLocalInterface = true
     disableConfiguration = true
     dnsQueryIPv6 = no
     dnsUseGethostbyname = yes
     disableVia = true
     censoredHeaders = from,accept-language,x-pad,link
     censorReferer = maybe
    
  3. Tell planetfilter to use the polipo proxy by adding the following to /etc/default/planetfilter:

     export http_proxy="localhost:8008"
     export https_proxy="localhost:8008"
    

Bugs and suggestions

The source code is available on repo.or.cz.

I've been using this for over a month and it's been working quite well for me. If you give it a go and run into any problems, please file a bug!

I'm also interested in any suggestions you may have.

March 25, 2015 09:55 AM


Python Software Foundation

Raspberry Pi 2: Even More Delicious!

For those of you not familiar with the Raspberry Pi Foundation, this UK based educational charity provides fun projects and opportunities for bringing coding literacy to students in the UK and to learners all over the world. This blog previously featured two of their projects, Astro Pi and Unicef’s Pi4Learning. There are many more, including Piper which uses the game, Minecraft, to teach electronics to kids, or the use of Raspberry Pis on weather balloons to observe and record (from the UK) today’s solar eclipse, or Picademy, which teaches programming skills to teachers (for these projects and many more, see RPF Blog).
The one thing these widely-varied projects have in common is that they all rely on the high-performing, incredibly affordable, versatile, and fun to use Raspberry Pi! First produced for sale by the RP Foundation in 2011, the device has become hugely popular, with over 5 million in use around the world. And it just got even better! 
The new Raspberry Pi 2 went on sale in February 2015. The reviews have begun pouring in, and the consensus is that it’s truly great! 
Still selling for a mere $35 USD, still the size of a credit card, and of course still pre-loaded with Python (along with Scratch, Wolfram Mathematica, and much more), the new Raspberry Pi features increased speed and functionality over the B and B+ models. With 900MHz, quad-core ARM Cortex-A7 CPU, and 1 full GB of RAM (over model B+’s 512 MB), it’s been benchmarked at speeds of 6 to almost 10 times faster than the first B model (see Tao of MACPC World).
Its 4-core processor can run all ARM GNU/Linux distributions and the new Pi is fully compatible with the earlier models. In addition, Windows is poised to release a version 10 that will work with the Pi, thus increasing its already broad appeal and versatility (see Raspberry Pi 2).

photo credit: da.wikipedia.orgunder CC license
Features it retains from the previous Model B+ include 4 USB ports, HDMI, Ethernet, Micro SD, Broadcom VideoCore IV Graphics, and Sound Card outputs via HDMI and 3.5mm analogue output (see PC Pro).
Currently the ties between the PSF and the RPF are strong, with many Pythonistas using the Raspberry Pi and many Raspberry Pi projects being done in Python. We hope more people will take a look at this remarkable tool and use it to teach Python, spread programming skills, and put computing power in the hands of anyone who wants it. 
I would love to hear from readers. Please send feedback, comments, or blog ideas to me at msushi@gnosis.cx.

March 25, 2015 02:26 AM

Manuel Kaufmann and Python in Argentina

Several recent blog posts have focused on Python-related and PSF-funded activities in Africa and the Middle East. But the Python community is truly global, and it has been exciting to witness its continued growth. New groups of people are being introduced to Python and to programming so frequently that it’s difficult to keep up with the news. Not only that, but the scope and lasting impact of work being accomplished by Pythonistas with very modest financial assistance from the PSF is astonishing. 

One example is the recent work in South America by Manuel Kaufmann. Manuel’s project is to promote the use of Python “to solve daily issues for common users." His choice of Python as the best language to achieve this end is due to his commitment to "the Software Libre philosophy,” in particular, collaboration rather than competition, as well as Python's ability "to develop powerful and complex software in an easy way."

Toward this end, one year ago, Manuel began his own project, spending his own money and giving his own time, traveling to various South American cities by car (again, his own), organizing meet-ups, tutorials, sprints, and other events to spread the word about Python and its potential to solve everyday problems (see Argentina en Python).

This definitely got the PSF's attention, so in January 2015, the PSF awarded him a $3,000 (USD) grant. With this award, Manuel has been able to continue his work, conducting events that have established new groups that are currently expanding further. This ripple effect of a small investment is something that the PSF has seen over and over again.

On January 17, Resistencia, Argentina was the setting for its first-ever Python Sprint. It was a fairly low-key affair, held at a pub/restaurant “with good internet access.” There were approximately 20 attendees (including 4 young women), who were for the most part beginners. After a general introduction, they broke into 2 work groups, with Manuel leading the beginners' group (see Resistencia, Chaco Sprint), by guiding them through some introductory materials and tutorials (e.g., Learning Python from PyAr's wiki).

Foto grupal con todos los asistentes (group photo of all attendees). 
Photo credit: Manuel Kaufmann

As can happen, momentum built, and the Sprint was followed by a Meet-up on January 30 to consolidate gains and to begin to build a local community. The Meet-up's group of 15 spent the time exploring the capabilities of Python, Brython, Javascript, Django, PHP, OpenStreet Map, and more, in relation to needed projects, and a new Python community was born (see Meetup at Resistencia, Chaco).

The next event in Argentina, the province of Formosa's first official Python gathering, was held on February 14. According to Manuel, it was a great success, attended by around 50 people. The day was structured to have more time for free discussion, which allowed for more interaction and exchange of ideas. In Manuel’s opinion, this structure really helped to forge and strengthen the community. The explicit focus on real world applications, with discussion of a Python/Django software application developed for and currently in use at Formosa’s Tourist Information Office, was especially compelling and of great interest to the attendees. See PyDay Formosa and for pictures, see PyDay Pics.

It looks as though these successes are just the beginning: Manuel has many more events scheduled:
  • 28 Mar - PyDay at Asunción (Gran Asunción, Paraguay and PyDay Asuncion); Manuel reports that registration for this event has already exceeded 100 people, after only 3 days of opening. In addition, the event organizers are working to establish a permanent “Python Paraguay” community!
  • 20-22 May - Educational Track for secondary students at SciPy LA 2015, Posadas, Misiones, Argentina (SciPy LA and Educational Track); and
  • 30 May - PyDay at Encarnación, Itapúa, Paraguay. 
You can learn more and follow Manuel’s project at the links provided and at Twitter. And stay tuned to this blog, because I plan to cover more of his exciting journey to bring Python, open source, and coding empowerment to many more South Americans.

I would love to hear from readers. Please send feedback, comments, or blog ideas to me at msushi@gnosis.cx.

March 25, 2015 12:15 AM


Matthew Rocklin

Partition and Shuffle

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

This post primarily targets developers.

tl;dr We partition out-of-core dataframes efficiently.

Partition Data

Many efficient parallel algorithms require intelligently partitioned data.

For time-series data we might partition into month-long blocks. For text-indexed data we might have all of the “A”s in one group and all of the “B”s in another. These divisions let us arrange work with foresight.

To extend Pandas operations to larger-than-memory data efficient partition algorithms are critical. This is tricky when data doesn’t fit in memory.

Partitioning is fundamentally hard

Data locality is the root of all performance
    -- A Good Programmer

Partitioning/shuffling is inherently non-local. Every block of input data needs to separate and send bits to every block of output data. If we have a thousand partitions then that’s a million little partition shards to communicate. Ouch.

Shuffling data between partitions

Consider the following setup

  100GB dataset
/ 100MB partitions
= 1,000 input partitions

To partition we need shuffle data in the input partitions to a similar number of output partitions

  1,000 input partitions
* 1,000 output partitions
= 1,000,000 partition shards

If our communication/storage of those shards has even a millisecond of latency then we run into problems.

  1,000,000 partition shards
x 1ms
= 18 minutes

Previously I stored the partition-shards individually on the filesystem using cPickle. This was a mistake. It was very slow because it treated each of the million shards independently. Now we aggregate shards headed for the same out-block and write out many at a time, bundling overhead. We balance this against memory constraints. This stresses both Python latencies and memory use.

BColz, now for very small data

Fortunately we have a nice on-disk chunked array container that supports append in Cython. BColz (formerly BLZ, formerly CArray) does this for us. It wasn’t originally designed for this use case but performs admirably.

Briefly, BColz is…

It includes two main objects:

Partitioned Frame

We use carray to make a new data structure pframe with the following operations:

Internally we invent two new data structures:

Partitioned Frame design

Through bcolz.carray, cframe manages efficient incremental storage to disk. PFrame partitions incoming data and feeds it to the appropriate cframe.

Example

Create test dataset

In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'a': [1, 2, 3, 4],
...                        'b': [1., 2., 3., 4.]},
...                       index=[1, 4, 10, 20])

Create pframe like our test dataset, partitioning on divisions 5, 15. Append the single test dataframe.

In [3]: from pframe import pframe
In [4]: pf = pframe(like=df, blockdivs=[5, 15])
In [5]: pf.append(df)

Pull out partitions

In [6]: pf.get_partition(0)
Out[6]:
   a  b
1  1  1
4  2  2

In [7]: pf.get_partition(1)
Out[7]:
    a  b
10  3  3

In [8]: pf.get_partition(2)
Out[8]:
    a  b
20  4  4

Continue to append data…

In [9]: df2 = pd.DataFrame({'a': [10, 20, 30, 40],
...                         'b': [10., 20., 30., 40.]},
...                        index=[1, 4, 10, 20])
In [10]: pf.append(df2)

… and partitions grow accordingly.

In [12]: pf.get_partition(0)
Out[12]:
    a   b
1   1   1
4   2   2
1  10  10
4  20  20

We can continue this until our disk fills up. This runs near peak I/O speeds (on my low-power laptop with admittedly poor I/O.)

Performance

I’ve partitioned the NYCTaxi trip dataset a lot this week and posting my results to the Continuum chat with messages like the following

I think I've got it to work, though it took all night and my hard drive filled up.
Down to six hours and it actually works.
Three hours!
By removing object dtypes we're down to 30 minutes
20!  This is actually usable.
OK, I've got this to six minutes.  Thank goodness for Pandas categoricals.
Five.
Down to about three and a half with multithreading, but only if we stop blosc from segfaulting.

And thats where I am now. It’s been a fun week. Here is a tiny benchmark.

>>> import pandas as pd
>>> import numpy as np
>>> from pframe import pframe

>>> df = pd.DataFrame({'a': np.random.random(1000000),
                       'b': np.random.poisson(100, size=1000000),
                       'c': np.random.random(1000000),
                       'd': np.random.random(1000000).astype('f4')}).set_index('a')

Set up a pframe to match the structure of this DataFrame Partition index into divisions of size 0.1

>>> pf = pframe(like=df,
...             blockdivs=[.1, .2, .3, .4, .5, .6, .7, .8, .9],
...             chunklen=2**15)

Dump the random data into the Partition Frame one hundred times and compute effective bandwidths.

>>> for i in range(100):
...     pf.append(df)

CPU times: user 39.4 s, sys: 3.01 s, total: 42.4 s
Wall time: 40.6 s

>>> pf.nbytes
2800000000

>>> pf.nbytes / 40.6 / 1e6  # MB/s
68.9655172413793

>>> pf.cbytes / 40.6 / 1e6  # Actual compressed bytes on disk
41.5172952955665

We partition and store on disk random-ish data at 68MB/s (cheating with compression). This is on my old small notebook computer with a weak processor and hard drive I/O bandwidth at around 100 MB/s.

Theoretical Comparison to External Sort

There isn’t much literature to back up my approach. That concerns me. There is a lot of literature however on external sorting and they often site our partitioning problem as a use case. Perhaps we should do an external sort?

I thought I’d quickly give some reasons why I think the current approach is theoretically better than an out-of-core sort; hopefully someone smarter can come by and tell me why I’m wrong.

We don’t need a full sort, we need something far weaker. External sort requires at least two passes over the data while the method above requires one full pass through the data as well as one additional pass through the index column to determine good block divisions. These divisions should be of approximately equal size. The approximate size can be pretty rough. I don’t think we would notice a variation of a factor of five in block sizes. Task scheduling lets us be pretty sloppy with load imbalance as long as we have many tasks.

I haven’t implemented a good external sort though so I’m only able to argue theory here. I’m likely missing important implementation details.

March 25, 2015 12:00 AM

March 24, 2015


Python Anywhere

XFS to ext4 for user storage - why we made the switch

Last Tuesday, we changed the filesystem we use to store our users' files over from XFS to ext4fs. This required a much longer maintenance outage than normal -- 2 hours instead of our normal 20-30 minutes.

This post explains why we made the change, and how we did it.

tl;dr for PythonAnywhere users:

We discovered that the quota system we were using with XFS didn't survive hard fileserver reboots in our configuration. After much experimentation, we determined that ext4 handles our particular use case better. So we moved over to using ext4, which was hard, but worthwhile for many reasons.

tl;dr for sysadmins:

A bit of architecture

In order to understand what we changed and why, you'll need a bit of background about how we store our users' files. This is relatively complex, in part because we need to give our users a consistent view of their data regardless of which server their code is running on -- for example so they see the same files from their consoles as they do from their web apps, and so all of the worker processes that make up their web apps can see all of their files -- and in part because we need to keep everything properly backed up to allow for hardware failures and human error.

The PythonAnywhere cluster is made up of a number of different server types. The most important for this post are execution servers, file servers, and backup servers.

Execution servers are the servers where users' code runs. There are three kinds: web servers, console servers, and (scheduled) task servers. From the perspective of file storage, they're all the same -- they run our users' code in containers, with each user's files mounted into the containers. They access the users' files from file servers.

File servers are just what you'd expect. All of a given user's files are on the same file server. They're high-capacity servers with large RAID0 SSD arrays (connected using Amazon's EBS). They run NFS to provide the files to the execution servers, and also run a couple of simple services that allow us to manage quotas and the like.

Backup servers are simpler versions of file servers. Each file server has its own backup server, and they have identical amounts of storage. Data that is written to a file server is asynchronously synchronised over to its associated backup server using a service called drbd.

Here's a diagram of what we were doing prior to the recent update:

Simplified architecture diagram

This architecture has a number of benefits:

XFS

As you can see from the diagram, the filesystem we used to use to store user data was XFS. XFS is a tried-and tested journaling filesystem, created by Silicon Graphics in 1993, and is perfect for high-capacity storage. We actually started using it because of a historical accident. In an early prototype of PythonAnywhere, all users actually mapped to the same Unix user. When we introduced disk quotas (yes, it was early enough that we didn't even have disk quotas) this was a problem. At that time, we couldn't see any easy way to change the situation with Unix users (that changed later) so we needed some kind of quota system that allowed us to enforce quotas on a per-directory basis, so that (eg.) /home/someuser had a quota of 512MB and /home/otheruser had a quota of 1GB. But most filesystems that provide quotas only support it on a per-user basis.

XFS, however, has a concept of "project quotas". A project is a set of directories, and each project can have its own independent quota. This was perfect for us, so of the tried-and-tested filesystems, XFS was a great choice.

Later on, of course, we worked out how to map each user to a separate Unix user -- so the project quota concept was less useful. But XFS is solid, reliable, and just as fast as, if not faster than, other filesystems, so there was no reason to change.

How things went wrong

A few weeks back, we had an unexpected outage on a core database instance that supports PythonAnywhere. This caused a number of servers to crash (coincidentally due to the code we use to map PythonAnywhere users to Unix users), and we instituted a rolling reboot. This has happened a couple of times before, and has only required execution server reboots. But this time we needed to reboot the file servers as well.

Our normal process for rebooting an execution server is to run sync to synchronise the filesystem (being old Unix hands we run it three times "just to be safe", despite the fact that hasn't been necessary since sometime in the early '90s) and then to do a rapid reboot by echoing "b" to /proc/sysrq-trigger.

File servers, however, require a more gentle reboot procedure, because they have critical data stored on them, and are writing so much to disk that stuff can change between the last sync and the reboot, so a normal slow reboot command is much safer.

This time, however, we made a mistake -- we used the execution-server-style hard reboot on the file servers.

There were no obvious ill effects; when everything came back, all filesystems were up and running as normal. No data was lost, and the site was back up and running. So we wiped the sweat from our respective brows, and carried on as normal.

Quotas

We first noticed that something was going wrong an hour or so later. Some of our users started reporting that instead of seeing their own disk usage and quotas on the "Files" tab in the PythonAnywhere web interface, they were seeing things like "1.1TB used of 1.6TB quota". Basically, they were seeing the disk usage across the storage volumes they were linked to instead of the quota details specific to their accounts.

This had happened in the past; the process of setting up a new project quota on XFS can take some time, especially when a volume has a lot of them (our volumes had tens of thousands) and it was done by a service running on the volume's file server that listened to a beanstalk queue and processed updates one at a time. So sometimes when there was a backlog, people would not see the correct quota information for some time.

But this time, when we investigated, we discovered tons of errors in the "quota queue listener" service's logs.

It appeared that while XFS had managed to store files correctly across the hard reboots, the project quotas had gone wrong. Essentially, all users now had unquota'd disk space. This was obviously a big problem. We immediately set up some alerts so that we could spot anyone going over quota.

We also disabled quota reporting on the PythonAnywhere "Files" interface, so that people wouldn't be confused. Or, indeed, to make sure that people didn't guess what was up and try to take advantage by using tons of storage, and cause problems for other users... we did not make any announcement about what was going on, as the risks were too high. (Indeed, this blog post is the announcement of what happened :-)

So, how to fix it?

Getting the backups back up

In order to get quotas working again, we'd need to run an XFS quota check on the affected filesystems. We'd done this in the past, and we'd found it to be extremely slow. This is odd, because XFS gurus had advised us that it should be pretty quick -- a few minutes at most. But the last time we'd run one it had taken 20 minutes, and that had been with significantly smaller storage volumes. If it scaled linearly, we'd be looking at at least a couple of hours' downtime. And if it was non-linear, it could be even longer.

We needed to get some kind of idea of how long it would take with our current data size. So, we picked a recent backup of 1.6TB worth of RAID0 disks, created fresh volumes for them, attached them to a fresh server, mounted it all, and kicked off the quota check.

24 hours later, it still hadn't completed. Additionally, in the machine's syslog there were a bunch of errors and warnings about blocked processes. The kind of errors and warnings that made us suspect that the process was never going to complete.

This was obviously not a good sign. The backup we were working from pre-dated the erroneous file server reboots. But the process by which we'd originally created it -- remember, we logged on to a backup server, used drbd to disconnect from its file server, did the backup snapshots, then reconnected drbd -- was actually quite similar to what would have happened during the server's hard reboot. Essentially, we had a filesystem where XFS might have been half-way through doing something when it was interrupted by the backup.

This shouldn't have mattered. XFS is a journaling filesystem, which means that it can be (although it generally shouldn't be) interrupted when it's half-way through something, and can pick up the pieces afterwards. This applies both to file storage and to quotas. But perhaps, we wondered, project quotas are different? Or maybe something else was going wrong?

We got in touch with the XFS mailing list, but unfortunately we were unable to explain the problem with the correct level of detail for people to be able to help us. The important thing we came away with was that what we were doing was not all that unusual, and it should all be working. The quotacheck should be completing in a few minutes.

And now for something completely different

At this point, we had multiple parallel streams of investigations ongoing. While one group worked on getting the quotacheck to pass, another was seeing whether another filesystem would work better for us. This team had come to the conclusion that ext4 -- a more widely-used filesystem than XFS -- might be worth a look. XFS is an immensely powerful tool, and (according to Wikipedia) is used by NASA for 300+ terabyte volumes. But, we thought, perhaps the problem is that we're just not expert enough to use it properly. After all, organisations of NASA's size have filesystem experts who can spend lots of time keeping that scale of system up and running. We're a small team, with smaller requirements, and need a simpler filesystem that "just works". On this theory, we thought that perhaps due to our lack of knowledge, we'd been misusing XFS in some subtle way, and that was the cause of our woes. ext4, being the standard filesystem for most current Linux distros, seemed to be more idiot-proof. And, perhaps importantly, now that we no longer needed XFS's project quotas (because PythonAnywhere users were now separate Unix users), it could also support enough quota management for our needs.

So we created a server with 1.6TB of ext4 storage, and kicked off an rsync to copy the data from another copy of the 1.6TB XFS backup the quotacheck team were using over to it, so that we could run some tests. We left that rsync running overnight.

When we came in the next morning, we saw something scary. The rsync had failed halfway through with IO errors. The backup we were working from was broken. Most of the files were OK, but some of them simply could not be read.

This was definitely something we didn't want to see. With further investigation, we discovered that our backups were generally usable, but in each one, some files were corrupted. Clearly our past backup tests (because, of course, we do test our backups regularly :-) had not been sufficient.

And clearly the combination of our XFS setup and drbd wasn't working the way we thought it did.

We immediately went back to the live system and changed the backup procedure. We started rolling "eternal rsync" processes -- we attached extra (ext4) storage to each file server, matching the existing capacity, and ran looped scripts that used rsync (at the lowest-priority ionice level) to make sure that all user data was backed up there.

We made sure that we weren't adversely affecting filesystem performance by checking out an enormous git repo into one of our own PythonAnywhere home directories, and running git status (which reads a bunch of files) regularly, and timing it.

Once the first eternal rsyncs had completed, we were 100% confident that we really did have everyones' data safe. We then changed the backup process to be:

This meant that we could be sure that the backups were recoverable, as they came from a filesystem that was not being written to while they happened. This time we tested them with a rsync from disk to disk, just to be sure that every file was OK.

We then copied the data from one of the new-style backups, that had come from an ext4 filesystem, over to a new XFS filesystem. We attached the XFS filesystem to a test server, set up the quotas, set some processes to reading from and writing to it, then did a hard reboot on the server. When it came back, it mounted the XFS filesystem, but quotas were disabled. Running a quotacheck on the filesystem crashed.

Further experiments showed that this was a general problem with pretty much any project-quota'ed XFS filesystem we could create; in our tests, a hard reboot caused a quotacheck when the filesystem was remounted, and this would frequently take a very long time, or even crash -- leaving the disk only mountable with no quotas.

We tried running a similar experiment using ext4; when the server came back after a hard reboot, it took a couple of minutes checking quotas and a few harmless-seeming warnings appeared in syslog. But the volumes mounted OK, and quotas were active.

Over to ext4

By this time we'd persuaded ourselves that moving to ext4 was the way forward for dimwits like us. So the question was, how to do it?

The first step was obviously to change our quota-management and system configuration code so that it used ext4's commands instead of XFS's. One benefit of doing this was that we were able to remove a bunch of database dependencies from the file server code. This meant that:

It's worth saying that the database dependency wasn't due to XFS; we were just able to eliminate it at this point because we were changing all of that code anyway.

Once we'd made the changes and run it through our continuous integration environment a few times to work out the kinks, we needed to deploy it. This was trickier.

What we needed to do was:

Parrallelise rsync for great good

The "copy" phase was the problem. The initial run of our eternal rsync processes made it clear that copying 1.6TB (our standard volume size) from a 1.6TB XFS volume to an equivalent ext4 one took 26 hours. A 26 hour outage would be completely unacceptable.

However, the fact that we were already running eternal rysync processes opened up some other options. The first sync took 26 hours, but each additional one took 6 hours -- that is, it took 26 hours to copy all of the data, then after that it took 6 hours to check for any changes between the XFS volume and the ext4 one it was copying them to that had happened while the original copy was running, and to copy those changes across. And then it took 6 hours to do that again.

We could use our external rsync target ext4 disks as the new disks for the new cluster, and just sync across the changes.

But that would still leave us with a 6+ hour outage -- 6 hours for the copy, and then extra time for moving disks around and so on. Better, but still not good enough.

Now, the eternal rsync processes were running at a very high nice and ionice level so as not to disrupt filesystem access on the live system. So we tested how long it would take to run the rsync with the opposite, resource-gobbling niceness settings. To our surprise, it didn't change things much; a rsync of 6 hours' worth of changes from an XFS volume to an ext4 one took about five and a half hours.

We obviously needed to think outside the box. We looked at what was happening while we ran one of these rsyncs, in top and iotop, and noticed that we were nowhere near maxing out our CPU or our disk IO... which made us think, what happens if we do things in parallel?

At this point, it might be worth sharing some (slightly simplified) code:

rsync-all.sh

#!/bin/bash
# Parameter $1 is the number of rsyncs to run in parallel
cd /mnt/old_xfs_volume/
ls -d * | xargs -n 1 -P $1 ~/rsync-one.sh

rsync-one

#!/bin/bash
mkdir -p /mnt/new_ext4_volume/"$1"
rsync -raXAS --delete /mnt/old_xfs_volume/"$1" /mnt/new_ext4_volume/

For some reason our notes don't capture, on our first test we went a bit crazy and used 720 parallel rsyncs, for a total of about 2,000 processes.

It was way better. The copy completed in about 90 minutes. So we experimented. After many, many tests, we found that the sweet spot was about 30 parallel rsyncs, which took on average about an hour and ten minutes.

Going.... LIVE

We believed that the copy would take about 70 minutes. Given that this deployment was going to require significantly more manual running of scripts and so on than a normal one, we figured that we'd need 50 minutes for the other tasks, so we were up from our normal 20-30 minutes of downtime for a release to two hours. Which was high, but just about acceptable.

The slowest time of day across all of the sites we host is between 4am and 8am UTC, so we decided to go live at 5am, giving us 3 hours just in case things went wrong. On 17 March, we had an all-hands-on deck go-live with the new code. And while there were a couple of scary moments, everything went pretty smoothly -- in particular, the big copy took 75 minutes, almost exactly what we'd expected.

So as of 17 March, we've been running on ext4.

Post-deploy tests

Since we went live, we've run two main tests.

First, and most importantly, we've tested our backups much more thoroughly than before. We've gone back to the old backup technique -- on the backup server, shut down the drbd connection, snapshot the disks, and restart drbd -- but now we're using ext4 as the filesystem. And we've confirmed that our new backups can be re-mounted, they have working quotas, and we can rsync all of their data over to fresh disks without errors. So that's reassuring.

Secondly, we've taken the old XFS volumes and tried to recover the quotas. It doesn't work. The data is all there, and can be rsynced to fresh volumes without IO errors (which means that at no time was anyone's data at risk). But the project quotas are irrecoverable.

We've also (before we went live with ext4, but after we'd committed to it) discovered that there was a bug in XFS -- fixed in Linux kernels since 3.17, but we're on Ubuntu Trusty, which uses 3.13. It is probably related to the problem we're seeing, but certainly doesn't explain it totally -- it explains why a quotacheck ran when we re-mounted the volumes, but doesn't explain why it never completed, or why we were never able to re-mount the volumes with quotas enabled.

Either way, we're on ext4 now. Naturally, we're 100% sure it won't have any problems whatsoever and everything will be just fine from now on ;-)

March 24, 2015 07:04 PM