skip to navigation
skip to content

Planet Python

Last update: September 30, 2014 04:56 AM

September 29, 2014


Caktus Consulting Group

Celery in Production

(Thanks to Mark Lavin for significant contributions to this post.)

In a previous post, we introduced using Celery to schedule tasks.

In this post, we address things you might need to consider when planning how to deploy Celery in production.

At Caktus, we've made use of Celery in a number of projects ranging from simple tasks to send emails or create image thumbnails out of band to complex workflows to catalog and process large (10+ Gb) files for encryption and remote archival and retrieval. Celery has a number of advanced features (task chains, task routing, auto-scaling) to fit most task workflow needs.

Simple Setup

A simple Celery stack would contain a single queue and a single worker which processes all of the tasks as well as schedules any periodic tasks. Running the worker would be done with

python manage.py celery worker -B

This is assuming using the django-celery integration, but there are plenty of docs on running the worker (locally as well as daemonized). We typically use supervisord, for which there is an example configuration, but init.d, upstart, runit, or god are all viable alternatives.

The -B option runs the scheduler for any periodic tasks. It can also be run as its own process. See starting-the-scheduler.

We use RabbitMQ as the broker, and in this simple stack we would store the results in our Django database or simply ignore all of the results.

Large Setup

In a large setup we would make a few changes. Here we would use multiple queues so that we can prioritize tasks, and for each queue, we would have a dedicated worker running with the appropriate level of concurrency. The docs have more information on task routing.

The beat process would also be broken out into its own process.

# Default queue
python manage.py celery worker -Q celery
# High priority queue. 10 workers
python manage.py celery worker -Q high -c 10
# Low priority queue. 2 workers
python manage.py celery worker -Q low -c 2
# Beat process
python manage.py celery beat

Note that high and low are just names for our queues, and don't have any implicit meaning to Celery. We allow the high queue to use more resources by giving it a higher concurrency setting.

Again, supervisor would manage the daemonization and group the processes so that they can all be restarted together. RabbitMQ is still the broker of choice. With the additional task throughput, the task results would be stored in something with high write speed: Memcached or Redis. If needed, these worker processes can be moved to separate servers, but they would have a shared broker and results store.

Scaling Features

Creating additional workers isn't free. The default concurrency uses a new process for each worker and creates a worker per CPU. Pushing the concurrency far above the number of CPUs can quickly pin the memory and CPU resources on the server.

For I/O heavy tasks, you can dedicate workers using either the gevent or eventlet pools rather than new processes. These can have a lower memory footprint with greater concurrency but are both based on greenlets and cooperative multi-tasking. If there is a library which is not properly patched or greenlet safe, it can block all tasks.

There are some notes on using eventlet, though we have primarily used gevent. Not all of the features are available on all of the pools (time limits, auto-scaling, built-in rate limiting). Previously gevent seemed to be the better supported secondary pool, but eventlet seems to have closed that gap or surpassed it.

The process and gevent pools can also auto-scale. It is less relevant for the gevent pool since the greenlets are much lighter weight. As noted in the docs, you can implement your own subclass of the Autoscaler to adjust how/when workers are added or removed from the pool.

Common Patterns

Task state and coordination is a complex problem. There are no magic solutions whether you are using Celery or your own task framework. The Celery docs have some good best practices which have served us well.

Tasks must assert the state they expect when they are picked up by the worker. You won't know how much time has passed since the original task was queued and when it executes. Another similar task might have already carried out the operation if there is a backlog.

We make use of a shared cache (Memcache/Redis) to implement task locks or rate limits. This is typically done via a decorator on the task. One example is given in the docs though it is not written as a decorator.

Key Choices

When getting started with Celery you must make two main choices:

  • Broker
  • Result store

The broker manages pending tasks, while the result store stores the results of completed tasks.

There is a comparison of the various brokers in the docs.

As previously noted, we use RabbitMQ almost exclusively, though we have used Redis successfully and experimented with SQS. We prefer RabbitMQ because Celery's message passing style and much of the terminology was written with AMQP in mind. There are no caveats with RabbitMQ like there are with Redis, SQS, or the other brokers which have to emulate AMQP features.

The major caveat with both Redis and SQS is the lack of built-in late acknowledgment, which requires a visibility timeout setting. This can be important when you have long running tasks. See acks-late-vs-retry.

To configure the broker, use BROKER_URL.

For the result store, you will need some kind of database. A SQL database can work fine, but using a key-value store can help take the load off of the database, as well as provide easier expiration of old results which are no longer needed. Many people choose to use Redis because it makes a great result store, a great cache server and a solid broker. AMQP backends like RabbitMQ are terrible result stores and should never be used for that, even though Celery supports it.

Results that are not needed should be ignored, using CELERY_IGNORE_RESULT or Task.ignore_result.

To configure the result store, use CELERY_RESULT_BACKEND.

RabbitMQ in production

When using RabbitMQ in production, one thing you'll want to consider is memory usage.

With its default settings, RabbitMQ will use up to 40% of the system memory before it begins to throttle, and even then can use much more memory. If RabbitMQ is sharing the system with other services, or you are running multiple RabbitMQ instances, you'll want to change those settings. Read the linked page for details.

Transactions and Django

You should be aware that Django's default handling of transactions can be different depending on whether your code is running in a web request or not. Furthermore, Django's transaction handling changed significantly between versions 1.5 and 1.6. There's not room here to go into detail, but you should review the documentation of transaction handling in your version of Django, and consider carefully how it might affect your tasks.

Monitoring

There are multiple tools available for keeping track of your queues and tasks. I suggest you try some and see which work best for you.

Summary

When going to production with your site that uses Celery, there are a number of decisions to be made that could be glossed over during development. In this post, we've tried to review some of the decisions that need to be thought about, and some factors that should be considered.

September 29, 2014 12:00 PM


Data Community DC

Social Network Analysis with Python Workshop on November 22nd

Social Network Analysis with Python

 

 

 

 

Data Community DC and District Data Labs are hosting a full-day Social Network Analysis with Python workshop on Saturday November 22nd.  For more info and to sign up, go to http://bit.ly/1lWFlLx.  Register before October 31st for an early bird discount!

Overview

Social networks are not new, even though websites like Facebook and Twitter might make you want to believe they are; and trust me- I’m not talking about Myspace! Social networks are extremely interesting models for human behavior, whose study dates back to the early twentieth century. However, because of those websites, data scientists have access to much more data than the anthropologists who studied the networks of tribes!

Because networks take a relationship-centered view of the world, the data structures that we will analyze model real world behaviors and community. Through a suite of algorithms derived from mathematical Graph theory we are able to compute and predict behavior of individuals and communities through these types of analyses. Clearly this has a number of practical applications from recommendation to law enforcement to election prediction, and more.

What You Will Learn

In this course we will construct a social network from email communications using Python. We will learn analyses that compute cardinality, as well as traversal and querying techniques on the graph, and even compute clusters to detect community. Besides learning the basics of graph theory, we will also make predictions and create visualizations from our graphs so that we can easily harness social networks in larger data products.

Course Outline

The workshop will cover the following topics:

Upon completion of the course, you will understand how to conduct graph analyses on social networks, as well as have built a library for analyses on a social network constructed from email communications!

Instructor: Benjamin Bengfort

Benjamin is an experienced Data Scientist and Python developer who has worked in military, industry, and academia for the past eight years. He is currently pursuing his PhD in Computer Science at The University of Maryland, College Park, doing research in Metacognition and Active Logic. He is also a Data Scientist at Cobrain Company in Bethesda, MD where he builds data products including recommender systems and classifier models. He holds a Masters degree from North Dakota State University where he taught undergraduate Computer Science courses. He is also adjunct faculty at Georgetown University where he teaches Data Science and Analytics.

For more information and to reserve a seat, please go to  http://bit.ly/1lWFlLx.

The post Social Network Analysis with Python Workshop on November 22nd appeared first on Data Community DC.

September 29, 2014 11:00 AM


eGenix.com

eGenix PyCon UK 2014 Talks & Videos

The PyCon UK Conference is the premier conference for Python users and developers in the UK. This year it was held from September 19-22 in Coventry, UK.

eGenix Talks at PyCon UK 2014

At this year's PyCon UK, Marc-André Lemburg, CEO of eGenix, gave the following talks at the conference. The presentations are available for viewing and download from our Presentations and Talks section.

When performance matters ...

Simple idioms you can use to make your Python code run faster and use less memory.

Python applications sometimes need all the performance they can get. Think of e.g. web, REST or RPC servers. There are several ways to address this: scale up by using more processes, use Cython, use PyPy, rewrite parts in C, etc.

However, there are also quite a few things that can be done directly in Python. This talk goes through a number of examples and show cases how sticking to a few idioms can easily enhance the performance of your existing applications without having to revert to more complex optimization strategies.

The talk was complemented with a lightning talk titled "Pythons and Flies", which addresses a memory performance idiom and answers one of the audience questions raised in the above talk.

Click to proceed to the PyCon UK 2014 talk and lightning talk videos and slides ...

Python Web Installer

Installing Python packages is usually done with one of the available package installation systems, e.g. pip, easy_install, zc.buildout, or manually by running "python setup.py install" in a package distribution directory.

These systems work fine as long as you have Python-only packages. For packages that contain binaries, such as Python C extensions or other platform dependent code, the situation is a lot less bright.

In this talk, we present a new web installer system that we're currently developing to overcome these limitations.

The system combines the dynamic Python installation interface supported by all installers ("python setup.py install"), with a web installer which automatically selects, downloads, verifies and installs the binary package for your platform.

Click to proceed to the PyCon 2014 talk video and slides ...

If you are interested in learning more about these idioms and techniques, eGenix now offers Python project coaching and consulting services to give your project teams advice on how to achieve best performance and efficiency with Python. Please contact our eGenix Sales Team for information.

Enjoy !

Charlie Clark, eGenix.com Sales & Marketing

September 29, 2014 11:00 AM


Geert Vanderkelen

MySQL Connector/Python 2.0.1 GA

MySQL Connector/Python v2.0 goes GA with version 2.0.1 GA. It is available for download from the MySQL Developer Zone! The previous post about 2.0 described what changed and what was added, here’s an overview:

Our manual has the full change log but here’s an overview of most important changes for this relase.

Useful links

September 29, 2014 10:31 AM


Reach Tim

BAM! A Web Framework "Short Stack"

This how-to article describes how you can use Bottle, Apache, and MongoDB to create a simple, understandable, and fast website. In real life applications that support CRUD operations, it will most likely also use JavaScript, but BAMJ doesn’t sound as cool. If you want to play with the code in this example, it is available on GitHub. The example is kept very simple to show the general workflow.

Table of Contents

Here are the tools:

Note: Be warned if you want to use a NFS file system to house your web files, logs, or database. NFS is nice if you need to share directories with clients over a network. However, file locking is slow and that can cause problems if a program finds it cannot get a lock. It is not recommended to use NFS for logging or for MongoDB.

Overview

This is what we’re going to do:

  1. Create a simple MongoDB database.
  2. Set up the Bottle web framework.
  3. Set up Apache and WSGI to enable Bottle applications.
  4. Create a Bottle application that can connect (using GET/POST) to the MongoDB database.
  5. Create HTML to view the data. With some final notes on how to handle CRUD operations on the data.

I’m assuming you have installed Bottle, the Apache HTTP server with mod_wsgi, and MongoDB.

Create the Database

First, create a toy people database in the mongo shell. We’ll use a collection called test. Here is what it looks like from a terminal window:

> mongo servername
MongoDB shell version: 2.4.9
connecting to: servername/test
rs1:PRIMARY> var people = [
    {name:'Alphonse', age: '15', height: '67', weight:'116.5'},
    {name:'Janice', age: '12', height: '54.3', weight:'60.5'},
    {name:'Sam', age: '12', height: '64.3', weight:'123.25'},
    {name:'Ursula', age: '14', height: '62.8', weight:'104.0'}
]

rs1:PRIMARY> db.people.insert(people)
rs1:PRIMARY> db.people.find()
{ "_id" : ObjectId("..."), "name" : "Alphonse", "age" : "15", "height" : "67", "weight" : "116.5" }
{ "_id" : ObjectId("..."), "name" : "Janice", "age" : "12", "height" : "54.3", "weight" : "60.5" }
{ "_id" : ObjectId("..."), "name" : "Sam", "age" : "12", "height" : "64.3", "weight" : "123.25" }
{ "_id" : ObjectId("..."), "name" : "Ursula", "age" : "14", "height" : "62.8", "weight" : "104.0" }

The data is a set of height and weight measurements on children of various ages. That’s the data we’ll connect to via Bottle/Apache.

Bottle Directory Structure

In /var/www/bottle, the directory structure looks like this. This layout makes it easy to add new applications as you need them. We have the main wsgi script (bottle_adapter.wsgi), and a single application, people. For every application you write, you may also have a css file, a javascript file and templates in the views subdirectory.

bottle_adapter.wsgi
people.py

css/
    person_form.css

js/
    person.js

views/
    people/
        people.tpl
        person.tpl

To add a new application:

  1. create the application code file in the root directory, like people.py
  2. mount the application in the bottle_adapter.wsgi file
  3. add the template path in the bottle_adapter.wsgi file
  4. add css, js files if needed
  5. add a view subdirectory to contain the application templates.

In this example we won’t use any javascript or even css, but in real life you will probably want to add that in. This directory layout keeps things simple and separated, and it’s easy to extend.

Apache Configuration

In the Apache httpd.conf file, set the WSGIScriptAlias and WSGIDaemonProcess. For details on wsgi configuration, see the guide. When a request comes in to our server with a url that begins with service, the request is passed on to our bottle application.

LoadModule wsgi_module /usr/lib/apache2/modules/mod_wsgi.so
WSGIScriptAlias /service /var/www/bottle/bottle_adapter.wsgi
WSGIDaemonProcess example02 processes=5 threads=25

You may need to set other parameters depending on your platform. I am on a FreeBSD machine and needed to set WSGIAcceptMutex posixsem in addition to the settings above. The wsgi guide is well written and indispensable.

The wsgi Adapter

When the Apache server gets a request that begins with service, the httpd daemon calls the bottle application adapter script, bottle_adapter.wsgi. As you can see below, the script imports the application code, adds to the template path and “mounts” the application on a specific URL path.

Now that the people application is mounted on the URL path /people, the application will get called for any URL that starts with /service/people.

The debug attribute is set to True until we’re ready to go production. Bottle provides its own development server when you’re getting started; I’ve commented out that line but it is helpful to use it when debugging your application code.

import sys, bottle
sys.path.insert(0,'/var/www/bottle')
import people

bottle.TEMPLATE_PATH.insert(0, '/var/www/bottle/views/people')

application = bottle.default_app()
application.mount('/people', people.app)

bottle.debug(True)
#bottle.run(host='example02', port=8090, debug=True)

The Skeleton of an App

Every application will have a similar skeleton; all my apps have the same things at the beginning. The following code block displays the beginning of the people application. First, we make the necessary imports, get a connection to the MongoDB database test collection and instantiate our application, app, which is known as people.app to the wsgi adapter.

from bottle import Bottle, view, request
from bson.objectid import ObjectId
import pymongo

conn = pymongo.MongoReplicaSetClient(
    'example02.unx.sas.com, example01.unx.sas.com',
    replicaSet='rs1')
db = conn.test

app = Bottle()

And there we have it—our people.app Bottle application, all wired up but not able to do anything yet. Let’s add some abilities so we can respond to URL requests. For example, when a GET request comes in on service/people, we’ll send back a response with a list of the people in the database.

Most capabilities you code will have three parts:

In the following code block, the @app.get decorator corresponds to a URL route; it responds to a GET request with no further arguments ('/'); however, by the time the request makes it to this point, the URL is actually service/people; that is what the root URL looks like to the people app.

The @view decorator specifies the template to use to display the data in the response (that is, the template views/people/people.tpl )

The people function creates a cursor which gets all the records (documents in the MongoDB world) from the database, sorts those records on the name and returns them as a list under the key named results.

@app.get('/')
@view('people')
def people():
    p = db.people.find().sort([('name', 1)])
    return {'results': list(p)}

Note: The sort function in pymongo operates differently from the mongo shell; it uses a list of tuples to sort on instead of a document. This tripped me up at first. The documentation explains it well.

Next, here is a route using the @app.get decorator which responds to a GET request of the form service/people/somename. It executes the person function and renders the resulting data using the person template (views/people/person.tpl).

The person function gets the single (or first) document with a matching name, converts the _id value to a string and returns the data under the key named person.

@app.get('/<name>')
@view('person')
def person(name):
    person = db['people'].find_one({'name':name})
    person['_id'] = str(person['_id'])
    return {'person': person}

Why convert the _id from an ObjectID (as it exists in the MongDB database) to a string? If all you want to do is display the record, you could just delete the key, but if you want to interact with the record (possibly updating or deleting it through the client), you have to keep track of that _id. Otherwise there’s no way to map back to the record in the database. So we transform it to a string on the way out to the client and as you’ll see in the next route function, we transform it back to an ObjectID on the way back in to the database.

The last route in this example uses the app.post decorator to respond to a POST request. It creates a new dictionary populated with values from the request, converts the string _id value back to an ObjectID, and saves the record back to the database. It returns a bare string (“Thanks…”) so the user knows the data was updated.

@app.post('/<name>')
def update_person(name):
    person = {'name': name}
    person['age'] = request.POST.get('age')
    person['weight'] = request.POST.get('weight')
    person['height'] = request.POST.get('height')
    person['_id'] = ObjectId(request.POST.get('_id'))

    db['people'].save(person)
    return 'Thanks for your data.'

Keep in mind that this is just a simple toy example; if you have a bazillion records or a gazillion requests per second, you’ll want to do a lot of reading on MongoDB index creation, aggregation and filtering techniques.

Summary: What We Have So Far

When a GET request comes in with this url:

http://example01/service/people

that url is routed from Apache to the Bottle wsgi script and on to the people.app Bottle application and finally to the /people route. That route, in turn, invokes the people function, which retrieves all the MongoDB documents in the people database. The JSON data is returned in the response under the key named results and rendered by the people template.

When a request comes in with this url:

http://example01/service/people/Sam

If the request is a GET, the document with name = Sam is retrieved, the _id is converted to a string instead of an object (it is saved as an ObjectId in the MongoDB database). The JSON data is returned in the response under the key named results and rendered by the person template.

If the request is a POST, the data is read from the request form, the _id is turned back into an ObjectId and the document is saved to the database.

Note: With this smidgen of code and a rational directory structure, before even talking about the HTML side, we have an simple, easy-to-understand API that can send and receive JSON documents, interacting with a MongoDB backend database.

A Sample Template File

You can use plain HTML to display and update records if your underlying data has a simple structure, as our example data does.

This is the view-only template people.tpl which displays the data on a GET request to http://example02/service/people:

<!doctype html>
<html>
<head>
  <title>People</title>
</head>
<body>
     <h1>People</h1>
     <table>
        <tr><th>Name</th><th>Age</th></tr>
        %for person in results:
            <tr>
                <td>{{person['name']}}</td>
                <td>{{person['age']}}</td>
            </tr>
        %end for
    </table>
</body>
</html>

And this is the (unstyled) result:

peoplelist

To view a particular person, the logic is similar, except we get back a single record: Here is the template person.tpl:

<!doctype html>
<html>
<head>
  <title>{{person['name']}} Stats</title>
</head>
<body>
  <h1>{{person['name']}} Stats</h1>
     <table>
     <tr><th>Age</th><td>{{person['age']}}</td></tr>
     <tr><th>Height</th><td>{{person['height']}}</td></tr>
     <tr><th>Weight</th><td>{{person['weight']}}</td></tr>
   </table>

</body>
</html>

If the url is http://example01/service/people/Alphonse, here is the result:

person

CRUD Operations

Performing updates, deletes, or creating records can get tricky, depending on the complexity of your data. If you have nested JSON data, you will need some Javascript to serialize the user-input data so the result has the correct structure when it gets back to MongoDB. More on that later. For now let’s write a form that enables us to update a simple record and save it.

Simple Case

First we need to view the data in a prepopulated form and then we need to update the database record with the changes made in the HTML client.

We already have the Python code (the person and update_person functions), so we just need to write the HTML page. You might call the page person_update.tpl, and change the @app.get(/<name>) route to use that instead of the view-only template person.tpl from before.

This page displays the data but now inside an HTML form. We keep the _id value because we must have it in order to update the database. If we don’t include the _id on the way back in to the database, our changed data will be added to the database as a new record instead of updating the original record.

<!doctype html>
<html>
<head>
  <title>{{person['name']}} Stats</title>
</head>
<body>
  <h1>{{person['name']}} Stats</h1>
  <form id="persondata" name="person" method="post" action="http://example02/service/people/{{person['name']}}">
    <input type="text" name="_id" value="{{person['_id']}}" style="display: none;" />
    <input type="text" name="name" value="{{person['name']}}" style="display: none;" />
     <table>
     <tr><th>Age</th>
      <td><input type="text" name="age" value="{{person['age']}}" /></td>
    </tr>
     <tr><th>Height</th>
      <td><input type="text" name="height" value="{{person['height']}}" /></td>
    </tr>
     <tr><th>Weight</th>
      <td><input type="text" name="weight" value="{{person['weight']}}" /></td>
    </tr>
    <tr><button type="submit">Save</button></tr>
   </table>
  </form>
</body>
</html>

Complex Case

If you have nested JSON data you will probably have to use Javascript. There are so many solutions that I’ll just point out some links. The library I use (and I have deeply nested JSON) is form2js. It is a Javascript library for collecting form data and converting it to a Javascript object.

In no particular order, here are some similar libraries that address the same situation:

The following is an example of using the form2js library and jQuery. You can use the library with arbitrarily nested JSON documents. However, you need to make a couple of changes to your Bottle app. This example template and the changed app code are in the GitHub files alt_people.py and alt_person_update.tpl.

The main change to the people app is the update_person function, which now looks like the following. It reads the JSON string from the request body and converts it. The loads function is included in the pymongo package; it provides conversion from a string instance to a BSON document. Finally, our function converts the _id to an ObjectId and saves the record back to the database.

@app.post('/')
def update_person():
    data = request.body.read()
    person = loads(data)
    person['_id'] = ObjectId(person['_id'])
    db['people'].save(person)

In the template, the form itself is identical but the submit button now calls a Javascript function.

<!doctype html>
<head>
<title>{{person['name']}} Stats</title>
</head>
<body>
<h1>{{person['name']}} Stats</h1>
<form id="persondata" name="person">
  <input type="text" name="_id" value="{{person['_id']}}" style="display: none;" />
  <input type="text" name="name" value="{{person['name']}}" style="display: none;" />
  <table>
    <tr>
      <th>Age</th>
      <td><input type="text" name="age" value="{{person['age']}}" /></td>
    </tr>
    <tr>
      <th>Height</th>
      <td><input type="text" name="height" value="{{person['height']}}" /></td>
    </tr>
    <tr>
      <th>Weight</th>
      <td><input type="text" name="weight" value="{{person['weight']}}" /></td>
    </tr>
    <tr>
      <td><input type="submit" /></td>
    </tr>
  </table>
</form>

When the user clicks the submit button, the save_data function is called.

First we load jQuery and the form2js library. Finally, we have two functions—one that executes when the submit button is clicked, and the save_data function. The save_data function reads the JSON data from the form into the variable json_data, stringifies it to the variable jsonstr, and fires an AJAX POST back to our URL route for updating a person (the only route in our app that accepts a POST request).

<script type="text/javascript" src="/js/jquery.min.js"></script>
<script type="text/javascript" src="/js/form2js.js"></script>
<script type="text/javascript">
    save_data = function(evt){
        var json_data = form2js('persondata', skipEmpty=false);
        var jsonstr = JSON.stringify(json_data, null, '\t');
        $.ajax({
            type:"POST",
            url: 'http://example02/service/people',
            data: jsonstr,
            success: function(){
                alert('Configuration saved.')
            }
        });
    };
  (function($){
     $(function(){
    $('input[type=submit]').val('Submit').click(save_data);
  });
   })(jQuery);
   </script>
</body>
</html>

When a request comes in at this URL:

http://example01/service/people/Janice

The result is displayed in a form:

person_complex

Check it out on GitHub if you want to play around with it. It is very easy to add applications and get CRUD operations going on any mongoDB database.

Let me know if you have any questions or if something needs changing on the GitHub repo.

September 29, 2014 04:00 AM


Vasudev Ram

CommonMark, a pure Python Markdown parser and renderer


By Vasudev Ram

I got to know about CommonMark.org via this post on the Python Reddit:

CommonMark.py - pure Python Markdown parser and renderer

From what I could gather, CommonMark is, or aims to be, two things:

1. "A standard, unambiguous syntax specification for Markdown, along with a suite of comprehensive tests".

2. A Python parser and renderer for the CommonMark Markdown spec.

CommonMark on PyPI, the Python Package Index.

Excerpts from the CommonMark.org site:

[ We propose a standard, unambiguous syntax specification for Markdown, along with a suite of comprehensive tests to validate Markdown implementations against this specification. We believe this is necessary, even essential, for the future of Markdown. ]

[ Who are you?
We're a group of Markdown fans who either work at companies with industrial scale deployments of Markdown, have written Markdown parsers, have extensive experience supporting Markdown with end users – or all of the above.

John MacFarlane
David Greenspan
Vicent Marti
Neil Williams
Benjamin Dumke-von der Ehe
Jeff Atwood ]

So I installed the Python library for it with:
pip install commonmark
Then modified this snippet of example code from the CommonMark PyPI site:
import CommonMark
parser = CommonMark.DocParser()
renderer = CommonMark.HTMLRenderer()
print(renderer.render(parser.parse("Hello *World*")))
on my local machine, to add a few more types of Markdown syntax:
import CommonMark
parser = CommonMark.DocParser()
renderer = CommonMark.HTMLRenderer()
markdown_string = \
"""
Heading
=======

Sub-heading
-----------

# Atx-style H1 heading.
## Atx-style H2 heading.
### Atx-style H3 heading.
#### Atx-style H4 heading.
##### Atx-style H5 heading.
###### Atx-style H6 heading.

Paragraphs are separated
by a blank line.

Let 2 spaces at the end of a line to do a
line break

Text attributes *italic*, **bold**, `monospace`.

A [link](http://example.com).

Shopping list:

* apples
* oranges
* pears

Numbered list:

1. apples
2. oranges
3. pears

"""
print(renderer.render(parser.parse(markdown_string)))
Here is a screenshot of the output HTML generated by CommonMark, loaded in Google Chrome:


Reddit user bracewel, who seems to be a CommonMark team member, said on the Py Reddit thread:

eventually we'd like to add a few more renderers, PDF/RTF being the first....

So CommonMark looks interesting and worth keeping an eye on, IMO.

- Vasudev Ram - Dancing Bison Enterprises - Python training and consulting

Dancing Bison - Contact Page

September 29, 2014 03:55 AM

September 28, 2014


Ned Batchelder

Coverage.py v4.0a1

The first alpha of the next major version of coverage.py is available: coverage.py v4.0a1.

The big new feature is support for the gevent, greenlet, and eventlet concurrency libraries. Previously, these libraries' behind-the-scenes stack swapping would confuse coverage.py. Now coverage adapts to give accurate coverage measurement. To enable it, use the "concurrency" setting to specify which library you are using.

Huge thanks to Peter Portante for getting the concurrency support started, and Joe Jevnik for the last final push.

Also new is that coverage.py will read its configuration from setup.cfg if there is no .coveragerc file. This lets you keep more of your project configuration in one place.

Lastly, the textual summary report now shows missing branches if you are using branch coverage.

One warning: I'm moving around lots of internals. People have a tendency to use what they need to to get their plugin or tool to work, so some of those third-party packages may now be broken. Let me know what you find.

Full details of other changes are in the CHANGES.txt file.

September 28, 2014 02:36 PM


Mats Kindahl

Pythonic Database API: Now with Launchpad

In a previous post, I demonstrated a simple Python database API with a syntax similar to jQuery. The goal was to provide a simple API that would allow Python programmers to use a database without having to resort to SQL, nor having to use any of the good, but quite heavy, ORM implementations that exist. The code was just an experimental implementation, and I was considering putting it up on Launchpad.
I did some basic cleaning of the code, turned it into a Python package, and pushed it to Launchpad. I also added some minor changes, such as introducing a define function to define new tables instead of automatically creating one when an insert was executed. Automatically constructing a table from values seems neat, but in reality it is quite difficult to ensure that it has the right types for the application. Here is a small code example demonstrating how to use the define function together with some other operations.

import mysql.api.simple as api

server = api.Server(host="example.com")

server.test_api.tbl.define(
    { 'name': 'more', 'type': int },
    { 'name': 'magic', 'type': str },
)

items = [
    {'more': 3, 'magic': 'just a test'},
    {'more': 3, 'magic': 'just another test'},
    {'more': 4, 'magic': 'quadrant'},
    {'more': 5, 'magic': 'even more magic'},
]

for item in items:
    server.test_api.tbl.insert(item)
The table is defined by providing a dictionary for each row that you want in the table. The two most important fields in the dictionary is name and type. The name field is used to supply a name for the field, and the type field is used to provide a type of the column. The type is denoted using a basic Python type constructor, which then maps internally to a SQL type. So, for example, int map to the SQL INT type, and bool map to the SQL type BIT(1). This choice of deciding to use Python types are simply because it is more natural for a Python programmer to define the tables from the data that the programmer want to store in the database. I this case, I would be less concerned with how the types are mapped, just assuming that it is mapped in a way that works. It is currently not possible to register your own mappings, but that is easy to add.So, why provide the type object and not just a string with the type name? The idea I had here is that since Python has introspection (it is a dynamic language after all), it would be possible to add code that read the provided type objects and do things with them, such as figuring out what fields there are in the type. It's not that I plan to implement this, but even though this is intended to be a simple database interface, there is no reason to tie ones hands from start, so this simple approach will provide some flexibility if needed in the future.

Links

Some additional links that you might find useful:
Connector/Python
You need to have Connector/Python installed to be able to use this package.
Sequalize
This is a JavaScript library that provide a similar interface to a database. It claims to be an ORM layer, but is not really. It is more similar to what I have written above.
Roland's MQL to SQL and Presentation on SlideShare is also some inspiration for alternatives.

September 28, 2014 07:11 AM

September 27, 2014


Ram Rachum

Another silly Python riddle

Do you think of yourself as an experienced Python developer? Do you think you know Python’s quirks inside and out? Here’s a silly riddle to test your skills.

Observe the following Python code:


    def f(x):
        return x == not x
    
    f(None)

The question is: What will the call f(None) return?

Think carefully and try to come up with an answer without running the code. Then check yourself :)

September 27, 2014 10:08 PM


Ned Batchelder

How should I distribute coverage.py alphas?

I thought today was going to be a good day. I was going to release the first alpha version of coverage.py 4.0. I finally finished the support for gevent and other concurrency libraries like it, and I wanted to get the code out for people to try it.

So I made the kits and pushed them to PyPI. I used to not do that, because people would get the betas by accident. But pip now understands about pre-releases and real releases, and won't install an alpha version by default. Only if you explicitly use --pre will you get an alpha.

About 10 minutes after I pushed the kits, someone I was chatting with on IRC said, "Did you just release a new version of coverage?" Turns out his Travis build was failing.

He was using coveralls to report his coverage statistics, and it was failing. Turns out coveralls uses internals from coverage.py to do its work, and I've made big refactorings to the internals, so their code was broken. But how did the alpha get installed in the first place?

He was using tox, and it turns out that when tox installs dependencies, it defaults to using the --pre switch! Why? I don't know.

OK, I figured I would just hide the new version on PyPI. That way, if people wanted to try it, they could use "pip install coverage==4.0a1", and no one else would be bothered with it. Nope: pip will find the newer version even if it is hidden on PyPI. Why? I don't know.

In my opinion:

So now the kit is removed entirely from PyPI while I figure out a new approach. Some possibilities, none of them great:

  1. Distribute the kit the way I used to, with a download on my site. This sucks because I don't know if there's a way to do this so that pip will find it, and I don't know if it can handle pre-built binary kits like that.
  2. Do whatever I need to do to coverage.py so that coveralls will continue to work. This sucks because I don't know how much I will have to add back, and I don't want to establish a precedent, and it doesn't solve the problem that people really don't expect to be using alphas of their testing tools on Travis.
  3. Make a new package on PyPI: coverage-prerelease, and instruct people to install from there. This sucks because tools like coveralls won't refer to it, so either you can't ever use it with coveralls, or if you install it alongside, then you have two versions of coverage fighting with each other? I think?
  4. Make a pull request against coveralls to fix their use of the now-missing coverage.py internals. This sucks (but not much) because I don't want to have to understand their code, and I don't have a simple way to run it, and I wish they had tried to stick to supported methods in the first place.
  5. Leave it broken, and let people fix it by overriding their tox.ini settings to not use --pre, or wait until people complain to coveralls and they fix their code. This sucks because there will be lots of people with broken builds.

Software is hard, yo.

September 27, 2014 08:20 PM


Varun Nischal

taT4Py | Recursively Search Regex Patterns

[UPDATE: 09/28/2014] I have mainly used python for text parsing, validation and transforming as needed. If it was done using shell script, I would end up writing variety of regular expression to play around. Getting Started Well, python is no different and in order to cook up regular expressions, one must import re (module) and get … Continue reading

September 27, 2014 08:02 PM

taT4Py | Convert AutoSys Job Attributes into Python Dictionary

[UPDATE: 09/28/2014] If you ever look at the definition of specific AutoSys Job, you would find that it contains attribute-value pairs (line-by-line), delimited by colon ‘:’ I thought it would be cool to parse the job definition, by creating python dictionary using the attribute-value pairs. Let us take a look at sample job definition; $> cat … Continue reading

September 27, 2014 07:29 PM


Tim Golden

PyConUK 2014: The Teachers

This is the third time that PyCon UK has organised a track especially for educators: really, for Primary & Secondary teachers, rather than for University lecturers. The first, a couple of years ago, was small, essentially a few hours within the main conference. The second, a whole Saturday last year, suffered from a belief that teachers would be too busy during the week and would prefer to come at the weekend. (In fact it seems they’re happy to get a day out from school on legitimate business such as learning about Python). This year we had upwards of thirty for the whole of Friday, students as well as experienced teachers.

You can read about some of the activities on offer and what they got out of them on their blogs and elsewhere. What I want to talk about is what I got out of it. I don’t think I learned anything absolutely new, but certain things were brought home to me more clearly.
I think that, as hackers (in Eric Raymond’s sense of that word), we expect teachers and pupils alike to be on fire with enthusiasm for programming, agog to learn more and to advance in this amazing world, and skilful enough to appreciate the subtle nuances and elegant constructs of the Python idiom. And we expect their schools to be 100% behind them, facilitating their initiatives, and supporting their ideas for learning in the 21st Century.

There surely are such people and institutions out there. But the picture painted by the teachers at PyCon UK was quite different. IT Administrators – whether school or county – can be restrictive in the extreme: disallowing access to the command prompt, enforcing bureaucratic processes for installing new packages, and limiting network access. At the same time, pupils may not be the excited and engaged proto-hackers you might have hoped for; they may have real difficulties with even the simplest of programming concepts. And the teachers are … teachers: people whose daily life is spent charting a course through the administrivia of modern-day schools, managing classrooms full of underappreciative pupils, and struggling to master for themselves some of the concepts they want to pass on to their pupils.

Like many in my situation, I started programming when I was 12, initially on my engineer Father’s programmable HP calculator and then, like so many others, on a BBC B. And I genuinely can’t remember a time when programming wasn’t intuitive. Of course, I’ve learnt and improved over the years. Of course there were things I didn’t understand; there still are. But the basic ideas of programming have always come easily. And I can find it a real challenge to put across concepts which are so very basic and whose explanation you’ve never needed to people to whom they don’t come naturally: to do the job of a teacher, in fact.

But that’s what each of us had to do for most of the teachers at PyCon UK and that’s what they have to do for most of their pupils. Even those teachers who have some experience in programming can struggle to visualise the solution to a more complex problem. To make matters worse, Computing is one of those subjects – like Music, perhaps – where a small number of the pupils can be streets ahead of the rest of the class. And maybe of the teacher.

On the Friday afternoon, after a morning of workshops, we had a Dojo-like session where teachers called out ideas for projects they’d be interested in pursuing with help from developers, and we developers attached ourselves to a group to offer advice. The group I joined was somewhat decided for me as one of the teachers was asking for help with an A2 project involving using a relational database to produce a simple room-booking website– and since my day job involves relational databases, I was pushed to the front to help a group of eight or so teachers.

In the discussion which followed, it became clear that, not only were many of the group of teachers hazy on the ideas of relational databases, but also of dynamic websites, plus the pieces of Python needed to pull those together. Another developer helpfully jumped in to propose Django, which brings us to the point I started to make above. Django has a good tutorial with a useful get-you-started admin mode, is well-documented, and has a strong and helpful community. But it requires a big level of buy-in from the pupils and, especially, from the teacher who will have to be able to help them.

“But there are forums,” you cry, “and mailing lists, and stackoverflow answers, and books, and lots of documentation”. And there are. But I was talking to teachers who have to fit a project like this into the structure of perhaps one or two classes a week plus homework among everything else. And something like Django is a big overhead. “But it’s easy!” It‘s not bad , if you already know what you’re talking about, or have unlimited time to play around and throw projects away. But, for this small project alone, I was having to pull together a headline course on relational databases and your possibilities using Python, with or without non-stdlib modules , plus the same thing for web programming, plus all the Python bits to make it all work. It was humbling to realise how many obstacles a teacher has to overcome to fulfil even the basics of the current computing curriculum.

As a proof-of-concept, I jotted down some notes on the spot, and ran up a couple of modules using sqlite and bottle (since it’s a one-file import which makes it easier to smuggle past restrictive installation régimes). The results are on my github.

During the Sprints on Monday (and on the way home) I decided to take it further, and to dust off an idea I had last year and which I hadn’t really pursued: One Screenful of Python. In last year’s Education track, I’d understood that one screenful of code was about right for one classroom session. So my idea was to produce useful, engaging, readable, snippets of Python which would fit, roughly, onto one screen and would produce an interesting result and illustrate some useful points of the Python idiom or of programming in general. It’s hosted in a github organisation.

It was quite clear that the project of building a website for room bookings using a relational database wasn’t going to fit on one screen even with some code gymnastics, which would in any case have defeated the pedagogical purpose of the exercise. So I restarted and produced a one-module project, built up over 13 steps, each building on the last (so a code-diff tool will produce a useful comparison). The result is in its own repo within the One Screenful organisation.

The code should run unchanged on 2.7 and 3.x; it should run cross-platform, especially targetting Windows & Raspberry Pi as they’re most likely to be found in the classroom; and it should work with the stdlib alone. These restrictions are an attempt to reduce possible difficulties teachers might face in getting it up and running. Sticking to the stdlib obviously precludes handy tools like Bottle (or your other lightweight web framework of choice), SqlAlchemy (or your ORM of choice) and whatever else might be useful. But it does simplify things. (pip install on Python 2.7 on Windows anyone ?).

But the most difficult restriction was in writing code which would be understood by A-level students. I’d struggled valiantly to pull back my hand and write simpler code. And then a teacher who kindly gave me feedback said that she thought the more able among her pupils might be able to cope with the code! This is the nub of the issue: the pupils who are gifted in this area can head off into the distance and work up full-stack Django apps from the online documentation and StackOverflow questions. But the run-of-the-mill average students need things to be kept simple to get them started.

If I have the time I’ll probably do a Git branch which will rework the final codebase in terms of, say, Bottle and SqlAlchemy. I don’t say that learners shouldn’t know about these things: they certainly should. I just think that every tool you add to a stack, no matter how useful, adds complexity. And if someone is a beginner, it might be better to stick to standard stuff. (I recognise that this is not a very strong argument since the code needed to manipulate SQL without an ORM is relatively verbose; likewise, using the built-in WSGI support). I also considered an approach of using git tags rather than the separate directories for the progressive disclosure of of the codebase. But, although it’s a neat idea, it’s an additional complication to cope with in the classroom.

So what did I get out of the Education Track? Well: that what many teachers need is simple code, easily installed, runnable without complications on different platforms. I’d like to be able to help, but I have commitments of my own which prevent me from easily giving much time to direct help in a classroom. But providing suitable code and advice is something I can do.

September 27, 2014 12:19 PM

September 26, 2014


Mike Driscoll

How to Connect to Twitter with Python

There are several 3rd party packages that wrap Twitter’s API. We’ll be looking at tweepy and twitter. The tweepy documentation is a bit more extensive than twitter’s, but I felt that the twitter package had more concrete examples. Let’s spend some time going over how to use these packages!


Getting Started

To get started using Twitter’s API, you will need to create a Twitter application. To do that, you’ll have to go to their developer’s site and create a new application. After your application is created, you will need to get your API keys (or generate some). You will also need to generate your access tokens.

Since neither of these packages are included with Python, you will need to install them as well. To install tweepy, just do the following:


pip install tweepy

To install twitter, you can do the same kind of thing:


pip install twitter

Now you should be ready to go!


Posting a Status Update

One of the basics that you should be able to do with these packages is post an update to your Twitter account. Let’s see how these two packages work in that regard. We will start with tweepy.

import tweepy
 
auth = tweepy.OAuthHandler(key, secret)
auth.set_access_token(token, token_secret)
client = tweepy.API(auth)
client.update_status("#Python Rocks!")

Well that was pretty straight-forward. We had to create an OAuth handler with our keys and then set the access tokens. Finally we created an object that represents Twitter’s API and updated out status. This method worked great for me. Now let’s see if we can get the twitter package to work.

import twitter
 
auth=twitter.OAuth(token, token_secret, key, secret)
client = twitter.Twitter(auth=auth)
client.statuses.update(status="#Python Rocks!")

This code is pretty simple too. In fact, I think the twitter package’s OAuth implementation is cleaner than tweepy’s.

Note: I sometimes got the following error while using the twitter package: Bad Authentication data, code 215. I’m not entirely sure why as when you look that error up, it’s supposed to be caused because you’re using Twitter’s old API. If that was the case, then it should never work.

Next we’ll look at how to get our timeline.


Getting Timelines

Getting your own Twitter timeline is really easy in both packages. Let’s take a look at tweepy’s implementation:

import tweepy
 
auth = tweepy.OAuthHandler(key, secret)
auth.set_access_token(token, token_secret)
client = tweepy.API(auth)
 
timeline = client.home_timeline()
 
for item in timeline:
    text = "%s says '%s'" % (item.user.screen_name, item.text)
    print text

So here we get authenticated and then we call the home_timeline() method. This returns an iterable of objects that we can loop over and extract various bits of data from. In this case, we just extract the screen name and the text of the Tweet. Let’s see how the twitter package does this:

import twitter
 
auth=twitter.OAuth(token, token_secret, key, secret)
client = twitter.Twitter(auth=auth)
 
timeline = client.statuses.home_timeline()
for item in timeline:
    text = "%s says '%s'" % (item["user"]["screen_name"],
                             item["text"])
    print text

The twitter package is pretty similar. The primary difference is that it returns a list of dictionaries.

What if you wanted to get someone else’s timeline. In tweepy, you’d do something like this:

import tweepy
 
auth = tweepy.OAuthHandler(key, secret)
auth.set_access_token(token, token_secret)
client = tweepy.API(auth)
 
user = client.get_user(screen_name='pydanny')
timeline = user.timeline()

The twitter package is a little different:

import twitter
 
auth=twitter.OAuth(token, token_secret, key, secret)
client = twitter.Twitter(auth=auth)
 
user_timeline = client.statuses.user_timeline(screen_name='pydanny')

In this case, I think the twitter package is a bit cleaner although one might argue that tweepy’s implementation is more intuitive.


Getting Your Friends and Followers

Just about everyone has friends (people they follow) and followers on Tritter. In this section we will look at how to access those items. The twitter package doesn’t really have a good example to follow to find your Twitter friends and followers so in this section we’ll just focus on tweepy.

import tweepy
 
auth = tweepy.OAuthHandler(key, secret)
auth.set_access_token(token, token_secret)
client = tweepy.API(auth)
 
friends = client.friends()
 
for friend in friends:
    print friend.name

If you run the code above, you will notice that the maximum number of friends that it prints out will be 20. If you want to print out ALL your friends, then you need to use a cursor. There are two ways to use the cursor. You can use it to return pages or a specific number of items. In my case, I have 32 people that I follow, so I went with the items way of doing things:

for friend in tweepy.Cursor(client.friends).items(200):
    print friend.name

This piece of code will iterate over up to 200 items. If you have a LOT of friends or you want to iterate over someone else’s friends, but don’t know how many they have, then using the pages method makes more sense. Let’s take a look at how that might work:

for page in tweepy.Cursor(client.friends).pages():
    for friend in page:
        print friend.name

That was pretty easy. Getting a list of your followers is exactly the same:

followers = client.followers()
for follower in followers:
    print follower.name

This too will return just 20 items. I have a lot of followers, so if I wanted to get a list of them, I would have to use one of the cursor methods mentioned above.


Wrapping Up

These packages provide a lot more functionality than this article covers. I particularly recommend looking at tweepy as it’s quite a bit more intuitive and easier to figure out using Python’s introspection tools than the twitter package is. You could easily take tweepy and create a user interface around it to keep you up-to-date with your friends if you didn’t already have a bunch of those applications already. On the other hand, that would still be a good program to write for a beginner.


Additional Reading

September 26, 2014 10:15 PM


Enthought

Enthought’s Prabhu Ramachandran Announced as Winner of Kenneth Gonsalves Award 2014 at PyCon India

From PyCon India: Published / 25 Sep 2014 PSSI [Python Software Society of India] is happy to announce that Prabhu Ramachandran, faculty member of Department of Aerospace Engineering, IIT Bombay [and managing director of Enthought India] is the winner of Kenneth Gonsalves Award, 2014. Prabhu has been active in the Open source and Python community for close to […]

September 26, 2014 09:53 PM


PyCharm

Feature Spotlight: Behavior-Driven Development in PyCharm

Happy Friday!

Today I’d like to shed some light on another brand-new functionality upcoming for PyCharm 4 – Behavior-Driven Development (BDD) Support. You can already check it out in the PyCharm 4 Public Preview builds available on the EAP page.

Note: The BDD support is available only in the PyCharm Professional Edition, not in the Community Edition.

BDD is a very popular and really effective software development approach nowadays. I’m not going to cover the ideas and principles behind it in this blog post, however I would like to encourage everyone to try it, since it really drives your development in more stable and accountable way. Sure, BDD works mostly for companies that require some collaboration between non-programmers management and development teams. However the same approach can be used in smaller teams that want to benefit from the advanced test-driven development concept.

In the Python world there are two most popular tools for behavior-driven development – Behave and Lettuce. PyCharm 4 supports both of them, recognizing feature files and providing syntax highlighting, auto-completion, as well as navigation from specific feature statements to their definitions. On-the-fly error highlighting, automatic quick fixes and other helpful PyCharm features are also available and can be used in a unified fashion.

Let me show you how it works in 10 simple steps:

1. To start with BDD development and in order to get the full support from PyCharm, you first need to define a preferred tool for BDD (Behave or Lettuce) in your project settings:

settings

2. You can create your own feature files within your project – just press Alt+Insert while in the project view window or in the editor and select “Gherkin feature file”. It will create a feature file where you can define your own features, scenarios, etc. PyCharm recognizes feature files format and provides syntax highlighting accordingly:

feature_file

3. Since there is no step definitions at the moment, PyCharm highlights these steps in a feature file accordingly. Press Alt+Enter to get a quick fix on a step:

new_step

4. Follow the dialog and create your step definitions:

step_definition

5. You can install behave or lettuce right from the editor. Just press Alt+Enter on unresolved reference to get the quick-fix suggestion to install the BDD tool:

install

6. Look how intelligently PyCharm keeps your code in a consistent state when working on step definitions. Use Alt+Enter to get a quick-fix action:

intelligence

7. In feature files, with Ctrl+Click you can navigate from a Scenario description to the actual step definition:

navigate

Note: Step definitions may contain wildcards as shown in the step #6 – matched steps are highlighted with blue in feature files.

8. PyCharm also gives you a handy assistance on automatic run configurations for BDD projects. In the feature file, right-click and choose the “create” option, to create an automatic run configuration for behave/lettuce projects:

run_configurations

9. In the run configurations you can specify the scenarios to run, parameters to pass and many other options:

run_configurations2

10. Now you’re all set to run your project with a newly created run configuration. Press Shift+F10 and inspect the results:

tests

That was simple, wasn’t it?
Hope you’ll enjoy the BDD support in PyCharm and give this approach a try in your projects!

See you next week!
-Dmitry

September 26, 2014 02:27 PM


Filipe Saraiva

Month of KDE Contributor: From LaKademy …

In recent weeks I had an intense “Month of KDE Contributor” that began with LaKademy, the KDE Latin American Summit, and ended with Akademy, the KDE World Summit. It was a month somewhat tiring, hard work, but it was also filled with good stories, great meetings, new contacts, discoveries and, I can say, fun.

This post I will write about LaKademy and the next I will comment about Akademy.

logoazul_menor

The second edition of LaKademy took place in São Paulo, one of the biggest cities of Latin America, in FLOSS Competence Center of University of Sao Paulo, an entire building dedicated to studies and researches on various aspects of free software: licenses, software engineering, metrics extracted from repositories, social aspects of collaboration, and more.

This year I and Aracele were the conference organizers, and I believe that we could provide all the infrastructure necessary to LaKademy attendees had good days of work in a pleasant and comfortable places.

First day we had talks of collaborators, and one that most caught my attention was Rafael Gomes on KDE sysadmin. It’s amazing the size of the infrastructure behind the scenes, a solid base that allows developers to do their jobs. It would be interesting to promote more this type of collaboration to attract potential contributors who prefer this side of computing.

14923088670_415cfc44df_z

This day I presented a talk about Qt in Android, describing the development tools configuration in Linux, presenting a basic Hello World, and commenting on some softwares availables using this technology, specially the VoltAir and GCompris. The presentation is below (in portuguese).

.

Second day we had a short-course about Qt, presented by Sandro Andrade. Impressive his didactic and how he manages to hold our attention for a whole day without getting boring or tiring. This day I was helping the other participants, especially those who were having the first contact with Qt development.

The third and fourth days were devoted to application hacking and projects development. I joined in “task-force” to port Bovo to KF5, I started the development of a metapackage to install all KF5 packages in Mageia, and I started the port of Cantor to KF5. I also fixed some KDE Brazil bots on social networks.

15106702121_ff87d73880_z

Task force to port Bovo to KF5

Fourth day we had a meeting to discuss some initiatives to promote KDE in Latin America, and we started to use Kanboard of KDE TODO to organize the implementation of these projects.

Besides the work we had some moments of relaxation at the event, as when we went to Garoa Hacker Clube, the main hackerspace in São Paulo, an activity we call Konvescote; and also when we all went to Augusta Street, one of the famous bohemian streets in the city.

15086717746_ec5d444223_z

KDE + Garoa

However, as in all events of Free Software and KDE Brazil, the best thing is see old friends again and meet new ones that are coming to the boat. For the novices, welcome and let’s to do a great work! For the veterans, we still have a good road ahead on this idea of writing free software and give back to the world something of beautiful, high quality technical, that respect the user.

KDE Brazil team wrote an excelent post enumerating what the attendees produced during the event. I suggest to all who still want more information to read that text.

I leave my thanks to KDE e.V. for providing this meeting. I hope to see more contributors in next LaKademy!

September 26, 2014 02:02 PM

Month of KDE Contributor: …to Akademy

In recent weeks I had an intense “Month of KDE Contributor” that began with LaKademy, the KDE Latin American Summit, and ended with Akademy, the KDE World Summit. It was a month somewhat tiring, hard work, but it was also filled with good stories, great meetings, new contacts, discoveries and, I can say, fun.

Previous post I wrote about LaKademy and now I will write about Akademy.

LaKademy had ended just one day before and there I was getting a bus to São Paulo again, preparing for a trip that would take about 35 hours to Brno, with an unusual connection in Dubai and a bus from Prague, the Czech Republic capital, to the city of the event.

Arriving at Brno my attention was piqued by the beautiful architecture of this old city of the Eastern Europe, something exotic for Brazilians. During the event I had some time to walking in the city, especially on some nights for dinner and during the Day Trip. I could calmly enjoy the details of several buildings, museums, the castle and the city cathedral.

It was the second Akademy I attended, if you count the Desktop Summit in 2011. This time I am a member of the KDE e.V., the organization behind KDE, so my first task was to attend to General Assembly.

I was fascinated how dozens of contributors from different parts of the world, from different cultures, were there discussing the future of KDE, planning important steps for the project, checking the accounts of the entity, in short, doing a typical task of any association. I was also impressed by the long applause for Cornelius Schumacher, a member of the KDE e.V. Board since 2002 and former president of the association. A way to show gratitude for all work he accomplished in those over 10 years in KDE e.V. Board.

In the end the day we had a reception for participants at Red Hat. I was impressed with the size of the company in the city (three large buildings). We drank some beers of the country and distribute Brazilian cachaça. =)

IMG_20140905_193708

The next day began the talk days. I highlight the keynote of Sascha (I believe he was invited to Akademy after Kévin Ottens have seen him lecture here in Brazil during FISL), and the talk on GCompris, software that I admire because it is a educational suite for children. Unfortunately, one of the lectures that I wanted to see not occurred, that was Cofunding KDE aplications. We were David Faure talking about software ports to KF5, and presentation of KDE groups of India and Taiwan in the end of day.

The second day of talks we had a curious keynote of Cornelius who presented some history of KDE using images of old contributors. The highlights of the day were also the presentations by VDG staff, the group that is doing a amazing design work in Plasma 5, and now they are extending their mouse pointer to KDE applications too. Great!

Another interesting presentation was on Next Generation of Desktop Applications, by Alex Fiestas. He argued that the new generation of software need to combine information from different web sources in order to provide a unique user experience. He used examples of such applications, and I’m very curious to try Jungle, video player that will have these characteristics.

Finally this day had a lecture by Paul Adams, very exciting. He shows that, after investigation in KDE repositories, the degree of contribution among developers decreased with the migration from SVN to GIT, the number of commits decreased too, and more. Paul has interesting work in this area, but for my part I think it is necessary to explain this conclusions using anothers concepts too, because we need to understand if this decreased is necessarily a bad thing. Maybe today are we developers more specialized than before? Maybe is the decrement of commits just a result of code base stabilization in that time? Something not yet concluded in KDE is that we came from a large unified project (including in repository level) to a large community of subprojects (today we are like Apache, maybe). In this scenario, is it worth doing comparisons between what we are today with what we were yesterday, based only on our repositories? Anyway, it is a good point to ponder.

In BoFs days, I participated in the first two parts of the software documentation  BoF – an important and necessary work, and we developers need to give a little more attention to it -; FOSS in Taiwan and KDE Edu in India. Unfortunately I could not attend to packagers BoF (well, I am a packager in Mageia), because it occurred in the same time of Taiwan BoF. Letś try again in next Akademy. =)

I like to see the experiences of users/developers groups in other countries; the management of these activities attracts me, mainly because we can apply either in Brazil. I left this Akademy with the desire to prepare something about Latin America community to the next event. I believe we have much to share with the community about what we’re doing here, our successes and failures, and the contribution of Latin American for the project.

Finally the other days I continued working on the Cantor port to KF5 or I was talking with different developers in the halls of university.

To me it’s very important to participate in Akademy because there I can see the power of free software and its contributors, and how this culture of collaboration brings together different people for development and evolution of free computer programs. Therefore, I would like to thank immensely to KDE e.V. for the opportunity to go to Akademy and I would like to say that I feel very good to be part of this great community that is KDE. =)

The best of all is to see old friends again and meet new people. When that e-mail address gets contours of human face is a very special moment for us who work “so close and so distant”. So it was amazing to be with all of you!

Akademy 2014 Group Photo – giant size here

And to finnish I desire a great job to the new KDE e.V. Board!

Those interested, most of the talks presented with video and slides are available in this link.

September 26, 2014 02:00 PM


Martijn Faassen

Life at the Boundaries: Conversion and Validation

In software development we deal with boundaries between systems.

Examples of boundaries are:

It's important to recognize these boundaries. You want to do things at the boundaries of our application, just after input has arrived into your application across an outer boundary, and just before you send output across an inner boundary.

If you read a file and what's in that file is a string representing a number, you want to convert the string to a number as soon as possible after reading it, so that the rest of your codebase can forget about the file and the string in it, and just deal with the number.

Because if you don't and pass a filename around, you may have to open that file multiple times throughout your codebase. Or if you read from the file and leave the value as a string, you may have to convert it to a number each time you need it. This means duplicated code, and multiple places where things can go wrong. All that is more work, more error prone, and less fun.

Boundaries are our friends. So much so that programming languages give us tools like functions and classes to create new boundaries in software. With a solid, clear boundary in place in the middle of our software, both halves can be easier to understand and easier to manage.

One of the most interesting things that happen on the boundaries in software is conversion and validation of values. I find it very useful to have a clear understanding of these concepts during software development. To understand each other better it's useful to share this understanding out loud. So here is how I define these concepts and how I use them.

I hope this helps some of you see the boundaries more clearly.

Following a HTML form submit through boundaries

Let's look at an example of a value going across multiple boundaries in software. In this example, we have a web form with an input field that lets the user fill in their date of birth as a string in the format 'DD-MM-YYYY'.

I'm going to give examples based on web development. I also give a few tiny examples in Python. The web examples and Python used here only exist to illustrate concepts; similar ideas apply in other contexts. You shouldn't need to understand the details of the web or Python to understand this, so don't go away if you don't.

Serializing a web form to a request string

In a traditional non-HTML 5 HTTP web form, the input type for dates is text`. This means that the dates are in fact not interpreted by the browser as dates at all. It's just a string to the browser, just like adfdafd. The browser does not know anything about the value otherwise, unless it has loaded JavaScript code that checks whether it the input is really a date and shows an error message if it's not.

In HTML 5 there is a new input type called date, but for the sake of this discussion we will ignore it, as it doesn't change all that much in this example.

So when the user submits a form with the birth date field, the inputs in the form are serialized to a longer string that is then sent to the server as the body of a POST request. This serialization happens according to what's specified in the form tag's enctype attribute. When the enctype is multipart/form-data, the request to the server will be a string that looks a lot like this:

POST /some/path HTTP/1.1
Content-type: multipart/form-data, boundary=AaB03x

--AaB03x
content-disposition: form-data; name="birthdate"

21-10-1985
--AaB03x--

Note that this serialization of form input to the multipart/form-data format cannot fail; serialization always succeeds, no matter what form data was entered.

Converting the request string to a Request object

So now this request arrives at the web server. Let's imagine our web server is in Python, and that there's a web framework like Django or Flask or Pyramid or Morepath in place. This web framework takes the serialized HTTP request, that is, the string, and then converts it into a request object.

This request object is much more convenient to work with in Python than the HTTP request string. Instead of having one blob of a string, you can easily check indidivual aspects of the request -- what request method was used (POST), what path the request is for, what the body of the request was. The web framework also recognizes multipart/form-data and automatically converts the request body with the form data into a convenient Python dictionary-like data structure.

Note that the conversion of HTTP request text to request object may fail. This can happen when the client did not actually format the request correctly. The server should then return a HTTP error, in this case 400 Bad Request, so that the client software (or the developer working on the client software) knows something went wrong.

The potential that something goes wrong is one difference between conversion and serialization; both transform the data, but conversion can fail and serialization cannot. Or perhaps better said: if serialization fails it is a bug in the software, whereas conversion can fail due to bad input. This is because serialization goes from known-good data to some other format, whereas conversion deals with input data from an external source that may be wrong in some way.

Thanks to the web framework's parsing of web form into a Python data structure, we can easily get the field birthdate from our form. If the request object was implemented by the Webob library (like for Pyramid and Morepath), we can get it like this:

 >>> request.POST['birthdate']
'21-10-1985'

Converting the string to a date

But the birthdate at this point is still a string 21-10-1985. We now want to convert it into something more convenient to Python. Python has a datetime library with a date type, so we'd like to get one of those.

This conversion could be done automatically by a form framework -- these are very handy as you can declaratively describe what types of values you expect and the framework can then automatically convert incoming strings to convenient Python values accordingly. I've written a few web form frameworks in my time. But in this example we'll do it it manually, using functionality from the Python datetime library to parse the date:

>>> from datetime import datetime
>>> birthdate = datetime.strptime(request.POST['birthdate'], '%d-%m-%Y').date()
datetime.date(1985, 10, 21)

Since this is a conversion operation, it can fail if the user gave input that is not in the right format or is not a proper date Python will raise a ValueError exception in this case. We need to write code that detects this and then signal the HTTP client that there was a conversion error. The client needs to update its UI to inform the user of this problem. All this can get quite complicated, and here again a form framework can help you with this.

It's important to note that we should isolate this conversion to one place in our application: the boundary where the value comes in. We don't want to pass the birth date string around in our code and only convert it into a date when we need to do something with it that requires a date object. Doing conversion "just in time" like that has a lot of problems: code duplication is one of them, but even worse is that we would need worry about conversion errors everywhere instead of in one place.

Validating the date

So now that we have the birth date our web application may want to do some basic checking to see whether it makes sense. For example, we probably don't expect time travellers to fill in the form, so we can safely reject any birth dates set in the future as invalid.

We've already converted the birth date from a string into a convenient Python date object, so validating that the date is not in the future is now easy:

>>> from datetime import date
>>> birthdate <= date.today()
True

Validation needs the value to be in a convenient form, so validation happens after conversion. Validation does not transform the value; it only checks whether the value is valid according to additional criteria.

There are a lot of possible validations:

  • validate that required values are indeed present.
  • check that a value is in a certain range.
  • relate the value to another value elsewhere in the input or in the database. Perhaps the birth date is not supposed to be earlier than some database-defined value, for instance.
  • etc.

If the input passes validation, the code just continues on its merry way. Only when the validation fails do we want to take special action. The minimum action that should be taken is to reject the data and do nothing, but it could also involve sending information about the cause of the validation failure back to the user interface, just like for conversion errors.

Validation should be done just after conversion, at the boundary of the application, so that after that we can stop worrying about all this and just trust the values we have as valid. Our life is easier if we do validation early on like this.

Serialize the date into a database

Now the web application wants to store the birth date in a database. The database sits behind a boundary. This boundary may be clever and allow you to pass in straight Python date objects and do a conversion to its internal format afterward. That would be best.

But imagine our database is dumb and expects our dates to be in a string format. Now the task is up to our application: we need transform the date to a string before the database boundary.

Let's say the database layer expects date strings in the format 'YYYY-MM-DD'. We then have to serialize our Python date object to that format before we pass it into the database:

>>> birthdate.strftime('%Y-%m-%d')
'1985-10-21'

This is serialization and not conversion because this transformation always succeeds.

Concepts

So we have:

Transformation:
Transform data from one type to another. Transformation by itself cannot fail, as it is assumed to always get correct input. It is a bug in the software if it does not. Conversion and serialization both do transformation.
Conversion:
Transform input across a boundary into a more convenient form inside that boundary. Fails if the input cannot be transformed.
Serialization
Transform valid data as output across a boundary into a form convenient to outside. Cannot fail if there are no bugs in the software.
Validation:
Check whether input across a boundary that is already converted to convenient form is valid inside that boundary. Can fail. Does not transform.

Reuse

Conversion just deals with converting one value to another and does not interact with the rest of the universe. The implementation of a converter is therefore often reusable between applications.

The behavior of a converter typically does not depend on state or configuration. If conversion behavior does depend on application state, for instance because you want to parse dates as 'MM-DD-YYYY' instead of 'DD-MM-YYYY', it is often a better approach to just swap in a different converter based on the locale than to have the converter itself to be aware of the locale.

Validation is different. While some validations are reusable across applications, a lot of them will be application specific. Validation success may depend on the state of other values in the input or on application state. Reusable frameworks that help with validation are still useful, but they do need additional information from the application to do their work.

Serialization and parsing

Serialization is transformation of data to a particular type, such as a string or a memory buffer. These types are convenient for communicating across the boundary: storing on the file system, storing data in a database, or passing data through the network.

The opposite of serialization is deserialization and this is done by parsing: this takes data in its serialized form and transforms it into a more convenient form. Parsing can fail if its input is not correct. Parsing is therefore conversion, but not all conversion is parsing.

Parsing extracts information and checks whether the input conforms to a grammar in one step, though if you treat the parser as a black box you can view these as two separate phases: input validation and transformation.

There are transformation operations in an application that do not serialize but can also not fail. I don't have a separate word for these besides "transformation", but they are quite common. Take for instance an operation that takes a Python object and transforms it into a dictionary convenient for serialization to JSON: it can only consist of dicts, lists, strings, ints, floats, bools and None.

Some developers argue that data should always be kept in such a format instead of in objects, as it can encourage a looser coupling between subsystems. This idea is especially prevalent in Lisp-style homoiconic language communities, where even code is treated as data. It is interesting to note that JSON has made web development go in the direction of more explicit data structures as well. Perhaps it is as they say:

Whoever does not understand LISP is doomed to reinvent it.

Input validation

We can pick apart conversion and find input validation inside. Conversion does input validation before transformation, and serialization (and plain transformation) does not.

Input validation is very different from application-level validation. Input validation is conceptually done just before the convenient form is created, and is an inherent part of the conversion. In practice, a converter typically parses data, doing both in a single step.

I prefer to reserve the term "validation" for application-level validation and discuss input validation only when we talk about implementing a converter.

But sometimes conversion from one perspective is validation from another.

Take the example above where we want to store a Python date in a database. What if this operation does not work for all Python date objects? The database layer could accept dates in a different range than the one supported by the Python date object. The database may therefore may therefore be offered a date that is outside of its range and reject it with an error.

We can view this as conversion: the database converts a date value that comes in, and this conversion may fail. But we can also view this in another way: the database transforms the date value that comes in, and then there is an additional validation that may fail. The database is a black box and both perspectives work. That comes in handy a little bit later.

Validation and layers

Consider a web application with an application-level validation layer, and another layer of validation in the database.

Maybe the database also has a rule to make sure that the birth date is not in the future. It gives an error when we give a date in the future. Since validation errors can now occur at the database layer, we need to worry about properly handling them.

But transporting such a validation failure back to the user interface can be tricky: we are on the boundary between application code and database at this point, far from the boundary between application and user interface. And often database-level validation failure messages are in a form that is not very informative to a user; they speak in terms of the database instead of the user.

We can make our life easier. What we can do is duplicate any validation the database layer does at the outer boundary of our application, the one facing the web. Validation failures there are relatively simple to propagate back to the user interface. Since any validation errors that can be given by the database have already been detected at an earlier boundary before the database is ever reached, we don't need to worry about handling database-level validation messages anymore. We can act as if they don't exist, as we've now guaranteed they cannot occur.

We treat the database-level validation as an extra sanity check guarding against bugs in our application-level code. If validation errors occur on the database boundary, we have a bug, and this should not happen, and we can just report a general error: on the web this is a 500 internal server error. That's a lot easier to do.

The general principle is: if we do all validations that the boundary to a deeper layer already needs at a higher layer, we can effectively the inner boundary as not having any validations. The validations in the deeper layer then only exist as extra checks that guard against bugs in the validations at the outer boundary.

We can also apply this to conversion errors: if we already make sure we clean up the data with validations at an outer boundary before it reaches an inner boundary that needs to do conversions, the conversions cannot fail. We can treat them as transformations again. We can do this as in a black box we can treat any conversion as a combination of transformation and validation.

Validation in the browser

In the end, let's return to the web browser.

We've seen that doing validation at an outer boundary can let us ignore validation done deeper down in our code. We do validation once when values come into the web server, and we can forget about doing them in the rest of our server code.

We can go one step further. We can lift our validation out of the server, into the client. If we do our validation in JavaScript when the user inputs values into the web form, we are in the right place to give really accurate user interface feedback in easiest way possible. Validation failure information has to cross from JavaScript to the browser DOM and that's it. The server is not involved.

We cannot always do this. If our validation code needs information on the server that cannot be shared securily or efficiently with the client, the server is still involved in validation, but at least we can still do all the user interface work in the client.

Even if we do not need server-side validation for the user interface, we cannot ignore doing server-side validation altogether, as we cannot guarantee that our JavaScript program is the only program that sends information to the server. Through that route, or because of bugs in our JavaScript code, we can still get input that is potentially invalid. But now if the server detects invalid information, it does not need do anything complicated to report validation errors to the client. Instead it can just generate an internal server error.

If we could somehow guarantee that only our JavaScript program is the one that sends information to the server, we could forgo doing validation on the server altogether. Someone more experienced in the arts of encryption may be able to say whether this is possible. I suspect the answer will be "no", as it usually is with JavaScript in web browsers and encryption.

In any case, we may in fact want to encourage other programs to use the same web server; that's the whole idea behind offering HTTP APIs. If this is our aim, we need to handle validation on the server as well, and give decent error messages.

September 26, 2014 01:50 PM


Andrew Dalke

chemfp's format API

This is part of a series of essays describing the reasons behind chemfp's new APIs. This one, about the new format API, is a lot shorter than the previous one on parse_molecule() and parse_id_and_molecule().

It's sometimes useful to know which cheminformatics formats are available, if only to display a help message or pulldown menu. The get_formats() toolkit function returns a list of available formats, as 'Format' instances.

>>> from chemfp import rdkit_toolkit as T
>>> T.get_formats()
[Format('rdkit/canstring'), Format('rdkit/inchikey'),
Format('rdkit/usmstring'), Format('rdkit/smistring'),
Format('rdkit/molfile'), Format('rdkit/usm'),
Format('rdkit/inchikeystring'), Format('rdkit/sdf'),
Format('rdkit/can'), Format('rdkit/smi'), Format('rdkit/inchi'),
Format('rdkit/rdbinmol'), Format('rdkit/inchistring')]
(The next version of chemfp will likely support RDKit's relatively new PDB reader.)

You can ask a format for its name, or see if it is an input format or output format by checking respectively "is_input" and "is_output". If you just want the list of input formats or output formats, use get_input_formats() or get_output_formats().

Here's an example to show which output formats are not also input formats:

>>> [format.name for format in T.get_output_formats() if not format.is_input_format]
['inchikey', 'inchikeystring']

You may recall that some formats are record-based and others are, for lack of a better word, "string-based". The latter include "smistring", "inchistring", and "inchikeystring". These are not records in their own right, so can't be read or written to a file.

I really couldn't come up with a good predicate which described those formats. This closest was "is_a_record". I ended up with "supports_io". I'm not happy with the name. If true, the format can be used in file I/O.

The RDKit input formats which do not support I/O are the expected ones ... and rdbinmol.

>>> [format.name for format in T.get_input_formats() if not format.supports_io]
['canstring', 'usmstring', 'smistring', 'molfile', 'rdbinmol', 'inchistring']
(The "rdbinmol" is an experimental format. It's the byte string from calling an RDKit molecule's "ToBinary()" method, which is also the basis for its pickle support.)

get_format() and compression

You can get a specific format by name using get_format(). This can also be used to specify a compressed format:

>>> T.get_format("sdf")
Format('rdkit/sdf')
>>> T.get_format("smi.gz")
Format('rdkit/smi.gz')
>>> format = T.get_format("smi.gz")
>>> format
Format('rdkit/smi.gz')
>>> format.name
'smi'
>>> format.compression
'gz'

Default reader and writer arguments

Toolkit- and format-specific arguments were a difficult challenge. I want chemfp to support multiple toolkits, because I know people work with fingerprints from multiple toolkits. Each of the toolkit has its own way to parse and generate records. I needed some way to have a common API but with a mechanism to control the underlying toolkit options.

The result are reader_args, which I discussed in the previous essay, and the writer_args complement for turning a molecule into a record.

A Format instance can be toolkit specific; the "rdkit/smi.gz" is an RDKit format. (The toolkit name is available from the aptly named attribute 'toolkit_name'.) Each Format has a way to get the default reader_args and writer_args for the format:

>>> format = T.get_format("smi.gz")
>>> format
Format('rdkit/smi.gz')
>>> format.get_default_reader_args()
{'delimiter': None, 'has_header': False, 'sanitize': True}
>>> format.get_default_writer_args()
{'isomericSmiles': True, 'delimiter': None, 'kekuleSmiles': False, 'allBondsExplicit': False, 'canonical': True}
This is especially useful if you are on the interactive prompt and have forgotten the option names.

Convert text settings into arguments

The -R command-line options for the chemfp tools rdkit2fps, ob2fps, and oe2fps let users set the reader_args. If your target molecules are in a space-delimited SMILES file then you can set the 'delimiter' option to 'space':

oe2fps -R delimiter=space targets.smi.gz -o targets.fps
or ask RDKit to disable sanitization using:
rdkit2fps -R sanitize=false targets.smi.gz -o targets.fps
The -R takes string keys and values. On the other hand reader_args take a dictionary with string keys but possibly integers and booleans as values. You could write the converter yourself, but that gets old very quickly. Instead, I included it the format's get_reader_args_from_text_settings(). (The *2fps programs don't generate structure output, but if they did the equivalent command-like flag would be -W, and the equivalent format method is get_writer_args_from_text_settings().)

Yes, I agree that get_..._settings() is a very long name. I couldn't think of a better one. I decided that "text settings" are the reader_args and writer_args expressed as a dictionary with string names and string values.

I'll use that long named function to convert some text settings into proper reader_args:

>>> format.get_reader_args_from_text_settings({
...    "delimiter": "tab",
...    "sanitize": "false",
... })
{'delimiter': 'tab', 'sanitize': False}
You can see that the text "false" was converted into the Python False value.

Namespaces

Names like "delimiter" and "sanitize" are 'unqualified' and apply for every toolkit and every format which accept them. This makes sense for "delimiter" because it's pointless to have OEChem parse a SMILES file using a different delimiter style than RDKit. It's acceptable for "sanitize" because only RDKit knows what it means, and the other toolkits will ignore unknown names. For many cases then you could simply do something like:

reader_args = {
  "delimiter": "tab",       # for SMILES files
  "strictParsing": False,   # for RDKit SDF
  "perceive_stereo": True,  # for Open Babel SDF
  "aromaticity": "daylight, # for all OEChem readers
}

At the moment the toolkits all have different names for option names for the same format, so there's no conflict there. But toolkits do use the same name for options on different formats, and there can be a good reason for why the value for a SMILES output is different than a value for an SDF record output.

The best example is OEChem, which uses a "flavor" flag to specify the input and output options for all formats. (For chemfp I decided to split OEChem's flavor into 'flavor' and 'aromaticity' reader and writer arguments. I leave that discussion for elsewhere.) I'll start by making an OEGraphMol.

from chemfp import openeye_toolkit

phenol = "c1ccccc1[16OH]"
oemol = openeye_toolkit.parse_molecule(phenol, "smistring")
Even though "smistring" output by default generates the canonical isomorphic SMILES for the record, I can ask it to generate a different output flavor. For convience, the flavor value can be an integer, which is treated as the flavor bitmask, or it can be a string of "|" or "," separated bitmask names. Usually the bitmask names are or'ed together, but a leading "-" means to unset the corresponding bits for that flag.
>>> openeye_toolkit.create_string(oemol, "smistring")
'c1ccc(cc1)[16OH]'
>>> openeye_toolkit.create_string(oemol, "smistring",
...      writer_args={"flavor": "Default"})
'c1ccc(cc1)[16OH]'
>>> openeye_toolkit.create_string(oemol, "smistring",
...      writer_args={"flavor": "Default,-Isotopes"})
'c1ccc(cc1)O'
>>> openeye_toolkit.create_string(oemol, "smistring",
...      writer_args={"flavor": "Canonical|Kekule|Isotopes"})
'C1=CC=C(C=C1)[16OH]'
Here I'll ask for the SDF record output in V3000 format. (In the future I expect to have a special "sdf3" or "sdf3000" format, to make it easier to specify V3000 output across all toolkits.)
>>> print(openeye_toolkit.create_string(oemol, "sdf",
...        writer_args={"flavor": "Default|MV30"}))

  -OEChem-09261411132D

  0  0  0     0  0            999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 7 7 0 0 0
M  V30 BEGIN ATOM
M  V30 1 C 0 0 0 0
M  V30 2 C 0 0 0 0
M  V30 3 C 0 0 0 0
M  V30 4 C 0 0 0 0
M  V30 5 C 0 0 0 0
M  V30 6 C 0 0 0 0
M  V30 7 O 0 0 0 0 MASS=16
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 1 2 1 6
M  V30 2 1 1 2
M  V30 3 2 2 3
M  V30 4 1 3 4
M  V30 5 2 4 5
M  V30 6 1 5 6
M  V30 7 1 6 7
M  V30 END BOND
M  V30 END CTAB
M  END
$$$$

What's the problem?

One problem comes when I want to configure chemfp so that if the output is SMILES then use one flavor, and if the output is SDF then use another flavor. You could construct a table of format-specific writer_args, like this:

writer_args_by_format = {
  "smi": {"flavor": "Canonical|Kekule|Isotopes", "aromaticity": "openeye"},
  "sdf": {"flavor": "Default|MV30", "aromaticity": "openeye"},
    ...
}

record = T.create_string(mol, format,
           writer_args = writer_args_by_format[format])
but not only is that tedious, it doesn't handle toolkit-specific options. Nor is there an easy way to turn the text settings into this data structure.

Qualified names

Instead, the reader_args and writer_args accept "qualified" names, which can be format-specific like "sdf.flavor", toolkit-specific like "openeye.*.aromaticity", or both, like "openeye.sdf.aromaticity".

A cleaner way to write the previous example is:

writer_args = {
  "smi.flavor": "Canonical|Kekule|Isotopes",
  "sdf.flavor": "Default|MV30",
  "aromaticity": "openeye",   # Use the openeye aromaticity model for all formats
    ...
}

record = T.create_string(mol, format, writer_args = writer_args)
or if you want to be toolkit-specific, use "openeye.smi.flavor", "openeye.sdf.flavor" and "openeye.*.aromaticity", etc.

Precendence

You probably noticed there are many ways to specify the same setting, as in the following:

reader_args = {
  "delimiter": "tab",
  "openeye.*.delimiter": "whitespace",
  "smi.delimiter": "space",
}
The chemfp precedence goes from most-qualified name to least-qualified, so for this case the search order is:
openeye.smi.delimiter
openeye.*.delimiter
smi.delimiter
delimiter

How to convert qualified names into unqualified names

The Format object's get_unqualified_reader_args() converts a complicated reader_args dictionary which may contain qualified names into a simpler reader_args dictionary with only unqualified names and only the names appropriate for the format. It's used internally to simplify the search for the right name, and it's part of the public API so you can help debug if your qualifiers are working correctly. I'll give an example of debugging in a moment.

Here's an example which shows that the previous 'reader_args' example, with several delimiter specification, is resolved to using the 'whitespace' delimiter style.

>>> from chemfp import openeye_toolkit
>>> 
>>> reader_args = {
...   "delimiter": "tab",
...   "openeye.*.delimiter": "whitespace",
...   "smi.delimiter": "space",
... }
>>> 
>>> format = openeye_toolkit.get_format("smi")
>>> format.get_unqualified_reader_args(reader_args)
{'delimiter': 'whitespace', 'flavor': None, 'aromaticity': None}
You can see that it also fills in the default values for unspecified arguments. Note that this function does not validate values. It's only concerned with resolving the names.

The equivalent method for writer_args is get_unqualified_writer_args() - I try to be predictable in my APIs.

This function is useful for debugging because it helps you spot typos. Readers ignore unknown arguments, so if you type "opneye" instead of "openeye" then it just assumes that you were talking about some other toolkit.

If you can't figure out why your reader_args or writer_args aren't being accepted, pass them through the 'unqualified' method and see what it gives:

>>> format.get_unqualified_reader_args({"opneye.*.aromaticity": "daylight"})
{'delimiter': None, 'flavor': None, 'aromaticity': None}

Qualified names and text settings

The Format object also supports qualifiers in the reader and writer text_settings and applies the same search order to give the unqualified reader_args.

>>> format.get_reader_args_from_text_settings({
...    "sanitize": "true",
...    "rdkit.*.sanitize": "false",
... })
{'sanitize': False}

Errors in the text settings

The get_reader_args_from_text_settings() and get_writer_args_from_text_settings() will validate the values as much as it can, and raise a ValueError with a helpful message if that fails.

>>> from chemfp import openeye_toolkit
>>> sdf_format = openeye_toolkit.get_format("sdf")
>>> sdf_format.get_writer_args_from_text_settings({
...   "flavor": "bland",
... })
Traceback (most recent call last):
  File "", line 2, in 
  File "chemfp/base_toolkit.py", line 407, in get_writer_args_from_text_settings
    return self._get_args_from_text_settings(writer_settings, self._format_config.output)
  File "chemfp/base_toolkit.py", line 351, in _get_args_from_text_settings
    % (self.toolkit_name, name, value, err))
ValueError: Unable to parse openeye setting flavor ('bland'): OEChem sdf format does not support the 'bland' flavor option. Available flavors are: CurrentParity, MCHG, MDLParity, MISO, MRGP, MV30, NoParity

File format detection based on extension

All of the above assumes you know the file format. Sometimes you only know the filename, and want to determine (or "guess") the format based on its extension. The file "abc.smi" is a SMILES file, the file "xyz.sdf" is an SD file, and "xyz.sdf.gz" is a gzip-compressed SD file.

The toolkit function get_input_format_from_source() will try to determine the format for an input file, given the source filename:

>>> from chemfp import openbabel_toolkit as T
>>> T.get_input_format_from_source("example.smi")
Format('openbabel/smi')
>>> T.get_input_format_from_source("example.sdf.gz")
Format('openbabel/sdf.gz')
>>> format = T.get_input_format_from_source("example.sdf.gz")
>>> format.get_default_reader_args()
{'implementation': None, 'perceive_0d_stereo': False, 'perceive_stereo': False, 'options': None}
The equivalent for output files is get_output_format_from_destination().

The main difference between the two is get_input_format_from_source() will raise an exception if the format is known but not supported as an input format, and get_input_format_from_destination() will raise an exception if the format is known but not supported as an output format.

>>> T.get_input_format_from_source("example.inchikey")
Traceback (most recent call last):
  File "", line 1, in 
  File "chemfp/openbabel_toolkit.py", line 109, in get_input_format_from_source
    return _format_registry.get_input_format_from_source(source, format)
  File "chemfp/base_toolkit.py", line 606, in get_input_format_from_source
    format_config = self.get_input_format_config(register_name)
  File "chemfp/base_toolkit.py", line 530, in get_input_format_config
    % (self.external_name, register_name))
ValueError: Open Babel does not support 'inchikey' as an input format

The format detection functions actually take two arguments, where the second is the format name.

>>> T.get_input_format_from_source("example.inchikey", "smi.gz")
Format('openbabel/smi.gz')
This is meant to simplify the logic that would otherwise lead to code like:
if format is not None:
    format = T.get_input_format(format)
else:
    format = T.get_input_format_from_source(source)

By the way, the source and destination can be None. This tells chemfp to read from stdin or write to stdout. Since stdin and stdout don't have a file extension, what format do they have? My cheminformatics roots started with Daylight, so I decided that the default format is "smi".

September 26, 2014 12:00 PM


Tim Golden

PyCon UK 2014: The Happening

Someone I meet only in Coventry mentioned in the hallway in this year’s PyCon UK that I’d not written about anything this year which he could comment on when he met me there. And indeed I was surprised to realise that my last post on this blog was about last year’s PyCon UK! I know I’d had several ideas for what to blog about, but clearly those ideas never became a reality.

As I pointed out in that year-ago post, every PyCon UK charts its own course for me each year. This year I was heavily involved in the Education & Kids’ tracks across the road in the Simulation Centre. In fact, apart from quick visits to the dining hall for food, I hardly interacted with the main conference at all throughout Friday and Saturday. As I tweeted at the time, the first talk I attended at the conference proper was the one I was giving first thing on Sunday morning!

Others have already done so, but I’d like to give tons of credit to Nicholas Tollervey who organised and ran the Education and Kids’ track (as well as giving a talk at the conference proper on turning footfall data into music). This involved liaising with schools and teachers for over 30 teachers to be able to attend PyCon UK as part of their Professional Development, in some cases assisted by money (generously provided by the Bank of America) to help pay for classroom cover. The Bank of America also paid for the venue, while the Python Software Foundation provided money to cover travel & accommodation. And then there’s the job of organising for 70 kids to come and have some experience of programming in a supportive environment. Those who only know Nicholas as a developer and organiser-in-chief of the monthly London Python Code Dojo may not be aware that he was a professional tuba player and, significantly here, a Secondary School teacher. This (I assume) gives him an insight into what will give the most benefit to the teachers and youngsters attending.

I hope to write a separate blog post on the Education slice of the conference on Friday. (If only to ensure that I have at least two blog posts to my name this year!). The Saturday event for children was certainly well-attended. We were over there not long after 7am setting up chairs, tables, RPis, keyboards, mice, screens and a *lot* of power cables to keep 70 youngsters occupied. Because of the numbers, activities were split over two large rooms plus two smaller rooms for specific activities. There were guided sessions on Minecraft, using the Pi camera module, and using PyGame plus a fair amount of freeflow action, mostly involving Minecraft. At the end was a show-and-tell where at least some of the various projects were showcased.

The Raspberry Pi Foundation Educational Outreach Team (or whatever they’re called) were there on both days, and in fact throughout the conference including the sprints on Monday. As well as running workshops in the Education track, they also provided a Model B+ Raspberry Pi for each of the youngsters attending, and gave talks and keynotes speeches at the main conference.

I attended relatively few of the talks at the main conference. I did like the gov.uk team’s presentation on their approach to using Flask to implement wizard-style forms (a project codenamed “Gandalf” apparently!) and which they hope to be able to open-source. The lightning talks I did get to see (ably compered as usual by Lightning Talk Man) were interesting; and one of them was even using my active_directory module. Another blog post from me in the making there, I hope.

The one very obvious aspect of the PyCon UK this year was the numbers. And, as a result, the queues, especially in the dining hall. Even my own talk – on the fairly niche subject of Core Python Development on Windows – was fully attended, presumably as a result of refugees who couldn’t fit into other talks and were desperate for somewhere they could at least sit down. It’s nice problem to have: to be too popular; but I imagine the organisers are looking hard at arrangements for next year.

I stayed for the sprints on Monday and so was looking for somewhere to eat on Sunday evening, traditionally the least organised of the PyCon UK evenings. I went looking with Conor & Ben for somewhere to eat in Coventry on a Sunday evening. With limited success. (We did eventually find somewhere open). I spent a pleasantly quiet rest of the evening with a pot of tea in the hotel bar, and chatting to Giacomo about this and that.

Monday was sprints, and although I was available to help people get started on core dev work, I was actually sprinting on my own Screenful of Python idea, of which more in a later post.

September 26, 2014 07:57 AM

September 25, 2014


Fabio Zadrozny

Attaching debugger to running process (PyDev 3.8.0)

The latest PyDev 3.8.0 has just been released... along with a bunch of bugfixes, the major feature added is the possibility of attaching the debugger to a running process.

So, I thought about explaining a bit how to use it (and later a bit on how it was done).

The first thing is that the Debug perspective must be activated (as the attach to process menu is only shown by default in the Debug Perspective). Then, when on the debug perspective, select PyDev > Attach to Process (as the image below shows).



When that action is activated, a list with the active processes is shown so that the process we should attach to can be selected (as a note, by default only Python processes are filtered, but if you have an executable running Python under a different name, it's also possible to select it).


After selecting the executable, the user must provide a Python interpreter that's compatible with the interpreter we'll attach to (i.e.: if we're attaching to a 64 bit interpreter, the interpreter selected must also be a 64 bit process -- and likewise for 32 bits).


And if everything went right, we'll be connected with the process after that step (you can note in the screenshot below that in the Debug View we have a process running regularly and we connected to the process through the remote debugger) -- after it's connected, it's possible to pause the running process (through right-click process > suspend)


So, this was how to use it within PyDev... now, I'll explain a bit on the internals as I had a great time implementing that feature :)

The first thing to note is that we currently support (only) Windows and Linux... although the basic idea is the same on both cases (we get a dll loaded in the target process and then execute our attach code through it), the implementations are actually pretty different (as it's Open Source, it can be seen at: https://github.com/fabioz/PyDev.Debugger/tree/development/pydevd_attach_to_process)

So, let me start with the Windows implementation...

In the Windows world, I must thank a bunch of people that came before me in order to make it work:
First, Mario Vilas for Winappdbg: https://github.com/MarioVilas/winappdbg -- mostly, this is a Windows debugger written in Python. We use it to attach a dll to a target process. Then, after the dll is loaded, there's some hand-crafted shellcode created to execute a function from that dll (which is actually different for 32 bits and for 64 bits -- I did spend quite some time here since my assembly knowledge was mostly theoretical, so, it was nice to see it working in practice: that's the GenShellCodeHelper class in add_code_to_python_process.py and it's used in the run_python_code_windows function).

Ok, so, that gives us the basic hackery needed in order to execute some code in a different process (which Winappdbg does through the Windows CreateRemoteThread API) -- so, initially I had a simple version working which did (all in shellcode) an acquire of the Python GIL/run some hand-crafted Python code through PyRun_SimpleString/release GIL... but there are a number of problems with that simple approach: the first one is that although we'll execute some Python code there, we'll do it under a new thread, which under Python 3 meant that this would be no good since we have to initialize the Python threading if it still wasn't initialized. Also, to setup the Debugger we need to call sys.settrace on existing threads (while executing in that thread -- or at least making Python think we're at that thread as there's no API for that)... at that point I decided on having the dll (and not only shellcode).

So, here I must thank the PVTS guys (which actually have an attach to process on Windows which does mixed mode debugging), so, the attach.cpp file which generates our dll is adapted from PVTS to the PyDev use case. It goes through a number of hoops to initialize the threading facility in Python if it still wasn't initialized and does the sys.settrace while having all threads suspended and makes Python think we're actually at the proper thread to make that call... looking at it, I find it really strange that Python itself doesn't have an API for that (it should be easy to do on Python, but it's such a major hack on the debugger because that API is not available).

Now, on to Linux: in Linux the approach is simpler as we reuse a debugger that should be readily available for Linux developers: gdb. Also, because gdb stops threads for us and executes the code in an existing thread, things become much simpler... first, because we're executing in an existing thread, we don't have to start the threading if it's not started -- the only reason this is needed in Windows is because we're executing the code in a new thread in the process, created through CreateRemoteThread -- and also, as gdb has a way to script in Python were we can switch threads and execute something in it while having other threads stopped, we also don't need to do the trick on Python to make it think we're in a thread to execute a sys.settrace as if we were in a thread when in reality we weren't, as we can switch to a thread and really execute code on it.

So, all in all, it should be working properly, although there are a number of caveats... it may fail if we don't have permissions for the CreateRemoteThread on Windows or in Linux it could fail if the ptrace permissions are not set for us to attach with GDB -- and probably a bunch of other things I still didn't think of :)

Still, it's nice to see it working!

Also, the last thanks goes to JetBrains/PyCharm, which helped in sponsoring the work in the debugger -- as I mentioned earlier: http://pydev.blogspot.com.br/2014/08/pydev-370-pydevpycharm-debugger-merge.html the debugger in PyDev/PyCharm is now merged :)

September 25, 2014 06:08 PM


PyTennessee

Python for Ada

All of us at PyTennessee are huge fans of the diverse community that python is becoming.  We work hard every year to seek out and encourage a diverse group of speakers and attendees.  I wanted to make sure all of you have seen the Python for Ada donation drive.  There are two days left, and we’re past the goal, but we can go further.  So please go checkout the campaign and contribute if you can, but spread the word at least.

Note: We put our money where our mouth is, and we donated as well. Also just like last year, at the conference on Sunday we will be having a fundraiser for PyLadies via a mani/pedi party and silent auction.

September 25, 2014 06:00 PM


Andrew Dalke

chemfp's parse_molecule()

I used several use cases to guide chemfp-1.2 development. One, under the category of "web service friendly", was to develop a web service which takes a structure record and optional format name as input for a k=3 nearest neighbor search of a pre-loaded target fingerprint file. That doesn't sound so hard, except I'll let the fingerprint file determine the fingerprint type based on its 'type' header, which might specify Open Babel, RDKit, or OpenEye's toolkits.

Those toolkit all have different APIs, but I would prefer not to write different code for each case. Chemfp-1.1 can be hacked to handle most of what's needed, because read_structure_fingerprints() will read a structure file and computes fingerprints for each record. The hack part is save the XML-RPC query to a file, since read_structure_fingerprints() only works from a file, not a string.

That isn't a full solution. Chemfp-1.1 doesn't support any sort of structure output, so there would need to be toolkit-specific code to set the tag data and convert the molecule to a record.

For chemfp I decided on the classic approach and make my own toolkit-independent API, with a back-end implementation for each of the supported toolkits. I think it's a great API, and I've been using it a lot for my other internal projects. My hope is that some of the ideas I developed go into other toolkits, or at least influence the design of the next generation of toolkits.

To see a more concrete example, here's that use case implemented as an XML-RPC service using the new chemfp-1.2 API.

from SimpleXMLRPCServer import SimpleXMLRPCServer

import chemfp

# Load the target fingerprints and figure out the fingerprint type and toolkit
targets = chemfp.load_fingerprints("databases/chembl_18_FP2.fps")
fptype = targets.get_fingerprint_type()  # uses the 'type' field from the FPS header
toolkit = fptype.toolkit

print("Loaded %d fingerprints of type %r" % (len(targets), fptype.get_type()))

def search(record, format="sdf"):
    # Parse the molecule and report any failures
    try:
        mol = toolkit.parse_molecule(record, format)
    except ValueError as err:
        return {"status": "error", "msg": str(err)}

    # Compute its fingerprint, search the targets, and get the nearest 3 neighbors
    fp = fptype.compute_fingerprint(mol)
    nearest = targets.knearest_tanimoto_search_fp(fp, k=3, threshold=0.0)

    # Add a tag value like "CHEMBL14060 1.00 CHEMBL8020 0.92 CHEMBL7970 0.92"
    (id1, score1), (id2, score2), (id3, score3) = nearest.get_ids_and_scores()
    toolkit.add_tag(mol, "NEAREST", "%s %.2f %s %.2f %s %.2f" %
                    (id1, score1, id2, score2, id3, score3))

    return {"status": "ok",
            "record": toolkit.create_string(mol, "sdf")}

server = SimpleXMLRPCServer(("localhost", 8000))
server.register_introspection_functions()
server.register_function(search)

if __name__ == "__main__":
    server.serve_forever()

I think this is a clean API, and a bit easier to understand and use than the native toolkit APIs. It's very similar to cinfony, though I think at this level cinfony is bit easier to understand because it wraps its own Molecule object around the native toolkit molecules, while I leave them as native molecule objects. I have to use helper functions where cinfony can use methods. I did this because I don't want the performance overhead of wrapping and unwrapping for the normal use case of converting a structure file to fingerprints.

I also have more stand-alone objects than cinfony, like my fingerprint type object for fingerprint generation, where cinfony uses a method of the molecule.

parse_molecule()

That example, while illustrative of the new API, isn't significantly better than existing systems. For that, I need to delve into the details, starting with parse_molecule() .

Chemfp has three "toolkit" APIs, one for each of the supported toolkits. The usual way to get them is through one of the following:

from chemfp import rdkit_toolkit
from chemfp import openeye_toolkit
from chemfp import openbabel_toolkit
Most of the examples are toolkit independent, so I'll use "T" rather than stress a toolkit. I'll use RDKit as the back-end for these examples:
>>> from chemfp import rdkit_toolkit as T

The type signature is:

parse_molecule(content, format, id_tag=None, reader_args=None, errors='strict')
In the simplest case I'll parse a SMILES record to get an RDKit molecule object:
>>> mol = T.parse_molecule("c1ccccc1[16OH] phenol", "smi")
>>> mol
<rdkit.Chem.rdchem.Mol object at 0x104521360>
If you use the openeye_toolkit you would get an OEGraphMol, and openbabel_toolkit returns an OBMol.

You might have noticed that the first two parameters are in reverse order from cinfony's readstring(), which take the format in the first position instead of the second. This is because the parse_molecules() parameters parallel chemfp's read_molecules() parameters:

read_molecules(source=None, format=None, id_tag=None, reader_args=None, errors='strict', location=None)
The format there is second because the default of None means to auto-detect the format based on the source. (If the source is a filename then detection is based on the extension. I'll go into more details in another essay.)

In comparison, cinfony readfile() takes (format, filename), and doesn't auto-detect the format. (A future version could do that, but it would require a format like "auto" or None, which I thought was less elegant than being able to use the default format.) I wanted read_molecules() to support auto-detection, which meant the format had to be in the second position, which is why parse_molecule() takes the format in the second position.

parse_molecule() always returns a new molecule

This function promises to return a new molecule object each time. Compare to OEChem or Open Babel, which parse into an existing molecule object.

Those two toolkits reuse the molecule for performance reasons; clearing is faster than deallocating and reallocating a molecule object. My experience is that in practice molecule reuse is error-prone because it's easy to forget to clear the molecule, or save multiple results to a list and be surprised that they all end up with the same molecule.

I agree that performance is important. I chose a different route to get there. I noticed that even if the molecule were reused, there would still be overhead in calling parse_molecule() because it has to validate or at least interpret the function parameters. These parameters rarely change, that validation is unneeded overhead.

What I ended up doing was making a new function, make_id_and_molecule_parser(), which take the same parameters as parse_molecule(), except leaving out 'content'. It returns a specialized parser function which only takes one parameter, the content, and returns the (id, molecule) pair. (In a future release I may also have a make_molecule_parser() which only returns the molecule, but that's too much work for now.)

This new function is designed for performance, and is free to reuse the molecule object. Functions which return functions are relatively unusual, and I think only more advanced programmers will use it, which makes it less likely that people will experience the negative problems.

It's still possible, so in the future I may add an option to make_molecule_parser() to require a new molecule each time.

Format names

The second parameter is the format name or Format object. For now I'll only consider format names.

mol = T.parse_molecule("c1ccccc1[16OH] phenol", "smi")

This was a surprisingly tricky to get right, because a SMILES record and a SMILES string are different things, and because toolkits differ in what a SMILES record means. When someone says "takes a SMILES as input", does that mean to take a record or a string?

To clarify, a SMILES file contains SMILES records. A SMILES record is a SMILES string followed by a whitespace, followed by a title/id. Some toolkits take Daylight's lead and say that the title goes to the end of the line. Others interpret the file as a whitespace or space or tab separated file; or even an arbitrary character delimited file. Some also support a header in the first line. Finally, SMILES records are newline terminated, although that's not always required for the last record. (I'll come back to this in a bit.)

I decided to use different format names for these two cases. The "smi" format refers to a SMILES record, and the "smistring" format refers to just the SMILES string.

Björn Grüning pointed out that a similar issue exists with InChI. There's the InChI string, but Open Babel and some other tools also support an "InChI file" with the InChI string as the first term, followed by a whitespace, followed by some title or identifier, as well as an "InChIKey file", using the InChIKey instead of the InChI.

Thus, chemfp has "inchistring" and "inchi", and "inchikeystring" and "inchikey", in parallel with the "smistring"/"smi" distinction.

The other formats, like "sdf" and "pdb", are only record-oriented and don't have the same dual meaning as SMILES and InChI.

Compressed formats

A longer term chemfp idea is to extend the binary format to store record data. My current experiments use SQLite for the records and FPB files for the fingerprints.

Uncompressed SDF records take up a lot of space, and compress well using standard zlib compression. The file-based I/O function support format names like "smi.gz". I extended the idea to support zlib-compressed records, like "sdf.zlib".

Output format names: different types of SMILES

The SMILES output format names are also tricky. This needs a bit of history to understand fully. Daylight toolkit introduced SMILES strings, but the original syntax did not support isotopes or chirality. These were added latter, as so-called "isomeric SMILES". Daylight and nearly every toolkit since maintained that separation, where a "SMILES" output string (either canonical or non-canonical) was not isomeric, and something different needed to be done to get an isomeric SMILES.

This was a usability mistake. Most people expect that when the input is 238U then the output SMILES will be "[238U]". I know, because I've probably made that mistake a dozen times in my own code. On the plus side, it's usually very easy to detect and fix. On the minus side, I've only rarely needed canonical non-isomeric SMILES, so the default ends up encouraging mistakes.

OEChem 2.0 decided to break the tradition and say that "smi" refers to canonical isomeric SMILES, which is what most expect but didn't get, that "can" refers to canonical non-isomeric SMILES (this is unchanged), and that "usm" is the new term for non-canonical, non-isomeric SMILES.

It's a brillantly simple solution to a usability problem I hadn't really noticed before they solved it. This made so much sense to me that I changed chemfp's new I/O API to use those format names and meanings. I hope others also follow their lead.

That's why "smi" as a chemfp output format means canonical isomeric SMILES, "can" means canonical non-isomeric, and "usm" means non-canonical non-isomeric. The corresponding string formats are "smistring", "canstring", and "usmstring". Here they are in action:

>>> mol = T.parse_molecule("c1ccccc1[16OH] phenol", "smi")
>>> T.create_string(mol, "smi")
'[16OH]c1ccccc1 phenol\n'
>>> T.create_string(mol, "can")
'Oc1ccccc1 phenol\n'
>>> T.create_string(mol, "usm")
'c1ccccc1O phenol\n'
>>> T.create_string(mol, "smistring")
'[16OH]c1ccccc1'

id_tag, parse_id_and_molecule(), and titles

The parse_molecule() function only really uses the "id_tag" parameter to improve error reporting - the error message will use the 'id_tag's value rather than the default title for a record.

The id_tag parameter exists because developers in Hinxton decided to put the record identifier in a tag of an SD file instead of placing it in the record title like the rest of the world. As a result, many cheminformatics tools stumble a bit with the ChEBI and ChEMBL datasets, which either have a blank title or the occasional non-blank but useless title like:

"tetrahydromonapterin"
(1E)-1-methylprop-1-ene-1,2,3-tricarboxylate
S([O-])([O-])(=O)=O.[Ni+2].O.O.O.O.O.O.O
Structure #1
Untitled Document-1
XTP
I've decided that this is a bug, but despite years of complaints, they haven't changed, so chemfp has to work around it.

Here's an SD record from ChEBI that you can copy and paste:


  Marvin  02030822342D          

  2  1  0  0  0  0            999 V2000
   -0.4125    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    0.4125    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  2  0  0  0  0
M  STY  1   1 SUP
M  SAL   1  2   1   2
M  SMT   1 O2
M  END
> <ChEBI ID>
CHEBI:15379

> <ChEBI Name>
dioxygen

> <Star>
3

$$$$
I'll assign that to the variable 'sdf_record', parse it, and show that the title is empty, though the identifier is available as the "ChEBI ID" tag.
>>> from chemfp import rdkit_toolkit as T
>>> sdf_record = """\
...   
...   Marvin  02030822342D          
... 
...   2  1  0  0  0  0            999 V2000
...    -0.4125    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
              ... many lines deleted ...
... $$$$
... """
>>> mol= T.parse_molecule(sdf_record, "sdf")
[00:49:59]  S group SUP ignored on line 8
>>> mol
<rdkit.Chem.rdchem.Mol object at 0x1043de4b0>
>>> T.get_id(mol)
''
>>> T.get_tag(mol, "ChEBI ID")
'CHEBI:15379'
(The "get_id" and "get_tag" methods are part of the new molecule API. I'll discuss them in a future essay.)

I found that I often want to get both the identifier and the molecule. Rather than use a combination of parse_molecule() and get_id()/get_tag() to get that information, I created the parse_id_and_molecule() function, which returns the 2-element tuple of (id, molecule), whose id_tag parameter specifies where to find the id.

> > > T.parse_id_and_molecule(sdf_record, "sdf", id_tag="ChEBI ID")
[00:50:37]  S group SUP ignored on line 8
('CHEBI:15379', <rdkit.Chem.rdchem.Mol object at 0x104346a60>)
If the id_tag is None, then it uses the appropriate default for the given format; the title for SD records. Otherwise it contains the name of the tag with the title. (If there are multiple tags with the same name then the choice of which tag to use is made arbitrarily.)

I can imagine that some people might place an identifier in a non-standard location for other formats. If that happens, then there will likely be an id_tag syntax for that format.

For example, the closest I've come up with is a SMILES variant where the id you want is in the third column. In that case the id_tag might be "#3". If the column is labeled "SMILES" then an alternate would be "@SMILES", or if you know the column name doesn't start with a '#' or '@' then simply "SMILES".

However, that's hypothetical. I haven't seen it in real life. What I have seen are CSV files, where the SMILES and id columns are arbtirary, and which may have Excel-specific delimiter and quoting rules. Chemfp doesn't currently support CSV files and that will likely be handled through reader_args.

Handling whitespace in a SMILES record

The SMILES record format is not portable across toolkits. According to Daylight, and followed by OEChem, the format is the SMILES string, a single whitespace, and the rest of the line is the title. This title might be a simple alphanumeric id, or an IUPAC name with spaces in it, or anything else.

Other toolkits treat it as a character delimited set of columns. For example, RDKit by default uses the space character as the delimiter, though you can change that to tab, comma, or other character.

This is a problem because you might want to generate fingerprints from a SMILES file containing the record:

C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin a
In one toolkit you'll end up with the identifier "vitamin" and with another toolkit get the identifier "vitamin A".

My experience is that most people treat a SMILES file as a space or tab delimited set of columns, with a strong perference for the space character. This isn't universally true. Roger Sayle pointed out to me that the formal IUPAC name is a perfectly reasonable unique and canonical identifer, which can have a space in it. The Daylight interpretation supports IUPAC names, while the space delimited version does not.

There is no perfect automatic solution to this ambiguity. Instead, chemfp lets you specify the appropriate delimiter type using the "delimiter" reader argument. The supported delimiter type names are "whitespace", "tab", "space", and "to-eol", as well as the literal characters " " and "\t". The default delimiter is "whitespace", because most of the people I work with think of a SMILES file more like a whitespace delimited file.

Here's an example of how the default doesn't like "vitamin A", but where the "to-eol" delimiter handles it correctly:

>>> smiles = "C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin a\n"
>>> T.parse_id_and_molecule(smiles, "smi")
('vitamin', <rdkit.Chem.rdchem.Mol object at 0x10bbf39f0>)
>>> T.parse_id_and_molecule(smiles, "smi",
...              reader_args={"delimiter": "to-eol"})
('vitamin a', <rdkit.Chem.rdchem.Mol object at 0x10bc8c2f0>)
as well as an example of the difference between the "tab" and "to-eol" delimiter types:
>>> T.parse_id_and_molecule("C\tA B\tC\t", "smi",
...              reader_args={"delimiter": "tab"})
('A B', <rdkit.Chem.rdchem.Mol object at 0x10bc8c2f0>)
>>> T.parse_id_and_molecule("C\tA B\tC\t", "smi",
...              reader_args={"delimiter": "to-eol"})
('A B\tC\t', <rdkit.Chem.rdchem.Mol object at 0x10bbf39f0>)

It was a bit of work to make the different toolkits work the same way, and my best attempt isn't perfect. For example, if you are daft and try to interpret the record "C Q\tA" as a tab-delimited set of columns, then OEChem will see this as methane with an id of "Q" while RDKit and Open Babel will say they can't parse the SMILES "C Q".

So don't do that!

reader_args

As you saw, reader_args is a Python dictionary. All of the SMILES parsers accept the 'delimiter' argument, and the RDKit and Open Babel reader_args also support the "has_header" argument. If true, the first line of the file contains a header. (I couldn't think of a good implementation of this for OEChem.)

There are also toolkit-specific reader_args. Here I'll disable RDKit's sanity checker for SMILES, and show that it accepts a SMILES that it would otherwise reject.

>>> T.parse_molecule("c1ccccc1#N", "smistring", reader_args={"sanitize": False})
<rdkit.Chem.rdchem.Mol object at 0x104407440>
>>> T.parse_molecule("c1ccccc1#N", "smistring", reader_args={"sanitize": True})
[22:09:40] Can't kekulize mol 

Traceback (most recent call last):
  File "<stdin>", line 1, in 
  File "chemfp/rdkit_toolkit.py", line 251, in parse_molecule
    return _toolkit.parse_molecule(content, format, id_tag, reader_args, errors)
  File "chemfp/base_toolkit.py", line 986, in parse_molecule
    id_tag, reader_args, error_handler)
  File "chemfp/base_toolkit.py", line 990, in _parse_molecule_impl
    id, mol = format_config.parse_id_and_molecule(content, id_tag, reader_args, error_handler)
  File "chemfp/_rdkit_toolkit.py", line 1144, in parse_id_and_molecule
    error_handler.error("RDKit cannot parse the SMILES string %r" % (terms[0],))
  File "chemfp/io.py", line 77, in error
    raise ParseError(msg, location)
chemfp.ParseError: RDKit cannot parse the SMILES string 'c1ccccc1#N'

The SMILES parsers use different reader_args than the SDF parser. You can see the default reader_args by using the toolkit's format API:

>>> from chemfp import rdkit_toolkit
>>> rdkit_toolkit.get_format("smi").get_default_reader_args()
{'delimiter': None, 'has_header': False, 'sanitize': True}
>>> rdkit_toolkit.get_format("sdf").get_default_reader_args()
{'strictParsing': True, 'removeHs': True, 'sanitize': True}
Also, the different toolkits may use different reader_args for the same format.
>>> from chemfp import openeye_toolkit
>>> openeye_toolkit.get_format("sdf").get_default_reader_args()
{'flavor': None, 'aromaticity': None}
>>> from chemfp import openbabel_toolkit
>>> openbabel_toolkit.get_format("sdf").get_default_reader_args()
{'implementation': None, 'perceive_0d_stereo': False, 'perceive_stereo': False, 'options': None}
I'll cover more about the format API in another essay.

Namespaces

This can lead to a problem. You saw earlier how to get the correct toolkit for a given fingerprint type. Once you have the toolkit you can parse a record into a toolkit-specific molecule. But what if you want toolkit-specific and format-specific settings?

First off, parsers ignore unknown reader_arg names, and the reader_arg names for the different toolkits are different, except for "delimiter" and "has_header" where it doesn't make sense for them to be different. That means you could do:

reader_args = {
  "delimiter": "tab",       # for SMILES files
  "strictParsing": False,   # for RDKit SDF
  "perceive_stereo": True,  # for Open Babel SDF
  "aromaticity": "daylight, # for all OEChem readers
}
and have everything work.

Still, it's possible that you want OEChem to parse a SMILES using "daylight" aromaticity and an SD file using "openeye" aromaticity.

The reader_args are namespaced, so for that case you could use a format qualifier, like this:

reader_args = {
  "smi.aromaticity": "daylight",
  "can.aromaticity": "daylight",
  "usm.aromaticity": "daylight",
  "sdf.aromaticity": "openeye",
}
There's also a toolkit qualifier. In this daft example, the OpenEye reader uses the whitespace delimiter option for SMILES files, the RDKit SMILES and InChI readers uses tab, and the Open Babel SMILES and InChI readers use the space character.
reader_args = {
  "openeye.*.delimiter": "whitespace",
  "rdkit.*.delimiter": "tab",
  "openbabel.*.delimiter": "space",
}
Fully qualified names, like "openeye.smi.delimiter" are also allowed.

The remaining problem is how to configure the reader_args. It's no problem to write the Python dictionary yourself, but what if you want people to pass them in as command-line arguments or in a configuration file? I'll cover that detail when I talk about the format API.

Errors

What happens if the SMILES isn't valid? I'll use my favorite invalid SMILES:

>>> T.parse_id_and_molecule("Q q-ane", "smi")
[22:01:37] SMILES Parse Error: syntax error for input: Q
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "chemfp/rdkit_toolkit.py", line 264, in parse_id_and_molecule
    return _toolkit.parse_id_and_molecule(content, format, id_tag, reader_args, errors)
  File "chemfp/base_toolkit.py", line 1014, in parse_id_and_molecule
    id_tag, reader_args, error_handler)
  File "chemfp/base_toolkit.py", line 1018, in _parse_id_and_molecule_impl
    return format_config.parse_id_and_molecule(content, id_tag, reader_args, error_handler)
  File "chemfp/_rdkit_toolkit.py", line 249, in parse_id_and_molecule
    error_handler.error("RDKit cannot parse the SMILES %r" % (smiles,))
  File "chemfp/io.py", line 77, in error
    raise ParseError(msg, location)
chemfp.ParseError: RDKit cannot parse the SMILES 'Q'
A chemfp.ParseError is also a ValueError, so you can check for this exception like this:
try:
    mol = T.parse_molecule("Q", "smistring")
except ValueError as err:
    print("Not a valid SMILES")

I think an exception is the right default action when there's an error, but various toolkits disagree. Some will return a None object in that case, and I can see the reasoning. It's sometimes easier to write things more like:

>>> mol = T.parse_molecule("Q", "smistring", errors="ignore")
[00:04:26] SMILES Parse Error: syntax error for input: Q
>>> print(mol)
None

The default errors value is "strict", which raises an exception. The "ignore" behavior is to return None when the function can't parse a SMILES.

In the above example, the "SMILES Parse Error" messages come from the the underlying RDKit C++ library. It's difficult to capture this message from Python. Chemfp doesn't even try. OEChem and Open Babel have similar error messages. Chemp doesn't try to capture those either.

Error handling control for a single record isn't so important. It's more important when parsing multiple records, where "ignore" skips a record and keeps processing and "strict" raises an exception and stops processing once a record cannot be parsed. I'll cover that in the essay where I talk about reading molecules.

Other error handling - experimental!

The errors parameter can be two other strings: "report" and "log". The "report" option writes the error information to stderr, like this:

>>> T.parse_molecule("Q", "smi", errors="report")
[03:05:52] SMILES Parse Error: syntax error for input: Q
ERROR: RDKit cannot parse the SMILES 'Q'. Skipping.
The "SMILES Parse Error", again, is generated by the RDKit C++ level going directly to stderr. The "RDKit cannot parse the SMILES" line, however, is chemfp sending information to Python's sys.stderr.

I don't think the "report" option is all that useful for this case since it's pretty much a duplicate of the underlying C++ toolkit. It does know about the id_tag, so it gives better error message for ChEBI and ChEMBL files.

The "log" option is the same as "report" except that it sends the message to the "chemfp" logger using Python's built-in logging system. I have very little experience with logging, so this is even more experimental than "report".

Finally, the "errors" parameter can take an object to use as the error handler. The idea is that the handler's error() is called when there's an error. See chemfp/io.py for how it works, but consider this to be undocumented, experiemental, and almost certain to change.

I have the idea that the error handler can have a warn() method, which would be called for warnings. The current generation of toolkits uses a global error handler or sends the message directly to stderr. In a multi-threaded system, which might parse two molecules at the same time in different threads, it's hard to know which molecule caused the error.

I haven't done this because ... it's hard using the current generation of tools to get that warning information. I'm also unsure about the error handler protocol. How does it know that it's collected all of the warnings for a given molecule. Is there a special "end_of_warnings()" method? Are the warning accumulated and sent in one call? What if there are both warnings and errors?

That's why this subsection it titled "experimental". :)

September 25, 2014 12:00 PM


Giampaolo Rodola

psutil 2.0 porting

This my second blog post is going to be about psutil 2.0, a major release in which I decided to reorganize the existing API for the sake of consistency. At the time of writing psutil 2.0 is still under development and the intent of this blog post is to serve as an official reference which describes how you should port your existent code base. In doing so I will also explain why I decided to make these changes. Despite many APIs will still be available as aliases pointing to the newer ones, the overall changes are numerous and many of them are not backward compatible. I'm sure many people will be sorely bitten but I think this is for the better and it needed to be done, hopefully for the first and last time. OK, here we go now.

Module constants turned into functions

What changed

Old name Replacement
psutil.BOOT_TIMEpsutil.boot_time()
psutil.NUM_CPUSpsutil.cpu_count()
psutil.TOTAL_PHYMEMpsutil.virtual_memory().total

Why I did it

I already talked about this more extensively in this blog post. In short: other than introducing unnecessary slowdowns, calculating a module level constant at import time is dangerous because in case of error the whole app will crash. Also, the represented values may be subject to change (think about the system clock) but the constant cannot be updated.
Thanks to this hack accessing the old constants still works and produces a DeprecationWarning.

Module functions renamings

What changed


Old name Replacement
psutil.get_boot_time() psutil.boot_time()
psutil.get_pid_list() psutil.pids()
psutil.get_users() psutil.users()

Why I did it

They were the only module level functions having a "get_" prefix. All others do not.

Process class' methods renamings

What changed

All methods lost their "get_" and "set_" prefixes. A single method can now be used for both getting and setting (if a value is passed). Assuming p = psutil.Process():

Old name Replacement
p.get_children()p.children()
p.get_connections()p.connections()
p.get_cpu_affinity()p.cpu_affinity()
p.get_cpu_percent()p.cpu_percent()
p.get_cpu_times()p.cpu_times()
p.get_io_counters()p.io_counters()
p.get_ionice()p.ionice()
p.get_memory_info()p.memory_info()
p.get_ext_memory_info()p.memory_info_ex()
p.get_memory_maps()p.memory_maps()
p.get_memory_percent()p.memory_percent()
p.get_nice()p.nice()
p.get_num_ctx_switches()p.num_ctx_switches()
p.get_num_fds()p.num_fds()
p.get_num_threads()p.num_threads()
p.get_open_files()p.open_files()
p.get_rlimit()p.rlimit()
p.get_threads()p.threads()
p.getcwd()p.cwd()

...as for set_* methods:

Old name Replacement
p.set_cpu_affinity()p.cpu_affinity(cpus)
p.set_ionice()p.ionice(ioclass, value=None)
p.set_nice()p.nice(value)
p.set_rlimit()p.rlimit(resource, limits=None)

Why I did it

I wanted to be consistent with system-wide module level functions which have no "get_" prefix. After I got rid of "get_" prefixes removing also "set_" seemed natural and helped diminish the number of methods.

Process properties are now methods

What changed

Assuming p = psutil.Process():

Old name Replacement
p.cmdlinep.cmdline()
p.create_timep.create_time()
p.exep.exe()
p.gidsp.gids()
p.namep.name()
p.parentp.parent()
p.ppidp.ppid()
p.statusp.status()
p.uidsp.uids()
p.usernamep.username()

Why I did it

Different reasons:

CPU percent intervals

What changed

The timeout parameter of cpu_percent* functions now defaults to 0.0 instead of 0.1. The functions affected are:
psutil.Process.cpu_percent()
psutil.cpu_percent()
psutil.cpu_times_percent()

Why I changed it

I originally set 0.1 as the default timeout because in order to get a meaningful percent value you need to wait some time.
Having an API which "sleeps" by default is risky though, because it's easy to forget it does so. That is particularly problematic when calling cpu_percent() for all processes: it's very easy to forget about specifying timeout=0 resulting in dramatic slowdowns which are hard to spot. For example, this code snippet might take different seconds to complete depending on the number of active processes:
>>> # this will be slow
>>> for p in psutil.process_iter():
... print(p.cpu_percent())

Migration strategy

Except for Process properties (name, exe, cmdline, etc.) all the old APIs are still available as aliases pointing to the newer names and raising DeprecationWarning. psutil will be very clear on what you should use instead of the deprecated API as long you start the interpreter with the "-Wd" option. This will enable deprecation warnings which were silenced in Python 2.7 (IMHO, from a developer standpoint this was a bad decision).
giampaolo@ubuntu:/tmp$ python -Wd
Python 2.7.3 (default, Sep 26 2013, 20:03:06)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import psutil
>>> psutil.get_pid_list()
__main__:1: DeprecationWarning: psutil.get_pid_list is deprecated; use psutil.pids() instead
[1, 2, 3, 6, 7, 13, ...]
>>>
>>>
>>> p = psutil.Process()
>>> p.get_cpu_times()
__main__:1: DeprecationWarning: get_cpu_times() is deprecated; use cpu_times() instead
pcputimes(user=0.08, system=0.03)
>>>
If you have a solid test suite you can run tests and fix the warnings one by one.

As for the the Process properties which were turned into methods it's more difficult because whereas psutil 1.2.1 returns the actual value, psutil 2.0.0 will return the bound method:
# psutil 1.2.1
>>> psutil.Process().name
'python'
>>>

# psutil 2.0.0
>>> psutil.Process().name
<bound method Process.name of psutil.Process(pid=19816, name='python') at 139845631328144>
>>>
What I would recommend if you want to drop support with 1.2.1 is to grep for ".name", ".exe" etc. and just replace them with ".exe()" and ".name()" one by one.
If on the other hand you want to write a code which works with both versions I see two possibilities:

#1 check version info, like this:
>>> PSUTIL2 = psutil.version_info >= (2, 0)
>>> p = psutil.Process()
>>> name = p.name() if PSUTIL2 else p.name
>>> exe = p.exe() if PSUTIL2 else p.exe
#2 get rid of all ".name", ".exe" occurrences you have in your code and use as_dict() instead:
>>> p = psutil.Process()
>>> pinfo = p.as_dict(attrs=["name", "exe"])
>>> pinfo
{'exe': '/usr/bin/python2.7', 'name': 'python'}
>>> name = pinfo['name']
>>> exe = pinfo['exe']

New features introduced in 2.0.0

Ok, enough with the bad news. =) psutil 2.0.0 is not only about code breakage. I also had the chance to integrate a bunch of interesting features.
    >>> psutil.cpu_count()  # logical
4
>>> psutil.cpu_count(logical=False) # physical cores only
2

September 25, 2014 10:36 AM