skip to navigation
skip to content

Planet Python

Last update: March 23, 2017 09:47 PM

March 23, 2017


Vasudev Ram

Analysing that Python code snippet

By Vasudev Ram

Hi readers,

Some days ago I had written this post:

Analyse this Python code snippet

in which I had shown a snippet of Python code (run in the Python shell), and said:

"Analyse the snippet of Python code below. See what you make of it. I will discuss it in my next post."

I am a few days late in discussing it; sorry about that.

Here is the analysis:

First, here's the the snippet again, for reference:

>>> a = 1
>>> lis = [a, 2 ]
>>> lis
[1, 2]
>>> lis = [a, 2 ,
... "abc", False ]
>>>
>>> lis
[1, 2, 'abc', False]
>>> a
1
>>> b = 3
>>> lis
[1, 2, 'abc', False]
>>> a = b
>>> a
3
>>> lis
[1, 2, 'abc', False]
>>> lis = [a, 2 ]
>>> lis
[3, 2]
>>>

The potential for confusion (at least, as I said, for newbie Pythonistas) lies in these apparent points:

The variable a is set to 1.
Then it is put into the list lis, along with the constant 2.
Then lis is changed to be [a, 2, "abc", False].
One might now think that the variable a is stored in the list lis.
The next line prints its value, which shows it is 1.
All fine so far.
Then b is set to 3.
Then a is set to b, i.e. to the value of b.
So now a is 3.
But when we print lis again, it still shows 1 for the first item, not 3, as some might expect (since a is now set to 3).
Only when we run the next line:
lis = [a, 2]
and then print lis again, do we see that the first item in lis is now 3.

This has to do with the concept of naming and binding in Python.

When a Python statement like:
a = 1
is run, naming and binding happens. The name on the left is first created, and then bound to the (value of the) object on the right of the equals sign (the assignment operator). The value can be any expression, which, when evaluated, results in a value (a Python object [1]) of some kind. In this case it is the int object with value 1.

[1] Almost everything in Python is an object, like almost everything in Unix is a file. [Conditions apply :)]

When that name, a, is used in an expression, Python looks up the value of the object that the name is bound to, and uses that value in the expression, in place of the name.

So when the name a was used inside any of the lists that were bound to the name lis, it was actually the value bound to the name a that was used instead. So, the first time it was 1, so the first item of the list became 1, and stayed as 1 until another binding of some other (list) object to the name lis was done.

But by this time, the name a had been rebound to another object, the int 3, the same one that name b had been earlier bound to just before. So the next time that the name lis was bound to a list, that list now included the value of the current object that name a was now bound to, which was 3.

This is the reason why the code snippet works as it does.

On a related note (also about Python language features, syntax and semantics), I was playing around with the pprint module (Python's pretty-printer) and the Python is operator, and came up with this other snippet:

>>> import pprint
>>> lis = []
>>> for i in range(10):
... lis.append(lis)
...
>>> print lis
[[...], [...], [...], [...], [...], [...], [...], [...], [...], [...]]

>>> pprint.pprint(lis)
[<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>]

>>> len(lis)
10

>>> lis is lis[0]
True

>>> lis is lis[0] is lis[0][0]
True

>>> lis is lis[0] is lis[0][0] is lis[0][0][0]
True

in which I created a list, appended it to itself, and then used pprint.pprint on it. Also used the Python is operator between the list and its 0th item, recursively, and was interested to see that the is operator can be used in a chain. I need to look that up (pun intended).

Enjoy.

- Vasudev Ram - Online Python training and consulting

Get updates (via Gumroad) on my forthcoming apps and content.

Jump to posts: Python * DLang * xtopdf

Subscribe to my blog by email

My ActiveState Code recipes

Follow me on: LinkedIn * Twitter

Are you a blogger with some traffic? Get Convertkit:

Email marketing for professional bloggers



March 23, 2017 09:39 PM


NumFOCUS

PyData Atlanta Meetup Celebrates 1 Year

Picture
PyData Atlanta holds a meetup at MailChimp, where Jim Crozier spoke about analyzing NFL data with PySpark.

Atlanta tells a new story about data

by Rob Clewley

In late 2015, the three of us (Tony Fast, Neel Shivdasani, and myself) had been regularly  nerding out about data over beers and becoming fast friends. We were eager to see a shift from Atlanta's data community to be more welcoming and encouraging towards beginners, self-starters, and generalists. Were about to find out that we were not alone.

We had met at local data science-related events earlier in the year and had discovered that we had lots of opinions—and weren’t afraid to advocate for them. But we also found that we listened to reason (data-driven learning!), appreciated the art in doing good science, and cared about people and the community. Open science, open data, free-and-open-source software, and creative forms of technical communication and learning were all recurring themes in our conversations. We also all agreed that Python is a great language for working with data.

Invitations were extended to like-minded friends, and the informal hangout was soon known as “Data Beers”. The consistent good buzz that Data Beers generated helped us realize an opportunity to contribute more widely to the Atlanta community. At the time, Atlanta was beginning its emergence as a new hub in the tech world and startup culture.

Some of the existing data-oriented meetups around Atlanta have a more formal business atmosphere, or are highly focused on specific tools or tech opinions. We have found that such environments have seemed to intimidate newcomers and those less formally educated in math or computer science. This inspired us to take a new perspective through an informal and eclectic approach oriented towards beginners, self-starters, and generalists. So, in January 2016, with the support of not-for-profit organization NumFOCUS, we set up the Atlanta chapter of PyData.

The mission of NumFOCUS is to promote sustainable high-level programming languages, open code development, and reproducible scientific research. NumFOCUS sponsors PyData conferences and local meetups internationally. The PyData community gathers to discuss how best to apply tools using Python, R, Stan, and Julia to meet evolving challenges in data management, processing, analytics, and visualization. In all, PyData is over 28,000 members across 52 international meetups. The Python language and the data-focused ecosystem that has grown around it has been remarkably successful in attracting an inclusive mindset centered around free and open-source software and science. Our Atlanta chapter aims to be even more neutral about specific technologies so long as the underlying spirit resonates with our mission.

The three of us, with the help of friend and colleague Lizzy Rolando, began sourcing great speakers who have a distinctive approach to using data that resonated with the local tech culture. We hosted our first meetup in early April. From the beginning, we encouraged a do-it-yourself, interactive vibe to our meetings, supporting shorter-format 30 minute presentations with 20 minute question and answer sessions.

Regardless of the technical focus, we try to bring in speakers who are applying their data-driven work to something of general interest. Our programming balances technical and more qualitative talks. Our meetings have covered a diverse range of applications, addressing computer literacy and education, human rights, neuroscience, journalism, and civics.

A crowd favorite is the inclusion of 3-4 audience-submitted lightning talks at the end of the main Q&A. The strictly five-minute talks add more energy to the mix and give a wider platform to the local community. They’re an opportunity to practice presentation skills for students, generate conversations around projects needing collaborators, discussions about new tools, or just have fun looking at interesting data sets.

Students, career changers, and professionals have come together as members of PyData to learn and share. Our network has generated new friends, collaborators, and even new jobs. Local organizations that share our community spirit provide generous sponsorship and refreshments for our meetings.

We believe we were in the right place at the right time to meet a need. It’s evident in the positive response and rapid growth we’ve seen, having acquired over 1,000 members in one year and hosted over 120 attendees at our last event. It has been a whirlwind experience, and we are delighted that our community has shared our spirit and become involved with us so strongly. Here’s to healthy, productive, data-driven outcomes for all of us in 2017!

March 23, 2017 09:23 PM


Reinout van Rees

Fossgis: open source for emergencies - Marco Lechner

(One of my summaries of a talk at the 2017 fossgis conference).

He works for the Bundesamtes fuer Strahlenschutz, basically the government agency that was started after Chernobil to protect against and to measure radioactivity. The software system they use/build is called IMIS.

IMIS consists of three parts:

  • Measurements (automatic + mobile measurements + laboratory results).
  • Prediction system. Including documentation (managed in Plone, a python CMS system).
  • Decision support. Help support the government layers that have to make the decisions.

They have a simple map at odlinfo.bfs.de.

The current core of the system is proprietary. They are dependent on one single firm. The system is heavily customized for their usage.

They need a new system because geographical analysis keeps getting more important and because there are new requirements coming out of the government. The current program cannot handle that.

What they want is a new system that is as simple as possible; that uses standards for geographical exchange; they don't want to be dependent on a single firm anymore. So:

  • Use open standards, so OGC. But also a specific world-wide nuclear info protocol.
  • Use existing open source software. OSGEO.
  • If we need something special, can we change/extend existing open source software?
  • If not, then it is OK to create our their software. Under an open source license.

They use open source companies to help them, including training their employees. And helping getting these employees used to modern software development (jenkins, docker, etc.)

If you use an open source strategy, what do you need to do to make it fair?

  • Your own developments should also be open source!
  • You need your own test and build infrastructure. (For instance Jenkins)
  • You need to make it easy to start working with what you made: documentation, docker, buildout (!), etc.

(Personal note: I didn't expect to hear 'buildout' at this open source GIS conference. I've helped quite a bit with that particular piece of python software :-) )

March 23, 2017 02:28 PM


PyBites

Module of the Week - ipaddress

While playing around with code for our post on generators we discovered the ipaddress module, part of the Standard Library. Such a handy little module!

March 23, 2017 10:30 AM


Reinout van Rees

Fossgis: sewer cadastre with qgis - jörg Höttges

(One of my summaries of a talk at the 2017 fossgis conference).

With engineer firms from the Aachen region they created qkan. Qkan is:

  • A data structure.
  • Plugins for Qgis.
  • Direct access. Not a specific application with restricted access, but unrestricted access from within Qgis. (He noticed lots of interest among the engineers to learn qgis during the project!)

It has been designed for the needs of the engineers that have to work with the data. You first import the data from the local sewer database. Qkan converts the data to what it needs. Then you can do simulations in a separate package. The results of the simulation will be visualized by Qkan in qgis. Afterwards you probably have to make some corrections to the data and give corrections back to the original database. Often you have to go look at the actual sewers to make sure the database is correct. Output is often a map with the sewer system.

Some functionality: import sewer data (in various formats). Simulate water levels. Draw graphs of the water levels in a sewer. Support database-level check ("an end node cannot occur halfway a sewer").

They took care to make the database schema simple. The source sewer database is always very complex because it has to hold lots of metadata. The engineer that has to work with it needs a much simpler schema in order to be productive. Qkan does this.

They used qgis, spatialite, postgis, python and qt (for forms). An important note: they used as many postgis functionality as possible instead of the geographical functions from qgis: the reason is that postgis (and even spatialite) is often much quicker.

With qgis, python and the "qt designer", you can make lots of handy forms. But you can always go back to the database that's underneath it.

The code is at https://github.com/hoettges

March 23, 2017 10:24 AM


CubicWeb

Introducing cubicweb-jsonschema

This is the first post of a series introducing the cubicweb-jsonschema project that is currently under development at Logilab. In this post, I'll first introduce the general goals of the project and then present in more details two aspects about data models (the connection between Yams and JSON schema in particular) and the basic features of the API. This post does not always present how things work in the current implementation but rather how they should.

Goals of cubicweb-jsonschema

From a high level point of view, cubicweb-jsonschema addresses mainly two interconnected aspects. One related to modelling for client-side development of user interfaces to CubicWeb applications while the other one concerns the HTTP API.

As far as modelling is concerned, cubicweb-jsonschema essentially aims at providing a transformation mechanism between a Yams schema and JSON Schema that is both automatic and extensible. This means that we can ultimately expect that Yams definitions alone would sufficient to have generated JSON schema definitions that would consistent enough to build an UI, pretty much as it is currently with the automatic web UI in CubicWeb. A corollary of this goal is that we want JSON schema definitions to match their context of usage, meaning that a JSON schema definition would not be the same in the context of viewing, editing or relationships manipulations.

In terms of API, cubicweb-jsonschema essentially aims at providing an HTTP API to manipulate entities based on their JSON Schema definitions.

Finally, the ultimate goal is to expose an hypermedia API for a CubicWeb application in order to be able to ultimately build an intelligent client. For this we'll build upon the JSON Hyper-Schema specification. This aspect will be discussed in a later post.

Basic usage as an HTTP API library

Consider a simple case where one wants to manipulate entities of type Author described by the following Yams schema definition:

class Author(EntityType):
    name = String(required=True)

With cubicweb-jsonschema one can get JSON Schema for this entity type in at different contexts such: view, creation or edition. For instance:

  • in a view context, the JSON Schema will be:

    {
        "$ref": "#/definitions/Author",
        "definitions": {
            "Author": {
                "additionalProperties": false,
                "properties": {
                    "name": {
                        "title": "name",
                        "type": "string"
                    }
                },
                "title": "Author",
                "type": "object"
            }
        }
    }
    
  • whereas in creation context, it'll be:

    {
        "$ref": "#/definitions/Author",
        "definitions": {
            "Author": {
                "additionalProperties": false,
                "properties": {
                    "name": {
                        "title": "name",
                        "type": "string"
                    }
                },
                "required": [
                    "name"
                ],
                "title": "Author",
                "type": "object"
            }
        }
    }
    

    (notice, the required keyword listing name property).

Such JSON Schema definitions are automatically generated from Yams definitions. In addition, cubicweb-jsonschema exposes some endpoints for basic CRUD operations on resources through an HTTP (JSON) API. From the client point of view, requests on these endpoints are of course expected to match JSON Schema definitions. Some examples:

Get an author resource:

GET /author/855
Accept:application/json

HTTP/1.1 200 OK
Content-Type: application/json
{"name": "Ernest Hemingway"}

Update an author:

PATCH /author/855
Accept:application/json
Content-Type: application/json
{"name": "Ernest Miller Hemingway"}

HTTP/1.1 200 OK
Location: /author/855/
Content-Type: application/json
{"name": "Ernest Miller Hemingway"}

Create an author:

POST /author
Accept:application/json
Content-Type: application/json
{"name": "Victor Hugo"}

HTTP/1.1 201 Created
Content-Type: application/json
Location: /Author/858
{"name": "Victor Hugo"}

Delete an author:

DELETE /author/858

HTTP/1.1 204 No Content

Now if the client sends invalid input with respect to the schema, they'll get an error:

(We provide a wrong born property in request body.)

PATCH /author/855
Accept:application/json
Content-Type: application/json
{"born": "1899-07-21"}

HTTP/1.1 400 Bad Request
Content-Type: application/json

{
    "errors": [
        {
            "details": "Additional properties are not allowed ('born' was unexpected)",
            "status": 422
        }
    ]
}

From Yams model to JSON Schema definitions

The example above illustrates automatic generation of JSON Schema documents based on Yams schema definitions. These documents are expected to help developping views and forms for a web client. Clearly, we expect that cubicweb-jsonschema serves JSON Schema documents for viewing and editing entities as cubicweb.web serves HTML documents for the same purposes. The underlying logic for JSON Schema generation is currently heavily inspired by the logic of primary view and automatic entity form as they exists in cubicweb.web.views. That is: the Yams schema is introspected to determine how properties should be generated and any additionnal control over this can be performed through uicfg declarations [1].

To illustrate let's consider the following schema definitions which:

class Book(EntityType):
    title = String(required=True)
    publication_date = Datetime(required=True)

class Illustration(EntityType):
    data = Bytes(required=True)

class illustrates(RelationDefinition):
    subject = 'Illustration'
    object = 'Book'
    cardinality = '1*'
    composite = 'object'
    inlined = True

class Author(EntityType):
    name = String(required=True)

class author(RelationDefinition):
    subject = 'Book'
    object = 'Author'
    cardinality = '1*'

class Topic(EntityType):
    name = String(required=True)

class topics(RelationDefinition):
    subject = 'Book'
    object = 'Topic'
    cardinality = '**'

and consider, as before, JSON Schema documents in different contexts for the the Book entity type:

  • in view context:

    {
        "$ref": "#/definitions/Book",
        "definitions": {
            "Book": {
                "additionalProperties": false,
                "properties": {
                    "author": {
                        "items": {
                            "type": "string"
                        },
                        "title": "author",
                        "type": "array"
                    },
                    "publication_date": {
                        "format": "date-time",
                        "title": "publication_date",
                        "type": "string"
                    },
                    "title": {
                        "title": "title",
                        "type": "string"
                    },
                    "topics": {
                        "items": {
                            "type": "string"
                        },
                        "title": "topics",
                        "type": "array"
                    }
                },
                "title": "Book",
                "type": "object"
            }
        }
    }
    

    We have a single Book definition in this document, in which we find attributes defined in the Yams schema (title and publication_date). We also find the two relations where Book is involved: topics and author, both appearing as a single array of "string" items. The author relationship appears like that because it is mandatory but not composite. On the other hand, the topics relationship has the following uicfg rule:

    uicfg.primaryview_section.tag_subject_of(('Book', 'topics', '*'), 'attributes')
    

    so that it's definition appears embedded in the document of Book definition.

    A typical JSON representation of a Book entity would be:

    {
        "author": [
            "Ernest Miller Hemingway"
        ],
        "title": "The Old Man and the Sea",
        "topics": [
            "sword fish",
            "cuba"
        ]
    }
    
  • in creation context:

    {
        "$ref": "#/definitions/Book",
        "definitions": {
            "Book": {
                "additionalProperties": false,
                "properties": {
                    "author": {
                        "items": {
                            "oneOf": [
                                {
                                    "enum": [
                                        "855"
                                    ],
                                    "title": "Ernest Miller Hemingway"
                                },
                                {
                                    "enum": [
                                        "857"
                                    ],
                                    "title": "Victor Hugo"
                                }
                            ],
                            "type": "string"
                        },
                        "maxItems": 1,
                        "minItems": 1,
                        "title": "author",
                        "type": "array"
                    },
                    "publication_date": {
                        "format": "date-time",
                        "title": "publication_date",
                        "type": "string"
                    },
                    "title": {
                        "title": "title",
                        "type": "string"
                    }
                },
                "required": [
                    "title",
                    "publication_date"
                ],
                "title": "Book",
                "type": "object"
            }
        }
    }
    

    notice the differences, we now only have attributes and required relationships (author) in this schema and we have the required listing mandatory attributes; the author property is represented as an array which items consist of pre-existing objects of the author relationship (namely Author entities).

    Now assume we add the following uicfg declaration:

    uicfg.autoform_section.tag_object_of(('*', 'illustrates', 'Book'), 'main', 'inlined')
    

    the JSON Schema for creation context will be:

    {
        "$ref": "#/definitions/Book",
        "definitions": {
            "Book": {
                "additionalProperties": false,
                "properties": {
                    "author": {
                        "items": {
                            "oneOf": [
                                {
                                    "enum": [
                                        "855"
                                    ],
                                    "title": "Ernest Miller Hemingway"
                                },
                                {
                                    "enum": [
                                        "857"
                                    ],
                                    "title": "Victor Hugo"
                                }
                            ],
                            "type": "string"
                        },
                        "maxItems": 1,
                        "minItems": 1,
                        "title": "author",
                        "type": "array"
                    },
                    "illustrates": {
                        "items": {
                            "$ref": "#/definitions/Illustration"
                        },
                        "title": "illustrates_object",
                        "type": "array"
                    },
                    "publication_date": {
                        "format": "date-time",
                        "title": "publication_date",
                        "type": "string"
                    },
                    "title": {
                        "title": "title",
                        "type": "string"
                    }
                },
                "required": [
                    "title",
                    "publication_date"
                ],
                "title": "Book",
                "type": "object"
            },
            "Illustration": {
                "additionalProperties": false,
                "properties": {
                    "data": {
                        "format": "data-url",
                        "title": "data",
                        "type": "string"
                    }
                },
                "required": [
                    "data"
                ],
                "title": "Illustration",
                "type": "object"
            }
        }
    }
    

    We now have an additional illustrates property modelled as an array of #/definitions/Illustration, the later also added the the document as an additional definition entry.

Conclusion

This post illustrated how a basic (CRUD) HTTP API based on JSON Schema could be build for a CubicWeb application using cubicweb-jsonschema. We have seen a couple of details on JSON Schema generation and how it can be controlled. Feel free to comment and provide feedback on this feature set as well as open the discussion with more use cases.

Next time, we'll discuss how hypermedia controls can be added the HTTP API that cubicweb-jsonschema provides.

[1]this choice is essentially driven by simplicity and conformance when the existing behavior to help migration of existing applications.

March 23, 2017 09:57 AM


Reinout van Rees

Fossgis: creating maps with open street map in QGis - Axel Heinemann

(One of my summaries of a talk at the 2017 fossgis conference).

He wanted to make a map for a local run. He wanted a nice map with the route and the infrastructure (start, end, parking, etc). Instead of the usual not-quite-readable city plan with a simple line on top. With qgis and openstreetmap he should be able to make something better!

A quick try with QGis, combined with the standard openstreetmap base map, already looked quite nice, but he wanted to do more customizations on the map colors. So he needed to download the openstreetmap data. That turned into quite a challenge. He tried two plugins:

  • OSMDownloader: easy selection, quick download. Drawback: too many objects as you cannot filter. The attribute table is hard to read.
  • QuickOSM: key/value selection, quick. Drawback: you need a bit of experience with the tool, as it is easy to forget key/values.

He then landed on https://overpass-turbo.eu . The user interface is very friendly. There is a wizard to get common cases done. And you can browse the available tags.

With the data downloaded with overpass-turbo, he could easily adjust colors and get a much nicer map out of it.

You can get it to work, but it takes a lot of custom work.

Some useful links:

https://taginfo.openstreetmap.org http://tagfinder.herokuapp.com https://gis.stackexchange.com

https://abload.de/img/screenshot2017-03-21a6asqe.png

Photo explanation: just a nice unrelated picture from the recent beautiful 'on traxs' model railway exibition (see video )

March 23, 2017 09:55 AM

Fossgis: introduction on some open source software packages

(One of my summaries of a talk at the 2017 fossgis conference).

The conference started with a quick introduction on several open source programs.

Openlayers 3 - Marc Jansen

Marc works on both openlayers and GeoExt. Openlayers is a javascript library with lots and lots of features.

To see what it can do, look at the 161 examples on the website :-) It works with both vector layers and raster layers.

Openlayers is a quite mature project, the first version is from 2006. It changed a lot to keep up with the state of the art. But they did take care to keep everything backwards compatible. Upgrading from 2.0 to 2.2 should have been relatively easy. The 4.0.0 version came out last month.

Openlayers...

  • Allows many different data sources and layer types.
  • Has build-in interaction and controls.
  • Is very actively developed.
  • Is well documented and has lots of examples.

The aim is to be easy to start with, but also to allow full control of your map and all sorts of customization.

Geoserver - Marc Jansen

(Again Marc: someone was sick...)

Geoserver is a java-based server for geographical data. It support lots of OGC standards (WMS, WFS, WPS, etc). Flexible, extensible, well documented. "Geoserver is a glorious example that you can write very performant software in java".

Geoserver can connect to many different data sources and make those sources available as map data.

If you're a government agency, you're required to make INSPIRE metadata available for your maps: geoserver can help you with that.

A big advantage of geoserver: it has a browser-based interface for configuring it. You can do 99% of your configuration work in the browser. For maintaining: there is monitoring to keep an eye on it.

Something to look at: the importer plugin. With it you get a REST API to upload shapes, for instance.

The latest version also supports LDAP groups. LDAP was already supported, but group membership not yet.

Mapproxy - Dominik Helle

Dominik is one of the MapProxy developers. Mapproxy is a WMS cache and tile cache. The original goal was to make maps quicker by caching maps.

Some possible sources: WMS, WMTS, tiles (google/bing/etc), MapServer. The output can be WMS, WMS-C, WMTS, TMS, KML. So the input could be google maps and the output WMS. One of their customers combines the output of five different regional organisations into one WMS layer...

The maps that mapproxy returns can be stored on a local disk in order to improve performance. They way they store it allows mapproxy to support intermediary zoom levels instead of fixed ones.

The cache can be in various formats: MBTiles, sqlite, couchdb, riak, arcgis compact cache, redis, s3. The cache is efficient by combining layers and by omitting unneeded data (empty tiles).

You can pre-fill the cache ("seeding").

Some other possibilities, apart from caching:

  • A nice feature: clipping. You can limit a source map to a specific area.
  • Reprojecting from one coordinate system to another. Very handy if someone else doesn't want to support the coordinate system that you need.
  • WMS feature info: you can just pass it on to the backend system, but you can also intercept and change it.
  • Protection of your layers. Password protection. Protect specific layers. Only allow specific areas. Etcetera.

QGis - Thomas Schüttenberg

QGis is an opern source gis platform. Desktop, server, browser, mobile. And it is a library. It runs on osx, linux, windows, android. The base is the QT ui library, hence the name.

Qgis contains almost everything you'd expect from a GIS packages. You can extend it with plugins.

Qgis is a very, very active project. Almost 1 million lines of code. 30.000+ github commits. 332 developers have worked on it, in the last 12 months 104.

Support via documentation, mailinglists and http://gis.stackexchange.com/ . In case you're wondering about the names of the releases: they come from the towns where the twice-a-year project meeting takes place :-)

Since december 2016, there's an official (legal) association.

QGis 3 will have lots of changes: QT 5 and python 3.

Mapbender 3 - Astrid Emde

Mapbender is library to build webgis applications. Ideally, you don't need to write any code yourself, but you configure it instead in your browser. It also supports mobile usage.

You can try it at http://demo.mapbender3.org/ . Examples are at http://mapbender3.org/?q=en/gallery .

You can choose a layout and fill in and configure the various parts. Layers you want to show: add sources. You can configure security/access with roles.

An example component: a search form for addresses that looks up addresses with sql or a web service. Such a search form can be a popup or you can put it in the sidebar, for instance. CSS can be customized.

PostNAS - Astrid Emde, Jelto Buurman

The postnas project is a solution for importing ALKIS data, a data exchange format for the german cadastre (Deutsch: Kataster).

PostNAS is an extension of the GDAL library for the "NAS" vector data format. (NAS = normalisierte Austausch Schnittstelle, "normalized exchange format"). This way, you can use all of the gdal functionality with the cadastre data. But that's not the only thing: there's also a qgis plugin. There is configuration and conversion scripts for postgis, mapbender, mapserver, etc.

They needed postprocessing/conversion scripts to get useful database tables out of the original data, tables that are usable for showing in QGis, for instance.

So... basically a complete open source environment for working with the cadastre data!

https://abload.de/img/screenshot2017-03-21a1islt.png

Photo explanation: just a nice unrelated picture from the recent beautiful 'on traxs' model railway exibition (see video )

March 23, 2017 09:55 AM


Tomasz Früboes

Unittesting print statements

Recently I was refactoring a small package that is supposed to allow execution of arbitrary python code on a remote machine. The first implementation was working nicely but with one serious drawback – function handling the actual code execution was running in a synchronous (blocking) mode. As the result all of the output (both stdout and stderr) was presented only at the end, i.e. when code finished its execution. This was unacceptable since the package should work in a way as transparent to the user as possible. So a wall of text when code completes its task wasn’t acceptable.

The goal of the refactoring was simple – to have the output presented to the user immediately after it was printed on the remote host. As a TDD worshipper I wanted to start this in a kosher way, i.e. with a test. And I got stuck.

For a day or so I had no idea how to progress. How do you unittest the print statements? It’s funny when I think about this now. I have used a similar technique many times in the past for output redirection, yet somehow haven’t managed to make a connection with this problem.

The print statement

So how do you do it? First we should understand what happens when print statement is executed. In python 2.x the print statement does two things – converts provided expressions into strings and writes the result to a file like object handling the stdout. Conveniently it is available as sys.stdout (i.e. as a part of sys module). So all you have to do is to overwrite the sys.stdout with your own object providing a ‘write’ method. Later you may discover, that some other methods may be also needed (e.g. ‘flush’ is quite often used), but for starters, having only the ‘write’ method should be sufficient.

A first try – simple stdout interceptor

The code below does just that. The MyOutput class is designed to replace the original sys.stdout:

import unittest
import sys

def fn_print(nrepeat):
    print "ab"*nrepeat

class MyTest(unittest.TestCase):
    def test_stdout(self):
        class MyOutput(object):
            def __init__(self):
                self.data = []

            def write(self, s):
                self.data.append(s)

            def __str__(self):
                return "".join(self.data)

        stdout_org = sys.stdout
        my_stdout = MyOutput()
        try:
            sys.stdout = my_stdout
            fn_print(2)
        finally:
            sys.stdout = stdout_org

        self.assertEquals( str(my_stdout), "abab\n") 

if __name__ == "__main__":
    unittest.main()

The fn_print function provides output to test against. After replacing sys.stdout we call this function and compare the obtained output with the expected one. It is worth noting that in the example above the original sys.stdout is first preserved and then carefully restored inside the ‘finally’ block. If you don’t do this you are likely to loose any output coming from other tests.

Is my code async? Logging time of arrival

In the second example we will address the original problem – is output presented as a wall of text at the end or maybe in real time as we want to. For this we will add time of arrival logging capability to the object replacing sys.stdout:

import unittest
import time
import sys

def fn_print_with_delay(nrepeat):
    for i in xrange(nrepeat):
        print    # prints a single newline
        time.sleep(0.5)

class TestServer(unittest.TestCase):
    def test_stdout_time(self):
        class TimeLoggingOutput(object):
            def __init__(self):
                self.data = []
                self.timestamps = []

            def write(self, s):
                self.timestamps.append(time.time())
                self.data.append(s)

        stdout_org = sys.stdout
        my_stdout = TimeLoggingOutput()
        nrep = 3 # make sure is >1
        try:
            sys.stdout = my_stdout
            fn_print_with_delay(nrep)
        finally:
            sys.stdout = stdout_org

        for i in xrange(nrep):
            if i > 0:
                dt = my_stdout.timestamps[i]-my_stdout.timestamps[i-1]
                self.assertTrue(0.5<dt<0.52)

if __name__ == "__main__":
    unittest.main()

The code is pretty much self-explanatory – the fn_print_with_delay function prints newlines in half of a second intervals. We override sys.stdout with an instance of a class capable of storing timestamps (obtained with time.time()) of all calls to the write method. At the and we assert the timestamps are spaced half of a second approximately. The code above works as expected:

.
----------------------------------------------------------------------
Ran 1 test in 1.502s

OK

If we change the interval inside the fn_print_with_delay function to one second, the test will (fortunately) fail.

Wrap-up

As we saw, testing for expected output is in fact trivial – all you have to do is to put an instance of a class with a ‘write’ method in proper place (i.e. sys.stdout). The only ‘gotcha’ is the cleanup – you should remember to restore sys.stdout to its original state. You may apply the exact same technique if you need to test stderr (just target the sys.stderr instead of sys.stdout). It is also worth noting that using a similar technique you could intercept (or completely silence) output coming from external libraries.

March 23, 2017 09:10 AM


DataCamp

PySpark Cheat Sheet: Spark in Python

Apache Spark is generally known as a fast, general and open-source engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. It allows you to speed analytic applications up to 100 times faster compared to technologies on the market today. You can interface Spark with Python through "PySpark". This is the Spark Python API exposes the Spark programming model to Python. 

Even though working with Spark will remind you in many ways of working with Pandas DataFrames, you'll also see that it can be tough getting familiar with all the functions that you can use to query, transform, inspect, ... your data. What's more, if you've never worked with any other programming language or if you're new to the field, it might be hard to distinguish between RDD operations.

Let's face it, map() and flatMap() are different enough, but it might still come as a challenge to decide which one you really need when you're faced with them in your analysis. Or what about other functions, like reduce() and reduceByKey()

PySpark cheat sheet

Even though the documentation is very elaborate, it never hurts to have a cheat sheet by your side, especially when you're just getting into it.

This PySpark cheat sheet covers the basics, from initializing Spark and loading your data, to retrieving RDD information, sorting, filtering and sampling your data. But that's not all. You'll also see that topics such as repartitioning, iterating, merging, saving your data and stopping the SparkContext are included in the cheat sheet. 

Note that the examples in the document take small data sets to illustrate the effect of specific functions on your data. In real life data analysis, you'll be using Spark to analyze big data.

Are you hungry for more? Don't miss our other Python cheat sheets for data science that cover topics such as Python basicsNumpyPandasPandas Data Wrangling and much more! 

March 23, 2017 09:10 AM


Rene Dudfield

pip is broken

Help?

Since asking people to use pip to install things, I get a lot of feedback on pip not working. Feedback like this.

"Our fun packaging Jargon"


What is a pip? What's it for? It's not built into python?  It's the almost-default and almost-standard tool for installing python code. Pip almost works a lot of the time. You install things from pypi. I should download pypy? No, pee why, pee eye. The cheeseshop. You're weird. Just call it pee why pee eye. But why is it called pip? I don't know.

"Feedback like this."

pip is broken on the raspberian

pip3 doesn't exist on windows

People have an old pip. Old pip doesn't support wheels. What are wheels? It's a cute bit of jargon to mean a zip file with python code in it structured in a nice way. I heard about eggs... tell me about eggs? Well, eggs are another zip file with python code in it. Used mainly by easy_install. Easy install? Let's use that, this is all too much.

The pip executable or script is for python 2, and they are using python 3.

pip is for a system python, and they have another python installed. How did they install that python? Which of the several pythons did they install? Maybe if they install another python it will work this time.

It's not working one time and they think that sudo will fix things. And now certain files can't be updated without sudo. However, now they have forgotten that sudo exists.

"pip lets you run it with sudo, without warning."


pip doesn't tell them which python it is installing for. But I installed it! Yes you did. But which version of python, and into which virtualenv? Let's use these cryptic commands to try and find out...

pip doesn't install things atomically, so if there is a failed install, things break. If pip was a database (it is)...

Virtual environments work if you use python -m venv, but not virtualenv. Or some times it's the other way around. If you have the right packages installed on Debian, and Ubuntu... because they don't install virtualenv by default.

What do you mean I can't rename my virtualenv folder? I can't move it to another place on my Desktop?

pip installs things into global places by default.

"Globals by default."


Why are packages still installed globally by default?

"So what works currently most of the time?"


python3 -m venv anenv
. ./anenv/bin/activate
pip install pip --upgrade
pip install pygame


This is not ideal. It doesn't work on windows. It doesn't work on Ubuntu. It makes some text editors crash (because virtualenvs have so many files they get sick). It confuses test discovery (because for some reason they don't know about virtual environments still and try to test random packages you have installed). You have to know about virtualenv, about pip, about running things with modules, about environment variables, and system paths. You have to know that at the beginning. Before you know anything at all.

Is there even one set of instructions where people can have a new environment, and install something? Install something in a way that it might not break their other applications? In a way which won't cause them harm? Please let me know the magic words?

I just tell people `pip install pygame`. Even though I know it doesn't work. And can't work. By design. I tell them to do that, because it's probably the best we got. And pip keeps getting better. And one day it will be even better.

Help? Let's fix this.

March 23, 2017 09:00 AM


Kushal Das

Running MicroPython on 96Boards Carbon

I received my Carbon from Seedstudio a few months back. But, I never found time to sit down and work on it. During FOSSASIA, in my MicroPython workshop, Siddhesh was working to put MicroPython using Zephyr on his Carbon. That gave me the motivation to have a look at the same after coming back home.

What is Carbon?

Carbon is a 96Boards IoT edition compatible board, with a Cortex-M4 chip, and 512KB flash. It currently runs Zephyr, which is a Linux Foundation hosted project to build a scalable real-time operating system (RTOS).

Setup MicroPython on Carbon

To install the dependencies in Fedora:

$ sudo dnf group install "Development Tools"
$ sudo dnf install git make gcc glibc-static \
      libstdc++-static python3-ply ncurses-devel \
      python-yaml python2 dfu-util

The next step is to setup the Zephyr SDK. You can download the latest binary from here. Then you can install it under your home directory (you don’t have to install it system-wide). I installed it under ~/opt/zephyr-sdk-0.9 location.

Next, I had to check out the zephyr source, I cloned from https://git.linaro.org/lite/zephyr.git repo. I also cloned MicroPython from the official GitHub repo. I will just copy paste the next steps below.

$ source zephyr-env.sh
$ cd ~/code/git/
$ git clone https://github.com/micropython/micropython.git
$ cd micropython/zephyr

Then I created a project file for the carbon board specially, this file is named as prj_96b_carbon.conf, and I am pasting the content below. I have submitted the same as a patch to the upstream Micropython project. It disables networking (otherwise you will get stuck while trying to get the REPL).

# No networking for carbon
CONFIG_NETWORKING=n
CONFIG_NET_IPV4=n
CONFIG_NET_IPV6=

Next, we have to build MicroPython as a Zephyr application.

$ make BOARD=96b_carbon
$ ls outdir/96b_carbon/
arch     ext          isr_tables.c  lib          Makefile         scripts  tests       zephyr.hex  zephyr.map           zephyr.strip
boards   include      isr_tables.o  libzephyr.a  Makefile.export  src      zephyr.bin  zephyr.lnk  zephyr_prebuilt.elf
drivers  isrList.bin  kernel        linker.cmd   misc             subsys   zephyr.elf  zephyr.lst  zephyr.stat

After the build is finished, you will be able to see a zephyr.bin file in the output directory.

Uploading the fresh build to the carbon

Before anything else, I connected my Carbon board to the laptop using an USB cable to the OTG port (remember to check the port name). Then, I had to press the *BOOT0 button and while pressing that one, I also pressed the Reset button. Then, left the reset button first, and then the boot0 button. If you run the dfu-util command after this, you should be able to see some output like below.

$ sudo dfu-util -l
dfu-util 0.9
Copyright 2005-2009 Weston Schmidt, Harald Welte and OpenMoko Inc.
Copyright 2010-2016 Tormod Volden and Stefan Schmidt
This program is Free Software and has ABSOLUTELY NO WARRANTY
Please report bugs to http://sourceforge.net/p/dfu-util/tickets/
Found DFU: [0483:df11] ver=2200, devnum=14, cfg=1, intf=0, path="2-2", alt=3, name="@Device Feature/0xFFFF0000/01*004 e", serial="385B38683234"
Found DFU: [0483:df11] ver=2200, devnum=14, cfg=1, intf=0, path="2-2", alt=2, name="@OTP Memory /0x1FFF7800/01*512 e,01*016 e", serial="385B38683234"
Found DFU: [0483:df11] ver=2200, devnum=14, cfg=1, intf=0, path="2-2", alt=1, name="@Option Bytes /0x1FFFC000/01*016 e", serial="385B38683234"
Found DFU: [0483:df11] ver=2200, devnum=14, cfg=1, intf=0, path="2-2", alt=0, name="@Internal Flash /0x08000000/04*016Kg,01*064Kg,03*128Kg", serial="385B38683234"

This means the board is in DFU mode. Next we flash the new application to the board.

$ sudo dfu-util -d [0483:df11] -a 0 -D outdir/96b_carbon/zephyr.bin -s 0x08000000
dfu-util 0.9
Copyright 2005-2009 Weston Schmidt, Harald Welte and OpenMoko Inc.
Copyright 2010-2016 Tormod Volden and Stefan Schmidt
This program is Free Software and has ABSOLUTELY NO WARRANTY
Please report bugs to http://sourceforge.net/p/dfu-util/tickets/
dfu-util: Invalid DFU suffix signature
dfu-util: A valid DFU suffix will be required in a future dfu-util release!!!
Opening DFU capable USB device...
ID 0483:df11
Run-time device DFU version 011a
Claiming USB DFU Interface...
Setting Alternate Setting #0 ...
Determining device status: state = dfuERROR, status = 10
dfuERROR, clearing status
Determining device status: state = dfuIDLE, status = 0
dfuIDLE, continuing
DFU mode device DFU version 011a
Device returned transfer size 2048
DfuSe interface name: "Internal Flash "
Downloading to address = 0x08000000, size = 125712
Download [=========================] 100% 125712 bytes
Download done.
File downloaded successfully

Hello World on Carbon

The hello world of the hardware land is the LED blinking code. I used the on-board LED(s) for the same, the sample code is given below. I have now connected the board to the UART (instead of OTG).

$ screen /dev/ttyUSB0 115200
>>>
>>> import time
>>> from machine import Pin
>>> led1 = Pin(("GPIOD",2), Pin.OUT)
>>> led2 = Pin(("GPIOB",5), Pin.OUT)
>>> while True:
... led2.low()
... led1.high()
... time.sleep(0.5)
... led2.high()
... led1.low()
... time.sleep(0.5)

March 23, 2017 06:42 AM


Fabio Zadrozny

PyDev 5.6.0 released: faster debugger, improved type inference for super and pytest fixtures

PyDev 5.6.0 is now already available for download (and is already bundled in LiClipse 3.5.0).

There are many improvements on this version!

The major one is that the PyDev.Debugger got some attention and should now be 60%-100% faster overall -- in all supported Python versions (and that's on top of the improvements done previously).

This improvement was a nice example of trading memory vs speed (the major change done was that the debugger now has 2 new caches, one for saving whether a frame should be skipped or not and another to save whether a given line in a traced frame should be skipped or not, which enables the debugger to make much less checks on those occasions).

Also, other fixes were done in the debugger. Namely:


Note: from this version onward, the debugger will now only support Python 2.6+ (I believe there should be very few Python 2.5 users -- Python 2.6 itself stopped being supported in 2013, so, I expect this change to affect almost no one -- if someone really needs to use an older version of Python, it's always possible to get an older version of the IDE/debugger too). Also, from now on, supported versions are actually properly tested on the ci (2.6, 2.7 and 3.5 in https://travis-ci.org/fabioz/PyDev.Debugger and 2.7, 3.5 in https://ci.appveyor.com/project/fabioz/pydev-debugger).

The code-completion (Ctrl+Space) and find definition (F3) also had improvements and can now deal with the Python super (so, it's possible to get completions and go to the definition of a method declared in a superclass when using the super construct) and pytest fixtures (so, if you have a pytest fixture, you should now be able to have completions/go to its definition even if you don't add a docstring to the parameter saying its expected type).

Also, this release improved the support in third-party packages, so, coverage, pycodestyle (previously pep8.py) and autopep8 now use the latest version available. Also, PyLint was improved to use the same thread pool used in code-analysis and an issue in the Django shell was fixed when django >= 1.10.

And to finish, the preferences for running unit-tests can now be saved to the project or user settings (i.e.: preferences > PyDev > PyUnit > Save to ...) and an issue was fixed when coloring the matrix multiplication operator (which was wrongly recognized as a decorator).

Thank you very much to all the PyDev supporters and Patrons (http://www.patreon.com/fabioz), who help to keep PyDev moving forward and to JetBrains, which sponsored many of the improvements done in the PyDev.Debugger.

March 23, 2017 04:29 AM


Matthew Rocklin

Dask Release 0.14.1

This work is supported by Continuum Analytics, the XDATA Program, and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.14.1. This release contains a variety of performance and feature improvements. This blogpost includes some notable features and changes since the last release on February 27th.

As always you can conda install from conda-forge

conda install -c conda-forge dask distributed

or you can pip install from PyPI

pip install dask[complete] --upgrade

Arrays

Recent work in distributed computing and machine learning have motivated new performance-oriented and usability changes to how we handle arrays.

Automatic chunking and operation on NumPy arrays

Many interactions between Dask arrays and NumPy arrays work smoothly. NumPy arrays are made lazy and are appropriately chunked to match the operation and the Dask array.

>>> x = np.ones(10)                 # a numpy array
>>> y = da.arange(10, chunks=(5,))  # a dask array
>>> z = x + y                       # combined become a dask.array
>>> z
dask.array<add, shape=(10,), dtype=float64, chunksize=(5,)>

>>> z.compute()
array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.])

Reshape

Reshaping distributed arrays is simple in simple cases, and can be quite complex in complex cases. Reshape now supports a much more broad set of shape transformations where any dimension is collapsed or merged to other dimensions.

>>> x = da.ones((2, 3, 4, 5, 6), chunks=(2, 2, 2, 2, 2))
>>> x.reshape((6, 2, 2, 30, 1))
dask.array<reshape, shape=(6, 2, 2, 30, 1), dtype=float64, chunksize=(3, 1, 2, 6, 1)>

This operation ends up being quite useful in a number of distributed array cases.

Optimize Slicing to Minimize Communication

Dask.array slicing optimizations are now careful to produce graphs that avoid situations that could cause excess inter-worker communication. The details of how they do this is a bit out of scope for a short blogpost, but the history here is interesting.

Historically dask.arrays were used almost exclusively by researchers with large on-disk arrays stored as HDF5 or NetCDF files. These users primarily used the single machine multi-threaded scheduler. We heavily tailored Dask array optimizations to this situation and made that community pretty happy. Now as some of that community switches to cluster computing on larger datasets the optimization goals shift a bit. We have tons of distributed disk bandwidth but really want to avoid communicating large results between workers. Supporting both use cases is possible and I think that we’ve achieved that in this release so far, but it’s starting to require increasing levels of care.

Micro-optimizations

With distributed computing also comes larger graphs and a growing importance of graph-creation overhead. This has been optimized somewhat in this release. We expect this to be a focus going forward.

DataFrames

Set_index

Set_index is smarter in two ways:

  1. If you set_index on a column that happens to be sorted then we’ll identify that and avoid a costly shuffle. This was always possible with the sorted= keyword but users rarely used this feature. Now this is automatic.
  2. Similarly when setting the index we can look at the size of the data and determine if there are too many or too few partitions and rechunk the data while shuffling. This can significantly improve performance if there are too many partitions (a common case).

Shuffle performance

We’ve micro-optimized some parts of dataframe shuffles. Big thanks to the Pandas developers for the help here. This accelerates set_index, joins, groupby-applies, and so on.

Fastparquet

The fastparquet library has seen a lot of use lately and has undergone a number of community bugfixes.

Importantly, Fastparquet now supports Python 2.

We strongly recommend Parquet as the standard data storage format for Dask dataframes (and Pandas DataFrames).

dask/fastparquet #87

Distributed Scheduler

Replay remote exceptions

Debugging is hard in part because exceptions happen on remote machines where normal debugging tools like pdb can’t reach. Previously we were able to bring back the traceback and exception, but you couldn’t dive into the stack trace to investigate what went wrong:

def div(x, y):
    return x / y

>>> future = client.submit(div, 1, 0)
>>> future
<Future: status: error, key: div-4a34907f5384bcf9161498a635311aeb>

>>> future.result()  # getting result re-raises exception locally
<ipython-input-3-398a43a7781e> in div()
      1 def div(x, y):
----> 2     return x / y

ZeroDivisionError: division by zero

Now Dask can bring a failing task and all necessary data back to the local machine and rerun it so that users can leverage the normal Python debugging toolchain.

>>> client.recreate_error_locally(future)
<ipython-input-3-398a43a7781e> in div(x, y)
      1 def div(x, y):
----> 2     return x / y
ZeroDivisionError: division by zero

Now if you’re in IPython or a Jupyter notebook you can use the %debug magic to jump into the stacktrace, investigate local variables, and so on.

In [8]: %debug
> <ipython-input-3-398a43a7781e>(2)div()
      1 def div(x, y):
----> 2     return x / y

ipdb> pp x
1
ipdb> pp y
0

dask/distributed #894

Async/await syntax

Dask.distributed uses Tornado for network communication and Tornado coroutines for concurrency. Normal users rarely interact with Tornado coroutines; they aren’t familiar to most people so we opted instead to copy the concurrent.futures API. However some complex situations are much easier to solve if you know a little bit of async programming.

Fortunately, the Python ecosystem seems to be embracing this change towards native async code with the async/await syntax in Python 3. In an effort to motivate people to learn async programming and to gently nudge them towards Python 3 Dask.distributed we now support async/await in a few cases.

You can wait on a dask Future

async def f():
    future = client.submit(func, *args, **kwargs)
    result = await future

You can put the as_completed iterator into an async for loop

async for future in as_completed(futures):
    result = await future
    ... do stuff with result ...

And, because Tornado supports the await protocols you can also use the existing shadow concurrency API (everything prepended with an underscore) with await. (This was doable before.)

results = client.gather(futures)         # synchronous
...
results = await client._gather(futures)  # asynchronous

If you’re in Python 2 you can always do this with normal yield and the tornado.gen.coroutine decorator.

dask/distributed #952

Inproc transport

In the last release we enabled Dask to communicate over more things than just TCP. In practice this doesn’t come up (TCP is pretty useful). However in this release we now support single-machine “clusters” where the clients, scheduler, and workers are all in the same process and transfer data cost-free over in-memory queues.

This allows the in-memory user community to use some of the more advanced features (asynchronous computation, spill-to-disk support, web-diagnostics) that are only available in the distributed scheduler.

This is on by default if you create a cluster with LocalCluster without using Nanny processes.

>>> from dask.distributed import LocalCluster, Client

>>> cluster = LocalCluster(nanny=False)

>>> client = Client(cluster)

>>> client
<Client: scheduler='inproc://192.168.1.115/8437/1' processes=1 cores=4>

>>> from threading import Lock         # Not serializable
>>> lock = Lock()                      # Won't survive going over a socket
>>> [future] = client.scatter([lock])  # Yet we can send to a worker
>>> future.result()                    # ... and back
<unlocked _thread.lock object at 0x7fb7f12d08a0>

dask/distributed #919

Connection pooling for inter-worker communications

Workers now maintain a pool of sustained connections between each other. This pool is of a fixed size and removes connections with a least-recently-used policy. It avoids re-connection delays when transferring data between workers. In practice this shaves off a millisecond or two from every communication.

This is actually a revival of an old feature that we had turned off last year when it became clear that the performance here wasn’t a problem.

Along with other enhancements, this takes our round-trip latency down to 11ms on my laptop.

In [10]: %%time
    ...: for i in range(1000):
    ...:     future = client.submit(inc, i)
    ...:     result = future.result()
    ...:
CPU times: user 4.96 s, sys: 348 ms, total: 5.31 s
Wall time: 11.1 s

There may be room for improvement here though. For comparison here is the same test with the concurent.futures.ProcessPoolExecutor.

In [14]: e = ProcessPoolExecutor(8)

In [15]: %%time
    ...: for i in range(1000):
    ...:     future = e.submit(inc, i)
    ...:     result = future.result()
    ...:
CPU times: user 320 ms, sys: 56 ms, total: 376 ms
Wall time: 442 ms

Also, just to be clear, this measures total roundtrip latency, not overhead. Dask’s distributed scheduler overhead remains in the low hundreds of microseconds.

dask/distributed #935

There has been activity around Dask and machine learning:

Acknowledgements

The following people contributed to the dask/dask repository since the 0.14.0 release on February 27th

The following people contributed to the dask/distributed repository since the 1.16.0 release on February 27th

March 23, 2017 12:00 AM

March 22, 2017


Tarek Ziade

Load Testing at Mozilla

After a stabilization phase, I am happy to announce that Molotov 1.0 has been released!

(Logo by Juan Pablo Bravo)

This release is an excellent opportunity to explain a little bit how we do load testing at Mozilla, and what we're planning to do in 2017 to improve the process.

I am talking here specifically about load testing our HTTP services, and when this blog post mentions what Mozilla is doing there, it refers mainly to the Mozilla QA team, helped with Services developers team that works on some of our web services.

What's Molotov?

Molotov is a simple load testing tool

Molotov is a minimalist load testing tool you can use to load test an HTTP API using Python. Molotov leverages Python 3.5+ asyncio and uses aiohttp to send some HTTP requests.

Writing load tests with Molotov is done by decorating asynchronous Python functions with the @scenario function:

from molotov import scenario

@scenario(100)
async def my_test(session):
    async with session.get('http://localhost:8080') as resp:
        assert resp.status == 200

When this script is executed with the molotov command, the my_test function is going to be repeatedly called to perform the load test.

Molotov tries to be as transparent as possible and just hands over session objects from the aiohttp.client module.

The full documentation is here: http://molotov.readthedocs.io

Using Molotov is the first step to load test our services. From our laptops, we can run that script and hammer a service to make sure it can hold some minimal charge.

What Molotov is not

Molotov is not a fully-featured load testing solution

Load testing application usually comes with high-level features to understand how the tested app is performing. Things like performance metrics are displayed when you run a test, like what Apache Bench does by displaying how many requests it was able to perform and their average response time.

But when you are testing web services stacks, the metrics you are going to collect from each client attacking your service will include a lot of variation because of the network and clients CPU overhead. In other words, you cannot guarantee reproducibility from one test to the other to track precisely how your app evolves over time.

Adding metrics directly in the tested application itself is much more reliable, and that's what we're doing these days at Mozilla.

That's also why I have not included any client-side metrics in Molotov, besides a very simple StatsD integration. When we run Molotov at Mozilla, we mostly watch our centralized metrics dashboards and see how the tested app behaves regarding CPU, RAM, Requests-Per-Second, etc.

Of course, running a load test from a laptop is less than ideal. We want to avoid the hassle of asking people to install Molotov & all the dependencies a test requires everytime they want to load test a deployment -- and run something from their desktop. Doing load tests occasionally from your laptop is fine, but it's not a sustainable process.

And even though a single laptop can generate a lot of loads (in one project, we're generating around 30k requests per second from one laptop, and happily killing the service), we also want to do some distributed load.

We want to run Molotov from the cloud. And that's what we do, thanks to Docker and Loads.

Molotov & Docker

Since running the Molotov command mostly consists of using the right command-line options and passing a test script, we've added in Molotov a second command-line utility called moloslave.

Moloslave takes the URL of a git repository and will clone it and run the molotov test that's in it by reading a configuration file. The configuration file is a simple JSON file that needs to be at the root of the repo, like how you would do with Travis-CI or other tools.

See http://molotov.readthedocs.io/en/latest/slave

From there, running in a Docker can be done with a generic image that has Molotov preinstalled and picks the test by cloning a repo.

See http://molotov.readthedocs.io/en/latest/docker

Having Molotov running in Docker solves all the dependencies issues you can have when you are running a Python app. We can specify all the requirements in the configuration file and have moloslave installs them. The generic Docker image I have pushed in the Docker Hub is a standard Python 3 environment that works in most case, but it's easy to create another Docker image when a very specific environment is required.

But the bottom line is that anyone from any OS can "docker run" a load test by simply passing the load test Git URL into an environment variable.

Molotov & Loads

Once you can run load tests using Docker images, you can use specialized Linux distributions like CoreOS to run them.

Thanks to boto, you can script the Amazon Cloud and deploy hundreds of CoreOS boxes and run Docker images in them.

That's what the Loads project is -- an orchestrator that will run hundreds of CoreOS EC2 instances to perform a massively distributed load test.

Someone that wants to run such a test has to pass to a Loads Broker that's running in the Amazon Cloud a configuration that tells where is the Docker that runs the Molotov test, and says for how long the test needs to run.

That allows us to run hours-long tests without having to depend on a laptop to orchestrate it.

But the Loads orchestrator has been suffering from reliability issues. Sometimes, EC2 instances on AWS are not responsive anymore, and Loads don't know anymore what's happening in a load test. We've suffered from that and had to create specific code to clean up boxes and avoid keeping hundreds of zombie instances sticking around.

But even with these issues, we're able to perform massive load tests distributed across hundreds of boxes.

Next Steps

At Mozilla, we are in the process of gradually switching all our load testing scripts to Molotov. Using a single tool everywhere will allow us to simplify the whole process that takes that script and performs a distributed load test.

I am also investigating on improving metrics. One idea is to automatically collect all the metrics that are generated during a load test and pushing them in a specialized performance trend dashboard.

We're also looking at switching from Loads to Ardere. Ardere is a new project that aims at leveraging Amazon ECS. ECS is an orchestrator we can use to create and manage EC2 instances. We've tried ECS in the past, but it was not suited to run hundreds of boxes rapidly for a load test. But ECS has improved a lot, and we started a prototype that leverages it and it looks promising.

For everything related to our Load testing effort at Mozilla, you can look at https://github.com/loads/

And of course, everything is open source and open to contributions.

March 22, 2017 11:00 PM


DataCamp

SciPy Cheat Sheet: Linear Algebra in Python

By now, you will have already learned that NumPy, one of the fundamental packages for scientific computing, forms at least for a part the fundament of other important packages that you might use used for data manipulation and machine learning with Python. One of those packages is SciPy, another one of the core packages for scientific computing in Python that provides mathematical algorithms and convenience functions built on the NumPy extension of Python. 

You might now wonder why this library might come in handy for data science. 

Well, SciPy has many modules that will help you to understand some of the basic components that you need to master when you're learning data science, namely, math, stats and machine learning. You can find out what other things you need to tackle to learn data science here. You'll see that for statistics, for example, a module like scipy.stats, etc. will definitely be of interest to you.

The other topic that was mentioned was machine learning: here, the scipy.linalg and scipy.sparse modules will offer everything that you're looking for to understand machine learning concepts such as eigenvalues, regression, and matrix multiplication...

But, what is maybe the most obvious is that most machine learning techniques deal with high-dimensional data and that data is often represented as matrices. What's more, you'll need to understand how to manipulate these matrices.  

That is why DataCamp has made a SciPy cheat sheet that will help you to master linear algebra with Python. 

Take a look by clicking on the button below:

python scipy cheat sheet

You'll see that this SciPy cheat sheet covers the basics of linear algebra that you need to get started: it provides a brief explanation of what the library has to offer and how you can use it to interact with NumPy, and goes on to summarize topics in linear algebra, such as matrix creation, matrix functions, basic routines that you can perform with matrices, and matrix decompositions from scipy.linalg. Sparse matrices are also included, with their own routines, functions, and decompositions from the scipy.sparse module. 

(Above is the printable version of this cheat sheet)

:target:before { content:""; display:block; height:150px; margin:-150px 0 0; } h3 {font-weight:normal; } h4 { font-weight: lighter; } table { width: 100%; table-layout: fixed; } th { height: 50px; } th, td { padding: 5px; text-align: left;} tr {background-color:white} tr:hover {background-color: #f5f5f5}

Python for Data-Science Cheat Sheet: SciPy - Linear Algebra

SciPy

The SciPy library is one of the core packages for scientific computing that provides mathematical algorithms and convenience functions built on the NumPy extension of Python.

Asking For Help

>>> help(scipy.linalg.diagsvd)

Interacting With NumPy

>>> import numpy as np
>>> a : np.array([1,2,3])
>>> b : np.array([(1+5j,2j,3j), (4j,5j,6j)])
>>> c : np.array([[(1.5,2,3), (4,5,6)], [(3,2,1), (4,5,6)]])

Index Tricks

>>> np.mgrid[0:5,0:5] Create a dense meshgrid
>>> np.ogrid[0:2,0:2]
>>> np.r_[3,[0]*5,-1:1:10j] Stack arrays vertically (row-wise)
>>> np.c_[b,c]

Shape Manipulation

>>> np.transpose(b) Permute array dimensions
>>> b.flatten() Flatten the array
>>> np.hstack((b,c)) Stack arrays horizontally (column-wise)
>>> np.vstack((a,b)) Stack arrays vertically (row-wise)
>>> np.hsplit(c,2) Split the array horizontally at the 2nd index
>>> np.vpslit(d,2)

Polynomials

>>> from numpy import poly1d
>>> p : poly1d([3,4,5]) Create a polynomial object

Vectorizing Functions

>>> def myfunc(a):
      if a ‹ 0:
      return a*2
      else:
      return a/2
>>> np.vectorize(myfunc) Vectorize functions

Type Handling

>>> np.real(b) Return the real part of the array elements
>>> np.imag(b) Return the imaginary part of the array elements
>>> np.real_if_close(c,tol:1000) Return a real array if complex parts close to 0
>>> np.cast['f'](np.pi) Cast object to a data type

Other Useful Functions

>>> np.angle(b,deg:True) Return the angle of the complex argument
>>> g : np.linspace(0,np.pi,num:5) Create an array of evenly spaced values (number of samples)
>>> g [3:] +: np.pi
>>> np.unwrap(g)
>>> np.logspace(0,10,3) Create an array of evenly spaced values (log scale)
>>> np.select([c<4],[c*2]) Return values from a list of arrays depending on conditions
>>> misc.factorial(a) Factorial
>>> misc.comb(10,3,exact:True) Combine N things taken at k time
>>> misc.central_diff_weights(3) Weights for Np-point central derivative
>>> misc.derivative(myfunc,1.0) Find the n-th derivative of a function at a point

Linear Algebra

You'll use the linalg and sparse modules. Note thatscipy.linalg contains and expands onnumpy.linalg.

>>> from scipy import linalg, sparse

Creating Matrices

>>> A : np.matrix(np.random.random((2,2)))
>>> B : np.asmatrix(b)
>>> C : np.mat(np.random.random((10,5)))
>>> D : np.mat([[3,4], [5,6]])

Basic Matrix Routines

Inverse

>>> A.I Inverse
>>> linalg.inv(A) Inverse

Transposition

>>> A.T Tranpose matrix
>>> A.H Conjugate transposition

Trace

>>> np.trace(A) Trace

Norm

>>> linalg.norm(A) Frobenius norm
>>> linalg.norm(A,1) L1 norm (max column sum)
>>> linalg.norm(A,np.inf) L inf norm (max row sum)

Rank

>>> np.linalg.matrix_rank(C) Matrix rank

Determinant

>>> linalg.det(A) Determinant

Solving linear problems

>>> linalg.solve(A,b) Solver for dense matrices
>>> E : np.mat(a).T Solver for dense matrices
>>> linalg.lstsq(F,E) Least-squares solution to linear matrix equation

Generalized inverse

>>> linalg.pinv(C) Compute the pseudo-inverse of a matrix (least-squares solver
>>> linalg.pinv2(C) Compute the pseudo-inverse of a matrix (SVD)

Creating Sparse Matrices

>>> F : np.eye(3, k:1) Create a 2X2 identity matrix
>>> G : np.mat(np.identity(2)) Create a 2x2 identity matrix
>>> C[C > 0.5] : 0
>>> H : sparse.csr_matrix(C) Compressed Sparse Row matrix
>>> I : sparse.csc_matrix(D) Compressed Sparse Column matrix
>>> J : sparse.dok_matrix(A) Dictionary Of Keys matrix
>>> E.todense() Sparse matrix to full matrix
>>> sparse.isspmatrix_csc(A) Identify sparse matrix

Sparse Matrix Routines

Inverse

>>> sparse.linalg.inv(I) Inverse

Norm

>>> sparse.linalg.norm(I) Norm

Solving linear problems

>>> sparse.linalg.spsolve(H,I) Solver for sparse matrices

Sparse Matrix Functions

>>> la, v : sparse.linalg.eigs(F,1) Eigenvalues and eigenvectors
>>> sparse.linalg.svds(H, 2) SVD
>>> sparse.linalg.expm(I) Sparse matrix exponential

Matrix Functions

Addition

>>> np.add(A,D) Addition

Subtraction

>>> np.subtract(A,D) Subtraction

Division

>>> np.divide(A,D) Division

Multiplication

>>> A @ D Multiplication operator (Python 3)
>>> np.multiply(D,A) Multiplication
>>> np.dot(A,D) Dot product
>>> np.vdot(A,D) Vector dot product
>>> np.inner(A,D) Inner product
>>> np.outer(A,D) Outer product
>>> np.tensordot(A,D) Tensor dot product
>>> np.kron(A,D) Kronecker product

Exponential Functions

>>> linalg.expm(A) Matrix exponential
>>> linalg.expm2(A) Matrix exponential (Taylor Series)
>>> linalg.expm3(D) Matrix exponential (eigenvalue decomposition)

Logarithm Function

>>> linalg.logm(A) Matrix logarithm

Trigonometric Functions

>>> linalg.sinm(D) Matrix sine
>>> linalg.cosm(D) Matrix cosine
>>> linalg.tanm(A) Matrix tangent

Hyperbolic Trigonometric Functions

>>> linalg.sinhm(D) Hypberbolic matrix sine
>>> linalg.coshm(D) Hyperbolic matrix cosine
>>> linalg.tanhm(A) Hyperbolic matrix tangent

Matrix Sign Function

>>> np.signm(A) Matrix sign function

Matrix Square Root

>>> linalg.sqrtm(A) Matrix square root

Arbitrary Functions

>>> linalg.funm(A, lambda x: x*x) Evaluate matrix function

Decompositions

Eigenvalues and Eigenvectors

>>> la, v : linalg.eig(A) Solve ordinary or generalized eigenvalue problem for square matrix
>>> l1, l2 : la Unpack eigenvalues
>>> v[:,0] First eigenvector
>>> v[:,1] Second eigenvector
>>> linalg.eigvals(A) Unpack eigenvalues

Singular Value Decomposition

LU Decomposition

>>> U,s,Vh : linalg.svd(B) Singular Value Decomposition (SVD)
>>> M,N : B.shape
>>> Sig : linalg.diagsvd(s,M,N) Construct sigma matrix in SVD
>>> P,L,U : linalg.lu(C) LU Decomposition

Sparse Matrix Decompositions

>>> np.info(np.matrix)

PS. Don't miss our other Python cheat sheets for data science that cover NumpyScikit-LearnBokehPandas and the Python basics.

March 22, 2017 06:36 PM


Python Engineering at Microsoft

Interactive Windows in VS 2017

Last week we announced that the Python development workload is available now in Visual Studio Preview, and briefly covered some of the new improvements in Visual Studio 2017. In this post, we are going to go into more depth on the improvements for the Python Interactive Window.

These are currently available in Visual Studio Preview, and will become available in one of the next updates to the stable release. Over the lifetime of Visual Studio 2017 we will have opportunities to further improve these features, so please provide feedback and suggestions at our GitHub site.

Interactive Windows

People who have been using Visual Studio with many versions of Python installed will be used to seeing a long list of interactive windows – one for each version of Python. Selecting any one of these would let you run short snippets of code with that version and see the results immediately. However, because we only allowed one window for each, there was no way to open multiple windows for the same Python version and try different things.

Comparison between VS 2015 and VS 2017 interactive window menus

In Visual Studio 2017, the main menu has been simplified to only include a single entry. Selecting this entry (or using the Alt+I keyboard shortcut) will open an interactive window with some new toolbar items:

Python interactive window with toolbar elements highlighted

At the right hand side, you’ll see the new “Environment” dropdown. With this field, you can select any version of Python you have installed and the interactive window will switch to it. This will reset your current state (after prompting), but will keep your history and previous output.

Two Python interactive windows with plots

The button at the left hand side of the toolbar creates a new interactive window. Each window is independent from each other: they do not share any state, history or output, can use the same or different versions of Python, and may be arranged however you like. This flexibility will allow you to try two different pieces of code in the same version of Python, viewing the results side-by-side, without having to repeatedly reset everything.

Code Cells

One workflow that we see people using very successfully is what we internally call the scratchpad. In this approach, you have a Python script that contains many little code snippets that you can copy-paste from the editor into an interactive window. Typically you don’t run the script in its entirety, as the code snippets may be completely unrelated. For example, you might have a “scratchpad.py” file with a range of your favorite matplotlib or Bokeh plot commands.

A scratchpad file alongside a Python interactive window

Previously, we provided a command to send selected text to an interactive window (press Ctrl+E twice) to easily copy code from an editor window. This command still exists, but has been enhanced in Visual Studio 2017 in the following ways.

We’ve added Ctrl+Enter as a new keyboard shortcut for this command, which will help more people use muscle memory that they may have developed in other tools. So if you are comfortable with pressing Ctrl+E twice, you can keep doing that, or you can switch to the more convenient Ctrl+Enter shortcut.

We have also made the shortcut work when you don’t have any code selected. (This is the complicated bit, but we’ve found that it makes the most sense when you start using it.) In the normal case, pressing Ctrl+Enter without a selection will send the current line of text and move the caret to the following line. We do not try and figure out whether it is a complete statement or not, so you might send an invalid statement, though in this case we won’t try and execute it straight away. As you send more lines of code (by pressing Ctrl+Enter repeatedly), we will build up a complete statement and then execute it.

For the scratchpad workflow though, you’ll typically have a small block of code you want to send all at once. To simplify this, we’ve added support for code cells. Adding a comment starting with #%% to your code will begin a code cell, and end the previous one. Code cells can be collapsed, and using Ctrl+Enter (or Ctrl+E twice) inside a code cell will send the entire cell to the interactive and move to the next one. So when sending a block of code, you can simply click anywhere inside the cell and press Ctrl+Enter to run the whole block.

We also detect code cells starting with comments like # In[1]:, which is the format you get when exporting a Jupyter notebook as a Python file. So you can easily execute a notebook from Azure Notebooks by downloading as a Python file, opening in Visual Studio, and using Ctrl+Enter to run each cell.

Startup Scripts

Python Environments window with interactive commands highlighted

As you start using interactive windows more in your everyday workflow, you will likely develop helper functions that you use regularly. While you could write these into a code cell and run them each time you restart, there is a better way.

In the Python Environments window for each environment, there are buttons to open an interactive window, to explore interactive scripts, and to enable IPython interactive mode. There is a checkbox to enable IPython interactive mode, and when it is selected all interactive windows for that environment will start with IPython mode enabled. This will allow you to use inline plots, as well as the extended IPython syntax such as name? to view help and !command for shell commands. We recommend enabling IPython interactive mode when you have installed an Anaconda distribution, as it requires extra packages.

The “Explore interactive scripts” button will open a directory in File Explorer from your documents folder. You can put any Python scripts you like into this folder and they will be run every time you start an interactive window for that environment. For example, you may make a function that opens a DataFrame in Excel, and then save it to this folder so that you can use it in your interactive window.

A Python function

Summary

Thanks for installing and using Python support in Visual Studio 2017. We hope that you’ll find these interactive window enhancements useful additions to your development workflow.

As usual, we are constantly looking to improve all of our features based on what our users need and value most. To report any issues or provide suggestions, feel free to post them at our GitHub site.

March 22, 2017 05:00 PM


NumFOCUS

nteract: Building on top of Jupyter (from a rich REPL toolkit to interactive notebooks)

Picture
Blueprint for nteract
nteract builds upon the very successful foundations of Jupyter. I think of Jupyter as a brilliantly rich REPL toolkit. A typical REPL (Read-Eval-Print-Loop) is an interpreter that takes input from the user and prints results (on stdout and stderr).

Here’s the standard Python interpreter; a REPL many of us know and love.
Picture
Standard Python interpreter
The standard terminal’s spartan user interface, while useful, leaves something to be desired. IPython was created in 2001 to refine the interpreter, primarily by extending display hooks in Python. Iterative improvement on the interpreter was a big boon for interactive computing experiences, especially in the sciences.
Picture
IPython terminal
As the team behind IPython evolved, so did their ambitions to create richer consoles and notebooks. Core to this was crafting the building blocks of the protocol that were established on top of ZeroMQ, leading to the creation of the IPython notebook. It decoupled the REPL from a closed loop in one system to multiple components communicating together.
Picture
IP[y]thon Notebook
As IPython came to embrace more than just Python (RJuliaNode.jsScala, …), the IPython leads created a home for the language agnostic parts: Jupyter.
Picture
Jupyter Notebook Classic Edition
Jupyter isn’t just a notebook or a console.
It’s an establishment of well-defined protocols and formats. It’s a community of people who come together to build interactive computing experiences. We share our knowledge across the sciences, academia, and industry — there’s a lot of overlap in vision, goals, and priorities.

That being said, one project alone may not meet with everyone’s specific needs and workflows. Luckily, with strong support by Jupyter’s solid foundation of protocols to communicate with the interpreters (Jupyter kernels) and document formats (e.g. .ipynb), you too can build your ideal interactive computing environment.

In pursuit of this, members of the Jupyter community created nteract, a Jupyter notebook desktop application as well as an ecosystem of JavaScript packages to support it and more.

What is the platform that Jupyter provides to build rich interactive experiences?

To explore this, I will describe the Jupyter protocol with a lightweight (non-compliant) version of the protocol that hopefully helps explain how this works under the hood.
Also a lightweight Hello WorldWhen a user runs this code, a message is formed:
We send that message and receive replies as JSON:
We’ve received two types of messages so far:
  • execution status for the interpreter — busy or idle
  • a “stream” of stdout
The status tells us the interpreter is ready for more and the stream data is shown below the editor in the output area of a notebook.

What happens when a longer computation runs?

Picture
Sleepy time printing
As multiple outputs come in, they get appended to the display area below the code editor.

How are tables, plots, and other rich media shown?

Picture
Yay for DataFrames!
Let’s send that code over to see
The power and simplicity of the protocol emerges when using the execute_result and display_data message types. They both have a data field with multiple media types for the frontend to choose how to represent. Pandas provides text/plain and text/html for tabular data

text/plain

text/html

When the front-end receives the HTML payload, it embeds it directly in the outputs so you get a nice table:
Picture
DataFrame to Table
This isn’t limited to HTML and text — we can handle images and any other known transform. The primary currency for display are these bundles of media types to data. In nteract we have a few custom mime types, which are also coming soon to a Jupyter notebook near you!
Picture
GeoJSON in nteract
Picture
Vega / Vega-lite via Altair (https://github.com/altair-viz/altair)

How do you build a notebook document?

We’ve witnessed how our code gets sent across to the runtime and what we receive on the notebook side. How do we form a notebook? How do we associate messages to the cells they originated from?

We need an ID to identify where an execute_request comes from. Let’s bring in the concept of a message ID and form the cell state over time
We send the execute_request as message 0001
and initialize our state
Each message afterward lists the originating msg_id as parent_id 0001. Responses start flowing in, starting with message 0002
Which we can store as part of the state of our cell
Here comes the plain text output in message 0003
Which we fold into an outputs structure of our cell
Finally, we receive a status to inform us the kernel is no longer busy
Resulting in the final state of the cell
That’s just one cell though —what would an entire notebook structure look like? One way of thinking about a notebook is that it’s a rolling work log of computations. A linear list of cells. Using the same format we’ve constructed above, here’s a lightweight notebook:
​As well as the rendered version:
As Jupyter messages are sent back and forth, a notebook is formed. We use message IDs to route outputs to cells. Users run code, get results, and view representations of their data:
This very synchronous imperative description doesn’t give the Jupyter protocol (and ZeroMQ for that matter) enough credence. In reality, it’s a hyper reactor of asynchronous interactive feedback, enabling you to iterate quickly and explore the space of computation and visualization. These messages come in asynchronously, and there are a lot more messages available within the core protocol.
I encourage you to get involved in both the nteract and Jupyter projects. We have plenty to explore and build together, whether you are interested in: Feel free to reach out on issues or the nteract slack.
Thank you to Safia AbdallaLukas GeigerPaul IvanovPeggy Rayzis, and Carol Willing for reviewing, providing feedback, and editing this post.

Thanks to Lukas GeigerPeggy Rayzis, and Paul “π” Ivanov.

March 22, 2017 04:29 PM


Dataquest

Turbocharge Your Data Acquisition using the data.world Python Library

When working with data, a key part of your workflow is finding and importing data sets. Being able to quickly locate data, understand it and combine it with other sources can be difficult.

One tool to help with this is data.world, where you can search for, copy, analyze, and download data sets. In addition, you can upload your data to data.world and use it to collaborate with others.

In this tutorial, we’re going to show you how to use data.world’s Python library to easily work with data from your python scripts or Jupyter notebooks. You’ll need to create a free data.world account to view the data set and follow along.

The data.world python library allows you to bring data that’s stored in a data.world data set straight into your workflow, without having to first download the data locally and transform it into a format you require.

Because data sets in data.world are stored in the format that the user originally uploaded them in, you often find great data sets that exist in a less than ideal, format, such as multiple sheets of an Excel workbook, where...

March 22, 2017 03:00 PM


DataCamp

New Python Course: Network Analysis

Hi Pythonistas! Today we're launching Network Analysis in Python by Eric Ma!

From online social networks such as Facebook and Twitter to transportation networks such as bike sharing systems, networks are everywhere, and knowing how to analyze this type of data will open up a new world of possibilities for you as a Data Scientist. This course will equip you with the skills to analyze, visualize, and make sense of networks. You'll apply the concepts you learn to real-world network data using the powerful NetworkX library. With the knowledge gained in this course, you'll develop your network thinking skills and be able to start looking at your data with a fresh perspective!

Start for free

Python: Network Analysis features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you a master network analysis in python!

What you'll learn:

In the first chapter, you'll be introduced to fundamental concepts in network analytics while becoming acquainted with a real-world Twitter network dataset that you will explore throughout the course. In addition, you'll learn about NetworkX, a library that allows you to manipulate, analyze, and model graph data. You'll learn about different types of graphs as well as how to rationally visualize them. Start first chapter for free.

In chapter 2, you'll learn about ways of identifying nodes that are important in a network. In doing so, you'll be introduced to more advanced concepts in network analysis as well as learn the basics of path-finding algorithms. The chapter concludes with a deep dive into the Twitter network dataset which will reinforce the concepts you've learned, such as degree centrality and betweenness centrality.

Chapter 3 is all about finding interesting structures within network data. You'll learn about essential concepts such as cliques, communities, and subgraphs, which will leverage all of the skills you acquired in Chapter 2. By the end of this chapter, you'll be ready to apply the concepts you've learned to a real-world case study.

In the final chapter of the course, you'll consolidate everything you've learned by diving into an in-depth case study of GitHub collaborator network data. This is a great example of real-world social network data, and your newly acquired skills will be fully tested. By the end of this chapter, you'll have developed your very own recommendation system which suggests GitHub users who should collaborate together. Enjoy!

Start course for free

March 22, 2017 01:14 PM


PyBites

Best Practices for Compatible Python 2 and 3 Code

95% of most popular Python packages support Python 3. Maybe you are lucky and get to start fresh using Python 3. However as of last year Python 2.7 still reigns supreme in pip installs and at a lot of places 2.x is the only version you get to work in. I think writing Python 2 and 3 compatible code is an important skill, so lets check what it entails.

March 22, 2017 11:42 AM


PyCharm

Why Postgres Should be your Document Database Webinar Recording

This Monday Jim Fulton, one of the first Python contributors, hosted a webinar about storing JSONB documents in PostgreSQL. Watch it now:

Known mostly for its mature SQL and data-at-scale infrastructure, the PostgreSQL project added a “JSONB” column type in its 9.4 release, then refined it over the next two releases. While using it is straightforward, combining it in hybrid structured/unstructured applications along with other facilities in the database can require skill.

In this webinar, Python and database consultant Jim Fulton shows us how to use JSONB and related machinery for pure and hybrid Python document-oriented applications. We also briefly discuss his long history back to the start of Python, and finish with his unique NewtDB library for native Python objects coupled to JSONB queries.

Jim uses PyCharm Professional during the webinar. PyCharm Professional bundles the database tools from JetBrains DataGrip, our database IDE. However, the webinar itself is focused on the concepts of JSONB.

You can find Jim’s code on GitHub: https://github.com/jimfulton/pycharm-170320

If you have any questions or comments about the webinar, feel free to leave them in the comments below, or you can reach us on Twitter. Jim is on Twitter as well, his Twitter handle is @j1mfulton.

-PyCharm Team
The Drive to Develop

March 22, 2017 11:19 AM


Simple is Better Than Complex

Ask Vitor #2: How to dynamically filter ModelChoice's queryset in a ModelForm?

Michał Strumecki asks:

I just want to filter select field in a form, regarding a currently logged user. Every user has own categories and budgets. I want to display only a models related with a currently logged user. I’ve tried stuff with filtering before is_valid field, but with no result.


Answer

This is a very common use case when dealing with ModelForms. The problem is that in the form fields ModelChoice and ModelMultipleChoiceField, which are used respectively for the model fields ForeignKey and ManyToManyField, it defaults the queryset to the Model.objects.all().

If the filtering was static, you could simply pass a filtered queryset in the form definition, like Model.objects.filter(status='pending').

When the filtering parameter is dynamic, we need to do a few tweaks in the form to get the right queryset.

Let’s simplify the scenario a little bit. We have the Django User model, a Category model and Product model. Now let’s say it’s a multi-user application. And each user can only see the products they create, and naturally only use the categories they own.

models.py

from django.contrib.auth.models import User
from django.db import models

class Category(models.Model):
    name = models.CharField(max_length=30)
    user = models.ForeignKey(User, on_delete=models.CASCADE)

class Product(models.Model):
    name = models.CharField(max_length=30)
    price = models.DecimalField(decimal_places=2, max_digits=10)
    category = models.ForeignKey(Category)
    user = models.ForeignKey(User, on_delete=models.CASCADE)

Here is how we can create a ModelForm for the Product model, using only the currently logged-in user:

forms.py

from django import forms
from .models import Category, Product

class ProductForm(forms.ModelForm):
    class Meta:
        model = Product
        fields = ('name', 'price', 'category', )

    def __init__(self, user, *args, **kwargs):
        super(ProductForm, self).__init__(*args, **kwargs)
        self.fields['category'].queryset = Category.objects.filter(user=user)

That means now the ProductForm has a mandatory parameter in its constructor. So, instead of initializing the form as form = ProductForm(), you need to pass a user instance: form = ProductForm(user).

Here is a working example of view handling this form:

views.py

from django.shortcuts import render, redirect
from .forms import ProductForm

@login_required
def new_product(request):
    if request.method == 'POST':
        form = ProductForm(request.user, request.POST)
        if form.is_valid():
            product = form.save(commit=False)
            product.user = request.user
            product.save()
            return redirect('products_list')
    else:
        form = ProductForm(request.user)
    return render(request, 'products/product_form.html', {'form': form})

Using ModelFormSet

The machinery behind the modelformset_factory is not very flexible, so we can’t add extra parameters in the form constructor. But we certainly can play with the available resources.

The difference here is that we will need to change the queryset on the fly.

Here is what we can do:

models.py

@login_required
def edit_all_products(request):
    ProductFormSet = modelformset_factory(Product, fields=('name', 'price', 'category'), extra=0)
    data = request.POST or None
    formset = ProductFormSet(data=data, queryset=Product.objects.filter(user=request.user))
    for form in formset:
        form.fields['category'].queryset = Category.objects.filter(user=request.user)

    if request.method == 'POST' and formset.is_valid():
        formset.save()
        return redirect('products_list')

    return render(request, 'products/products_formset.html', {'formset': formset})

The idea here is to provide a screen where the user can edit all his products at once. The product form involves handling a list of categories. So for each form in the formset, we need to override the queryset with the proper list of values.

ProductFormSet

The big difference here is that each of the categories list is filtered by the categories of the logged in user.


Get the Code

I prepared a very detailed example you can explore to get more insights.

The code is available on GitHub: github.com/sibtc/askvitor.

March 22, 2017 11:08 AM


David MacIver

How and why to learn about data structures

There’s a common sentiment that 99% of programmers don’t need to know how to build basic data structures, and that it’s stupid to expect them to.

There’s certainly an element of truth to that. Most jobs don’t require knowing how to implement any data structure at all, so a lot of this sentiment is just backlash against using them as part of the interview process. I agree with that backlash. Don’t use data structures as part of your interview process unless you expect the job to routinely involve writing your own data structures (or working on ones somebody has already written). Bad interviewer. No cookie.

But setting aside the interview question, there is still a strong underlying sentiment of this not actually being that useful a thing to spend your time on. After all, you wouldn’t ever implement a hash table when there’s a great one in the standard library, right?

This is like arguing that you don’t need to learn to cook because you can just go out to restaurants.

A second, related, point of view is that if you needed to know how this worked you’d just look it up.

That is, you don’t need to learn how to invent your own recipes because you can just look it up in a cook book.

In principle both of these arguments are fine. There are restaurants, there are cook books, not everybody needs to know how to cook and they certainly don’t need to become a gourmet chef.

But nevertheless, most people’s lives will be improved by acquiring at least a basic facility in the kitchen. Restaurants are expensive and may be inconvenient. You run out of ingredients and can’t be bothered to go to the store so you need to improvise or substitute. Or you’re just feeling creative and want to try something new for the hell of it.

The analogy breaks down a bit, because everybody needs to eat but most people don’t really need to implement custom data structures. It’s not 99%, but it might be 90%. Certainly it’s more than 50%.

But “most” isn’t “all”, and there’s a lot of grey areas at the boundary. If you’re not already sure you need to know this, you can probably get on fine without learning how to implement your own data structures, but you might find it surprisingly useful to do so anyway. Even if you don’t, there are some indirect benefits.

I’m not using this analogy just to make a point about usefulness, I also think it’s a valuable way of looking at it: Data structures are recipes. You have a set of techniques and tools and features, and you put them together in an appropriate way to achieve the result you want.

I think a lot of the problem is that data structures are not usually taught this way. I may be wrong about this – I’ve never formally taken a data structures course because my academic background is maths, not computer science, but it sure doesn’t look like people are being taught this way based on the books I’ve read and the people I’ve talked to.

Instead people are taught “Here’s how you implement an AVL tree. It’s very clever” (confession: I have no idea how you implement an AVL tree. If I needed to know I’d look it up, right?). It’s as if you were going to cookery school and they were taking through a series of pages from the recipe book and teaching you how to follow them.

Which is not all bad! Learning some recipes is a great way to learn to cook. But some of that is because you already know how to eat food, so you’ve got a good idea what you’re trying to achieve. It’s also not sufficient in its own right – you need to learn to adapt, to combine the things you’ve already seen and apply the basic skills you’ve learned to solve new constraints or achieve new results.

Which is how I would like data structures to be taught. Not “Here is how to implement this named data structure” but “Here is the set of operations I would like to support, with these sorts of complexities as valid. Give me some ideas.”

Because this is the real use of learning to implement data structures: Sometimes the problem you’re given doesn’t match the set of data structures you have in the standard library or any of the standard ones. Maybe you need to support some silly combination of operations that you don’t normally do, or you have an unusual workload where some operations are very uncommon and so you don’t mind paying some extra cost there but some operations are very common so need to be ultra fast.

At that point, knowing the basic skills of data structure design becomes invaluable, because you can take what you’ve learned and put it together in a novel way that supports what you want.

And with that, I’ll finish by teaching you a little bit about data structures.

First lets start with a simple problem: Given a list of N items, I want to sample from them without replacement. How would I do that with an O(N) initialisation and O(1) sample?

Well, it’s easy: You create a copy of the list as an array. Now when you want to sample, you pick an index into the array at random.

Now that you have that index that gives you the value to return. Replace the value at that index with the value that’s at the end of the array, and reduce the array length by one.

Here’s some python:

def sample(ls, random):
    i = random.randint(0, len(ls) - 1)
    result = ls[i]
    ls[i] = ls[-1]
    ls.pop()
    return result

Now I’ve given you a recipe to build on, lets see you improve upon it!

  1. If you assume the list you are given is immutable and you can hang onto it, can you improve the initialisation to O(1)? (you may need to make sampling only O(1) amortised and/or expected time to do this. Feel free to build on other standard data structures rather than inventing them from scratch).
  2. How would I extend that data structure to also support a “Remove the smallest element” operation in O(log(n))? (You may wish to read about how binary heaps work). You’ll probably have to go back to O(n) initialisation, but can you avoid that if you assume the input list is already sorted?
  3. How would you create a data structure to support weighted sampling with rejection? i.e. you start with a list of pairs of values and weights, and each value is sampled with probability proportionate to its weight. You may need to make sample O(log(n)) to do this (you can do it in expected O(1) time, but I don’t know of a data structure that does so without quite a lot of complexity). You can assume the weights are integers and/or just ignore questions of numerical stability.
  4. How would add an operation to give a key selected uniformly at random to a hash table? (If you haven’t read about how pypy dicts work you may wish to read that first)
  5. How would you extend a hash table to add an O(log(n)) “remove and return the smallest key” operation with no additional storage but increasing the insert complexity to O(log(n))? Can you do it without adding any extra storage to the hash table?

These aren’t completely arbitrary examples. Some of them are ones I’ve actually needed recently, others are just applications of the tricks I figured out in the course of doing so. I do recommend working through them in order though, because each will give you hints for how to do later ones.

You may never need any of these combinations, but that doesn’t matter. The point is not that these represent some great innovations in data structures. The point is to learn how to make your own data structures so that when you need to you can.

If you want to learn more, I recommend just playing around with this yourself. Try to come up with odd problems to solve that could be solved with a good data structure. It can also be worth learning about existing ones – e.g. reading about how the standard library in your favourite language implements things. What are the tricks and variations that it uses?

If you’d like to take a more formal course that is structured like this, I’m told Tim Roughgarden’s Coursera specialization on algorithms follows this model, and the second course in it will cover the basics of data structures. I’ve never taken it though, so this is a second hand recommendation. (Thanks @pozorvlak for the recommendation).

(And if you want to learn more things like this by reading more about it from me, support me on Patreon and say so! Nine out of ten cats prefer it, and you’ll get access to drafts of upcoming blog posts)

March 22, 2017 08:31 AM


Wingware Blog

Wing Python IDE Product Line Changes

Wing 6 makes Wing Personal free, streamlines the process for applying for free Wing Pro licenses, and introduces an annual licensing option.

March 22, 2017 01:00 AM