skip to navigation
skip to content

Planet Python

Last update: November 17, 2017 09:47 PM

November 17, 2017


Weekly Python Chat

pip: installing Python libraries

This week we're talking about pip: the Python package manager. We'll talk about installing packages with pip and discovering packages that suit your needs.

November 17, 2017 08:30 PM


NumFOCUS

Theano and the Future of PyMC

This is a guest post by Christopher Fonnesbeck of PyMC3. — PyMC, now in its third iteration as PyMC3, is a project whose goal is to provide a performant, flexible, and friendly Python interface to Bayesian inference.  The project relies heavily on Theano, a deep learning framework, which has just announced that development will not be […]

November 17, 2017 05:25 PM

November 16, 2017


Mike Driscoll

wxPython: Moving items in ObjectListView

I was recently asked about how to implement drag-and-drop of items in a wx.ListCtrl or in ObjectListView. Unfortunately neither control has this built-in although I did find an article on the wxPython wiki that demonstrated one way to do drag-and-drop of the items in a ListCtrl.

However I did think that implementing some buttons to move items around in an ObjectListView widget should be fairly easy to implement. So that’s what this article will be focusing on.


Changing Item Order

If you don’t have wxPython and ObjectListView installed, then you will want to use pip to install them:

pip install wxPython objectlistview

Once that is done, open up your favorite text editor or IDE and enter the following code:

import wx
from ObjectListView import ObjectListView, ColumnDefn
 
 
class Book(object):
    """
    Model of the Book object
    Contains the following attributes:
    'ISBN', 'Author', 'Manufacturer', 'Title'
    """
 
    def __init__(self, title, author, isbn, mfg):
        self.isbn = isbn
        self.author = author
        self.mfg = mfg
        self.title = title
 
    def __repr__(self):
        return "<Book: {title}>".format(title=self.title)
 
 
class MainPanel(wx.Panel):
 
    def __init__(self, parent):
        wx.Panel.__init__(self, parent=parent, id=wx.ID_ANY)
        self.current_selection = None
        self.products = [Book("wxPython in Action", "Robin Dunn",
                              "1932394621", "Manning"),
                         Book("Hello World", "Warren and Carter Sande",
                              "1933988495", "Manning"),
                         Book("Core Python Programming", "Wesley Chun",
                             "0132269937", "Prentice Hall"),
                         Book("Python Programming for the Absolute Beginner",
                              "Michael Dawson", "1598631128",
                              "Course Technology"),
                         Book("Learning Python", "Mark Lutz",
                              "0596513984", "O'Reilly")
                         ]
 
        self.dataOlv = ObjectListView(self, wx.ID_ANY, 
                                      style=wx.LC_REPORT|wx.SUNKEN_BORDER)
        self.setBooks()
 
        # Allow the cell values to be edited when double-clicked
        self.dataOlv.cellEditMode = ObjectListView.CELLEDIT_SINGLECLICK
 
        # create up and down buttons
        up_btn = wx.Button(self, wx.ID_ANY, "Up")
        up_btn.Bind(wx.EVT_BUTTON, self.move_up)
 
        down_btn = wx.Button(self, wx.ID_ANY, "Down")
        down_btn.Bind(wx.EVT_BUTTON, self.move_down)
 
        # Create some sizers
        mainSizer = wx.BoxSizer(wx.VERTICAL)
 
        mainSizer.Add(self.dataOlv, 1, wx.ALL|wx.EXPAND, 5)
        mainSizer.Add(up_btn, 0, wx.ALL|wx.CENTER, 5)
        mainSizer.Add(down_btn, 0, wx.ALL|wx.CENTER, 5)
        self.SetSizer(mainSizer)
 
    def move_up(self, event):
        """
        Move an item up the list
        """        
        self.current_selection = self.dataOlv.GetSelectedObject()
        data = self.dataOlv.GetObjects()
        if self.current_selection:
            index = data.index(self.current_selection)
            if index > 0:
                new_index = index - 1
            else:
                new_index = len(data)-1
            data.insert(new_index, data.pop(index))
            self.products = data
            self.setBooks()
            self.dataOlv.Select(new_index)
 
    def move_down(self, event):
        """
        Move an item down the list
        """
        self.current_selection = self.dataOlv.GetSelectedObject()
        data = self.dataOlv.GetObjects()
        if self.current_selection:
            index = data.index(self.current_selection)
            if index < len(data) - 1:
                new_index = index + 1
            else:
                new_index = 0
            data.insert(new_index, data.pop(index))
            self.products = data
            self.setBooks()
            self.dataOlv.Select(new_index)
 
    def setBooks(self):
        self.dataOlv.SetColumns([
            ColumnDefn("Title", "left", 220, "title"),
            ColumnDefn("Author", "left", 200, "author"),
            ColumnDefn("ISBN", "right", 100, "isbn"),
            ColumnDefn("Mfg", "left", 180, "mfg")
        ])
 
        self.dataOlv.SetObjects(self.products)
 
 
class MainFrame(wx.Frame):
    def __init__(self):
        wx.Frame.__init__(self, parent=None, id=wx.ID_ANY, 
                          title="ObjectListView Demo", size=(800,600))
        panel = MainPanel(self)
        self.Show()
 
 
if __name__ == "__main__":
    app = wx.App(False)
    frame = MainFrame()
    app.MainLoop()

The code we care about most in this example are the move_up() and move_down() methods. Each of these methods will check to see if you have an item in the ObjectListView widget selected. It will also grab the current contents of the widgets. If you have an item selected, then it will grab that item’s index from the ObjectListView widget’s data that we grabbed when we called GetObjects(). Then we can use that index to determine whether we should increment (move_down) or decrement (move_up) its index depending on which of the buttons we press.

After we update the list with the changed positions, then we update self.products, which is our class variable that we use in the setBooks() to update our ObjectListView widget. Finally we actually call setBooks() and we reset the selection since our original selection moved.


Wrapping Up

I thought this was a neat little project that didn’t take very long to put together. I will note that there is at least one issue with this implementation and that is that it doesn’t work correctly when you select multiple items in the control. You could probably fix this by disabling multiple selection in your ObjectListView widget or by figuring out the logic to make it work with multiple selections. But I will leave that up the reader to figure out. Have fun an happy coding!

November 16, 2017 06:15 PM


Django Weblog

DSF calls for applicants for a Django Fellow

After three years of full-time work as the Django Fellow, I'd like to scale back my involvement to part-time. That means it's time to hire another Fellow who would like to work on Django 20-40 hours per week. The position is ongoing - the successful applicant will have the position until they choose to step down.

The position of Fellow is primarily focused on housekeeping and community support - you'll be expected to do the work that would benefit from constant, guaranteed attention rather than volunteer-only efforts. In particular, your duties will include:

Being a committer isn't a prerequisite for this position; we'll consider applications from anyone with a proven history of working with either the Django community or another similar open-source community.

Your geographical location isn't important either - we have several methods of remote communication and coordination that we can use depending on the timezone difference to the supervising members of Django.

You'll be expected to post a weekly report of your work to the django-developers mailing list.

If you don't perform the duties to a satisfactory level, we may end your contract. We may also terminate the contract if we're unable to raise sufficient funds to support the Fellowship on an ongoing basis (unlikely, given the current fundraising levels).

Compensation isn't competitive with full-time salaries in big cities like San Francisco or London. The Fellow will be selected to make best use of available funds.

If you're interested in applying for the position, please email us with details of your experience with Django and open-source contribution and community support in general, the amount of time each week you'd like to dedicate to the position (a minimum of 20 hours a week), your hourly rate, and when you'd like to start working. The start date is flexible and will be on or after January 1, 2018.

Applications will be open until 1200 UTC, December 18, 2017, with the expectation that the successful candidate will be announced around December 22.

Successful applicants will not be an employee of the Django Project or the Django Software Foundation. Fellows will be contractors and expected to ensure that they meet all of their resident country's criteria for self-employment or having a shell consulting company, invoicing the DSF on a monthly basis and ensuring they pay all relevant taxes.

If you or your company is interested in helping fund this program and future DSF activities, please consider becoming a corporate member to learn about corporate membership, or you can make a donation to the Django Software Foundation.

November 16, 2017 02:18 PM


Red Hat Developers

Speed up your Python using Rust

What is Rust?

Rust is a systems programming language that runs blazingly fast, prevents segfaults, and guarantees thread safety.

Featuring

Description is taken from rust-lang.org.

Why does it matter for a Python developer?

The better description of Rust I heard from Elias (a member of the Rust Brazil Telegram Group).

Rust is a language that allows you to build high level abstractions, but without giving up low-level control – that is, control of how data is represented in memory, control of which threading model you want to use etc.
Rust is a language that can usually detect, during compilation, the worst parallelism and memory management errors (such as accessing data on different threads without synchronization, or using data after they have been deallocated), but gives you a hatch escape in the case you really know what you’re doing.
Rust is a language that, because it has no runtime, can be used to integrate with any runtime; you can write a native extension in Rust that is called by a program node.js, or by a python program, or by a program in ruby, lua etc. and, however, you can script a program in Rust using these languages. — “Elias Gabriel Amaral da Silva”

There is a bunch of Rust packages out there to help you extending Python with Rust.

I can mention Milksnake created by Armin Ronacher (the creator of Flask) and also PyO3 The Rust bindings for Python interpreter.

See a complete reference list at the bottom of this article.

Let’s see it in action

For this post, I am going to use Rust Cpython, it’s the only one I have tested, it is compatible with stable version of Rust and found it straightforward to use.

NOTE: PyO3 is a fork of rust-cpython, comes with many improvements, but works only with the nightly version of Rust, so I prefered to use the stable for this post, anyway the examples here must work also with PyO3.

Pros: It is easy to write Rust functions and import from Python and as you will see by the benchmarks it worth in terms of performance.

Cons: The distribution of your project/lib/framework will demand the Rust module to be compiled on the target system because of variation of environment and architecture, there will be a compiling stage which you don’t have when installing Pure Python libraries, you can make it easier using rust-setuptools or using the MilkSnake to embed binary data in Python Wheels.

Python is sometimes slow

Yes, Python is known for being “slow” in some cases and the good news is that this doesn’t really matter depending on your project goals and priorities. For most projects, this detail will not be very important.

However, you may face the rare case where a single function or module is taking too much time and is detected as the bottleneck of your project performance, often happens with string parsing and image processing.

Example

Let’s say you have a Python function which does a string processing, take the following easy example of counting pairs of repeated chars, but have in mind that this example can be reproduced with other string processing functions or any other generally slow process in Python.

# How many subsequent-repeated group of chars are in the given string? 
abCCdeFFghiJJklmnopqRRstuVVxyZZ... {millions of chars here}
  1   2    3        4    5   6

Python is slow for doing large string processing, so you can use pytest-benchmark to compare a Pure Python (with Iterator Zipping) function versus a Regexp implementation.

# Using a Python3.6 environment
$ pip3 install pytest pytest-benchmark

Then write a new Python program called doubles.py

import re
import string
import random

# Python ZIP version
def count_doubles(val):
    total = 0
    # there is an improved version later on this post
    for c1, c2 in zip(val, val[1:]):
        if c1 == c2:
            total += 1
    return total


# Python REGEXP version
double_re = re.compile(r'(?=(.)\1)')

def count_doubles_regex(val):
    return len(double_re.findall(val))


# Benchmark it
# generate 1M of random letters to test it
val = ''.join(random.choice(string.ascii_letters) for i in range(1000000))

def test_pure_python(benchmark):
    benchmark(count_doubles, val)

def test_regex(benchmark):
    benchmark(count_doubles_regex, val)

Run pytest to compare:

$ pytest doubles.py                                                                                                           
=============================================================================
platform linux -- Python 3.6.0, pytest-3.2.3, py-1.4.34, pluggy-0.4.
benchmark: 3.1.1 (defaults: timer=time.perf_counter disable_gc=False min_roun
rootdir: /Projects/rustpy, inifile:
plugins: benchmark-3.1.1
collected 2 items

doubles.py ..


-----------------------------------------------------------------------------
Name (time in ms)         Min                Max               Mean          
-----------------------------------------------------------------------------
test_regex            24.6824 (1.0)      32.3960 (1.0)      27.0167 (1.0)    
test_pure_python      51.4964 (2.09)     62.5680 (1.93)     52.8334 (1.96)   
-----------------------------------------------------------------------------

Lets take the Mean for comparison:

Extending Python with Rust

Create a new crate

crate is how we call Rust Packages.

Having rust installed (recommended way is https://www.rustup.rs/) Rust is also available on Fedora and RHEL repositories by the rust-toolset

I used rustc 1.21.0

In the same folder run:

cargo new pyext-myrustlib

It creates a new Rust project in that same folder called pyext-myrustlib containing the Cargo.toml (cargo is the Rust package manager) and also a src/lib.rs (where we write our library implementation).

Edit Cargo.toml

It will use the rust-cpython crate as dependency and tell cargo to generate a dylib to be imported from Python.

[package]
name = "pyext-myrustlib"
version = "0.1.0"
authors = ["Bruno Rocha <rochacbruno@gmail.com>"]

[lib]
name = "myrustlib"
crate-type = ["dylib"]

[dependencies.cpython]
version = "0.1"
features = ["extension-module"]

Edit src/lib.rs

What we need to do:

  1. Import all macros from cpython crate.
  2. Take Python and PyResult types from CPython into our lib scope.
  3. Write the count_doubles function implementation in Rust, note that this is very similar to the Pure Python version except for:
    • It takes a Python as first argument, which is a reference to the Python Interpreter and allows Rust to use the Python GIL.
    • Receives a &str typed val as reference.
    • Returns a PyResult which is a type that allows the rise of Python exceptions.
    • Returns an PyResult object in Ok(total) (Result is an enum type that represents either success (Ok) or failure (Err)) and as our function is expected to return a PyResult the compiler will take care of wrapping our Ok on that type. (note that our PyResult expects a u64 as return value).
  4. Using py_module_initializer! macro we register new attributes to the lib, including the __doc__ and also we add the count_doubles attribute referencing our Rust implementation of the function.
    • Attention to the names libmyrustlib, initlibmyrustlib, and PyInit.
    • We also use the try! macro, which is the equivalent to Python’stry.. except.
    • Return Ok(()) – The () is an empty result tuple, the equivalent of None in Python.
#[macro_use]
extern crate cpython;

use cpython::{Python, PyResult};

fn count_doubles(_py: Python, val: &str) -> PyResult<u64> {
    let mut total = 0u64;

    // There is an improved version later on this post
    for (c1, c2) in val.chars().zip(val.chars().skip(1)) {
        if c1 == c2 {
            total += 1;
        }
    }

    Ok(total)
}

py_module_initializer!(libmyrustlib, initlibmyrustlib, PyInit_myrustlib, |py, m | {
    try!(m.add(py, "__doc__", "This module is implemented in Rust"));
    try!(m.add(py, "count_doubles", py_fn!(py, count_doubles(val: &str))));
    Ok(())
});

Now let’s build it with cargo

$ cargo build --release
    Finished release [optimized] target(s) in 0.0 secs

$ ls -la target/release/libmyrustlib*
target/release/libmyrustlib.d
target/release/libmyrustlib.so*  <-- Our dylib is here

Now let’s copy the generated .so lib to the same folder where our doubles.py is located.

NOTE: on Fedora you must get a .so in other system you may get a .dylib and you can rename it changing extension to .so.

$ cd ..
$ ls
doubles.py pyext-myrustlib/

$ cp pyext-myrustlib/target/release/libmyrustlib.so myrustlib.so

$ ls
doubles.py myrustlib.so pyext-myrustlib/

Having the myrustlib.so in the same folder or added to your Python path allows it to be directly imported, transparently as it was a Python module.

 

Importing from Python and comparing the results

Edit your doubles.py now importing our Rust implemented version and adding a benchmark for it.

import re
import string
import random
import myrustlib   #  <-- Import the Rust implemented module (myrustlib.so)


def count_doubles(val):
    """Count repeated pair of chars ins a string"""
    total = 0
    for c1, c2 in zip(val, val[1:]):
        if c1 == c2:
            total += 1
    return total


double_re = re.compile(r'(?=(.)\1)')


def count_doubles_regex(val):
    return len(double_re.findall(val))


val = ''.join(random.choice(string.ascii_letters) for i in range(1000000))


def test_pure_python(benchmark):
    benchmark(count_doubles, val)


def test_regex(benchmark):
    benchmark(count_doubles_regex, val)


def test_rust(benchmark):   #  <-- Benchmark the Rust version
    benchmark(myrustlib.count_doubles, val)

Benchmark

$ pytest doubles.py
==============================================================================
platform linux -- Python 3.6.0, pytest-3.2.3, py-1.4.34, pluggy-0.4.
benchmark: 3.1.1 (defaults: timer=time.perf_counter disable_gc=False min_round
rootdir: /Projects/rustpy, inifile:
plugins: benchmark-3.1.1
collected 3 items

doubles.py ...


-----------------------------------------------------------------------------
Name (time in ms)         Min                Max               Mean          
-----------------------------------------------------------------------------
test_rust              2.5555 (1.0)       2.9296 (1.0)       2.6085 (1.0)    
test_regex            25.6049 (10.02)    27.2190 (9.29)     25.8876 (9.92)   
test_pure_python      52.9428 (20.72)    56.3666 (19.24)    53.9732 (20.69)  
-----------------------------------------------------------------------------

Lets take the Mean for comparison:

Rust implementation can be 10x faster than Python Regex and 21x faster than Pure Python Version.

Interesting that Regex version is only 2x faster than Pure Python 🙂

NOTE: That numbers makes sense only for this particular scenario, for other cases that comparison may be different.

Updates and Improvements

After this article has been published I got some comments on r/python and also on r/rust

The contributions came as Pull Requests and you can send a new if you think the functions can be improved.

Thanks to: Josh Stone we got a better implementation for Rust which iterates the string only once and also the Python equivalent.

Thanks to: Purple Pixie we got a Python implementation using itertools, however this version is not performing any better and still needs improvements.

Iterating only once

fn count_doubles_once(_py: Python, val: &str) -> PyResult<u64> {
    let mut total = 0u64;

    let mut chars = val.chars();
    if let Some(mut c1) = chars.next() {
        for c2 in chars {
            if c1 == c2 {
                total += 1;
            }
            c1 = c2;
        }
    }

    Ok(total)
}
def count_doubles_once(val):
    total = 0
    chars = iter(val)
    c1 = next(chars)
    for c2 in chars:
        if c1 == c2:
            total += 1
        c1 = c2
    return total

Python with itertools

import itertools

def count_doubles_itertools(val):
    c1s, c2s = itertools.tee(val)
    next(c2s, None)
    total = 0
    for c1, c2 in zip(c1s, c2s):
        if c1 == c2:
            total += 1
    return total

Why not C/C++/Nim/Go/Ĺua/PyPy/{other language}?

Ok, that is not the purpose of this post, this post was never about comparing Rust X other language, this post was specifically about how to use Rust to extend and speed up Python and by doing that it means you have a good reason to choose Rust instead of other language or by its ecosystem or by its safety and tooling or just to follow the hype, or simply because you like Rust doesn’t matter the reason, this post is here to show how to use it with Python.

I (personally) may say that Rust is more future proof as it is new and there are lots of improvements to come, also because of its ecosystem, tooling, and community and also because I feel comfortable with Rust syntax, I really like it!

So, as expected people started complaining about the use of other languages and it becomes a sort of benchmark, and I think it is cool!

So as part of my request for improvements some people on Hacker News also sent ideas, martinxyz sent an implementation using C and SWIG that performed very well.

C Code (swig boilerplate omitted)

uint64_t count_byte_doubles(char * str) {
  uint64_t count = 0;
  while (str[0] && str[1]) {
    if (str[0] == str[1]) count++;
    str++;
  }
  return count;
}

And our fellow Red Hatter Josh Stone improved the Rust implementation again by replacing chars with bytes so it is a fair competition with C as C is comparing bytes instead of Unicode chars.

fn count_doubles_once_bytes(_py: Python, val: &str) -> PyResult<u64> {
    let mut total = 0u64;

    let mut chars = val.bytes();
    if let Some(mut c1) = chars.next() {
        for c2 in chars {
            if c1 == c2 {
                total += 1;
            }
            c1 = c2;
        }
    }

    Ok(total)
}

There are also ideas to compare Python list comprehension and numpy so I included here

Numpy:

import numpy as np

def count_double_numpy(val):
    ng=np.fromstring(val,dtype=np.byte)
    return np.sum(ng[:-1]==ng[1:])

List comprehension

def count_doubles_comprehension(val):
    return sum(1 for c1, c2 in zip(val, val[1:]) if c1 == c2)

The complete test case is on repository test_all.py file.

New Results

NOTE: Have in mind that the comparison was done in the same environment and may have some differences if run in a different environment using another compiler and/or different tags.

-------------------------------------------------------------------------------------------------
Name (time in us)                     Min                    Max                   Mean          
-------------------------------------------------------------------------------------------------
test_rust_bytes_once             476.7920 (1.0)         830.5610 (1.0)         486.6116 (1.0)    
test_c_swig_bytes_once           795.3460 (1.67)      1,504.3380 (1.81)        827.3898 (1.70)   
test_rust_once                   985.9520 (2.07)      1,483.8120 (1.79)      1,017.4251 (2.09)   
test_numpy                     1,001.3880 (2.10)      2,461.1200 (2.96)      1,274.8132 (2.62)   
test_rust                      2,555.0810 (5.36)      3,066.0430 (3.69)      2,609.7403 (5.36)   
test_regex                    24,787.0670 (51.99)    26,513.1520 (31.92)    25,333.8143 (52.06)  
test_pure_python_once         36,447.0790 (76.44)    48,596.5340 (58.51)    38,074.5863 (78.24)  
test_python_comprehension     49,166.0560 (103.12)   50,832.1220 (61.20)    49,699.2122 (102.13) 
test_pure_python              49,586.3750 (104.00)   50,697.3780 (61.04)    50,148.6596 (103.06) 
test_itertools                56,762.8920 (119.05)   69,660.0200 (83.87)    58,402.9442 (120.02) 
-------------------------------------------------------------------------------------------------

NOTE: If you want to propose changes or improvements send a PR here: https://github.com/rochacbruno/rust-python-example/

Conclusion

Back to the purpose of this post “How to Speed Up your Python with Rust” we started with:

– Pure Python function taking 102 ms.
– Improved with Numpy (which is implemented in C) to take 3 ms.
– Ended with Rust taking 1 ms.

In this example Rust performed 100x faster than our Pure Python.

Rust will not magically save you, you must know the language to be able to implement the clever solution and once implemented in the right it worth as much as C in terms of performance and also comes with amazing tooling, ecosystem, community and safety bonuses.

Rust may not be yet the general purpose language of choice by its level of complexity and may not be the better choice yet to write common simple applications such as web sites and test automation scripts.

However, for specific parts of the project where Python is known to be the bottleneck and your natural choice would be implementing a C/C++ extension, writing this extension in Rust seems easy and better to maintain.

There are still many improvements to come in Rust and lots of others crates to offer Python <--> Rust integration. Even if you are not including the language in your tool belt right now, it is really worth to keep an eye open to the future!

References

The code snippets for the examples showed here are available in GitHub repo: https://github.com/rochacbruno/rust-python-example.

The examples in this publication are inspired by Extending Python with Rust talk by Samuel Cormier-Iijima in Pycon Canada. video here: https://www.youtube.com/watch?v=-ylbuEzkG4M.

Also by My Python is a little Rust-y by Dan Callahan in Pycon Montreal. video here: https://www.youtube.com/watch?v=3CwJ0MH-4MA.

Other references:

Join Community:

Join Rust community, you can find group links in https://www.rust-lang.org/en-US/community.html.

If you speak Portuguese, I recommend you to join https://t.me/rustlangbr and there is the http://bit.ly/canalrustbr on Youtube.

Author

Bruno Rocha

More info: http://about.me/rochacbruno and http://brunorocha.org


Whether you are new to Containers or have experience, downloading this cheat sheet can assist you when encountering tasks you haven’t done lately.

Share

The post Speed up your Python using Rust appeared first on RHD Blog.

November 16, 2017 11:00 AM


Davy Wybiral

Write a Feed Reader in Python

I just started a new video tutorial series. This time it'll cover the entire process of writing an RSS feed reader in Python from start to finish using the feedparser module, flask, and SQLAlchemy. Expect to see about 3-4 new videos a week until this thing is finished!

Click to watch

November 16, 2017 10:23 AM


Python Bytes

#52 Call your APIs with uplink and test them in the tavern

November 16, 2017 08:00 AM


Kushal Das

PyConf Hyderabad 2017

In the beginning of October, I attended a new PyCon in India, PyConf Hyderabad (no worries, they are working on the name for the next year). I was super excited about this conference, the main reason is being able to meet more Python developers from India. We are a large country, and we certainly need more local conferences :)

We reached the conference hotel a day before the event starts along with Py. The first day of the conference was workshop day, we reached the venue on time to say hi to everyone. Meet the team at the conference and many old friends. It was good to see that folks traveled from all across the country to volunteer for the conference. Of course, we had a certain number of dgplug folks there :)

In the conference day, Anwesha and /my setup in the PSF booth, and talked to the attendees. During the lighting talk session, Sayan and Anwesha introduced PyCon Pune, and they also opened up the registration during the lighting talk :). I attended Chandan Kumar’s talk about his journey into upstream projects. Have to admit that I feel proud to see all the work he has done.

Btw, I forgot to mention that lunch at PyConf Hyderabad was the best conference food ever. They had some amazing biryani :).

The last talk of the day was my keynote titled Free Software movement & current days. Anwesha and I wrote an article on the history of Free Software a few months back, and that the talk was based on that. This was also the first time I spoke about Freedom of the Press Foundation (attended my first conference as the FPF staff member).

The team behind the conference did some amazing groundwork to make this conference happening. It was a good opportunity to meet the community, and make new friends.

November 16, 2017 04:25 AM


Wallaroo Labs

Non-native event-driven windowing in Wallaroo

Certain applications lend themselves to pure parallel computation better than others. In some cases we require to apply certain algorithms over a “window” in our data. This means that after we have completed a certain amount of processing (be it time, number of messages or some other arbitrary metric), we want to perform a special action for the data in that window. An example application of this could be producing stats for log files over a certain period of time.

November 16, 2017 12:00 AM

November 15, 2017


Django Weblog

Django 2.0 release candidate 1 released

Django 2.0 release candidate 1 is the final opportunity for you to try out the assortment of new features before Django 2.0 is released.

The release candidate stage marks the string freeze and the call for translators to submit translations. Provided no major bugs are discovered that can't be solved in the next two weeks, Django 2.0 will be released on or around December 1. Any delays will be communicated on the django-developers mailing list thread.

Please use this opportunity to help find and fix bugs (which should be reported to the issue tracker). You can grab a copy of the package from our downloads page or on PyPI.

The PGP key ID used for this release is Tim Graham: 1E8ABDC773EDE252.

November 15, 2017 11:54 PM


PyCharm

PyCharm 2017.3 EAP 10

This week’s early access program (EAP) version of PyCharm is now available from our website:

Get PyCharm 2017.3 EAP 10

The release is getting close, and we’re just polishing out the last small issues until it’s ready.

Improvements in This Version

If these features sound interesting to you, try them yourself:

Get PyCharm 2017.3 EAP 10

If you are using a recent version of Ubuntu (16.04 and later) you can also install PyCharm EAP versions using snap:

sudo snap install [pycharm-professional | pycharm-community] --classic --edge

If you already used snap for the previous version, you can update using:

sudo snap refresh [pycharm-professional | pycharm-community] --classic --edge

As a reminder, PyCharm EAP versions:

If you run into any issues with this version, or another version of PyCharm, please let us know on our YouTrack. If you have other suggestions or remarks, you can reach us on Twitter, or by commenting on the blog.

November 15, 2017 05:47 PM


Codementor

Onicescu correlation coefficient-Python - Alexandru Daia

Implementing a new correlation method based on kinetic energies I research

November 15, 2017 05:41 PM


Ian Ozsvald

PyDataBudapest and “Machine Learning Libraries You’d Wish You’d Known About”

I’m back at BudapestBI and this year it has its first PyDataBudapest track. Budapest is fun! I’ve had a second iteration talking on a slightly updated “Machine Learning Libraries You’d Wish You’d Known About” (updated from PyDataCardiff two weeks back). When I was here to give an opening keynote talk two years back the conference was a bit smaller, it has grown by +100 folk since then. There’s also a stronger emphasis on open source R and Python tools. As before, the quality of the members here is high – the conversations are great!

During my talk I used my Explaining Regression Predictions Notebook to cover:

Nick’s photo of me on stage

Some audience members asked about co-linearity detection and explanation. Whilst I don’t have a good answer for identifying these relationships, I’ve added a seaborn pairplot, a correlation plot and the Pandas Profiling tool to the Notebook which help to show these effects.

Although it is complicated, I’m still pretty happy with this ELI5 plot that’s explaining feature contributions to a set of cheap-to-expensive houses from the Boston dataset:

Boston ELI5

I’m planning to do some training on these sort of topics next year, join my training list if that might be of use.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

November 15, 2017 03:50 PM


EuroPython Society

EuroPython 2018: Location and Dates

After a two month RFP bidding process with 19 venues from all over Europe, we are pleased to announce our selection of the location and venue for EuroPython 2018:

… yes, this is just one week before the famous Edinburgh Festival Fringe, so you can extend your stay a little longer if you like.

Based on the feedback we collected in the last few years, we have switched to a more compact conference layout for 2018:

More information will be available as we progress with the organization.

PS: We are now entering contract negotiations, so the above dates are highly likely, but we cannot confirm 100% yet.

Enjoy,

EuroPython Society

November 15, 2017 03:43 PM


Tryton News

Foundation board renewal 2017

The 2017 foundation board renewal process has finished. We are happy to anounce that the new board is composed by:

  • Axel Braun from Germany
  • Jonathan Levy from the United States of America
  • Korbinian Preisler from Germany
  • Nicolas Évrard from Belgium
  • Pedro Luciano Rossi from Argentina
  • Sebastián Marró from Argentina
  • Sergi Almacellas Abellana from Spain

Congratulations to Nicolas Évrard as he became the second president of the Tryton foundation board.

We nearly reached the website redesign goal of our budget for 2017. You can help to make it happen by making a donation.

November 15, 2017 09:00 AM


Talk Python to Me

#138 Anvil: All web, all Python

Have you noticed that web development is kind of hard? If you've been doing it for a long time, this is easy to forget. It probably sounds easy enough

November 15, 2017 08:00 AM


NumFOCUS

Quantopian commits to fund pandas as a new NumFOCUS Corporate Partner

NumFOCUS welcomes Quantopian as our first Emerging Leader Corporate Partner, a partnership for small but growing companies who are leading by providing fiscal support to our open source projects. — Quantopian Supports Open Source by John Fawcett, CEO and founder of Quantopian       It all started with a single tweet. While scrolling through […]

November 15, 2017 12:03 AM

November 14, 2017


Mike Driscoll

Book Review: Python Testing with pytest

A couple of months ago, Brian Okken asked me if I would be interested in reading his book, Python Testing with pytest. I have been interested in learning more about the pytest package for a while, so I agreed to take a look. I also liked that the publisher was The Pragmatic Programmers, which I’ve had good experience with in the past. We will start with a quick review and then dive into the play-by-play.


Quick Review

  • Why I picked it up: The author of the book asked me to read his book
  • Why I finished it: I mostly skimmed the book to see how it was written and to check out the examples
  • I’d give it to: Anyone who is interested in testing in Python and especially in the pytest package

  • Book Formats

    You can get this as a physical soft cover, Kindle on Amazon or various other eBook formats via Pragmatic Programming’s website.


    Book Contents

    This book has 7 chapters, 5 appendices and is 222 pages long.


    Full Review

    This book jumps right in by starting off with an example in chapter 1. I actually found this a bit jarring as usually chapters have an introduction at the beginning that goes over what the chapter will be about. But chapter one just jumps right in with an example of a test in Python. It’s not bad, just different. This chapter explains how to get started using pytest and covers some of the common command line options you can pass to pytest.

    Chapter two goes into writing test functions with pytest. It also talks about how pytest uses Python’s assert keyword rather than assert methods like Python’s unittest library does. I found that to be an appealing feature of pytest. You will also learn how to skip tests and how to mark tests that we expect will fail as well as a few other things.

    If you’ve wondered about fixtures in pytest, then you will be excited to know that this book has two chapters on the topic; specifically chapters three and four. These chapters cover a lot of material, so I will just mention the highlights. You will learn about creating fixtures for setup and teardown, how to trace fixture execution, fixture scope, parameterized fixtures and builtin fixtures.

    Chapter five is about how to add plugins to pytest. You will also learn how to write your own plugins, how to install them and how to test your plugins. You will also get a good foundation in how to use conftest.py in this chapter.

    In chapter six, we learn all about configuring pytest. The big topics covered in this chapter deal with pytest.ini, conftest.py and __init__.py as well as what you might use setup.cfg for. There are a lot of interesting topics in this chapter as well such as registering markers or changing test discovery locations. I encourage you to take a look at the book’s table of contents to learn more though!

    Finally in chapter 7 (the last chapter), we learn about using pytest with other testing tools. In this case, the book covers, pdb, coverage.py, mock, tox, Jenkins and even unittest.

    The rest of the book is a series of five appendices and an index. The appendices cover virtual environments, pip, a plugin sampler, packaging / distributing Python projects and xUnit fixtures.

    I thought this book was well written and stayed on topic well. The examples are short and to the point. I am looking forward to diving more deeply into the book when I want to use the pytest package for my own code. I would recommend this book to anyone who is interested in the pytest package.

    Python Testing with pytest

    by Brian Okken

    Amazon, Book Website


    Other Book Reviews

    November 14, 2017 06:15 PM


    Stack Abuse

    Using Machine Learning to Predict the Weather: Part 1

    Part 1: Collecting Data From Weather Underground

    Using Machine Learning to Predict the Weather: Part 1

    This is the first article of a multi-part series on using Python and Machine Learning to build models to predict weather temperatures based off data collected from Weather Underground. The series will be comprised of three different articles describing the major aspects of a Machine Learning project. The topics to be covered are:

    1. Data collection and processing
    2. Linear regression models
    3. Neural network models

    The data used in this series will be collected from Weather Underground's free tier API web service. I will be using the requests library to interact with the API to pull in weather data since 2015 for the city of Lincoln, Nebraska. Once collected, the data will need to be process and aggregated into a format that is suitable for data analysis, and then cleaned.

    The second article will focus on analyzing the trends in the data with the goal of selecting appropriate features for building a Linear Regression model using the statsmodels and scikit-learn Python libraries. I will discuss the importance of understanding the assumptions necessary for using a Linear Regression model and demonstrate how to evaluate the features to build a robust model. This article will conclude with a discussion of Linear Regression model testing and validation.

    The final article will focus on using Neural Networks. I will compare the process of building a Neural Network model, interpreting the results and, overall accuracy between the Linear Regression model built in the prior article and the Neural Network model.

    Getting Familiar with Weather Underground

    Weather Underground is a company that collects and distributes data on various weather measurements around the globe. The company provides a swath of API's that are available for both commercial and non-commercial uses. In this article, I will describe how to programmatically pull daily weather data from Weather Underground using their free tier of service available for non-commercial purposes.

    If you would like to follow along with the tutorial you will want to sign up for their free developer account here. This account provides an API key to access the web service at a rate of 10 requests per minute and up to a total of 500 requests in a day.

    Weather Underground provides many different web service API's to access data from but, the one we will be concerned with is their history API. The history API provides a summary of various weather measurements for a city and state on a specific day.

    The format of the request for the history API resource is as follows:

    http://api.wunderground.com/api/API_KEY/history_YYYYMMDD/q/STATE/CITY.json  
    

    Making Requests to the API

    To make requests to the Weather Underground history API and process the returned data I will make use of a few standard libraries as well as some popular third party libraries. Below is a table of the libraries I will be using and their description. For installation instructions please refer to the listed documentation.

    Library Description of Usage Source
    datetime Used to increment our requests by day Standard Library
    time Used to delay requests to stay under 10 per minute Standard Library
    collections Use namedtuples for structured collection of data Standard Library
    pandas Used to process, organize and clean the data Third Party Library
    requests Used to make networked requests to the API Third Party Library
    matplotlib Used for graphical analysis Third Party Library

    Let us get started by importing these libraries:

    from datetime import datetime, timedelta  
    import time  
    from collections import namedtuple  
    import pandas as pd  
    import requests  
    import matplotlib.pyplot as plt  
    

    Now I will define a couple of constants representing my API_KEY and the BASE_URL of the API endpoint I will be requesting. Note you will need to signup for an account with Weather Underground and receive your own API_KEY. By the time this article is published I will have deactivated this one.

    BASE_URL is a string with two place holders represented by curly brackets. The first {} will be filled by the API_KEY and the second {} will be replaced by a string formatted date. Both values will be interpolated into the BASE_URL string using the str.format(...) function.

    API_KEY = '7052ad35e3c73564'  
    BASE_URL = "http://api.wunderground.com/api/{}/history_{}/q/NE/Lincoln.json"  
    

    Next I will initialize the target date to the first day of the year in 2015. Then I will specify the features that I would like to parse from the responses returned from the API. The features are simply the keys present in the history -> dailysummary portion of the JSON response. Those features are used to define a namedtuple called DailySummary which I'll use to organize the individual request's data in a list of DailySummary tuples.

    target_date = datetime(2016, 5, 16)  
    features = ["date", "meantempm", "meandewptm", "meanpressurem", "maxhumidity", "minhumidity", "maxtempm",  
                "mintempm", "maxdewptm", "mindewptm", "maxpressurem", "minpressurem", "precipm"]
    DailySummary = namedtuple("DailySummary", features)  
    

    In this section I will be making the actual requests to the API and collecting the successful responses using the function defined below. This function takes the parameters url, api_key, target_date and days.

    def extract_weather_data(url, api_key, target_date, days):  
        records = []
        for _ in range(days):
            request = BASE_URL.format(API_KEY, target_date.strftime('%Y%m%d'))
            response = requests.get(request)
            if response.status_code == 200:
                data = response.json()['history']['dailysummary'][0]
                records.append(DailySummary(
                    date=target_date,
                    meantempm=data['meantempm'],
                    meandewptm=data['meandewptm'],
                    meanpressurem=data['meanpressurem'],
                    maxhumidity=data['maxhumidity'],
                    minhumidity=data['minhumidity'],
                    maxtempm=data['maxtempm'],
                    mintempm=data['mintempm'],
                    maxdewptm=data['maxdewptm'],
                    mindewptm=data['mindewptm'],
                    maxpressurem=data['maxpressurem'],
                    minpressurem=data['minpressurem'],
                    precipm=data['precipm']))
            time.sleep(6)
            target_date += timedelta(days=1)
        return records
    

    I start by defining a list called records which will hold the parsed data as DailySummary namedtuples. The for loop is defined so that it iterates over the loop for number of days passed to the function.

    Then the request is formatted using the str.format() function to interpolate the API_KEY and string formatted target_date object. Once formatted, the request variable is passed to the get() method of the requests object and the response is assigned to a variable called response.

    With the response returned I want to make sure the request was successful by evaluating that the HTTP status code is equal to 200. If it is successful then I parse the response's body into JSON using the json() method of the returned response object. Chained to the same json() method call I select the indexes of the history and daily summary structures then grab the first item in the dailysummary list and assign that to a variable named data.

    Now that I have the dict-like data structure referenced by the data variable I can select the desired fields and instantiate a new instance of the DailySummary namedtuple which is appended to the records list.

    Finally, each iteration of the loop concludes by calling the sleep method of the time module to pause the loop's execution for six seconds, guaranteeing that no more than 10 requests are made per minute, keeping us within Weather Underground's limits.

    Then the target_date is incremented by 1 day using the timedelta object of the datetime module so the next iteration of the loop retrieves the daily summary for the following day.

    The First Batch of Requests

    Without further delay I will kick off the first set of requests for the maximum allotted daily request under the free developer account of 500. Then I suggest you grab a refill of your coffee (or other preferred beverage) and get caught up on your favorite TV show because the function will take at least an hour depending on network latency. With this we have maxed out our requests for the day, and this is only about half the data we will be working with.

    So, come back tomorrow where we will finish out the last batch of requests then we can start working on processing and formatting the data in a manner suitable for our Machine Learning project.

    records = extract_weather_data(BASE_URL, API_KEY, target_date, 500)  
    

    Finishing up the Data Retrieval

    Ok, now that it is a new day we have a clean slate and up to 500 requests that can be made to the Weather Underground history API. Our batch of 500 requests issued yesterday began on January 1st, 2015 and ended on May 15th, 2016 (assuming you didn't have any failed requests). Once again let us kick off another batch of 500 requests but, don't go leaving me for the day this time because once this last chunk of data is collected we are going to begin formatting it into a Pandas DataFrame and derive potentially useful features.

    # if you closed our terminal or Jupyter Notebook, reinitialize your imports and
    # variables first and remember to set your target_date to datetime(2016, 5, 16)
    records += extract_weather_data(BASE_URL, API_KEY, target_date, 500)  
    

    Setting up our Pandas DataFrame

    Now that I have a nice and sizable records list of DailySummary named tuples I will use it to build out a Pandas DataFrame. The Pandas DataFrame is a very useful data structure for many programming tasks which are most popularly known for cleaning and processing data to be used in machine learning projects (or experiments).

    I will utilize the Pandas.DataFrame(...) class constructor to instantiate a DataFrame object. The parameters passed to the constructor are records which represent the data for the DataFrame, the features list I also used to define the DailySummary namedtuples which will specify the columns of the DataFrame. The set_index() method is chained to the DataFrame instantiation to specify date as the index.

    df = pd.DataFrame(records, columns=features).set_index('date')  
    

    Deriving the Features

    Machine learning projects, also referred to as experiments, often have a few characteristics that are a bit oxymoronic. By this I mean that it is quite helpful to have subject matter knowledge in the area under investigation to aid in selecting meaningful features to investigate paired with a thoughtful assumption of likely patterns in data.

    However, I have also seen highly influential explanatory variables and pattern arise out of having almost a naive or at least very open and minimal presuppositions about the data. Having the knowledge-based intuition to know where to look for potentially useful features and patterns as well as the ability to look for unforeseen idiosyncrasies in an unbiased manner is an extremely important part of a successful analytics project.

    In this regard, we have selected quite a few features while parsing the returned daily summary data to be used in our study. However, I fully expect that many of these will prove to be either uninformative in predicting weather temperatures or inappropriate candidates depending on the type of model being used but, the crux is that you simply do not know until you rigorously investigate the data.

    Now I can't say that I have significant knowledge of meteorology or weather prediction models, but I did do a minimal search of prior work on using Machine Learning to predict weather temperatures. As it turns out there are quite a few research articles on the topic and in 2016 Holmstrom, Liu, and Vo they describe using Linear Regression to do just that. In their article, Machine Learning Applied to Weather Forecasting, they used weather data on the prior two days for the following measurements.

    I will be expanding upon their list of features using the ones listed below, and instead of only using the prior two days I will be going back three days.

    So next up is to figure out a way to include these new features as columns in our DataFrame. To do so I will make a smaller subset of the current DataFrame to make it easier to work with while developing an algorithm to create these features. I will make a tmp DataFrame consisting of just 10 records and the features meantempm and meandewptm.

    tmp = df[['meantempm', 'meandewptm']].head(10)  
    tmp  
    
    date meantempm meandewptm
    2015-01-01 -6 -12
    2015-01-02 -6 -9
    2015-01-03 -4 -11
    2015-01-04 -14 -19
    2015-01-05 -9 -14
    2015-01-06 -10 -15
    2015-01-07 -16 -22
    2015-01-08 -7 -12
    2015-01-09 -11 -19
    2015-01-10 -6 -12

    Let us break down what we hope to accomplish, and then translate that into code. For each day (row) and for a given feature (column) I would like to find the value for that feature N days prior. For each value of N (1-3 in our case) I want to make a new column for that feature representing the Nth prior day's measurement.

    # 1 day prior
    N = 1
    
    # target measurement of mean temperature
    feature = 'meantempm'
    
    # total number of rows
    rows = tmp.shape[0]
    
    # a list representing Nth prior measurements of feature
    # notice that the front of the list needs to be padded with N
    # None values to maintain the constistent rows length for each N
    nth_prior_measurements = [None]*N + [tmp[feature][i-N] for i in range(N, rows)]
    
    # make a new column name of feature_N and add to DataFrame
    col_name = "{}_{}".format(feature, N)  
    tmp[col_name] = nth_prior_measurements  
    tmp  
    
    date meantempm meandewptm meantempm_1
    2015-01-01 -6 -12 None
    2015-01-02 -6 -9 -6
    2015-01-03 -4 -11 -6
    2015-01-04 -14 -19 -4
    2015-01-05 -9 -14 -14
    2015-01-06 -10 -15 -9
    2015-01-07 -16 -22 -10
    2015-01-08 -7 -12 -16
    2015-01-09 -11 -19 -7
    2015-01-10 -6 -12 -11

    Ok so it appears we have the basic steps required to make our new features. Now I will wrap these steps up into a reusable function and put it to work building out all the desired features.

    def derive_nth_day_feature(df, feature, N):  
        rows = df.shape[0]
        nth_prior_measurements = [None]*N + [df[feature][i-N] for i in range(N, rows)]
        col_name = "{}_{}".format(feature, N)
        df[col_name] = nth_prior_measurements
    

    Now I will write a loop to loop over the features in the feature list defined earlier, and for each feature that is not "date" and for N days 1 through 3 we'll call our function to add the derived features we want to evaluate for predicting temperatures.

    for feature in features:  
        if feature != 'date':
            for N in range(1, 4):
                derive_nth_day_feature(df, feature, N)
    

    And for good measure I will take a look at the columns to make sure that they look as expected.

    df.columns  
    
    Index(['meantempm', 'meandewptm', 'meanpressurem', 'maxhumidity',  
           'minhumidity', 'maxtempm', 'mintempm', 'maxdewptm', 'mindewptm',
           'maxpressurem', 'minpressurem', 'precipm', 'meantempm_1', 'meantempm_2',
           'meantempm_3', 'meandewptm_1', 'meandewptm_2', 'meandewptm_3',
           'meanpressurem_1', 'meanpressurem_2', 'meanpressurem_3',
           'maxhumidity_1', 'maxhumidity_2', 'maxhumidity_3', 'minhumidity_1',
           'minhumidity_2', 'minhumidity_3', 'maxtempm_1', 'maxtempm_2',
           'maxtempm_3', 'mintempm_1', 'mintempm_2', 'mintempm_3', 'maxdewptm_1',
           'maxdewptm_2', 'maxdewptm_3', 'mindewptm_1', 'mindewptm_2',
           'mindewptm_3', 'maxpressurem_1', 'maxpressurem_2', 'maxpressurem_3',
           'minpressurem_1', 'minpressurem_2', 'minpressurem_3', 'precipm_1',
           'precipm_2', 'precipm_3'],
          dtype='object')
    

    Excellent! Looks like we have what we need. The next thing I want to do is assess the quality of the data and clean it up where necessary.

    Data Cleaning - The Most Important Part

    As the section title says, the most important part of an analytics project is to make sure you are using quality data. The proverbial saying, "garbage in, garbage out", is as appropriate as ever when it comes to machine learning. However, the data cleaning part of an analytics project is not just one of the most important parts it is also the most time consuming and laborious. To ensure the quality of the data for this project, in this section I will be looking to identify unnecessary data, missing values, consistency of data types, and outliers then making some decisions about how to handle them if they arise.

    The first thing I want to do is drop any the columns of the DataFrame that I am not interested in to reduce the amount of data I am working with. The goal of the project is to predict the future temperature based off the past three days of weather measurements. With this in mind we only want to keep the min, max, and mean temperatures for each day plus all the new derived variables we added in the last sections.

    # make list of original features without meantempm, mintempm, and maxtempm
    to_remove = [feature  
                 for feature in features 
                 if feature not in ['meantempm', 'mintempm', 'maxtempm']]
    
    # make a list of columns to keep
    to_keep = [col for col in df.columns if col not in to_remove]
    
    # select only the columns in to_keep and assign to df
    df = df[to_keep]  
    df.columns  
    
    Index(['meantempm', 'maxtempm', 'mintempm', 'meantempm_1', 'meantempm_2',  
           'meantempm_3', 'meandewptm_1', 'meandewptm_2', 'meandewptm_3',
           'meanpressurem_1', 'meanpressurem_2', 'meanpressurem_3',
           'maxhumidity_1', 'maxhumidity_2', 'maxhumidity_3', 'minhumidity_1',
           'minhumidity_2', 'minhumidity_3', 'maxtempm_1', 'maxtempm_2',
           'maxtempm_3', 'mintempm_1', 'mintempm_2', 'mintempm_3', 'maxdewptm_1',
           'maxdewptm_2', 'maxdewptm_3', 'mindewptm_1', 'mindewptm_2',
           'mindewptm_3', 'maxpressurem_1', 'maxpressurem_2', 'maxpressurem_3',
           'minpressurem_1', 'minpressurem_2', 'minpressurem_3', 'precipm_1',
           'precipm_2', 'precipm_3'],
          dtype='object')
    

    The next thing I want to do is to make use of some built in Pandas functions to get a better understanding of the data and potentially identify some areas to focus my energy on. The first function is a DataFrame method called info() which, big surprise... provides information on the DataFrame. Of interest is the "data type" column of the output.

    df.info()  
    
    <class 'pandas.core.frame.DataFrame'>  
    DatetimeIndex: 1000 entries, 2015-01-01 to 2017-09-27  
    Data columns (total 39 columns):  
    meantempm          1000 non-null object  
    maxtempm           1000 non-null object  
    mintempm           1000 non-null object  
    meantempm_1        999 non-null object  
    meantempm_2        998 non-null object  
    meantempm_3        997 non-null object  
    meandewptm_1       999 non-null object  
    meandewptm_2       998 non-null object  
    meandewptm_3       997 non-null object  
    meanpressurem_1    999 non-null object  
    meanpressurem_2    998 non-null object  
    meanpressurem_3    997 non-null object  
    maxhumidity_1      999 non-null object  
    maxhumidity_2      998 non-null object  
    maxhumidity_3      997 non-null object  
    minhumidity_1      999 non-null object  
    minhumidity_2      998 non-null object  
    minhumidity_3      997 non-null object  
    maxtempm_1         999 non-null object  
    maxtempm_2         998 non-null object  
    maxtempm_3         997 non-null object  
    mintempm_1         999 non-null object  
    mintempm_2         998 non-null object  
    mintempm_3         997 non-null object  
    maxdewptm_1        999 non-null object  
    maxdewptm_2        998 non-null object  
    maxdewptm_3        997 non-null object  
    mindewptm_1        999 non-null object  
    mindewptm_2        998 non-null object  
    mindewptm_3        997 non-null object  
    maxpressurem_1     999 non-null object  
    maxpressurem_2     998 non-null object  
    maxpressurem_3     997 non-null object  
    minpressurem_1     999 non-null object  
    minpressurem_2     998 non-null object  
    minpressurem_3     997 non-null object  
    precipm_1          999 non-null object  
    precipm_2          998 non-null object  
    precipm_3          997 non-null object  
    dtypes: object(39)  
    memory usage: 312.5+ KB  
    

    Notice that the data type of every column is of type "object". We need to convert all of these feature columns to floats for the type of numerical analysis that we hope to perform. To do this I will use the apply() DataFrame method to apply the Pandas to_numeric method to all values of the DataFrame. The error='coerce' parameter will fill any textual values to NaNs. It is common to find textual values in data from the wild which usually originate from the data collector where data is missing or invalid.

    df = df.apply(pd.to_numeric, errors='coerce')  
    df.info()  
    
    <class 'pandas.core.frame.DataFrame'>  
    DatetimeIndex: 1000 entries, 2015-01-01 to 2017-09-27  
    Data columns (total 39 columns):  
    meantempm          1000 non-null int64  
    maxtempm           1000 non-null int64  
    mintempm           1000 non-null int64  
    meantempm_1        999 non-null float64  
    meantempm_2        998 non-null float64  
    meantempm_3        997 non-null float64  
    meandewptm_1       999 non-null float64  
    meandewptm_2       998 non-null float64  
    meandewptm_3       997 non-null float64  
    meanpressurem_1    999 non-null float64  
    meanpressurem_2    998 non-null float64  
    meanpressurem_3    997 non-null float64  
    maxhumidity_1      999 non-null float64  
    maxhumidity_2      998 non-null float64  
    maxhumidity_3      997 non-null float64  
    minhumidity_1      999 non-null float64  
    minhumidity_2      998 non-null float64  
    minhumidity_3      997 non-null float64  
    maxtempm_1         999 non-null float64  
    maxtempm_2         998 non-null float64  
    maxtempm_3         997 non-null float64  
    mintempm_1         999 non-null float64  
    mintempm_2         998 non-null float64  
    mintempm_3         997 non-null float64  
    maxdewptm_1        999 non-null float64  
    maxdewptm_2        998 non-null float64  
    maxdewptm_3        997 non-null float64  
    mindewptm_1        999 non-null float64  
    mindewptm_2        998 non-null float64  
    mindewptm_3        997 non-null float64  
    maxpressurem_1     999 non-null float64  
    maxpressurem_2     998 non-null float64  
    maxpressurem_3     997 non-null float64  
    minpressurem_1     999 non-null float64  
    minpressurem_2     998 non-null float64  
    minpressurem_3     997 non-null float64  
    precipm_1          889 non-null float64  
    precipm_2          889 non-null float64  
    precipm_3          888 non-null float64  
    dtypes: float64(36), int64(3)  
    memory usage: 312.5 KB  
    

    Now that all of our data has the data type I want I would like to take a look at some summary stats of the features and use the statistical rule of thumb to check for the existence of extreme outliers. The DataFrame method describe() will produce a DataFrame containing the count, mean, standard deviation, min, 25th percentile, 50th percentile (or median), the 75th percentile and, the max value. This can be very useful information to evaluating the distribution of the feature data.

    I would like to add to this information by calculating another output column, indicating the existence of outliers. The rule of thumb to identifying an extreme outlier is a value that is less than 3 interquartile ranges below the 25th percentile, or 3 interquartile ranges above the 75th percentile. Interquartile range is simply the difference between the 75th percentile and the 25th percentile.

    # Call describe on df and transpose it due to the large number of columns
    spread = df.describe().T
    
    # precalculate interquartile range for ease of use in next calculation
    IQR = spread['75%'] - spread['25%']
    
    # create an outliers column which is either 3 IQRs below the first quartile or
    # 3 IQRs above the third quartile
    spread['outliers'] = (spread['min']<(spread['25%']-(3*IQR)))|(spread['max'] > (spread['75%']+3*IQR))
    
    # just display the features containing extreme outliers
    spread.ix[spread.outliers,]  
    
    count mean std min 25% 50% 75% max outliers
    maxhumidity_1 999.0 88.107107 9.273053 47.0 83.0 90.0 93.00 100.00 True
    maxhumidity_2 998.0 88.102204 9.276407 47.0 83.0 90.0 93.00 100.00 True
    maxhumidity_3 997.0 88.093280 9.276775 47.0 83.0 90.0 93.00 100.00 True
    maxpressurem_1 999.0 1019.924925 7.751874 993.0 1015.0 1019.0 1024.00 1055.00 True
    maxpressurem_2 998.0 1019.922846 7.755482 993.0 1015.0 1019.0 1024.00 1055.00 True
    maxpressurem_3 997.0 1019.927783 7.757805 993.0 1015.0 1019.0 1024.00 1055.00 True
    minpressurem_1 999.0 1012.329329 7.882062 956.0 1008.0 1012.0 1017.00 1035.00 True
    minpressurem_2 998.0 1012.326653 7.885560 956.0 1008.0 1012.0 1017.00 1035.00 True
    minpressurem_3 997.0 1012.326981 7.889511 956.0 1008.0 1012.0 1017.00 1035.00 True
    precipm_1 889.0 2.908211 8.874345 0.0 0.0 0.0 0.51 95.76 True
    precipm_2 889.0 2.908211 8.874345 0.0 0.0 0.0 0.51 95.76 True
    precipm_3 888.0 2.888885 8.860608 0.0 0.0 0.0 0.51 95.76 True

    Assessing the potential impact of outliers is a difficult part of any analytics project. On the one hand, you need to be concerned about the potential for introducing spurious data artifacts that will significantly impact or bias your models. On the other hand, outliers can be extremely meaningful in predicting outcomes that arise under special circumstances. We will discuss each of these outliers containing features and see if we can come to a reasonable conclusion as to how to treat them.

    The first set of features all appear to be related to max humidity. Looking at the data I can tell that the outlier for this feature category is due to the apparently very low min value. This indeed looks to be a pretty low value and I think I would like to take a closer look at it, preferably in a graphical way. To do this I will use a histogram.

    %matplotlib inline
    plt.rcParams['figure.figsize'] = [14, 8]  
    df.maxhumidity_1.hist()  
    plt.title('Distribution of maxhumidity_1')  
    plt.xlabel('maxhumidity_1')  
    plt.show()  
    

    Using Machine Learning to Predict the Weather: Part 1

    Looking at the histogram of the values for maxhumidity the data exhibits quite a bit of negative skew. I will want to keep this in mind when selecting prediction models and evaluating the strength of impact of max humidities. Many of the underlying statistical methods assume that the data is normally distributed. For now I think I will leave them alone but it will be good to keep this in mind and have a certain amount of skepticism of it.

    Next I will look at the minimum pressure feature distribution.

    df.minpressurem_1.hist()  
    plt.title('Distribution of minpressurem_1')  
    plt.xlabel('minpressurem_1')  
    plt.show()  
    

    Using Machine Learning to Predict the Weather: Part 1

    This plot exhibits another interesting feature. From this plot, the data is multimodal, which leads me to believe that there are two very different sets of environmental circumstances apparent in this data. I am hesitant to remove these values since I know that the temperature swings in this area of the country can be quite extreme especially between seasons of the year. I am worried that removing these low values might have some explanatory usefulness but, once again I will be skeptical about it at the same time.

    The final category of features containing outliers, precipitation, are quite a bit easier to understand. Since the dry days (ie, no precipitation) are much more frequent, it is sensible to see outliers here. To me this is no reason to remove these features.

    The last data quality issue to address is that of missing values. Due to the way in which I have built out the DataFrame, the missing values are represented by NaNs. You will probably remember that I have intentionally introduced missing values for the first three days of the data collected by deriving features representing the prior three days of measurements. It is not until the third day in that we can start deriving those features, so clearly I will want to exclude those first three days from the data set.

    Look again at the output from the last time I issued the info method. There is a column of output that listed the non-null values for each feature column. Looking at this information you can see that for the most part the features contain relatively few missing (null / NaN) values, mostly just the ones I introduced. However, the precipitation columns appear to be missing a significant part of their data.

    Missing data poses a problem because most machine learning methods require complete data sets devoid of any missing data. Aside from the issue that many of the machine learning methods require complete data, if I were to remove all the rows just because the precipitation feature contains missing data then I would be throwing out many other useful feature measurements.

    As I see it I have a couple of options to deal with this issue of missing data:

    1. I can simply remove the rows that contain the missing values, but as I mentioned earlier throwing out that much data removes a lot of value from the data
    2. I can fill the missing values with an interpolated value that is a reasonable estimation of the true values.

    Since I would rather preserve as much of the data as I can, where there is minimal risk of introducing erroneous values, I am going to fill the missing precipitation values with the most common value of zero. I feel this is a reasonable decision because the great majority of values in the precipitation measurements are zero.

    # iterate over the precip columns
    for precip_col in ['precipm_1', 'precipm_2', 'precipm_3']:  
        # create a boolean array of values representing nans
        missing_vals = pd.isnull(df[precip_col])
        df[precip_col][missing_vals] = 0
    

    Now that I have filled all the missing values that I can, while being cautious not to negatively impact the quality, I would be comfortable simply removing the remaining records containing missing values from the data set. It is quite easy to drop rows from the DataFrame containing NaNs. All I have to do is call the method dropna() and Pandas will do all the work for me.

    df = df.dropna()  
    

    Resources

    Want to learn the tools, machine learning, and data analysis used in this tutorial? Here are a few great resources to get you started:

    Conclusion

    In this article I have described the process of collecting, cleaning, and processing a reasonably good-sized data set to be used for upcoming articles on a machine learning project in which we predict future weather temperatures.

    While this is probably going to be the driest of the articles detaining this machine learning project, I have tried to emphasize the importance of collecting quality data suitable for a valuable machine learning experiment.

    Thanks for reading and I hope you look forward to the upcoming articles on this project.

    November 14, 2017 05:21 PM


    Wallaroo Labs

    Identifying Trending Twitter Hashtags in Real-time with Wallaroo

    This week we have a guest post written by Hanee’ Medhat Hanee’ is a Big Data Engineer, with experience working with massive data in many industries, such as Telecommunications and Banking. Overview One of the primary places where the world is seeing an explosion of data growth is in social media. Wallaroo is a powerful and simple-to-use open-source data engine that is ideally suited for handling massive amounts of streaming data in real-time.

    November 14, 2017 04:00 PM


    Codementor

    30-minute Python Web Scraper

    I’ve been meaning to create a web scraper using Python and Selenium (http://www.seleniumhq.org/) for a while now, but never gotten around to it. A few nights ago, I decided to give it a spin....

    November 14, 2017 09:31 AM


    Montreal Python User Group

    Montréal-Python 68: Wysiwyg Xylophone

    Please RSVP on our meetup event

    When

    November 20th at 6:00PM

    Where

    Google Montréal 1253 McGill College #150 Montréal, QC

    We thank Google Montreal for sponsoring MP68

    Schedule

    Presentations

    Va debugger ton Python! - Stéphane Wirtel

    Cette présentation vous explique les bases de Pdb ainsi que GDB, afin de debugger plus facilement vos scripts Python.

    Writing a Python interpreter in Python from scratch - Zhentao Li

    I will show a prototype of a Python interpreter written in entirely in Python itself (that isn't Pypy).

    The goal is to have simpler internals to allow experimenting with changes to the language more easily. This interpreter has a small core with much of the library modifiable at run time for quickly testing changes. This differs from Pypy that aimed for full Python compatibility and speed (from JIT compilation). I will show some of the interesting things that you can do with this interpreter.

    This interpreter has two parts: a parser to transform source to Abstract Syntax Tree and a runner for traversing this tree. I will give an overview of how both part work and discuss some challenges encountered and their solution.

    This interpreter makes use of very few libraries, and only those included with CPython.

    This project is looking for members to discuss ways of simplifying parts of the interpreter (among other things).

    The talk would be about Rasa, an open-source chatbots platform - Nathan Zylbersztejn

    Most chatbots rely on external APIs for the cool stuff such as natural language understanding (NLU) and disappoint because if and else conditionals fail at delivering good representations of our non linear human way to converse. Wouldn’t it be great if we could 1) take control of NLU and tweak it to better fit our needs and 2) really apply machine learning, extract patterns from real conversations, and handle dialogue in a decent manner? Well, we can, thanks to Rasa.ai. It’s open-source, it’s in Python, and it works

    About Nathan Zylbersztejn:

    Nathan is the founder of Mr. Bot, a dialogue consulting agency in Montreal with clients in the media, banking and aerospace industries. He holds a master in economics, a graduate diploma in computer science, and a machine learning nanodegree.

    November 14, 2017 05:00 AM

    November 13, 2017


    Doug Hellmann

    readline — The GNU readline Library — PyMOTW 3

    The readline module can be used to enhance interactive command line programs to make them easier to use. It is primarily used to provide command line text completion, or “tab completion”. Read more… This post is part of the Python Module of the Week series for Python 3. See PyMOTW.com for more articles from the …

    November 13, 2017 02:00 PM


    Mike Driscoll

    PyDev of the Week: Bert JW Regeer

    This week we welcome Bert JW Regeer as our PyDev of the Week! Bert is a core developer of the Pyramid web framework. You can check our his experience over on his website or go check out his Github profile to see what projects he’s been working on lately. Let’s take a few moments to get to know Bert better!

    Can you tell us a little about yourself? (Hobbies, education, etc?):

    Oh, I have no idea where to start, but let’s give it a shot. I am first and foremost a geek, I love electronics, which in and of itself is an expensive hobby. Always new toys to play with and buy. I studied Computer Science at the University of Advancing Technology, and have been known to spend a lot of time building cool hardware based projects too. Spent a lot of time on FIRST Robotics, first in HS and then mentoring HS students while in college. Lately the only hardware I get to play with though is home automation, installing new switches and sensors to make my laziness even more lazy!

    My other major hobbies all have something in common, they are expensive ones: photography and cars. I am a bit of an amateur photographer and am always looking to get new lenses or find new ideas on how to shoot something new and exciting. My next goal is to do some astrophotography and I am looking to get a nice wide lens with a nice large aperture. I live in Colorado so there are plenty of gorgeous places to go photograph. I drive a Subaru WRX, and I absolutely love going for rides. Been eyeing some upgrades to my car, but so far she is still stock. I enjoy going out and driving around the mountains though, which goes hand in hand with the photography!

    Last but not least, lately I have gotten into making my own sourdough bread. There is nothing better than a freshly baked sourdough bread with a little butter. It’s my newest and most recent hobby, and it is also the one that costs almost nothing. I get to make healthier bread, share it with friends and family, and it costs pennies! I work at a small company named Crunch and a few of my colleagues are bread bakers too, which allows us to share tips and ideas on how to improve our breads.

    I really should update my website to include this, but on a different note, you can find some of my older projects there!

    Why did you start using Python?

    I was introduced to Python while I was at university, at the time (2008 time frame) I didn’t really think much of it other than just a quick way to prototype some ideas before translating them into C/C++. At the time I was a pretty big proponent of C++, and working mostly in the backend it was a pretty natural fit. Write your applications in C++ and run on FreeBSD.

    It wasn’t until I was at my first programming job where we had to quickly write a new server component that I started reaching for Python first and foremost to deliver the project. After quickly prototyping I was able to prove that Python would provide us with the speed and throughput required while alleviating some of the development concerns we were worried about. As time went on there were still components I wrote in C++, but a large part of our service ended up being written in Python due to the speed of development.

    For personal use, I had always written my websites in PHP since it was always available and easy to use, but I never really did enjoy using any of the frameworks built for PHP. All of them were brand new at the time and felt incredibly heavy weight, and because I was using more Python at work it was getting confusing, when do I need to use tabs and when do I use curly braces? It always took me a minute or two to context switch from one programming language to another, so I started looking at Django, Flask, Pylons, Pyramid, Google App Engine and others. I ended up settling on Pyramid due to it’s simplicity and because it allowed me to pick the components I wanted to use. I ended becoming the maintainer for WebOb and recently have become a full core contributor to the Pylons Project.

    What other programming languages do you know and which is your favorite?

    This is going to be an incredibly long list, so let’s go with the ones I have extensively and not just used for a toy project here and there. As already mentioned I started out with C/C++, being the first language I learned from a C++ for Dummies book when I was 12, since then I’ve ran the gamut, but PHP was fairly big for me a for a while. In high school and university Java of course was used, although it still is my least favorite language, I have used it extensively on some Android projects. I’ve worked on a project that was Objective C, with a little bit of Swift, mainly doing a security audit so I am not sure it really counts as extensively… currently the two languages I use most are Python and JavaScript (or Typescript, transpiled to JavaScript). ES6/ES7 (yay for Babel) are heavily used in various projects.

    Python however has definitely become my favorite programming language. It is incredibly versatile and even though it is not the fastest language by far, it is one of the most flexible and I can appreciate how easy it makes my life. Are there things I’d like to see change in Python? Absolutely. Are there pieces missing? Sure. At the same time I am not sure what other language I would enjoy working in as much as I currently do with Python. I’ve tried Golang, it’s just not for me. Rust comes pretty close, but it feels too much like C/C++ and requires a lot more thinking than I think is necessary for the things I am working on.

    What projects are you working on now?

    Outside of work, just a bunch of open source currently. As I am writing this I am preparing a talk for PloneConf where I am going to talk about taking over maintenance of well loved projects, specifically WebOb which is a Python WSGI/HTTP request/response library that was written by Ian Bicking and is now maintained by the Pylons Project with me as lead.

    The Pylons Project is a collective of people that maintain a bunch of different pieces of software, mostly for the HTTP web. Pyramid the web framework, WebOb the WSGI/HTTP request/response library that underlies it, webtest for testing your WSGI applications, form generation, pure HTTP web server and more. We don’t have a lot of people, so there is a lot of work to be done. Releasing new versions of existing software, accepting/reviewing patches and reducing the issue count faster than the issues continue to be generated!

    There are also many unfinished projects, my Github is a venerable graveyard of projects that I’ve started and never finished. Great aspirations, just find that if I am doing things for myself once I have figured out the hard part, once I’ve solved “the problem”, completing a project is not nearly as much fun, so off I go to the next project. I always learn something new, just feel bad that it is mostly half-finished code that no-one else can really benefit from.

    Which Python libraries are your favorite (core or 3rd party)

    I believe it started as a third-party, but is now baked into every Python installation (unless you are on one of those special Linux distributions that rips it out and makes it a separate installable), and that is the virtual environment tool venv or virtualenv. It makes it much simpler to have different environments with different libraries installed. Being able to separate out all of my different projects and not have to install globally is a amazing! C/C++ make this much more difficult especially if you need to include linker flags and all kinds of fun stuff, and pkg-config and friends only get you so far. Similar systems exist for other languages, but it is by far my favorite part about working with Python.

    Is there anything else you’d like to say?

    We are always looking for new developers to join us in the Pylons Project, if you are looking for someone to mentor you, please reach out and we will do our best. This year we had an absolutely fantastic Google Summer of Code with Ira and I’d be happy to help introduce more new people to not just the Pylons Project but to open source in general.

    Thanks for doing the interview!

    November 13, 2017 01:30 PM


    Tryton News

    Tryton on Docker

    In order to easy the adoption of Tryton, we are publishing Docker images images on the Docker hub for all series starting from 4.4.

    Docker

    They contain the server, the web client and all modules of the series. They are periodically updated and tagged per series. They work with the postgres images as default storage back-end.

    The usage is pretty simple:

    $ docker run --name postgres -e POSTGRES_DB=tryton -d postgres
    $ docker run --link postgres:postgres -it tryton/tryton trytond-admin -d tryton --all
    $ docker run --name tryton -p 8000:8000 --link postgres:postgres -d tryton/tryton
    

    Then you can connect using: http://localhost:8000/

    November 13, 2017 09:00 AM