skip to navigation
skip to content

Planet Python

Last update: April 01, 2020 09:48 PM UTC

April 01, 2020


Stack Abuse

One-Hot Encoding in Python with Pandas and Scikit-Learn

Introduction

In computer science, data can be represented in a lot of different ways, and naturally, every single one of them has its advantages as well as disadvantages in certain fields.

Since computers are unable to process categorical data as these categories have no meaning for them, this information has to be prepared if we want a computer to be able to process it.

This action is called preprocessing. A big part of preprocessing is encoding - representing every single piece of data in a way that a computer can understand (the name literally means "convert to computer code").

In many branches of computer science, especially machine learning and digital circuit design, One-Hot Encoding is widely used.

In this article, we will explain what one-hot encoding is and implement it in Python using a few popular choices, Pandas and Scikit-Learn. We'll also compare it's effectiveness to other types of representation in computers, its strong points and weaknesses, as well as its applications.

What is One-Hot Encoding?

One-hot Encoding is a type of vector representation in which all of the elements in a vector are 0, except for one, which has 1 as its value, where 1 represents a boolean specifying a category of the element.

There also exists a similar implementation called One-Cold Encoding, where all of the elements in a vector are 1, except for one, which has 0 as its value.

For instance, [0, 0, 0, 1, 0] and [1 ,0, 0, 0, 0] could be some examples of one-hot vectors. A similar technique to this one, also used to represent data, would be dummy variables in statistics.

This is very different from other encoding schemes, which all allow multiple bits to have 1 as its value. Below is a table that compares the representation of numbers from 0 to 7 in binary, Gray code, and one-hot:

Decimal Binary Gray code One-Hot
0 000 000 0000000
1 001 001 0000001
2 010 011 0000010
3 011 010 0000100
4 100 110 0001000
5 101 111 0010000
6 110 101 0100000
7 111 100 1000000

Practically, for every one-hot vector, we ask n questions, where n is the number of categories we have:

Is this the number 1? Is this the number 2? ... Is this the number 7?

Each "0" is "false" and once we hit a "1" in a vector, the answer to the question is "true".

One-hot encoding transforms categorical features to a format that works better with classification and regression algorithms. It's very useful in methods where multiple types of data representation is necessary.

For example, some vectors may be optimal for regression (approximating functions based on former return values), and some may be optimal for classification (categorization into fixed sets/classes, typically binary):

Label ID
Strawberry 1
Apple 2
Watermelon 3
Lemon 4
Peach 5
Orange 6

Here we have six sample inputs of categorical data. The type of encoding used here is called "label encoding" - and it is very simple: we just assign an ID for a categorical value.

Our computer now knows how to represent these categories, because it knows how to work with numbers. However, this method of encoding is not very effective, because it tends to naturally give the higher numbers higher weights.

It wouldn't make sense to say that our category of "Strawberries" is greater or smaller than "Apples", or that adding the category "Lemon" to "Peach" would give us a category "Orange", since these values are not ordinal.

If we represented these categories in one-hot encoding, we would actually replace the rows with columns. We do this by creating one boolean column for each of our given categories, where only one of these columns could take on the value 1 for each sample:

Strawberry Apple Watermelon Lemon Peach Orange ID
1 0 0 0 0 0 1
0 1 0 0 0 0 2
0 0 1 0 0 0 3
0 0 0 1 0 0 4
0 0 0 0 1 0 5
0 0 0 0 0 1 6

We can see from the tables above that more digits are needed in one-hot representation compared to Binary or Gray code. For n digits, one-hot encoding can only represent n values, while Binary or Gray encoding can represent 2n values using n digits.

Implementation

Pandas

Let's take a look at a simple example of how we can convert values from a categorical column in our dataset into their numerical counterparts, via the one-hot encoding scheme.

We'll be creating a really simple dataset - a list of countries and their ID's:

import pandas as pd

ids = [11, 22, 33, 44, 55, 66, 77]
countries = ['Spain', 'France', 'Spain', 'Germany', 'France']

df = pd.DataFrame(list(zip(ids, countries)),
                  columns=['Ids', 'Countries'])

In the script above, we create a Pandas dataframe, called df using two lists i.e. ids and countries. If you call the head() method on the dataframe, you should see the following result:

df.head()

dataframe header one hot encoding

The Countries column contain categorical values. We can convert the values in the Countries column into one-hot encoded vectors using the get_dummies() function:

y = pd.get_dummies(df.Countries, prefix='Country')
print(y.head())

We passed Country as the value for the prefix attribute of the get_dummies() method, hence you can see the string Country prefixed before the header of each of the one-hot encoded columns in the output.

Running this code yields:

   Country_France  Country_Germany  Country_Spain
0               0                0              1
1               1                0              0
2               0                0              1
3               0                1              0
4               1                0              0

Scikit-Learn

An alternative would be to use another popular library - Scikit-Learn. It offers both the OneHotEncoder class and the LabelBinarizer class for this purpose.

First, let's start by importing the LabelBinarizer:

from sklearn.preprocessing import LabelBinarizer

And then, using the same dataframe as before, let's instantiate the LabelBinarizer and fit it:

y = LabelBinarizer().fit_transform(df.Countries)

Printing y would yield:

[[0 0 1]
 [1 0 0]
 [0 0 1]
 [0 1 0]
 [1 0 0]]

Though, this isn't nearly as pretty as the Pandas approach.

Similarly, we can use the OneHotEncoder class, which supports multi-column data, unlike the previous class:

from sklearn.preprocessing import OneHotEncoder

And then, let's populate a list and fit it in the encoder:

x = [[11, "Spain"], [22, "France"], [33, "Spain"], [44, "Germany"], [55, "France"]]
y = OneHotEncoder().fit_transform(x).toarray()
print(y)

Running this will yield:

[[1. 0. 0. 0. 0. 0. 0. 1.]
 [0. 1. 0. 0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1. 1. 0. 0.]]

Applications of One-Hot Encoding

One-hot encoding has seen most of its application in the fields of Machine Learning and Digital Circuit Design.

Machine Learning

As stated above, computers aren't very good with categorical data. While we understand categorical data just fine, it's due to a kind of prerequisite knowledge that computers don't have.

Most Machine Learning techniques and models work with a very bounded dataset (typically binary). Neural networks consume data and produce results in the range of 0..1 and rarely will we ever go beyond that scope.

In short, the vast majority of machine learning algorithms receive sample data ("training data") from which features are extracted. Based on these features, a mathematical model is created, which is then used to make predictions or decisions without being explicitly programmed to perform these tasks.

A great example would be Classification, where the input can be technically unbounded, but the output is typically limited to a few classes. In the case of binary classification (say we're teaching a neural network to classify cats and dogs), we'd have a mapping of 0 for cats, and 1 for dogs.

Most of the time, the training data we wish to perform predictions on is categorical, like the example with fruit mentioned above. Again, while this makes a lot of sense to us, the words themselves are of no meaning to the algorithm as it doesn't understand them.

Using one-hot encoding for representation of data in these algorithms is not technically necessary, but pretty useful if we want an efficient implementation.

Digital Circuit Design

Many basic digital circuits use one-hot notation in order to represent their I/O values.

For example, it can be used to indicate the state of a finite-state machine. If some other type of representation, like Gray or Binary, is used, a decoder is needed to determine the state as they're not as naturally compatible. Contrarily, a one-hot finite-state machine does not need the decoder, because if the nth bit is high, the machine is, logically, in the nth state.

A good example of a finite-state machine is a ring counter - a type of counter composed of flip-flops connected to a shift register, in which the output of one flip-flop connects to the input of the other one.

The first flip-flop in this counter represents the first state, the second represents the second state, and so on. At the beginning, all of the flip-flops in the machine are set to '0', except for the first one, which is set to '1'.

The next clock edge arriving at the flip flops advances the one 'hot' bit to the second flip flop. The 'hot' bit advances like this until the last state, after which the machine returns to the first state.

Another example of usage of one-hot encoding in digital circuit design would be an address decoder, which takes a Binary or Gray code input, and then converts it to one-hot for the output, as well as a priority encoder (shown in the picture below).

It's the exact opposite and takes the one-hot input and converts it to Binary or Gray:

digital circuit design

Advantages and Disadvantages of One-hot encoding

Like every other type of encoding, one-hot has many good points as well as problematic aspects.

Advantages

A great advantage of one-hot encoding is that determining the state of a machine has a low and constant cost, because all it needs to do is access one flip-flop. Changing the state of the machine is almost as fast, since it just needs to access two flip-flops.

Another great thing about one-hot encoding is the easy implementation. Digital circuits made in this notation are very easy to design and modify. Illegal states in the finite-state machine are also easy to detect.

A one-hot implementation is known for being the fastest one, allowing a state machine to run at a faster clock rate than any other encoding of that state machine.

Disadvantages

One of the main disadvantages that one-hot encoding has is the above mentioned fact that it can't represent many values (for n states, we would need n digits - or flip-flops). This is why, if we wanted to implement a one-hot 15-state ring counter for example, we would need 15 flip-flops, whereas the binary implementation would only need three flip-flops.

This makes it especially impractical for PAL devices, and it can also be very expensive, but it takes advantage of an FPGA's abundant flip-flops.

Another problem with this type of encoding is that many of the states in a finite-state machine would illegal - for every n valid states, there is (2n - n) illegal ones. A good thing is that these illegal states are, as previously said, really easy to detect (one XOR gate would be enough), so it's not very hard to take care of them.

Conclusion

Since one-hot encoding is very simple, it is easy to understand and use in practice. It's no surprise that it is this popular in the world of computer science.

Due to the fact that the cons aren't too bad, its seen wide application. At the end of the day, its pros clearly outweigh the cons, which is why this type of implementation will definitely stick around for a long time in the future.

April 01, 2020 08:03 PM UTC


Python Software Foundation

An Update on PyPI Funded Work

Originally announced at the end of 2018, a gift from Facebook Research is funding improvements for the security PyPI and its users.

What's been done

After launching a request for information and subsequent request for proposal in the second half of 2019, contractors were selected and work commenced on Milestone 2 of the project in December 2019 and was completed in February 2020.
The result is that PyPI now has tooling in place to implement automated checks that run in response to events such as Project or Release creation or File uploads as well as on schedules. In addition to documentation example checks were also implemented that demonstrate event based and scheduled checks.
Results from checks are made available for PyPI moderators and administrators to review, but will not have any automated responses put in place. As a check suite is developed and refined we hope that these will help to identify malicious uploads and spam that PyPI regularly contends with.

What's next

With the acceptance of PEP 458 on February 15 we're excited to announce that work on implementation of The Update Framework has started.
This work will enable clients like pip to ensure that they have downloaded valid files from PyPI and equip the PyPI administrators to better respond in event of a compromise.
The timeline for this work is currently planned over the coming months, with an initial key signing to be held at PyCon 2020 in Pittsburgh, Pennsylvania and rollout of the services needed to support TUF enabled clients in May or June.

Other PyPI News

For users who have enabled two factor authentication on PyPI, support has been added for Account Recovery codes. These codes are intended for use in the case where you've lost your Webauthn device or TOTP application, allowing you to recover access to your account.
You can generate and store recovery codes now by visiting your account settings and clicking "Generate Recovery Codes".

April 01, 2020 02:55 PM UTC


Real Python

Linked Lists in Python: An Introduction

Linked lists are like a lesser-known cousin of lists. They’re not as popular or as cool, and you might not even remember them from your algorithms class. But in the right context, they can really shine.

In this article, you’ll learn:

If you’re looking to brush up on your coding skills for a job interview, or if you want to learn more about Python data structures besides the usual dictionaries and lists, then you’ve come to the right place!

You can follow along with the examples in this tutorial by downloading the source code available at the link below:

Get the Source Code: Click here to get the source code you'll use to learn about linked lists in this tutorial.

Understanding Linked Lists

Linked lists are an ordered collection of objects. So what makes them different from normal lists? Linked lists differ from lists in the way that they store elements in memory. While lists use a contiguous memory block to store references to their data, linked lists store references as part of their own elements.

Main Concepts

Before going more in depth on what linked lists are and how you can use them, you should first learn how they are structured. Each element of a linked list is called a node, and every node has two different fields:

  1. Data contains the value to be stored in the node.
  2. Next contains a reference to the next node on the list.

Here’s what a typical node looks like:

Example Node of a Linked ListNode

A linked list is a collection of nodes. The first node is called the head, and it’s used as the starting point for any iteration through the list. The last node must have its next reference pointing to None to determine the end of the list. Here’s how it looks:

Example Structure of a Linked ListLinked List

Now that you know how a linked list is structured, you’re ready to look at some practical use cases for it.

Practical Applications

Linked lists serve a variety of purposes in the real world. They can be used to implement (spoiler alert!) queues or stacks as well as graphs. They’re also useful for much more complex tasks, such as lifecycle management for an operating system application.

Queues or Stacks

Queues and stacks differ only in the way elements are retrieved. For a queue, you use a First-In/First-Out (FIFO) approach. That means that the first element inserted in the list is the first one to be retrieved:

Example Structure of a QueueQueue

In the diagram above, you can see the front and rear elements of the queue. When you append new elements to the queue, they’ll go to the rear end. When you retrieve elements, they’ll be taken from the front of the queue.

For a stack, you use a Last-In/Fist-Out (LIFO) approach, meaning that the last element inserted in the list is the first to be retrieved:

Example Structure of a StackStack

In the above diagram you can see that the first element inserted on the stack (index 0) is at the bottom, and the last element inserted is at the top. Since stacks use the LIFO approach, the last element inserted (at the top) will be the first to be retrieved.

Because of the way you insert and retrieve elements from the edges of queues and stacks, linked lists are one of the most convenient ways to implement these data structures. You’ll see examples of these implementations later in the article.

Graphs

Graphs can be used to show relationships between objects or to represent different types of networks. For example, a visual representation of a graph—say a directed acyclic graph (DAG)—might look like this:

Example Directed GraphDirected Acyclic Graph

There are different ways to implement graphs like the above, but one of the most common is to use an adjacency list. An adjacency list is, in essence, a list of linked lists where each vertex of the graph is stored alongside a collection of connected vertices:

Vertex Linked List of Vertices
1 2 → 3 → None
2 4 → None
3 None
4 5 → 6 → None
5 6 → None
6 None

In the table above, each vertex of your graph is listed in the left column. The right column contains a series of linked lists storing the other vertices connected with the corresponding vertex in the left column. This adjacency list could also be represented in code using a dict:

>>>
>>> graph = {
...     1: [2, 3, None],
...     2: [4, None],
...     3: [None],
...     4: [5, 6, None],
...     5: [6, None],
...     6: [None]
... }

The keys of this dictionary are the source vertices, and the value for each key is a list. This list is usually implemented as a linked list.

Note: In the above example you could avoid storing the None values, but we’ve retained them here for clarity and consistency with later examples.

In terms of both speed and memory, implementing graphs using adjacency lists is very efficient in comparison with, for example, an adjacency matrix. That’s why linked lists are so useful for graph implementation.

Performance Comparison: Lists vs Linked Lists

In most programming languages, there are clear differences in the way linked lists and arrays are stored in memory. In Python, however, lists are dynamic arrays. That means that the memory usage of both lists and linked lists is very similar.

Further reading: Python’s implementation of dynamic arrays is quite interesting and definitely worth a read. Make sure to have a look and use that knowledge to stand out at your next company party!

Since the difference in memory usage between lists and linked lists is so insignificant, it’s better if you focus on their performance differences when it comes to time complexity.

Insertion and Deletion of Elements

List insertion times vary depending on the need to resize the existing array. If you read the article above on how lists are implemented, then you’ll notice that Python overallocates space for arrays based on their growth. This means that inserting a new element into a list can take anywhere from Θ(n) to Θ(1) depending on whether the existing array needs to be resized.

The same applies for deleting elements from the array. If the size of the array is less than half the allocated size, then Python will shrink the array. So, all in all, inserting or deleting elements from an array has an average time complexity of Θ(n).

Linked lists, however, are much better when it comes to insertion and deletion of elements at the beginning or the end of the list, where their time complexity is constant: Θ(1). This performance advantage is the reason linked lists are so useful for queues and stacks, where elements are continuously inserted and removed from the edges.

Retrieval of Elements

When it comes to element lookup, lists perform much better than linked lists. When you know which element you want to access, lists can perform this operation in Θ(1) time. Trying to do the same with a linked list would take Θ(n) because you need to traverse the whole list to find the element.

When searching for a specific element, however, both lists and linked lists perform very similarly, with a time complexity of Θ(n). In both cases, you need to iterate through the entire list to find the element you’re looking for.

Introducing collections.deque

In Python, there’s a specific object in the collections module that you can use for linked lists called deque(pronounced “deck”), which stands for double-ended queue.

collections.deque uses an implementation of a linked list in which you can access, insert, or remove elements from the beginning or end of a list with constant O(1) performance.

How to Use collections.deque

There are quite a few methods that come, by default, with a deque object. However, in this article you’ll only touch on a few of them, mostly for adding or removing elements.

First, you need to create a linked list. You can use the following piece of code to do that with deque:

>>>
>>> from collections import deque
>>> deque()
deque([])

The code above will create an empty linked list. If you want to populate it at creation, then you can give it an iterable as input:

>>>
>>> deque(['a','b','c'])
deque(['a', 'b', 'c'])

>>> deque('abc')
deque(['a', 'b', 'c'])

>>> deque([{'data': 'a'}, {'data': 'b'}])
deque([{'data': 'a'}, {'data': 'b'}])

When initializing a deque object, you can pass any iterable as an input, such as a string (also an iterable) or a list of objects.

Now that you know how to create a deque object, you can interact with it by adding or removing elements. You can create an abcde linked list and add a new element f like this:

>>>
>>> llist = deque("abcde")
>>> llist
deque(['a', 'b', 'c', 'd', 'e'])

>>> llist.append("f")
>>> llist
deque(['a', 'b', 'c', 'd', 'e', 'f'])

>>> llist.pop()
'f'

>>> llist
deque(['a', 'b', 'c', 'd', 'e'])

Both append() and pop() add or remove elements from the right side of the linked list. However, you can also use deque to quickly add or remove elements from the left side, or head, of the list:

>>>
>>> llist.appendleft("z")
>>> llist
deque(['z', 'a', 'b', 'c', 'd', 'e'])

>>> llist.popleft()
'z'

>>> llist
deque(['a', 'b', 'c', 'd', 'e'])

Adding or removing elements from both ends of the list is pretty straightforward using the deque object. Now you’re ready to learn how to use collections.deque to implement a queue or a stack.

How to Implement Queues and Stacks

As you learned above, the main difference between a queue and a stack is the way you retrieve elements from each. Next, you’ll find out how to use collections.deque to implement both data structures.

Queues

With queues, you want to add values to a list (enqueue), and when the timing is right, you want to remove the element that has been on the list the longest (dequeue). For example, imagine a queue at a trendy and fully booked restaurant. If you were trying to implement a fair system for seating guests, then you’d start by creating a queue and adding people as they arrive:

>>>
>>> from collections import deque
>>> queue = deque()
>>> queue
deque([])

>>> queue.append("Mary")
>>> queue.append("John")
>>> queue.append("Susan")
>>> queue
deque(['Mary', 'John', 'Susan'])

Now you have Mary, John, and Susan in the queue. Remember that since queues are FIFO, the first person who got into the queue should be the first to get out.

Now imagine some time goes by and a few tables become available. At this stage, you want to remove people from the queue in the correct order. This is how you would do that:

>>>
>>> queue.popleft()
'Mary'

>>> queue
deque(['John', 'Susan'])

>>> queue.popleft()
'John'

>>> queue
deque(['Susan'])

Every time you call popleft(), you remove the head element from the linked list, mimicking a real-life queue.

Stacks

What if you wanted to create a stack instead? Well, the idea is more or less the same as with the queue. The only difference is that the stack uses the LIFO approach, meaning that the last element to be inserted in the stack should be the first to be removed.

Imagine you’re creating a web browser’s history functionality in which store every page a user visits so they can go back in time easily. Assume these are the actions a random user takes on their browser:

  1. Visits Real Python’s website
  2. Navigates to Pandas: How to Read and Write Files
  3. Clicks on a link for Reading and Writing CSV Files in Python

If you’d like to map this behavior into a stack, then you could do something like this:

>>>
>>> from collections import deque
>>> history = deque()

>>> history.appendleft("https://realpython.com/")
>>> history.appendleft("https://realpython.com/pandas-read-write-files/")
>>> history.appendleft("https://realpython.com/python-csv/")
>>> history
deque(['https://realpython.com/python-csv/',
       'https://realpython.com/pandas-read-write-files/',
       'https://realpython.com/'])

In this example, you created an empty history object, and every time the user visited a new site, you added it to your history variable using appendleft(). Doing so ensured that each new element was added to the head of the linked list.

Now suppose that after the user read both articles, they wanted to go back to the Real Python home page to pick a new article to read. Knowing that you have a stack and want to remove elements using LIFO, you could do the following:

>>>
>>> history.popleft()
'https://realpython.com/python-csv/'

>>> history.popleft()
'https://realpython.com/pandas-read-write-files/'

>>> history
deque(['https://realpython.com/'])

There you go! Using popleft(), you removed elements from the head of the linked list until you reached the Real Python home page.

From the examples above, you can see how useful it can be to have collections.deque in your toolbox, so make sure to use it the next time you have a queue- or stack-based challenge to solve.

Implementing Your Own Linked List

Now that you know how to use collections.deque for handling linked lists, you might be wondering why you would ever implement your own linked list in Python. There are a few reasons to do it:

  1. Practicing your Python algorithm skills
  2. Learning about data structure theory
  3. Preparing for job interviews

Feel free to skip this next section if you’re not interested in any of the above, or if you already aced implementing your own linked list in Python. Otherwise, it’s time to implement some linked lists!

How to Create a Linked List

First things first, create a class to represent your linked list:

class LinkedList:
    def __init__(self):
        self.head = None

The only information you need to store for a linked list is where the list starts (the head of the list). Next, create another class to represent each node of the linked list:

class Node:
    def __init__(self, data):
        self.data = data
        self.next = None

In the above class definition, you can see the two main elements of every single node: data and next. You can also add a __repr__ to both classes to have a more helpful representation of the objects:

class Node:
    def __init__(self, data):
        self.data = data
        self.next = None

    def __repr__(self):
        return self.data

class LinkedList:
    def __init__(self):
        self.head = None

    def __repr__(self):
        node = self.head
        nodes = []
        while node is not None:
            nodes.append(node.data)
            node = node.next
        nodes.append("None")
        return " -> ".join(nodes)

Have a look at an example of using the above classes to quickly create a linked list with three nodes:

>>>
>>> llist = LinkedList()
>>> llist
None

>>> first_node = Node("a")
>>> llist.head = first_node
>>> llist
a -> None

>>> second_node = Node("b")
>>> third_node = Node("c")
>>> first_node.next = second_node
>>> second_node.next = third_node
>>> llist
a -> b -> c -> None

By defining a node’s data and next values, you can create a linked list quite quickly. These LinkedList and Node classes are the starting points for our implementation. From now on, it’s all about increasing their functionality.

Here’s a slight change to the linked list’s __init__() that allows you to quickly create linked lists with some data:

def __init__(self, nodes=None):
    self.head = None
    if nodes is not None:
        node = Node(data=nodes.pop(0))
        self.head = node
        for elem in nodes:
            node.next = Node(data=elem)
            node = node.next

With the above modification, creating linked lists to use in the examples below will be much faster.

How to Traverse a Linked List

One of the most common things you will do with a linked list is to traverse it. Traversing means going through every single node, starting with the head of the linked list and ending on the node that has a next value of None.

Traversing is just a fancier way to say iterating. So, with that in mind, create an __iter__ to add the same behavior to linked lists that you would expect from a normal list:

def __iter__(self):
    node = self.head
    while node is not None:
        yield node
        node = node.next

The method above goes through the list and yields every single node. The most important thing to remember about this __iter__ is that you need to always validate that the current node is not None. When that condition is True, it means you’ve reached the end of your linked list.

After yielding the current node, you want to move to the next node on the list. That’s why you add node = node.next. Here’s an example of traversing a random list and printing each node:

>>>
>>> llist = LinkedList(["a", "b", "c", "d", "e"])
>>> llist
a -> b -> c -> d -> e -> None

>>> for node in llist:
...     print(node)
a
b
c
d
e

In other articles, you might see the traversing defined into a specific method called traverse(). However, using Python’s built-in methods to achieve said behavior makes this linked list implementation a bit more Pythonic.

How to Insert a New Node

There are different ways to insert new nodes into a linked list, each with its own implementation and level of complexity. That’s why you’ll see them split into specific methods for inserting at the beginning, end, or between nodes of a list.

Inserting at the Beginning

Inserting a new node at the beginning of a list is probably the most straightforward insertion since you don’t have to traverse the whole list to do it. It’s all about creating a new node and then pointing the head of the list to it.

Have a look at the following implementation of add_first() for the class LinkedList:

def add_first(self, node):
    node.next = self.head
    self.head = node

In the above example, you’re setting self.head as the next reference of the new node so that the new node points to the old self.head. After that, you need to state that the new head of the list is the inserted node.

Here’s how it behaves with a sample list:

>>>
>>> llist = LinkedList()
>>> llist
None

>>> llist.add_first(Node("b"))
>>> llist
b -> None

>>> llist.add_first(Node("a"))
>>> llist
a -> b -> None

As you can see, add_first() always adds the node to the head of the list, even if the list was empty before.

Inserting at the End

Inserting a new node at the end of the list forces you to traverse the whole linked list first and to add the new node when you reach the end. You can’t just append to the end as you would with a normal list because in a linked list you don’t know which node is last.

Here’s an example implementation of a function for inserting a node to the end of a linked list:

def add_last(self, node):
    if not self.head:
        self.head = node
        return
    for current_node in self:
        pass
    current_node.next = node

First, you want to traverse the whole list until you reach the end (that is, until the for loop raises a StopIteration exception). Next, you want to set the current_node as the last node on the list. Finally, you want to add the new node as the next value of that current_node.

Here’s an example of add_last() in action:

>>>
>>> llist = LinkedList(["a", "b", "c", "d"])
>>> llist
a -> b -> c -> d -> None

>>> llist.add_last(Node("e"))
>>> llist
a -> b -> c -> d -> e -> None

>>> llist.add_last(Node("f"))
>>> llist
a -> b -> c -> d -> e -> f -> None

In the code above, you start by creating a list with four values (a, b, c, and d). Then, when you add new nodes using add_last(), you can see that the nodes are always appended to the end of the list.

Inserting Between Two Nodes

Inserting between two nodes adds yet another layer of complexity to the linked list’s already complex insertions because there are two different approaches that you can use:

  1. Inserting after an existing node
  2. Inserting before an existing node

It might seem weird to split these into two methods, but linked lists behave differently than normal lists, and you need a different implementation for each case.

Here’s a method that adds a node after an existing node with a specific data value:

def add_after(self, target_node_data, new_node):
    if not self.head:
        raise Exception("List is empty")

    for node in self:
        if node.data == target_node_data:
            new_node.next = node.next
            node.next = new_node
            return

    raise Exception("Node with data '%s' not found" % target_node_data)

In the above code, you’re traversing the linked list looking for the node with data indicating where you want to insert a new node. When you find the node you’re looking for, you’ll insert the new node immediately after it and rewire the next reference to maintain the consistency of the list.

The only exceptions are if the list is empty, making it impossible to insert a new node after an existing node, or if the list does not contain the value you’re searching for. Here are a few examples of how add_after() behaves:

>>>
>>> llist = LinkedList()
>>> llist.add_after("a", Node("b"))
Exception: List is empty

>>> llist = LinkedList(["a", "b", "c", "d"])
>>> llist
a -> b -> c -> d -> None

>>> llist.add_after("c", Node("cc"))
>>> llist
a -> b -> c -> cc -> d -> None

>>> llist.add_after("f", Node("g"))
Exception: Node with data 'f' not found

Trying to use add_after() on an empty list results in an exception. The same happens when you try to add after a nonexistent node. Everything else works as expected.

Now, if you want to implement add_before(), then it will look something like this:

 1 def add_before(self, target_node_data, new_node):
 2     if not self.head:
 3         raise Exception("List is empty")
 4 
 5     if self.head.data == target_node_data:
 6         return self.add_first(new_node)
 7 
 8     prev_node = self.head
 9     for node in self:
10         if node.data == target_node_data:
11             prev_node.next = new_node
12             new_node.next = node
13             return
14         prev_node = node
15 
16     raise Exception("Node with data '%s' not found" % target_node_data)

There are a few things to keep in mind while implementing the above. First, as with add_after(), you want to make sure to raise an exception if the linked list is empty (line 2) or the node you’re looking for is not present (line 16).

Second, if you’re trying to add a new node before the head of the list (line 5), then you can reuse add_first() because the node you’re inserting will be the new head of the list.

Finally, for any other case (line 9), you should keep track of the last-checked node using the prev_node variable. Then, when you find the target node, you can use that prev_node variable to rewire the next values.

Once again, an example is worth a thousand words:

>>>
>>> llist = LinkedList()
>>> llist.add_before("a", Node("a"))
Exception: List is empty

>>> llist = LinkedList(["b", "c"])
>>> llist
b -> c -> None

>>> llist.add_before("b", Node("a"))
>>> llist
a -> b -> c -> None

>>> llist.add_before("b", Node("aa"))
>>> llist.add_before("c", Node("bb"))
>>> llist
a -> aa -> b -> bb -> c -> None

>>> llist.add_before("n", Node("m"))
Exception: Node with data 'n' not found

With add_before(), you now have all the methods you need to insert nodes anywhere you’d like in your list.

How to Remove a Node

To remove a node from a linked list, you first need to traverse the list until you find the node you want to remove. Once you find the target, you want to link its previous and next nodes. This re-linking is what removes the target node from the list.

That means you need to keep track of the previous node as you traverse the list. Have a look at an example implementation:

 1 def remove_node(self, target_node_data):
 2     if not self.head:
 3         raise Exception("List is empty")
 4 
 5     if self.head.data == target_node_data:
 6         self.head = self.head.next
 7         return
 8 
 9     previous_node = self.head
10     for node in self:
11         if node.data == target_node_data:
12             previous_node.next = node.next
13             return
14         previous_node = node
15 
16     raise Exception("Node with data '%s' not found" % target_node_data)

In the above code, you first check that your list is not empty (line 2). If it is, then you raise an exception. After that, you check if the node to be removed is the current head of the list (line 5). If it is, then you want the next node in the list to become the new head.

If none of the above happens, then you start traversing the list looking for the node to be removed (line 10). If you find it, then you need to update its previous node to point to its next node, automatically removing the found node from the list. Finally, if you traverse the whole list without finding the node to be removed (line 16), then you raise an exception.

Notice how in the above code you use previous_node to keep track of the, well, previous node. Doing so ensures that the whole process will be much more straightforward when you find the right node to be deleted.

Here’s an example using a list:

>>>
>>> llist = LinkedList()
>>> llist.remove_node("a")
Exception: List is empty

>>> llist = LinkedList(["a", "b", "c", "d", "e"])
>>> llist
a -> b -> c -> d -> e -> None

>>> llist.remove_node("a")
>>> llist
b -> c -> d -> e -> None

>>> llist.remove_node("e")
>>> llist
b -> c -> d -> None

>>> llist.remove_node("c")
>>> llist
b -> d -> None

>>> llist.remove_node("a")
Exception: Node with data 'a' not found

That’s it! You now know how to implement a linked list and all of the main methods for traversing, inserting, and removing nodes. If you feel comfortable with what you’ve learned and you’re craving more, then feel free to pick one of the challenges below:

  1. Create a method to retrieve an element from a specific position: get(i) or even llist[i].
  2. Create a method to reverse the linked list: llist.reverse().
  3. Create a Queue() object inheriting this article’s linked list with enqueue() and dequeue() methods.

Apart from being great practice, doing some extra challenges on your own is an effective way to assimilate all the knowledge you’ve gained. If you want to get a head start by reusing all the source code from this article, then you can download everything you need at the link below:

Get the Source Code: Click here to get the source code you'll use to learn about linked lists in this tutorial.

Using Advanced Linked Lists

Until now, you’ve been learning about a specific type of linked list called singly linked lists. But there are more types of linked lists that can be used for slightly different purposes.

How to Use Doubly Linked Lists

Doubly linked lists are different from singly linked lists in that they have two references:

  1. The previous field references the previous node.
  2. The next field references the next node.

The end result looks like this:

Example Node of a Doubly Linked ListNode (Doubly Linked List)

If you wanted to implement the above, then you could make some changes to your existing Node class in order to include a previous field:

class Node:
    def __init__(self, data):
        self.data = data
        self.next = None
        self.previous = None

This kind of implementation would allow you to traverse a list in both directions instead of only traversing using next. You could use next to go forward and previous to go backward.

In terms of structure, this is how a doubly linked list would look:

Example Structure of a Doubly Linked ListDoubly Linked List

You learned earlier that collections.deque uses a linked list as part of its data structure. This is the kind of linked list it uses. With doubly linked lists, deque is capable of inserting or deleting elements from both ends of a queue with constant O(1) performance.

How to Use Circular Linked Lists

Circular linked lists are a type of linked list in which the last node points back to the head of the list instead of pointing to None. This is what makes them circular. Circular linked lists have quite a few interesting use cases:

This is what a circular linked list looks like:

Example Structure of a Circular Linked ListCircular Linked List

One of the advantages of circular linked lists is that you can traverse the whole list starting at any node. Since the last node points to the head of the list, you need to make sure that you stop traversing when you reach the starting point. Otherwise, you’ll end up in an infinite loop.

In terms of implementation, circular linked lists are very similar to singly linked list. The only difference is that you can define the starting point when you traverse the list:

class CircularLinkedList:
    def __init__(self):
        self.head = None

    def traverse(self, starting_point=None):
        if starting_point is None:
            starting_point = self.head
        node = starting_point
        while node is not None and (node.next != starting_point):
            yield node
            node = node.next
        yield node

    def print_list(self, starting_point=None):
        nodes = []
        for node in self.traverse(starting_point):
            nodes.append(str(node))
        print(" -> ".join(nodes))

Traversing the list now receives an additional argument, starting_point, that is used to define the start and (because the list is circular) the end of the iteration process. Apart from that, much of the code is the same as what we had in our LinkedList class.

To wrap up with a final example, have a look at how this new type of list behaves when you give it some data:

>>>
>>> circular_llist = CircularLinkedList()
>>> circular_llist.print_list()
None

>>> a = Node("a")
>>> b = Node("b")
>>> c = Node("c")
>>> d = Node("d")
>>> a.next = b
>>> b.next = c
>>> c.next = d
>>> d.next = a
>>> circular_llist.head = a
>>> circular_llist.print_list()
a -> b -> c -> d

>>> circular_llist.print_list(b)
b -> c -> d -> a

>>> circular_llist.print_list(d)
d -> a -> b -> c

There you have it! You’ll notice that you no longer have the None while traversing the list. That’s because there is no specific end to a circular list. You can also see that choosing different starting nodes will render slightly different representations of the same list.

Conclusion

In this article, you learned quite a few things! The most important are:

If you want to learn more about linked lists, then check out Vaidehi Joshi’s Medium post for a nice visual explanation. If you’re interested in a more in-depth guide, then the Wikipedia article is quite thorough. Finally, if you’re curious about the reasoning behind the current implementation of collections.deque, then check out Raymond Hettinger’s thread.

You can download the source code used throughout this tutorial by clicking on the following link:

Get the Source Code: Click here to get the source code you'll use to learn about linked lists in this tutorial.

Feel free to leave any questions or comments below. Happy Pythoning!


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

April 01, 2020 02:00 PM UTC


PyCharm

What’s New in R Plugin

We’re releasing a new update of the R Plugin for PyCharm and other IntelliJ-based IDEs. If you haven’t tried the plugin yet, download it from our website.

The plugin is available for 2019.3 versions of IDEs and for EAP builds of 2020.1. The latest update comes with many stability improvements and long-awaited features:

1. You want your publications to look good, we now make it easy to get your graphs in exactly the size you need.

When you execute any code chunk that plots the graph, just click in the Plots tab and you’ll be able to export it in the portable network graphics (.png).

r_build-package

2. Build interactive widgets with Shiny and embed them into your R Markdown files. R plugin provides a separated type file and shiny runtime. Just add a code chunk with the widget and execute it.

using Shiny widgets in the R plugin

3. Need more packages? How about creating your own one? The latest build of the plugin comes with some cool features: the dedicated R Package project type with prefabricated package structure and templates and the Build tool window with handy instruments to install, check, and test a new package.

Building an R package

Interested?

Download PyCharm from our website and install the R plugin. See more details and installation instructions in PyCharm documentation.

April 01, 2020 08:40 AM UTC


Django Weblog

Django bugfix releases issued: 3.0.5 and 2.2.12

Today we've issued 3.0.5 and 2.2.12 bugfix releases.

The release package and checksums are available from our downloads page, as well as from the Python Package Index. The PGP key ID used for this release is Carlton Gibson: E17DF5C82B4F9D00.

Django 1.11 has reached the end of extended support.

Note that with this release, Django 1.11 has reached the end of extended support. All Django 1.11 users are encouraged to upgrade to Django 2.2 or later to continue receiving fixes for security issues.

See the downloads page for a table of supported versions and the future release schedule.

April 01, 2020 08:00 AM UTC


Tryton News

Newsletter April 2020

@ced wrote:

Tryton is a business software platform which comes with a set of modules that can be activated to make an ERP, MRP, CRM and other useful applications for organizations of any kind.
During this month, when most developers are social distancing, we recorded a lot of changes to prepare for the upcoming release 5.6 that is planned for the start of May.

Contents:

Changes for the User

The calendar view can now prevent the use of drag and drop if the dates of the event cannot be changed.

We now display the lines to pay on the invoice. This makes it easier to edit the payments’ due dates when invoice payment is renegotiated.

The reference for productions is now displayed on the record name. This follows the same schema as other documents like sales and purchases.

We have improved the form that is used when creating product variants. The non-editable fields inherited from the template are no longer displayed if they are empty. This helps avoid confusing users, who may mistakenly think they need to fill them in. When creating a template from the variant form, we do not create a default variant any more. This prevents creating two variants by mistake.

Invoicing projects is now more focused on the amount rather than the duration. This makes the invoicing process easier to extend and customize.

The inventory moves are now also planned. This prevents a glitch in the stock quantity calculation between stock receipt and inventorying.

It is now possible to automatically use the maturity date from the lines paid. So, on the payment wizard if the user leaves the payment date empty, Tryton will use the maturity date from the line if there is one.

We’ve added a process button to the purchase request (like on sale and purchase). With this the administrator can force the purchase request to be processed if there was an error when running the task in the queue.

We display the user name when requesting the password. This is useful when you’ve forgotten which user you’re logged in as.

The lost and found location has been removed from the inventory and now uses the one defined on the warehouse or a parent of the inventoried location. This ensures the the same lost and found location is always used, and lets the user create other lost and found locations which are not used for inventories, such as scrap locations.

We now show the move resulting from the grouping wizard. This is useful as this move is not posted by default, so the user can check or edit the move before posting it.

It is now possible to define a common prefix for the product variant’s codes on the product template. And we enforce uniqueness on the variants codes so it can be used as SKU.

It is now possible to set a unit of measure for a purchase request or requisition without a product. This follows the same behavior as the purchase order.

We now use the date of the statement line as the date of the clearing move when the line is linked to a payment.

We now support invoicing subscriptions in advance.

We continue to make invoicing projects more flexible by allowing use of products that are not based on time. In this case the unit price is considered to be the full price for the work.

Accounts from the chart of accounts can now be reported into different account types (on the balance sheet and income statement) depending on the sign of their balance. This is useful for some countries which have such reporting constraints.

We have added to sale and purchase two new fields, contact and invoice party. This makes the form very flexible and suitable for almost all use cases.

We now keep the link between the credit lines and the stock moves of the original line. This will then allow us to calculate the exact price of a product including the eventual credit note.

Deposit lines should be reconciled for performance, but we never do a write-off on those lines. So the reconciliation wizard has been changed to only propose to reconcile deposit accounts if they are balanced otherwise they are automatically skipped.

We added a visual indicator on product and location quantities.

When a subscription is closed without an end date, we automatically set it to the greatest end date of the lines.

On the Spanish chart of account we have added tax rules for regimen especial agricultura, ganaderia y pesca.

New Modules

Account Cash Rounding

The account_cash_rounding module allows cash amounts to be rounded using the cash rounding factor of the currency.
When the invoice has to round the lines to pay, the excess amount is debited or credited to the accounts defined in the accounting configuration.

Changes for the System Administrator

As it may be complicated to configure the email server options, we added an option on trytond-admin to send a test email to ensure it is working.

Changes for the Developer

The speed of displaying a calendar containing lot of events has been improved using a new strategy.

For future releases of the desktop client, we have added a build for 64bit versions of Windows which complements the existing 32bit build.

We have added an editable attribute to the calendar view which can be used to disable the use of drag and drop to edit events.

All the Python code has been cleaned up so it is 100% compliant with our style guidelines. This is enforced by using flake8 and our reviewbot :robot:.

The string format for proteus records follows the same pattern as on the server side. So it is <model name>,<id> which is compatible with the value used for Reference fields.

It is now possible to order parties according to their proximity to a reference party. The distance is calculated based on the number of steps it takes to get from the reference party to the other party by traveling along party relationships.

The DeactivableMixin automatically sets all the fields to read only when the record is deactivated. So the developer no longer has to manage this manually and this also enforces a common behavior through all the models.

We added support for Python 3.8, this means that our continuous integration system now also runs tests for this version too.

We replaced the memoize tool by functools.lru_cache from the Python standard library. This reduces our code base and we get to profit from a more widely used implementation.

When copying a record with a binary field, we do not reprocess the data if it already has a file_id. This reduces the workload and improves the storage usage for any filestore that does not automatically detect duplicate data.

Triggers are now run in the queue. This ensures that internal changes that are reverted later in the same transaction do not trigger an action. This also ensures that records that cause a trigger only fire the same trigger once per transaction.

It is now possible to define depends fields in the view_attributes actions. The required field will be added to the view automatically if it is missing and the xpath expression is matched.

Posts: 1

Participants: 1

Read full topic

April 01, 2020 08:00 AM UTC


Brett Cannon

What the heck is pyproject.toml?

What the heck is pyproject.toml?

Recently on Twitter there was a maintainer of a Python project who had a couple of bugs filed against their project due to builds failing (this particular project doesn't provide wheels, only sdists). Eventually it came out that the project was using a pyproject.toml file because that's how you configure Black and not for any other purpose. This isn't the first time I have seen setuptools users use pyproject.toml because they were "told to by <insert name of tool>" without knowing the entire point behind the file. And so I decided to write this blog post to try and explain to setuptools users why pyproject.toml exists and what it does as it's the future of packaging in the Python ecosystem (if you are not a conda user 😉).

PEP 518 and pyproject.toml

I've blogged about this before, but the purpose of PEP 518 was to come up with a way for projects to specify what build tools they required. That's it, real simple and straightforward. Before PEP 518 and the introduction of pyproject.toml there was no way for a project to tell a tool like pip what build tools it required in order to build a wheel (let alone an sdist). Now setuptools has a setup_requires argument to specify what is necessary to build a project, but your can't read that setting unless you have setuptools installed, which meant you couldn't declare you needed setuptools to read the setting in setuptools. This chicken-and-egg problem is why tools like virtualenv install setuptools by default and why pip always injects setuptools and wheel when running a setup.py file regardless of whether you explicitly installed  it. Oh, and don't even try to rely on a specific version of setuptools for buildling your project as there was no way to specify that; you had to make do with whatever the user happened to have installed.

But PEP 518 and pyproject.toml changed that. Now a tool like pip can read pyproject.toml, see what build tools are specified in it, and install those in a virtual environment to build your project. That means you can rely on a specific version of setuptools and 'wheel' if you want. Heck, you can even build with a tool other than setuptools if you want (e.g. flit or Poetry, but since these other tools require pyproject.toml their users are already familiar with what's going on). The key point is assumptions no longer need to be made about what is necessary to build your project, which frees up the packaging ecosystem to experiment and grow.

PEP 517 and building wheels

With PEP 518 in place, tools knew what needed to be available in order to build a project into a wheel (or sdist). But how do you produce a wheel or sdist from a project that has a pyproject.toml? This is where PEP 517 comes in. That PEP specifies how build tools are to be executed to build both sdists and wheels. So PEP 518 gets the build tools installed and PEP 517 gets them executed. This opens the door to using other tools by standardizing how to run build tools. Before, there was no standardized way to build a wheel or sdist except with python setup.py sdist bdist_wheel which isn't really flexible; there's no way for the tool running the build to pass in environment details as appropriate, for instance. PEP 517 helped solve that problem.

One other change that PEP 517 & 518 has led to is build isolation. Now that projects can specify arbitrary build tools, tools like pip have to build projects in virtual environments to make sure each project's build tools don't conflict with another project's build tool needs. This also helps with reproducible builds by making sure your build tools are consistent.

Unfortunately this frustrates some setuptools users when they didn't realize a setup.py files and/or build environment have become structured in such a way that they can't be built in isolation. For instance, one user was doing their builds offline and didn't have setuptools and 'wheel' in their local cache of wheels (aka their local wheelhouse), so when pip tried to build a project in isolation it failed as pip couldn't find setuptools and 'wheel' to install into the build virtual environment.

Tools standardizing on pyproject.toml

An interesting side-effect of PEP 518 trying to introduce a standard file that all projects should (eventually) have is that non-build development tools realized they now had a file where they could put their own configuration. I say this is interesting because originally PEP 518 disallowed this, but people chose to ignore this part of the PEP 😄.  We eventually updated the PEP to allow for this use-case since it became obvious people liked the idea of centralizing configuration data in a single file.

And so now projects like Black, coverage.py, towncrier, and tox (in a way) allow you to specify their configurations in pyproject.toml instead of in a separate file. Occasionally you do hear people lament the fact that they are adding yet another configuration file to their project due to pyproject.toml. What I don't think people realize, though, is these project could have also created their own configuration files (and in fact both coverage.py and tox do support their own files). And so, thanks to projects consolidating around pyproject.toml, there's actually an argument to be made there are fewer configuration files than before thanks to pyproject.toml.

How to use pyproject.toml with setuptools

Hopefully I have convinced you to introduce pyproject.toml into your setuptools-based project so you get benefits like build isolation and the ability to specify the version of setuptools you want to depend on. Now you might be wondering what your pyproject.toml should consist of? Unfortunately no one has had the time to document all of this for setuptools, but luckily the issue tracking adding that document outlines what is necessary:

[build-system]
requires = ["setuptools >= 40.6.0", "wheel"]
build-backend = "setuptools.build_meta"
A pyproject.toml file for setuptools users

With that you get to participate in thePEP 517 world of standards! 😉 And as I said, you can now rely on a specific version of setuptools and get build isolation as well (which is why the current directory is not put on sys.path automatically; you will need sys.path.insert(0, os.path.dirname(__file__)) or equivalent if you're importing local files).

But there's a bonus if you use a pyproject.toml file with a setup.cfg configuration for setuptools: you don't need a setup.py file anymore! Since tools like pip are going to call setuptools using the PEP 517 API instead of setup.py it means you can delete that setup.py file!

Unfortunately there is one hitch with dropping the setup.py file: if you want editable installs you still need a setup.py shim, but that's true of any build tool that isn't setuptools as there isn't a standard for editable installs (yet; people have talked about standardizing it and sketched it out, but no one has had the time to implement a proof-of-concept and then the eventual PEP). Luckily the shim to keep editable installs is really small:

#!/usr/bin/env python

import setuptools

if __name__ == "__main__":
    setuptools.setup()
A setup.py shim for use with pyproject.toml and setup.cfg

You could even simplify this down to import setuptools; setuptools.setup() if you really wanted to.

Where all of this is going

What all of this comes down to is the Python packaging ecosystem is working towards basing itself on standards. And those standards are all working towards standardizing artifacts and how to work with them. For instance, if we all know how wheels are formatted and how to install them then you don't have to care about how the wheel is made, just that a wheel exists for the thing you want to install and that it follows the appropriate standards. If you keep pushing this out and standardize more and more it makes it much easier for tools to communicate via artifacts and provide freedom for people to use whatever software they want to produce those artifacts.

For instance, you may have noticed I keep saying "tools like pip" instead of just saying "pip". That's been entirely on purpose. By making all of these standards it means tools don't have to rely solely on pip to do things because "that's how pip does it". As an example, tox could install a wheel by itself by using a library like pep517 to do the building of a wheel and then use another library like distlib to do the wheel installation.

Standards also take out the guessing as to whether something is on purpose or not. This becomes important to make sure everyone agrees on how things should work. There's also coherency as standards start to build on each other and flow into one another nicely. There's also less arguing (eventually 😉) as everyone works toward the same thing that everyone agreed to earlier.

It also takes pressure off of setuptools. It doesn't have to try and be everything to everyone as people can now choose the tool that best fits their project and development style. Same goes for pip.

Besides, don't we all want the platypus? 😉

April 01, 2020 04:25 AM UTC

March 31, 2020


Zero-with-Dot (Oleg Żero)

Hidden Markov Model - A story of the morning insanity

Introduction

In this article, we present an example of an (im-)practical application of the Hidden Markov Model (HMM). It is an artifially constructed problem, where we create a case for a model, rather than applying a model to a particular case… although, maybe a bit of both.

Here, we will rely on the code we developed earlier , and discussed in the earlier article: “Hidden Markov Model - Implementation from scratch”, including the mathematical notation. Feel free to take a look. The story we are about to tell contains modeling of the problem, uncovering the hidden sequence and training of the model.

Let the story begin…

Picture the following scenario: It’s at 7 a.m. You’re preparing to go to work. In practice, it means that you are running like crazy between different rooms. You spend some random amount of time in each, doing something, hoping to get everything you need to be sorted before you leave.

Sounds familiar?

To make things worse, your girlfriend (or boyfriend) has cats. The little furball wants to eat. Due to the morning hustle, it is uncertain whether you would remember to feed it. If you don’t, the cats will be upset… and so will your girlfriend if she finds out.

Modeling the situation

Say your flat has four rooms. That is to include the kitchen, bathroom, living room and bedroom. You spend some random amount of time in each, and transition between the rooms with a certain probability. At the same time, where ever you go, you are likely to make some distinct kinds of noises. Your girlfriend hears these noises and, despite being still asleep, she can infer in which room you are spending your time.

And so she does that day by day. She wants to make sure that you do feed the cats.

However, since she can’t be there, all she can do is to place the cat food bag in a room where you supposedly stay the longest. Hopefully, that will increase the chances that you do feed the “beast” (and save your evening).

Markovian view

From the Markovian perspective there are rooms are the hidden states ( in our case). Every minute (or any other time constant), we transition from one room to another . The probabilities associated with the transitioning are the elements of matrix .

At the same time, there exist distinct observable noises your girfriend can hear:

The probabilities of their occurrence given a state is given by coefficients of the matrix . In principle, any of these could originate from you being in an arbitrary room (state). In practice, however, there is physically a little chance you pulled the toilet-trigger while being in the kitchen, thus some ’s will be close to zero.

Most importantly, as you hop from one room to the other, it reasonable to assume that whichever room you go to depends only on the room that you have just been to. In other words, the state at time depends on the state at time only, especially if your are half brain-dead with an attention span of a gold fish…

Uncovering the hidden states

The goal

For the first attempt, let’s assume that the probability coefficients are known. This means that we have a model , and our task is to estimate the latent sequence given the observation sequence, which corresponds to finding . In other words, the girlfriend wants to establish in what room do we spend the most time, given what she hears.

Initialization

Let’s initialize our and .

  bathroom bedroom kitchen living room
bathroom 0.90 0.08 0.01 0.01
bedroom 0.01 0.90 0.05 0.04
kitchen 0.03 0.02 0.85 0.10
living room 0.05 0.02 0.23 0.70
  coffee dishes flushing radio shower silence television toothbrush wardrobe
bathroom 0.01 0.01 0.20 0.01 0.30 0.05 0.01 0.40 0.01
bedroom 0.01 0.01 0.01 0.10 0.01 0.30 0.05 0.01 0.50
kitchen 0.30 0.20 0.01 0.10 0.01 0.30 0.05 0.02 0.01
living room 0.03 0.01 0.01 0.19 0.01 0.39 0.39 0.01 0.03
bathroom bedroom kitchen living room
0 1 0 0
1
2
3
... # all initializations as explained in the other article
hml = HiddenMarkovLayer(A, B, pi)
hmm = HiddenMarkovModel(hml)

Simulation

Having defined and , let’s see how a typical “morning insanity” might look like. Here, we assume that the whole “circus” lasts 30 minutes, with one-minute granularity.

1
2
observations, latent_states = hml.run(30)
pd.DataFrame({'noise': observations, 'room': latent_states})
t noise room
0 radio bedroom
1 wardrobe bedroom
2 silence bedroom
3 wardrobe bedroom
4 silence living room
5 coffee bedroom
6 wardrobe bedroom
7 wardrobe bedroom
8 radio bedroom
9 wardrobe kitchen

The table above shows the first ten minutes of the sequence. We can see that it kind of makes sense, although we have to note that the girlfriend does not know what room we visited. This sequence is hidden from her.

However, as it is presented in the last article, we can guess what would be the statistically most favorable sequence of the rooms given the observations. The problem is addressed with the .uncover method.

1
2
estimated_states = hml.uncover(observations)
pd.DataFrame({'estimated': estimated_states, 'real': latent_states})
t estimated real
0 bedroom bedroom
1 bedroom bedroom
2 bedroom bedroom
3 bedroom bedroom
4 bedroom living room
5 bedroom bedroom
6 bedroom bedroom
7 bedroom bedroom
8 bedroom bedroom
9 bedroom kitchen

Comparing the results, we get the following count:

  estimated time proportion real time proportion
bathroom 1 2
bedroom 18 17
kitchen 4 10
living room 8 2

The resulting estimate gives 12 correct matches. Although this may seem like not much (only ~40% accuracy), it is 1.6 times better than a random guess.

Furthermore, we are not interested in matching the elements of the sequences here anyway. What interests us more is to find the room that you spend the most amount of time in. According to the simulation, you spend as much as 17 minutes in the bedroom. This estimate is off by one minute from the real sequence, which translates to ~6% relative error. Not that bad.

According to these results, the cat food station should be placed in the bedroom.

Training the model

In the last section, we have, relied on an assumption that the intrinsic probabilities of transition and observation are known. In other words, your girlfriend must have been watching your pretty closely, essentially collecting data about you. Otherwise, how else would she be able for formulate a model?

Although this may sound like total insanity, the good news is that our model is also trainable. Given a sequence of observations, it is possible to train the model and then use it to examine the hidden variables.

Let’s take an example sequence of what your girlfriend could have heard some crazy Monday morning. You woke up. Being completely silent for about 3 minutes, you went about to look for your socks in a wardrobe. Having found what you needed (or not), you went silent again for five minutes and flushed the toilet. Immediately after, you proceeded to take a shower (5 minutes), followed by brushing your teeth (3 minutes), although you turn the radio on in between. Once you were done, you turned the coffee machine on, watched TV (3 minutes), and did the dishes.

So, the observable sequence goes as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
what_she_heard = ['silence']*3 \
+ ['wardrobe'] \
+ ['silence']*5 \
+ ['flushing'] \
+ ['shower']*5 \
+ ['radio']*2 \
+ ['toothbrush']*3 \
+ ['coffee'] \
+ ['television']*3 \
+ ['dishes']

rooms = ['bathroom', 'bedroom', 'kitchen', 'living room']
pi = PV({'bathroom': 0, 'bedroom': 1, 'kitchen': 0, 'living room': 0}) 

The starting point is the bedroom, but and are unknown. Let’s initialize the model, and train it on the observation sequence.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
np.random.seed(3)

model = HiddenMarkovModel.initialize(rooms, 
	list(set(what_she_heard)))
model.layer.pi = pi
model.train(what_she_heard, epochs=100)

fig, ax = plt.subplots(1, 1, figsize=(10, 5))
ax.semilogy(model.score_history)
ax.set_xlabel('Epoch')
ax.set_ylabel('Score')
ax.set_title('Training history')
plt.grid()
plt.show()
/assets/hidden-markov-model-morning-insanity/training.png Figure 1. Training history using 100 epochs. To train means to maximise the score.

Now, after training of the model, the prediction sequence goes as follows:

1
2
3
4
pd.DataFrame(zip(
	what_she_heard, 
	model.layer.uncover(what_she_heard)), 
	columns=['the sounds you make', 'her guess on where you are'])
t the sound you make her guess on where you are
0 silence bedroom
1 silence bedroom
2 silence bedroom
3 wardrobe bedroom
4 silence bedroom
5 silence bedroom
6 silence bedroom
7 silence bedroom
8 silence bedroom
9 flushing bathroom
10 shower bathroom
11 shower bathroom
12 shower bathroom
13 shower bathroom
14 shower bathroom
15 radio kitchen
16 radio kitchen
17 toothbrush living room
18 toothbrush living room
19 toothbrush living room
20 coffee living room
21 television living room
22 television living room
23 television living room
24 dishes living room
state (guessed) total time steps
bathroom 6
bedroom 9
kitchen 2
living room 8

According to the table above, it is evident that the cat food should be placed in the bedroom.

However, it is important to note that this result is somewhat a nice coincidence because the model was initialized from a purely random state. Consequently, we had no control over the direction it would evolve in the context of the labels. In other words, the naming for the hidden states is are simply abstract to the model. They are our convention, not the model’s. Consequently, the model could have just as well associated “shower” with the “kitchen” and “coffee” with “bathroom”, in which case the model would still be correct, but to interpret the results we would need to swap the labels.

Still, in our case, the model seems to have trained to output something fairly reasonable and without the need to swap the names.

Conclusion

Hopefully, we have shed a bit of light into this whole story of morning insanity using the Hidden Markov model approach.

In this short story, we have covered two study cases. The first case assumed that the probability coefficients were known. Using these coefficients, we could define the model and uncover the latent state sequence given the observation sequence. The second case represented the opposite situation. The probabilities were not known and so the model had to be trained first in order to output the hidden sequence.

Closing remark

The situation described here is a real situation that the author faces every day. And yes… the cats survived. ;)

March 31, 2020 10:00 PM UTC


Stack Abuse

Reading and Writing MS Word Files in Python via Python-Docx Module

The MS Word utility from Microsoft Office suite is one of the most commonly used tools for writing text documents, both simple and complex. Though humans can easily read and write MS Word documents, assuming you have the Office software installed, often times you need to read text from Word documents within another application.

For instance, if you are developing a natural language processing application in Python that takes MS Word files as input, you will need to read MS Word files in Python before you can process the text. Similarly, often times you need to write text to MS Word documents as output, which could be a dynamically generated report to download, for example.

In this article, article you will see how to read and write MS Word files in Python.

Installing Python-Docx Library

Several libraries exist that can be used to read and write MS Word files in Python. However, we will be using the python-docx module owing to its ease-of-use. Execute the following pip command in your terminal to download the python-docx module as shown below:

$ pip install python-docx

Reading MS Word Files with Python-Docx Module

In this section, you will see how to read text from MS Word files via the python-docx module.

Create a new MS Word file and rename it as "my_word_file.docx". I saved the file in the root of my "E" directory, although you can save the file anywhere you want. The my_word_file.docx file should have the following content:

reading ms word files in python

To read the above file, first import the docx module and then create an object of the Document class from the docx module. Pass the path of the my_word_file.docx to the constructor of the Document class, as shown in the following script:

import docx

doc = docx.Document("E:/my_word_file.docx")

The Document class object doc can now be used to read the content of the my_word_file.docx.

Reading Paragraphs

Once you create an object of the Document class using the file path, you can access all the paragraphs in the document via the paragraphs attribute. An empty line is also read as a paragraph by the Document. Let's fetch all the paragraphs from the my_word_file.docx and then display the total number of paragraphs in the document:

all_paras = doc.paragraphs
len(all_paras)

Output:

10

Now we'll iteratively print all the paragraphs in the my_word_file.docx file:

for para in all_paras:
    print(para.text)
    print("-------")

Output:

-------
Introduction
-------

-------
Welcome to stackabuse.com
-------
The best site for learning Python and Other Programming Languages
-------
Learn to program and write code in the most efficient manner
-------

-------
Details
-------

-------
This website contains useful programming articles for Java, Python, Spring etc.
-------

The output shows all of the paragraphs in the Word file.

We can even access a specific paragraph by indexing the paragraphs property like an array. Let's print the 5th paragraph in the file:

single_para = doc.paragraphs[4]
print(single_para.text)

Output:

The best site for learning Python and Other Programming Languages

Reading Runs

A run in a word document is a continuous sequence of words having similar properties, such as similar font sizes, font shapes, and font styles. For example, if you look at the second line of the my_word_file.docx, it contains the text "Welcome to stackabuse.com", here the text "Welcome to" is in plain font, while the text "stackabuse.com" is in bold face. Hence, the text "Welcome to" is considered as one run, while the bold faced text "stackabuse.com" is considered as another run.

Similarly, "Learn to program and write code in the" and "most efficient manner" are treated as two different runs in the paragraph "Learn to program and write code in the most efficient manner".

To get all the runs in a paragraph, you can use the run property of the paragraph attribute of the doc object.

Let's read all the runs from paragraph number 5 (4th index) in our text:

single_para = doc.paragraphs[4]
for run in single_para.runs:
    print(run.text)

Output:

The best site for
learning Python
 and Other
Programming Languages

In the same way, the following script prints all the runs from the 6th paragraph of the my_word_file.docx file:

second_para = doc.paragraphs[5]
for run in second_para.runs:
    print(run.text)

Output:

Learn to program and write code in the
most efficient manner

Writing MS Word Files with Python-Docx Module

In the previous section, you saw how to read MS Word files in Python using the python-docx module. In this section, you will see how to write MS Word files via the python-docx module.

To write MS Word files, you have to create an object of the Document class with an empty constructor, or without passing a file name.

mydoc = docx.Document()

Writing Paragraphs

To write paragraphs, you can use the add_paragraph() method of the Document class object. Once you have added a paragraph, you will need to call the save() method on the Document class object. The path of the file to which you want to write your paragraph is passed as a parameter to the save() method. If the file doesn't already exist, a new file will be created, otherwise the paragraph will be appended at the end of the existing MS Word file.

The following script writes a simple paragraph to a newly created MS Word file named "my_written_file.docx".

mydoc.add_paragraph("This is first paragraph of a MS Word file.")
mydoc.save("E:/my_written_file.docx")

Once you execute the above script, you should see a new file "my_written_file.docx" in the directory that you specified in the save() method. Inside the file, you should see one paragraph which reads "This is first paragraph of a MS Word file."

Let's add another paragraph to the my_written_file.docx:

mydoc.add_paragraph("This is the second paragraph of a MS Word file.")
mydoc.save("E:/my_written_file.docx")

This second paragraph will be appended at the end of the existing content in my_written_file.docx.

Writing Runs

You can also write runs using the python-docx module. To write runs, you first have to create a handle for the paragraph to which you want to add your run. Take a look at the following example to see how it's done:

third_para = mydoc.add_paragraph("This is the third paragraph.")
third_para.add_run(" this is a section at the end of third paragraph")
mydoc.save("E:/my_written_file.docx")

In the script above we write a paragraph using the add_paragraph() method of the Document class object mydoc. The add_paragraph() method returns a handle for the newly added paragraph. To add a run to the new paragraph, you need to call the add_run() method on the paragraph handle. The text for the run is passed in the form of a string to the add_run() method. Finally, you need to call the save() method to create the actual file.

Writing Headers

You can also add headers to MS Word files. To do so, you need to call the add_heading() method. The first parameter to the add_heading() method is the text string for header, and the second parameter is the header size. The header sizes start from 0, with 0 being the top level header.

The following script adds three headers of level 0, 1, and 2 to the file my_written_file.docx:

mydoc.add_heading("This is level 1 heading", 0)
mydoc.add_heading("This is level 2 heading", 1)
mydoc.add_heading("This is level 3 heading", 2)
mydoc.save("E:/my_written_file.docx")

Adding Images

To add images to MS Word files, you can use the add_picture() method. The path to the image is passed as a parameter to the add_picture() method. You can also specify the width and height of the image using the docx.shared.Inches() attribute. The following script adds an image from the local file system to the my_written_file.docx Word file. The width and height of the image will be 5 and 7 inches, respectively:

mydoc.add_picture("E:/eiffel-tower.jpg", width=docx.shared.Inches(5), height=docx.shared.Inches(7))
mydoc.save("E:/my_written_file.docx")

After executing all the scripts in the Writing MS Word Files with Python-Docx Module section of this article, your final my_written_file.docx file should look like this:

writing ms word files in python

In the output, you can see the three paragraphs that you added to the MS word file, along with the three headers and one image.

Conclusion

The article gave a brief overview of how to read and write MS Word files using the python-docx module. The article covers how to read paragraphs and runs from within a MS Word file. Finally, the process of writing MS Word files, adding a paragraph, runs, headers, and images to MS Word files have been explained in this article.

March 31, 2020 08:06 PM UTC


PyCoder’s Weekly

Issue #414 (March 31, 2020)

#414 – MARCH 31, 2020
View in Browser »

The PyCoder’s Weekly Logo


Automatically Finding Codenames Clues With GloVe Vectors

In the Czech boardgame Codenames, one player must come up with a single-word clue that prompts their teammates to select certain words from a 5x5 board while simultaneously avoiding a “bomb” word that, if selected, causes the team to lose. In this article, James Somers explores how to generate clues automatically using Global Vectors for Word Representations—with surprising results.
JAMES SOMERS

Learn Python Skills While Creating Games

In this episode, Christopher interviews Jon Fincher from the Real Python Team. Jon talks about his recent articles on PyGame and Arcade. They discuss if game programming is a good way to develop your Python programming skills, and if a game would make a good portfolio piece.
REAL PYTHON podcast

Automate & Standardize Code Reviews for Python

alt

Take the hassle out of code reviews - Codacy flags errors automatically, directly from your Git workflow. Customize standards on coverage, duplication, complexity & style violations. Use in the cloud or on your servers for 30 different languages. Get started for free →
CODACY sponsor

SimPy: Simulating Real-World Processes With Python

See how you can use the SimPy package to model real-world processes with a high potential for congestion. You’ll create an algorithm to approximate a complex system, and then you’ll design and run a simulation of that system in Python.
REAL PYTHON

psycopg3: A First Report

“What’s left? Well, a lot! Now that the basic machinery is in place, and Python can send and retrieve bytes to and from Postgres, it’s time to attack the adaptation layer.”
DANIELE VARRAZZO

Cognitive Biases in Software Development

“In this post, I will try to answer the question: why do we feel weird about technical solutions?” Not strictly Python-focused, but absolutely worth a read.
STANISLAV MYACHENKOV

Learning Pandas by Exploring COVID-19 Data

“Use the pandas data analysis tool to explore the free COVID-19 data set provided by the European Centre for Disease Prevention and Control.”
MATT MAKAI

How Long Did It Take You to Learn Python?

“Wait, don’t answer that. It doesn’t matter.”
NED BATCHELDER

Discussions

Are F-Strings Good Practice?

REDDIT

What Resources Would You Recommend to Learn Socket Programming?

LOBSTE.RS

Python Jobs

Python Tutorial Authors Wanted (100% Remote)

Real Python

Software Engineer - PyTorch (Remote)

CyberCoders

Senior Python Engineer (Remote)

CyberCoders

Python Quant Developer With Data Focus (Remote)

eFinancial Careers

Senior Python Developer With Django or Flask (Remote)

Botsford Associates LLC

More Python Jobs >>>

Articles & Tutorials

Django vs. Flask in 2019: Which Framework to Choose [2019]

”[…] Django and Flask are by far the two most popular Python web frameworks… In the end, both frameworks are used to develop web applications. The key difference lies in how they achieve this goal. Think of Django as a car and Flask as a bike. Both can get you from point A to point B, but their approaches are quite different.”
MICHAEL HERMAN

How to Calculate Feature Importance With Python

“Feature importance scores play an important role in a predictive modeling project, including providing insight into the data, insight into the model, and the basis for dimensionality reduction and feature selection that can improve the efficiency and effectiveness of a predictive model on the problem. In this tutorial, you will discover feature importance scores for machine learning in Python.”
JASON BROWNLEE

Get Inspiration for Your Next Machine Learning Project

alt

Python is a leading language for machine learning, and businesses worldwide are using the technology to get ahead. Explore 22 examples of how ML is being applied across four industries →
STXNEXT sponsor

Using Markdown in Django

“As developers, we rely on static analysis tools to check, lint and transform our code. We use these tools to help us be more productive and produce better code. However, when we write content using markdown the tools at our disposal are scarce. In this article we describe how we developed a Markdown extension to address challenges in managing content using Markdown in Django sites.”
HAKI BENITA

Using WSL to Build a Python Development Environment on Windows

Learn how to set-up a development environment on Windows that leverages the power of the Windows Subsystem for Linux (WSL). From installing WSL, interfacing with Windows Terminal, setting up VS Code, and even running graphical Linux apps, this comprehensive tutorial covers everything you need to know to set-up a development workflow.
CHRIS MOFFITT

How to Use any() in Python

If you’ve ever wondered how to simplify complex conditionals by determining if at least one in a series of conditions is true, then look no further. This tutorial will teach you all about how to use any() in Python to do just that.
REAL PYTHON

Profile, Understand & Optimize Code Performance

You can’t improve what you can’t measure. Profile and understand Python code’s behavior and performance (Wall-time, I/O, CPU, HTTP requests, SQL queries). Browse through appealing graphs. Blackfire.io is now available as Public Beta. New features added regularly.
BLACKFIRE sponsor

Comparing Python Objects the Right Way: “is” vs “==”

Learn when to use the Python is, is not, == and != operators. You’ll see what these comparison operators do under the hood, dive into some quirks of object identity and interning, and define a custom class.
REAL PYTHON video

The Usefulness of Python’s Permutations and Combinations Functions

Learn how the permutations() and combinations() functions in Python’s itertools module can help you writer cleaner and faster for loops.
KEVIN DAWE

TLDR Newsletter: Byte Sized News for Techies

TLDR is a daily, curated newsletter with links and TLDRs of the most interesting stories in tech, science, and programming.
TLDRNEWSLETTER.COM

Projects & Code

streamz: Build Pipelines to Manage Continuous Streams of Data

GITHUB.COM/PYTHON-STREAMZ

pyliveupdate: Runtime Code Manipulation Framework for Profiling, Debugging and Bugfixing

GITHUB.COM/DEVOPSPP

meshy: Interactive Graph of Your Django Model Structure

GITHUB.COM/MESHY • Shared by Charlie Denton

rain: Live Example to Illustrate Python Packaging, Testing, Building, and Deploying

GITHUB.COM/ANKUR-GUPTA

pm4py-source: Process Mining for Python

GITHUB.COM/PM4PY

convtools: Generates Python Code of Conversions, Aggregations, Joins

GITHUB.COM/ITECHART-ALMAKOV

rapidfuzz: Rapid Fuzzy String Matching in Python

GITHUB.COM/RHASSPY

Python 3 Maze Generator With Disjoint Set

GITHUB.COM/SCHEDUTRON

glow: Compiler for Neural Network Hardware Accelerators

GITHUB.COM/PYTORCH

pywharf: Host Your Python Package on GitHub

GITHUB.COM/PYWHARF • Shared by HUNT ZHAN

typer-cli: Run Typer Scripts With Auto-Completion

GITHUB.COM/TIANGOLO • Shared by Sebastián Ramírez

Events

PyLadies Remote Lightning Talks

March 28th, 2020 (Remote via Zoom)
PYLADIES


Happy Pythoning!
This was PyCoder’s Weekly Issue #414.
View in Browser »

alt

[ Subscribe to 🐍 PyCoder’s Weekly 💌 – Get the best Python news, articles, and tutorials delivered to your inbox once a week >> Click here to learn more ]

March 31, 2020 07:30 PM UTC


Continuum Analytics Blog

Securing Pangeo with Dask Gateway

This post is also available on the Pangeo blog. Over the past few weeks, we have made some exciting changes to Pangeo’s cloud deployments. These changes will make using Pangeo’s clusters easier for users while…

The post Securing Pangeo with Dask Gateway appeared first on Anaconda.

March 31, 2020 03:42 PM UTC


Real Python

Comparing Python Objects the Right Way: "is" vs "=="

There’s a subtle difference between the Python identity operator (is) and the equality operator (==). Your code can run fine when you use the Python is operator to compare numbers, until it suddenly doesn’t. You might have heard somewhere that the Python is operator is faster than the == operator, or you may feel that it looks more Pythonic. However, it’s crucial to keep in mind that these operators don’t behave quite the same.

The == operator compares the value or equality of two objects, whereas the Python is operator checks whether two variables point to the same object in memory. In the vast majority of cases, this means you should use the equality operators == and !=, except when you’re comparing to None.

In this course, you’ll learn:


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

March 31, 2020 02:00 PM UTC


EuroPython

EuroPython 2020: Online conference from July 23-26

In the last two weeks, we have discussed and investigated concepts around running this year’s EuroPython conference as an online conference. We have looked at conference tools, your feedback, drafted up ideas on what we can do to make the event interesting and what we can accomplish given our limited resources.

Today, we are happy to announce that we will be running

EuroPython 2020
from July 23 - 26 2020
as an online conference

image

We are planning the following structure:

Attending the conference days will require a ticket, participating in the sprint days will be free.

We will publish additional information on the new format as it becomes available. A few updates for today (more details will follow in separate blog posts):

Call for Papers (CFP)

With the new plan in place, we will extend and retarget the CFP we had been running in the last three weeks to the online setup.

Please note that we will not have training sessions at EuroPython 2020. We will have keynotes, 30 and 45-minute talks, panels, interactive sessions, as well as try to come up with a format for doing posters and lightning talks.

Unlike for our in-person event, speakers will get free tickets to the event, since we don’t have to cover catering costs.

We need your help

Given that we had already put a lot of work into the in-person event organization, a lot of which we’ll now have to adapt or redo for the online set up in the few months ahead of us, we will need more help from the community to make this happen.

If you would like to help, please write to board@europython.eu. We are specifically looking for people with experience using online conference tools to help host tracks during the conference days.

Sponsoring

As for the in-person event, we will have sponsorship packages available for the online event as well. Because the format is different, we’ll have to adjust the packages we had intended for the in-person to an online setup.

If you are interested in sponsoring EuroPython 2020, please write to sponsoring@europython.eu. We will then send you more details, as they become available.

Thanks,

EuroPython 2020 Team
https://ep2020.europython.eu/
https://www.europython-society.org/

March 31, 2020 11:16 AM UTC


Python Software Foundation

PSF's Projected 2020 Financial Outcome

The Python Software Foundation (PSF) is a 501(c)(3) non-profit organization dedicated to the Python community and programming language, as well as running PyCon US. Since PyCon US 2020 was cancelled, the community has asked how the PSF’s finances will be affected. Let us take a look at the projected 2020 financial outcome.

Bottom Line

As of today, the PSF will use approximately $627,000 from our financial reserve:
ExpensesRevenue
PSF$1,300,000$550,000-$750,000
PyCon$280,000$403,000$123,000
Total-$627,000
If you are interested in how we arrived at these estimates, continue reading to learn about our projected expenses and revenue for this year. 

Expenses

PyCon US 2020

Pittsburgh and its vendors have been incredibly helpful in reducing or eliminating most of the 2020 conference minimums and cancellation fees. We estimate $280,000 in expenses for pre-conference work related to website/logo design and nonrefundable deposits. In addition, we budgeted significant funds to support travel grantees with non-reimbursable costs, as well as executing PyCon 2020 remote content. Once travel grants and instructor fees are complete, we will revise the expense total. 

PSF

Through March 2020, the PSF awarded several grants*, expended legal fees to protect PyLadies trademarks in dozens of countries, and employed staff. The PSF is projected to spend $1,300,000 in 2020.

Revenue

PyCon US 2020

PyCon US registration and sponsorship revenue is used to produce PyCon, with the largest costs going to food, audio-visual services, and travel grants. 
Our staff works to create the best and most affordable attendee experience possible with the added benefit that 100% of net proceeds fund the PSF. For 2020, we estimated PyCon’s net income at $720,000. As of today, we are estimating PyCon's net income to be $123,000, thanks to individual donations and sponsorship fees. 

PSF

PSF 2020 sponsorships are estimated at $350,000. COVID-19 is impacting financial markets and job security, so we expect individual donations and memberships to decrease in 2020 by 55% from 2019 to around $200,000. 

How can you help?

The PSF’s financial reserve is crucial, as we experienced during the economic downturn of 2008 and again in 2020. The cash reserve prepares us for economic impacts, events out of our control, and provides a stable environment with health benefits for our employees, even during this difficult time. 
Here are ways community members can help and get involved:

We wish our entire community good health.
* PSF Grants: When PyCon 2020 was cancelled, the PSF paused its Grants Program until we can find virtual options and other ways to support events, as well as fully understand the PSF’s financial situation.

March 31, 2020 11:00 AM UTC


Codementor

Michael Kennedy almost learned Python in the 90s... and other things I learned recording his DevJourney

Michael Kennedy is a podcaster and a trainer. After interviewing him for the DevJourney podcast, here are the key takeways I personally took out of the discussion.

March 31, 2020 10:58 AM UTC


Kushal Das

Introducing ManualBox project

One of the major security features of the QubesOS is the file vaults, where access to specific files can only happen via user input in the GUI applet. Same goes to the split-ssh, where the user has to allow access to the ssh key (actually on a different VM).

I was hoping to have similar access control to important dotfiles with passwords, ssh private keys, and other similar files on my regular desktop system. I am introducing ManualBox which can provide similarly access control on normal Linux Desktops or even on Mac.

GIF of usage

How to install?

Follow the installation guide on the Mac in the wiki. For Linux, we are yet to package the application, and you can directly run from the source (without installing).

git clone https://github.com/kushaldas/manualbox.git
cd manualbox

On Fedora

sudo dnf install python3-cryptography python3-qt5 python3-fusepy python3-psutil fuse -y

On Debian

sudo apt install python3-cryptography python3-pyqt5 python3-fusepy python3-psutil fuse

Usage guide

To start the application from source:

On Linux:

./devscripts/manualbox

On Mac:

Click on the App icon like any other application.

If you are running the tool for the first time, it will create a new manualbox and mount it in ~/secured directory, it will also give you the password, please store it somewhere securely, as you will need it to mount the filesystem from the next time.

initial screen

After selecting (or you can directly type) the mount path (must be an empty directory), you should type in the password, and then click on the Mount button.

File system mounted

Now, if you try to access any file, the tool will show a system notification, and you can either Allow or Deny via the following dialog.

Allow or deny access

Every time you allow file access, it shows the notification message via the system tray icon.

Accessing file msg

To exit the application, first click on the Unmount, and right-click on the systray icon, and click on the Exit or close via window close button.

How to exit from the application

Usage examples (think about your important dotfiles with passwords/tokens)

Note: If you open the mounted directory path from a GUI file browser, you will get too many notifications, as these browsers will open the file many times separately. Better to have you GUI application/command line tool to use those files as required.

Thunderbird

You can store your thuderbird profile into this tool. That way, thunderbird needs your permission for access when you start the application.

ls -l ~/.thunderbird/
# now find your right profile (most people have only one)
mv ~/.thunderbird/xxxxxx.default/logins.json ~/secured/
ln -s ~/secured/logins.json ~/.thunderbird/xxxxxx.default/logins.json

SSH private key

mv ~/.ssh/id_rsa ~/secured/
ln -s ~/secured/id_rsa ~/.ssh/id_rsa

If you have any issues, please file issues or even better a PR along with the issue :)

March 31, 2020 06:02 AM UTC


Programiz

Python main function

In this tutorial, we will learn how to use a Python program's __name__ attribute to run it dynamically in different contexts.

March 31, 2020 05:51 AM UTC


Mike Driscoll

Python 101 – Learning About Dictionaries

Dictionaries are another fundamental data type in Python. A dictionary is a key, value pair. Some programming languages refer to them as hash tables. They are described as a mapping object that maps hashable values to arbitrary objects.

A dictionary’s keys must be immutable, that is, unable to change. Starting in Python 3.7, dictionaries are ordered. What that means is that when you add a new key, value pair to a dictionary, it remembers what order they were added. Prior to Python 3.7, this was not the case and you could not rely on insertion order.

You will learn how to do the following in this chapter:

  • Create dictionaries
  • Access dictionaries
  • Dictionary methods
  • Modifying dictionaries
  • Deleting from your dictionary

Let’s start off by learning about creating dictionaries!

You can create a dictionary in a couple of different ways. The most common method is by placing a comma-separated list key: value pairs within curly braces.

Let’s look at an example:

>>> sample_dict = {'first_name': 'James', 'last_name': 'Doe', 'email': 'jdoe@gmail.com'}
>>> sample_dict
{'email': 'jdoe@gmail.com', 'first_name': 'James', 'last_name': 'Doe'}

You can also use Python’s built-in dict() function to create a dictionary. dict() will accept a series of keyword arguments (i.e. 1=’one’, 2=’two’, etc), a list of tuples or another dictionary.

Here are a couple of examples:

>>> numbers = dict(one=1, two=2, three=3)
>>> numbers
{'one': 1, 'three': 3, 'two': 2}
>>> info_list = [('first_name', 'James'), ('last_name', 'Doe'), ('email', 'jdoes@gmail.com')]
>>> info_dict = dict(info_list)
>>> info_dict
{'email': 'jdoes@gmail.com', 'first_name': 'James', 'last_name': 'Doe'}

The first example uses dict() on a series of keyword arguments. You will learn more about these when you learn about functions. You can think of keyword arguments as a series of keywords with the equals sign between them and their value.

The second example shows you how to create a list that has 3 tuples inside of it. Then you pass that list to dict() to convert it to a dictionary.

Accessing Dictionaries

Dictionaries claim to fame is that they are very fast. You can access any value in a dictionary via the key. If the key is not found, you will receive a KeyError.

Let’s take a look at how to use a dictionary:

>>> sample_dict = {'first_name': 'James', 'last_name': 'Doe', 'email': 'jdoe@gmail.com'}
>>> sample_dict['first_name']
'James'

To get the value of first_name, you must use the following syntax: dictionary_name[key]

Now let’s try to get a key that doesn’t exist:

>>> sample_dict['address']
Traceback (most recent call last):
   Python Shell, prompt 118, line 1
builtins.KeyError: 'address'

Well that didn’t work! You asked the dictionary to give you a value that wasn’t in the dictionary!

You can use Python’s in keyword to ask if a key is in the dictionary:

>>> 'address' in sample_dict
False
>>> 'first_name' in sample_dict
True

You can also check to see if a key is not in a dictionary by using Python’s not keyword:

>>> 'first_name' not in sample_dict
False
>>> 'address' not in sample_dict
True

Another way to access keys in dictionaries is by using one of the dictionary methods. Let’s find out more about dictionary methods now!

Dictionary Methods

As with most Python data types, dictionaries have special methods you can use. Let’s check out some of the dictionary’s methods!

d.get(key[, default])

You can use the get() method to get a value. get() requires you to specify a key to look for. It optionally allows you to return a default if the key is not found. The default is None. Let’s take a look:

>>> print(sample_dict.get('address'))
None
>>> print(sample_dict.get('address', 'Not Found'))
Not Found

The first example shows you what happens when you try to get() a key that doesn’t exist without setting get‘s default. In that case, it returns None. Then the second example shows you how to set the default to the string “Not Found”.

d.clear()

The clear() method can be used to remove all the items from the dictionary.

>>> sample_dict = {'first_name': 'James', 'last_name': 'Doe', 'email': 'jdoe@gmail.com'}
>>> sample_dict
{'email': 'jdoe@gmail.com', 'first_name': 'James', 'last_name': 'Doe'}
>>> sample_dict.clear()
>>> sample_dict
{}

d.copy()

If you need to create a shallow copy of the dictionary, then the copy() method is for you:

>>> sample_dict = {'first_name': 'James', 'last_name': 'Doe', 'email': 'jdoe@gmail.com'}
>>> copied_dict = sample_dict.copy()
>>> copied_dict
{'email': 'jdoe@gmail.com', 'first_name': 'James', 'last_name': 'Doe'}

If your dictionary has objects or dictionaries inside of it, then you may end up running into logic errors due to this method as changing one dictionary can affect the copy. In those case, you should use Python’s copy module, which has a deepcopy function that will create a completely separate copy for you.

d.items()

The items() method will return a new view of the dictionary’s items:

>>> sample_dict = {'first_name': 'James', 'last_name': 'Doe', 'email': 'jdoe@gmail.com'}
>>> sample_dict.items()
dict_items([('first_name', 'James'), ('last_name', 'Doe'), ('email', 'jdoe@gmail.com')])

This view object will change as the dictionary object itself changes.

d.keys()

If you need to get a view of the keys that are in a dictionary, then keys() is the method for you. As a view object, it will provide you with a dynamic view of the dictionary’s keys. You can iterate over a view and also check membership view the in keyword:

>>> sample_dict = {'first_name': 'James', 'last_name': 'Doe', 'email': 'jdoe@gmail.com'}
>>> keys = sample_dict.keys()
>>> keys
dict_keys(['first_name', 'last_name', 'email'])
>>> 'email' in keys
True
>>> len(keys)
3

d.values()

The values() method also returns a view object, but in this case it is a dynamic view of the dictionary’s values:

>>> sample_dict = {'first_name': 'James', 'last_name': 'Doe', 'email': 'jdoe@gmail.com'}
>>> values = sample_dict.values()
>>> values
dict_values(['James', 'Doe', 'jdoe@gmail.com'])
>>> 'Doe' in values
True
>>> len(values)
3

d.pop(key[, default])

Do you need to remove a key from a dictionary? Then pop() is the method for you. The pop() method takes a key and an option default string. If you don’t set the default and the key is not found, a KeyError will be raised.

Here are some examples:

>>> sample_dict = {'first_name': 'James', 'last_name': 'Doe', 'email': 'jdoe@gmail.com'}
>>> sample_dict.pop('something')
Traceback (most recent call last):
   Python Shell, prompt 146, line 1
builtins.KeyError: 'something'
>>> sample_dict.pop('something', 'Not found!')
'Not found!'
>>> sample_dict.pop('first_name')
'James'
>>> sample_dict
{'email': 'jdoe@gmail.com', 'last_name': 'Doe'}

d.popitem()

The popitem() method is used to remove and return a (key, value) pair from the dictionary. The pairs are returned in last-in first-out (LIFO) order. If called on an empty dictionary, you will receive a KeyError

>>> sample_dict = {'first_name': 'James', 'last_name': 'Doe', 'email': 'jdoe@gmail.com'}
>>> sample_dict.popitem()
('email', 'jdoe@gmail.com')
>>> sample_dict
{'first_name': 'James', 'last_name': 'Doe'}

d.update([other])

Update a dictionary with the (key, value) pairs from other, overwriting existing keys. Returns None.

>>> sample_dict = {'first_name': 'James', 'last_name': 'Doe', 'email': 'jdoe@gmail.com'}
>>> sample_dict.update([('something', 'else')])
>>> sample_dict
{'email': 'jdoe@gmail.com',
'first_name': 'James',
'last_name': 'Doe',
'something': 'else'}

Modifying Your Dictionary

You will need to modify your dictionary from time to time. Let’s assume that you need to add a new key, value pair:

>>> sample_dict = {'first_name': 'James', 'last_name': 'Doe', 'email': 'jdoe@gmail.com'}
>>> sample_dict['address'] = '123 Dunn St'
>>> sample_dict
{'address': '123 Dunn St',
'email': 'jdoe@gmail.com',
'first_name': 'James',
'last_name': 'Doe'}

To add a new item to a dictionary, you can use the square braces to enter a new key and set it to a value.

If you need to update a pre-existing key, you can do the following:

>>> sample_dict = {'first_name': 'James', 'last_name': 'Doe', 'email': 'jdoe@gmail.com'}
>>> sample_dict['email'] = 'jame@doe.com'
>>> sample_dict
{'email': 'jame@doe.com', 'first_name': 'James', 'last_name': 'Doe'}

In this example, you set sample_dict['email'] to jame@doe.com. Whenever you set a pre-existing key to a new value, you will overwrite the previous value.

Deleting Items From Your Dictionary

Sometimes you will need to remove a key from a dictionary. You can use Python’s del keyword for that:

>>> sample_dict = {'first_name': 'James', 'last_name': 'Doe', 'email': 'jdoe@gmail.com'}
>>> del sample_dict['email']
>>> sample_dict
{'first_name': 'James', 'last_name': 'Doe'}

In this case, you tell Python to delete the key “email” from sample_dict.

The other method for removing a key is to use the dictionary’s pop() method, which was mentioned in the previous section:

>>> sample_dict = {'first_name': 'James', 'last_name': 'Doe', 'email': 'jdoe@gmail.com'}
>>> sample_dict.pop('email')
'jdoe@gmail.com'
>>> sample_dict
{'first_name': 'James', 'last_name': 'Doe'}

When you use pop(), it will return the value that is being removed.

Wrapping Up

The dictionary data type is extremely useful. You will find it handy to use for quick lookups of all kinds of data. You can set the value of the key: value pair to any object in Python. So you could store lists, tuples, or objects as values in a dictionary.

If you need a dictionary that can create a default when you go to get a key that does not exist, you should take a look at Python’s collections module. It has a defaultdict class that is made for exactly that use case.

Related Reading

 

The post Python 101 – Learning About Dictionaries appeared first on The Mouse Vs. The Python.

March 31, 2020 05:05 AM UTC


IslandT

Lists in python example

This is the final chapter of the lists in python topic, in this chapter we will create an example that will remove the duplicate student names within a student list with the help of the python loop.

We are given a list of student names, our mission is to remove the duplicate student name from the list.

student = ["Rick", "Ricky", "Richard", "Rick", "Rickky", "Rick", "Rickky"] # there are names that have been duplicated
double_enter = []

# first we need to know all the duplicate student names
for i in range(len(student)):
    count = student.count(student[i])
    if count > 1:
        if student[i] not in double_enter:
            double_enter.append(student[i])

# then we will remove those duplicate student names
for name in student:
    if name in double_enter and student.count(name) &gt; 1:
        student.remove(name)

print(student)

Both Ricky and Rickky have duplicated values

The above python program is very long, can you find a shorter method to achieve the above outcome? Write your own solution in the comment box below.

March 31, 2020 04:13 AM UTC


Codementor

How to Install Odoo 13 in Ubuntu 18.04 ?

Install Odoo13, Ubuntu 18.04

March 31, 2020 02:59 AM UTC

March 30, 2020


Podcast.__init__

An Open Source Toolchain For Natural Language Processing From Explosion AI

The state of the art in natural language processing is a constantly moving target. With the rise of deep learning, previously cutting edge techniques have given way to robust language models. Through it all the team at Explosion AI have built a strong presence with the trifecta of SpaCy, Thinc, and Prodigy to support fast and flexible data labeling to feed deep learning models and performant and scalable text processing. In this episode founder and open source author Matthew Honnibal shares his experience growing a business around cutting edge open source libraries for the machine learning developent process.

Summary

The state of the art in natural language processing is a constantly moving target. With the rise of deep learning, previously cutting edge techniques have given way to robust language models. Through it all the team at Explosion AI have built a strong presence with the trifecta of SpaCy, Thinc, and Prodigy to support fast and flexible data labeling to feed deep learning models and performant and scalable text processing. In this episode founder and open source author Matthew Honnibal shares his experience growing a business around cutting edge open source libraries for the machine learning developent process.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, node balancers, a 40 Gbit/s public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they’ve got dedicated CPU and GPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on great conferences. And now, the events are coming to you, with no travel necessary! We have partnered with organizations such as ODSC, and Data Council. Upcoming events include the Observe 20/20 virtual conference on April 6th and ODSC East which has also gone virtual starting April 16th. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host as usual is Tobias Macey and today I’m interviewing Matthew Honnibal about the Thinc and Prodigy tools and an update on SpaCy

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by giving an overview of your mission at Explosion?
  • We spoke previously about your work on SpaCy. What has changed in the past 3 1/2 years?
    • How have recent innovations in language models such as BERT and GPT-2 influenced the direction or implementation of the project?
  • When I last looked SpaCy only supported English and German, but you have added several new languages. What are the most challenging aspects of building the additional models?
    • What would be required for supporting symbolic or right-to-left languages?
  • How has the ecosystem for language processing in Python shifted or evolved since you first introduced SpaCy?
  • Another project that you have released is Prodigy to support labelling of datasets. Can you talk through the motivation for creating it and describe the workflow for someone using it?
    • What was lacking in the other annotation tools that you have worked with that you are trying to solve for in Prodigy?
  • What are some of the most challenging or problematic aspects of labelling data sets for use in machine learning projects?
    • What is a typical scale of data that can be reasonably handled by an individual or small team working with Prodigy?
      • At what point do you find that it makes sense to use a labeling service rather than generating the labels yourself?
  • Your most recent project is Thinc for building and using deep learning models. What was the motivation for creating it and what problem does it solve in the ecosystem?
    • How does its design and usage compare to other deep learning frameworks such as PyTorch and Tensorflow?
    • How does it compare to projects such as Keras that abstract across those frameworks?
  • How do the SpaCy, Prodigy, and Thinc libraries work together?
  • What are some of the biggest challenges that you are facing in building open source tools to meet the needs of data scientists and machine learning engineers?
  • What are some of the most interesting or impressive projects that you have seen built with the tools your team is creating?
  • What do you have planned for the future of Explosion, SpaCy, Prodigy, and Thinc?

Keep In Touch

Picks

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

March 30, 2020 10:49 PM UTC


Continuum Analytics Blog

Why Understanding CVEs Is Critical for Data Scientists

CVEs are Common Vulnerabilities and Exposures found in software components. Because modern software is complex with its many layers, interdependencies, data input, and libraries, vulnerabilities tend to emerge over time. Ignoring a high CVE score…

The post Why Understanding CVEs Is Critical for Data Scientists appeared first on Anaconda.

March 30, 2020 08:51 PM UTC


Wesley Chun

Authorized Google API access from Python (part 2 of 2)

Listing your files with the Google Drive API

NOTE: You can also watch a video walkthrough of the common code covered in this blogpost here.

UPDATE (Mar 2020): You can build this application line-by-line with our codelab (self-paced, hands-on tutorial) introducing developers to G Suite APIs. The deprecated auth library comment from the previous update below is spelled out in more detail in the green sidebar towards the bottom of step/module 5 (Install the Google APIs Client Library for Python). Also, the code sample is now maintained in a GitHub repo which includes a port to the newer auth libraries so you have both versions to refer to.

UPDATE (Apr 2019): In order to have a closer relationship between the GCP and G Suite worlds of Google Cloud, all G Suite Python code samples have been updated, replacing some of the older G Suite API client libraries with their equivalents from GCP. NOTE: using the newer libraries requires more initial code/effort from the developer thus will seem "less Pythonic." However, we will leave the code sample here with the original client libraries (deprecated but not shutdown yet) to be consistent with the video.

UPDATE (Aug 2016): The code has been modernized to use oauth2client.tools.run_flow() instead of the deprecated oauth2client.tools.run_flow(). You can read more about that change here.

UPDATE (Jun 2016): Updated to Python 2.7 & 3.3+ and Drive API v3.

Introduction

In this final installment of a (currently) two-part series introducing Python developers to building on Google APIs, we'll extend from the simple API example from the first post (part 1) just over a month ago. Those first snippets showed some skeleton code and a short real working sample that demonstrate accessing a public (Google) API with an API key (that queried public Google+ posts). An API key however, does not grant applications access to authorized data.

Authorized data, including user information such as personal files on Google Drive and YouTube playlists, require additional security steps before access is granted. Sharing of and hardcoding credentials such as usernames and passwords is not only insecure, it's also a thing of the past. A more modern approach leverages token exchange, authenticated API calls, and standards such as OAuth2.

In this post, we'll demonstrate how to use Python to access authorized Google APIs using OAuth2, specifically listing the files (and folders) in your Google Drive. In order to better understand the example, we strongly recommend you check out the OAuth2 guides (general OAuth2 info, OAuth2 as it relates to Python and its client library) in the documentation to get started.

The docs describe the OAuth2 flow: making a request for authorized access, having the user grant access to your app, and obtaining a(n access) token with which to sign and make authorized API calls with. The steps you need to take to get started begin nearly the same way as for simple API access. The process diverges when you arrive on the Credentials page when following the steps below.

Google API access

In order to Google API authorized access, follow these instructions (the first three of which are roughly the same for simple API access):
NOTEs: Instructions from the previous blogpost were to get an API key. This time, in the steps above, we're creating and downloading OAuth2 credentials. You can also watch a video walkthrough of this app setup process of getting simple or authorized access credentials in the "DevConsole" here.

Accessing Google APIs from Python

In order to access authorized Google APIs from Python, you still need the Google APIs Client Library for Python, so in this case, do follow those installation instructions from part 1.

We will again use googleapiclient.discovery.build(), which is required to create a service endpoint for interacting with an API, authorized or otherwise. However, for authorized data access, we need additional resources, namely the httplib2 and oauth2client packages. Here are the first five lines of the new boilerplate code for authorized access:

from __future__ import print_function
from googleapiclient import discovery
from httplib2 import Http
from oauth2client import file, client, tools

SCOPES = # one or more scopes (strings)
SCOPES is a critical variable: it represents the set of scopes of authorization an app wants to obtain (then access) on behalf of user(s). What's does a scope look like?

Each scope is a single character string, specifically a URL. Here are some examples:
You can request one or more scopes, given as a single space-delimited string of scopes or an iterable (list, generator expression, etc.) of strings.  If you were writing an app that accesses both your YouTube playlists as well as your Google+ profile information, your SCOPES variable could be either of the following:
SCOPES = 'https://www.googleapis.com/auth/plus.me https://www.googleapis.com/auth/youtube'

That is space-delimited and made tiny by me so it doesn't wrap in a regular-sized browser window; or it could be an easier-to-read, non-tiny, and non-wrapped tuple:

SCOPES = (
    'https://www.googleapis.com/auth/plus.me',
    'https://www.googleapis.com/auth/youtube',
)

Our example command-line script will just list the files on your Google Drive, so we only need the read-only Drive metadata scope, meaning our SCOPES variable will be just this:
SCOPES = 'https://www.googleapis.com/auth/drive.metadata.readonly'
The next section of boilerplate represents the security code:
store = file.Storage('storage.json')
creds = store.get()
if not creds or creds.invalid:
    flow = client.flow_from_clientsecrets('client_secret.json', SCOPES)
    creds = tools.run_flow(flow, store)
Once the user has authorized access to their personal data by your app, a special "access token" is given to your app. This precious resource must be stored somewhere local for the app to use. In our case, we'll store it in a file called "storage.json". The lines setting the store and creds variables are attempting to get a valid access token with which to make an authorized API call.

If the credentials are missing or invalid, such as being expired, the authorization flow (using the client secret you downloaded along with a set of requested scopes) must be created (by client.flow_from_clientsecrets()) and executed (by tools.run_flow()) to ensure possession of valid credentials. The client_secret.json file is the credentials file you saved when you clicked "Download JSON" from the DevConsole after you've created your OAuth2 client ID.

If you don't have credentials at all, the user much explicitly grant permission — I'm sure you've all seen the OAuth2 dialog describing the type of access an app is requesting (remember those scopes?). Once the user clicks "Accept" to grant permission, a valid access token is returned and saved into the storage file (because you passed a handle to it when you called tools.run_flow()).

Note: tools.run() deprecated by tools.run_flow()
You may have seen usage of the older tools.run() function, but it has been deprecated by tools.run_flow(). We explain this in more detail in another blogpost specifically geared towards migration.

Once the user grants access and valid credentials are saved, you can create one or more endpoints to the secure service(s) desired with googleapiclient.discovery.build(), just like with simple API access. Its call will look slightly different, mainly that you need to sign your HTTP requests with your credentials rather than passing an API key:

DRIVE = discovery.build(API, VERSION, http=creds.authorize(Http()))

In our example, we're going to list your files and folders in your Google Drive, so for API, use the string 'drive'. The API is currently on version 3 so use 'v3' for VERSION:

DRIVE = discovery.build('drive', 'v3', http=creds.authorize(Http()))

If you want to get comfortable with OAuth2, what it's flow is and how it works, we recommend that you experiment at the OAuth Playground. There you can choose from any number of APIs to access and experience first-hand how your app must be authorized to access personal data.

Going back to our working example, once you have an established service endpoint, you can use the list() method of the files service to request the file data:

files = DRIVE.files().list().execute().get('files', [])

If there's any data to read, the response dict will contain an iterable of files that we can loop over (or default to an empty list so the loop doesn't fail), displaying file names and types:

for f in files:
    print(f['name'], f['mimeType'])

Conclusion

To find out more about the input parameters as well as all the fields that are in the response, take a look at the docs for files().list(). For more information on what other operations you can execute with the Google Drive API, take a look at the reference docs and check out the companion video for this code sample. Don't forget the codelab and this sample's GitHub repo. That's it!

Below is the entire script for your convenience:
'''
drive_list.py -- Google Drive API demo; maintained at:
http://github.com/googlecodelabs/gsuite-apis-intro
'''
from __future__ import print_function

from googleapiclient import discovery
from httplib2 import Http
from oauth2client import file, client, tools

SCOPES = 'https://www.googleapis.com/auth/drive.readonly.metadata'
store = file.Storage('storage.json')
creds = store.get()
if not creds or creds.invalid:
    flow = client.flow_from_clientsecrets('client_secret.json', SCOPES)
    creds = tools.run_flow(flow, store)

DRIVE = discovery.build('drive', 'v3', http=creds.authorize(Http()))
files = DRIVE.files().list().execute().get('files', [])
for f in files:
    print(f['name'], f['mimeType'])
When you run it, you should see pretty much what you'd expect, a list of file or folder names followed by their MIMEtypes — I named my script drive_list.py:
$ python3 drive_list.py
Google Maps demo application/vnd.google-apps.spreadsheet
Overview of Google APIs - Sep 2014 application/vnd.google-apps.presentation
tiresResearch.xls application/vnd.google-apps.spreadsheet
6451_Core_Python_Schedule.doc application/vnd.google-apps.document
out1.txt application/vnd.google-apps.document
tiresResearch.xls application/vnd.ms-excel
6451_Core_Python_Schedule.doc application/msword
out1.txt text/plain
Maps and Sheets demo application/vnd.google-apps.spreadsheet
ProtoRPC Getting Started Guide application/vnd.google-apps.document
gtaskqueue-1.0.2_public.tar.gz application/x-gzip
Pull Queues application/vnd.google-apps.folder
gtaskqueue-1.0.1_public.tar.gz application/x-gzip
appengine-java-sdk.zip application/zip
taskqueue.py text/x-python-script
Google Apps Security Whitepaper 06/10/2010.pdf application/pdf
Obviously your output will be different, depending on what files are in your Google Drive. But that's it... hope this is useful. You can now customize this code for your own needs and/or to access other Google APIs. Thanks for reading!

EXTRA CREDIT: To test your skills, add functionality to this code that also displays the last modified timestamp, the file (byte)size, and perhaps shave the MIMEtype a bit as it's slightly harder to read in its entirety... perhaps take just the final path element? One last challenge: in the output above, we have both Microsoft Office documents as well as their auto-converted versions for Google Apps... perhaps only show the filename once and have a double-entry for the filetypes!

March 30, 2020 03:10 PM UTC


Real Python

How to Use any() in Python

As a Python programmer, you’ll frequently deal with Booleans and conditional statements—sometimes very complex ones. In those situations, you may need to rely on tools that can simplify logic and consolidate information. Fortunately, any() in Python is such a tool. It looks through the elements in an iterable and returns a single value indicating whether any element is true in a Boolean context, or truthy.

In this tutorial, you’ll learn:

Let’s dive right in!

Python Pit Stop: This tutorial is a quick and practical way to find the info you need, so you’ll be back to your project in no time!

Free Bonus: Click here to get a Python Cheat Sheet and learn the basics of Python 3, like working with data types, dictionaries, lists, and Python functions.

How to Use any() in Python

Imagine that you’re writing a program for your employer’s recruiting department. You might want to schedule interviews with candidates who meet any of the following criteria:

  1. Know Python already
  2. Have five or more years of developer experience
  3. Have a degree

One tool you could use to write this conditional expression is or:

# recruit_developer.py
def schedule_interview(applicant):
    print(f"Scheduled interview with {applicant['name']}")

applicants = [
    {
        "name": "Devon Smith",
        "programming_languages": ["c++", "ada"],
        "years_of_experience": 1,
        "has_degree": False,
        "email_address": "devon@email.com",
    },
    {
        "name": "Susan Jones",
        "programming_languages": ["python", "javascript"],
        "years_of_experience": 2,
        "has_degree": False,
        "email_address": "susan@email.com",
    },
    {
        "name": "Sam Hughes",
        "programming_languages": ["java"],
        "years_of_experience": 4,
        "has_degree": True,
        "email_address": "sam@email.com",
    },
]
for applicant in applicants:
    knows_python = "python" in applicant["programming_languages"]
    experienced_dev = applicant["years_of_experience"] >= 5

    meets_criteria = (
        knows_python
        or experienced_dev
        or applicant["has_degree"]
    )
    if meets_criteria:
        schedule_interview(applicant)

In the above example, you check each applicant’s credentials and schedule an interview if the applicant meets any of your three criteria.

Technical Detail: Python’s any() and or aren’t limited to evaluating Boolean expressions. Instead, Python performs a truth value test on each argument, evaluating whether the expression is truthy or falsy. For example, nonzero integer values are considered truthy and zero is considered falsy:

>>>
>>> 1 or 0
1

In this example, or evaluated the nonzero value 1 as truthy even though it’s not of type Boolean. or returned 1 and didn’t need to evaluate the truthiness of 0. Later in this tutorial, you’ll learn more about the return value and argument evaluation of or.

If you execute this code, then you’ll see that Susan and Sam will get interviews:

$ python recruit_developer.py
Scheduled interview with Susan Jones
Scheduled interview with Sam Hughes

The reason the program chose to schedule interviews with Susan and Sam is that Susan already knows Python and Sam has a degree. Notice each candidate only needed to meet one criterion.

Another way to evaluate the applicants’ credentials is to use any(). When you use any() in Python, you must pass the applicants’ credentials as an iterable argument:

for applicant in applicants:
    knows_python = "python" in applicant["programming_languages"]
    experienced_dev = applicant["years_of_experience"] >= 5

    credentials = (
        knows_python,
        experienced_dev,
        applicant["has_degree"],
    )
    if any(credentials):
        schedule_interview(applicant)

When you use any() in Python, keep in mind that you can pass any iterable as an argument:

>>>
>>> any([0, 0, 1, 0])
True

>>> any(set((True, False, True)))
True

>>> any(map(str.isdigit, "hello world"))
False

In each example, any() loops through a different Python iterable, testing the truth of each element until it finds a truthy value or checks every element.

Note: The last example uses Python’s built-in map(), which returns an iterator in which every element is the result of passing the next character in the string to str.isdigit(). This is a useful way to use any() for more complex checks.

You may be wondering if any() is merely a dressed-up version of or. In the next section, you’ll learn the differences between these tools.

How to Distinguish Between or and any()

There are two main differences between or and any() in Python:

  1. Syntax
  2. Return value

First, you’ll learn about how syntax affects the usability and reliability of each tool. Second, you’ll learn the types of values that each tool returns. Knowing these differences will help you decide which tool is best for a given situation.

Syntax

or is an operator, so it takes two arguments, one on either side:

>>>
>>> True or False
True

any(), on the other hand, is a function that takes one argument, an iterable of objects that it loops through to evaluate truthiness:

>>>
>>> any((False, True))
True

This difference in syntax is significant because it affects each tool’s usability and readability. For example, if you have an iterable, then you can pass the iterable directly to any(). To get similar behavior from or, you’d need to use a loop or a function like reduce():

>>>
>>> import functools
>>> functools.reduce(lambda x, y: x or y, (True, False, False))
True

In the above example, you used reduce() to pass an iterable as an argument to or. This could be done much more efficiently with any, which directly accepts iterables as arguments.

To illustrate another way that the syntax of each tool affects its usability, imagine that you want to avoid testing a condition if any preceding condition is True:

def knows_python(applicant):
    print(f"Determining if {applicant['name']} knows Python...")
    return "python" in applicant["programming_languages"]

def is_local(applicant):
    print(f"Determine if {applicant['name']} lives near the office...")

should_interview = knows_python(applicant) or is_local(applicant)

If is_local() takes a relatively long time to execute, then you don’t want to call it when knows_python() has already returned True. This is called lazy evaluation, or short-circuit evaluation. By default, or evaluates conditions lazily, whereas any does not.

In the above example, the program wouldn’t even need to determine if Susan is local because it already confirmed that she knows Python. That’s good enough to schedule an interview. In this situation, calling functions lazily with or would be the most efficient approach.

Why not use any() instead? You learned above that any() takes an iterable as an argument, and Python evaluates the conditions according to the iterable type. So, if you use a list, Python will execute both knows_python() and is_local() during the creation of that list before calling any():

should_interview = any([knows_python(applicant), is_local(applicant)])

Here, Python will call is_local() for every applicant, even for those who know Python. Because is_local() will take a long time to execute and is sometimes unnecessary, this is an inefficient implementation of the logic.

There are ways to make Python call functions lazily when you’re using iterables, such as building an iterator with map() or using a generator expression:

any((meets_criteria(applicant) for applicant in applicants))

This example uses a generator expression to generate Boolean values indicating whether an applicant meets the criteria for an interview. Once an applicant meets the criteria, any() will return True without checking the remaining applicants. But keep in mind that these types of workarounds also present their own issues and may not be appropriate in every situation.

The most important thing to remember is that the syntactic difference between any() and or can affect their usability.

Syntax isn’t the only difference that affects the usability of these tools. Next, let’s take a look at the different return values for any() and or and how they might influence your decision on which tool to use.

Return Value

Python’s any() and or return different types of values. any() returns a Boolean, which indicates whether it found a truthy value in the iterable:

>>>
>>> any((1, 0))
True

In this example, any() found a truthy value (the integer 1), so it returned the Boolean value True.

or, on the other hand, returns the first truthy value it finds, which will not necessarily be a Boolean. If there are no truthy values, then or returns the last value:

>>>
>>> 1 or 0
1

>>> None or 0
0

In the first example, or evaluated 1, which is truthy, and returned it without evaluating 0. In the second example, None is falsy, so or evaluated 0 next, which is also falsy. But since there are no more expressions to check, or returns the last value, 0.

When you’re deciding which tool to use, it’s helpful to consider if you want to know the actual value of the object or just whether a truthy value exists somewhere in the collection of objects.

Conclusion

Congratulations! You’ve learned the ins and outs of using any() in Python and the differences between any() and or. With a deeper understanding of these two tools, you’re well prepared to decide between them in your own code.

You now know:

If you would like to continue learning about conditional expressions and how to use tools like or and any() in Python, then you can check out the following resources:


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

March 30, 2020 02:00 PM UTC


Chris Moffitt

Using WSL to Build a Python Development Environment on Windows

Introduction

In 2016, Microsoft launched Windows Subsystem for Linux (WSL) which brought robust unix functionality to Windows. In May 2019, Microsoft announced the release of WSL 2 which includes an updated architecture that improved many aspects of WSL - especially file system performance. I have been following WSL for a while but now that WSL 2 is nearing general release, I decided to install it and try it out. In the few days I have been using it, I have really enjoyed the experience. The combo of using Windows 10 and a full Linux distro like Ubuntu is a really powerful development solution that works surprisingly well.

The rest of this article will discuss:

  • What is WSL and why you may want to install and use it on your system?
  • Instructions for installing WSL 2 and some helper apps to make development more streamlined.
  • How to use this new capability to work effectively with python in a combined Windows and Linux environment.

What is WSL?

One of the biggest issues I have had with Windows in the past is that working from the command line was painful - at best. The old Windows command prompt was no match for the power available with a simple bash shell and the full suite of unix commands. WSL fixes this in many ways. With WSL, you can install a real Linux distribution on your Windows system and run it at close to bare metal speeds. You can get the best of both worlds - full unix support in parallel with MS Office and other Windows productivity tools not available on Linux.

The concept may be a little tough to grasp at first. Here is a screenshot to bring it into a little more perspective:

Windows and ubuntu

In this screenshot, I am running a full version of Ubuntu 18.04 on Windows alongside Excel and Word. It all runs at very acceptable speeds on my laptop.

There have been virtualization options such as VMWare and VirtualBox for a while. The main advantage of WSL 2 is the efficient use of system resources. Microsoft accomplishes this by running a very minimal subset of Hyper-V features and only using minimal resources when not running. With this architecture you can spin up your virtual Linux image in a second or so and get started with your Linux environment in a seamless manner.

The other benefit of this arrangement is that you can easily copy files between the virtual environment and your base Windows system. There are also some cool tricks to seamlessly use Visual Studio Code and Windows Explorer to bridge the gap between the two environments. In practice, it works very well.

I will go through some additional examples later in this article and highlight how to do python development in the various environments.

Setting Up WSL 2

I highly recommend using WSL 2 due to the speed improvements with the file system. As of this writing, these instructions are the high level process I used to get it installed on my version of Windows 10 Pro. I recommend checking out the official Windows documentation for the latest instructions. I also found this article and the official Ubuntu WSL page very useful for getting everything set up.

I will apologize in advance because this article has a lot of images and is pretty long. However, I wanted to develop a fairly complete guide to bring together many ideas into one place. I hope you find it useful.

With that caveat out of the way, let’s get started.

Before you start, make sure you have administrator access on your system. You also need to be part of the Windows Insider program in order to get WSL 2. This may change in the future but for now, make sure you are enrolled. I chose to use the “slow” ring as shown below:

Windows Insider

In addition, you need a Windows 10 Version of at least build 18917. I am using Windows Pro but I believe the Home edition will work as well.

Windows Version

If these are new settings for your system, make sure all the updates are applied before proceeding.

Now that the foundation is setup, you need to enable the Windows Subsystem for Linux and Virtual Machine Platform using these PowerShell commands:

dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart
dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart

Check the settings here:

Windows Settings

You should restart to make sure the install is complete.

Now that the subsystem is installed, you need to install your preferred Linux distribution from the Microsoft Store. I have chosen to use Ubuntu. There are a few tweaks to this Ubuntu distro to make this combo work better so I recommend Ubuntu as the first distribution to start. A benefit is that once you get Ubuntu working you can install other distributions side by side and experiment with the one that works best for you.

Windows Ubuntu Install

The install should not take long. Once it is done, you should see an Ubuntu item on your Windows start menu. Go ahead and click it. You will get a message that the install make take a few minutes:

Installing linux

Then configure your username and password:

Username and password

It’s always a good idea to update your Linux environment using sudo :

sudo apt update
sudo apt upgrade

As you can see, this is just like the normal Ubuntu upgrade process but on your Windows system.

Go ahead and try some of your favorite commands and watch them work. Pretty cool.

The final step is to use the Windows wsl commands to enable WSL 2 for this virtual environment. You need to invoke the wsl commands from PowerShell as an administrator:

Launch PowerShell

The wsl command is used to manage the different environments installed on your system. Use the command wsl -l -v to see what you have installed:

Use WSL

As you can see from the output, the Ubuntu-18.04 version is still at Version 1 of WSL. We want to upgrade, so use the command, wsl --set-version Ubuntu-18.04 2

Convert to version 2

Behind the scenes, this process is upgrading the environment while preserving all the existing configurations. It may take a couple of minutes for the upgrade to complete. If you are interested, this link provides some more detail on the differences between WSL 1 and 2.

When you are done, use wsl -l -v to verify that both are running Version 2.

Convert to version 3

While you are at it, you should probably use this command to set WSL to use version 2 as the default for all new installs - wsl --set-default-version 2

Note, depending on when you install this, you could get a message that says “WSL2 requires and update to its kernel component.” If you see this, reference the info in this MS blog post.

At this point, we have WSL version 2 up and running. Before we go into using it with python, I want to install a couple of additional components to make the development process easier.

Helper Apps

Windows Terminal

One of the problems with the default Windows environment is that there is not a good Terminal application. As I mentioned in the beginning of this post, working from the command line in Linux has been so much more useful than on Windows. Fortunately, Microsoft has been developing a new Windows Terminal that works really well with WSL and the other consoles. If you are interested in learning about all the differences, I highly recommend this blog post for more detail.

The bottom line is that I recommend installing the Windows Terminal from the Microsoft Store. I will be using it for the rest of the command line examples:

Convert to version 3

Windows Terminal is very configurable and you can trick it out quite a bit. In the interest of keeping this post manageable, I’ll refer you to another post that has some good details and links.

I have configured my terminal to launch a couple of different shells. I found the process of editing and configuring much simpler than the steps I have to do to setup a Windows Shortcut to launch conda.

If you want to review my profile.json file, I have placed a copy on github. I included commands to launch miniconda and customized some aspects of the prompts. Feel free to use this as a reference but you will need to customize it to work for your system. If you have favorite tips and tricks, include them in the comments.

Generating GUIDs

You will need to create your own GUIDs for the different sections of the profile. One option is to use python.

import uuid
uuid.uuid4()

One final item you should consider is installing the Cascadia Fonts for a nicer looking Terminal experience.

After configuring, I have a single place to launch all the various shells and environments I might need both in Linux and Windows:

Terminal launch

Miniconda

As you can see from this screen, I have also installed Miniconda on my system. In a fun twist, I installed a version on the Ubuntu image as well as on Windows. I will not go into the process for installing but I do encourage you to install it on your system in your Windows and WSL environments. This will be the default python environment setup I use.

VS Code

The final useful component is Visual Studio Code and some helpful extensions. I recommend that you install Visual Studio Code in your windows environment.

In order to get the maximum usefulness out of this setup, you need to install a couple of extensions:

You will likely want to customize other aspects with themes and icons which I encourage you to do. The above mentioned extensions are the important ones for working with WSL and conda environments on your local Windows and Ubuntu environments.

Working Across environments

Accessing files

That was a lot of setup! Now what?

You should be able to launch your Ubuntu environment or Windows environment and work with Python like you normally would.

Here is a screenshot showing one terminal with tabs running Ubuntu and PowerShell and another one running conda on the Windows system:

Conda environments

This in an of itself is pretty useful but the real power is the way you can interact across WSL and native Windows.

For instance, if you type explorer.exe . in your Ubuntu environment, windows will launch the Explorer and show the current directory in the WSL environment:

Explorer Windows

Now you have a Windows Explorer view of the files in that Ubuntu WSL environment.

You can also access this environment directly in Explorer by typing the network path \\wsl$\Ubuntu\home\chris

Explorer Windows - Network access

This cross-environment “magic” is supported by the 9P protocol file server which you can see referenced via the mount command in the screen shot above. Microsoft has a nice write up on their blog with some more details about how this works.

Do not access AppData folder
There is one big caveat with all of this. If you want to copy files across WSL and Windows, use Explorer or copy commands. Do not try to locate the AppData folder and manipulate the files directly. This is not supported and will likely cause problems.

Visual Studio Code

There is another handy tricky for working across environments. You can use the WSL Visual Studio Code plugin to access the WSL file system from your VS Code install on Windows.

If you execute the command code . in your Ubuntu environment, Windows will Launch VS Code and connect to the files within WSL. You can edit those files using all the normal VS Code functionality and all changes are saved within the WSL environment. You can see the indicator in the bottom left that lets you know you’re interacting with WSL and not the standard Windows system.

Visual Studio Code

You can also launch VS Code in Windows and access all your running WSL environments through the command palette. Press Ctrl + Shift + P then type Remote-WSL to see the options.

Remote WSL window

If you have more than one WSL environment setup, then you can select the appropriate one as well.

VS Code then ensures that you are editing files in the WSL environment. For instance, When you open a file, you will only see the WSL file system:

Opening files

One minor surprise I encountered is that you need to make sure any VS Code plugins you want to use in WSL are installed within the WSL environment. For instance, if you look at this screenshot, you can see how some of the plugins are installed on the local Windows environment but you need to also ensure they are installed on the WSL environment as well.

Fortunately, the installation is pretty simple. In fact, VS Code prompts you with a button that says “Install in WSL: Ubuntu.” The install process is simple but it is an implementation detail to keep in mind.

Installing VS Code plugins

It is a bit crazy to think about how this works but the implementation is very seamless and in my experience you get used to it pretty quickly.

Jupyter Notebooks

Another method to work across environments is using the network. In researching this article, I found many comments that accessing localhost did not work in some of the older versions of WSL. I did not have any problems getting localhost to work when using Pelican or Jupyter Notebooks. I get the sense that this is an active area of focus for the developers so keep this in mind as you experiment on your own.

The one option I do recommend you use is the --no-browser switch to avoid a warning message when launching a Jupyter notebook. In the example below, I’m running Jupyter in Ubuntu but viewing it on my local Edge browser.

Launching a Jupyter Notebook

Also, it’s helpful to remember that if you want to copy data from the terminal use Ctrl + Shift + C to copy and Ctrl + Shift + V to paste. You will likely need this to copy the token and authenticate with the Jupyter process.

Directly running apps

The wsl command is a powerful tool for operating on WSL environments. One of its capabilities is that it can run an executable directly from the Linux environment. For example, we can run the fortune command that is installed in our Ubuntu environment.

Execute unix commands

What about running Graphical Apps?

For the most part, I’ve been using the Windows-native apps for graphical applications. Between MS Office Apps, Chrome and VS Code I have most use cases covered. If I want to use apps like Gimp or Inkscape, I can use the Windows versions.

However, I have found a couple of niche apps that did not have a good equivalent in Windows. One simple app I use is Trimage to compress images.

I can install it in Ubuntu, but when I try to execute it, I get an error message, trimage.py: cannot connect to X server

The fix for this is to install an X Server on Windows. There are several options including a pay version called X410. I elected to use VcXsrv (oh sourceforge, such memories).

Be forewarned it is definitely not a Native Win 10 app so this is all going to look a bit ugly. There are likely ways to make it look better but I have not investigated because this is an approach of last resort for a handful of apps. I’m sharing for completeness.

Anyway, Install VcXsrv and run it:

Launching a Jupyter Notebook

I needed to Disable Access Control:

X Server Access Control

Once it is launched, it will sit in the system tray and listen for connections.

In order to configure your Ubuntu environment, make sure these two lines are in your .bashrc file. After making the changes, restart the shell:

export DISPLAY=$(awk '/nameserver / {print $2; exit}' /etc/resolv.conf 2>/dev/null):0 # in WSL 2
export LIBGL_ALWAYS_INDIRECT=1

If all is configured correctly, you should see Trimage:

Compressing Images with Trimage

It worked just fine for compressing the images in this post.

If you really feel the need to run a more full-fledged graphical environment, you can install a light-weight desktop environment like xfce and launch it as well Here is how to install it:

sudo apt install xfce4

Here is a busy screen shot showing all of this working together:

Full desktop with all apps

The image shows:

  • VS Code, editing files in the WSL environment
  • A full xfce desktop running on WSL being displayed in the local Windows X server
  • The WSL environment serving up the pelican blog

Workflow

Now that you have all these options for python development on a single machine, you need to decide how best to configure for your needs. I am still working through my process but here is what I’m doing right now:

  • Chrome on Win 10: General web browsing, email, Jupyter notebooks
  • Visual Studio Code on Win 10: Text and python file editing
  • Visual Studio on Win 10 connected through WSL: Write ReStructured Text articles for blog
  • Ubuntu on WSL: Maintain and develop content on Pelican blog
  • Ubuntu on WSL: Command line tools as needed
  • python on WSL: Blog content and general development/experimentation
  • python on Win 10: Development when working on Windows specific tasks (Excel, Word, etc)

The key point is that even though the WSL and Windows environments can “talk” to each other, there does need to be some segregation of responsibilities. For instance, when using git in WSL, it is recommended that you operate on the files in the WSL environment. The same goes for Windows - don’t try to run Windows executables directly from the WSL file system.

Finally, I still recommend that you use conda environments to keep your python environments clean. I have chosen to have a conda environment on Ubuntu and one on Windows so that I can make sure blog posts work appropriately across Windows and Linux environments.

Conclusion

WSL is a major step forward in making Windows a first class development platform. I have been a long-time Ubuntu user at home and Windows user at work. WSL finally gives me a platform where I can have the best of both worlds. I get access to all the tools and flexibility of working in Ubuntu alongside the common MS Office tools. In addition, I am confident any commercial software I might need can be installed on this system.

I hope you find this guide useful and that it helps you build your own python development environment for Windows and Linux. If you have any other tips, let me know in the comments.

March 30, 2020 12:55 PM UTC