skip to navigation
skip to content

Planet Python

Last update: January 20, 2017 04:46 AM

January 20, 2017


Kushal Das

Fedora Atomic Working Group update from 2017-01-17

This is an update from Fedora Atomic Working Group based on the IRC meeting on 2017-01-17. 14 people participated in the meeting, the full log of the meeting can be found here.

OverlayFS partition

We had a decision to have a docker partition in Fedora 26. The root partition sizing will also need to be fixed. One can read all the discussion on the same at the Pagure issue.

We also require help in writing the documentation for migration from Devicemapper -> Overlay -> Back.

How to compose your own Atomic tree?

Jason Brooks will update his document located at Project Atomic docs.

docker-storage-setup patches require more testing

There are pending patches which will require more testing before merging.

Goals and PRD of the working group

Josh Berkus is updating the goals and PRD documentation for the working group. Both short term and long term goals can be seen at this etherpad. The previous Cloud Working Group’s PRD is much longer than most of the other groups’ PRD, so we also discussed trimming the Atomic WG PRD.

Open floor discussion + other items

I updated the working group about a recent failure of the QCOW2 image on the Autocloud. It appears that if we boot the images with only one VCPU, and then after disabling chroyd service when we reboot, there is no defined time to have ssh service up and running.

Misc talked about the hardware plan for FOSP., and later he sent a detailed mail to the list on the same.

Antonio Murdaca (runcom) brought up the discussion about testing the latest Docker (1.13) and pushing it to F25. We decided to spend more time in testing it, and then only push to Fedora 25, otherwise it may break Kubernetes/Openshift. We will schedule a 1.13 testing week in the coming days.

January 20, 2017 04:18 AM


Python Diary

Encryption experiment in Python

I recently created a toy encryption tool using pure Python, and it's dead simple to implement and use. It is slow in CPython, a bit faster in Cython, and runs nicely in a compiled language like ObjectPascal.

I created this as a way to better understand how encryption works and to allow others who don't understand cryptography to have an easy to read and learn example of the utter basics of encryption. This code can be easily expanded to further strengthen it. It uses eXclusive OR to toggle bits which is what does the actual encryption here. It is a stream cipher, so the key and input can be variable in length. The encryption works by using a custom table, or master key as it is labeled in the code, along with an actual password/passphrase. I'd highly recommend passing an SHA512 digest hash of a password.

My initial idea was to create a Crypto Virtual Machine, where each byte in the password/passphrase would map to a virtual op code in the VM. This op code would then do something to the current byte to be encrypted, so effectively the password is a bytecode string for this VM which tells the VM how to encrypt and decrypt the clear-text or otherwise data to be encrypted. This may make encryption slow, as a VM needs to parse through each and every byte and do something to it, and it needs to be reversible using the same bytecode. Essentially you would need an encryption VM, and a decryption VM. The encryption VM would perform the encryption of the bytes or blocks, and the decryption VM would perform the decryption. So rather than a set algorithm, each byte in your key works with a VM to perform almost random encryption making it almost impossible to decrypt without knowing the bytecode the VM needs to decrypt it.

This is only a theory on how more advanced computers with faster processing power could implement encryption, and technically a dedicated microcontroller or processor could be used to implement what the bytecodes do in the hardware level. I do still plan on making an example Virtual Machine like this to play around with the idea. For now, you can check out the encryption project I talked about above here on my KCrypt source code.

January 20, 2017 01:20 AM

January 19, 2017


Python Software Foundation

Shelia Miguez and Will Kahn-Greene and their love for the Python Community: Community Service Award Quarter 3 2016 Winners

There are two elements which make Open Source function:
  1. Technology
  2. An active community.
The primary need for a successful community is a good contributor base. The contributors are our real heroes, who work persistently, on many (if not most) occasions without any financial benefits, just for the love of the community. The Python Community is blessed with many such heroes. The PSF's quarterly Community Service Award honors these heroes for their notable contributions and dedication to the Python ecosystem.


The PSF is delighted to give the 2016 Third Quarter Community Service Award to Sheila Miguez and Will Kahn-Greene:
Sheila Miguez and William Kahn-Greene for their monumental work in creating and supporting PyVideo over the years.


Community Service Award for 3rd Quarter

Will Kahn-Greene
Taken by Erik Rose, June 2016
The PSF funds a variety of conferences and workshops throughout the year worldwide to educate people about Python. But, not everyone can attend all of these events. Two people, Sheila Miguez and Will Kahn-Greene wanted to resolve this problem for the Pythonistas. Will came up with a brilliant idea of PyVideo and Sheila later joined the mission. PyVideo works as the warehouse of videos from Python conferences, local user groups, screencasts, and tutorials.

The Dawn of PyVideo

Back in 2010, Will started a Python video site using the Miro Community video-sharing platform. PSF encouraged his work with an $1800 grant the following year. As Will recalls, "I was thinking there were a bunch of Python conferences putting out video, but they were hosting the videos in different places. Search engines weren't really finding it. It was hard to find things even if you knew where to look." He started with Miro Community, and later wrote a whole new codebase for generating the data and another codebase for the front end of the website.
With these tools he started PyVideo.org. "This new infrastructure let me build a site closer to what I was envisioning."


When Sheila joined the project she contributed both to its technology and by helping the community find Python videos easier. Originally, she intended to only work on the codebase, but found herself dedicating a lot of time to adding content to the site.


What is PyVideo?
PyVideo is a repository that indexes and links to thousands of Python videos. It also provides a website pyvideo.org where people can browse the collection, which is more than 5000 Python videos and growing. The goals for PyVideo are:

  1.  Help people get to Python presentations easier and faster
  2.  Focus on education
  3.  Data collection and categorization.
  4.  Aim to give people an easy, enjoyable experience contributing to open source on PyVideo's GitHub repo

The Community Response

The Python community has welcomed Will and Sheila's noble endeavor enthusiastically. Pythonistas around the world never have to miss another recorded talk or tutorial. Sheila and Will worked relentlessly to give shape to their mammoth task. When I asked Will about the community’s response, he said, "Many learned Python by watching videos they found on pyvideo.org. Many had ideas for different things we could do with the site and other related projects. I talked with some folks who later contributed fixes and corrections to the data."


Will and Sheila worked on pyvideo.org only in their spare time, but it has became a major catalyst in the growth of the Python community worldwide. According to Will, pyvideo.org has additional, under publicized benefits:


  • PyVideo is a primary source to survey diversity trends among Python conference speakers around the globe.
  • Since its videos are solely Python, it is easily searchable and provides more helpful results than other search engines.
  • It offers a preview of conferences: By watching past talks people can choose if they want to go.


PyVideo : The End?

With a blog post Will and Sheila announced the end of pyvideo.org. "I'm pretty tired of working on pyvideo and I haven't had the time or energy to do much on it in a while," Will wrote.


Though they were shutting down the site, they never wanted to lose or waste the valuable data. Will says, "In February 2016 or so, Sheila and I talked about the state of things and I just felt bad about everything. So we decided to focus on extracting the data from PyVideo and make sure that even if the site didn't live on, the data did. We wrote a bunch of tools and
infrastructure for a community of people to add to, improve and otherwise work on the data. We figured someone could take the data and build a static site around it." Will did a blog post about the status of the data of pyvideo.org, and invited new maintainers to replace the site.


The end of pyvideo.org broke the hearts of many Pythonistas, including Paul Logston. Paul’s mornings used to begin by watching a talk on the site, and he couldn't renounce his morning entertainment.  He resolved to replace pyvideo.org. To begin, he wrote his project called "PyTube" for storing videos. Though initially his interest was personal, its educational outreach aspect drove him to finish and publicize the project. Sheila remembers noticing Paul for the first time when she noticed his fork of the pyvideo data repository. She was excited to see that he'd already built a static site generator based on PyVideo data. She read Paul’s development philosophy and felt he was the right person to carry on the mission.


In May 2016, at PyCon US,  there was a lightning talk on PyVideo and its situation. Paul met some fellow PyVideo followers who, just like him, did not want to lose the site. They decided to work on it during the Sprints. Though the structure of the website was ready, there were a lot of things that needed to be done like data gathering, curating data, and the design of the website. So, the contributors divided the works between them.


Both Sheila and Will were committed to PyVideo's continued benefit for the community, while passing PyVideo to new hands. They were satisfied by Paul's work and transferred the domain to his control. Paul's PyTube code became the replacement of pyvideo.org on August 13, 2016.


Emergence of the Successor : The Present Status of PyVideo

Now the project has 30 contributors, with Paul serving as project lead. These contributors have kept the mission alive. Though PyVideo's aim is still the same, there is a difference in its technology. The old Django app is replaced with a static site generated with Pelican, and it now has a separate repository for data in JSON files. The team's current work emphasizes making the project hassle-free to maintain.


Listen to Paul talking about PyVideo and its future on Talk Python to Me.


The Wings to Fly

Every community needs someone with a vision for its future. Will and Sheila had showed us a path to grow and help the community. It is now our responsibility to take the new PyVideo further. Paul describes its purpose beautifully: "PyVideo's deeper 'why' is the desire to make educating oneself as easy, affordable, and available as possible." Contributors: please come and join the project, give a hand to Paul and the team to help move this great endeavor forward.

January 19, 2017 10:47 PM


PyCharm

Make sense of your variables at a glance with semantic highlighting

Let’s say you have a really dense function or method, with lots of arguments passed in and lots of local variables. Syntax highlighting helps some, but can PyCharm do more?

In 2017.1, PyCharm ships with “semantic highlighting” available as a preference. What is it, what problem does it solve, and how do I use it? Let’s take a look.

It’s So Noisy

Sometimes you have really, really big functions. Not in your codebase, of course, because you are tidy. But hypothetically, you encounter this in a library:

2016-noselection

PyCharm helps, of course. Syntax highlighting sorts out the reserved words and different kinds of symbols: bold for keywords, gray for unneeded, yellow means suggestion, green for string literals. But that doesn’t help you focus on the parameter “namespaces”. Clicking on a specific symbol highlights it for the rest of the file:

2016-selection

That kind of works, but not only do you have to perform an action for each symbol you want to focus on, it also moves your cursor. It’s a solution to a different problem.

How can my tool help me scan this Python code without much effort or distraction?

IntelliJ Got It

As you likely know, PyCharm and our other IDEs are built atop the IntelliJ IDE platform. In November, IntelliJ landed an experimental cut of “semantic highlighting”:

“Semantic Highlighting, previously introduced in KDevelop and some other IDEs, is now available in IntelliJ IDEA. It extends the standard syntax highlighting with unique colors for each parameter and local variable.”

It wasn’t available in the IDEs, but you could manually enable it via a developer preference. Here’s a quick IntelliJ video describing the problem and how semantic highlighting helps.

With PyCharm 2017.1, the engine is now available to be turned on in preferences. Let’s see it in action.

Crank Up the Signal

Blah blah blah, what does it look like?

2017Our noisy function now has some help. PyCharm uses semantic highlighting to assign a different color to each parameter and local variable: the “namespaces” parameter is now a certain shade of green. You can then let color help you scan through the function to track the variable, with no distracting action to isolate one of them or switch focus to another.

To turn on semantic highlighting in your project, on a per-font-scheme basis, visit the Editor -> Colors & Fonts -> Language Defaults preference:

prefs

Your Colors Make Me Sad

The default color scheme might not work for you. Some folks have visual issues for red and green, for example. Some might have contrast issues in their theme or workspace. Others might simply hate #114D77 (we’ve all been there.)

If you make IDEs for long enough, you learn self-defense, and that means shipping a flexible means of customization:

colorpicker

The pickers let you assign base colors then gradients to tailor a wide number of local symbols to your needs and taste.

Learn More

PyCharm’s goal is to help you be a badass Python developer, and hopefully our use of semantic highlighting helps you make sense of dense code. We’re still working on the idea itself as well as the implementation, so feel free to follow along in our bug tracker across all our products, since this isn’t a PyCharm-specific feature.

And as usual, if you have any quick questions, drop us a note in the blog comments.

January 19, 2017 05:14 PM


Python Data

Collecting / Storing Tweets with Python and MongoDB

A good amount of the work that I do involves using social media content for analyzing networks, sentiment, influencers and other various types of analysis.

In order to do this type of analysis, you first need to have some data to analyze.  You can also scrape websites like Twitter or Facebook using simple web scrapers, but I’ve always found it easier to use the API’s that these companies / websites provide to pull down data.

The Twitter Streaming API is ideal for grabbing data in real-time and storing it for analysis. Twitter also has a search API that lets you pull down a certain number of historical tweets (I think I read it was the last 1,000 tweets…but its been a while since I’ve looked at the Search API).   I’m a fan of the Streaming API because it lets me grab a much larger set of data than the Search API, but it requires you to build a script that ‘listens’ to the API for your required keywords and then store those tweets somewhere for later analysis.

There are tons of ways to connect up to the Streaming API. There are also quite a few Twitter API wrappers for Python (and most of them work very well).   I tend to use Tweepy more than others due to its ease of use and simple structure. Additionally, if I’m working on a small / short-term project, I tend to reach for MongoDB to store the tweets using the PyMongo module. For larger / longer-term projects I usually connect the streaming API script to MySQL instead of MongoDB simply because MySQL fits into my ecosystem of backup scripts, etc better than MongoDB does.  MongoDB is perfectly suited for this type of work for larger projects…I just tend to swing toward MySQL for those projects.

For this post, I wanted to share my script for collecting Tweets from the Twitter API and storing them into MongoDB.

Note: This script is a mashup of many other scripts I’ve found on the web over the years. I don’t recall where I found the pieces/parts of this script but I don’t want to discount the help I had from other people / sites in building this script.

Collecting / Storing Tweets with Python and MongoDB

Let’s set up our imports:

from __future__ import print_function
import tweepy
import json
from pymongo import MongoClient

Next, set up your mongoDB path:

MONGO_HOST= 'mongodb://localhost/twitterdb'  # assuming you have mongoDB installed locally
                                             # and a database called 'twitterdb'

Next, set up the words that you want to ‘listen’ for on Twitter. You can use words or phrases seperated by commas.

WORDS = ['#bigdata', '#AI', '#datascience', '#machinelearning', '#ml', '#iot']

Here, I’m listening for words related to maching learning, data science, etc.

Next, let’s set up our Twitter API Access information.  You can set these up here.

CONSUMER_KEY = "KEY"
CONSUMER_SECRET = "SECRET"
ACCESS_TOKEN = "TOKEN"
ACCESS_TOKEN_SECRET = "TOKEN_SECRET"

Time to build the listener class.

class StreamListener(tweepy.StreamListener):    
    #This is a class provided by tweepy to access the Twitter Streaming API. 

    def on_connect(self):
        # Called initially to connect to the Streaming API
        print("You are now connected to the streaming API.")
 
    def on_error(self, status_code):
        # On error - if an error occurs, display the error / status code
        print('An Error has occured: ' + repr(status_code))
        return False
 
    def on_data(self, data):
        #This is the meat of the script...it connects to your mongoDB and stores the tweet
        try:
            client = MongoClient(MONGO_HOST)
            
            # Use twitterdb database. If it doesn't exist, it will be created.
            db = client.twitterdb
    
            # Decode the JSON from Twitter
            datajson = json.loads(data)
            
            #grab the 'created_at' data from the Tweet to use for display
            created_at = datajson['created_at']

            #print out a message to the screen that we have collected a tweet
            print("Tweet collected at " + str(created_at))
            
            #insert the data into the mongoDB into a collection called twitter_search
            #if twitter_search doesn't exist, it will be created.
            db.twitter_search.insert(datajson)
        except Exception as e:
           print(e)

Now that we have the listener class, let’s set everything up to start listening.

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
#Set up the listener. The 'wait_on_rate_limit=True' is needed to help with Twitter API rate limiting.
listener = StreamListener(api=tweepy.API(wait_on_rate_limit=True)) 
streamer = tweepy.Stream(auth=auth, listener=listener)
print("Tracking: " + str(WORDS))
streamer.filter(track=WORDS)

Now you are ready to go. The full script is below. You can store this script as “streaming_API.py” and run it as “python streaming_API.py” and – assuming you set up mongoDB and your twitter API key’s correctly, you should start collecting Tweets.

The Full Script:

from __future__ import print_function
import tweepy
import json
from pymongo import MongoClient

MONGO_HOST= 'mongodb://localhost/twitterdb'  # assuming you have mongoDB installed locally
                                             # and a database called 'twitterdb'

WORDS = ['#bigdata', '#AI', '#datascience', '#machinelearning', '#ml', '#iot']

CONSUMER_KEY = "KEY"
CONSUMER_SECRET = "SECRET"
ACCESS_TOKEN = "TOKEN"
ACCESS_TOKEN_SECRET = "TOKEN_SECRET"


class StreamListener(tweepy.StreamListener):    
    #This is a class provided by tweepy to access the Twitter Streaming API. 

    def on_connect(self):
        # Called initially to connect to the Streaming API
        print("You are now connected to the streaming API.")
 
    def on_error(self, status_code):
        # On error - if an error occurs, display the error / status code
        print('An Error has occured: ' + repr(status_code))
        return False
 
    def on_data(self, data):
        #This is the meat of the script...it connects to your mongoDB and stores the tweet
        try:
            client = MongoClient(MONGO_HOST)
            
            # Use twitterdb database. If it doesn't exist, it will be created.
            db = client.twitterdb
    
            # Decode the JSON from Twitter
            datajson = json.loads(data)
            
            #grab the 'created_at' data from the Tweet to use for display
            created_at = datajson['created_at']

            #print out a message to the screen that we have collected a tweet
            print("Tweet collected at " + str(created_at))
            
            #insert the data into the mongoDB into a collection called twitter_search
            #if twitter_search doesn't exist, it will be created.
            db.twitter_search.insert(datajson)
        except Exception as e:
           print(e)

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
#Set up the listener. The 'wait_on_rate_limit=True' is needed to help with Twitter API rate limiting.
listener = StreamListener(api=tweepy.API(wait_on_rate_limit=True)) 
streamer = tweepy.Stream(auth=auth, listener=listener)
print("Tracking: " + str(WORDS))
streamer.filter(track=WORDS)

 

The post Collecting / Storing Tweets with Python and MongoDB appeared first on Python Data.

January 19, 2017 01:17 PM


PyTennessee

PyTN Profiles: Deborah Hanus and Axial Healthcare



Speaker Profile: Deborah Hanus (@deborahhanus)

Deborah graduated from MIT with her Master’s and Bachelor’s in Computer Science in computer science, where she developed mathematical models of human perception. Then as a Fulbright Scholar in Cambodia, she investigated how education translates into job creation. She worked as an early software engineer at a San Francisco start up before taking a break to work on exciting data & programming-related projects as a PhD candidate in machine learning at Harvard.

Deborah will be presenting “Lights, camera, action! Scraping a great dataset to predict Oscar winners” at 11:00AM Sunday (2/5) in the Room 100. Using Jupyter notebooks and scikit-learn, you’ll predict whether a movie is likely to win an Oscar or be a box office hit. (http://oscarpredictor.github.io/) Together, we’ll step through the creation of an effective dataset: asking a question your data can answer, writing a web scraper, and answering those questions using nothing but Python libraries and data from the Internet. 

Sponsor Profile: Axial Healthcare (@axialhealthcare)

axialHealthcare is the nation’s leading pain medication and pain care management company. Our cutting-edge analytics engine mines data to give insurers a comprehensive view of their pain problem and what it’s costing them. axial’s pain management solutions improve financial performance by engaging practitioners and patients, optimizing pain care outcomes, and reducing opioid misuse. The axialHealthcare team is comprised of some of the nation’s top physicians, scientists, pharmacists, and operators in the field of pain management. Our team is mission-focused, smart, collaborative, and growing. Learn more at axialhealthcare.com.

January 19, 2017 12:34 PM

PyTN Profiles: Keynoter Courey Elliott and Big Apple Py


Speaker Profile: Courey Elliott (@dev_branch)

Courey Elliott is a software engineer at Emma who has a love of architecture, automation, methodology, and programming principles. They enjoy working on community projects and microcomputing in their spare time. They have a spouse, two kids, two GIANT dogs, two cats, five chickens, a lizard, and a hedgehog named Jack.

Courey will be presenting at 4:00PM Saturday (2/4) in the Auditorium.

Sponsor Profile: Big Apply Py (@bigapplepyinc)

Big Apple Py is a New York State non-profit that promotes the use and education of open source software, in particular, the Python programming language, in and around New York City. Big Apple Py proudly organizes the NYC Python (http://nycpython.org), Learn Python NYC (http://learn.nycpython.org), and Flask-NYC (http://flask-nyc.org) meetup groups, as well as PyGotham (https://pygotham.org), an annual regional Python conference.

January 19, 2017 12:27 PM


Python Anywhere

New release! File sharing, and some nice little fixes and improvements

Honestly, it's strange. We'll work on a bunch of features, care about them deeply for a few days or weeks, commit them to the CI server, and then when we come to deploy them a little while later, we'll have almost forgotten about them. Take today, when Glenn and I were discussing writing the blog post for the release

-- "Not much user-visible stuff in this one was there? Just infrastructure I think..."

-- "Let's have a look. Oh yes, we fixed the ipad text editor. And we did the disable-webapps-on-downgrade thing. Oh yeah, and change-webapp-python-version, people have been asking for that. Oh, wow, and shared files! I'd almost totally forgotten!"

So actually, dear users, lots of nice things to show you.

File sharing

People have been asking us since forever about whether they could use PythonAnywhere to share code with their friends. or to show off that famous text-based guessing game we've all made early on in our programming careers. And, after years of saying "I keep telling people there's no demand for it", we've finally managed to make a start.

If you open up the Editor on PythonAnywhere you'll see a new button marked Share. screenshot of share button

You'll be able to get a link that you can share with your friends, who'll then be able to view your code, and, if they dare, copy it into their own accounts and run it.

screenshot of share menu

We're keen to know what you think, so do send feedback!

Change Python version for a web app

Another feature request, more minor this time; you'll also see a new button that'll let you change the version of Python for an existing web app. Sadly the button won't magically convert all your code from Python 2 to Python 3 though, so that's still up to you...

screenshot of change python ui

More debugging info on the "Unhandled Exception" page.

When there's a bug, or your code raises an exception for whatever reason, your site will return our standard "Unhandled Exception" page. We've now enhanced it so that, if it notices you're currently logged into PythonAnywhere and are the owner of the site, it will show you some extra debugging info, that's not visible to other users.

screenshot of new error page

Why not introduce some bugs into your code and see for yourself?

ipad fix, and other bits and pieces

We finally gave up on using the fully-featured syntax-highlighting editor on ipads (it seemed like it worked but it really didn't, once you tried to do anything remotely complicated) and have reverted to using a simple textarea.

If you're trying to use PythonAnywhere on a mobile device and notice any problems with the editor, do let us know, and we'll see if we can do the same for your platform.

Other than that, nothing major! A small improvement to the workflow for people who downgrade and re-upgrade their accounts, a fix to a bug with __init__.py in django projects,

Keep your suggestions and comments coming, thanks for being our users and customers, and speak soon!

Harry + the team.

January 19, 2017 10:29 AM

January 18, 2017


Flavio Percoco

On communities: Trading off our values... Sometimes

Not long ago I wrote about how much emotions matter in every community. In that post I explained the importance of emotions, how they affect our work and why I believe they are relevant for pretty much everything we do. Emotions matter is a post quite focused on how we can affect, with our actions, other people's emotional state.

I've always considered myself an almost-thick skinned person. Things affect me but not in a way that would prevent me from keep moving forward. Most of the time, at least. I used to think this was a weakness, I used to think that letting this emotions through would slow me down. With time I came to accept it as a strength. Acknowledging this characteristic of mine has helped me to be more open about the relevance of emotions in our daily interactions and to be mindful about other folks that, like me, are almost-thick skinned or not even skinned at all. I've also come to question the real existence of the so called thick-skinned people and the more I interact with people, the more I'm convinced they don't really exist.

If you would ask me what emotion hits me the most I would probably say frustration. I'm often frustrated about things happening around me, especially about things that I am involved with. I don't spend time on things I can't change but rather try focus on those that not only directly affect me but that I can also have a direct impact on.

At this point, you may be wondering why I'm saying all this and what all this has to do with both, communities and with this post. Bear with me for a bit, I promise you this is relevant.

Culture (as explained in this post), emotions, personality and other factors drive our interactions with other team members. For some people, working in teams is easier than for others, although everyone claims they are awesome team mates (sarcasm intended, sorry). I believe, however, that one of the most difficult things of working with others is the constant evaluation of the things we values as team members, humans, professionals, etc.

There are no perfect teams and there are no perfect team mates. We weight the relevance of our values everyday, in every interaction we have with other people, in every thing we do.

But, what values am I talking about here?

Anything, really. Anything that is important to us. Anything that we stand for and that has slowly become a principle for us, our modus operandi. Our values are our methods. Our values are those beliefs that silently tell us how to react under different circumstances. Our values tell us whether we should care about other people's emotions or not. Controversially, our values are the things that will and won't make us valuable in a team and/or community. Our values are not things we posses, they are things we are and believe. In other words, the things we value are the things we consider important that will determine our behavior, our interaction with our environment and how the events happening around us will affect us.

The constant trading off of our values is hard. It makes us question our own stances. What's even harder is putting other people's values on top of ours from time to time. This constant evaluation is not supposed to be easy, it's never been easy. Not for me, at least. Let's face it, we all like to be stubborn, it feels go when things go the way we like. It's easier to manage, it's easier to reason about things when they go our way.

Have you ever found yourself doing something that will eventually make someone else's work useless? If yes, did you do it without first talking with that person? How much value do you put into splitting the work and keeping other folks motivated instead of you doing most of it just to get it done? Do you think going faster is more important than having a motivated team? How do you measure your success? Do you base success on achieving a common goal or about your personal performance in the process?

Note that the questions above don't try to express an opinion. The answers to those questions can be 2 or more depending on your point of view and that's fine. I don't even think there's a right answer to those questions. However, they do question our beliefs. Choosing one option over the other may go in favor or against of what we value. This is true for many areas of our life, not only our work environment. This applies to our social life, our family life, etc.

Some values are easier to question than others but we should all spend more time thinking about them. I believe the time we spend weighting and re-evaluating our values allow us for adapting faster to new environments and for us to grow as individuals and communities. Your cultural values have a great influence in this process. Whether you come from an individualist culture or a collectivist one (Listen to 'Customs of the world' for more info on this) will make you prefer one option over the other.

Of course, balance is the key. Giving up our beliefs every time is not the answer but not giving them up ever is definitely frustrating for everyone and makes interactions with other cultures more difficult. There are things that cannot be traded and that's fine. That's understandable, that's human. That's how it should be. Nonetheless, there are more things that can be traded than there are things that you shouldn't give up. The reason I'm sure of this is that our world is extremely diverse and we wouldn't be were we are if we wouldn't be able to give up some of our own beliefs from time to time.

I don't think we should give up who we are, I think we should constantly evaluate if our values are still relevant. It's not easy, though. No one said it was.

January 18, 2017 11:00 PM


PyTennessee

PyTN Profiles: Kenneth Reitz and Intellovations

image



Speaker Profile: Kenneth Reitz (@kennethreitz)

Kenneth Reitz is a well-known software engineer, international keynote speaker, open source advocate, street photographer, and electronic music producer.

He is the product owner of Python at Heroku and a fellow at the Python Software Foundation. He is well-known for his many open source software projects, specifically Requests: HTTP for Humans.

Kenneth will be presenting “The Reality of Developer Burnout” at 11:00AM Sunday (2/5) in the Auditorium. 

image

Sponsor Profile: Intellovations (@ForecastWatch)

Intellovations builds intelligent and innovative software that helps you understand, communicate, and use your data to make better decisions, increase productivity, and discover new knowledge.

We specialize in large-scale data collection and analysis, Internet-based software, and scientific and educational applications. We have experience building systems that have collected over 500 million metrics per day from Internet-based hardware, have created powerful desktop Internet-search products, and have used genetic algorithms and genetic programming for optimization.

Intellovations’ main product is ForecastWatch, a service that continually monitors and assesses the accuracy of weather forecasts around the world, and is in use by leaders in the weather forecasting industry such as AccuWeather, Foreca, Global Weather Corporation, MeteoGroup, Pelmorex, and The Weather Company.

January 18, 2017 07:11 PM

PyTN Profiles: Matthew Montgomery and Elevation Search Solutions

image


Speaker Profile: Matthew Montgomery (@signed8bit)

Matthew Montgomery is a Technical Leader at Cisco Systems in the OpenStack group. He has been working professionally on the web since 2000 when he joined Sun Microsystems and worked on a number or high volume customer facing web properties. Moving on after the Oracle acquisition of Sun, he worked briefly in the consultant racket with Accenture and then made some meaningful contributions to clinical workflow with Vanderbilt University Medical Center. Prior to Cisco he was focusing on digital marketing applications deployed on Amazon Web Services. Through most of this, Matthew has called Nashville home and has no plans to change that in the future.

Matthew will be presenting “Test your Automation!” at 3:00PM Saturday (2/4) in Room 300. Learn how to apply the principles of unit testing to your automation code. Using Molecule and Testinfra, this tutorial will provide hands-on guidance for testing an Ansible role.

image

Sponsor Profile: Elevation Search Solutions (@elevationsearch)

Elevation Search Solutions is a boutique search firm specializing in team build outs for growing companies. We are exceptional at sourcing top professionals to fit unique cultures and push business to the next level.

January 18, 2017 07:03 PM


DataCamp

Pandas Cheat Sheet for Data Science in Python

The Pandas library is one of the most preferred tools for data scientists to do data manipulation and analysis, next to matplotlib for data visualization and NumPy, the fundamental library for scientific computing in Python on which Pandas was built. 

The fast, flexible, and expressive Pandas data structures are designed to make real-world data analysis significantly easier, but this might not be immediately the case for those who are just getting started with it. Exactly because there is so much functionality built into this package that the options are overwhelming.

That's where this Pandas cheat sheet might come in handy. 

It's a quick guide through the basics of Pandas that you will need to get started on wrangling your data with Python. 

As such, you can use it as a handy reference if you are just beginning their data science journey with Pandas or, for those of you who already haven't started yet, you can just use it as a guide to make it easier to learn about and use it. 

Python Pandas Cheat Sheet

The Pandas cheat sheet will guide you through the basics of the Pandas library, going from the data structures to I/O, selection, dropping indices or columns, sorting and ranking, retrieving basic information of the data structures you're working with to applying functions and data alignment.

In short, everything that you need to kickstart your data science learning with Python!

Do you want to learn more? Start the Intermediate Python For Data Science course for free now or try out our Pandas DataFrame tutorial

Also, don't miss out on our Bokeh cheat sheet for data visualization in Python and our Python cheat sheet for data science

January 18, 2017 06:56 PM


Caktus Consulting Group

Ship It Day Q1 2017

Last Friday, Caktus set aside client projects for our regular quarterly ShipIt Day. From gerrymandered districts to RPython and meetup planning, the team started off 2017 with another great ShipIt.

Books for the Caktus Library

Liza uses Delicious Library to track books in the Caktus Library. However, the tracking of books isn't visible to the team, so Scott used the FTP export feature of Delicious Library to serve the content on our local network. Scott dockerized Caddy and deployed it to our local Dokku PaaS platform and serves it over HTTPS, allowing the team to see the status of the Caktus Library.

Property-based testing with Hypothesis

Vinod researched using property-based testing in Python. Traditionally it's more used with functional programming languages, but Hypothesis brings the concept to Python. He also learned about new Django features, including testing optimizations introduced with setupTestData.

Caktus Wagtail Demo with Docker and AWS

David looked into migrating a Heroku-based Wagtail deployment to a container-driven deployment using Amazon Web Services (AWS) and Docker. Utilizing Tobias' AWS Container Basics isolated Elastic Container Service stack, David created a Dockerfile for Wagtail and deployed it to AWS. Down the road, he'd like to more easily debug performance issues and integrate it with GitLab CI.

Local Docker Development

During Code for Durham Hack Nights, Victor noticed local development setup was a barrier of entry for new team members. To help mitigate this issue, he researched using Docker for local development with the Durham School Navigator project. In the end, he used Docker Compose to run a multi-container docker application with PostgreSQL, NGINX, and Django.

Caktus Costa Rica

Daryl, Nicole, and Sarah really like the idea of opening a branch Caktus office in Costa Rica and drafted a business plan to do so! Including everything from an executive summary, to operational and financial plans, the team researched what it would take to run a team from Playa Hermosa in Central America. Primary criteria included short distances to an airport, hospital, and of course, a beach. They even found an office with our name, the Cactus House. Relocation would be voluntary!

Improving the GUI test runner: Cricket

Charlotte M. likes to use Cricket to see test results in real time and have the ability to easily re-run specific tests, which is useful for quickly verifying fixes. However, she encountered a problem causing the application to crash sometimes when tests failed. So she investigated the problem and submitted a fix via a pull request back to the project. She also looked into adding coverage support.

Color your own NC Congressional District

Erin, Mark, Basia, Neil, and Dmitriy worked on an app that visualizes and teaches you about gerrymandered districts. The team ran a mini workshop to define goals and personas, and help the team prioritize the day's tasks by using agile user story mapping. The app provides background information on gerrymandering and uses data from NC State Board of Elections to illustrate how slight changes to districts can vastly impact the election of state representatives. The site uses D3 visualizations, which is an excellent utility for rendering GeoJSON geospatial data. In the future they hope to add features to compare districts and overlay demographic data.

Releasing django_tinypng

Dmitriy worked on testing and documenting django_tinypng, a simple Django library to allows optimization of images by using TinyPNG. He published the app to PyPI so it's easily installable via pip.

Learning Django: The Django Girls Tutorial

Gerald and Graham wanted to sharpen their Django skills by following the Django Girls Tutorial. Gerald learned a lot from the tutorial and enjoyed the format, including how it steps through blocks of code describing the syntax. He also learned about how the Django Admin is configured. Graham knew that following tutorials can sometimes be a rocky process, so he worked together with Graham so they could talk through problems together and Graham was able to learn by reviewing and helping.

Planning a new meetup for Digital Project Management

When Elizabeth first entered the Digital Project Management field several years ago, there were not a lot of resources available specifically for digital project managers. Most information was related to more traditional project management, or the PMP. She attended the 2nd Digital PM Summit with her friend Jillian, and loved the general tone of openness and knowledge sharing (they also met Daryl and Ben there!). The Summit was a wonderful resource. Elizabeth wanted to bring the spirit of the Summit back to the Triangle, so during Ship It Day, she started planning for a new meetup, including potential topics and meeting locations. One goal is to allow remote attendance through Google Hangouts, to encourage openness and sharing without having to commute across the Triangle. Elizabeth and Jillian hope to hold their first meetup in February.

Kanban: Research + Talk

Charlotte F. researched Kanban to prepare for a longer talk to illustrate how Kanban works in development and how it differs from Scrum. Originally designed by Toyota to improve manufacturing plants, Kanban focuses on visualizing workflows to help reveal and address bottlenecks. Picking the right tool for the job is important, and one is not necessarily better than the other, so Charlotte focused on outlining when to use one over the other.

Identifying Code for Cleanup

Calvin created redundant, a tool for identifying technical debt. Last ShipIt he was able to locate completely identical files, but he wanted to improve on that. Now the tool can identify functions that are almost the same and/or might be generalizable. It searches for patterns and generates a report of your codebase. He's looking for codebases to test it on!

RPython Lisp Implementation, Revisited

Jeff B. continued exploring how to create a Lisp implementation in RPython, the framework behind the PyPy project project. RPython is a restricted subset of the Python language. In addition to learning about RPython, he wanted to better understand how PyPy is capable of performance enhancements over CPython. Jeff also converted his parser to use Alex Gaynor's RPLY project.

Streamlined Time Tracking

At Caktus, time tracking is important, and we've used a variety of tools over the years. Currently we use Harvest, but it can be tedius to use when switching between projects a lot. Dan would like a tool to make this process more efficient. He looked into Project Hampster, but settled on building a new tool. His implementation makes it easy to switch between projects with a single click. It also allows users to sync daily entries to Harvest.

January 18, 2017 04:39 PM


PyCharm

PyCharm 2017.1 EAP 3 (build 171.2455.3)

We’re happy to announce the next EAP for PyCharm 2017.1, get it now from our website!

This week, we’ve fixed several issues, and added some functionality:

Download it now from our website! To keep up-to-date with our EAP releases set your update channel to Early Access Program: Settings | Appearance & Behavior | System Settings | Updates, Automatically check updates for “Early Access Program”

-PyCharm Team
The Drive to Develop

January 18, 2017 04:34 PM


GoDjango

How I Deploy Django Day-to-Day

There are a lot of ways to deploy Django so I think it is one of those topics people are really curious about how other people do it. Generally, in all deploys you need to get the latest code, run migrations, collect your static files and restart web server processes. How yo do those steps, that is the interesting part.

In todays video I go over How I deploy Django day to day, followed by some other ways I have done it. This is definitely a topic you can make as easy or complicated as you want.

Here is the link again: https://www.youtube.com/watch?v=43lIXCPMw_8?vq=hd720

January 18, 2017 04:00 PM


Experienced Django

Django Debug Toolbar

Django debug toolbar is a nifty little utility to allow you to examine what’s going on under the hood.  It’s a fairly easy install and gives quite a lot of info.

Installation

I’m not going to waste your time (or mine) with details of how to install the debug toolbar.  The instructions are here.

I will, however, point out that the “tips” page starts with “The toolbar isn’t displayed!“, which helped me get running.  My problem was a lack of <body> </body> tags on my template.  (side note: I’m wondering if something like bootstrap would provide those surrounding tags automatically.)

Using The Toolbar

The use of the toolbar is pretty obvious.  The information is pretty clearly laid out on each of the sections.

The section I found the most interesting was the SQL tab (shown below), which not only shows which queries were done for the given page, but also how long each took.

The page I instrumented has a task which updates several fields in the database the first time it is loaded on any given date.  Using this tab it was clear how much of the page load time was taken up in this update process.

Not only would this be handy for performance troubleshooting, but it’s also instructional to see which python statements turn into queries and how.

Conclusion

As a fan of development tools, Django Debug Toolbar certainly makes me happy not only for its features, but also its simplicity in use and design.  I would definitely recommend it

January 18, 2017 02:03 PM


Talk Python to Me

#95 Grumpy: Running Python on Go

Google runs millions of lines of Python code. The front-end server that drives youtube.com and YouTube’s APIs is primarily written in Python, and it serves millions of requests per second! <br/> <br/> On this episode you'll meet Dylan Trotter who is working increase performance and concurrency on these servers powering YouTube. He just launched Grumpy: A Python implementation based on Go, the highly concurrent language from Google. <br/> <br/> Links from the show: <br/> <div style="font-size: .85em;"> <br/> <b>Grumpy home page (redirects)</b>: <a href='http://grump.io' target='_blank'>grump.io</a> <br/> <b>Grumpy at github</b>: <a href='https://github.com/google/grumpy' target='_blank'>github.com/google/grumpy</a> <br/> <b>Announcement post</b>: <a href='https://opensource.googleblog.com/2017/01/grumpy-go-running-python.html' target='_blank'>opensource.googleblog.com/2017/01/grumpy-go-running-python.html</a> <br/> <b>Dylan on Github</b>: <a href='https://github.com/trotterdylan' target='_blank'>github.com/trotterdylan</a> <br/> <br/> <b>Deep Learning Kickstarter</b>: <a href='https://www.kickstarter.com/projects/adrianrosebrock/1866482244' target='_blank'>kickstarter.com/projects/adrianrosebrock/1866482244</a> <br/> <b>Hired's Talk Python Offer</b>: <a href='http://hired.com/talkpythontome' target='_blank'>hired.com/talkpythontome</a> <br/> </div>

January 18, 2017 08:00 AM


Django Weblog

Django 1.11 alpha 1 released

Django 1.11 alpha 1 is now available. It represents the first stage in the 1.11 release cycle and is an opportunity for you to try out the changes coming in Django 1.11.

Django 1.11 has a medley of new features which you can read about in the in-development 1.11 release notes.

This alpha milestone marks a complete feature freeze. The current release schedule calls for a beta release in about a month and a release candidate about a month from then. We'll only be able to keep this schedule if we get early and often testing from the community. Updates on the release schedule schedule are available on the django-developers mailing list.

As with all alpha and beta packages, this is not for production use. But if you'd like to take some of the new features for a spin, or to help find and fix bugs (which should be reported to the issue tracker), you can grab a copy of the alpha package from our downloads page or on PyPI.

The PGP key ID used for this release is Tim Graham: 1E8ABDC773EDE252.

January 18, 2017 01:16 AM


Daniel Bader

Assert Statements in Python

Assert Statements in Python

How to use assertions to help automatically detect errors in your Python programs in order to make them more reliable and easier to debug.

What Are Assertions & What Are They Good For?

Python’s assert statement is a debugging aid that tests a condition. If the condition is true, it does nothing and your program just continues to execute. But if the assert condition evaluates to false, it raises an AssertionError exception with an optional error message.

The proper use of assertions is to inform developers about unrecoverable errors in a program. They’re not intended to signal expected error conditions, like “file not found”, where a user can take corrective action or just try again.

Another way to look at it is to say that assertions are internal self-checks for your program. They work by declaring some conditions as impossible in your code. If one of these conditions doesn’t hold that means there’s a bug in the program.

If your program is bug-free, these conditions will never occur. But if they do occur the program will crash with an assertion error telling you exactly which “impossible” condition was triggered. This makes it much easier to track down and fix bugs in your programs.

To summarize: Python’s assert statement is a debugging aid, not a mechanism for handling run-time errors. The goal of using assertions is to let developers find the likely root cause of a bug more quickly. An assertion error should never be raised unless there’s a bug in your program.

Assert in Python — An Example

Here’s a simple example so you can see where assertions might come in handy. I tried to give this some semblance of a real world problem you might actually encounter in one of your programs.

Suppose you were building an online store with Python. You’re working to add a discount coupon functionality to the system and eventually write the following apply_discount function:

def apply_discount(product, discount):
    price = int(product['price'] * (1.0 - discount))
    assert 0 <= price <= product['price']
    return price

Notice the assert statement in there? It will guarantee that, no matter what, discounted prices cannot be lower than $0 and they cannot be higher than the original price of the product.

Let’s make sure this actually works as intended if we call this function to apply a valid discount:

#
# Our example product: Nice shoes for $149.00
#
>>> shoes = {'name': 'Fancy Shoes', 'price': 14900}

#
# 25% off -> $111.75
#
>>> apply_discount(shoes, 0.25)
11175

Alright, this worked nicely. Now, let’s try to apply some invalid discounts:

#
# A "200% off" discount:
#
>>> apply_discount(shoes, 2.0)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    apply_discount(prod, 2.0)
  File "<input>", line 4, in apply_discount
    assert 0 <= price <= product['price']
AssertionError

#
# A "-30% off" discount:
#
>>> apply_discount(shoes, -0.3)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    apply_discount(prod, -0.3)
  File "<input>", line 4, in apply_discount
    assert 0 <= price <= product['price']
AssertionError

As you can see, trying to apply an invalid discount raises an AssertionError exception that points out the line with the violated assertion condition. If we ever encounter one of these errors while testing our online store it will be easy to find out what happened by looking at the traceback.

This is the power of assertions, in a nutshell.

Python’s Assert Syntax

It’s always a good idea to study up on how a language feature is actually implemented in Python before you start using it. So let’s take a quick look at the syntax for the assert statement according to the Python docs:

assert_stmt ::= "assert" expression1 ["," expression2]

In this case expression1 is the condition we test, and the optional expression2 is an error message that’s displayed if the assertion fails.

At execution time, the Python interpreter transforms each assert statement into roughly the following:

if __debug__:
    if not expression1:
        raise AssertionError(expression2)

You can use expression2 to pass an optional error message that will be displayed with the AssertionError in the traceback. This can simplify debugging even further—for example, I’ve seen code like this:

if cond == 'x':
    do_x()
elif cond == 'y':
    do_y()
else:
    assert False, ("This should never happen, but it does occasionally. "
                   "We're currently trying to figure out why. "
                   "Email dbader if you encounter this in the wild.")

Is this ugly? Well, yes. But it’s definitely a valid and helpful technique if you’re faced with a heisenbug-type issue in one of your applications. 😉

Common Pitfalls With Using Asserts in Python

Before you move on, there are two important caveats with using assertions in Python that I’d like to call out.

The first one has to do with introducing security risks and bugs into your applications, and the second one is about a syntax quirk that makes it easy to write useless assertions.

This sounds (and potentially is) pretty horrible, so you might at least want to skim these two caveats or read their summaries below.

Caveat #1 – Don’t Use Asserts for Data Validation

Asserts can be turned off globally in the Python interpreter. Don’t rely on assert expressions to be executed for data validation or data processing.

The biggest caveat with using asserts in Python is that assertions can be globally disabled with the -O and -OO command line switches, as well as the PYTHONOPTIMIZE environment variable in CPython.

This turns any assert statement into a null-operation: the assertions simply get compiled away and won’t be evaluated, which means that none of the conditional expressions will be executed.

This is an intentional design decision used similarly by many other programming languages. As a side-effect it becomes extremely dangerous to use assert statements as a quick and easy way to validate input data.

Let me explain—if your program uses asserts to check if a function argument contains a “wrong” or unexpected value this can backfire quickly and lead to bugs or security holes.

Let’s take a look at a simple example. Imagine you’re building an online store application with Python. Somewhere in your application code there’s a function to delete a product as per a user’s request:

def delete_product(product_id, user):
    assert user.is_admin(), 'Must have admin privileges to delete'
    assert store.product_exists(product_id), 'Unknown product id'
    store.find_product(product_id).delete()

Take a close look at this function. What happens if assertions are disabled?

There are two serious issues in this three-line function example, caused by the incorrect use of assert statements:

  1. Checking for admin privileges with an assert statement is dangerous. If assertions are disabled in the Python interpreter, this turns into a null-op. Therefore any user can now delete products. The privileges check doesn’t even run. This likely introduces a security problem and opens the door for attackers to destroy or severely damage the data in your customer’s or company’s online store. Not good.
  2. The product_exists() check is skipped when assertions are disabled. This means find_product() can now be called with invalid product ids—which could lead to more severe bugs depending on how our program is written. In the worst case this could be an avenue for someone to launch Denial of Service attacks against our store. If the store app crashes if we attempt to delete an unknown product, it might be possible for an attacker to bombard it with invalid delete requests and cause an outage.

How might we avoid these problems? The answer is to not use assertions to do data validation. Instead we could do our validation with regular if-statements and raise validation exceptions if necessary. Like so:

def delete_product(product_id, user):
    if not user.is_admin():
        raise AuthError('Must have admin privileges to delete')

    if not store.product_exists(product_id):
        raise ValueError('Unknown product id')

    store.find_product(product_id).delete()

This updated example also has the benefit that instead of raising unspecific AssertionError exceptions, it now raises semantically correct exceptions like ValueError or AuthError (which we’d have to define ourselves).

Caveat #2 – Asserts That Never Fail

It’s easy to accidentally write Python assert statements that always evaluate to true. I’ve been bitten by this myself in the past. I wrote a longer article about this specific issue you can check out by clicking here.

Alternatively, here’s the executive summary:

When you pass a tuple as the first argument in an assert statement, the assertion always evaluates as true and therefore never fails.

For example, this assertion will never fail:

assert(1 == 2, 'This should fail')

This has to do with non-empty tuples always being truthy in Python. If you pass a tuple to an assert statement it leads to the assert condition to always be true—which in turn leads to the above assert statement being useless because it can never fail and trigger an exception.

It’s relatively easy to accidentally write bad multi-line asserts due to this unintuitive behavior. This quickly leads to broken test cases that give a false sense of security in our test code. Imagine you had this assertion somewhere in your unit test suite:

assert (
    counter == 10,
    'It should have counted all the items'
)

Upon first inspection this test case looks completely fine. However, this test case would never catch an incorrect result: it always evaluates to True, regardless of the state of the counter variable.

Like I said, it’s rather easy to shoot yourself in the foot with this (mine still hurts). Luckily, there are some countermeasures you can apply to prevent this syntax quirk from causing trouble:

>> Read the full article on bogus assertions to get the dirty details.

Python Assertions — Summary

Despite these caveats I believe that Python’s assertions are a powerful debugging tool that’s frequently underused by Python developers.

Understanding how assertions work and when to apply them can help you write more maintainable and easier to debug Python programs. It’s a great skill to learn that will help bring your Python to the next level and make you a more well-rounded Pythonista.

January 18, 2017 12:00 AM


Matthew Rocklin

Dask Development Log

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

To increase transparency I’m blogging weekly about the work done on Dask and related projects during the previous week. This log covers work done between 2017-01-01 and 2016-01-17. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Themes of the last couple of weeks:

  1. Stability enhancements for the distributed scheduler and micro-release
  2. NASA Grant writing
  3. Dask-EC2 script
  4. Dataframe categorical flexibility (work in progress)
  5. Communication refactor (work in progress)

Stability enhancements and micro-release

We’ve released dask.distributed version 1.15.1, which includes important bugfixes after the recent 1.15.0 release. There were a number of small issues that coordinated to remove tasks erroneously. This was generally OK because the Dask scheduler was able to heal the missing pieces (using the same machinery that makes Dask resilience) and so we didn’t notice the flaw until the system was deployed in some of the more serious Dask deployments in the wild. PR dask/distributed #804 contains a full writeup in case anyone is interested. The writeup ends with the following line:

This was a nice exercise in how coupling mostly-working components can easily yield a faulty system.

This also adds other fixes, like a compatibility issue with the new Bokeh 0.12.4 release and others.

NASA Grant Writing

I’ve been writing a proposal to NASA to help fund distributed Dask+XArray work for atmospheric and oceanographic science at the 100TB scale. Many thanks to our scientific collaborators who are offering support here.

Dask-EC2 startup

The Dask-EC2 project deploys Anaconda, a Dask cluster, and Jupyter notebooks on Amazon’s Elastic Compute Cloud (EC2) with a small command line interface:

pip install dask-ec2 --upgrade
dask-ec2 up --keyname KEYNAME \
            --keypair /path/to/ssh-key \
            --type m4.2xlarge
            --count 8

This project can be either very useful for people just getting started and for Dask developers when we run benchmarks, or it can be horribly broken if AWS or Dask interfaces change and we don’t keep this project maintained. Thanks to a great effort from Ben Zaitlen `dask-ec2 is again in the very useful state, where I’m hoping it will stay for some time.

If you’ve always wanted to try Dask on a real cluster and if you already have AWS credentials then this is probably the easiest way.

This already seems to be paying dividends. There have been a few unrelated pull requests from new developers this week.

Dataframe Categorical Flexibility

Categoricals can significantly improve performance on text-based data. Currently Dask’s dataframes support categoricals, but they expect to know all of the categories up-front. This is easy if this set is small, like the ["Healthy", "Sick"] categories that might arise in medical research, but requires a full dataset read if the categories are not known ahead of time, like the names of all of the patients.

Jim Crist is changing this so that Dask can operates on categorical columns with unknown categories at dask/dask #1877. The constituent pandas dataframes all have possibly different categories that are merged as necessary. This distinction may seem small, but it limits performance in a surprising number of real-world use cases.

Communication Refactor

Since the recent worker refactor and optimizations it has become clear that inter-worker communication has become a dominant bottleneck in some intensive applications. Antoine Pitrou is currently refactoring Dask’s network communication layer, making room for more communication options in the future. This is an ambitious project. I for one am very happy to have someone like Antoine looking into this.

January 18, 2017 12:00 AM

January 17, 2017


Brian Curtin

Easy deprecations in Python with @deprecated

Tim Peters once wrote, "[t]here should be one—and preferably only one—obvious way to do it." Sometimes we don't do it right the first time, or we later decide something shouldn't be done at all. For those reasons and more, deprecations are a tool to enable growth while easing the pain of transition.

Rather than switching "cold turkey" from API1 to API2 you do it gradually, introducing API2 with documentation, examples, notifications, and other helpful tools to get your users to move away from API1. Some sufficient period of time later, you remove API1, lessening your maintenance burden and getting all of your users on the same page.

One of the biggest issues I've seen is that last part, the removal. More often than not, it's a manual step. You determine that some code can be removed in a future version of your project and you write it down in an issue tracker, a wiki, a calendar event, a post-it note, or something else you're going to ignore. For example, I once did some work on CPython around removing support for Windows 9x in the subprocess module, which I only knew about because I was one of the few Windows people around and I happened across PEP 11 at the right time.

Automate It!

Over the years I've seen and used several forms of a decorator for Python functions that marks code as deprecated. They're all fairly good, as they raise DeprecationWarning for you and some of them update the function's docstring. However, as Python 2.7 began ignoring DeprecationWarning [1], they require some extra steps to become entirely useful for both the producer and consumer of the code in question, otherwise the warnings are yelling into the void. Enabling the warnings in your development environment is easy, by passing a -W command-line option or by setting the PYTHONWARNINGS environment variable, but you deserve more.

import deprecation

If you pip install libdeprecation [2], you get a couple of things:

  1. If you decorate a function with deprecation.deprecated, your now deprecated code raises DeprecationWarning. Rather, it raises deprecation.DeprecatedWarning, but that's a subclass, as is deprecation.UnsupportedWarning. You'll see why it's useful in a second.
  2. Your docstrings are updated with deprecation details. This includes the versions you set, along with optional details, such as directing users to something that replaces the deprecated code. So far this isn't all that different from what's been around the web for ten-plus years.
  3. If you pass deprecation.deprecated enough information and then use deprecation.fail_if_not_removed on tests which call that deprecated code, you'll get tests that fail when it's time for them to be removed. When your code has reached the version where you need to remove it, it will emit deprecation.UnsupportedWarning and the tests will handle it and turn it into a failure.
@deprecation.deprecated(deprecated_in="1.0", removed_in="2.0",
                        current_version=__version__,
                        details="Use the ``one`` function instead")
def won():
    """This function returns 1"""
    # Oops, it's one, not won. Let's deprecate this and get it right.
    return 1

...

@deprecation.fail_if_not_removed
def test_won(self):
    self.assertEqual(1, won())

All in all, the process of documenting, notifying, and eventually moving on is handled for you. When __version__ = "2.0", that test will fail and you'll be able to catch it before releasing it.

Full documentation and more examples are available at deprecation.readthedocs.io, and the source can be found on GitHub at briancurtin/deprecation.

Happy deprecating!


[1]Exposing application users to DeprecationWarnings that are emitted by lower-level code needlessly involves end-users in "how things are done." It often leads to users raising issues about warnings they're presented, which on one hand is done rightfully so, as it's been presented to them as some sort of issue to resolve. However, at the same time, the warning could be well known and planned for. From either side, loud DeprecationWarnings can be seen as noise that isn't necessary outside of development.
[2]The deprecation name on PyPI is currently being squatted on, so I've reached out to the current holder to see if I can use it. Only the PyPI package name is called libdeprecation, not any of the project's API. I hope to eventually deprecate libdeprecation to change names, which I think is self-deprecating?

January 17, 2017 08:30 PM


Continuum Analytics News

Announcing General Availability of conda 4.3

Wednesday, January 18, 2017
Kale Franz
Continuum Analytics

We're excited to announce that conda 4.3 has been released for general availability. The 4.3 release series has several new features and substantial improvements. Below is a summary. 

To get the latest, just run conda update conda.

New Features

  • Unlink and Link Packages in a Single Transaction: In the past, conda hasn't always been safe and defensive with its disk-mutating actions. It has gleefully clobbered existing files; mid-operation failures left environments completely broken. In some of the most severe examples, conda can appear to "uninstall itself." With this release, the unlinking and linking of packages for an executed command is done in a single transaction. If a failure occurs for any reason while conda is mutating files on disk, the environment will be returned its previous state. While we've implemented some pre-transaction checks (verifying package integrity for example), it's impossible to anticipate every failure mechanism. In some circumstances, OS file permissions cannot be fully known until an operation is attempted and fails, and conda itself is not without bugs. Moving forward, unforeseeable failures won't be catastrophic.

  • Progressive Fetch and Extract Transactions: Like package unlinking and linking, the download and extract phases of package handling have also been given transaction-like behavior. The distinction is that the rollback on error is limited to a single package. Rather than rolling back the download and extract operation for all packages, the single-package rollback prevents the need for having to re-download every package if an error is encountered.

  • Generic- and Python-Type Noarch/Universal Packages: Along with conda-build 2.1, a noarch/universal type for Python packages is officially supported. These are much like universal Python wheels. Files in a Python noarch package are linked into a prefix just like any other conda package, with the following additional features:

    1. conda maps the site-packages directory to the correct location for the Python version in the environment,
    2. conda maps the python-scripts directory to either $PREFIX/bin or $PREFIX/Scripts depending on platform,
    3. conda creates the Python entry points specified in the conda-build recipe, and
    4. conda compiles pyc files at install time when prefix write permissions are guaranteed.

    Python noarch packages must be "fully universal." They cannot have OS- or Python version-specific dependencies. They cannot have OS- or Python version-specific "scripts" files. If these features are needed, traditional conda packages must be used.

  • Multi-User Package Caches: While the on-disk package cache structure has been preserved, the core logic implementing package cache handling has had a complete overhaul. Writable and read-only package caches are fully supported.

  • Python API Module: An oft requested feature is the ability to use conda as a Python library, obviating the need to "shell out" to another Python process. Conda 4.3 includes a conda.cli.python_api module that facilitates this use case. While we maintain the user-facing command-line interface, conda commands can be executed in-process. There is also a conda.exports module to facilitate longer-term usage of conda as a library across conda releases. However, conda's Python code is considered internal and private, subject to change at any time across releases. At the moment, conda will not install itself into environments other than its original install environment.

  • Remove All Locks: Locking has never been fully effective in conda, and it often created a false sense of security. In this release, multi-user package cache support has been implemented for improved safety by hard-linking packages in read-only caches to the user's primary user package cache. Still, users are cautioned that undefined behavior can result when conda is running in multiple process and operating on the same package caches and/or environments.

Deprecations/Breaking Changes

  • Conda now has the ability to refuse to clobber existing files that are not within the unlink instructions of the transaction. This behavior is configurable via the path_conflict configuration option, which has three possible values: clobber, warn, and prevent. In 4.3, the default value is clobber. This preserves existing behaviors, and it gives package maintainers time to correct current incompatibilities within their package ecosystem. In 4.4, the default will switch to warn, which means these operations continue to clobber, but the warning messages are displayed. In 4.5, the default value will switch to prevent. As we tighten up the path_conflict constraint, a new command line flag --clobber will loosen it back up on an ad hoc basis. Using --clobber overrides the setting for path_conflict to effectively be clobber for that operation.

  • Conda signed packages have been removed in 4.3. Vulnerabilities existed, and an illusion of security is worse than not having the feature at all. We will be incorporating The Update Framework (TUF) into conda in a future feature release.

  • Conda 4.4 will drop support for older versions of conda-build.

Other Notable Improvements

  • A new "trace" log level is added, with output that is extremely verbose. To enable it, use -v -v -v or -vvv as a command-line flag, set a verbose: 3 configuration parameter, or set a CONDA_VERBOSE=3 environment variable.

  • The r channel is now part of the default channels.

  • Package resolution/solver hints have been improved with better messaging.

January 17, 2017 07:22 PM


Chris Moffitt

Data Science Challenge - Predicting Baseball Fanduel Points

Introduction

Several months ago, I participated in my first crowd-sourced Data Science competition in the Twin Cities run by Analyze This!. In my previous post, I described the benefits of working through the competition and how much I enjoyed the process. I just completed the second challenge and had another great experience that I wanted to share and (hopefully) encourage others to try these types of practical challenges to build their Data Science/Analytics skills.

In this second challenge, I felt much more comfortable with the actual process of cleaning the data, exploring it and building and testing models. I found that the python tools continue to serve me well. However, I also identified a lot of things that I need to do better in future challenges or projects in order to be more systematic about my process. I am curious if the broader community has tips or tricks they can share related to some of the items I will cover below. I will also highlight a few of the useful python tools I used throughout the process. This post does not include any code but is focused more on the process and python tools for Data Science.

Background

As mentioned in my previous post, Analyze This! is an organization dedicated to raising awareness of the power of Data Science and increasing visibility in the local business community of the capabilities that Data Science can bring to their organizations. In order to accomplish this mission, Analyze This! hosts friendly competitions and monthly educational sessions on various Data Science topics.

This specific competition focused on predicting 2015 Major League Baseball Fanduel points. A local company provided ~36,000 rows of data to be used in the analysis. The objective was to use the 116 measures to build a model to predict the actual points a hitter would get in a Fanduel fantasy game. Approximately 10 teams of 3-5 people each participated in the challenge and the top 4 presented at SportCon. I was very proud to be a member of the team that made the final 4 cut and presented at SportCon.

Observations

As I went into the challenge, I wanted to leverage the experience from the last challenge and focus on a few skills to build in this event. I specifically wanted to spend more time on the exploratory analysis in order to more thoughtfully construct my models. In addition, I wanted to actually build out and try the models on my own. My past experience was very ad-hoc. I wanted this process to be a little more methodical and logical.

Leverage Standards

About a year ago, I took an introductory Business Analytics class which used the book Data Science for Business (Amazon Referral) by Foster Provost and Tom Fawcett as one of the primary textbooks for the course. As I have spent more time working on simple Data Science projects, I have really come to appreciate the insights and perspectives from this book.

In the future, I would like to do a more in-depth review of this book but for the purposes of this article, I used it as a reference to inform the basic process I wanted to follow for the project. Not surprisingly, this book mentions that there is an established methodology for Data Mining/Analysis called the “Cross Industry Standard Process for Data Mining” aka CRISP-DM. Here is a simple graphic showing the various phases:

Description of Prescriptive Analytics

credit: Kenneth Jensen

This process matched what my experience had been in the past in that it is very iterative as you explore the potential solutions. I plan to continue to use this as a model for approaching data analysis problems.

Business and Data Understanding

For this particular challenge, there were a lot of interesting aspects to the “business” and “data” understanding. From a personal perspective, I was familiar with baseball as a casual fan but did not have any in-depth experience with Fanduel so one of the first things I had to do was learn more about how scores were generated for a given game.

In addition to the basic understanding of the problem, it was a bit of a challenge to interpret some of the various measures; understand how they were calculated and figure out what they actually represented. It was clear as we went through the final presentations that some groups understood the intricacies of the data in much more detail than others. It was also interesting that in-depth understanding of each data element was not required to actually “win” the competition.

Finally, this phase of the process would typically involve more thought around what data elements to capture. The structure of this specific challenge made that a non-issue since all data was provided and we were not allowed to augment it with other data sources.

Data Preparation

For this particular problem, the data was relatively clean and easily read in via Excel or csv. However there were three components to the data cleaning that impacted the final model:

  • Handling missing data
  • Encoding categorical data
  • Scaling data

As I worked through the problem, it was clear that managing these three factors required quite a bit of intuition and trial and error to figure out the best approach.

I am generally aware of the options for handling missing data but I did not have a good intution for when to apply the various approaches:

  • When is it better to replace a missing value with a numerical substitute like mean, median or mode?
  • When should a dummy value like NaN or -1 be used?
  • When should the data just be dropped?

Categorical data proved to have somewhat similar challenges. There were approximately 16 categorical variables that could be encoded in several ways:

  • Binary (Day/Night)
  • Numerical range (H-M-L converted to 3-2-1)
  • One hot encoding (each value in a column)
  • Excluded from the model

Finally, the data included many measures with values < 1 as well as measures > 1000. Depending on the model, these scales could over-emphasize some results at the expense of others. Fortunately scikit-learn has options for mitigating but how do you know when to use which option? In my case, I stuck with using RobustScaler as my go-to function. This may or may not be the right approach.

The challenge with all these options is that I could not figure out a good systematic way to evaluate each of these data preparation steps and how they impacted the model. The entire process felt like a lot of trial and error.

Trying Stuff Until It Works

Ultimately, I believe this is just part of the process but I am interested in understanding how to systematically approach these types of data preparation steps in a methodical manner.

Modeling and Evaluation

For modeling, I used the standard scikit learn tools augmented with TPOT and ultimately used XGboost as the model of choice.

In a similar vein to the challenges with data prep, I struggled to figure out how to choose which model worked best. The data set was not tremendously large but some of the modeling approaches could take several minutes to run. By the time I factored in all of the possible options of data prep + model selection + parameter tuning, it was very easy to get lost in the process.

Scikit-learn has capabilities to tune hyper-parameters which is helpful. Additionally, TPOT can be a great tool to try a bunch of different approaches too. However, these tools don’t always help with the further up-stream process related to data prep and feature engineering. I plan to investigate more options in this area in future challenges.

Tool Sets

In this particular challenge, most groups used either R or python for their solution. I found it interesting that python seemed to be the dominant tool and that most people used a the standard python Data Science stack. However, even though everyone used similar tools and processes, we did come up with different approaches to the solutions.

I used Jupyter Notebooks quite extensively for my analysis but realized that I need to re-think how to organize them. As I iterated through the various solutions, I started to spend more time struggling to find which notebook contained a certain piece of code I needed. Sorting and searching through the various notebooks is very limited since the notebook name is all that is displayed on the notebook index.

One of my biggest complaints with Jupyter notebooks is that they don’t lend themselves to standard version control like a standalone python script. Obviously, storing a notebook in git or mercurial is possible but it is not very friendly for diff viewing. I recently learned about the nbdime project which looks very interesting and I may check out next time.

Speaking of Notebooks, I found a lot of useful examples for python code in the Allstate Kaggle Competition. This specific competition had a data set that tended to have data analysis approaches that worked well for the Baseball data as well. I used a lot of code snippets and ideas from these kernels. I encourage people to check out all of the kernels that are available on Kaggle. They do a nice job of showing how to approach problems from multiple different perspectives.

Another project I will likely use going forward are the Cookiecutter templates for Data Science. The basic structure may be a little overkill for a small project but I like the idea of enforcing some consistency in the process. As I looked through this template and the basic thought process for its development, it makes a lot of sense and I look forward to trying it in the future.

Another tool that I used in the project was mlxtend which contains a set of tools that are useful for “day-to-day data science tasks.” I particularly liked the ease of creating a visual plot of a confusion matrix. There are several other useful functions in this package that work quite well with scikit-learn. It’s well worth investigating all the functionality.

Finally, this dataset did have a lot of missing data. I enjoyed using the missingno tool to get a quick visualization of where the missing data was and how prevalent the missing values were. This is a very powerful library for visualizing missing data in a pandas DataFrame.

Conclusion

I have found that the real life process of analyzing and working through a Data Science challenge is one of the best ways to build up my skills and experience. There are many resources on the web that explain how to use the tools like pandas, sci-kit learn, XGBoost, etc but using the tools is just one piece of the puzzle. The real value is knowing how to smartly apply these tools and intuitively understanding how different choices will impact the rest of the downstream processes. This knowledge can only be gained by doing something over and over. Data Science challenges that focus on real-world issues are tremendously useful opportunities to learn and build skills.

Thanks again to all the people that make Analyze This! possible. I feel very fortunate that this type of event is available in my home town and hopefully others can replicate it in their own geographies.

January 17, 2017 02:10 PM


PyTennessee

PyTN Profiles: Calvin Hendryx-Parker and Juice Analytics



Speaker Profile: Calvin Hendryx-Parker (@calvinhp)

Six Feet Up, Inc. co-founder Calvin Hendryx-Parker has over 18 years of experience in the development and hosting of applications using Python and web frameworks including Django, Pyramid and Flask.

As Chief Technology Officer for Six Feet Up, Calvin is responsible for researching cutting-edge advances that could become part of the company’s technology road map. Calvin provides both the company and its clients with recommendations on tools and technologies, systems architecture and solutions that address specific information-sharing needs. Calvin is an advocate of open source and is a frequent speaker at Python conferences on multisite content management, integration, and web app development. Calvin is also a founder and organizer of the IndyPy meetup group and Pythology training series in Indianapolis.

Outside of work Calvin spends time tinkering with new devices like the Fitbit, Pebble and Raspberry Pi. Calvin is an avid distance runner and ran the 2014 NYC Marathon to support the Innocence Project. Every year he and the family enjoys an extended trip to France where his wife Gabrielle, the CEO of Six Feet Up, is from. Calvin holds a Bachelor of Science from Purdue University.

Calvin will be presenting “Open Source Deployment Automation and Orchestration with SaltStack” at 3:00PM Saturday (2/4) in Room 200. Salt is way more than a configuration management tool. It supports many types of other activities such as remote execution and full-blown system orchestration. It can be used as a replacement for remote task tools such as Fabric or Paver.

Sponsor Profile: Juice Analytics (@juiceanalytics)

At Juice, we’re building Juicebox, a cloud platform to allow anyone to build and share stories with data. Juicebox is built on AWS, Python, Backbone.js and D3. We’re looking for a frontend dev with a love of teamwork, a passion for pixels, and a devotion to data. Love of Oxford commas also required.

January 17, 2017 01:28 PM

PyTN Profiles: Jared M. Smith and SimplyAgree

image


Speaker Profile: Jared M. Smith (@jaredthecoder)

I’m a Research Scientist at Oak Ridge National Laboratory, where I engage in computer security research with the Cyber Warfare Research Team. I am also pursuing my PhD in Computer Science at the University of Tennessee, Knoxville. I founded VolHacks, our university’s hackathon, and HackUTK, our university’s computer security club. I used to work at Cisco Systems as a Software Security Engineer working on pentesting engagements and security tooling.

Back at home, I helped start the Knoxville Python meetup. I also serve on the Knoxville Technology Council, volunteer at the Knoxville Entrepreneur Center, do consulting for VC-backed startups, compete in hackathons and pitch competitions, and hike in the Great Smoky Mountains.

Jared will be presenting “Big Data Analysis in Python with Apache Spark, Pandas, and Matplotlib” at 3:00PM Saturday (2/4) in Room 100. Big data processing is finally approachable for the modern Pythonista. Using Apache Spark and other data analysis tools, we can process, analyze, and visualize more data than ever before using Pythonic APIs and a language you already know, without having to learn Java, C++, or even Fortran. Come hang out and dive into the essentials of big data analysis with Python.

image

Sponsor Profile: SimplyAgree (@simplyagree)

SimplyAgree is an electronic signature and closing management tool for complex corporate transactions. The app is built on Python, Django and Django REST Framework.

Our growing team is based in East Nashville, TN.

January 17, 2017 01:21 PM