skip to navigation
skip to content

Planet Python

Last update: October 19, 2018 01:48 AM UTC

October 18, 2018


Stack Abuse

Getting User Input in Python

Introduction

The way in which information is obtained and handled is one of the most important aspects in the ethos of any programming language, more so for the information supplied and obtained from the user.

Python, while comparatively slow in this regard when compared to other programming languages like C or Java, contains robust tools to obtain, analyze, and process data obtained directly from the end user.

This article briefly explains how different Python functions can be used to obtain information from the user through the keyboard, with the help of some code snippets to serve as examples.

Input in Python

To receive information through the keyboard, Python uses either the input() or raw_input() functions (more about the difference between the two in the following section). These functions have an optional parameter, commonly known as prompt, which is a string that will be printed on the screen whenever the function is called.

When one of the input() or raw_input() functions is called, the program flow stops until the user enters the input via the command line. To actually enter the data, the user needs to press the ENTER key after inputing their string. While hitting the ENTER key usually inserts a newline character ("\n"), it does not in this case. The entered string will simply be submitted to the application.

On a curious note, little has changed in how this function works between Python versions 2 and 3, which is reflected in the workings of input() and raw_input(), explained in the next section.

Comparing the input and raw_input Functions

The difference when using these functions only depends on what version of Python is being used. For Python 2, the function raw_input() is used to get string input from the user via the command line, while the input() function returns will actually evaluate the input string and try to run it as Python code.

In Python 3, raw_input() function has been deprecated and replaced by the input() function and is used to obtain a user's string through the keyboard. And the input() function of Python 2 is discontinued in version 3. To obtain the same functionality that was provided by Python 2's input() function, the statement eval(input()) must be used in Python 3.

Take a look at an example of raw_input function in Python 2.

# Python 2

txt = raw_input("Type something to test this out: ")  
print "Is this what you just said?", txt  

Output

Type something to test this out: Let the Code be with you!

Is this what you just said? Let the Code be with you!  

Similarly, take a look at an example of input function in Python 3.

# Python 3

txt = input("Type something to test this out: ")

# Note that in version 3, the print() function
# requires the use of parenthesis.
print("Is this what you just said? ", txt)  

Output

Type something to test this out: Let the Code be with you!  
Is this what you just said? Let the Code be with you!  

From here onwards this article will use the input method from Python 3, unless specified otherwise.

String and Numeric input

The input() function, by default, will convert all the information it receives into a string. The previous example we showed demonstrates this behavior.

Numbers, on the other hand, need to be explicitly handled as such since they come in as strings originally. The following example demonstrates how numeric type information is received:

# An input is requested and stored in a variable
test_text = input ("Enter a number: ")

# Converts the string into a integer. If you need
# to convert the user input into decimal format,
# the float() function is used instead of int()
test_number = int(test_text)

# Prints in the console the variable as requested
print ("The number you entered is: ", test_number)  

Output

Enter a number: 13  
The number you entered is: 13  

Another way to do the same thing is as follows:

test_number = int(input("Enter a number: "))  

Here we directly save the input, after immediate conversion, into a variable.

Keep in mind that if the user doesn't actually enter an integer then this code will throw an exception, even if the entered string is a floating point number.

Input Exception Handling

There are several ways to ensure that the user enters valid information. One of the ways is to handle all the possible errors that may occur while user enters the data.

In this section we'll demonstrates some good methods of error handling when taking input.

But first, here is some unsafe code:

test2word = input("Tell me your age: ")  
test2num = int(test2word)  
print("Wow! Your age is ", test2num)  

When running this code, let's say you enter the following:

Tell me your age: Three  

Here, when the int() function is called with the "Three" string, a ValueError exception is thrown and the program will stop and/or crash.

Now let's see how we would make this code safer to handle user input:

test3word = input("Tell me your lucky number: ")

try:  
    test3num = int(test3word)
    print("This is a valid number! Your lucky number is: ", test3num)
except ValueError:  
    print("This is not a valid number. It isn't a number at all! This is a string, go and try again. Better luck next time!")

This code block will evaluate the new input. If the input is an integer represented as a string then the int() function will convert it into a proper integer. If not, an exception will be raised, but instead of crashing the application it will be caught and the second print statement is run.

Here is an example of this code running when an exception is raised:

Tell me your lucky number: Seven  
This is not a valid number. It isn't a number at all! This is a string, go and try again. Better luck next time!  

This is how input-related errors can be handled in Python. You can combine this code with another construct, like a while loop to ensure that the code is repeatedly run until you receive the valid integer input that your program requires.

A Complete Example

# Makes a function that will contain the
# desired program.
def example():

    # Calls for an infinite loop that keeps executing
    # until an exception occurs
    while True:
        test4word = input("What's your name? ")

        try:
            test4num = int(input("From 1 to 7, how many hours do you play in your mobile?" ))

        # If something else that is not the string
        # version of a number is introduced, the
        # ValueError exception will be called.
        except ValueError:
            # The cycle will go on until validation
            print("Error! This is not a number. Try again.")

        # When successfully converted to an integer,
        # the loop will end.
        else:
            print("Impressive, ", test4word, "! You spent", test4num*60, "minutes or", test4num*60*60, "seconds in your mobile!")
            break

# The function is called
example()  

The output will be:

What's your name? Francis  
From 1 to 7, how many hours do you play in your mobile? 3  
Impressive, Francis! You spent 180 minutes or 10800 seconds on your mobile!  

Conclusion

In this article, we saw how the built-in Python utilities can be used to get user input in a variety of formats. We also saw how we can handle the exceptions and errors that can possibly occur while obtaining user input.

October 18, 2018 01:36 PM UTC


PyCharm

PyCharm 2018.3 EAP 7

PyCharm 2018.3 EAP 7 is out! Get it now from the JetBrains website.

In this EAP we have introduced a host of new features as well as fixed bugs for various subsystems.

Read the Release Notes

New in This Version

WSL Support

py_wsl_interpreter

We have some great news for Windows users, PyCharm now supports Windows Subsystem for Linux (WSL). With support for WSL, you can select a WSL-based Python interpreter in PyCharm’s project interpreter settings and then run and debug your project or perform any other actions as if you had a local interpreter setup. There’s only one exception – you won’t be able to create virtual environments with WSL-based interpreters. All packages have to be installed on the corresponding WSL system interpreter. Before trying this new type of Python interpreter in PyCharm, please make sure you have properly installed WSL.

Read more about WSL support in the PyCharm Documentation.

Structure of ‘from’ Imports

Selection_247

The new “Structure of ‘from’ imports” set of style options is available under Settings(Preferences) | Editor | Code Style | Python. Using these options you can control the code style for imports by choosing between joining imports into one line and splitting imports by placing each of them on a new line when performing imports optimizations (Ctrl(Cmd)+Alt+O).

Read more about the other code style options available.

Support for Python Stub Files and PEP-561

PyCharm has been supporting Python stub files (.pyi) for a while. These files let you specify type hints using Python 3 syntax for both Python 2 and 3. PyCharm shows an asterisk in the left-hand gutter for those code elements that have stubs. Clicking the asterisk results in jumping to the corresponding stub:

elements with stubs

With the PEP-561 support introduced in this PyCharm 2018.3 EAP build, you can install stubs as packages for a Python 3.7 interpreter:

py_install_stub_package

Read more about the Python stub files support in the PyCharm Documentation.

Time Tracking

Selection_243

With the PyCharm’s built-in Time Tracking plugin, you can track the amount of time you spend on a task when working in the editor. To enable this feature go to Settings/Preferences | Tools | Tasks | Time Tracking, and select the Enable Time Tracking checkbox. Once enabled, you can start using the tool to track and record your productivity:

Read more about the Time Tracking tool in the PyCharm documentation.

Copyright Notices in Project Files

Inserting copyright notices in the project files can be daunting. PyCharm makes it easier with its new “Copyright”-related set of settings and features. Set different copyright profiles along with the project scopes that they apply to in Settings (Preferences) | Copyright. After you have your copyright profiles in place, simply generate copyright notices by simply pressing Alt + Insert anywhere in a file:

Selection_246

Interested?

Download this EAP from our website. Alternatively, you can use the JetBrains Toolbox App to keep up to date with the latest releases throughout the entire EAP.

If you’re on Ubuntu 16.04 or later, you can use snap to get PyCharm EAP, and stay up to date. You can find the installation instructions on our website.

PyCharm 2018.3 is in constant development during the EAP phase, therefore not all new features are already available. More features will be added in the coming weeks. As PyCharm 2018.3 is pre-release software, it is worth noting that it is not as stable as the release versions. Furthermore, we may decide to change and/or drop certain features as the EAP progresses.

All EAP versions will ship with a built-in EAP license, which means that these versions are free to use for up to 30 days after the day that they are built. As EAPs are released weekly, you’ll be able to use PyCharm Professional Edition EAP for free for the duration of the EAP program, as long as you upgrade at least once every 30 days.

 

October 18, 2018 11:21 AM UTC


PyBites

Data Analysis of Pybites Community Branch Activity

/*! * * IPython notebook * */ /* CSS font colors for translated ANSI colors. */ .ansibold { font-weight: bold; } /* use dark versions for foreground, to improve visibility */ .ansiblack { color: black; } .ansired { color: darkred; } .ansigreen { color: darkgreen; } .ansiyellow { color: #c4a000; } .ansiblue { color: darkblue; } .ansipurple { color: darkviolet; } .ansicyan { color: steelblue; } .ansigray { color: gray; } /* and light for background, for the same reason */ .ansibgblack { background-color: black; } .ansibgred { background-color: red; } .ansibggreen { background-color: green; } .ansibgyellow { background-color: yellow; } .ansibgblue { background-color: blue; } .ansibgpurple { background-color: magenta; } .ansibgcyan { background-color: cyan; } .ansibggray { background-color: gray; } div.cell { /* Old browsers */ display: -webkit-box; -webkit-box-orient: vertical; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: vertical; -moz-box-align: stretch; display: box; box-orient: vertical; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: column; align-items: stretch; border-radius: 2px; box-sizing: border-box; -moz-box-sizing: border-box; -webkit-box-sizing: border-box; border-width: 1px; border-style: solid; border-color: transparent; width: 100%; padding: 5px; /* This acts as a spacer between cells, that is outside the border */ margin: 0px; outline: none; border-left-width: 1px; padding-left: 5px; background: linear-gradient(to right, transparent -40px, transparent 1px, transparent 1px, transparent 100%); } div.cell.jupyter-soft-selected { border-left-color: #90CAF9; border-left-color: #E3F2FD; border-left-width: 1px; padding-left: 5px; border-right-color: #E3F2FD; border-right-width: 1px; background: #E3F2FD; } @media print { div.cell.jupyter-soft-selected { border-color: transparent; } } div.cell.selected { border-color: #ababab; border-left-width: 0px; padding-left: 6px; background: linear-gradient(to right, #42A5F5 -40px, #42A5F5 5px, transparent 5px, transparent 100%); } @media print { div.cell.selected { border-color: transparent; } } div.cell.selected.jupyter-soft-selected { border-left-width: 0; padding-left: 6px; background: linear-gradient(to right, #42A5F5 -40px, #42A5F5 7px, #E3F2FD 7px, #E3F2FD 100%); } .edit_mode div.cell.selected { border-color: #66BB6A; border-left-width: 0px; padding-left: 6px; background: linear-gradient(to right, #66BB6A -40px, #66BB6A 5px, transparent 5px, transparent 100%); } @media print { .edit_mode div.cell.selected { border-color: transparent; } } .prompt { /* This needs to be wide enough for 3 digit prompt numbers: In[100]: */ min-width: 14ex; /* This padding is tuned to match the padding on the CodeMirror editor. */ padding: 0.4em; margin: 0px; font-family: monospace; text-align: right; /* This has to match that of the the CodeMirror class line-height below */ line-height: 1.21429em; /* Don't highlight prompt number selection */ -webkit-touch-callout: none; -webkit-user-select: none; -khtml-user-select: none; -moz-user-select: none; -ms-user-select: none; user-select: none; /* Use default cursor */ cursor: default; } @media (max-width: 540px) { .prompt { text-align: left; } } div.inner_cell { min-width: 0; /* Old browsers */ display: -webkit-box; -webkit-box-orient: vertical; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: vertical; -moz-box-align: stretch; display: box; box-orient: vertical; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: column; align-items: stretch; /* Old browsers */ -webkit-box-flex: 1; -moz-box-flex: 1; box-flex: 1; /* Modern browsers */ flex: 1; } /* input_area and input_prompt must match in top border and margin for alignment */ div.input_area { border: 1px solid #cfcfcf; border-radius: 2px; background: #f7f7f7; line-height: 1.21429em; } /* This is needed so that empty prompt areas can collapse to zero height when there is no content in the output_subarea and the prompt. The main purpose of this is to make sure that empty JavaScript output_subareas have no height. */ div.prompt:empty { padding-top: 0; padding-bottom: 0; } div.unrecognized_cell { padding: 5px 5px 5px 0px; /* Old browsers */ display: -webkit-box; -webkit-box-orient: horizontal; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: horizontal; -moz-box-align: stretch; display: box; box-orient: horizontal; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: row; align-items: stretch; } div.unrecognized_cell .inner_cell { border-radius: 2px; padding: 5px; font-weight: bold; color: red; border: 1px solid #cfcfcf; background: #eaeaea; } div.unrecognized_cell .inner_cell a { color: inherit; text-decoration: none; } div.unrecognized_cell .inner_cell a:hover { color: inherit; text-decoration: none; } @media (max-width: 540px) { div.unrecognized_cell > div.prompt { display: none; } } div.code_cell { /* avoid page breaking on code cells when printing */ } @media print { div.code_cell { page-break-inside: avoid; } } /* any special styling for code cells that are currently running goes here */ div.input { page-break-inside: avoid; /* Old browsers */ display: -webkit-box; -webkit-box-orient: horizontal; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: horizontal; -moz-box-align: stretch; display: box; box-orient: horizontal; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: row; align-items: stretch; } @media (max-width: 540px) { div.input { /* Old browsers */ display: -webkit-box; -webkit-box-orient: vertical; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: vertical; -moz-box-align: stretch; display: box; box-orient: vertical; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: column; align-items: stretch; } } /* input_area and input_prompt must match in top border and margin for alignment */ div.input_prompt { color: #303F9F; border-top: 1px solid transparent; } div.input_area > div.highlight { margin: 0.4em; border: none; padding: 0px; background-color: transparent; } div.input_area > div.highlight > pre { margin: 0px; border: none; padding: 0px; background-color: transparent; } /* The following gets added to the if it is detected that the user has a * monospace font with inconsistent normal/bold/italic height. See * notebookmain.js. Such fonts will have keywords vertically offset with * respect to the rest of the text. The user should select a better font. * See: https://github.com/ipython/ipython/issues/1503 * * .CodeMirror span { * vertical-align: bottom; * } */ .CodeMirror { line-height: 1.21429em; /* Changed from 1em to our global default */ font-size: 14px; height: auto; /* Changed to auto to autogrow */ background: none; /* Changed from white to allow our bg to show through */ } .CodeMirror-scroll { /* The CodeMirror docs are a bit fuzzy on if overflow-y should be hidden or visible.*/ /* We have found that if it is visible, vertical scrollbars appear with font size changes.*/ overflow-y: hidden; overflow-x: auto; } .CodeMirror-lines { /* In CM2, this used to be 0.4em, but in CM3 it went to 4px. We need the em value because */ /* we have set a different line-height and want this to scale with that. */ padding: 0.4em; } .CodeMirror-linenumber { padding: 0 8px 0 4px; } .CodeMirror-gutters { border-bottom-left-radius: 2px; border-top-left-radius: 2px; } .CodeMirror pre { /* In CM3 this went to 4px from 0 in CM2. We need the 0 value because of how we size */ /* .CodeMirror-lines */ padding: 0; border: 0; border-radius: 0; } /* Original style from softwaremaniacs.org (c) Ivan Sagalaev Adapted from GitHub theme */ .highlight-base { color: #000; } .highlight-variable { color: #000; } .highlight-variable-2 { color: #1a1a1a; } .highlight-variable-3 { color: #333333; } .highlight-string { color: #BA2121; } .highlight-comment { color: #408080; font-style: italic; } .highlight-number { color: #080; } .highlight-atom { color: #88F; } .highlight-keyword { color: #008000; font-weight: bold; } .highlight-builtin { color: #008000; } .highlight-error { color: #f00; } .highlight-operator { color: #AA22FF; font-weight: bold; } .highlight-meta { color: #AA22FF; } /* previously not defined, copying from default codemirror */ .highlight-def { color: #00f; } .highlight-string-2 { color: #f50; } .highlight-qualifier { color: #555; } .highlight-bracket { color: #997; } .highlight-tag { color: #170; } .highlight-attribute { color: #00c; } .highlight-header { color: blue; } .highlight-quote { color: #090; } .highlight-link { color: #00c; } /* apply the same style to codemirror */ .cm-s-ipython span.cm-keyword { color: #008000; font-weight: bold; } .cm-s-ipython span.cm-atom { color: #88F; } .cm-s-ipython span.cm-number { color: #080; } .cm-s-ipython span.cm-def { color: #00f; } .cm-s-ipython span.cm-variable { color: #000; } .cm-s-ipython span.cm-operator { color: #AA22FF; font-weight: bold; } .cm-s-ipython span.cm-variable-2 { color: #1a1a1a; } .cm-s-ipython span.cm-variable-3 { color: #333333; } .cm-s-ipython span.cm-comment { color: #408080; font-style: italic; } .cm-s-ipython span.cm-string { color: #BA2121; } .cm-s-ipython span.cm-string-2 { color: #f50; } .cm-s-ipython span.cm-meta { color: #AA22FF; } .cm-s-ipython span.cm-qualifier { color: #555; } .cm-s-ipython span.cm-builtin { color: #008000; } .cm-s-ipython span.cm-bracket { color: #997; } .cm-s-ipython span.cm-tag { color: #170; } .cm-s-ipython span.cm-attribute { color: #00c; } .cm-s-ipython span.cm-header { color: blue; } .cm-s-ipython span.cm-quote { color: #090; } .cm-s-ipython span.cm-link { color: #00c; } .cm-s-ipython span.cm-error { color: #f00; } .cm-s-ipython span.cm-tab { background: url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAADAAAAAMCAYAAAAkuj5RAAAAAXNSR0IArs4c6QAAAGFJREFUSMft1LsRQFAQheHPowAKoACx3IgEKtaEHujDjORSgWTH/ZOdnZOcM/sgk/kFFWY0qV8foQwS4MKBCS3qR6ixBJvElOobYAtivseIE120FaowJPN75GMu8j/LfMwNjh4HUpwg4LUAAAAASUVORK5CYII=); background-position: right; background-repeat: no-repeat; } div.output_wrapper { /* this position must be relative to enable descendents to be absolute within it */ position: relative; /* Old browsers */ display: -webkit-box; -webkit-box-orient: vertical; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: vertical; -moz-box-align: stretch; display: box; box-orient: vertical; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: column; align-items: stretch; z-index: 1; } /* class for the output area when it should be height-limited */ div.output_scroll { /* ideally, this would be max-height, but FF barfs all over that */ height: 24em; /* FF needs this *and the wrapper* to specify full width, or it will shrinkwrap */ width: 100%; overflow: auto; border-radius: 2px; -webkit-box-shadow: inset 0 2px 8px rgba(0, 0, 0, 0.8); box-shadow: inset 0 2px 8px rgba(0, 0, 0, 0.8); display: block; } /* output div while it is collapsed */ div.output_collapsed { margin: 0px; padding: 0px; /* Old browsers */ display: -webkit-box; -webkit-box-orient: vertical; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: vertical; -moz-box-align: stretch; display: box; box-orient: vertical; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: column; align-items: stretch; } div.out_prompt_overlay { height: 100%; padding: 0px 0.4em; position: absolute; border-radius: 2px; } div.out_prompt_overlay:hover { /* use inner shadow to get border that is computed the same on WebKit/FF */ -webkit-box-shadow: inset 0 0 1px #000; box-shadow: inset 0 0 1px #000; background: rgba(240, 240, 240, 0.5); } div.output_prompt { color: #D84315; } /* This class is the outer container of all output sections. */ div.output_area { padding: 0px; page-break-inside: avoid; /* Old browsers */ display: -webkit-box; -webkit-box-orient: horizontal; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: horizontal; -moz-box-align: stretch; display: box; box-orient: horizontal; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: row; align-items: stretch; } div.output_area .MathJax_Display { text-align: left !important; } div.output_area div.output_area div.output_area img, div.output_area svg { max-width: 100%; height: auto; } div.output_area img.unconfined, div.output_area svg.unconfined { max-width: none; } /* This is needed to protect the pre formating from global settings such as that of bootstrap */ .output { /* Old browsers */ display: -webkit-box; -webkit-box-orient: vertical; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: vertical; -moz-box-align: stretch; display: box; box-orient: vertical; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: column; align-items: stretch; } @media (max-width: 540px) { div.output_area { /* Old browsers */ display: -webkit-box; -webkit-box-orient: vertical; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: vertical; -moz-box-align: stretch; display: box; box-orient: vertical; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: column; align-items: stretch; } } div.output_area pre { margin: 0; padding: 0; border: 0; vertical-align: baseline; color: black; background-color: transparent; border-radius: 0; } /* This class is for the output subarea inside the output_area and after the prompt div. */ div.output_subarea { overflow-x: auto; padding: 0.4em; /* Old browsers */ -webkit-box-flex: 1; -moz-box-flex: 1; box-flex: 1; /* Modern browsers */ flex: 1; max-width: calc(100% - 14ex); } div.output_scroll div.output_subarea { overflow-x: visible; } /* The rest of the output_* classes are for special styling of the different output types */ /* all text output has this class: */ div.output_text { text-align: left; color: #000; /* This has to match that of the the CodeMirror class line-height below */ line-height: 1.21429em; } /* stdout/stderr are 'text' as well as 'stream', but execute_result/error are *not* streams */ div.output_stderr { background: #fdd; /* very light red background for stderr */ } div.output_latex { text-align: left; } /* Empty output_javascript divs should have no height */ div.output_javascript:empty { padding: 0; } .js-error { color: darkred; } /* raw_input styles */ div.raw_input_container { line-height: 1.21429em; padding-top: 5px; } pre.raw_input_prompt { /* nothing needed here. */ } input.raw_input { font-family: monospace; font-size: inherit; color: inherit; width: auto; /* make sure input baseline aligns with prompt */ vertical-align: baseline; /* padding + margin = 0.5em between prompt and cursor */ padding: 0em 0.25em; margin: 0em 0.25em; } input.raw_input:focus { box-shadow: none; } p.p-space { margin-bottom: 10px; } div.output_unrecognized { padding: 5px; font-weight: bold; color: red; } div.output_unrecognized a { color: inherit; text-decoration: none; } div.output_unrecognized a:hover { color: inherit; text-decoration: none; } .rendered_html { color: #000; /* any extras will just be numbers: */ } .rendered_html :link { text-decoration: underline; } .rendered_html :visited { text-decoration: underline; } .rendered_html h1:first-child { margin-top: 0.538em; } .rendered_html h2:first-child { margin-top: 0.636em; } .rendered_html h3:first-child { margin-top: 0.777em; } .rendered_html h4:first-child { margin-top: 1em; } .rendered_html h5:first-child { margin-top: 1em; } .rendered_html h6:first-child { margin-top: 1em; } .rendered_html * + ul { margin-top: 1em; } .rendered_html * + ol { margin-top: 1em; } .rendered_html pre, .rendered_html tr, .rendered_html th, .rendered_html td, .rendered_html * + table { margin-top: 1em; } .rendered_html * + p { margin-top: 1em; } .rendered_html * + img { margin-top: 1em; } .rendered_html img, .rendered_html img.unconfined, div.text_cell { /* Old browsers */ display: -webkit-box; -webkit-box-orient: horizontal; -webkit-box-align: stretch; display: -moz-box; -moz-box-orient: horizontal; -moz-box-align: stretch; display: box; box-orient: horizontal; box-align: stretch; /* Modern browsers */ display: flex; flex-direction: row; align-items: stretch; } @media (max-width: 540px) { div.text_cell > div.prompt { display: none; } } div.text_cell_render { /*font-family: "Helvetica Neue", Arial, Helvetica, Geneva, sans-serif;*/ outline: none; resize: none; width: inherit; border-style: none; padding: 0.5em 0.5em 0.5em 0.4em; color: #000; box-sizing: border-box; -moz-box-sizing: border-box; -webkit-box-sizing: border-box; } a.anchor-link:link { text-decoration: none; padding: 0px 20px; visibility: hidden; } h1:hover .anchor-link, h2:hover .anchor-link, h3:hover .anchor-link, h4:hover .anchor-link, h5:hover .anchor-link, h6:hover .anchor-link { visibility: visible; } .text_cell.rendered .input_area { display: none; } .text_cell.rendered .text_cell.unrendered .text_cell_render { display: none; } .cm-header-1, .cm-header-2, .cm-header-3, .cm-header-4, .cm-header-5, .cm-header-6 { font-weight: bold; font-family: "Helvetica Neue", Helvetica, Arial, sans-serif; } .cm-header-1 { font-size: 185.7%; } .cm-header-2 { font-size: 157.1%; } .cm-header-3 { font-size: 128.6%; } .cm-header-4 { font-size: 110%; } .cm-header-5 { font-size: 100%; font-style: italic; } .cm-header-6 { font-size: 100%; font-style: italic; } .highlight .hll { background-color: #ffffcc } .highlight { background: #f8f8f8; } .highlight .c { color: #408080; font-style: italic } /* Comment */ .highlight .err { border: 1px solid #FF0000 } /* Error */ .highlight .k { color: #008000; font-weight: bold } /* Keyword */ .highlight .o { color: #666666 } /* Operator */ .highlight .ch { color: #408080; font-style: italic } /* Comment.Hashbang */ .highlight .cm { color: #408080; font-style: italic } /* Comment.Multiline */ .highlight .cp { color: #BC7A00 } /* Comment.Preproc */ .highlight .cpf { color: #408080; font-style: italic } /* Comment.PreprocFile */ .highlight .c1 { color: #408080; font-style: italic } /* Comment.Single */ .highlight .cs { color: #408080; font-style: italic } /* Comment.Special */ .highlight .gd { color: #A00000 } /* Generic.Deleted */ .highlight .ge { font-style: italic } /* Generic.Emph */ .highlight .gr { color: #FF0000 } /* Generic.Error */ .highlight .gh { color: #000080; font-weight: bold } /* Generic.Heading */ .highlight .gi { color: #00A000 } /* Generic.Inserted */ .highlight .go { color: #888888 } /* Generic.Output */ .highlight .gp { color: #000080; font-weight: bold } /* Generic.Prompt */ .highlight .gs { font-weight: bold } /* Generic.Strong */ .highlight .gu { color: #800080; font-weight: bold } /* Generic.Subheading */ .highlight .gt { color: #0044DD } /* Generic.Traceback */ .highlight .kc { color: #008000; font-weight: bold } /* Keyword.Constant */ .highlight .kd { color: #008000; font-weight: bold } /* Keyword.Declaration */ .highlight .kn { color: #008000; font-weight: bold } /* Keyword.Namespace */ .highlight .kp { color: #008000 } /* Keyword.Pseudo */ .highlight .kr { color: #008000; font-weight: bold } /* Keyword.Reserved */ .highlight .kt { color: #B00040 } /* Keyword.Type */ .highlight .m { color: #666666 } /* Literal.Number */ .highlight .s { color: #BA2121 } /* Literal.String */ .highlight .na { color: #7D9029 } /* Name.Attribute */ .highlight .nb { color: #008000 } /* Name.Builtin */ .highlight .nc { color: #0000FF; font-weight: bold } /* Name.Class */ .highlight .no { color: #880000 } /* Name.Constant */ .highlight .nd { color: #AA22FF } /* Name.Decorator */ .highlight .ni { color: #999999; font-weight: bold } /* Name.Entity */ .highlight .ne { color: #D2413A; font-weight: bold } /* Name.Exception */ .highlight .nf { color: #0000FF } /* Name.Function */ .highlight .nl { color: #A0A000 } /* Name.Label */ .highlight .nn { color: #0000FF; font-weight: bold } /* Name.Namespace */ .highlight .nt { color: #008000; font-weight: bold } /* Name.Tag */ .highlight .nv { color: #19177C } /* Name.Variable */ .highlight .ow { color: #AA22FF; font-weight: bold } /* Operator.Word */ .highlight .w { color: #bbbbbb } /* Text.Whitespace */ .highlight .mb { color: #666666 } /* Literal.Number.Bin */ .highlight .mf { color: #666666 } /* Literal.Number.Float */ .highlight .mh { color: #666666 } /* Literal.Number.Hex */ .highlight .mi { color: #666666 } /* Literal.Number.Integer */ .highlight .mo { color: #666666 } /* Literal.Number.Oct */ .highlight .sa { color: #BA2121 } /* Literal.String.Affix */ .highlight .sb { color: #BA2121 } /* Literal.String.Backtick */ .highlight .sc { color: #BA2121 } /* Literal.String.Char */ .highlight .dl { color: #BA2121 } /* Literal.String.Delimiter */ .highlight .sd { color: #BA2121; font-style: italic } /* Literal.String.Doc */ .highlight .s2 { color: #BA2121 } /* Literal.String.Double */ .highlight .se { color: #BB6622; font-weight: bold } /* Literal.String.Escape */ .highlight .sh { color: #BA2121 } /* Literal.String.Heredoc */ .highlight .si { color: #BB6688; font-weight: bold } /* Literal.String.Interpol */ .highlight .sx { color: #008000 } /* Literal.String.Other */ .highlight .sr { color: #BB6688 } /* Literal.String.Regex */ .highlight .s1 { color: #BA2121 } /* Literal.String.Single */ .highlight .ss { color: #19177C } /* Literal.String.Symbol */ .highlight .bp { color: #008000 } /* Name.Builtin.Pseudo */ .highlight .fm { color: #0000FF } /* Name.Function.Magic */ .highlight .vc { color: #19177C } /* Name.Variable.Class */ .highlight .vg { color: #19177C } /* Name.Variable.Global */ .highlight .vi { color: #19177C } /* Name.Variable.Instance */ .highlight .vm { color: #19177C } /* Name.Variable.Magic */ .highlight .il { color: #666666 } /* Literal.Number.Integer.Long */ /* Temporary definitions which will become obsolete with Notebook release 5.0 */ .ansi-black-fg { color: #3E424D; } .ansi-black-bg { background-color: #3E424D; } .ansi-black-intense-fg { color: #282C36; } .ansi-black-intense-bg { background-color: #282C36; } .ansi-red-fg { color: #E75C58; } .ansi-red-bg { background-color: #E75C58; } .ansi-red-intense-fg { color: #B22B31; } .ansi-red-intense-bg { background-color: #B22B31; } .ansi-green-fg { color: #00A250; } .ansi-green-bg { background-color: #00A250; } .ansi-green-intense-fg { color: #007427; } .ansi-green-intense-bg { background-color: #007427; } .ansi-yellow-fg { color: #DDB62B; } .ansi-yellow-bg { background-color: #DDB62B; } .ansi-yellow-intense-fg { color: #B27D12; } .ansi-yellow-intense-bg { background-color: #B27D12; } .ansi-blue-fg { color: #208FFB; } .ansi-blue-bg { background-color: #208FFB; } .ansi-blue-intense-fg { color: #0065CA; } .ansi-blue-intense-bg { background-color: #0065CA; } .ansi-magenta-fg { color: #D160C4; } .ansi-magenta-bg { background-color: #D160C4; } .ansi-magenta-intense-fg { color: #A03196; } .ansi-magenta-intense-bg { background-color: #A03196; } .ansi-cyan-fg { color: #60C6C8; } .ansi-cyan-bg { background-color: #60C6C8; } .ansi-cyan-intense-fg { color: #258F8F; } .ansi-cyan-intense-bg { background-color: #258F8F; } .ansi-white-fg { color: #C5C1B4; } .ansi-white-bg { background-color: #C5C1B4; } .ansi-white-intense-fg { color: #A1A6B2; } .ansi-white-intense-bg { background-color: #A1A6B2; } .ansi-bold { font-weight: bold; }

Pybites Community Branch Activity

I wanted to play around with a dataset and see what I could find out about it. I decided on analyzing the little bit of data that I could collect from Github without having to use an OAuth key, which limits it to just 300 events.

To Run All of The Cells

You have the option of running each of the cells one at a time or you can just run them all in sequential order. Selecting a cell and either clicking on the Run button on the menu or using the key combination Shift+Enter will run the code in that cell if its code.

To run them all you will have to use the menu: Cell > Run All

In [1]:
import json
from collections import Counter
from pathlib import Path

import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
import seaborn as sns
from dateutil.parser import parse
from matplotlib import rc
from matplotlib.pyplot import figure
In [2]:
data_location = Path.cwd().joinpath("data")

Retrieving and Importing the Data

The following code will load the three event json files in the data directory if the data directory exists. If the direcotry is not found it will be created and the files will be pulled down from Github and then loaded into memory.

In [3]:
def retrieve_data():
    if not data_location.exists():
        data_location.mkdir()
    url = "https://api.github.com/repos/pybites/challenges/events?page={}&per_page=1000"
    for page in range(1, 4):
        response = requests.get(url.format(page))
        if response.ok:
            file_name = data_location.joinpath(f"events{page}.json")
            try:
                file_name.write_text(json.dumps(response.json()))
                print(f"  Created: {file_name.name}")
            except Exception as e:
                print(e)
        else:
            print(f"Something went wrong: [response.status_code]: {response.reason}")


def load_data():
    if data_location.exists():
        for page in range(1, 4):
            file_name = data_location.joinpath(f"events{page}.json")
            events.extend(json.loads(file_name.read_text()))
            print(f"  Loaded: {file_name.name}")
    else:
        print("Data directory was not found:")
        retrieve_data()
        load_data()

NOTE: If you want to work with the latest data, just remove the data directory and all its contents to have it pulled down once again.

In [4]:
events = []
load_data()
print(f"Total Events Loaded: {len(events)}")
  Loaded: events1.json
  Loaded: events2.json
  Loaded: events3.json
Total Events Loaded: 300

Parsing the Data

From what I hear, we should just get used to cleaning data up before we can use it and its no exception here. I'm interested in exploring a few key points from the data. Mostly I'm interested in the following:

  • Pull Request Events
  • Data that they were created
  • The username of the developer
  • The amount of time spent on the challenge
  • How difficult they found the challenge to be
In [5]:
# helper function
def parse_data(line):
    if '[' in line:
        data = line.split(': [')[1].replace(']', '').strip()
    else:
        data = line.split(': ')[1].strip()
    return data


# list to store the data
created = []
devs = []
diff_levels = []
time_spent = []

for event in events:
    # only insterested in pull request events
    if event['type'] == 'PullRequestEvent':
        # developer username
        dev = event['actor']['login']
        # ignore pybites ;)
        if dev != 'pybites':
            # store developer username
            devs.append(dev)
            # store the date
            created.append(event['created_at'].split('T')[0])
            # parse comment from user for data
            comment = event['payload']['pull_request']['body']
            for line in comment.split('\n'):
                # get difficulty level and time spent
                if 'Difficulty level (1-10):' in line:
                    diff = parse_data(line)
                elif 'Estimated time spent (hours):' in line:
                    spent = parse_data(line)
            # pandas DataFrames require that all columns are the same length
            # so if we have a missing value, None is used in its place
            if diff:
                diff_levels.append(int(diff))
            else:
                diff_levels.append(None)
            if spent:
                time_spent.append(int(spent))
            else:
                time_spent.append(None)

Creating The DataFrame

Now that we have the lists with the data that we parsed, a DataFrame can be created with them.

In [6]:
df = pd.DataFrame({
    'Developers': devs, 
    'Difficulty_Levels': diff_levels, 
    'Time_Spent': time_spent,
    'Date': created,
})

Data Exploration

Here, we can start exploring the data. To take a quick peek at how it's looking, there is no better choice then to use head().

In [7]:
df.head()
Out[7]:
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Developers Difficulty_Levels Time_Spent Date
0 cod3Ghoul 4.0 20.0 2018-10-17
1 YauheniKr 4.0 2.0 2018-10-16
2 YauheniKr 4.0 2.0 2018-10-16
3 clamytoe 6.0 6.0 2018-10-15
4 vipinreyo 4.0 4.0 2018-10-15

To get some quick statistacaly metrics on the dataset, describe() can be used.

In [8]:
df.describe()
Out[8]:
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Difficulty_Levels Time_Spent
count 44.000000 44.000000
mean 3.681818 3.090909
std 1.639239 3.297767
min 1.000000 1.000000
25% 2.000000 1.000000
50% 4.000000 2.000000
75% 5.000000 4.000000
max 8.000000 20.000000

Based on what I could see above, I Wanted to get a feel for the following portions. I can see the average difficulty level above, next to the 50%, but I also wanted to show you how to pull that out individually.

In [9]:
print(f'Developers: {len(df["Developers"])}')
print(f'Average Difficulty: {df["Difficulty_Levels"].median()}')
print(f'Time Spent: {df["Time_Spent"].sum()}')
Developers: 53
Average Difficulty: 4.0
Time Spent: 136.0

The following Counters are just me exploring the data even further.

In [10]:
developers = Counter(df['Developers']).most_common(6)
developers
Out[10]:
[('clamytoe', 8),
 ('sorian', 8),
 ('vipinreyo', 7),
 ('demarcoz', 4),
 ('bbelderbos', 3),
 ('mridubhatnagar', 3)]
In [11]:
bite_difficulty = Counter(df['Difficulty_Levels'].dropna()).most_common()
bite_difficulty
Out[11]:
[(4.0, 13), (2.0, 8), (3.0, 7), (6.0, 6), (5.0, 5), (1.0, 4), (8.0, 1)]
In [12]:
bite_duration = Counter(df['Time_Spent'].dropna()).most_common()
bite_duration
Out[12]:
[(1.0, 16),
 (2.0, 10),
 (3.0, 6),
 (4.0, 4),
 (8.0, 3),
 (6.0, 2),
 (5.0, 2),
 (20.0, 1)]
In [13]:
created_at = sorted(Counter(df['Date'].dropna()).most_common())
created_at
Out[13]:
[('2018-10-01', 1),
 ('2018-10-02', 6),
 ('2018-10-03', 3),
 ('2018-10-04', 4),
 ('2018-10-05', 8),
 ('2018-10-07', 7),
 ('2018-10-08', 4),
 ('2018-10-09', 2),
 ('2018-10-10', 1),
 ('2018-10-11', 1),
 ('2018-10-12', 4),
 ('2018-10-13', 3),
 ('2018-10-14', 3),
 ('2018-10-15', 3),
 ('2018-10-16', 2),
 ('2018-10-17', 1)]

Hmm, how many days are we looking at?

In [14]:
len(created_at)
Out[14]:
16

Time To Get Down To Business

Now that we've loaded our data and cleaned it up, lets see what it can tell us.

Number of Pull Request per Day

Pretty amazing that Pybites Blog Challenges had over 300 distinct github interactions in such a short time!

In [15]:
# resize graph
figure(num=None, figsize=(6, 6), dpi=80, facecolor='w', edgecolor='k')

# gather data into a custom DataFrame
dates = [day[0] for day in created_at]
prs = [pr[1] for pr in created_at]
df_prs = pd.DataFrame({'xvalues': dates, 'yvalues': prs})
 
# plot
plt.plot('xvalues', 'yvalues', data=df_prs)

# labels
plt.xticks(rotation='vertical', fontweight='bold')

# title
plt.title('Number of Pull Request per Day')

# show the graphic
plt.show()

Top Blog Challenge Ninjas

Although there are many more contributors, I had to limit the count so that the data would be easier to visualize.

In [16]:
# resize graph
figure(num=None, figsize=(6, 6), dpi=80, facecolor='w', edgecolor='k')

# create labels
labels = [dev[0] for dev in developers]

# get a count of the pull requests
prs = [dev[1] for dev in developers]

# pull out top ninja slice
explode = [0] * len(developers)
explode[0] = 0.1

# create the pie chart
plt.pie(prs, explode=explode, labels=labels, shadow=True, startangle=90)

# add title and center
plt.axis('equal')
plt.title('Top Blog Challenge Ninjas')

# show the graphic
plt.show()

Time Spent vs Difficulty Level per Pull Request

Finally I wanted to explore what the relation between time spent per PR vs how difficult the develop found the challenge to be.

In [17]:
# resize graph
figure(num=None, figsize=(15, 6), dpi=80, facecolor='w', edgecolor='k')

# drop null values
df_clean = df.dropna()

# add legend
diff = mpatches.Patch(color='#557f2d', label='Difficulty Level')
time = mpatches.Patch(color='#2d7f5e', label='Time Spent')
plt.legend(handles=[time, diff])

# y-axis in bold
rc('font', weight='bold')

# values of each group
bars1 = df_clean['Difficulty_Levels']
bars2 = df_clean['Time_Spent']

# heights of bars1 + bars2
bars = df_clean['Difficulty_Levels'] + df_clean['Time_Spent']

# position of the bars on the x-axis
r = range(len(df_clean))

# names of group and bar width
names = df_clean['Developers']
barWidth = 1

# create green bars (bottom)
plt.bar(r, bars1, color='#557f2d', edgecolor='white', width=barWidth)
# create green bars (top), on top of the firs ones
plt.bar(r, bars2, bottom=bars1, color='#2d7f5e', edgecolor='white', width=barWidth)

# custom X axis
plt.xticks(r, names, rotation='vertical', fontweight='bold')
plt.xlabel("Developers", fontweight='bold')

# title
plt.title('Time Spent vs Difficulty Level per Pull Request')

# show graphic
plt.show()

Conclusions

As you can see, the Pybites Ninjas are an active bunch. With such a small limited dataset its plain to see that some good information can be extracted from it. Would be interesting to see which challenges are getting the most action though, but I'll leave that as an exercise for you to explore!

October 18, 2018 11:10 AM UTC


Mike Driscoll

Python 101: Episode #29 – Installing Packages

In this screencast we will learn how to install 3rd party modules and packages using easy_install, pip and from source.

You can also read the chapter this video is based on here or get the book on Leanpub

October 18, 2018 05:05 AM UTC

October 17, 2018


Davy Wybiral

LoRa IoT Network Programming | RYLR896

Hey everyone, so I just got some LoRa modules from REYAX to experiment with long range network applications and these things are so cool! So far I've made a long range security alarm, a button to water plants on the other side of my property, and some bridge code to interact with IP and BLE networks.

Just thought I'd do a quick video update on this stuff:




The module I wrote is part of the Espruino collection now: https://www.espruino.com/RYLR

I got these LoRa devices from REYAX: https://reyax.com/products/rylr896/

They seem to only sell them on e-bay right now: RYLR896

October 17, 2018 08:15 PM UTC


The No Title® Tech Blog

Haiku R1/beta1 review - revisiting BeOS, 18 years after its latest official release

Having experimented and used BeOS R5 Pro back in the early 2000’s, when the company that created it was just going down, I have been following with some interest the development of Haiku during all these years. While one can argue that both the old BeOS and Haiku miss some important features to be considered modern OSes these days, the fact is that a lightweight operating system can always be, for instance, an excellent way to bring new life into old, or new but less powerfull, hardware.

October 17, 2018 07:30 PM UTC


Stack Abuse

Python Dictionary Tutorial

Introduction

Python comes with a variety of built-in data structures, capable of storing different types of data. A Python dictionary is one such data structure that can store data in the form of key-value pairs. The values in a Python dictionary can be accessed using the keys. In this article, we will be discussing the Python dictionary in detail.

Creating a Dictionary

To create a Python dictionary, we need to pass a sequence of items inside curly braces {}, and separate them using a comma (,). Each item has a key and a value expressed as a "key:value" pair.

The values can belong to any data type and they can repeat, but the keys must remain unique.

The following examples demonstrate how to create Python dictionaries:

Creating an empty dictionary:

dict_sample = {}  

Creating a dictionary with integer keys:

dict_sample = {1: 'mango', 2: 'pawpaw'}  

Creating a dictionary with mixed keys:

dict_sample = {'fruit': 'mango', 1: [4, 6, 8]}  

We can also create a dictionary by explicitly calling the Python's dict() method:

dict_sample = dict({1:'mango', 2:'pawpaw'})  

A dictionary can also be created from a sequence as shown below:

dict_sample = dict([(1,'mango'), (2,'pawpaw')])  

Dictionaries can also be nested, which means that we can create a dictionary inside another dictionary. For example:

dict_sample = {1: {'student1' : 'Nicholas', 'student2' : 'John', 'student3' : 'Mercy'},  
        2: {'course1' : 'Computer Science', 'course2' : 'Mathematics', 'course3' : 'Accounting'}}

To print the dictionary contents, we can use the Python's print() function and pass the dictionary name as the argument to the function. For example:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}
print(dict_sample)  

Output:

{'Company': 'Toyota', 'model': 'Premio', 'year': 2012}

Accessing Elements

To access dictionary items, pass the key inside square brackets []. For example:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}
x = dict_sample["model"]  
print(x)  

Output:

Premio  

We created a dictionary named dict_sample. A variable named x was then created and its value is set to be the value for the key "model" in the dictionary.

Here is another example:

dict = {'Name': 'Mercy', 'Age': 23, 'Course': 'Accounting'}  
print("Student Name:", dict['Name'])  
print("Course:", dict['Course'])  
print("Age:", dict['Age'])  

Output:

Student Name: Mercy  
Course: Accounting  
Age: 23  

The dictionary object also provides the get() function, which can be used to access dictionary elements as well. We append the function with the dictionary name using the dot operator and then pass the name of the key as the argument to the function. For example:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}
x = dict_sample.get("model")  
print(x)  

Output:

Premio  

Now we know how to access dictionary elements using a few different methods. In the next section we'll discuss how to add new elements to an already existing dictionary.

Adding Elements

There are numerous ways to add new elements to a dictionary. We can use a new index key and assign a value to it. For example:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}
dict_sample["Capacity"] = "1800CC"  
print(dict_sample)  

Output:

{'Capacity': '1800CC', 'year': 2012, 'Company': 'Toyota', 'model': 'Premio'}

The new element has "Capacity" as the key and "1800CC" as its corresponding value. It has been added as the first element of the dictionary.

Here is another example. First let's first create an empty dictionary:

MyDictionary = {}  
print("An Empty Dictionary: ")  
print(MyDictionary)  

Output:

An Empty Dictionary:  

The dictionary returns nothing as it has nothing stored yet. Let us add some elements to it, one at a time:

MyDictionary[0] = 'Apples'  
MyDictionary[2] = 'Mangoes'  
MyDictionary[3] = 20  
print("\n3 elements have been added: ")  
print(MyDictionary)  

Output:

3 elements have been added:  
{0: 'Apples', 2: 'Mangoes', 3: 20}

To add the elements, we specified keys as well as the corresponding values. For example:

MyDictionary[0] = 'Apples'  

In the above example, 0 is the key while "Apples" is the value.

It is even possible for us to add a set of values to one key. For example:

MyDictionary['Values'] = 1, "Pairs", 4  
print("\n3 elements have been added: ")  
print(MyDictionary)  

Output:

3 elements have been added:  
{'Values': (1, 'Pairs', 4)}

In the above example, the name of the key is "Values" while everything after the = sign are the actual values for that key, stored as a Set.

Other than adding new elements to a dictionary, dictionary elements can also be updated/changed, which we'll go over in the next section.

Updating Elements

After adding a value to a dictionary we can then modify the existing dictionary element. You use the key of the element to change the corresponding value. For example:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}

dict_sample["year"] = 2014

print(dict_sample)  

Output:

{'year': 2014, 'model': 'Premio', 'Company': 'Toyota'}

In this example you can see that we have updated the value for the key "year" from the old value of 2012 to a new value of 2014.

Removing Elements

The removal of an element from a dictionary can be done in several ways, which we'll discuss one-by-one in this section:

The del keyword can be used to remove the element with the specified key. For example:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}
del dict_sample["year"]  
print(dict_sample)  

Output:

{'Company': 'Toyota', 'model': 'Premio'}

We called the del keyword followed by the dictionary name. Inside the square brackets that follow the dictionary name, we passed the key of the element we need to delete from the dictionary, which in this example was "year". The entry for "year" in the dictionary was then deleted.

Another way to delete a key-value pair is to use the pop() function and pass the key of the entry to be deleted as the argument to the function. For example:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}
dict_sample.pop("year")  
print(dict_sample)  

Output:

{'Company': 'Toyota', 'model': 'Premio'}

We invoked the pop() function by appending it with the dictionary name. Again, in this example the entry for "year" in the dictionary will be deleted.

The popitem() function removes the last item inserted into the dictionary, without needing to specify the key. Take a look at the following example:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}
dict_sample.popitem()  
print(dict_sample)  

Output:

{'Company': 'Toyota', 'model': 'Premio'}

The last entry into the dictionary was "year". It has been removed after calling the popitem() function.

But what if you want to delete the entire dictionary? It would be difficult and cumbersome to use one of these methods on every single key. Instead, you can use the del keyword to delete the entire dictionary. For example:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}
del dict_sample  
print(dict_sample)  

Output:

NameError: name 'dict_sample' is not defined  

The code returns an error. The reason is that we are trying to access a dictionary which doesn't exist since it has been deleted.

However, your use-case may require you to just remove all dictionary elements and be left with an empty dictionary. This can be achieved by calling the clear() function on the dictionary:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}
dict_sample.clear()  
print(dict_sample)  

Output:

{}

The code returns an empty dictionary since all the dictionary elements have been removed.

Other Common Methods

The len() Method

With this method, you can count the number of elements in a dictionary. For example:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}
print(len(dict_sample))  

Output:

3  

There are three entries in the dictionary, hence the method returned 3.

The copy() Method

This method returns a copy of the existing dictionary. For example:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}
x = dict_sample.copy()

print(x)  

Output:

{'Company': 'Toyota', 'year': 2012, 'model': 'Premio'}

We created a copy of dictionary named dict_sample and assigned it to the variable x. If x is printed on the console, you will see that it contains the same elements as those stored by dict_sample dictionary.

Note that this is useful because modifications made to the copied dictionary won't affect the original one.

The items() Method

When called, this method returns an iterable object. The iterable object has key-value pairs for the dictionary, as tuples in a list. This method is primarily used when you want to iterate through a dictionary.

The method is simply called on the dictionary object name as shown below:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}

for k, v in dict_sample.items():  
  print(k, v)

Output:

('Company', 'Toyota')
('model', 'Premio')
('year', 2012)

The object returned by items() can also be used to show the changes that have been implemented on the dictionary. This is demonstrated below:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}

x = dict_sample.items()

print(x)

dict_sample["model"] = "Mark X"

print(x)  

Output:

dict_items([('Company', 'Toyota'), ('model', 'Premio'), ('year', 2012)])  
dict_items([('Company', 'Toyota'), ('model', 'Mark X'), ('year', 2012)])  

The output shows that when you change a value in the dictionary, the items object is also updated to reflect this change.

The fromkeys() Method

This method returns a dictionary having specified keys and values. It takes the syntax given below:

dictionary.fromkeys(keys, value)  

The value for required keys parameter is an iterable and it specifies the keys for the new dictionary. The value for value parameter is optional and it specifies the default value for all the keys. The default value for this is None.

Suppose we need to create a dictionary of three keys all with the same value. We can do so as follows:

name = ('John', 'Nicholas', 'Mercy')  
age = 25

dict_sample = dict.fromkeys(name, age)


print(dict_sample)  

Output:

{'John': 25, 'Mercy': 25, 'Nicholas': 25}

In the script above, we specified the keys and one value. The fromkeys() method was able to pick the keys and combine them with this value to create a populated dictionary.

The value for the keys parameter is mandatory. The following example demonstrates what happens when the value for the values parameter is not specified:

name = ('John', 'Nicholas', 'Mercy')

dict_sample = dict.fromkeys(name)


print(dict_sample)  

Output:

{'John': None, 'Mercy': None, 'Nicholas': None}

The default value, which is None, was used.

The setdefault() Method

This method is applicable when we need to get the value of the element with the specified key. If the key is not found, it will be inserted into the dictionary alongside the specified value.

The method takes the following syntax:

dictionary.setdefault(keyname, value)  

In this function the keyname parameter is required. It represents the keyname of the item you need to return a value from. The value parameter is optional. If the dictionary already has the key, this parameter won't have any effect. If the key doesn't exist, then the value given in this function will become the value of the key. It has a default value of None.

For example:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}

x = dict_sample.setdefault("color", "Gray")

print(x)  

Output

Gray  

The dictionary doesn't have the key for color. The setdefault() method has inserted this key and the specified a value, that is, "Gray", has been used as its value.

The following example demonstrates how the method behaves if the value for the key does exist:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}

x = dict_sample.setdefault("model", "Allion")

print(x)  

Output:

Premio  

The value "Allion" has no effect on the dictionary since we already have a value for the key.

The keys() Method

This method also returns an iterable object. The object returned is a list of all keys in the dictionary. And just like with the items() method, the returned object can be used to reflect the changes made to the dictionary.

To use this method, we only call it on the name of the dictionary, as shown below:

dictionary.keys()  

For example:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}

x = dict_sample.keys()

print(x)  

Output:

dict_keys(['model', 'Company', 'year'])  

Often times this method is used to iterate through each key in your dictionary, like so:

dict_sample = {  
  "Company": "Toyota",
  "model": "Premio",
  "year": 2012
}

for k in dict_sample.keys():  
  print(k)

Output:

Company  
model  
year  

Conclusion

This marks the end of this tutorial on Python dictionaries. These dictionaries store data in "key:value" pairs. The "key" acts as the identifier for the item while "value" is the value of the item. The Python dictionary comes with a variety of functions that can be applied for retrieval or manipulation of data. In this article, we saw how Python dictionary can be created, modified and deleted along with some of the most commonly used dictionary methods.

October 17, 2018 02:15 PM UTC


Real Python

Python, Boto3, and AWS S3: Demystified

Amazon Web Services (AWS) has become a leader in cloud computing. One of its core components is S3, the object storage service offered by AWS. With its impressive availability and durability, it has become the standard way to store videos, images, and data. You can combine S3 with other services to build infinitely scalable applications.

Boto3 is the name of the Python SDK for AWS. It allows you to directly create, update, and delete AWS resources from your Python scripts.

If you’ve had some AWS exposure before, have your own AWS account, and want to take your skills to the next level by starting to use AWS services from within your Python code, then keep reading.

By the end of this tutorial, you’ll:

Before exploring Boto3’s characteristics, you will first see how to configure the SDK on your machine. This step will set you up for the rest of the tutorial.

Free Bonus: 5 Thoughts On Python Mastery, a free course for Python developers that shows you the roadmap and the mindset you'll need to take your Python skills to the next level.

Installation

To install Boto3 on your computer, go to your terminal and run the following:

$ pip install boto3

You’ve got the SDK. But, you won’t be able to use it right now, because it doesn’t know which AWS account it should connect to.

To make it run against your AWS account, you’ll need to provide some valid credentials. If you already have an IAM user that has full permissions to S3, you can use those user’s credentials (their access key and their secret access key) without needing to create a new user. Otherwise, the easiest way to do this is to create a new AWS user and then store the new credentials.

To create a new user, go to your AWS account, then go to Services and select IAM. Then choose Users and click on Add user.

Give the user a name (for example, boto3user). Enable programmatic access. This will ensure that this user will be able to work with any AWS supported SDK or make separate API calls:

add AWS IAM user

To keep things simple, choose the preconfigured AmazonS3FullAccess policy. With this policy, the new user will be able to have full control over S3. Click on Next: Review:

aws s3 IAM user add policy

Select Create user:

aws s3 IAM user finish creation

A new screen will show you the user’s generated credentials. Click on the Download .csv button to make a copy of the credentials. You will need them to complete your setup.

Now that you have your new user, create a new file, ~/.aws/credentials:

$ touch ~/.aws/credentials

Open the file and paste the structure below. Fill in the placeholders with the new user credentials you have downloaded:

[default]
aws_access_key_id = YOUR_ACCESS_KEY_ID
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY

Save the file.

Now that you have set up these credentials, you have a default profile, which will be used by Boto3 to interact with your AWS account.

There is one more configuration to set up: the default region that Boto3 should interact with. You can check out the complete table of the supported AWS regions. Choose the region that is closest to you. Copy your preferred region from the Region column. In my case, I am using eu-west-1 (Ireland).

Create a new file, ~/.aws/config:

$ touch ~/.aws/config

Add the following and replace the placeholder with the region you have copied:

[default]
region = YOUR_PREFERRED_REGION

Save your file.

You are now officially set up for the rest of the tutorial.

Next, you will see the different options Boto3 gives you to connect to S3 and other AWS services.

Client Versus Resource

At its core, all that Boto3 does is call AWS APIs on your behalf. For the majority of the AWS services, Boto3 offers two distinct ways of accessing these abstracted APIs:

You can use either to interact with S3.

To connect to the low-level client interface, you must use Boto3’s client(). You then pass in the name of the service you want to connect to, in this case, s3:

import boto3
s3_client = boto3.client('s3')

To connect to the high-level interface, you’ll follow a similar approach, but use resource():

import boto3
s3_resource = boto3.resource('s3')

You’ve successfully connected to both versions, but now you might be wondering, “Which one should I use?”

With clients, there is more programmatic work to be done. The majority of the client operations give you a dictionary response. To get the exact information that you need, you’ll have to parse that dictionary yourself. With resource methods, the SDK does that work for you.

With the client, you might see some slight performance improvements. The disadvantage is that your code becomes less readable than it would be if you were using the resource. Resources offer a better abstraction, and your code will be easier to comprehend.

Understanding how the client and the resource are generated is also important when you’re considering which one to choose:

Boto3 generates the client and the resource from different definitions. As a result, you may find cases in which an operation supported by the client isn’t offered by the resource. Here’s the interesting part: you don’t need to change your code to use the client everywhere. For that operation, you can access the client directly via the resource like so: s3_resource.meta.client.

One such client operation is .generate_presigned_url(), which enables you to give your users access to an object within your bucket for a set period of time, without requiring them to have AWS credentials.

Common Operations

Now that you know about the differences between clients and resources, let’s start using them to build some new S3 components.

Creating a Bucket

To start off, you need an S3 bucket. To create one programmatically, you must first choose a name for your bucket. Remember that this name must be unique throughout the whole AWS platform, as bucket names are DNS compliant. If you try to create a bucket, but another user has already claimed your desired bucket name, your code will fail. Instead of success, you will see the following error: botocore.errorfactory.BucketAlreadyExists.

You can increase your chance of success when creating your bucket by picking a random name. You can generate your own function that does that for you. In this implementation, you’ll see how using the uuid module will help you achieve that. A UUID4’s string representation is 36 characters long (including hyphens), and you can add a prefix to specify what each bucket is for.

Here’s a way you can achieve that:

import uuid
def create_bucket_name(bucket_prefix):
    # The generated bucket name must be between 3 and 63 chars long
    return ''.join([bucket_prefix, str(uuid.uuid4())])

You’ve got your bucket name, but now there’s one more thing you need to be aware of: unless your region is in the United States, you’ll need to define the region explicitly when you are creating the bucket. Otherwise you will get an IllegalLocationConstraintException.

To exemplify what this means when you’re creating your S3 bucket in a non-US region, take a look at the code below:

s3_resource.create_bucket(Bucket=YOUR_BUCKET_NAME,
                          CreateBucketConfiguration={
                              'LocationConstraint': 'eu-west-1'})

You need to provide both a bucket name and a bucket configuration where you must specify the region, which in my case is eu-west-1.

This isn’t ideal. Imagine that you want to take your code and deploy it to the cloud. Your task will become increasingly more difficult because you’ve now hardcoded the region. You could refactor the region and transform it into an environment variable, but then you’d have one more thing to manage.

Luckily, there is a better way to get the region programatically, by taking advantage of a session object. Boto3 will create the session from your credentials. You just need to take the region and pass it to create_bucket() as its LocationConstraint configuration. Here’s how to do that:

def create_bucket(bucket_prefix, s3_connection):
    session = boto3.session.Session()
    current_region = session.region_name
    bucket_name = create_bucket_name(bucket_prefix)
    bucket_response = s3_connection.create_bucket(
        Bucket=bucket_name,
        CreateBucketConfiguration={
        'LocationConstraint': current_region})
    print(bucket_name, current_region)
    return bucket_name, bucket_response

The nice part is that this code works no matter where you want to deploy it: locally/EC2/Lambda. Moreover, you don’t need to hardcode your region.

As both the client and the resource create buckets in the same way, you can pass either one as the s3_connection parameter.

You’ll now create two buckets. First create one using the client, which gives you back the bucket_response as a dictionary:

>>>
>>> first_bucket_name, first_response = create_bucket(
...     bucket_prefix='firstpythonbucket', 
...     s3_connection=s3_resource.meta.client)
firstpythonbucket7250e773-c4b1-422a-b51f-c45a52af9304 eu-west-1

>>> first_response
{'ResponseMetadata': {'RequestId': 'E1DCFE71EDE7C1EC', 'HostId': 'r3AP32NQk9dvbHSEPIbyYADT769VQEN/+xT2BPM6HCnuCb3Z/GhR2SBP+GM7IjcxbBN7SQ+k+9B=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': 'r3AP32NQk9dvbHSEPIbyYADT769VQEN/+xT2BPM6HCnuCb3Z/GhR2SBP+GM7IjcxbBN7SQ+k+9B=', 'x-amz-request-id': 'E1DCFE71EDE7C1EC', 'date': 'Fri, 05 Oct 2018 15:00:00 GMT', 'location': 'http://firstpythonbucket7250e773-c4b1-422a-b51f-c45a52af9304.s3.amazonaws.com/', 'content-length': '0', 'server': 'AmazonS3'}, 'RetryAttempts': 0}, 'Location': 'http://firstpythonbucket7250e773-c4b1-422a-b51f-c45a52af9304.s3.amazonaws.com/'}

Then create a second bucket using the resource, which gives you back a Bucket instance as the bucket_response:

>>>
>>> second_bucket_name, second_response = create_bucket(
...     bucket_prefix='secondpythonbucket', s3_connection=s3_resource)
secondpythonbucket2d5d99c5-ab96-4c30-b7f7-443a95f72644 eu-west-1

>>> second_response
s3.Bucket(name='secondpythonbucket2d5d99c5-ab96-4c30-b7f7-443a95f72644')

You’ve got your buckets. Next, you’ll want to start adding some files to them.

Naming Your Files

You can name your objects by using standard file naming conventions. You can use any valid name. In this article, you’ll look at a more specific case that helps you understand how S3 works under the hood.

If you’re planning on hosting a large number of files in your S3 bucket, there’s something you should keep in mind. If all your file names have a deterministic prefix that gets repeated for every file, such as a timestamp format like “YYYY-MM-DDThh:mm:ss”, then you will soon find that you’re running into performance issues when you’re trying to interact with your bucket.

This will happen because S3 takes the prefix of the file and maps it onto a partition. The more files you add, the more will be assigned to the same partition, and that partition will be very heavy and less responsive.

What can you do to keep that from happening?

The easiest solution is to randomize the file name. You can imagine many different implementations, but in this case, you’ll use the trusted uuid module to help with that. To make the file names easier to read for this tutorial, you’ll be taking the first six characters of the generated number’s hex representation and concatenate it with your base file name.

The helper function below allows you to pass in the number of bytes you want the file to have, the file name, and a sample content for the file to be repeated to make up the desired file size:

def create_temp_file(size, file_name, file_content):
    random_file_name = ''.join([str(uuid.uuid4().hex[:6]), file_name])
    with open(random_file_name, 'w') as f:
        f.write(str(file_content) * size)
    return random_file_name

Create your first file, which you’ll be using shortly:

first_file_name = create_temp_file(300, 'firstfile.txt', 'f')   

By adding randomness to your file names, you can efficiently distribute your data within your S3 bucket.

Creating Bucket and Object Instances

The next step after creating your file is to see how to integrate it into your S3 workflow.

This is where the resource’s classes play an important role, as these abstractions make it easy to work with S3.

By using the resource, you have access to the high-level classes (Bucket and Object). This is how you can create one of each:

first_bucket = s3_resource.Bucket(name=first_bucket_name)
first_object = s3_resource.Object(
    bucket_name=first_bucket_name, key=first_file_name)

The reason you have not seen any errors with creating the first_object variable is that Boto3 doesn’t make calls to AWS to create the reference. The bucket_name and the key are called identifiers, and they are the necessary parameters to create an Object. Any other attribute of an Object, such as its size, is lazily loaded. This means that for Boto3 to get the requested attributes, it has to make calls to AWS.

Understanding Sub-resources

Bucket and Object are sub-resources of one another. Sub-resources are methods that create a new instance of a child resource. The parent’s identifiers get passed to the child resource.

If you have a Bucket variable, you can create an Object directly:

first_object_again = first_bucket.Object(first_file_name)

Or if you have an Object variable, then you can get the Bucket:

first_bucket_again = first_object.Bucket()

Great, you now understand how to generate a Bucket and an Object. Next, you’ll get to upload your newly generated file to S3 using these constructs.

Uploading a File

There are three ways you can upload a file:

In each case, you have to provide the Filename, which is the path of the file you want to upload. You’ll now explore the three alternatives. Feel free to pick whichever you like most to upload the first_file_name to S3.

Object Instance Version

You can upload using an Object instance:

s3_resource.Object(first_bucket_name, first_file_name).upload_file(
    Filename=first_file_name)

Or you can use the first_object instance:

first_object.upload_file(first_file_name)

Bucket Instance Version

Here’s how you can upload using a Bucket instance:

s3_resource.Bucket(first_bucket_name).upload_file(
    Filename=first_file_name, Key=first_file_name)

Client Version

You can also upload using the client:

s3_resource.meta.client.upload_file(
    Filename=first_file_name, Bucket=first_bucket_name,
    Key=first_file_name)

You have successfully uploaded your file to S3 using one of the three available methods. In the upcoming sections, you’ll mainly work with the Object class, as the operations are very similar between the client and the Bucket versions.

Downloading a File

To download a file from S3 locally, you’ll follow similar steps as you did when uploading. But in this case, the Filename parameter will map to your desired local path. This time, it will download the file to the tmp directory:

s3_resource.Object(first_bucket_name, first_file_name).download_file(
    f'/tmp/{first_file_name}') # Python 3.6+

You’ve successfully downloaded your file from S3. Next, you’ll see how to copy the same file between your S3 buckets using a single API call.

Copying an Object Between Buckets

If you need to copy files from one bucket to another, Boto3 offers you that possibility. In this example, you’ll copy the file from the first bucket to the second, using .copy():

def copy_to_bucket(bucket_from_name, bucket_to_name, file_name):
    copy_source = {
        'Bucket': bucket_from_name,
        'Key': file_name
    }
    s3_resource.Object(bucket_to_name, file_name).copy(copy_source)

copy_to_bucket(first_bucket_name, second_bucket_name, first_file_name)

Note: If you’re aiming to replicate your S3 objects to a bucket in a different region, have a look at Cross Region Replication.

Deleting an Object

Let’s delete the new file from the second bucket by calling .delete() on the equivalent Object instance:

s3_resource.Object(second_bucket_name, first_file_name).delete()

You’ve now seen how to use S3’s core operations. You’re ready to take your knowledge to the next level with more complex characteristics in the upcoming sections.

Advanced Configurations

In this section, you’re going to explore more elaborate S3 features. You’ll see examples of how to use them and the benefits they can bring to your applications.

ACL (Access Control Lists)

Access Control Lists (ACLs) help you manage access to your buckets and the objects within them. They are considered the legacy way of administrating permissions to S3. Why should you know about them? If you have to manage access to individual objects, then you would use an Object ACL.

By default, when you upload an object to S3, that object is private. If you want to make this object available to someone else, you can set the object’s ACL to be public at creation time. Here’s how you upload a new file to the bucket and make it accessible to everyone:

second_file_name = create_temp_file(400, 'secondfile.txt', 's')
second_object = s3_resource.Object(first_bucket.name, second_file_name)
second_object.upload_file(second_file_name, ExtraArgs={
                          'ACL': 'public-read'})

You can get the ObjectAcl instance from the Object, as it is one of its sub-resource classes:

second_object_acl = second_object.Acl()

To see who has access to your object, use the grants attribute:

>>>
>>> second_object_acl.grants
[{'Grantee': {'DisplayName': 'name', 'ID': '24aafdc2053d49629733ff0141fc9fede3bf77c7669e4fa2a4a861dd5678f4b5', 'Type': 'CanonicalUser'}, 'Permission': 'FULL_CONTROL'}, {'Grantee': {'Type': 'Group', 'URI': 'http://acs.amazonaws.com/groups/global/AllUsers'}, 'Permission': 'READ'}]

You can make your object private again, without needing to re-upload it:

>>>
>>> response = second_object_acl.put(ACL='private')
>>> second_object_acl.grants
[{'Grantee': {'DisplayName': 'name', 'ID': '24aafdc2053d49629733ff0141fc9fede3bf77c7669e4fa2a4a861dd5678f4b5', 'Type': 'CanonicalUser'}, 'Permission': 'FULL_CONTROL'}]

You have seen how you can use ACLs to manage access to individual objects. Next, you’ll see how you can add an extra layer of security to your objects by using encryption.

Note: If you’re looking to split your data into multiple categories, have a look at tags. You can grant access to the objects based on their tags.

Encryption

With S3, you can protect your data using encryption. You’ll explore server-side encryption using the AES-256 algorithm where AWS manages both the encryption and the keys.

Create a new file and upload it using ServerSideEncryption:

third_file_name = create_temp_file(300, 'thirdfile.txt', 't')
third_object = s3_resource.Object(first_bucket_name, third_file_name)
third_object.upload_file(third_file_name, ExtraArgs={
                         'ServerSideEncryption': 'AES256'})

You can check the algorithm that was used to encrypt the file, in this case AES256:

>>>
>>> third_object.server_side_encryption
'AES256'

You now understand how to add an extra layer of protection to your objects using the AES-256 server-side encryption algorithm offered by AWS.

Storage

Every object that you add to your S3 bucket is associated with a storage class. All the available storage classes offer high durability. You choose how you want to store your objects based on your application’s performance access requirements.

At present, you can use the following storage classes with S3:

If you want to change the storage class of an existing object, you need to recreate the object.

For example, reupload the third_object and set its storage class to Standard_IA:

third_object.upload_file(third_file_name, ExtraArgs={
                         'ServerSideEncryption': 'AES256', 
                         'StorageClass': 'STANDARD_IA'})

Note: If you make changes to your object, you might find that your local instance doesn’t show them. What you need to do at that point is call .reload() to fetch the newest version of your object.

Reload the object, and you can see its new storage class:

>>>
>>> third_object.reload()
>>> third_object.storage_class
'STANDARD_IA'

Note: Use LifeCycle Configurations to transition objects through the different classes as you find the need for them. They will automatically transition these objects for you.

Versioning

You should use versioning to keep a complete record of your objects over time. It also acts as a protection mechanism against accidental deletion of your objects. When you request a versioned object, Boto3 will retrieve the latest version.

When you add a new version of an object, the storage that object takes in total is the sum of the size of its versions. So if you’re storing an object of 1 GB, and you create 10 versions, then you have to pay for 10GB of storage.

Enable versioning for the first bucket. To do this, you need to use the BucketVersioning class:

def enable_bucket_versioning(bucket_name):
    bkt_versioning = s3_resource.BucketVersioning(bucket_name)
    bkt_versioning.enable()
    print(bkt_versioning.status)
>>>
>>> enable_bucket_versioning(first_bucket_name)
Enabled

Then create two new versions for the first file Object, one with the contents of the original file and one with the contents of the third file:

s3_resource.Object(first_bucket_name, first_file_name).upload_file(
   first_file_name)
s3_resource.Object(first_bucket_name, first_file_name).upload_file(
   third_file_name)

Now reupload the second file, which will create a new version:

s3_resource.Object(first_bucket_name, second_file_name).upload_file(
    second_file_name)

You can retrieve the latest available version of your objects like so:

>>>
>>> s3_resource.Object(first_bucket_name, first_file_name).version_id
'eQgH6IC1VGcn7eXZ_.ayqm6NdjjhOADv'

In this section, you’ve seen how to work with some of the most important S3 attributes and add them to your objects. Next, you’ll see how to easily traverse your buckets and objects.

Traversals

If you need to retrieve information from or apply an operation to all your S3 resources, Boto3 gives you several ways to iteratively traverse your buckets and your objects. You’ll start by traversing all your created buckets.

Bucket Traversal

To traverse all the buckets in your account, you can use the resource’s buckets attribute alongside .all(), which gives you the complete list of Bucket instances:

>>>
>>> for bucket in s3_resource.buckets.all():
...     print(bucket.name)
...
firstpythonbucket7250e773-c4b1-422a-b51f-c45a52af9304
secondpythonbucket2d5d99c5-ab96-4c30-b7f7-443a95f72644

You can use the client to retrieve the bucket information as well, but the code is more complex, as you need to extract it from the dictionary that the client returns:

>>>
>>> for bucket_dict in s3_resource.meta.client.list_buckets().get('Buckets'):
...     print(bucket_dict['Name'])
...
firstpythonbucket7250e773-c4b1-422a-b51f-c45a52af9304
secondpythonbucket2d5d99c5-ab96-4c30-b7f7-443a95f72644

You have seen how to iterate through the buckets you have in your account. In the upcoming section, you’ll pick one of your buckets and iteratively view the objects it contains.

Object Traversal

If you want to list all the objects from a bucket, the following code will generate an iterator for you:

>>>
>>> for obj in first_bucket.objects.all():
...     print(obj.key)
...
127367firstfile.txt
616abesecondfile.txt
fb937cthirdfile.txt

The obj variable is an ObjectSummary. This is a lightweight representation of an Object. The summary version doesn’t support all of the attributes that the Object has. If you need to access them, use the Object() sub-resource to create a new reference to the underlying stored key. Then you’ll be able to extract the missing attributes:

>>>
>>> for obj in first_bucket.objects.all():
...     subsrc = obj.Object()
...     print(obj.key, obj.storage_class, obj.last_modified,
...           subsrc.version_id, subsrc.metadata)
...
127367firstfile.txt STANDARD 2018-10-05 15:09:46+00:00 eQgH6IC1VGcn7eXZ_.ayqm6NdjjhOADv {}
616abesecondfile.txt STANDARD 2018-10-05 15:09:47+00:00 WIaExRLmoksJzLhN7jU5YzoJxYSu6Ey6 {}
fb937cthirdfile.txt STANDARD_IA 2018-10-05 15:09:05+00:00 null {}

You can now iteratively perform operations on your buckets and objects. You’re almost done. There’s one more thing you should know at this stage: how to delete all the resources you’ve created in this tutorial.

Deleting Buckets and Objects

To remove all the buckets and objects you have created, you must first make sure that your buckets have no objects within them.

Deleting a Non-empty Bucket

To be able to delete a bucket, you must first delete every single object within the bucket, or else the BucketNotEmpty exception will be raised. When you have a versioned bucket, you need to delete every object and all its versions.

If you find that a LifeCycle rule that will do this automatically for you isn’t suitable to your needs, here’s how you can programatically delete the objects:

def delete_all_objects(bucket_name):
    res = []
    bucket=s3_resource.Bucket(bucket_name)
    for obj_version in bucket.object_versions.all():
        res.append({'Key': obj_version.object_key,
                    'VersionId': obj_version.id})
    print(res)
    bucket.delete_objects(Delete={'Objects': res})

The above code works whether or not you have enabled versioning on your bucket. If you haven’t, the version of the objects will be null. You can batch up to 1000 deletions in one API call, using .delete_objects() on your Bucket instance, which is more cost-effective than individually deleting each object.

Run the new function against the first bucket to remove all the versioned objects:

>>>
>>> delete_all_objects(first_bucket_name)
[{'Key': '127367firstfile.txt', 'VersionId': 'eQgH6IC1VGcn7eXZ_.ayqm6NdjjhOADv'}, {'Key': '127367firstfile.txt', 'VersionId': 'UnQTaps14o3c1xdzh09Cyqg_hq4SjB53'}, {'Key': '127367firstfile.txt', 'VersionId': 'null'}, {'Key': '616abesecondfile.txt', 'VersionId': 'WIaExRLmoksJzLhN7jU5YzoJxYSu6Ey6'}, {'Key': '616abesecondfile.txt', 'VersionId': 'null'}, {'Key': 'fb937cthirdfile.txt', 'VersionId': 'null'}]

As a final test, you can upload a file to the second bucket. This bucket doesn’t have versioning enabled, and thus the version will be null. Apply the same function to remove the contents:

>>>
>>> s3_resource.Object(second_bucket_name, first_file_name).upload_file(
...     first_file_name)
>>> delete_all_objects(second_bucket_name)
[{'Key': '9c8b44firstfile.txt', 'VersionId': 'null'}]

You’ve successfully removed all the objects from both your buckets. You’re now ready to delete the buckets.

Deleting Buckets

To finish off, you’ll use .delete() on your Bucket instance to remove the first bucket:

s3_resource.Bucket(first_bucket_name).delete()

If you want, you can use the client version to remove the second bucket:

s3_resource.meta.client.delete_bucket(Bucket=second_bucket_name)

Both the operations were successful because you emptied each bucket before attempting to delete it.

You’ve now run some of the most important operations that you can perform with S3 and Boto3. Congratulations on making it this far! As a bonus, let’s explore some of the advantages of managing S3 resources with Infrastructure as Code.

Python Code or Infrastructure as Code (IaC)?

As you’ve seen, most of the interactions you’ve had with S3 in this tutorial had to do with objects. You didn’t see many bucket-related operations, such as adding policies to the bucket, adding a LifeCycle rule to transition your objects through the storage classes, archive them to Glacier or delete them altogether or enforcing that all objects be encrypted by configuring Bucket Encryption.

Manually managing the state of your buckets via Boto3’s clients or resources becomes increasingly difficult as your application starts adding other services and grows more complex. To monitor your infrastructure in concert with Boto3, consider using an Infrastructure as Code (IaC) tool such as CloudFormation or Terraform to manage your application’s infrastructure. Either one of these tools will maintain the state of your infrastructure and inform you of the changes that you’ve applied.

If you decide to go down this route, keep the following in mind:

Conclusion

Congratulations on making it to the end of this tutorial!

You’re now equipped to start working programmatically with S3. You now know how to create objects, upload them to S3, download their contents and change their attributes directly from your script, all while avoiding common pitfalls with Boto3.

May this tutorial be a stepping stone in your journey to building something great using AWS!

Further Reading

If you want to learn more, check out the following:


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

October 17, 2018 02:00 PM UTC


Stack Abuse

Creating a Neural Network from Scratch in Python: Multi-class Classification

This is the third article in the series of articles on "Creating a Neural Network From Scratch in Python".

If you have no prior experience with neural networks, I would suggest you first read Part 1 and Part 2 of the series (linked above). Once you feel comfortable with the concepts explained in those articles, you can come back and continue this article.

Introduction

In the previous article, we saw how we can create a neural network from scratch, which is capable of solving binary classification problems, in Python. A binary classification problem has only two outputs. However, real-world problems are far more complex.

Consider the example of digit recognition problem where we use the image of a digit as an input and the classifier predicts the corresponding digit number. A digit can be any number between 0 and 9. This is a classic example of a multi-class classification problem where input may belong to any of the 10 possible outputs.

In this article, we will see how we can create a simple neural network from scratch in Python, which is capable of solving multi-class classification problems.

Dataset

Let's first briefly take a look at our dataset. Our dataset will have two input features and one of the three possible output. We will manually create a dataset for this article.

To do so, execute the following script:

import numpy as np  
import matplotlib.pyplot as plt

np.random.seed(42)

cat_images = np.random.randn(700, 2) + np.array([0, -3])  
mouse_images = np.random.randn(700, 2) + np.array([3, 3])  
dog_images = np.random.randn(700, 2) + np.array([-3, 3])  

In the script above, we start by importing our libraries and then we create three two-dimensional arrays of size 700 x 2. You can think of each element in one set of the array as an image of a particular animal. Each array element corresponds to one of the three output classes.

An important point to note here is that, that if we plot the elements of the cat_images array on a two-dimensional plane, they will be centered around x=0 and y=-3. Similarly, the elements of the mouse_images array will be centered around x=3 and y=3, and finally, the elements of the array dog_images will be centered around x=-3 and y=3. You will see this once we plot our dataset.

Next, we need to vertically join these arrays to create our final dataset. Execute the following script to do so:

feature_set = np.vstack([cat_images, mouse_images, dog_images])  

We created our feature set, and now we need to define corresponding labels for each record in our feature set. The following script does that:

labels = np.array([0]*700 + [1]*700 + [2]*700)  

The above script creates a one-dimensional array of 2100 elements. The first 700 elements have been labeled as 0, the next 700 elements have been labeled as 1 while the last 700 elements have been labeled as 2. This is just our shortcut way of quickly creating the labels for our corresponding data.

For multi-class classification problems, we need to define the output label as a one-hot encoded vector since our output layer will have three nodes and each node will correspond to one output class. We want that when an output is predicted, the value of the corresponding node should be 1 while the remaining nodes should have a value of 0. For that, we need three values for the output label for each record. This is why we convert our output vector into a one-hot encoded vector.

Execute the following script to create the one-hot encoded vector array for our dataset:

one_hot_labels = np.zeros((2100, 3))

for i in range(2100):  
    one_hot_labels[i, labels[i]] = 1

In the above script we create the one_hot_labels array of size 2100 x 3 where each row contains one-hot encoded vector for the corresponding record in the feature set. We then insert 1 in the corresponding column.

If you execute the above script, you will see that the one_hot_labels array will have 1 at index 0 for the first 700 records, 1 at index 1 for next 700 records while 1 at index 2 for the last 700 records.

Now let's plot the dataset that we just created. Execute the following script:

plt.scatter(feature_set[:,0], feature_set[:,1], c=labels, cmap='plasma', s=100, alpha=0.5)  
plt.show()  

Once you execute the above script, you should see the following figure:

Generated dataset

You can clearly see that we have elements belonging to three different classes. Our task will be to develop a neural network capable of classifying data into the aforementioned classes.

Neural Network with Multiple Output Classes

The neural network that we are going to design has the following architecture:

Neural network structure

You can see that our neural network is pretty similar to the one we developed in Part 2 of the series. It has an input layer with 2 input features and a hidden layer with 4 nodes. However, in the output layer, we can see that we have three nodes. This means that our neural network is capable of solving the multi-class classification problem where the number of possible outputs is 3.

Softmax and Cross-Entropy Functions

Before we move on to the code section, let us briefly review the softmax and cross entropy functions, which are respectively the most commonly used activation and loss functions for creating a neural network for multi-class classification.

Softmax Function

From the architecture of our neural network, we can see that we have three nodes in the output layer. We have several options for the activation function at the output layer. One option is to use sigmoid function as we did in the previous articles.

However, there is a more convenient activation function in the form of softmax that takes a vector as input and produces another vector of the same length as output. Since our output contains three nodes, we can consider the output from each node as one element of the input vector. The output will be a length of the same vector where the values of all the elements sum to 1. Mathematically, the softmax function can be represented as:

$$ y_i(z_i) = \frac{e^{z_i}}{ \sum\nolimits_{k=1}^{k}{e^{z_k}} } $$

The softmax function simply divides the exponent of each input element by the sum of exponents of all the input elements. Let's take a look at a simple example of this:

def softmax(A):  
    expA = np.exp(A)
    return expA / expA.sum()

nums = np.array([4, 5, 6])  
print(softmax(nums))  

In the script above we create a softmax function that takes a single vector as input, takes exponents of all the elements in the vector and then divides the resulting numbers individually by the sum of exponents of all the numbers in the input vector.

You can see that the input vector contains elements 4, 5 and 6. In the output, you will see three numbers squashed between 0 and 1 where the sum of the numbers will be equal to 1. The output looks likes this:

[0.09003057 0.24472847 0.66524096]

Softmax activation function has two major advantages over the other activation functions, particular for multi-class classification problems: The first advantage is that softmax function takes a vector as input and the second advantage is that it produces an output between 0 and 1. Remember, in our dataset, we have one-hot encoded output labels which mean that our output will have values between 0 and 1. However, the output of the feedforward process can be greater than 1, therefore softmax function is the ideal choice at the output layer since it squashes the output between 0 and 1.

Cross-Entropy Function

With softmax activation function at the output layer, mean squared error cost function can be used for optimizing the cost as we did in the previous articles. However, for the softmax function, a more convenient cost function exists which is called cross-entropy.

Mathematically, the cross-entropy function looks likes this:

$$ H(y,\hat{y}) = -\sum_i y_i \log \hat{y_i} $$

The cross-entropy is simply the sum of the products of all the actual probabilities with the negative log of the predicted probabilities. For multi-class classification problems, the cross-entropy function is known to outperform the gradient decent function.

Now we have sufficient knowledge to create a neural network that solves multi-class classification problems. Let's see how our neural network will work.

As always, a neural network executes in two steps: Feed-forward and back-propagation.

Feed Forward

The feedforward phase will remain more or less similar to what we saw in the previous article. The only difference is that now we will use the softmax activation function at the output layer rather than sigmoid function.

Remember, for the hidden layer output we will still use the sigmoid function as we did previously. The softmax function will be used only for the output layer activations.

Phase 1

Since we are using two different activation functions for the hidden layer and the output layer, I have divided the feed-forward phase into two sub-phases.

In the first phase, we will see how to calculate output from the hidden layer. For each input record, we have two features "x1" and "x2". To calculate the output values for each node in the hidden layer, we have to multiply the input with the corresponding weights of the hidden layer node for which we are calculating the value. Notice, we are also adding a bias term here. We then pass the dot product through sigmoid activation function to get the final value.

For instance to calculate the final value for the first node in the hidden layer, which is denoted by "ah1", you need to perform the following calculation:

$$ zh1 = x1w1 + x2w2 + b
$$

$$ ah1 = \frac{\mathrm{1} }{\mathrm{1} + e^{-zh1} }
$$

This is the resulting value for the top-most node in the hidden layer. In the same way, you can calculate the values for the 2nd, 3rd, and 4th nodes of the hidden layer.

Phase 2

To calculate the values for the output layer, the values in the hidden layer nodes are treated as inputs. Therefore, to calculate the output, multiply the values of the hidden layer nodes with their corresponding weights and pass the result through an activation function, which will be softmax in this case.

This operation can be mathematically expressed by the following equation:

$$ zo1 = ah1w9 + ah2w10 + ah3w11 + ah4w12
$$

$$ zo2 = ah1w13 + ah2w14 + ah3w15 + ah4w16
$$

$$ zo3 = ah1w17 + ah2w18 + ah3w19 + ah4w20
$$

Here zo1, zo2, and zo3 will form the vector that we will use as input to the sigmoid function. Lets name this vector "zo".

zo = [zo1, zo2, zo3]  

Now to find the output value a01, we can use softmax function as follows:

$$ ao1(zo) = \frac{e^{zo1}}{ \sum\nolimits_{k=1}^{k}{e^{zok}} } $$

Here "a01" is the output for the top-most node in the output layer. In the same way, you can use the softmax function to calculate the values for ao2 and ao3.

You can see that the feed-forward step for a neural network with multi-class output is pretty similar to the feed-forward step of the neural network for binary classification problems. The only difference is that here we are using softmax function at the output layer rather than the sigmoid function.

Back-Propagation

The basic idea behind back-propagation remains the same. We have to define a cost function and then optimize that cost function by updating the weights such that the cost is minimized. However, unlike previous articles where we used mean squared error as a cost function, in this article we will instead use cross-entropy function.

Back-propagation is an optimization problem where we have to find the function minima for our cost function.

To find the minima of a function, we can use the gradient decent algorithm. The gradient decent algorithm can be mathematically represented as follows:

$$ repeat \ until \ convergence: \begin{Bmatrix} w_j := w_j - \alpha \frac{\partial }{\partial w_j} J(w_0,w_1 ....... w_n) \end{Bmatrix} ............. (1) $$

The details regarding how gradient decent function minimizes the cost have already been discussed in the previous article. Here we will jus see the mathematical operations that we need to perform.

Our cost function is:

$$ H(y,\hat{y}) = -\sum_i y_i \log \hat{y_i} $$

In our neural network, we have an output vector where each element of the vector corresponds to output from one node in the output layer. The output vector is calculated using the softmax function. If "ao" is the vector of the predicted outputs from all output nodes and "y" is the vector of the actual outputs of the corresponding nodes in the output vector, we have to basically minimize this function:

$$ cost(y, {ao}) = -\sum_i y_i \log {ao_i} $$
Phase 1

In the first phase, we need to update weights w9 up to w20. These are the weights of the output layer nodes.

From the previous article, we know that to minimize the cost function, we have to update weight values such that the cost decreases. To do so, we need to take the derivative of the cost function with respect to each weight. Mathematically we can represent it as:

$$ \frac {dcost}{dwo} = \frac {dcost}{dao} *, \frac {dao}{dzo} * \frac {dzo}{dwo} ..... (1) $$

Here "wo" refers to the weights in the output layer.

The first part of the equation can be represented as:

$$ \frac {dcost}{dao} *\ \frac {dao}{dzo} ....... (2) $$

The detailed derivation of cross-entropy loss function with softmax activation function can be found at this link.

The derivative of equation (2) is:

$$ \frac {dcost}{dao} *\ \frac {dao}{dzo} = ao - y ....... (3) $$

Where "ao" is predicted output while "y" is the actual output.

Finally, we need to find "dzo" with respect to "dwo" from Equation 1. The derivative is simply the outputs coming from the hidden layer as shown below:

$$ \frac {dzo}{dwo} = ah $$

To find new weight values, the values returned by Equation 1 can be simply multiplied with the learning rate and subtracted from the current weight values.

We also need to update the bias "bo" for the output layer. We need to differentiate our cost function with respect to bias to get new bias value as shown below:

$$ \frac {dcost}{dbo} = \frac {dcost}{dao} *\ \frac {dao}{dzo} * \frac {dzo}{dbo} ..... (4) $$

The first part of the Equation 4 has already been calculated in Equation 3. Here we only need to update "dzo" with respect to "bo" which is simply 1. So:

$$ \frac {dcost}{dbo} = ao - y ........... (5) $$

To find new bias values for output layer, the values returned by Equation 5 can be simply multiplied with the learning rate and subtracted from the current bias value.

Phase 2

In this section, we will back-propagate our error to the previous layer and find the new weight values for hidden layer weights i.e. weights w1 to w8.

Let's collectively denote hidden layer weights as "wh". We basically have to differentiate the cost function with respect to "wh".

Mathematically we can use chain rule of differentiation to represent it as:

$$ \frac {dcost}{dwh} = \frac {dcost}{dah} *, \frac {dah}{dzh} * \frac {dzh}{dwh} ...... (6) $$

Here again, we will break Equation 6 into individual terms.

The first term "dcost" can be differentiated with respect to "dah" using the chain rule of differentiation as follows:

$$ \frac {dcost}{dah} = \frac {dcost}{dzo} *\ \frac {dzo}{dah} ...... (7) $$

Let's again break the Equation 7 into individual terms. From the Equation 3, we know that:

$$ \frac {dcost}{dao} *\ \frac {dao}{dzo} =\frac {dcost}{dzo} = = ao - y ........ (8) $$

Now we need to find dzo/dah from Equation 7, which is equal to the weights of the output layer as shown below:

$$ \frac {dzo}{dah} = wo ...... (9) $$

Now we can find the value of dcost/dah by replacing the values from Equations 8 and 9 in Equation 7.

Coming back to Equation 6, we have yet to find dah/dzh and dzh/dwh.

The first term dah/dzh can be calculated as:

$$ \frac {dah}{dzh} = sigmoid(zh) * (1-sigmoid(zh)) ........ (10) $$

And finally, dzh/dwh is simply the input values:

$$ \frac {dzh}{dwh} = input features ........ (11) $$

If we replace the values from Equations 7, 10 and 11 in Equation 6, we can get the updated matrix for the hidden layer weights. To find new weight values for the hidden layer weights "wh", the values returned by Equation 6 can be simply multiplied with the learning rate and subtracted from the current hidden layer weight values.

Similarly, the derivative of the cost function with respect to hidden layer bias "bh" can simply be calculated as:

$$ \frac {dcost}{dbh} = \frac {dcost}{dah} *, \frac {dah}{dzh} * \frac {dzh}{dbh} ...... (12) $$

Which is simply equal to:

$$ \frac {dcost}{dbh} = \frac {dcost}{dah} *, \frac {dah}{dzh} ...... (13) $$

because,

$$ \frac {dzh}{dbh} = 1 $$

To find new bias values for the hidden layer, the values returned by Equation 13 can be simply multiplied with the learning rate and subtracted from the current hidden layer bias values and that's it for the back-propagation.

You can see that the feed-forward and back-propagation process is quite similar to the one we saw in our last articles. The only thing we changed is the activation function and cost function.

Code for Neural Networks for Multi-class Classification

We have covered the theory behind the neural network for multi-class classification, and now is the time to put that theory into practice.

Take a look at the following script:

import numpy as np  
import matplotlib.pyplot as plt

np.random.seed(42)

cat_images = np.random.randn(700, 2) + np.array([0, -3])  
mouse_images = np.random.randn(700, 2) + np.array([3, 3])  
dog_images = np.random.randn(700, 2) + np.array([-3, 3])

feature_set = np.vstack([cat_images, mouse_images, dog_images])

labels = np.array([0]*700 + [1]*700 + [2]*700)

one_hot_labels = np.zeros((2100, 3))

for i in range(2100):  
    one_hot_labels[i, labels[i]] = 1

plt.figure(figsize=(10,7))  
plt.scatter(feature_set[:,0], feature_set[:,1], c=labels, cmap='plasma', s=100, alpha=0.5)  
plt.show()

def sigmoid(x):  
    return 1/(1+np.exp(-x))

def sigmoid_der(x):  
    return sigmoid(x) *(1-sigmoid (x))

def softmax(A):  
    expA = np.exp(A)
    return expA / expA.sum(axis=1, keepdims=True)

instances = feature_set.shape[0]  
attributes = feature_set.shape[1]  
hidden_nodes = 4  
output_labels = 3

wh = np.random.rand(attributes,hidden_nodes)  
bh = np.random.randn(hidden_nodes)

wo = np.random.rand(hidden_nodes,output_labels)  
bo = np.random.randn(output_labels)  
lr = 10e-4

error_cost = []

for epoch in range(50000):  
############# feedforward

    # Phase 1
    zh = np.dot(feature_set, wh) + bh
    ah = sigmoid(zh)

    # Phase 2
    zo = np.dot(ah, wo) + bo
    ao = softmax(zo)

########## Back Propagation

########## Phase 1

    dcost_dzo = ao - one_hot_labels
    dzo_dwo = ah

    dcost_wo = np.dot(dzo_dwo.T, dcost_dzo)

    dcost_bo = dcost_dzo

########## Phases 2

    dzo_dah = wo
    dcost_dah = np.dot(dcost_dzo , dzo_dah.T)
    dah_dzh = sigmoid_der(zh)
    dzh_dwh = feature_set
    dcost_wh = np.dot(dzh_dwh.T, dah_dzh * dcost_dah)

    dcost_bh = dcost_dah * dah_dzh

    # Update Weights ================

    wh -= lr * dcost_wh
    bh -= lr * dcost_bh.sum(axis=0)

    wo -= lr * dcost_wo
    bo -= lr * dcost_bo.sum(axis=0)

    if epoch % 200 == 0:
        loss = np.sum(-one_hot_labels * np.log(ao))
        print('Loss function value: ', loss)
        error_cost.append(loss)

The code is pretty similar to the one we created in the previous article. In the feed-forward section, the only difference is that "ao", which is the final output, is being calculated using the softmax function.

Similarly, in the back-propagation section, to find the new weights for the output layer, the cost function is derived with respect to softmax function rather than the sigmoid function.

If you run the above script, you will see that the final error cost will be 0.5. The following figure shows how the cost decreases with the number of epochs.

Cost vs epochs

As you can see, not many epochs are needed to reach our final error cost.

Similarly, if you run the same script with sigmoid function at the output layer, the minimum error cost that you will achieve after 50000 epochs will be around 1.5 which is greater than 0.5, achieved with softmax.

Conclusion

Real-world neural networks are capable of solving multi-class classification problems. In this article, we saw how we can create a very simple neural network for multi-class classification, from scratch in Python. This is the final article of the series: "Neural Network from Scratch in Python". In the future articles, I will explain how we can create more specialized neural networks such as recurrent neural networks and convolutional neural networks from scratch in Python.

October 17, 2018 12:50 PM UTC


Eli Bendersky

Covariance and contravariance in subtyping

Many programming languages support subtyping, a kind of polymorphism that lets us define hierarchical relations on types, with specific types being subtypes of more generic types. For example, a Cat could be a subtype of Mammal, which itself is a subtype of Vertebrate.

Intuitively, functions that accept any Mammal would accept a Cat too. More formally, this is known as the Liskov substitution principle:

Let \phi (x) be a property provable about objects x of type T. Then \phi (y) should be true for objects y of type S where S is a subtype of T.

A shorter way to say S is a subtype of T is S <: T. The relation <: is also sometimes expressed as \le, and can be thought of as "is less general than". So Cat <: Mammal and Mammal <: Vertebrate. Naturally, <: is transitive, so Cat <: Vertebrate; it's also reflexive, as T <: T for any type T [1].

Kinds of variance in subtyping

Variance refers to how subtyping between composite types (e.g. list of Cats versus list of Mammals) relates to subtyping between their components (e.g. Cats and Mammals). Let's use the general Composite<T> to refer to some composite type with components of type T.

Given types S and T with the relation S <: T, variance is a way to describe the relation between the composite types:

  • Covariant means the ordering of component types is preserved: Composite<S> <: Composite<T>.
  • Contravariant means the ordering is reversed: Composite<T> <: Composite<S> [2].
  • Bivariant means both covariant and contravariant.
  • Invariant means neither covariant nor contravariant.

That's a lot of theory and rules right in the beginning; the following examples should help clarify all of this.

Covariance in return types of overriding methods in C++

In C++, when a subclass method overrides a similarly named method in a superclass, their signatures have to match. There is an important exception to this rule, however. When the original return type is B* or B&, the return type of the overriding function is allowed to be D* or D& respectively, provided that D is a public subclass of B. This rule is important to implement methods like Clone:

struct Mammal {
  virtual ~Mammal() = 0;
  virtual Mammal* Clone() = 0;
};

struct Cat : public Mammal {
  virtual ~Cat() {}

  Cat* Clone() override {
    return new Cat(*this);
  }
};

struct Dog : public Mammal {
  virtual ~Dog() {}

  Dog* Clone() override {
    return new Dog(*this);
  }
};

And we can write functions like the following:

Mammal* DoSomething(Mammal* m) {
  Mammal* cloned = m->Clone();
  // Do something with cloned
  return cloned;
}

No matter what the concrete run-time class of m is, m->Clone() will return the right kind of object.

Armed with our new terminology, we can say that the return type rule for overriding methods is covariant for pointer and reference types. In other words, given Cat <: Mammal we have Cat* <: Mammal*.

Being able to replace Mammal* by Cat* seems like a natural thing to do in C++, but not all typing rules are covariant. Consider this code:

struct MammalClinic {
  virtual void Accept(Mammal* m);
};

struct CatClinic : public MammalClinic {
  virtual void Accept(Cat* c);
};

Looks legit? We have general MammalClinics that accept all mammals, and more specialized CatClinics that only accept cats. Given a MammalClinic*, we should be able to call Accept and the right one will be invoked at run-time, right? Wrong. CatClinic::Accept does not actually override MammalClinic::Accept; it simply overloads it. If we try to add the override keyword (as we should always do starting with C++11):

struct CatClinic : public MammalClinic {
  virtual void Accept(Cat* c) override;
};

We'll get:

error: ‘virtual void CatClinic::Accept(Cat*)’ marked ‘override’, but does not override
   virtual void Accept(Cat* c) override;
                ^

This is precisely what the override keyword was created for - help us find erroneous assumptions about methods overriding other methods. The reality is that function overrides are not covariant for pointer types. They are invariant. In fact, the vast majority of typing rules in C++ are invariant; std::vector<Cat> is not a subclass of std::vector<Mammal>, even though Cat <: Mammal. As the next section demonstrates, there's a good reason for that.

Covariant arrays in Java

Suppose we have PersianCat <: Cat, and some class representing a list of cats. Does it make sense for lists to be covariant? On initial thought, yes. Say we have this (pseudocode) function:

MakeThemMeow(List<Cat> lst) {
    for each cat in lst {
        cat->Meow()
    }
}

Why shouldn't we be able to pass a List<PersianCat> into it? After all, all persian cats are cats, so they can all meow! As long as lists are immutable, this is actually safe. The problem appears when lists can be modified. The best example of this problem can be demonstrated with actual Java code, since in Java array constructors are covariant:

class Main {
  public static void main(String[] args) {
    String strings[] = {"house", "daisy"};
    Object objects[] = strings; // covariant

    objects[1] = "cauliflower"; // works fine
    objects[0] = 5;             // throws exception
  }
}

In Java, String <: Object, and since arrays are covariant, it means that String[] <: Object[], which makes the assignment on the line marked with "covariant" type-check successfully. From that point on, objects is an array of Object as far as the compiler is concerned, so assigning anything that's a subclass of Object to its elements is kosher, including integers [3]. Therefore the last line in main throws an exception at run-time:

Exception in thread "main" java.lang.ArrayStoreException: java.lang.Integer
    at Main.main(Main.java:7)

Assigning an integer fails because at run-time it's known that objects is actually an array of strings. Thus, covariance together with mutability makes array types unsound. Note, however, that this is not just a mistake - it's a deliberate historical decision made when Java didn't have generics and polymorphism was still desired; the same problem exists in C# - read this for more details.

Other languages have immutable containers, which can then be made covariant without jeopardizing the soundness of the type system. For example in OCaml lists are immutable and covariant.

Contravariance for function types

Covariance seems like a pretty intuitive concept, but what about contravariance? When does it make sense to reverse the subtyping relation for composite types to get Composite<T> <: Composite<S> for S <: T?

An important use case is function types. Consider a function that takes a Mammal and returns a Mammal; in functional programming the type of this function is commonly referred to as Mammal -> Mammal. Which function types are valid subtypes of this type?

Here's a pseudo-code definition that makes it easier to discuss:

func user(f : Mammal -> Mammal) {
  // do stuff with 'f'
}

Can we call user providing it a function of type Mammal -> Cat as f? Inside its body, user may invoke f and expect its return value to be a Mammal. Since Mammal -> Cat returns cats, that's fine, so this usage is safe. It aligns with our earlier intuition that covariance makes sense for function return types.

Note that passing a Mammal -> Vertebrate function as f doesn't work as well, because user expects f to return Mammals, but our function may return a Vertebrate that's not a Mammal (maybe a Bird). Therefore, function return types are not contravariant.

But what about function parameters? So far we've been looking at function types that take Mammal - an exact match for the expected signature of f. Can we call user with a function of type Cat -> Mammal? No, because user expects to be able to pass any kind of Mammal into f, not just Cats. So function parameters are not covariant. On the other hand, it should be safe to pass a function of type Vertebrate -> Mammal as f, because it can take any Mammal, and that's what user is going to pass to it. So contravariance makes sense for function parameters.

Most generally, we can say that Vertebrate -> Cat is a subtype of Mammal -> Mammal, because parameters types are contravariant and return types are covariant. A nice quote that can help remember these rules is: be liberal in what you accept and conservative in what you produce.

This is not just theory; if we go back to C++, this is exactly how function types with std::function behave:

#include <functional>

struct Vertebrate {};
struct Mammal : public Vertebrate {};
struct Cat : public Mammal {};

Cat* f1(Vertebrate* v) {
  return nullptr;
}

Vertebrate* f2(Vertebrate* v) {
  return nullptr;
}

Cat* f3(Cat* v) {
  return nullptr;
}

void User(std::function<Mammal*(Mammal*)> f) {
  // do stuff with 'f'
}

int main() {
  User(f1);       // works

  return 0;
}

The invocation User(f1) compiles, because f1 is convertible to the type std::function<Mammal*(Mammal*)> [4]. Had we tried to invoke User(f2) or User(f3), they would fail because neither f2 nor f3 are proper subtypes of std::function<Mammal*(Mammal*)>.

Bivariance

So far we've seen examples of invariance, covariance and contravariance. What about bivariance? Recall, bivariance means that given S <: T, both Composite<S> <: Composite<T> and Composite<T> <: Composite<S> are true. When is this useful? Not often at all, it turns out.

In TypeScript, function parameters are bivariant. The following code compiles correctly but fails at run-time:

function trainDog(d: Dog) { ... }
function cloneAnimal(source: Animal, done: (result: Animal) => void): void { ... }
let c = new Cat();

// Runtime error here occurs because we end up invoking 'trainDog' with a 'Cat'
cloneAnimal(c, trainDog);

Once again, this is not because the TypeScript designers are incompetent. The reason is fairly intricate and explained on this page; the summary is that it's needed to help the type-checker treat functions that don't mutate their arguments as covariant for arrays.

That said, in TypeScript 2.6 this is being changed with a new strictness flag that treats parameters only contravariantly.

Explicit variance specification in Python type-checking

If you had to guess which of the mainstream languages has the most advanced support for variance in their type system, Python probably wouldn't be your first guess, right? I admit it wasn't mine either, because Python is dynamically (duck) typed. But the new type hinting support (described in PEP 484 with more details in PEP 483) is actually fairly advanced.

Here's an example:

class Mammal:
    pass

class Cat(Mammal):
    pass

def count_mammals_list(seq : List[Mammal]) -> int:
    return len(seq)

mlst = [Mammal(), Mammal()]
print(count_mammals_list(mlst))

If we run mypy type-checking on this code, it will succeed. count_mammals_list takes a list of Mammals, and this is what we passed in; so far, so good. However, the following will fail:

clst = [Cat(), Cat()]
print(count_mammals_list(clst))

Because List is not covariant. Python doesn't know whether count_mammals_list will modify the list, so allowing calls with a list of Cats is potentially unsafe.

It turns out that the typing module lets us express the variance of types explicitly. Here's a very minimal "immutable list" implementation that only supports counting elements:

T_co = TypeVar('T_co', covariant=True)

class ImmutableList(Generic[T_co]):
    def __init__(self, items: Iterable[T_co]) -> None:
        self.lst = list(items)

    def __len__(self) -> int:
        return len(self.lst)

And now if we define:

def count_mammals_ilist(seq : ImmutableList[Mammal]) -> int:
    return len(seq)

We can actually invoke it with a ImmutableList of Cats, and this will pass type checking:

cimmlst = ImmutableList([Cat(), Cat()])
print(count_mammals_ilist(cimmlst))

Similarly, we can support contravariant types, etc. The typing module also provides a number of useful built-ins; for example, it's not really necessary to create an ImmutableList type, as there's already a Sequence type that is covariant.


[1]In most cases <: is also antisymmetric, making it a partial order, but in some cases it isn't; for example, structs with permuted fields can be considered subtypes of each other (in most languages they aren't!) but such subtyping is not antisymmetric.
[2]These terms come from math, and a good rule of thumb to remember how they apply is: co means together, while contra means against. As long as the composite types vary together (in the same direction) as their component types, they are co-variant. When they vary against their component types (in the reverse direction), they are contra-variant.
[3]Strictly speaking, integer literals like 5 are primitives in Java and not objects at all. However, due to autoboxing, this is equivalent to wrapping the 5 in Integer prior to the assignment.
[4]Note that we're using pointer types here. The same example would work with std::function<Mammal(Mammal)> and corresponding f1 taking and returning value types. It's just that in C++ value types are not very useful for polymorphism, so pointer (or reference) values are much more commonly used.

October 17, 2018 12:35 PM UTC


PyCon

PyCon 2019 Launches Financial Aid

The PyCon conference prides itself on being affordable. However, registration is only one of several expenses an attendee must incur, and it’s likely the smallest one. Flying, whether halfway around the world or from a few hundred miles away, is more expensive. Staying in a hotel for a few days is also more expensive. All together, the cost of attending a conference can become prohibitively expensive. That’s where PyCon's Financial Aid program comes in. We’re opening applications for Financial Aid today, and we’ll be accepting them through February 12, 2019.

To apply, first set up an account on the site, and then you will be able to fill out the application here or through your dashboard.

For those proposing talks, tutorials, or posters, selecting the “I require a speaker grant if my proposal is accepted” box on your speaker profile serves as your request, so you do not need to fill out the financial aid application. Upon acceptance, we’ll contact the speakers who checked that box to gather the appropriate information. Accepted speakers and presenters are prioritized for travel grants. Additionally, we do not expose grant requests to reviewers while evaluating proposals. The Program Committee evaluates proposals on the basis of their presentation, and later the Financial Aid team comes in and looks at how we can help our speakers.

We offer need-based grants to enable people from across our community to attend PyCon. The criteria for evaluating requests takes into account several things, such as whether the applicant is a student, unemployed, or underemployed; their geographic location; and their involvement in both the conference and the greater Python community.

Our process aims to help a large amount of people with partial grants, as opposed to covering full expenses for a small amount of people. Based on individual need, we craft grant amounts that we hope can turn PyCon from inaccessible to reality. While some direct costs—like those associated with PyCon itself—are discounted or waived, external costs such as travel are handled via reimbursement, where the attendee pays and then submits receipts to be paid back an amount based on their grant. For the full details, see our FAQ at https://us.pycon.org/2019/financial-assistance/faq/ and contact pycon-aid@python.org with further questions.

The Python Software Foundation & PyLadies make Financial Aid possible. This year the Python Software Foundation is providing $110,000 USD towards financial aid and PyLadies will contribute as much as they can based on the contributions they get throughout 2018.

For more information about Financial Aid, see https://us.pycon.org/2019/financial-assistance.


Our Call for Proposals is open! Tutorial presentations are due November 26, while talk, poster, and education summit proposals are due January 3. For more information, see https://us.pycon.org/2019/speaking/.

*Note: Main content is from post written by Brian Curtin for 2018 launch

October 17, 2018 10:01 AM UTC


Talk Python to Me

#182 Picture Python at Shutterfly

Join me and Doug Farrell as we discuss his career and what he's up to at Shutterfly. You'll learn about the Python stack he's using to work with, not just with bits and bytes, but physical devices on a production line for creating all sorts of picturesque items. You'll also hear how both he and I feel it's a great time to be a developer, even if you're on the older side of 30 or 40 or beyond.

October 17, 2018 08:00 AM UTC


Mike Driscoll

Jupyter Notebook Debugging

Debugging is an important concept. The concept of debugging is trying to figure out what is wrong with your code or just trying to understand the code. There are many times where I will come to unfamiliar code and I will need to step through it in a debugger to grasp how it works. Most Python IDEs have good debuggers built into them. I personally like Wing IDE for instance. Others like PyCharm or PyDev. But what if you want to debug the code in your Jupyter Notebook? How does that work?

In this chapter we will look at a couple of different methods of debugging a Notebook. The first one is by using Python’s own pdb module.


Using pdb

The pdb module is Python’s debugger module. Just as C++ has gdb, Python has pdb.

Let’s start by opening up a new Notebook and adding a cell with the following code in it:

def bad_function(var):
	return var + 0
 
bad_function("Mike")

If you run this code, you should end up with some output that looks like this:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-2f23ed1cac1e> in <module>()
      2         return var + 0
      3 
----> 4 bad_function("Mike")
 
<ipython-input-1-2f23ed1cac1e> in bad_function(var)
      1 def bad_function(var):
----> 2         return var + 0
      3 
      4 bad_function("Mike")
 
TypeError: cannot concatenate 'str' and 'int' objects

What this means is that you cannot concatenate a string and an integer. This is a pretty common problem if you don’t know what types a function accepts. You will find that this is especially true when working with complex functions and classes, unless they happen to be using type hinting. One way to figure out what is going on is by adding a breakpoint using pdb’s set_trace() function:

def bad_function(var):
    import pdb
    pdb.set_trace()
    return var + 0
 
bad_function("Mike")

Now when you run the cell, you will get a prompt in the output which you can use to inspect the variables and basically run code live. If you happen to have Python 3.7, then you can simplify the example above by using the new breakpoint built-in, like this:

def bad_function(var):
    breakpoint()
    return var + 0
 
bad_function("Mike")

This code is functionally equivalent to the previous example but uses the new breakpoint function instead. When you run this code, it should act the same way as the code in the previous section did.

You can read more about how to use pdb here.

You can use any of pdb’s command right inside of your Jupyter Notebook. Here are some examples:

  • w(here) – Print the stack trace
  • d(own) – Move the current frame X number of levels down. Defaults to one.
  • u(p) – Move the current frame X number of levels up. Defaults to one.
  • b(reak) – With a *lineno* argument, set a break point at that line number in the current file / context
  • s(tep) – Execute the current line and stop at the next possible line
  • c(ontinue) – Continue execution

Note that these are single-letter commands: w, d, u and b are the commands. You can use these commands to interactively debug your code in your Notebook along with the other commands listed in the documentation listed above.


ipdb

IPython also has a debugger called ipdb. However it does not work with Jupyter Notebook directly. You would need to connect to the kernel using something like Jupyter console and run it from there to use it. If you would like to go that route, you can read more about using Jupyter console here.

However there is an IPython debugger that we can use called IPython.core.debugger.set_trace. Let’s create a cell with the following code:

from IPython.core.debugger import set_trace
 
def bad_function(var):
    set_trace()
    return var + 0
 
bad_function("Mike")

Now you can run this cell and get the ipdb debugger. Here is what the output looked like on my machine:

The IPython debugger uses the same commands as the Python debugger does. The main difference is that it provides syntax highlighting and was originally designed to work in the IPython console.

There is one other way to open up the ipdb debugger and that is by using the %pdb magic. Here is some sample code you can try in a Notebook cell:

%pdb
 
def bad_function(var):
    return var + 0
 
bad_function("Mike")

When you run this code, you should end up seeing the `TypeError` traceback and then the ipdb prompt will appear in the output, which you can then use as before.


What about %%debug?

There is yet another way that you can open up a debugger in your Notebook. You can use `%%debug` to debug the entire cell like this:

%%debug
 
def bad_function(var):
    return var + 0
 
bad_function("Mike")

This will start the debugging session immediately when you run the cell. What that means is that you would want to use some of the commands that pdb supports to step into the code and examine the function or variables as needed.

Note that you could also use `%debug` if you want to debug a single line.


Wrapping Up

In this chapter we learned of several different methods that you can use to debug the code in your Jupyter Notebook. I personally prefer to use Python’s pdb module, but you can use the IPython.core.debugger to get the same functionality and it could be better if you prefer to have syntax highlighting.

There is also a newer “visual debugger” package called the PixieDebugger from the pixiedust package:

I haven’t used it myself. Some reviewers say it is amazing and others have said it is pretty buggy. I will leave that one up to you to determine if it is something you want to add to your toolset.

As far as I am concerned, I think using pdb or IPython’s debugger work quite well and should work for you too.


Related Reading

October 17, 2018 05:05 AM UTC


Vasudev Ram

The 2018 Python Developer Survey

By Vasudev Ram

Reposting a PSF-Community email as a PSA:

Participate in the 2018 Python Developer Survey.

Excerpt from an email to the psf-community@python.org and psf-members-announce@python.org mailing lists:

[ As some of you may have seen, the 2018 Python Developer Survey is available. If you haven't taken the survey yet, please do so soon! Additionally, we'd appreciate any assistance you all can provide with sharing the survey with your local Python groups, schools, work colleagues, etc. We will keep the survey open through October 26th, 2018.

Python Developers Survey 2018

We’re counting on your help to better understand how different Python developers use Python and related frameworks, tools, and technologies. We also hope you'll enjoy going through the questions.

The survey is organized in partnership between the Python Software Foundation and JetBrains. Together we will publish the aggregated results. We will randomly choose and announce 100 winners to receive a Python Surprise Gift Pack (must complete the full survey to qualify). ]

To my readers: I'll post the answer to A Python email signature puzzle soon, in my next post.


- Vasudev Ram - Online Python training and consulting

Hit the ground running with my vi quickstart tutorial, vetted by two Windows system administrator friends.Jump to posts: Python * DLang * xtopdfInterested in a Python, SQL or Linux course?Get WP Engine, powerful managed WordPress hosting.Subscribe to my blog (jugad2.blogspot.com) by emailMy ActiveState Code recipes
Follow me on:* Gumroad  * LinkedIn  * TwitterDo you create online products? Get Convertkit:Email marketing for digital product creators

October 17, 2018 04:18 AM UTC

October 16, 2018


Andrea Grandi

Using ipdb with Python 3.7.x breakpoint

Python 3.7.x introduced a new method to insert a breakpoint in the code. Before Python 3.7.x to insert a debugging point we had to write import pdb; pdb.set_trace() which honestly I could never remember (and I also created a snippet on VS Code to auto complete it).

Now you can just write breakpoint() that's it!

Now... the only problem is that by default that command will use pdb which is not exactly the best debugger you can have. I usually use ipdb but there wasn't an intuitive way of using it... and no, just installing it in your virtual environment, it won't be used by default.

How to use it then? It's very simple. The new debugging command will read an environment variable named PYTHONBREAKPOINT. If you set it properly, you will be able to use ipdb instead of pdb.

export PYTHONBREAKPOINT=ipdb.set_trace

At this point, any time you use breakpoint() in your code, ipdb will be used instead of pdb.

References

October 16, 2018 09:00 PM UTC


Stack Abuse

A Brief Introduction to matplotlib for Data Visualization

Introduction

Python has a wide variety of useful packages for machine learning and statistical analysis such as TensorFlow, NumPy, scikit-learn, Pandas, and more. One package that is essential to most data science projects is matplotlib.

Available for any Python distribution, it can be installed on Python 3 with pip. Other methods are also available, check https://matplotlib.org/ for more details.

Installation

If you use an OS with a terminal, the following command would install matplotlib with pip:

$ python3 -m pip install matplotlib

Importing & Environment

In a Python file, we want to import the pyplot function that allows us to interface with a MATLAB-like plotting environment. We also import a lines function that lets us add lines to plots:

import matplotlib.pyplot as plt  
import matplotlib.lines as mlines  

Essentially, this plotting environment lets us save figures and their attributes as variables. These plots can then be printed and viewed with a simple command. For an example, we can look at the stock price of Google: specifically the date, open, close, volume, and adjusted close price (date is stored as an np.datetime64) for the most recent 250 days:

import numpy as np  
import matplotlib.pyplot as plt  
import matplotlib.cbook as cbook

with cbook.get_sample_data('goog.npz') as datafile:  
    price_data = np.load(datafile)['price_data'].view(np.recarray)
price_data = price_data[-250:] # get the most recent 250 trading days  

We then transform the data in a way that is done quite often for time series, etc. We find the difference, $d_i$, between each observation and the one before it:

$$d_i = y_i - y_{i - 1} $$

delta1 = np.diff(price_data.adj_close) / price_data.adj_close[:-1]  

We can also look at the transformations of different variables, such as volume and closing price:

# Marker size in units of points^2
volume = (15 * price_data.volume[:-2] / price_data.volume[0])**2  
close = 0.003 * price_data.close[:-2] / 0.003 * price_data.open[:-2]  

Plotting a Scatter Plot

To actually plot this data, you can use the subplots() functions from plt (matplotlib.pyplot). By default this generates the area for the figure and the axes of a plot.

Here we will make a scatter plot of the differences between successive days. To elaborate, x is the difference between day i and the previous day. y is the difference between day i+1 and the previous day (i):

fig, ax = plt.subplots()  
ax.scatter(delta1[:-1], delta1[1:], c=close, s=volume, alpha=0.5)

ax.set_xlabel(r'$\Delta_i$', fontsize=15)  
ax.set_ylabel(r'$\Delta_{i+1}$', fontsize=15)  
ax.set_title('Volume and percent change')

ax.grid(True)  
fig.tight_layout()

plt.show()  

We then create labels for the x and y axes, as well as a title for the plot. We choose to plot this data with grids and a tight layout.

plt.show() displays the plot for us.

Adding a Line

We can add a line to this plot by providing x and y coordinates as lists to a Line2D instance:

import matplotlib.lines as mlines

fig, ax = plt.subplots()  
line = mlines.Line2D([-.15,0.25], [-.07,0.09], color='red')  
ax.add_line(line)

# reusing scatterplot code
ax.scatter(delta1[:-1], delta1[1:], c=close, s=volume, alpha=0.5)

ax.set_xlabel(r'$\Delta_i$', fontsize=15)  
ax.set_ylabel(r'$\Delta_{i+1}$', fontsize=15)  
ax.set_title('Volume and percent change')

ax.grid(True)  
fig.tight_layout()

plt.show()  

Plotting Histograms

To plot a histogram, we follow a similar process and use the hist() function from pyplot. We will generate 10000 random data points, x, with a mean of 100 and standard deviation of 15.

The hist function takes the data, x, number of bins, and other arguments such as density, which normalizes the data to a probability density, or alpha, which sets the transparency of the histogram.

We will also use the library mlab to add a line representing a normal density function with the same mean and standard deviation:

import numpy as np  
import matplotlib.mlab as mlab  
import matplotlib.pyplot as plt

mu, sigma = 100, 15  
x = mu + sigma*np.random.randn(10000)

# the histogram of the data
n, bins, patches = plt.hist(x, 30, density=1, facecolor='blue', alpha=0.75)

# add a 'best fit' line
y = mlab.normpdf( bins, mu, sigma)  
l = plt.plot(bins, y, 'r--', linewidth=4)

plt.xlabel('IQ')  
plt.ylabel('Probability')  
plt.title(r'$\mathrm{Histogram\ of\ IQ:}\ \mu=100,\ \sigma=15$')  
plt.axis([40, 160, 0, 0.03])  
plt.grid(True)

plt.show()  

Bar Charts

While histograms helped us with visual densities, bar charts help us view counts of data. To plot a bar chart with matplotlib, we use the bar() function. This takes the counts and data labels as x and y, along with other arguments.

As an example, we could look at a sample of the number of programmers that use different languages:

import numpy as np  
import matplotlib.pyplot as plt

objects = ('Python', 'C++', 'Java', 'Perl', 'Scala', 'Lisp')  
y_pos = np.arange(len(objects))  
performance = [10,8,6,4,2,1]

plt.bar(y_pos, performance, align='center', alpha=0.5)  
plt.xticks(y_pos, objects)  
plt.ylabel('Usage')  
plt.title('Programming language usage')

plt.show()  

Plotting Images

Analyzing images is very common in Python. Not surprisingly, we can use matplotlib to view images. We use the cv2 library to read in images.

The read_image() function summary is below:

The rest of the code reads in the first five images of cats and dogs from data used in an image recognition CNN. The pictures are concatenated and printed on the same axis:

import matplotlib.pyplot as plt  
import numpy as np  
import os, cv2

cwd = os.getcwd()  
TRAIN_DIR = cwd + '/data/train/'

ROWS = 256  
COLS = 256  
CHANNELS = 3

train_images = [TRAIN_DIR+i for i in os.listdir(TRAIN_DIR)] # use this for full dataset  
train_dogs =   [TRAIN_DIR+i for i in os.listdir(TRAIN_DIR) if 'dog' in i]  
train_cats =   [TRAIN_DIR+i for i in os.listdir(TRAIN_DIR) if 'cat' in i]

def read_image(file_path):  
    img = cv2.imread(file_path, cv2.IMREAD_COLOR) #cv2.IMREAD_GRAYSCALE
    b,g,r = cv2.split(img)
    img2 = cv2.merge([r,g,b])
    return cv2.resize(img2, (ROWS, COLS), interpolation=cv2.INTER_CUBIC)

for a in range(0,5):  
    cat = read_image(train_cats[a])
    dog = read_image(train_dogs[a])
    pair = np.concatenate((cat, dog), axis=1)
    plt.figure(figsize=(10,5))
    plt.imshow(pair)
    plt.show()

Conclusion

In this post we saw a brief introduction of how to use matplotlib to plot data in scatter plots, histograms, and bar charts. We also added lines to these plots. Finally, we saw how to read in images using the cv2 library and used matplotlib to plot the images.

October 16, 2018 01:10 PM UTC


A. Jesse Jiryu Davis

Recap: PyGotham 2018 Speaker Coaching

With your help, we raised money for twelve PyGotham speakers to receive free training from opera singer and speaking coach Melissa Collom. Most of the speakers were new to the conference scene; Melissa helped them focus on their value to the audience, clarify their ideas, and speak with confidence and charisma. In a survey, nearly all speakers said the session was “very beneficial” and made them “much more likely” to propose conference talks again.

October 16, 2018 11:14 AM UTC


Codementor

Celery Task Routing: The Basics

October 16, 2018 11:00 AM UTC


PyBites

Code Challenge 55 - #100DaysOfCode Curriculum Generator

There is an immense amount to be learned simply by tinkering with things. - Henry Ford

Hey Pythonistas,

It's time for another code challenge! This week we're asking you to create your own #100DaysOfCode Curriculum Generator.

Sounds exciting? It gets even better: with this challenge you can even be featured on our platform! Read on ...

The Challenge

Did you notice that every serious progress starts with a plan? This is why we are big advocates of the #100DaysOfCode. Heck we even build a whole Python course around it.

So here is the deal: PyBites is expanding its 100 Days tracker ("grid") feature: we want folks to add their own curriculums or learning paths.

Only one requirement: return a valid JSON response

You can make this as simple or sophisticated as you want, the only thing we request is a standard response JSON template so we can easily parse it on the platform:

Built with ObjGen -> http://www.objgen.com/json/models/q2S4Q

    {
    "title": "title of your 100 days",
    "version": 0.1,
    "github_repo": "https://github.com/pybites/100DaysOfCode",
    "tasks": [
        {
        "day": 1,
        "activity": "what you need to do this day?",
        "done": false
        },
        {
        "day": 2,
        "activity": "what you need to do this day?",
        "done": false
        },
        {
        "day": 3,
        "activity": "what you need to do this day?",
        "done": false
        },
    ...
    ...
        {
        "day": 100,
        "activity": "milestone ... 100 days done",
        "done": false
        }
    ]
    }

Update 17/10/2018: we took startDate and goals out because these are not relevant for the learning path, more for the cosumers of it. github_repo is optional.

An example

Here is what we plan to do, maybe it serves as an idea how you could code this challenge up:

If you like this idea, we opened an API endpoint to more easily pull in book info based on (Google) book ID, for example: http://pbreadinglist.herokuapp.com/api/books/bRpYDgAAQBAJ. Just replace the bookid in this endpoint.

More ideas

Of course it does not have to be centered around books, it can be any other way you like to plan your #100DaysOfCode. As long as you return the required JSON.

Other ideas that come to mind:

As usual, this is a challenge that came about wanting to scratch our own itch. Lack ideas? Remember there is always something you can enhance or automate for yourself or somebody else, and by doing so sharpening your coding skills!

Be featured

If you want to share your learning path with our community let us know in your PR linking to your JSON file and a short description. We will then add it to our 100 days grid app.

If you need help getting ready with Github, see our new instruction video.

PyBites Community

A few more things before we take off:


>>> from pybites import Bob, Julian

Keep Calm and Code in Python!

October 16, 2018 10:47 AM UTC

Code Challenge 54 - Python Clipboard History - Review

In this article we review last week's Python Clipboard History code challenge.

Reminder: new structure review post / Hacktoberfest is back!

From now on we will merge our solution into our Community branch and include anything noteworthy here, because:

Don't be shy, share your work!

Community Pull Requests

A good 10+ PRs this week, amazing!

Check out the awesome PRs by our community for PCC54 (or from fork: git checkout community && git merge upstream/community):

Featured

vipinreyo's Clipboard Viewer

vipinreyo's Clipboard Viewer

Lanseuo's Clipboard

Lanseuo's Clipboard

PCC54 Lessons

Refreshed pypeclip and sqlite modules. PyQT5 documentation is evolving. Hence there are not much code available in the public domain to play around with, which is a constraint in designing GUIs for Python apps using QT.

I had to really think about how to monitor the clipboard and copy the text from it just ONCE, ie, no immediate duplicates. It was more the thought process around it.

I learned some new things about tkinter

Gave me the chance to finally play with python 3.7's dataclasses, although not by much though.

Really nice one to practice various skills. I made a clipboard cache queue, a bit like vim buffers (used: deque, clear terminal, class, property, pyperclip, termcolor)

Read Code for Fun and Profit

You can look at all submitted code here and/or on our Community branch.

Other learnings we spotted in Pull Requests for other challenges this week:

(PCC01) how with works in python

(PCC13) I tweaked your tests in order to make it pass with my data structure.

(PCC39) Played around with 'fixture' and the scope of the fixture.

(PCC47) This one was time consuming because I had to look up how to graph all of these, but it was an excellent learning exercise!

(PCC51) Expanded my skills of working with the databases within python and brushed up on some rusty SQL skills

Thanks to everyone for your participation in our blog code challenges! Keep the PRs coming and include a README.md with one or more screenshots if you want to be featured in this weekly review post.

Keep the PRs coming, again this month it counts for Hacktoberfest!

Need more Python Practice?

Subscribe to our blog (sidebar) to get a new PyBites Code Challenge (PCC) in your inbox every start of the week.

And/or take any of our 50+ challenges on our platform.

Prefer coding self contained Python exercises in the comfort of your browser? Try our growing collection of Bites of Py.

Want to do the #100DaysOfCode but not sure what to work on? Take our course and/or start logging your progress on our platform.


Keep Calm and Code in Python!

-- Bob and Julian

October 16, 2018 10:40 AM UTC


Python Bytes

#99 parse - the regex antidote in Python

October 16, 2018 08:00 AM UTC


Mike Driscoll

Testing Jupyter Notebooks

The more you do programming, the more you will here about how you should test your code. You will hear about things like Extreme Programming and Test Driven Development (TDD). These are great ways to create quality code. But how does testing fit in with Jupyter? Frankly, it really doesn’t. If you want to test your code properly, you should write your code outside of Jupyter and import it into cells if you need to. This allows you to use Python’s unittest module or py.test to write tests for your code separately from Jupyter. This will also let you add on test runners like nose or put your code into a Continuous Integration setup using something like Travis CI or Jenkins.

However all is now lost. You can do some testing of your Jupyter Notebooks even though you won’t have the full flexibility that you would get from keeping your code separate. We will look at some ideas that you can use to do some basic testing with Jupyter.


Execute and Check

One popular method of “testing” a Notebook is to run it from the command line and send its output to a file. Here is the example syntax that you could use if you wanted to do the execution on the command line:

jupyter-nbconvert --to notebook --execute --output output_file_path input_file_path

Of course, we want to do this programmatically and we want to be able to capture errors. To do that, we will take our Notebook runner code from my exporting Jupyter Notebook article and re-use it. Here it is again for your convenience:

# notebook_runner.py
 
import nbformat
import os
 
from nbconvert.preprocessors import ExecutePreprocessor
 
 
def run_notebook(notebook_path):
    nb_name, _ = os.path.splitext(os.path.basename(notebook_path))
    dirname = os.path.dirname(notebook_path)
 
    with open(notebook_path) as f:
        nb = nbformat.read(f, as_version=4)
 
    proc = ExecutePreprocessor(timeout=600, kernel_name='python3')
    proc.allow_errors = True
 
    proc.preprocess(nb, {'metadata': {'path': '/'}})
    output_path = os.path.join(dirname, '{}_all_output.ipynb'.format(nb_name))
 
    with open(output_path, mode='wt') as f:
        nbformat.write(nb, f)
 
    errors = []
    for cell in nb.cells:
        if 'outputs' in cell:
            for output in cell['outputs']:
                if output.output_type == 'error':
                    errors.append(output)
 
    return nb, errors
 
if __name__ == '__main__':
    nb, errors = run_notebook('Testing.ipynb')
    print(errors)

You will note that I have updated the code to run a new Notebook. Let’s go ahead and create a Notebook that has two cells of code in it. After creating the Notebook, change the title to Testing and save it. That will cause Jupyter to save the file as Testing.ipynb. Now enter the following code in the first cell:

def add(a, b):
    return a + b
 
add(5, 6)

And enter the following code into cell #2:

1 / 0

Now you can run the Notebook runner code. When you do, you should get the following output:

[{'ename': 'ZeroDivisionError',
  'evalue': 'integer division or modulo by zero',
  'output_type': 'error',
  'traceback': ['\x1b[0;31m\x1b[0m',
                '\x1b[0;31mZeroDivisionError\x1b[0mTraceback (most recent call '
                'last)',
                '\x1b[0;32m<ipython-input-2-bc757c3fda29>\x1b[0m in '
                '\x1b[0;36m<module>\x1b[0;34m()\x1b[0m\n'
                '\x1b[0;32m----> 1\x1b[0;31m \x1b[0;36m1\x1b[0m '
                '\x1b[0;34m/\x1b[0m '
                '\x1b[0;36m0\x1b[0m\x1b[0;34m\x1b[0m\x1b[0m\n'
                '\x1b[0m',
                '\x1b[0;31mZeroDivisionError\x1b[0m: integer division or '
                'modulo by zero']}]

This indicates that we have some code that outputs an error. In this case, we did expect that as this is a very contrived example. In your own code, you probably wouldn’t want any of your code to output an error. Regardless, this Notebook runner script isn’t enough to actually do a real test. You need to wrap this code with testing code. So let’s create a new file that we will save to the same location as our Notebook runner code. We will save this script with the name “test_runner.py”. Put the following code in your new script:

import unittest
 
import runner
 
 
class TestNotebook(unittest.TestCase):
 
    def test_runner(self):
        nb, errors = runner.run_notebook('Testing.ipynb')
        self.assertEqual(errors, [])
 
 
if __name__ == '__main__':
    unittest.main()

This code uses Python’s unittest module. Here we create a testing class with a single test function inside of it called test_runner. This function calls our Notebook runner and asserts that the errors list should be empty. To run this code, open up a terminal and navigate to the folder that contains your code. Then run the following command:

python test_runner.py

When I ran this, I got the following output:

F
======================================================================
FAIL: test_runner (__main__.TestNotebook)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_runner.py", line 10, in test_runner
    self.assertEqual(errors, [])
AssertionError: Lists differ: [{'output_type': u'error', 'ev... != []
 
First list contains 1 additional elements.
First extra element 0:
{'ename': 'ZeroDivisionError',
 'evalue': 'integer division or modulo by zero',
 'output_type': 'error',
 'traceback': ['\x1b[0;31m---------------------------------------------------------------------------\x1b[0m',
               '\x1b[0;31mZeroDivisionError\x1b[0m                         '
               'Traceback (most recent call last)',
               '\x1b[0;32m<ipython-input-2-bc757c3fda29>\x1b[0m in '
               '\x1b[0;36m<module>\x1b[0;34m()\x1b[0m\n'
               '\x1b[0;32m----> 1\x1b[0;31m \x1b[0;36m1\x1b[0m '
               '\x1b[0;34m/\x1b[0m \x1b[0;36m0\x1b[0m\x1b[0;34m\x1b[0m\x1b[0m\n'
               '\x1b[0m',
               '\x1b[0;31mZeroDivisionError\x1b[0m: integer division or modulo '
               'by zero']}
 
Diff is 677 characters long. Set self.maxDiff to None to see it.
 
----------------------------------------------------------------------
Ran 1 test in 1.463s
 
FAILED (failures=1)

This clearly shows that our code failed. If you remove the cell that has the divide by zero issue and re-run your test, you should get this:

.
----------------------------------------------------------------------
Ran 1 test in 1.324s
 
OK

By removing the cell (or just correcting the error in that cell), you can make your tests pass.


The py.test Plugin

I discovered a neat plugin you can use that appears to help you out by making the workflow a bit easier. I am referring to the py.test plugin for Jupyter, which you can learn more about here.

Basically it gives py.test the ability to recognize Jupyter Notebooks and check if the stored inputs match the stored outputs and also that Notebooks run without error. After installing the nbval package, you can run it with py.test like this (assuming you have py.test installed):

py.test --nbval

Frankly you can actually run just py.test with no commands on the test file we already created and it will use our test code as is. The main benefit of adding nbval is that you won’t need to necessarily add wrapper code around Jupyter if you do so.


Testing within the Notebook

Another way to run tests is to just include some tests in the Notebook itself. Let’s add a new cell to our Testing Notebook that contains the following code:

import unittest
 
class TestNotebook(unittest.TestCase):
 
    def test_add(self):
        self.assertEqual(add(2, 3), 5)

This will test the add function in the first cell eventually. We could add a bunch of different tests here. For example, we might want to test what happens if we add a string type with a None type. But you may have noticed that if you try to run this cell, you get to output. The reason is that we aren’t instantiating the class yet. We need to call unittest.main to do that. So while it’s good to run that cell to get it into Jupyter’s memory, we actually need to add one more cell with the following code:

unittest.main(argv=[''], verbosity=2, exit=False)

This code should be put in the last cell of your Notebook so it can run all the tests that you have added. It is basically telling Python to run with verbosity level of 2 and not to exit. When you run this code you should see the following output in your Notebook:

test_add (__main__.TestNotebook) ... ok
 
----------------------------------------------------------------------
Ran 1 test in 0.003s
 
OK
 
<unittest.main.TestProgram at 0x7fbc8fffc0d0>

You can do something similar with Python’s doctest module inside of Jupyter Notebooks as well.


Wrapping Up

As I mentioned at the beginning, while you can test your code in your Jupyter Notebooks, it is actually much better if you just test your code outside of it. However there are workarounds and since some people like to use Jupyter for documentation purposes, it is good to have a way to verify that they are working correctly. In this chapter you learned how to run Notebooks programmatically and verify that the output was as you expected. You could enhance that code to verify certain errors are present if you wanted to as well.

You also learned how to use Python’s unittest module in your Notebook cells directly. This does offer some nice flexibility as you can now run your code all in one place. Use these tools wisely and they will serve you well.


Related Reading

October 16, 2018 05:05 AM UTC


Randy Zwitch

Using pandas and pymapd for ETL into OmniSci

I’ve got PyData NYC 2018 in two days and rather finishing up my talk, I just realized that my source data has a silent corruption due to non-standard timestamps. Here’s how I fixed this using pandas and then uploaded the data to OmniSci.

Computers Are Dumb, MAKE THINGS EASIER FOR THEM!

Literally every data tool in the world can read the ISO-8601 timestamp format. Conversely, not every tool in the world can read Excel or whatever horrible other tool people use to generate the CSV files seen in the wild. While I should’ve been more diligent checking my data ingestion, I didn’t until I created a wonky report…

Let’s take a look at the format that tripped me up:

Excel data format sucks

Month/Day/Year Hour:Minute:Second AM/PM feels very much like an Excel date format that you get when Excel is used as a display medium. Unfortunately, when you write CSV files like this, the next tool to read them has to understand 1) that these columns are timestamps and 2) if the user doesn’t specify the format, has to guess the format.

In my case, I didn’t do descriptive statistics on my timestamp columns and had a silent truncation(!) of the AM/PM portion of the data. So instead of having 24 hours in the day, the parser read the data as follows (the #AM and #PM are my comments for clarity):

datetime_beginning_utc
2001-01-01 01:00:00 #AM
2001-01-01 01:00:00 #PM
2001-01-01 02:00:00 #AM
2001-01-01 02:00:00 #PM
2001-01-01 03:00:00 #AM
2001-01-01 03:00:00 #PM
2001-01-01 04:00:00 #AM
2001-01-01 04:00:00 #PM
2001-01-01 05:00:00 #AM
2001-01-01 05:00:00 #PM
2001-01-01 06:00:00 #AM
2001-01-01 06:00:00 #PM
2001-01-01 07:00:00 #AM
2001-01-01 07:00:00 #PM
2001-01-01 08:00:00 #AM
2001-01-01 08:00:00 #PM
2001-01-01 09:00:00 #AM
2001-01-01 09:00:00 #PM
2001-01-01 10:00:00 #AM
2001-01-01 10:00:00 #PM
2001-01-01 11:00:00 #AM
2001-01-01 11:00:00 #PM
2001-01-01 12:00:00 #AM
2001-01-01 12:00:00 #PM

So while the data looks like it was imported correctly (because, it is a timestamp), it wasn’t until I realized that hours 13-23 were missing from my data that I realized I had an error.

Pandas To The Rescue!

Fixing this issue is as straight-forward as reading the CSV into python using pandas and specifying the date format:

import pandas as pd
import datetime
df = pd.read_csv("/mnt/storage1TB/hrl_load_metered/hrl_load_metered.csv",
                 parse_dates=[0,1],
                 date_parser= lambda x: datetime.datetime.strptime(x, "%m/%d/%Y %I:%M:%S %p"))

Yay pandas!

We can see from the code above that pandas has taken our directive about the format and it appears the data have been parsed correctly. A good secondary check here is that the difference in timestamps is -5, which is the offset of the East Coast of the United States relative to UTC.

Uploading to OmniSci Directly From Pandas

Since my PyData talk is going to be using OmniSci, I need to upload this corrected data or rebuild all my work (I’ll opt for fixing my source). Luckily, the pymapd package provides tight integration to an OmniSci database, providing a means of uploading the data directly from a pandas dataframe:

import pymapd

#connect to database
conn = pymapd.connect(host="localhost", port=9091, user="mapd", password="HyperInteractive", dbname="mapd")

#truncate table so that table definition can be reused
conn.execute("truncate table hrl_load_metered")

#re-load data into table
#with none of the optional arguments, pymapd infers that this is an insert operation, since table name exists
conn.load_table_columnar("hrl_load_metered", df)

I have a pre-existing table hrl_load_metered on the database, so I can truncate the table to remove its (incorrect) data but keep the table structure. Then I can use load_table_columnar to insert the cleaned up data into my table and now my data is correct.

Computers May Be Dumb, But Humans Are Lazy

At the beginning, I joked that computers are dumb. Computers are just tools that do exactly what a human programs them to do, and really, it was my laziness that caused this data error. Luckily, I did catch this before my talk and the fix is pretty easy.

I’d like to say I’m going to remember to check my data going forward, but in reality, I’m just documenting this here for the next time I make the same, lazy mistake.

October 16, 2018 12:00 AM UTC

October 15, 2018


Will McGugan

Adding type hints to the Django ORM

It occurred to me that Django's ORM could do with a bit of a revamp to make use of recent developments in the Python language.

The main area where I think Django's models are missing out is the lack of type hinting (hardly surprising since Django pre-dates type hints). Adding type hints allows Mypy to detect bugs before you even run your code. It may only save you minutes each time, but multiply that by the number of code + run iterations you do each day, and it can save hours of development time. Multiply that by the lifetime of your project, and it could save weeks or months. A clear win.

Typing Django Models

I'd love to be able to use type hints with the Django ORM, but it seems that the magic required to create Django models is just too dynamic and would defy any attempts to use typing. Fortunately that may not necessarily be the case. Type hints can be inspected at runtime, and we could use this information when building the model, while still allowing Mypy to analyze our code. Take the following trivial Django model:

class Foo(models.Model):
    count = models.IntegerField(default=0)

The same information could be encoded in type hints as follows:

class Foo(TypedModel):
    count: int = 0

The TypedModel class could inspect the type hints and create the integer field in the same way as models.Model uses IntegerField and friends. But this would also tell Mypy that instances of Foo have an integer attribute called count.

But what of nullable fields. How can we express those in type hints? The following would cover it:

class Foo(TypedModel):
    count: Optional[int] = 0

The Optional type hint tells Mypy that the attribute could be None, which could also be used to instruct TypedModel to create a nullable field.

So type hints contain enough information to set the type of the field, the default value, and wether the field is nullable--but there are other pieces of information associated with fields in Django models; a CharField has a max_length attribute for instance:

class Bar(models.Model):
    name = models.CharField(max_length=30)

There's nowhere in the type hinting to express the maximum length of a string, so we would have to use a custom object in addition to the type hints. Here's how that might be implemented:

class Bar(TypedModel):
    name: str = String(max_length=30)

The String class contains the maximum length information and additional meta information for the field. This class would have to be a subclass of the type specified in the hint, i.e. str, or Mypy would complain. Here's an example implementation:

class String(str):
    def __new__(cls, max_length=None):
        obj = super().__new__(cls)
        obj.max_length = max_length
        return obj

The above class creates an object that acts like a str, but has properties that could be inspected by the TypedModel class.

The entire model could be built using these techniques. Here's an larger example of what the proposed changes might look like:

class Student(TypeModel):
    name: str = String(max_length=30)  # CharField
    notes: str = ""  # TextField with empty default 
    birthday: datetime  # DateTimeField
    teacher: Optional[Staff] = None  # Nullable ForeignKey to Staff table
    classes: List[Subject]   # ManyToMany 

Its more terse than a typical Django model, which is a nice benefit, but the main advantage is that Mypy can detect errors (VSCode will even highlight such errors right in the editor).

For instance there is a bug in this line of code:

return {"teacher_name": student.teacher.name}

If the teacher field is ever null, that line with throw something like NoneType has no attribute "name". A silly error which may go un-noticed, even after a code review and 100% unit test coverage. No doubt only occurring in production at the weekend when your boss/client is giving a demo. But with typing, Mypy would catch that.

Specifying Meta

Another area were I think modern Python could improve Django models, is specifying the models meta information.

This may be subjective, but I've never been a huge fan of the way Django uses a inner class (a class defined in a class) to store additional information about the model. Python3 gives us another option, we can add keyword args to the class statement (where you would specify the metaclass). This feels like a more better place to add addtional information about the Model. Let's compare...

Hare's an example taking from the docs:

class Ox(models.Model):
    horn_length = models.IntegerField()

    class Meta:
        ordering = ["horn_length"]
        verbose_name_plural = "oxen"

Here's the equivalent, using class keyword args:

class Ox(TypedModel, ordering=["horn_length"], verbose_name_plural="oxen"):
    horn_length : int

The extra keywords args may result in a large line, but these could be formatted differently (in the style preferred by black):

class Ox(
    TypedModel,
    ordering=["horn_length"],
    verbose_name_plural="oxen"
):
    horn_length : int

I think the class keyword args are neater, but YMMV.

Code?

I'm sorry to say that none of this exists in code form (unless somebody else has come up with the same idea). I do think it could be written in such a way that the TypedModel and traditional models.Model definitions would be interchangeable, since all I'm proposing is a little syntactical sugar and no changes in functionality.

It did occur to me to start work on this, but then I remembered I have plenty projects and other commitments to keep me busy for the near future. I'm hoping that this will be picked up by somebody strong on typing who understands metaclasses enough to take this on.

October 15, 2018 10:14 PM UTC


Programming Ideas With Jake

Python Descriptors 2nd Edition!

The second edition of my book was just published and available at the source and on Amazon. To purchase, just click one of the links in the sidebar! Also, I’ll be writing up a new article to be published on here this weekend, so look forward to that! Advertisements

October 15, 2018 10:11 PM UTC