skip to navigation
skip to content

Planet Python

Last update: April 23, 2019 01:47 AM UTC

April 22, 2019


NumFOCUS

NumFOCUS Projects to Apply for Inaugural Google Season of Docs

The post NumFOCUS Projects to Apply for Inaugural Google Season of Docs appeared first on NumFOCUS.

April 22, 2019 09:41 PM UTC


Podcast.__init__

Exploring Indico: A Full Featured Event Management Platform

Managing an event is rife with inherent complexity that scales as you move from scheduling a meeting to organizing a conference. Indico is a platform built at CERN to handle their efforts to organize events such as the Computing in High Energy Physics (CHEP) conference, and now it has grown to manage booking of meeting rooms. In this episode Adrian Mönnich, core developer on the Indico project, explains how it is architected to facilitate this use case, how it has evolved since its first incarnation two decades ago, and what he has learned while working on it. The Indico platform is definitely a feature rich and mature platform that is worth considering if you are responsible for organizing a conference or need a room booking system for your office.

Summary

Managing an event is rife with inherent complexity that scales as you move from scheduling a meeting to organizing a conference. Indico is a platform built at CERN to handle their efforts to organize events such as the Computing in High Energy Physics (CHEP) conference, and now it has grown to manage booking of meeting rooms. In this episode Adrian Mönnich, core developer on the Indico project, explains how it is architected to facilitate this use case, how it has evolved since its first incarnation two decades ago, and what he has learned while working on it. The Indico platform is definitely a feature rich and mature platform that is worth considering if you are responsible for organizing a conference or need a room booking system for your office.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Bots and automation are taking over whole categories of online interaction. Discover.bot is an online community designed to serve as a platform-agnostic digital space for bot developers and enthusiasts of all skill levels to learn from one another, share their stories, and move the conversation forward together. They regularly publish guides and resources to help you learn about topics such as bot development, using them for business, and the latest in chatbot news. For newcomers to the space they have the Beginners Guide To Bots that will teach you the basics of how bots work, what they can do, and where they are developed and published. To help you choose the right framework and avoid the confusion about which NLU features and platform APIs you will need they have compiled a list of the major options and how they compare. Go to pythonpodcast.com/discoverbot today to get started and thank them for their support of the show.
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Adrian Mönnich about Indico, the effortless open-source tool for event organisation, archival and collaboration

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by describing what Indico is and how the project got started?
    • What are some other projects which target a similar use case and what were they lacking that led to Indico being necessary?
  • Can you talk through an example workflow for setting up and managing an event in Indico?
    • How does the lifecycle change when working with larger events, such as PyCon?
  • Can you describe how Indico is architected and how its design has evolved since it was first built?
    • What are some of the most complex or challenging portions of Indico to implement and maintain?
  • There are a lot of areas for exercising constraint resolution algorithms. Can you talk through some of the business logic of how that operates?
  • Most of Indico is highly configurable and flexible. How do you approach managing sane defaults to prevent users getting overwhelmed when onboarding?
    • What is your approach to testing given how complex the project is?
  • What are some of the most interesting or unexpected ways that you have seen Indico used?
  • What are some of the most interesting/unexpected lessons that you have learned in the process of building Indico?
  • What do you have planned for the future of the project?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

April 22, 2019 06:18 PM UTC


Codementor

Variable references in Python

Variable reference in python

April 22, 2019 03:58 PM UTC


ListenData

Loops in Python explained with examples

This tutorial covers various ways to execute loops in python. Loops is an important concept of any programming language which performs iterations i.e. run specific code repeatedly until a certain condition is reached.

Real Time Examples of Loop

  1. Software of the ATM machine is in a loop to process transaction after transaction until you acknowledge that you have no more to do.
  2. Software program in a mobile device allows user to unlock the mobile with 5 password attempts. After that it resets mobile device.
  3. You put your favorite song on a repeat mode. It is also a loop.
  4. You want to run a particular analysis on each column of your data set.

1. For Loop

Like R and C programming language, you can use for loop in Python. It is one of the most commonly used loop method to automate the repetitive tasks.

How for loop works?

Suppose you are asked to print sequence of numbers from 1 to 9, increment by 2.
for i in range(1,10,2):
print(i)
Output
1
3
5
7
9
range(1,10,2) means starts from 1 and ends with 9 (excluding 10), increment by 2.

Iteration over list
This section covers how to run for in loop on a list.
mylist = [30,21,33,42,53,64,71,86,97,10]
for i in mylist:
print(i)
Output
30
21
33
42
53
64
71
86
97
10

Suppose you need to select every 3rd value of list.
for i in mylist[::3]:
print(i)
Output
30
42
71
10
mylist[::3] is equivalent to mylist[0::3] which follows this syntax style list[start:stop:step]

Python Loop Explained with Examples

Example 1 : Create a new list with only items from list that is between 0 and 10
l1 = [100, 1, 10, 2, 3, 5, 8, 13, 21, 34, 55, 98]

new = [] #Blank list
for i in l1:
if i > 0 and i <= 10:
new.append(i)

new
Output: [1, 10, 2, 3, 5, 8]
It can also be done via numpy package by creating list as numpy array. See the code below.
import numpy as np
k=np.array(l1)
new=k[np.where(k<=10)]

Example 2 : Check which alphabet (a-z) is mentioned in string

Suppose you have a string named k and you want to check which alphabet exists in the string k.
k = "deepanshu"

import string
for n in string.ascii_lowercase:
if n in k:
print(n + ' exists in ' + k)
else:
print(n + ' does not exist in ' + k)
string.ascii_lowercase returns 'abcdefghijklmnopqrstuvwxyz'.

Practical Examples : for in loop in Python

Create sample pandas data frame for illustrative purpose.
import pandas as pd
np.random.seed(234)
df = pd.DataFrame({"x1" : np.random.randint(low=1, high=100, size=10),
"Month1" : np.random.normal(size=10),
"Month2" : np.random.normal(size=10),
"Month3" : np.random.normal(size=10),
"price" : range(10)
})

df
1. Multiple each month column by 1.2
for i in range(1,4):
print(df["Month"+str(i)]*1.2)
range(1,4) returns 1, 2 and 3. str( ) function is used to covert to string. "Month" + str(1) means Month1.
2. Store computed columns in new data frame
import pandas as pd
newDF = pd.DataFrame()
for i in range(1,4):
data = pd.DataFrame(df["Month"+str(i)]*1.2)
newDF=pd.concat([newDF,data], axis=1)
pd.DataFrame( ) is used to create blank data frame. The concat() function from pandas package is used to concatenate two data frames.

3. Check if value of x1 >= 50, multiply each month cost by price. Otherwise same as month.
import pandas as pd
import numpy as np
for i in range(1,4):
df['newcol'+str(i)] = np.where(df['x1'] >= 50,
df['Month'+str(i)] * df['price'],
df['Month'+str(i)])
In this example, we are adding new columns named newcol1, newcol2 and newcol3. np.where(condition, value_if condition meets, value_if condition does not meet) is used to construct IF ELSE statement.

4. Filter data frame by each unique value of a column and store it in a separate data frame
mydata = pd.DataFrame({"X1" : ["A","A","B","B","C"]})

for name in mydata.X1.unique():
temp = pd.DataFrame(mydata[mydata.X1 == name])
exec('{} = temp'.format(name))
The unique( ) function is used to calculate distinct values of a variable. The exec( ) function is used for dynamic execution of Python program. See the usage of string format( ) function below -
s= "Your Input"
"i am {}".format(s)

Output: 'i am Your Input'

Loop Control Statements

Loop control statements change execution from its normal iteration. When execution leaves a scope, all automatic objects that were created in that scope are destroyed.

Python supports the following control statements.
  1. Continue statement
  2. Break statement

Continue Statement
When continue statement is executed, it skips the further code in the loop and continue iteration.
In the code below, we are avoiding letters a and d to be printed.
for n in "abcdef":
if n =="a" or n =="d":
continue
print("letter :", n)
letter : b
letter : c
letter : e
letter : f
Break Statement
When break statement runs, it breaks or stops the loop.
In this program, when n is either c or d, loop stops executing.
for n in "abcdef":
if n =="c" or n =="d":
break
print("letter :", n)
letter : a
letter : b

for loop with else clause

Using else clause with for loop is not common among python developers community.
The else clause executes after the loop completes. It means that the loop did not encounter a break statement.
The program below calculates factors for numbers between 2 to 10. Else clause returns numbers which have no factors and are therefore prime numbers:

for k in range(2, 10):
for y in range(2, k):
if k % y == 0:
print( k, '=', y, '*', round(k/y))
break
else:
print(k, 'is a prime number')
2 is a prime number
3 is a prime number
4 = 2 * 2
5 is a prime number
6 = 2 * 3
7 is a prime number
8 = 2 * 4
9 = 3 * 3

While Loop

While loop is used to execute code repeatedly until a condition is met. And when the condition becomes false, the line immediately after the loop in program is executed.
i = 1
while i < 10:
print(i)
i += 2 #means i = i + 2
print("new i :", i)
Output:
1
new i : 3
3
new i : 5
5
new i : 7
7
new i : 9
9
new i : 11

While Loop with If-Else Statement

If-Else statement can be used along with While loop. See the program below -

counter = 1 
while (counter <= 5):
if counter < 2:
print("Less than 2")
elif counter > 4:
print("Greater than 4")
else:
print(">= 2 and <=4")
counter += 1

April 22, 2019 02:10 PM UTC


Real Python

A Beginner’s Guide to the Python time Module

The Python time module provides many ways of representing time in code, such as objects, numbers, and strings. It also provides functionality other than representing time, like waiting during code execution and measuring the efficiency of your code.

This article will walk you through the most commonly used functions and objects in time.

By the end of this article, you’ll be able to:

You’ll start by learning how you can use a floating point number to represent time.

Free Bonus: Click here to get our free Python Cheat Sheet that shows you the basics of Python 3, like working with data types, dictionaries, lists, and Python functions.

Dealing With Python Time Using Seconds

One of the ways you can manage the concept of Python time in your application is by using a floating point number that represents the number of seconds that have passed since the beginning of an era—that is, since a certain starting point.

Let’s dive deeper into what that means, why it’s useful, and how you can use it to implement logic, based on Python time, in your application.

The Epoch

You learned in the previous section that you can manage Python time with a floating point number representing elapsed time since the beginning of an era.

Merriam-Webster defines an era as:

The important concept to grasp here is that, when dealing with Python time, you’re considering a period of time identified by a starting point. In computing, you call this starting point the epoch.

The epoch, then, is the starting point against which you can measure the passage of time.

For example, if you define the epoch to be midnight on January 1, 1970 UTC—the epoch as defined on Windows and most UNIX systems—then you can represent midnight on January 2, 1970 UTC as 86400 seconds since the epoch.

This is because there are 60 seconds in a minute, 60 minutes in an hour, and 24 hours in a day. January 2, 1970 UTC is only one day after the epoch, so you can apply basic math to arrive at that result:

>>>
>>> 60 * 60 * 24
86400

It is also important to note that you can still represent time before the epoch. The number of seconds would just be negative.

For example, you would represent midnight on December 31, 1969 UTC (using an epoch of January 1, 1970) as -86400 seconds.

While January 1, 1970 UTC is a common epoch, it is not the only epoch used in computing. In fact, different operating systems, filesystems, and APIs sometimes use different epochs.

As you saw before, UNIX systems define the epoch as January 1, 1970. The Win32 API, on the other hand, defines the epoch as January 1, 1601.

You can use time.gmtime() to determine your system’s epoch:

>>>
>>> import time
>>> time.gmtime(0)
time.struct_time(tm_year=1970, tm_mon=1, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=1, tm_isdst=0)

You’ll learn about gmtime() and struct_time throughout the course of this article. For now, just know that you can use time to discover the epoch using this function.

Now that you understand more about how to measure time in seconds using an epoch, let’s take a look at Python’s time module to see what functions it offers that help you do so.

Python Time in Seconds as a Floating Point Number

First, time.time() returns the number of seconds that have passed since the epoch. The return value is a floating point number to account for fractional seconds:

>>>
>>> from time import time
>>> time()
1551143536.9323719

The number you get on your machine may be very different because the reference point considered to be the epoch may be very different.

Further Reading: Python 3.7 introduced time_ns(), which returns an integer value representing the same elapsed time since the epoch, but in nanoseconds rather than seconds.

Measuring time in seconds is useful for a number of reasons:

Sometimes, however, you may want to see the current time represented as a string. To do so, you can pass the number of seconds you get from time() into time.ctime().

Python Time in Seconds as a String Representing Local Time

As you saw before, you may want to convert the Python time, represented as the number of elapsed seconds since the epoch, to a string. You can do so using ctime():

>>>
>>> from time import time, ctime
>>> t = time()
>>> ctime(t)
'Mon Feb 25 19:11:59 2019'

Here, you’ve recorded the current time in seconds into the variable t, then passed t as an argument to ctime(), which returns a string representation of that same time.

Technical Detail: The argument, representing seconds since the epoch, is optional according to the ctime() definition. If you don’t pass an argument, then ctime() uses the return value of time() by default. So, you could simplify the example above:

>>>
>>> from time import ctime
>>> ctime()
'Mon Feb 25 19:11:59 2019'

The string representation of time, also known as a timestamp, returned by ctime() is formatted with the following structure:

  1. Day of the week: Mon (Monday)
  2. Month of the year: Feb (February)
  3. Day of the month: 25
  4. Hours, minutes, and seconds using the 24-hour clock notation: 19:11:59
  5. Year: 2019

The previous example displays the timestamp of a particular moment captured from a computer in the South Central region of the United States. But, let’s say you live in Sydney, Australia, and you executed the same command at the same instant.

Instead of the above output, you’d see the following:

>>>
>>> from time import time, ctime
>>> t = time()
>>> ctime(t)
'Tue Feb 26 12:11:59 2019'

Notice that the day of week, day of month, and hour portions of the timestamp are different than the first example.

These outputs are different because the timestamp returned by ctime() depends on your geographical location.

Note: While the concept of time zones is relative to your physical location, you can modify this in your computer’s settings without actually relocating.

The representation of time dependent on your physical location is called local time and makes use of a concept called time zones.

Note: Since local time is related to your locale, timestamps often account for locale-specific details such as the order of the elements in the string and translations of the day and month abbreviations. ctime() ignores these details.

Let’s dig a little deeper into the notion of time zones so that you can better understand Python time representations.

Understanding Time Zones

A time zone is a region of the world that conforms to a standardized time. Time zones are defined by their offset from Coordinated Universal Time (UTC) and, potentially, the inclusion of daylight savings time (which we’ll cover in more detail later in this article).

Fun Fact: If you’re a native English speaker, you might be wondering why the abbreviation for “Coordinated Universal Time” is UTC rather than the more obvious CUT. However, if you’re a native French speaker, you would call it “Temps Universel Coordonné,” which suggests a different abbreviation: TUC.

Ultimately, the International Telecommunication Union and the International Astronomical Union compromised on UTC as the official abbreviation so that, regardless of language, the abbreviation would be the same.

UTC and Time Zones

UTC is the time standard against which all the world’s timekeeping is synchronized (or coordinated). It is not, itself, a time zone but rather a transcendent standard that defines what time zones are.

UTC time is precisely measured using astronomical time, referring to the Earth’s rotation, and atomic clocks.

Time zones are then defined by their offset from UTC. For example, in North and South America, the Central Time Zone (CT) is behind UTC by five or six hours and, therefore, uses the notation UTC-5:00 or UTC-6:00.

Sydney, Australia, on the other hand, belongs to the Australian Eastern Time Zone (AET), which is ten or eleven hours ahead of UTC (UTC+10:00 or UTC+11:00).

This difference (UTC-6:00 to UTC+10:00) is the reason for the variance you observed in the two outputs from ctime() in the previous examples:

These times are exactly sixteen hours apart, which is consistent with the time zone offsets mentioned above.

You may be wondering why CT can be either five or six hours behind UTC or why AET can be ten or eleven hours ahead. The reason for this is that some areas around the world, including parts of these time zones, observe daylight savings time.

Daylight Savings Time

Summer months generally experience more daylight hours than winter months. Because of this, some areas observe daylight savings time (DST) during the spring and summer to make better use of those daylight hours.

For places that observe DST, their clocks will jump ahead one hour at the beginning of spring (effectively losing an hour). Then, in the fall, the clocks will be reset to standard time.

The letters S and D represent standard time and daylight savings time in time zone notation:

When you represent times as timestamps in local time, it is always important to consider whether DST is applicable or not.

ctime() accounts for daylight savings time. So, the output difference listed previously would be more accurate as the following:

Dealing With Python Time Using Data Structures

Now that you have a firm grasp on many fundamental concepts of time including epochs, time zones, and UTC, let’s take a look at more ways to represent time using the Python time module.

Python Time as a Tuple

Instead of using a number to represent Python time, you can use another primitive data structure: a tuple.

The tuple allows you to manage time a little more easily by abstracting some of the data and making it more readable.

When you represent time as a tuple, each element in your tuple corresponds to a specific element of time:

  1. Year
  2. Month as an integer, ranging between 1 (January) and 12 (December)
  3. Day of the month
  4. Hour as an integer, ranging between 0 (12 A.M.) and 23 (11 P.M.)
  5. Minute
  6. Second
  7. Day of the week as an integer, ranging between 0 (Monday) and 6 (Sunday)
  8. Day of the year
  9. Daylight savings time as an integer with the following values:
    • 1 is daylight savings time.
    • 0 is standard time.
    • -1 is unknown.

Using the methods you’ve already learned, you can represent the same Python time in two different ways:

>>>
>>> from time import time, ctime
>>> t = time()
>>> t
1551186415.360564
>>> ctime(t)
'Tue Feb 26 07:06:55 2019'

>>> time_tuple = (2019, 2, 26, 7, 6, 55, 1, 57, 0)

In this case, both t and time_tuple represent the same time, but the tuple provides a more readable interface for working with time components.

Technical Detail: Actually, if you look at the Python time represented by time_tuple in seconds (which you’ll see how to do later in this article), you’ll see that it resolves to 1551186415.0 rather than 1551186415.360564.

This is because the tuple doesn’t have a way to represent fractional seconds.

While the tuple provides a more manageable interface for working with Python time, there is an even better object: struct_time.

Python Time as an Object

The problem with the tuple construct is that it still looks like a bunch of numbers, even though it’s better organized than a single, seconds-based number.

struct_time provides a solution to this by utilizing NamedTuple, from Python’s collections module, to associate the tuple’s sequence of numbers with useful identifiers:

>>>
>>> from time import struct_time
>>> time_tuple = (2019, 2, 26, 7, 6, 55, 1, 57, 0)
>>> time_obj = struct_time(time_tuple)
>>> time_obj
time.struct_time(tm_year=2019, tm_mon=2, tm_mday=26, tm_hour=7, tm_min=6, tm_sec=55, tm_wday=1, tm_yday=57, tm_isdst=0)

Technical Detail: If you’re coming from another language, the terms struct and object might be in opposition to one another.

In Python, there is no data type called struct. Instead, everything is an object.

However, the name struct_time is derived from the C-based time library where the data type is actually a struct.

In fact, Python’s time module, which is implemented in C, uses this struct directly by including the header file times.h.

Now, you can access specific elements of time_obj using the attribute’s name rather than an index:

>>>
>>> day_of_year = time_obj.tm_yday
>>> day_of_year
57
>>> day_of_month = time_obj.tm_mday
>>> day_of_month
26

Beyond the readability and usability of struct_time, it is also important to know because it is the return type of many of the functions in the Python time module.

Converting Python Time in Seconds to an Object

Now that you’ve seen the three primary ways of working with Python time, you’ll learn how to convert between the different time data types.

Converting between time data types is dependent on whether the time is in UTC or local time.

Coordinated Universal Time (UTC)

The epoch uses UTC for its definition rather than a time zone. Therefore, the seconds elapsed since the epoch is not variable depending on your geographical location.

However, the same cannot be said of struct_time. The object representation of Python time may or may not take your time zone into account.

There are two ways to convert a float representing seconds to a struct_time:

  1. UTC
  2. Local time

To convert a Python time float to a UTC-based struct_time, the Python time module provides a function called gmtime().

You’ve seen gmtime() used once before in this article:

>>>
>>> import time
>>> time.gmtime(0)
time.struct_time(tm_year=1970, tm_mon=1, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=1, tm_isdst=0)

You used this call to discover your system’s epoch. Now, you have a better foundation for understanding what’s actually happening here.

gmtime() converts the number of elapsed seconds since the epoch to a struct_time in UTC. In this case, you’ve passed 0 as the number of seconds, meaning you’re trying to find the epoch, itself, in UTC.

Note: Notice the attribute tm_isdst is set to 0. This attribute represents whether the time zone is using daylight savings time. UTC never subscribes to DST, so that flag will always be 0 when using gmtime().

As you saw before, struct_time cannot represent fractional seconds, so gmtime() ignores the fractional seconds in the argument:

>>>
>>> import time
>>> time.gmtime(1.99)
time.struct_time(tm_year=1970, tm_mon=1, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=1, tm_wday=3, tm_yday=1, tm_isdst=0)

Notice that even though the number of seconds you passed was very close to 2, the .99 fractional seconds were simply ignored, as shown by tm_sec=1.

The secs parameter for gmtime() is optional, meaning you can call gmtime() with no arguments. Doing so will provide the current time in UTC:

>>>
>>> import time
>>> time.gmtime()
time.struct_time(tm_year=2019, tm_mon=2, tm_mday=28, tm_hour=12, tm_min=57, tm_sec=24, tm_wday=3, tm_yday=59, tm_isdst=0)

Interestingly, there is no inverse for this function within time. Instead, you’ll have to look in Python’s calendar module for a function named timegm():

>>>
>>> import calendar
>>> import time
>>> time.gmtime()
time.struct_time(tm_year=2019, tm_mon=2, tm_mday=28, tm_hour=13, tm_min=23, tm_sec=12, tm_wday=3, tm_yday=59, tm_isdst=0)
>>> calendar.timegm(time.gmtime())
1551360204

timegm() takes a tuple (or struct_time, since it is a subclass of tuple) and returns the corresponding number of seconds since the epoch.

Historical Context: If you’re interested in why timegm() is not in time, you can view the discussion in Python Issue 6280.

In short, it was originally added to calendar because time closely follows C’s time library (defined in time.h), which contains no matching function. The above-mentioned issue proposed the idea of moving or copying timegm() into time.

However, with advances to the datetime library, inconsistencies in the patched implementation of time.timegm(), and a question of how to then handle calendar.timegm(), the maintainers declined the patch, encouraging the use of datetime instead.

Working with UTC is valuable in programming because it’s a standard. You don’t have to worry about DST, time zone, or locale information.

That said, there are plenty of cases when you’d want to use local time. Next, you’ll see how to convert from seconds to local time so that you can do just that.

Local Time

In your application, you may need to work with local time rather than UTC. Python’s time module provides a function for getting local time from the number of seconds elapsed since the epoch called localtime().

The signature of localtime() is similar to gmtime() in that it takes an optional secs argument, which it uses to build a struct_time using your local time zone:

>>>
>>> import time
>>> time.time()
1551448206.86196
>>> time.localtime(1551448206.86196)
time.struct_time(tm_year=2019, tm_mon=3, tm_mday=1, tm_hour=7, tm_min=50, tm_sec=6, tm_wday=4, tm_yday=60, tm_isdst=0)

Notice that tm_isdst=0. Since DST matters with local time, tm_isdst will change between 0 and 1 depending on whether or not DST is applicable for the given time. Since tm_isdst=0, DST is not applicable for March 1, 2019.

In the United States in 2019, daylight savings time begins on March 10. So, to test if the DST flag will change correctly, you need to add 9 days’ worth of seconds to the secs argument.

To compute this, you take the number of seconds in a day (86,400) and multiply that by 9 days:

>>>
>>> new_secs = 1551448206.86196 + (86400 * 9)
>>> time.localtime(new_secs)
time.struct_time(tm_year=2019, tm_mon=3, tm_mday=10, tm_hour=8, tm_min=50, tm_sec=6, tm_wday=6, tm_yday=69, tm_isdst=1)

Now, you’ll see that the struct_time shows the date to be March 10, 2019 with tm_isdst=1. Also, notice that tm_hour has also jumped ahead, to 8 instead of 7 in the previous example, because of daylight savings time.

Since Python 3.3, struct_time has also included two attributes that are useful in determining the time zone of the struct_time:

  1. tm_zone
  2. tm_gmtoff

At first, these attributes were platform dependent, but they have been available on all platforms since Python 3.6.

First, tm_zone stores the local time zone:

>>>
>>> import time
>>> current_local = time.localtime()
>>> current_local.tm_zone
'CST'

Here, you can see that localtime() returns a struct_time with the time zone set to CST (Central Standard Time).

As you saw before, you can also tell the time zone based on two pieces of information, the UTC offset and DST (if applicable):

>>>
>>> import time
>>> current_local = time.localtime()
>>> current_local.tm_gmtoff
-21600
>>> current_local.tm_isdst
0

In this case, you can see that current_local is 21600 seconds behind GMT, which stands for Greenwich Mean Time. GMT is the time zone with no UTC offset: UTC±00:00.

21600 seconds divided by seconds per hour (3,600) means that current_local time is GMT-06:00 (or UTC-06:00).

You can use the GMT offset plus the DST status to deduce that current_local is UTC-06:00 at standard time, which corresponds to the Central standard time zone.

Like gmtime(), you can ignore the secs argument when calling localtime(), and it will return the current local time in a struct_time:

>>>
>>> import time
>>> time.localtime()
time.struct_time(tm_year=2019, tm_mon=3, tm_mday=1, tm_hour=8, tm_min=34, tm_sec=28, tm_wday=4, tm_yday=60, tm_isdst=0)

Unlike gmtime(), the inverse function of localtime() does exist in the Python time module. Let’s take a look at how that works.

Converting a Local Time Object to Seconds

You’ve already seen how to convert a UTC time object to seconds using calendar.timegm(). To convert local time to seconds, you’ll use mktime().

mktime() requires you to pass a parameter called t that takes the form of either a normal 9-tuple or a struct_time object representing local time:

>>>
>>> import time

>>> time_tuple = (2019, 3, 10, 8, 50, 6, 6, 69, 1)
>>> time.mktime(time_tuple)
1552225806.0

>>> time_struct = time.struct_time(time_tuple)
>>> time.mktime(time_struct)
1552225806.0

It’s important to keep in mind that t must be a tuple representing local time, not UTC:

>>>
>>> from time import gmtime, mktime

>>> # 1
>>> current_utc = time.gmtime()
>>> current_utc
time.struct_time(tm_year=2019, tm_mon=3, tm_mday=1, tm_hour=14, tm_min=51, tm_sec=19, tm_wday=4, tm_yday=60, tm_isdst=0)

>>> # 2
>>> current_utc_secs = mktime(current_utc)
>>> current_utc_secs
1551473479.0

>>> # 3
>>> time.gmtime(current_utc_secs)
time.struct_time(tm_year=2019, tm_mon=3, tm_mday=1, tm_hour=20, tm_min=51, tm_sec=19, tm_wday=4, tm_yday=60, tm_isdst=0)

Note: For this example, assume that the local time is March 1, 2019 08:51:19 CST.

This example shows why it’s important to use mktime() with local time, rather than UTC:

  1. gmtime() with no argument returns a struct_time using UTC. current_utc shows March 1, 2019 14:51:19 UTC. This is accurate because CST is UTC-06:00, so UTC should be 6 hours ahead of local time.

  2. mktime() tries to return the number of seconds, expecting local time, but you passed current_utc instead. So, instead of understanding that current_utc is UTC time, it assumes you meant March 1, 2019 14:51:19 CST.

  3. gmtime() is then used to convert those seconds back into UTC, which results in an inconsistency. The time is now March 1, 2019 20:51:19 UTC. The reason for this discrepancy is the fact that mktime() expected local time. So, the conversion back to UTC adds another 6 hours to local time.

Working with time zones is notoriously difficult, so it’s important to set yourself up for success by understanding the differences between UTC and local time and the Python time functions that deal with each.

Converting a Python Time Object to a String

While working with tuples is fun and all, sometimes it’s best to work with strings.

String representations of time, also known as timestamps, help make times more readable and can be especially useful for building intuitive user interfaces.

There are two Python time functions that you use for converting a time.struct_time object to a string:

  1. asctime()
  2. strftime()

You’ll begin by learning aboutasctime().

asctime()

You use asctime() for converting a time tuple or struct_time to a timestamp:

>>>
>>> import time
>>> time.asctime(time.gmtime())
'Fri Mar  1 18:42:08 2019'
>>> time.asctime(time.localtime())
'Fri Mar  1 12:42:15 2019'

Both gmtime() and localtime() return struct_time instances, for UTC and local time respectively.

You can use asctime() to convert either struct_time to a timestamp. asctime() works similarly to ctime(), which you learned about earlier in this article, except instead of passing a floating point number, you pass a tuple. Even the timestamp format is the same between the two functions.

As with ctime(), the parameter for asctime() is optional. If you do not pass a time object to asctime(), then it will use the current local time:

>>>
>>> import time
>>> time.asctime()
'Fri Mar  1 12:56:07 2019'

As with ctime(), it also ignores locale information.

One of the biggest drawbacks of asctime() is its format inflexibility. strftime() solves this problem by allowing you to format your timestamps.

strftime()

You may find yourself in a position where the string format from ctime() and asctime() isn’t satisfactory for your application. Instead, you may want to format your strings in a way that’s more meaningful to your users.

One example of this is if you would like to display your time in a string that takes locale information into account.

To format strings, given a struct_time or Python time tuple, you use strftime(), which stands for “string format time.”

strftime() takes two arguments:

  1. format specifies the order and form of the time elements in your string.
  2. t is an optional time tuple.

To format a string, you use directives. Directives are character sequences that begin with a % that specify a particular time element, such as:

For example, you can output the date in your local time using the ISO 8601 standard like this:

>>>
>>> import time
>>> time.strftime('%Y-%m-%d', time.localtime())
'2019-03-01'

Further Reading: While representing dates using Python time is completely valid and acceptable, you should also consider using Python’s datetime module, which provides shortcuts and a more robust framework for working with dates and times together.

For example, you can simplify outputting a date in the ISO 8601 format using datetime:

>>>
>>> from datetime import date
>>> date(year=2019, month=3, day=1).isoformat()
'2019-03-01'

As you saw before, a great benefit of using strftime() over asctime() is its ability to render timestamps that make use of locale-specific information.

For example, if you want to represent the date and time in a locale-sensitive way, you can’t use asctime():

>>>
>>> from time import asctime
>>> asctime()
'Sat Mar  2 15:21:14 2019'

>>> import locale
>>> locale.setlocale(locale.LC_TIME, 'zh_HK')  # Chinese - Hong Kong
'zh_HK'
>>> asctime()
'Sat Mar  2 15:58:49 2019'

Notice that even after programmatically changing your locale, asctime() still returns the date and time in the same format as before.

Technical Detail: LC_TIME is the locale category for date and time formatting. The locale argument 'zh_HK' may be different, depending on your system.

When you use strftime(), however, you’ll see that it accounts for locale:

>>>
>>> from time import strftime, localtime
>>> strftime('%c', localtime())
'Sat Mar  2 15:23:20 2019'

>>> import locale
>>> locale.setlocale(locale.LC_TIME, 'zh_HK')  # Chinese - Hong Kong
'zh_HK'
>>> strftime('%c', localtime())
'六  3/ 2 15:58:12 2019' 2019'

Here, you have successfully utilized the locale information because you used strftime().

Note: %c is the directive for locale-appropriate date and time.

If the time tuple is not passed to the parameter t, then strftime() will use the result of localtime() by default. So, you could simplify the examples above by removing the optional second argument:

>>>
>>> from time import strftime
>>> strftime('The current local datetime is: %c')
'The current local datetime is: Fri Mar  1 23:18:32 2019'

Here, you’ve used the default time instead of passing your own as an argument. Also, notice that the format argument can consist of text other than formatting directives.

Further Reading: Check out this thorough list of directives available to strftime().

The Python time module also includes the inverse operation of converting a timestamp back into a struct_time object.

Converting a Python Time String to an Object

When you’re working with date and time related strings, it can be very valuable to convert the timestamp to a time object.

To convert a time string to a struct_time, you use strptime(), which stands for “string parse time”:

>>>
>>> from time import strptime
>>> strptime('2019-03-01', '%Y-%m-%d')
time.struct_time(tm_year=2019, tm_mon=3, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=4, tm_yday=60, tm_isdst=-1)

The first argument to strptime() must be the timestamp you wish to convert. The second argument is the format that the timestamp is in.

The format parameter is optional and defaults to '%a %b %d %H:%M:%S %Y'. Therefore, if you have a timestamp in that format, you don’t need to pass it as an argument:

>>>
>>> strptime('Fri Mar 01 23:38:40 2019')
time.struct_time(tm_year=2019, tm_mon=3, tm_mday=1, tm_hour=23, tm_min=38, tm_sec=40, tm_wday=4, tm_yday=60, tm_isdst=-1)

Since a struct_time has 9 key date and time components, strptime() must provide reasonable defaults for values for those components it can’t parse from string.

In the previous examples, tm_isdst=-1. This means that strptime() can’t determine by the timestamp whether it represents daylight savings time or not.

Now you know how to work with Python times and dates using the time module in a variety of ways. However, there are other uses for time outside of simply creating time objects, getting Python time strings, and using seconds elapsed since the epoch.

Suspending Execution

One really useful Python time function is sleep(), which suspends the thread’s execution for a specified amount of time.

For example, you can suspend your program’s execution for 10 seconds like this:

>>>
>>> from time import sleep, strftime
>>> strftime('%c')
'Fri Mar  1 23:49:26 2019'
>>> sleep(10)
>>> strftime('%c')
'Fri Mar  1 23:49:36 2019'

Your program will print the first formatted datetime string, then pause for 10 seconds, and finally print the second formatted datetime string.

You can also pass fractional seconds to sleep():

>>>
>>> from time import sleep
>>> sleep(0.5)

sleep() is useful for testing or making your program wait for any reason, but you must be careful not to halt your production code unless you have good reason to do so.

Before Python 3.5, a signal sent to your process could interrupt sleep(). However, in 3.5 and later, sleep() will always suspend execution for at least the amount of specified time, even if the process receives a signal.

sleep() is just one Python time function that can help you test your programs and make them more robust.

Measuring Performance

You can use time to measure the performance of your program.

The way you do this is to use perf_counter() which, as the name suggests, provides a performance counter with a high resolution to measure short distances of time.

To use perf_counter(), you place a counter before your code begins execution as well as after your code’s execution completes:

>>>
>>> from time import perf_counter
>>> def longrunning_function():
...     for i in range(1, 11):
...         time.sleep(i / i ** 2)
...
>>> start = perf_counter()
>>> longrunning_function()
>>> end = perf_counter()
>>> execution_time = (end - start)
>>> execution_time
8.201258441999926

First, start captures the moment before you call the function. end captures the moment after the function returns. The function’s total execution time took (end - start) seconds.

Technical Detail: Python 3.7 introduced perf_counter_ns(), which works the same as perf_counter(), but uses nanoseconds instead of seconds.

perf_counter() (or perf_counter_ns()) is the most precise way to measure the performance of your code using one execution. However, if you’re trying to accurately gauge the performance of a code snippet, I recommend using the Python timeit module.

timeit specializes in running code many times to get a more accurate performance analysis and helps you to avoid oversimplifying your time measurement as well as other common pitfalls.

Conclusion

Congratulations! You now have a great foundation for working with dates and times in Python.

Now, you’re able to:

On top of all that, you’ve learned some fundamental concepts surrounding date and time, such as:

Now, it’s time for you to apply your newfound knowledge of Python time in your real world applications!

Further Reading

If you want to continue learning more about using dates and times in Python, take a look at the following modules:


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

April 22, 2019 02:00 PM UTC


PyCharm

Interview: Dan Tofan for this week’s data science webinar

In the past few years, Python has made a big push into data science and PyCharm has as well. Years ago we added Jupyter Notebook integration, then 2017.3 introduced Scientific Mode for workflows that felt more like an IDE. In 2019.1 we re-invented our Jupyter support to also be more like a professional tool.

PyCharm and data science are thus a hot topic. Dan Tofan very recently published a Pluralsight course on using PyCharm for data science and we invited him for a webinar next week.

To help set the stage, below is an interview with Dan.

webinar-05-2

Let’s start with the key point: what does PyCharm bring to data scientists?

PyCharm brings a productivity boost to data scientists, by helping them explore data, debug Python code, write better Python code, and understand Python code faster. As a PyCharm user, I experienced and benefited from these productivity boosters, which I distilled into my first Pluralsight course, so that data scientists can make the most out of PyCharm in their activities.

For the webinar: who is it for and what can people expect you to cover?

If you are a data scientist who dabbled with PyCharm, then this webinar is for you. I will cover PyCharm’s most relevant features to data science: the scientific mode and the completely rewritten Jupyter support. I will show how these features interplay with other PyCharm features, such as refactoring code from Jupyter cells. I will use easy-to-understand code examples with popular data science libraries.

Now, back to the start: tell us a little about yourself.

Currently, I am a senior backend developer for Dimensions – a research data platform that uses data science, and links data on a total of over 140 million publications, grants, patents and clinical trials. I’ve always been curious, which led me to do my PhD studies at the University of Groningen (Netherlands) and learn more about statistics and data analysis.

Do Python data scientists feel like programmers first and data scientists second, or the reverse?

In my opinion, data science is a melting pot of skills from three complementing backgrounds: programmers, statisticians and business analysts. At the start of your data science journey, you are going to rely on the skills from your main background, and – as your skills expand – you are going to feel more and more like a data scientist.

Your course has a bunch of sections on software development practices and IDE tips. How important are these practices to “professional” data science?

As part of the melting pot, programmers bring a lot of value with their experiences ranging from software development practices to IDE tips. Data scientists from a programming background are already familiar with most of these, and those from other backgrounds benefit immensely.

Think of a code base that starts to grow: how do you write better code? How do you refactor the code? How can a new team member understand that code faster? These are some of the questions that my course helps with.

The course also covers three major facilities in PyCharm Professional: Scientific Mode, Jupyter support, and the Database tool. How do these fit in?

All of them are data centric, so they are very relevant to data scientists. These facilities are integrated nicely with other PyCharm capabilities such as debugging and refactoring. Overall, after watching the course and getting familiar with these capabilities, data scientists get a nice productivity boost.

This webinar is good timing. You just released the course and we just re-invented our Jupyter support. What do you think of the new, IDE-centric Jupyter integration?

I think the new Jupyter integration is an excellent step in the right direction, because you can use both Jupyter and PyCharm features such as debugging and code completion. Joel Grus gave an insightful and entertaining talk about Jupyter limitations at JupyterCon 2018. I think the new Jupyter integration in PyCharm can eventually help solve some Jupyter pain points raised by Joel, such as hidden state.

What’s one big problem or pain point in Jupyter that could benefit from new ideas or tooling?

Reproducibility is problematic with Jupyter and it is important for data science. For example, it’s easy to share a notebook on GitHub, then someone else tries to run it and gets different results. Perhaps the solution is a mix of discipline and better tools.

April 22, 2019 11:47 AM UTC


Ram Rachum

PySnooper: Never use print for debugging again

PySnooper: Never use print for debugging again

I just released a new open-source project!

https://github.com/cool-RR/PySnooper/.

April 22, 2019 10:54 AM UTC


ListenData

Pandas Python Tutorial - Learn by Examples

Pandas being one of the most popular package in Python is widely used for data manipulation. It is a very powerful and versatile package which makes data cleaning and wrangling much easier and pleasant.

The Pandas library has a great contribution to the python community and it makes python as one of the top programming language for data science and analytics. It has become first choice of data analysts and scientists for data analysis and manipulation.

Data Analysis with Python : Pandas Step by Step Guide

Why pandas?
It has many functions which are the essence for data handling. In short, it can perform the following tasks for you -
  1. Create a structured data set similar to R's data frame and Excel spreadsheet.
  2. Reading data from various sources such as CSV, TXT, XLSX, SQL database, R etc.
  3. Selecting particular rows or columns from data set
  4. Arranging data in ascending or descending order
  5. Filtering data based on some conditions
  6. Summarizing data by classification variable
  7. Reshape data into wide or long format
  8. Time series analysis
  9. Merging and concatenating two datasets
  10. Iterate over the rows of dataset
  11. Writing or Exporting data in CSV or Excel format

Datasets:

In this tutorial we will use two datasets: 'income' and 'iris'.
  1. 'income' data : This data contains the income of various states from 2002 to 2015. The dataset contains 51 observations and 16 variables. Download link
  2. 'iris' data: It comprises of 150 observations with 5 variables. We have 3 species of flowers(50 flowers for each specie) and for all of them the sepal length and width and petal length and width are given. Download link 


Important pandas functions to remember

The following is a list of common tasks along with pandas functions.
Utility Functions
Extract Column Names df.columns
Select first 2 rows df.iloc[:2]
Select first 2 columns df.iloc[:,:2]
Select columns by name df.loc[:,["col1","col2"]]
Select random no. of rows df.sample(n = 10)
Select fraction of random rows df.sample(frac = 0.2)
Rename the variables df.rename( )
Selecting a column as index df.set_index( )
Removing rows or columns df.drop( )
Sorting values df.sort_values( )
Grouping variables df.groupby( )
Filtering df.query( )
Finding the missing values df.isnull( )
Dropping the missing values df.dropna( )
Removing the duplicates df.drop_duplicates( )
Creating dummies pd.get_dummies( )
Ranking df.rank( )
Cumulative sum df.cumsum( )
Quantiles df.quantile( )
Selecting numeric variables df.select_dtypes( )
Concatenating two dataframes pd.concat()
Merging on basis of common variable pd.merge( )

Importing pandas library

You need to import or load the Pandas library first in order to use it. By "Importing a library", it means loading it into the memory and then you can use it. Run the following code to import pandas library:
import pandas as pd
The "pd" is an alias or abbreviation which will be used as a shortcut to access or call pandas functions. To access the functions from pandas library, you just need to type pd.function instead of  pandas.function every time you need to apply it.

Importing Dataset

To read or import data from CSV file, you can use read_csv() function. In the function, you need to specify the file location of your CSV file.
income = pd.read_csv("C:\\Users\\Hp\\Python\\Basics\\income.csv")
 Index       State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
0 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134
1 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841
2 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382
3 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213
4 C California 1685349 1675807 1889570 1480280 1735069 1812546

Y2008 Y2009 Y2010 Y2011 Y2012 Y2013 Y2014 Y2015
0 1945229 1944173 1237582 1440756 1186741 1852841 1558906 1916661
1 1551826 1436541 1629616 1230866 1512804 1985302 1580394 1979143
2 1752886 1554330 1300521 1130709 1907284 1363279 1525866 1647724
3 1188104 1628980 1669295 1928238 1216675 1591896 1360959 1329341
4 1487315 1663809 1624509 1639670 1921845 1156536 1388461 1644607

Get Variable Names

By using income.columnscommand, you can fetch the names of variables of a data frame.
Index(['Index', 'State', 'Y2002', 'Y2003', 'Y2004', 'Y2005', 'Y2006', 'Y2007',
'Y2008', 'Y2009', 'Y2010', 'Y2011', 'Y2012', 'Y2013', 'Y2014', 'Y2015'],
dtype='object')
income.columns[0:2] returns first two column names 'Index', 'State'. In python, indexing starts from 0.

Knowing the Variable types

You can use the dataFrameName.dtypes command to extract the information of types of variables stored in the data frame.
income.dtypes 
Index    object
State object
Y2002 int64
Y2003 int64
Y2004 int64
Y2005 int64
Y2006 int64
Y2007 int64
Y2008 int64
Y2009 int64
Y2010 int64
Y2011 int64
Y2012 int64
Y2013 int64
Y2014 int64
Y2015 int64
dtype: object

Here 'object' means strings or character variables. 'int64' refers to numeric variables (without decimals).

To see the variable type of one variable (let's say "State") instead of all the variables, you can use the command below -
income['State'].dtypes
It returns dtype('O'). In this case, 'O' refers to object i.e. type of variable as character.

Changing the data types

Y2008 is an integer. Suppose we want to convert it to float (numeric variable with decimals) we can write:
income.Y2008 = income.Y2008.astype(float)
income.dtypes
Index     object
State object
Y2002 int64
Y2003 int64
Y2004 int64
Y2005 int64
Y2006 int64
Y2007 int64
Y2008 float64
Y2009 int64
Y2010 int64
Y2011 int64
Y2012 int64
Y2013 int64
Y2014 int64
Y2015 int64
dtype: object

To view the dimensions or shape of the data
income.shape
 (51, 16)

51 is the number of rows and 16 is the number of columns.

You can also use shape[0] to see the number of rows (similar to nrow() in R) and shape[1] for number of columns (similar to ncol() in R). 
income.shape[0]
income.shape[1]

To view only some of the rows

By default head( ) shows first 5 rows. If we want to see a specific number of rows we can mention it in the parenthesis. Similarly tail( ) function shows last 5 rows by default.
income.head()
income.head(2) #shows first 2 rows.
income.tail()
income.tail(2) #shows last 2 rows

Alternatively, any of the following commands can be used to fetch first five rows.
income[0:5]
income.iloc[0:5]

Define Categorical Variable

Like factors() function in R, we can include categorical variable in python using "category" dtype.
s = pd.Series([1,2,3,1,2], dtype="category")
s
0    1
1 2
2 3
3 1
4 2
dtype: category
Categories (3, int64): [1, 2, 3]

Extract Unique Values

The unique() function shows the unique levels or categories in the dataset.
income.Index.unique()
array(['A', 'C', 'D', ..., 'U', 'V', 'W'], dtype=object)


The nunique( ) shows the number of unique values.
income.Index.nunique()
It returns 19 as index column contains distinct 19 values.

Generate Cross Tab

pd.crosstab( ) is used to create a bivariate frequency distribution. Here the bivariate frequency distribution is between Index and State columns.
pd.crosstab(income.Index,income.State)

Creating a frequency distribution

income.Index selects the 'Index' column of 'income' dataset and value_counts( ) creates a frequency distribution. By default ascending = False i.e. it will show the 'Index' having the maximum frequency on the top.
income.Index.value_counts(ascending = True)
F    1
G 1
U 1
L 1
H 1
P 1
R 1
D 2
T 2
S 2
V 2
K 2
O 3
C 3
I 4
W 4
A 4
M 8
N 8
Name: Index, dtype: int64

To draw the samples
income.sample( ) is used to draw random samples from the dataset containing all the columns. Here n = 5 depicts we need 5 columns and frac = 0.1 tells that we need 10 percent of the data as my sample.
income.sample(n = 5)
income.sample(frac = 0.1)
Selecting only a few of the columns
To select only a specific columns we use either loc[ ] or iloc[ ] functions. The index or columns to be selected are passed as lists. "Index":"Y2008" denotes the that all the columns from Index to Y2008 are to be selected.

Syntax of df.loc[  ]
df.loc[row_index , column_index]
income.loc[:,["Index","State","Y2008"]]
income.loc[0:2,["Index","State","Y2008"]]  #Selecting rows with Index label 0 to 2 & columns
income.loc[:,"Index":"Y2008"]  #Selecting consecutive columns
#In the above command both Index and Y2008 are included.
income.iloc[:,0:5]  #Columns from 1 to 5 are included. 6th column not included
Difference between loc and iloc

loc considers rows (or columns) with particular labels from the index. Whereas iloc considers rows (or columns) at particular positions in the index so it only takes integers.
x = pd.DataFrame({"var1" : np.arange(1,20,2)}, index=[9,8,7,6,10, 1, 2, 3, 4, 5])
    var1
9 1
8 3
7 5
6 7
10 9
1 11
2 13
3 15
4 17
5 19
iloc Code
x.iloc[:3]

Output:
var1
9 1
8 3
7 5
loc code
x.loc[:3]

Output:
var1
9 1
8 3
7 5
6 7
10 9
1 11
2 13
3 15
You can also use the following syntax to select specific variables.
income[["Index","State","Y2008"]]

Renaming the variables
We create a dataframe 'data' for information of people and their respective zodiac signs.
data = pd.DataFrame({"A" : ["John","Mary","Julia","Kenny","Henry"], "B" : ["Libra","Capricorn","Aries","Scorpio","Aquarius"]})
data 
       A          B
0 John Libra
1 Mary Capricorn
2 Julia Aries
3 Kenny Scorpio
4 Henry Aquarius
If all the columns are to be renamed then we can use data.columns and assign the list of new column names.
#Renaming all the variables.
data.columns = ['Names','Zodiac Signs']

   Names Zodiac Signs
0 John Libra
1 Mary Capricorn
2 Julia Aries
3 Kenny Scorpio
4 Henry Aquarius
If only some of the variables are to be renamed then we can use rename( ) function where the new names are passed in the form of a dictionary.
#Renaming only some of the variables.
data.rename(columns = {"Names":"Cust_Name"},inplace = True)
  Cust_Name Zodiac Signs
0 John Libra
1 Mary Capricorn
2 Julia Aries
3 Kenny Scorpio
4 Henry Aquarius
By default in pandas inplace = False which means that no changes are made in the original dataset. Thus if we wish to alter the original dataset we need to define inplace = True.

Suppose we want to replace only a particular character in the list of the column names then we can use str.replace( ) function. For example, renaming the variables which contain "Y" as "Year"
income.columns = income.columns.str.replace('Y' , 'Year ')
income.columns
Index(['Index', 'State', 'Year 2002', 'Year 2003', 'Year 2004', 'Year 2005',
'Year 2006', 'Year 2007', 'Year 2008', 'Year 2009', 'Year 2010',
'Year 2011', 'Year 2012', 'Year 2013', 'Year 2014', 'Year 2015'],
dtype='object')

Setting one column in the data frame as the index
Using set_index("column name") we can set the indices as that column and that column gets removed.
income.set_index("Index",inplace = True)
income.head()
#Note that the indices have changed and Index column is now no more a column
income.columns
income.reset_index(inplace = True)
income.head()
reset_index( ) tells us that one should use the by default indices.

Removing the columns and rows
To drop a column we use drop( ) where the first argument is a list of columns to be removed.

By default axis = 0 which means the operation should take place horizontally, row wise. To remove a column we need to set axis = 1.
income.drop('Index',axis = 1)

#Alternatively
income.drop("Index",axis = "columns")
income.drop(['Index','State'],axis = 1)
income.drop(0,axis = 0)
income.drop(0,axis = "index")
income.drop([0,1,2,3],axis = 0)
 Also inplace = False by default thus no alterations are made in the original dataset.  axis = "columns"  and axis = "index" means the column and row(index) should be removed respectively.

Sorting the data
To sort the data sort_values( ) function is deployed. By default inplace = False and ascending = True.
income.sort_values("State",ascending = False)
income.sort_values("State",ascending = False,inplace = True)
income.Y2006.sort_values() 
We have got duplicated for Index thus we need to sort the dataframe firstly by Index and then for each particular index we sort the values by Y2002
income.sort_values(["Index","Y2002"]) 

Create new variables
Using eval( ) arithmetic operations on various columns can be carried out in a dataset.
income["difference"] = income.Y2008-income.Y2009

#Alternatively
income["difference2"] = income.eval("Y2008 - Y2009")
income.head()
  Index       State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
0 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134
1 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841
2 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382
3 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213
4 C California 1685349 1675807 1889570 1480280 1735069 1812546

Y2008 Y2009 Y2010 Y2011 Y2012 Y2013 Y2014 Y2015 \
0 1945229.0 1944173 1237582 1440756 1186741 1852841 1558906 1916661
1 1551826.0 1436541 1629616 1230866 1512804 1985302 1580394 1979143
2 1752886.0 1554330 1300521 1130709 1907284 1363279 1525866 1647724
3 1188104.0 1628980 1669295 1928238 1216675 1591896 1360959 1329341
4 1487315.0 1663809 1624509 1639670 1921845 1156536 1388461 1644607

difference difference2
0 1056.0 1056.0
1 115285.0 115285.0
2 198556.0 198556.0
3 -440876.0 -440876.0
4 -176494.0 -176494.0

income.ratio = income.Y2008/income.Y2009
The above command does not work, thus to create new columns we need to use square brackets.
We can also use assign( ) function but this command does not make changes in the original data as there is no inplace parameter. Hence we need to save it in a new dataset.
data = income.assign(ratio = (income.Y2008 / income.Y2009))
data.head()

Finding Descriptive Statistics
describe( ) is used to find some statistics like mean,minimum, quartiles etc. for numeric variables.
income.describe() #for numeric variables
              Y2002         Y2003         Y2004         Y2005         Y2006  \
count 5.100000e+01 5.100000e+01 5.100000e+01 5.100000e+01 5.100000e+01
mean 1.566034e+06 1.509193e+06 1.540555e+06 1.522064e+06 1.530969e+06
std 2.464425e+05 2.641092e+05 2.813872e+05 2.671748e+05 2.505603e+05
min 1.111437e+06 1.110625e+06 1.118631e+06 1.122030e+06 1.102568e+06
25% 1.374180e+06 1.292390e+06 1.268292e+06 1.267340e+06 1.337236e+06
50% 1.584734e+06 1.485909e+06 1.522230e+06 1.480280e+06 1.531641e+06
75% 1.776054e+06 1.686698e+06 1.808109e+06 1.778170e+06 1.732259e+06
max 1.983285e+06 1.994927e+06 1.979395e+06 1.990062e+06 1.985692e+06

Y2007 Y2008 Y2009 Y2010 Y2011 \
count 5.100000e+01 5.100000e+01 5.100000e+01 5.100000e+01 5.100000e+01
mean 1.553219e+06 1.538398e+06 1.658519e+06 1.504108e+06 1.574968e+06
std 2.539575e+05 2.958132e+05 2.361854e+05 2.400771e+05 2.657216e+05
min 1.109382e+06 1.112765e+06 1.116168e+06 1.103794e+06 1.116203e+06
25% 1.322419e+06 1.254244e+06 1.553958e+06 1.328439e+06 1.371730e+06
50% 1.563062e+06 1.545621e+06 1.658551e+06 1.498662e+06 1.575533e+06
75% 1.780589e+06 1.779538e+06 1.857746e+06 1.639186e+06 1.807766e+06
max 1.983568e+06 1.990431e+06 1.993136e+06 1.999102e+06 1.992996e+06

Y2012 Y2013 Y2014 Y2015
count 5.100000e+01 5.100000e+01 5.100000e+01 5.100000e+01
mean 1.591135e+06 1.530078e+06 1.583360e+06 1.588297e+06
std 2.837675e+05 2.827299e+05 2.601554e+05 2.743807e+05
min 1.108281e+06 1.100990e+06 1.110394e+06 1.110655e+06
25% 1.360654e+06 1.285738e+06 1.385703e+06 1.372523e+06
50% 1.643855e+06 1.531212e+06 1.580394e+06 1.627508e+06
75% 1.866322e+06 1.725377e+06 1.791594e+06 1.848316e+06
max 1.988270e+06 1.994022e+06 1.990412e+06 1.996005e+06
For character or string variables, you can write include = ['object']. It will return total count, maximum occurring string and its frequency
income.describe(include = ['object'])  #Only for strings / objects
To find out specific descriptive statistics of each column of data frame
income.mean()
income.median()
income.agg(["mean","median"])

Mean, median, maximum and minimum can be obtained for a particular column(s) as:
income.Y2008.mean()
income.Y2008.median()
income.Y2008.min()
income.loc[:,["Y2002","Y2008"]].max()

GroupBy function

To group the data by a categorical variable we use groupby( ) function and hence we can do the operations on each category.agg( ) function is used to aggregate the data.

The following command finds minimum and maximum values for Y2002 and only mean for Y2003
income.groupby("Index").agg({"Y2002": ["min","max"],"Y2003" : "mean"})
          Y2002                 Y2003
min max mean
Index
A 1170302 1742027 1810289.000
C 1343824 1685349 1595708.000
D 1111437 1330403 1631207.000
F 1964626 1964626 1468852.000
G 1929009 1929009 1541565.000
H 1461570 1461570 1200280.000
I 1353210 1776918 1536164.500
K 1509054 1813878 1369773.000
L 1584734 1584734 1110625.000
M 1221316 1983285 1535717.625
N 1395149 1885081 1382499.625
O 1173918 1802132 1569934.000
P 1320191 1320191 1446723.000
R 1501744 1501744 1942942.000
S 1159037 1631522 1477072.000
T 1520591 1811867 1398343.000
U 1771096 1771096 1195861.000
V 1134317 1146902 1498122.500
W 1677347 1977749 1521118.500
In order to rename the columns after groupby, you can use tuple. See the code below.

income.groupby("Index").agg({"Y2002" : [("Y2002_min","min"),("Y2002_max","max")],
"Y2003" : [("Y2003_mean","mean")]})
Renaming columns can also be done via the method below.

dt = income.groupby("Index").agg({"Y2002": ["min","max"],"Y2003" : "mean"})
dt.columns = ['Y2002_min', 'Y2002_max', 'Y2003_mean']
Groupby more than 1 column

income.groupby(["Index", "State"]).agg({"Y2002": ["min","max"],"Y2003" : "mean"})

Filtering
To filter only those rows which have Index as "A" we write:
income[income.Index == "A"]

#Alternatively
income.loc[income.Index == "A",:]
  Index     State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
0 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134
1 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841
2 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382
3 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213

Y2008 Y2009 Y2010 Y2011 Y2012 Y2013 Y2014 Y2015
0 1945229 1944173 1237582 1440756 1186741 1852841 1558906 1916661
1 1551826 1436541 1629616 1230866 1512804 1985302 1580394 1979143
2 1752886 1554330 1300521 1130709 1907284 1363279 1525866 1647724
3 1188104 1628980 1669295 1928238 1216675 1591896 1360959 1329341
To select the States having Index as "A":
income.loc[income.Index == "A","State"]
income.loc[income.Index == "A",:].State
To filter the rows with Index as "A" and income for 2002 > 1500000"
income.loc[(income.Index == "A") & (income.Y2002 > 1500000),:]
To filter the rows with index either "A" or "W", we can use isin( ) function:
income.loc[(income.Index == "A") | (income.Index == "W"),:]

#Alternatively.
income.loc[income.Index.isin(["A","W"]),:]
   Index          State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
0 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134
1 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841
2 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382
3 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213
47 W Washington 1977749 1687136 1199490 1163092 1334864 1621989
48 W West Virginia 1677347 1380662 1176100 1888948 1922085 1740826
49 W Wisconsin 1788920 1518578 1289663 1436888 1251678 1721874
50 W Wyoming 1775190 1498098 1198212 1881688 1750527 1523124

Y2008 Y2009 Y2010 Y2011 Y2012 Y2013 Y2014 Y2015
0 1945229 1944173 1237582 1440756 1186741 1852841 1558906 1916661
1 1551826 1436541 1629616 1230866 1512804 1985302 1580394 1979143
2 1752886 1554330 1300521 1130709 1907284 1363279 1525866 1647724
3 1188104 1628980 1669295 1928238 1216675 1591896 1360959 1329341
47 1545621 1555554 1179331 1150089 1775787 1273834 1387428 1377341
48 1238174 1539322 1539603 1872519 1462137 1683127 1204344 1198791
49 1980167 1901394 1648755 1940943 1729177 1510119 1701650 1846238
50 1587602 1504455 1282142 1881814 1673668 1994022 1204029 1853858
Alternatively we can use query( ) function and write our filtering criteria:
income.query('Y2002>1700000 & Y2003 > 1500000')

Dealing with missing values
We create a new dataframe named 'crops' and to create a NaN value we use np.nan by importing numpy.
import numpy as np
mydata = {'Crop': ['Rice', 'Wheat', 'Barley', 'Maize'],
        'Yield': [1010, 1025.2, 1404.2, 1251.7],
        'cost' : [102, np.nan, 20, 68]}
crops = pd.DataFrame(mydata)
crops
isnull( ) returns True and notnull( ) returns False if the value is NaN.
crops.isnull()  #same as is.na in R
crops.notnull()  #opposite of previous command.
crops.isnull().sum()  #No. of missing values.
crops.cost.isnull() firstly subsets the 'cost' from the dataframe and returns a logical vector with isnull()

crops[crops.cost.isnull()] #shows the rows with NAs.
crops[crops.cost.isnull()].Crop #shows the rows with NAs in crops.Crop
crops[crops.cost.notnull()].Crop #shows the rows without NAs in crops.Crop
To drop all the rows which have missing values in any rows we use dropna(how = "any") . By default inplace = False . If how = "all" means drop a row if all the elements in that row are missing

crops.dropna(how = "any").shape
crops.dropna(how = "all").shape  
To remove NaNs if any of 'Yield' or'cost' are missing we use the subset parameter and pass a list:
crops.dropna(subset = ['Yield',"cost"],how = 'any').shape
crops.dropna(subset = ['Yield',"cost"],how = 'all').shape
Replacing the missing values by "UNKNOWN" sub attribute in Column name.
crops['cost'].fillna(value = "UNKNOWN",inplace = True)
crops

Dealing with duplicates
We create a new dataframe comprising of items and their respective prices.
data = pd.DataFrame({"Items" : ["TV","Washing Machine","Mobile","TV","TV","Washing Machine"], "Price" : [10000,50000,20000,10000,10000,40000]})
data
             Items  Price
0 TV 10000
1 Washing Machine 50000
2 Mobile 20000
3 TV 10000
4 TV 10000
5 Washing Machine 40000
duplicated() returns a logical vector returning True when encounters duplicated.
data.loc[data.duplicated(),:]
data.loc[data.duplicated(keep = "first"),:]
By default keep = 'first' i.e. the first occurence is considered a unique value and its repetitions are considered as duplicates.
If keep = "last" the last occurence is considered a unique value and all its repetitions are considered as duplicates.
data.loc[data.duplicated(keep = "last"),:] #last entries are not there,indices have changed.
If keep = "False" then it considers all the occurences of the repeated observations as duplicates.
data.loc[data.duplicated(keep = False),:]  #all the duplicates, including unique are shown.
To drop the duplicates drop_duplicates is used with default inplace = False, keep = 'first' or 'last' or 'False' have the respective meanings as in duplicated( )
data.drop_duplicates(keep = "first")
data.drop_duplicates(keep = "last")
data.drop_duplicates(keep = False,inplace = True)  #by default inplace = False
data

Creating dummies
Now we will consider the iris dataset
iris = pd.read_csv("C:\\Users\\Hp\\Desktop\\work\\Python\\Basics\\pandas\\iris.csv")
iris.head()
   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
map( ) function is used to match the values and replace them in the new series automatically created.
iris["setosa"] = iris.Species.map({"setosa" : 1,"versicolor":0, "virginica" : 0})
iris.head()
To create dummies get_dummies( ) is used. iris.Species.prefix = "Species" adds a prefix ' Species' to the new series created.
pd.get_dummies(iris.Species,prefix = "Species")
pd.get_dummies(iris.Species,prefix = "Species").iloc[:,0:1]  #1 is not included
species_dummies = pd.get_dummies(iris.Species,prefix = "Species").iloc[:,0:]
With concat( ) function we can join multiple series or dataframes. axis = 1 denotes that they should be joined columnwise.
iris = pd.concat([iris,species_dummies],axis = 1)
iris.head()
   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species  \
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Species_setosa Species_versicolor Species_virginica
0 1 0 0
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
It is usual that for a variable with 'n' categories we creat 'n-1' dummies, thus to drop the first 'dummy' column we write drop_first = True
pd.get_dummies(iris,columns = ["Species"],drop_first = True).head()

Ranking
 To create a dataframe of all the ranks we use rank( )
iris.rank() 
Ranking by a specific variable
Suppose we want to rank the Sepal.Length for different species in ascending order:
iris['Rank2'] = iris['Sepal.Length'].groupby(iris["Species"]).rank(ascending=1)
iris.head()

Calculating the Cumulative sum
Using cumsum( ) function we can obtain the cumulative sum
iris['cum_sum'] = iris["Sepal.Length"].cumsum()
iris.head()
Cumulative sum by a variable
To find the cumulative sum of sepal lengths for different species we use groupby( ) and then use cumsum( )
iris["cumsum2"] = iris.groupby(["Species"])["Sepal.Length"].cumsum()
iris.head()

Calculating the percentiles.
Various quantiles can be obtained by using quantile( )
iris.quantile(0.5)
iris.quantile([0.1,0.2,0.5])
iris.quantile(0.55)

if else in Python
We create a new dataframe of students' name and their respective zodiac signs.
students = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
                         'Zodiac Signs': ['Aquarius','Libra','Gemini','Pisces','Virgo']})
def name(row):
if row["Names"] in ["John","Henry"]:
return "yes"
else:
return "no"

students['flag'] = students.apply(name, axis=1)
students
Functions in python are defined using the block keyword def , followed with the function's name as the block's name. apply( ) function applies function along rows or columns of dataframe.

Note :If using simple 'if else' we need to take care of the indentation . Python does not involve curly braces for the loops and if else.

Output
      Names Zodiac Signs flag
0 John Aquarius yes
1 Mary Libra no
2 Henry Gemini yes
3 Augustus Pisces no
4 Kenny Virgo no

Alternatively, By importing numpy we can use np.where. The first argument is the condition to be evaluated, 2nd argument is the value if condition is True and last argument defines the value if the condition evaluated returns False.
import numpy as np
students['flag'] = np.where(students['Names'].isin(['John','Henry']), 'yes', 'no')
students

Multiple Conditions : If Else-if Else
def mname(row):
if row["Names"] == "John" and row["Zodiac Signs"] == "Aquarius" :
return "yellow"
elif row["Names"] == "Mary" and row["Zodiac Signs"] == "Libra" :
return "blue"
elif row["Zodiac Signs"] == "Pisces" :
return "blue"
else:
return "black"

students['color'] = students.apply(mname, axis=1)
students

We create a list of conditions and their respective values if evaluated True and use np.select where default value is the value if all the conditions is False
conditions = [
    (students['Names'] == 'John') & (students['Zodiac Signs'] == 'Aquarius'),
    (students['Names'] == 'Mary') & (students['Zodiac Signs'] == 'Libra'),
    (students['Zodiac Signs'] == 'Pisces')]
choices = ['yellow', 'blue', 'purple']
students['color'] = np.select(conditions, choices, default='black')
students
      Names Zodiac Signs flag   color
0 John Aquarius yes yellow
1 Mary Libra no blue
2 Henry Gemini yes black
3 Augustus Pisces no purple
4 Kenny Virgo no black

Select numeric or categorical columns only
To include numeric columns we use select_dtypes( ) 
data1 = iris.select_dtypes(include=[np.number])
data1.head()
 _get_numeric_data also provides utility to select the numeric columns only.
data3 = iris._get_numeric_data()
data3.head(3)
   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  cum_sum  cumsum2
0 5.1 3.5 1.4 0.2 5.1 5.1
1 4.9 3.0 1.4 0.2 10.0 10.0
2 4.7 3.2 1.3 0.2 14.7 14.7
For selecting categorical variables
data4 = iris.select_dtypes(include = ['object'])
data4.head(2)
 Species
0 setosa
1 setosa

Concatenating
We create 2 dataframes containing the details of the students:
students = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
                         'Zodiac Signs': ['Aquarius','Libra','Gemini','Pisces','Virgo']})
students2 = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
                          'Marks' : [50,81,98,25,35]})
 using pd.concat( ) function we can join the 2 dataframes:
data = pd.concat([students,students2])  #by default axis = 0
   Marks     Names Zodiac Signs
0 NaN John Aquarius
1 NaN Mary Libra
2 NaN Henry Gemini
3 NaN Augustus Pisces
4 NaN Kenny Virgo
0 50.0 John NaN
1 81.0 Mary NaN
2 98.0 Henry NaN
3 25.0 Augustus NaN
4 35.0 Kenny NaN
By default axis = 0 thus the new dataframe will be added row-wise. If a column is not present then in one of the dataframes it creates NaNs. To join column wise we set axis = 1
data = pd.concat([students,students2],axis = 1)
data
      Names Zodiac Signs  Marks     Names
0 John Aquarius 50 John
1 Mary Libra 81 Mary
2 Henry Gemini 98 Henry
3 Augustus Pisces 25 Augustus
4 Kenny Virgo 35 Kenny
Using append function we can join the dataframes row-wise
students.append(students2)  #for rows
Alternatively we can create a dictionary of the two data frames and can use pd.concat to join the dataframes row wise
classes = {'x': students, 'y': students2}
 result = pd.concat(classes)
result 
     Marks     Names Zodiac Signs
x 0 NaN John Aquarius
1 NaN Mary Libra
2 NaN Henry Gemini
3 NaN Augustus Pisces
4 NaN Kenny Virgo
y 0 50.0 John NaN
1 81.0 Mary NaN
2 98.0 Henry NaN
3 25.0 Augustus NaN
4 35.0 Kenny NaN

Merging or joining on the basis of common variable.
We take 2 dataframes with different number of observations:
students = pd.DataFrame({'Names': ['John','Mary','Henry','Maria'],
                         'Zodiac Signs': ['Aquarius','Libra','Gemini','Capricorn']})
students2 = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
                          'Marks' : [50,81,98,25,35]})
Using pd.merge we can join the two dataframes. on = 'Names' denotes the common variable on the basis of which the dataframes are to be combined is 'Names'
result = pd.merge(students, students2, on='Names')  #it only takes intersections
result
   Names Zodiac Signs  Marks
0 John Aquarius 50
1 Mary Libra 81
2 Henry Gemini 98
 By default how = "inner" thus it takes only the common elements in both the dataframes. If you want all the elements in both the dataframes set how = "outer"
 result = pd.merge(students, students2, on='Names',how = "outer")  #it only takes unions
result
      Names Zodiac Signs  Marks
0 John Aquarius 50.0
1 Mary Libra 81.0
2 Henry Gemini 98.0
3 Maria Capricorn NaN
4 Augustus NaN 25.0
5 Kenny NaN 35.0
To take only intersections and all the values in left df set how = 'left'
result = pd.merge(students, students2, on='Names',how = "left")
result
   Names Zodiac Signs  Marks
0 John Aquarius 50.0
1 Mary Libra 81.0
2 Henry Gemini 98.0
3 Maria Capricorn NaN
Similarly how = 'right' takes only intersections and all the values in right df.
result = pd.merge(students, students2, on='Names',how = "right",indicator = True)
result
      Names Zodiac Signs  Marks      _merge
0 John Aquarius 50 both
1 Mary Libra 81 both
2 Henry Gemini 98 both
3 Augustus NaN 25 right_only
4 Kenny NaN 35 right_only
indicator = True creates a column for indicating that whether the values are present in both the dataframes or either left or right dataframe.

April 22, 2019 05:07 AM UTC


Mike Driscoll

PyDev of the Week: Dane Hillard

This week we welcome Dane Hillard (@easyaspython) as our PyDev of the Week! Dane is the author Practices of the Python Pro, an upcoming book from Manning. He is also a blogger and web developer. Let’s take some time to get to know Dane!

Can you tell us a little about yourself (hobbies, education, etc):

I’m a creative type, so many of my interests are in art and music. I’ve been a competitive ballroom dancer, and I’m a published musician and photographer. I’m proud of those accomplishments, but I’m driven to do most of this stuff for personal fulfillment more than anything! I enjoy sharing and discussing what I learn with others, too. When I have some time my next project is to start exploring foodways, which is this idea of exploring food and its cultural impact through written history. I’ve loved cooking (and food in general) for a long time and I want to get to know its origins better, which I think is something this generation is demanding more from industries as a whole. Should be fun!

Why did you start using Python?

I like using my computer engineering skills to build stuff not just for work, but for myself. I had written a website for my photography business in PHP way back in the day, but I wasn’t using a framework of any kind and the application code was mixed with the front-end code in a way that was hard to manage. I decided to try out a framework, and after using (and disliking) Java Spring for a while I gave Django a try. The rest is history! I started using Python for a few work-related things at the time and saw that it adapted well to many different types of tasks, so I kept rolling with it.

What other programming languages do you know and which is your favorite?

I use JavaScript fairly regularly, though it wasn’t until jQuery gave way to reactive paradigms that I really started enjoying it. We’re using React and Vue frequently now and I like it quite a bit for client-side development. I’ve also used Ruby in the past, which I find to be quite Python-like in certain ways. I think I still like Python best, but it’s easy to stick with what you know, right? I wouldn’t mind learning some Rust or Go soon! My original background is mainly in C and C++ but I can barely manage the memory in my own head so I don’t like telling a computer how to manage its memory when I can avoid it, but all these languages have their place.

What projects are you working on now?

At ITHAKA we’ve been managing an open source Python REST client, apiron, for a while now. We just released a feature where I got to explore some metaprogramming, which was stellar. It ended up reducing boilerplate people have to write, which is also stellar. I also built a new website as a bit of a portfolio and to centralize some of my online presence. It’s written in Vue, but was my first chance to explore vue-router and couple of other libraries, along with a headless CMS for blogging.

The biggest amount of my free time definitely goes to thinking about and writing the book I’m working on, which introduces people new to software development to some concepts important in collaborative software, in the context of Python. I’m hoping it will help people just graduating, switching disciplines, or who want to augment their work with software! The book is in early access and I’m chugging away on new chapters as we speak.

Which Python libraries are your favorite (core or 3rd party)?

The requests library is one of the more ubiquitous libraries, and it’s what we built apiron on top of. I’ve started using pytest a bit in place of Python’s built-in unittest, and I like the ways it simplifies existing tests while also providing tooling for doing more complex things with fixtures. There’s a great package, zappa, for deploying Django apps (or anything WSGI-based, I believe) to AWS Lambda. Look into that if you’re spending too much on an EC2 instance! For image manipulation, Pillow is great. One that I’d like to try out more soon is surprise, which helps you make recommendation systems akin to what Netflix or Hulu uses to recommend movies. Too many others to name here!

How did you come to author a book?

I don’t know how it works for most authors, but in my case the publisher, Manning, reached out to me—probably after seeing the blog posts I’ve written online. Presented with the opportunity, it was difficult to figure out if I really felt ready or qualified to do a book, which I still ask myself often if I’m being honest. I try to frame it to myself as an opportunity to help others, so even if I don’t produce something perfect I hope that I’ll still be able to say I did that much!

What challenges did you have writing the book and how did you overcome them?

Finding time and balancing it with other priorities is the primary struggle for me, as I imagine it is for many authors. The uncertainty I mentioned earlier is another one. Something that surprised me was how easy it is to use overloaded terms in the context of programming; many concepts have similar names and many English words can be ambiguous for untrained readers! My editor fortunately keeps these at bay, but I slip up often! Teaching is hard. The best way I’ve found to mitigate issues like this is to automate where I can.

Is there anything else you’d like to say?

If you’re out there thinking about getting into programming or writing a book or anything really, and you’re fortunate to have the means to do so, get to it! I’ve found that I don’t know how I feel about something until I really examine it, flip a few switches, find out how it works under the hood. Sometimes you’ll find you don’t like something as much as you thought, but maybe it uncovers tangentially-related things you want to explore. The most important part is getting started!

Thanks for doing the interview, Dane!

April 22, 2019 05:05 AM UTC

April 21, 2019


The Code Bits

Flask Project for Beginners: Inspirational Quotes

In this project, we will create a web application that displays a random inspirational quote.

The goal of this project is to learn about application factory and how to create templates in Flask.

This is the second part of the “Getting started with Flask series”. In the first part, we learned how to create a basic Hello World application using Flask and run it locally.

Installation and Setup

First, let us install the dependencies and setup our project directory. Basically what we need to do is:
  • Install Python. We will be using Python3 here.
  • Create a directory for our project, say “1_inspirational_quotes”, and go inside the directory.
    mkdir 1_inspirational_quotes
    cd 1_inspirational_quotes
  • We will create a virtual environment for our project where we will install Flask and any other dependencies. So go ahead and create a virtual environment and activate it.
    python3 -m venv venv
    . venv/bin/activate
  • Finally install Flask.
    pip3 install flask
If you need more instructions on installation, refer to Flask installation guide or Getting started with Flask.

Set up the Application Factory

Now that we have setup our project directory, the next thing that we need to do is to create the Flask application, which is nothing but an instance of the Flask class.

We could create the Flask instance globally, the way we did in Getting Started with Flask: Hello World. However, in this example, we will create it within a function.

Application Factory is the term used to refer to the method inside which we will create our application (Flask instance). All sorts of configuration and setup required for the application will also be done within the application factory. Finally it will return the Flask application instance.

So let us create a directory ‘quotes’ and add an __init__.py file to it. This will make the directory get treated as a Python package.

mkdir quotes
cd quotes
touch __init__.py

Then let us define our application factory in this file.

1_inspirational_quotes/quotes/__init__.py

from flask import Flask

def create_app():
    """
    create_app is the application factory.
    """
    # Create the app.
    app = Flask(__name__)

    # Add a route.
    @app.route('/')
    def home():
        return 'Hello there! :)'

    # Return the app.
    return app

Notes:

Run the basic application

In order to make sure that everything is set up correctly, let us run the application and see if it is working.

First, let us set the FLASK_APP environment variable to be our application package name. This basically tells Flask which application to run.

export FLASK_APP=quotes

We will also set the environment variable FLASK_ENV to development so that:

  1. debug mode is turned on and the debugger is activated.
  2. the server will be restarted whenever we make a code change. We can make modifications to our code and simply refresh the browser to see the changes in effect.
export FLASK_ENV=development

Note: If you are on Windows, use set  instead of export.

Now we are ready to run the application. So go ahead and run it using the flask command. You should see an output similar to the following.

flask run
 * Serving Flask app "quotes" (lazy loading)
 * Environment: development
 * Debug mode: on
 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
 * Restarting with stat
 * Debugger is active!
 * Debugger PIN: 150-101-403

Note: Make sure that you are running the command from the ‘1_inspirational_quotes’ directory and not ‘quotes’. Otherwise, you will see the error “flask.cli.NoAppException: Could not import “quotes.quotes”.”

To see the app in action, go to http://127.0.0.1:5000/ on your browser. You should see our message displayed in it as shown in the following image.

Awesome! Now let us start building our quotes app.

Add a template

Currently, our app just displays the string, “Hello there! :)” to the user. In this section, we will learn how to create a template that shows a random inspirational quote.

Return HTML content from the application factory

The simplest way to achieve this is to return the HTML code as a string instead of our hello world string as shown below:

1_inspirational_quotes/quotes/__init__.py

from flask import Flask

def create_app():
    """
    create_app is the application factory.
    """
    # Create the app.
    app = Flask(__name__)

    # Add a route.
    @app.route('/')
    def home():
        return '''
<html>
<body>
  I find that the harder I work, the more luck I seem to have. – Thomas Jefferson
</body>
</html>
'''

    # Return the app.
    return app

Now if you go to http://127.0.0.1:5000/, you should see the quote displayed on the screen:

Even though this works perfectly fine, this is not the best approach to serve HTML content for our application. First of all, the code does not look clean. Second, as our application grows, modifying and maintaining the template within the application factory will be tedious. So we need to isolate our template from the application factory.

Create a static HTML template file

A template is a file that contains static data as well as placeholders for dynamic data. In this section, we will just be creating static HTML template that displays a quote to our user. In a later section, we will see how to make it dynamic.

Within the quotes directory, let us add a directory to keep our templates and move our quotes template to a separate HTML file.

mkdir templates
touch templates/quotes.html

Note that our template is stored within a directory named templates under the application directory, quotes. This is where Flask expects its templates by default.

1_inspirational_quotes/quotes/templates/quotes.html

<!doctype html>
<html>
<body>
  I find that the harder I work, the more luck I seem to have. – Thomas Jefferson
<body>
</html>

Register the template with the application factory

Now we need to modify our application factory such that this HTML file is served when users visit our web page.

1_inspirational_quotes/quotes/__init__.py

from flask import Flask, render_template

def create_app():
    """
    create_app is the application factory.
    """
    # Create the app.
    app = Flask(__name__)

    # Add a route.
    @app.route('/')
    def home():
        return render_template('quotes.html')

    # Return the app.
    return app

Note how we introduced the method, render_template(). In this case, it takes our HTML file name and returns its contents. Later on, when we learn about serving dynamic content, we will learn more about rendering and how Flask uses Jinja for template rendering.

Now if we go to http://127.0.0.1:5000/, we should see the quote displayed on the screen just as we saw earlier.

Update the template to render quotes dynamically using Jinja

Now that we have learned how to create a template and register it with the application factory, let us see how we can serve content dynamically.

Right now our app just displays the same quote every time someone visits. Our goal is to dynamically update the quote by selecting one randomly from a set of quotes.

First, let us go ahead and create a list of quotes. To keep things simple, we will be adding it in memory within the application factory. In a later post, we will explore how to use databases with Flask.

1_inspirational_quotes/quotes/__init__.py

from flask import Flask, render_template
import random

def create_app():
    """
    create_app is the application factory.
    """
    # Create the app.
    app = Flask(__name__)

    # Add a route.
    @app.route('/')
    def home():
        sample_quotes = [
            "I find that the harder I work, the more luck I seem to have. – Thomas Jefferson",
            "Success is the sum of small efforts, repeated day in and day out. – Robert Collier",
            "There are no shortcuts to any place worth going. – Beverly Sills",
            "The only place where success comes before work is in the dictionary. – Vidal Sassoon",
            "You don’t drown by falling in the water; you drown by staying there. – Ed Cole"
        ]

        # Select a random quote.
        selected_quote = random.choice(sample_quotes)

        # Pass the selected quote to the template.
        return render_template('quotes.html', quote=selected_quote)

    # Return the app.
    return app

As you can see, now we are passing an additional parameter, quote, to the render_template function. Flask uses Jinja to render dynamic content in the template. With this change, the variable, quote, becomes available in the template, quotes.html. Now let us see how we can update the template file to make use of this variable.

1_inspirational_quotes/quotes/templates/quotes.html

<!doctype html>
<html>
<body>
  {{ quote }}
<body>
</html>

Here, {{..}} is the delimiter used by Jinja to denote expressions which will be evaluated and rendered in the final HTML document.

Now if we go to http://127.0.0.1:5000/ and keep refreshing the page, we should see a different random quote selected from the list every time. A demo is shown below:

Add a stylesheet

As of now, our app works, but it looks very plain. So now we will see how to add a simple stylesheet to it.

In Flask, just like templates were expected to be in the templates directory by default, static files like CSS stylesheets are expected to be in static directory within the application folder.

So go ahead and create the directory and add a CSS file style.css to it.

mkdir static
touch static/style.css

1_inspirational_quotes/quotes/static/style.css

body {
  background-color: black;
  background-image: url("background.jpg");
  background-size:cover;
}
.quote_div {
  text-align: center;
  color: white;
  font-size: 30px;
  padding: 25px 5px;
  margin: 15% auto auto;
}

You can also add a background image and keep it under the static directory as shown above.

Now let us modify the template to use the stylesheet.

1_inspirational_quotes/quotes/templates/quotes.html

<!doctype html>
<html>
<head>
  <link rel="stylesheet" href="{{ url_for('static', filename='style.css') }}">
</head>
<body>
  <div class="quote_div">
    {{ quote }}
  </div>
<body>
</html>

Now if we go to http://127.0.0.1:5000/, we will see a nicer app!

A final demo is shown in the following video:

Conclusion

In case you want to browse through the code or download and try it out, the Github link for this project is here.

In this post, we learned how to create a basic Flask application that serves dynamic data using a Jinja template.

For more advanced lessons with projects, stay tuned and subscribe to our blog!

April 21, 2019 11:47 PM UTC

Getting started with Raspberry Pi and Python

Hello there!, So you just got a shiny new Raspberry Pi. Well done and congrats! In this tutorial, we are going to look at how we can set up your Raspberry Pi and get it up and running.

What you need to get started

First and foremost we have to ensure that we have all the required items to get started. Here is a short list of items that are absolutely required to get the ball rolling.

things_needed

Optional

You can also get these as a bundle (buy from Amazon) or if you have some of the components lying around, then you could selectively get them as needed.

Getting your OS ready

Raspberry Pi is a miniature computer and just like your laptop and desktop PC, it needs to run an Operating System. Since Raspberry Pi runs an ARM-based processor (If you haven’t heard of ARM processors, they are low power processors commonly found in your mobile phones and tablets), we need to use a supporting version of OS. Luckily both the Raspberry Pi foundation and many of the Linux community members have created many Operating Systems that you can choose from to run on your board! In this tutorial, we are going to use NOOBS distribution which you can find here.

Once you go to the above link, choose NOOBS, and you will be given two options, select NOOBS. This is a large file and it might take some time. Be patient!

In the meanwhile, we need another piece of software that helps us format and transfer our OS to our microSD card. You can find the software here.

Once the SDFormatter tool is installed, and the OS image is downloaded, we are ready to go!

The first step is to unzip the NOOBS file that you have downloaded. You will get a NOOBS_v3_0_0 folder or something similar depending on your OS version. Open up the SDFormatter tool and insert your microSD card onto your PC. You can either use a USB microSD adapter like this or use the one that’s in your laptop / PC.

On the SDFormatter tool select the drive that corresponds to your SD card. Make sure this is correct!.

SDFormatter

IMPORTANT: Please back up any data before formatting!

Once the formatting is complete, you can copy the contents of the NOOB folder which would be something similar to what is shown below onto your formatted SD card.

Noobs_contents

Connecting all pieces and booting up!

All right, we’re almost there! Now we need to connect our Raspberry Pi to our peripherals and boot it up for the first time.

Here, I have used a USB keyboard, a Wireless Logitech mouse that has a wireless USB adapter. I have plugged in a USB Wi-Fi adapter to connect to my network and an HDMI cable to connect to my monitor. The overall setup after setting up everything including power looks something like this.

fully_connected

Setting up the OS

Once we have the setup ready, let’s connect the board to the peripherals and power and boot it up. Initially, you will see a dialog box to select the operating system

Select Raspbian Full and click the install button. The installation dialog will follow afterward. It will copy the files and install the OS.

Once the OS is installed, the system will reboot and you will be welcomed with the new desktop!. Next, you will be prompted to set up your account password, keyboard profile, and timezone as well as connect to a Wi-Fi endpoint. That is pretty much it!

Hello world in Python

Fire up the terminal by going to the menu button on the top left and selecting the terminal. Open up a python shell by entering

python

This opens up a python shell, say hello world from your shiny new raspberry pi!

print "Hello world!"

Wrap up

So that’s it! We’ve made it to the end and we have a Raspberry Pi up and running! Woohoo! Now is the fun part, which is all the fun things we could do with it. I’ll be doing some interesting projects that you could follow along on this blog going forward. Make sure to subscribe to thecodebits.com to receive updates. See you all soon!

April 21, 2019 10:52 PM UTC


Talk Python to Me

#208 Packaging, Making the most of PyCon, and more

Are you going to PyCon (or a similar conference)? Join me and Kenneth Retiz as we discuss how to make the most of PyCon and what makes it special for each of us.

April 21, 2019 08:00 AM UTC

April 20, 2019


Weekly Python StackOverflow Report

(clxxiv) stackoverflow python report

These are the ten most rated questions at Stack Overflow last week.
Between brackets: [question score / answers count]
Build date: 2019-04-20 22:00:34 GMT


  1. Why does Python start at index -1 (as opposed to 0) when indexing a list from the end? - [63/7]
  2. How to remove list items depending on predecessor in python - [11/7]
  3. How does Python know the values already stored in its memory? - [10/3]
  4. Century handling in Pandas - [9/5]
  5. How do I check if current code is part of a try-except-block? - [9/2]
  6. Python sets versus arrays - [9/1]
  7. Shared python generator - [8/1]
  8. Convert pandas column of lists into matrix representation (One Hot Encoding) - [6/2]
  9. Python Flask as Windows Service - [6/2]
  10. Count of values grouped per month, year - Pandas - [6/2]

April 20, 2019 10:01 PM UTC

April 19, 2019


Doug Hellmann

imapautofiler 1.8.0

imapautofiler applies user-defined rules to automatically organize messages on an IMAP server. What’s new in 1.8.0? use yaml safe loader drop python 3.5 and add 3.7 support perform substring matches without regard to case

April 19, 2019 06:47 PM UTC


ListenData

NumPy Tutorial with Exercises

NumPy (acronym for 'Numerical Python' or 'Numeric Python') is one of the most essential package for speedy mathematical computation on arrays and matrices in Python. It is also quite useful while dealing with multi-dimensional data. It is a blessing for integrating C, C++ and FORTRAN tools. It also provides numerous functions for Fourier transform (FT) and linear algebra.

Python : Numpy Tutorial

Why NumPy instead of lists?

One might think of why one should prefer arrays in NumPy instead we can create lists having the same data type. If this statement also rings a bell then the following reasons may convince you:
  1. Numpy arrays have contiguous memory allocation. Thus if a same array stored as list will require more space as compared to arrays.
  2. They are more speedy to work with and hence are more efficient than the lists.
  3. They are more convenient to deal with.

    NumPy vs. Pandas

    Pandas is built on top of NumPy. In other words,Numpy is required by pandas to make it work. So Pandas is not an alternative to Numpy. Instead pandas offers additionalmethod or provides more streamlined way of working with numerical and tabular data in Python.

    Importing numpy
    Firstly you need to import the numpy library. Importing numpy can be done by running the following command:
    import numpy as np
    It is a general approach to import numpy with alias as 'np'. If alias is not provided then to access the functions from numpy we shall write numpy.function. To make it easier an alias 'np' is introduced so we can write np.function. Some of the common functions of numpy are listed below -

    Functions Tasks
    array Create numpy array
    ndim Dimension of the array
    shape Size of the array (Number of rows and Columns)
    size Total number of elements in the array
    dtype Type of elements in the array, i.e., int64, character
    reshape Reshapes the array without changing the original shape
    resize Reshapes the array. Also change the original shape
    arange Create sequence of numbers in array
    Itemsize Size in bytes of each item
    diag Create a diagonal matrix
    vstack Stacking vertically
    hstack Stacking horizontally
    1D array
    Using numpy an array is created by using np.array:
    a = np.array([15,25,14,78,96])
    a
    print(a)
    a
    Output: array([15, 25, 14, 78, 96])

    print(a)
    Output: [15 25 14 78 96]
    Notice that in np.array square brackets are present. Absence of square bracket introduces an error. To print the array we can use print(a).

    Changing the datatype
    np.array( ) has an additional parameter of dtype through which one can define whether the elements are integers or floating points or complex numbers.
    a.dtype
    a = np.array([15,25,14,78,96],dtype = "float")
    a
    a.dtype
    Initially datatype of 'a' was 'int32' which on modifying becomes 'float64'.

    1. int32 refers to number without a decimal point. '32' means number can be in between-2147483648 and 2147483647. Similarly, int16 implies number can be in range -32768 to 32767
    2. float64 refers to number with decimal place.


    Creating the sequence of numbers
    If you want to create a sequence of numbers then using np.arange, we can get our sequence. To get the sequence of numbers from 20 to 29 we run the following command.
    b = np.arange(start = 20,stop = 30, step = 1)
    b
    array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])
    In np.arange the end point is always excluded.

    np.arange provides an option of step which defines the difference between 2 consecutive numbers. If step is not provided then it takes the value 1 by default.

    Suppose we want to create an arithmetic progression with initial term 20 and common difference 2, upto 30; 30 being excluded.
    c = np.arange(20,30,2) #30 is excluded.
    c
    array([20, 22, 24, 26, 28])
    It is to be taken care that in np.arange( ) the stop argument is always excluded.

    Indexing in arrays
    It is important to note that Python indexing starts from 0. The syntax of indexing is as follows -
    1. x[start:end:step]: Elements in array x start through the end (but the end is excluded), default step value is 1.
    2. x[start:end] : Elements in array x start through the end (but the end is excluded)
    3. x[start:] : Elements start through the end
    4. x[:end] : Elements from the beginning through the end (but the end is excluded)

    If we want to extract 3rd element we write the index as 2 as it starts from 0.
    x = np.arange(10)
    x[2]
    x[2:5]
    x[::2]
    x[1::2]
    x
    Output: [0 1 2 3 4 5 6 7 8 9]

    x[2]
    Output: 2

    x[2:5]
    Output: array([2, 3, 4])

    x[::2]
    Output: array([0, 2, 4, 6, 8])

    x[1::2]
    Output: array([1, 3, 5, 7, 9])

    Note that in x[2:5] elements starting from 2nd index up to 5th index(exclusive) are selected.
    If we want to change the value of all the elements from starting upto index 7,excluding 7, with a step of 3 as 123 we write:
    x[:7:3] = 123
    x
     array([123,   1,   2, 123,   4,   5, 123,   7,   8,   9])
    To reverse a given array we write:
    x = np.arange(10)
    x[ : :-1] # reversed x
    array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
    Note that the above command does not modify the original array.

    Reshaping the arrays
    To reshape the array we can use reshape( ).
    f = np.arange(101,113)
    f.reshape(3,4)
    f
     array([101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112])

    Note that reshape() does not alter the shape of the original array. Thus to modify the original array we can use resize( )
    f.resize(3,4)
    f
    array([[101, 102, 103, 104],
    [105, 106, 107, 108],
    [109, 110, 111, 112]])

    If a dimension is given as -1 in a reshaping, the other dimensions are automatically calculated provided that the given dimension is a multiple of total number of elements in the array.
    f.reshape(3,-1)
    array([[101, 102, 103, 104],
    [105, 106, 107, 108],
    [109, 110, 111, 112]])

    In the above code we only directed that we will have 3 rows. Python automatically calculates the number of elements in other dimension i.e. 4 columns.

    Missing Data
    The missing data is represented by NaN (acronym for Not a Number). You can use the command np.nan
    val = np.array([15,10, np.nan, 3, 2, 5, 6, 4])
    val.sum()
    Out: nan
    To ignore missing values, you can use np.nansum(val) which returns 45

    To check whether array contains missing value, you can use the functionisnan( )
    np.isnan(val)


    2D arrays
    A 2D array in numpy can be created in the following manner:
    g = np.array([(10,20,30),(40,50,60)])
    #Alternatively
    g = np.array([[10,20,30],[40,50,60]])
    g
    The dimension, total number of elements and shape can be ascertained by ndim, size and shape respectively:
    g.ndim
    g.size
    g.shape
    g.ndim
    Output: 2

    g.size
    Output: 6

    g.shape
    Output: (2, 3)

    Creating some usual matrices
    numpy provides the utility to create some usual matrices which are commonly used for linear algebra.
    To create a matrix of all zeros of 2 rows and 4 columns we can use np.zeros( ):
    np.zeros( (2,4) )
    array([[ 0.,  0.,  0.,  0.],
    [ 0., 0., 0., 0.]])
    Here the dtype can also be specified. For a zero matrix the default dtype is 'float'. To change it to integer we write 'dtype = np.int16'
    np.zeros([2,4],dtype=np.int16)
    array([[0, 0, 0, 0],
    [0, 0, 0, 0]], dtype=int16)
    To get a matrix of all random numbers from 0 to 1 we write np.empty.
    np.empty( (2,3) )
    array([[  2.16443571e-312,   2.20687562e-312,   2.24931554e-312],
    [ 2.29175545e-312, 2.33419537e-312, 2.37663529e-312]])
    Note: The results may vary everytime you run np.empty.
    To create a matrix of unity we write np.ones( ). We can create a 3 * 3 matrix of all ones by:
    np.ones([3,3])
    array([[ 1.,  1.,  1.],
    [ 1., 1., 1.],
    [ 1., 1., 1.]])
    To create a diagonal matrix we can write np.diag( ). To create a diagonal matrix where the diagonal elements are 14,15,16 and 17 we write:
    np.diag([14,15,16,17])
    array([[14,  0,  0,  0],
    [ 0, 15, 0, 0],
    [ 0, 0, 16, 0],
    [ 0, 0, 0, 17]])
    To create an identity matrix we can use np.eye( ) .
    np.eye(5,dtype = "int")
    array([[1, 0, 0, 0, 0],
    [0, 1, 0, 0, 0],
    [0, 0, 1, 0, 0],
    [0, 0, 0, 1, 0],
    [0, 0, 0, 0, 1]])
    By default the datatype in np.eye( ) is 'float' thus we write dtype = "int" to convert it to integers.

    Reshaping 2D arrays
    To get a flattened 1D array we can use ravel( )
    g = np.array([(10,20,30),(40,50,60)])
    g.ravel()
     array([10, 20, 30, 40, 50, 60])
    To change the shape of 2D array we can use reshape. Writing -1 will calculate the other dimension automatically and does not modify the original array.
    g.reshape(3,-1) # returns the array with a modified shape
    #It does not modify the original array
    g.shape
     (2, 3)
    Similar to 1D arrays, using resize( ) will modify the shape in the original array.
    g.resize((3,2))
    g #resize modifies the original array
    array([[10, 20],
    [30, 40],
    [50, 60]])

    Time for some matrix algebra
    Let us create some arrays A,b and B and they will be used for this section:
    A = np.array([[2,0,1],[4,3,8],[7,6,9]])
    b = np.array([1,101,14])
    B = np.array([[10,20,30],[40,50,60],[70,80,90]])
    In order to get the transpose, trace and inverse we use A.transpose( ) , np.trace( ) and np.linalg.inv( ) respectively.
    A.T #transpose
    A.transpose() #transpose
    np.trace(A) # trace
    np.linalg.inv(A) #Inverse
    A.transpose()  #transpose
    Output:
    array([[2, 4, 7],
    [0, 3, 6],
    [1, 8, 9]])

    np.trace(A) # trace
    Output: 14

    np.linalg.inv(A) #Inverse
    Output:
    array([[ 0.53846154, -0.15384615, 0.07692308],
    [-0.51282051, -0.28205128, 0.30769231],
    [-0.07692308, 0.30769231, -0.15384615]])
    Note that transpose does not modify the original array.

    Matrix addition and subtraction can be done in the usual way:
    A+B
    A-B
    A+B
    Output:
    array([[12, 20, 31],
    [44, 53, 68],
    [77, 86, 99]])

    A-B
    Output:
    array([[ -8, -20, -29],
    [-36, -47, -52],
    [-63, -74, -81]])
    Matrix multiplication of A and B can be accomplished by A.dot(B). Where A will be the 1st matrix on the left hand side and B will be the second matrix on the right side.
    A.dot(B)
    array([[  90,  120,  150],
    [ 720, 870, 1020],
    [ 940, 1160, 1380]])
    To solve the system of linear equations: Ax = b we use np.linalg.solve( )
    np.linalg.solve(A,b)
    array([-13.92307692, -24.69230769,  28.84615385])
    The eigen values and eigen vectors can be calculated using np.linalg.eig( )
    np.linalg.eig(A)
    (array([ 14.0874236 ,   1.62072127,  -1.70814487]),
    array([[-0.06599631, -0.78226966, -0.14996331],
    [-0.59939873, 0.54774477, -0.81748379],
    [-0.7977253 , 0.29669824, 0.55608566]]))
    The first row are the various eigen values and the second matrix denotes the matrix of eigen vectors where each column is the eigen vector to the corresponding eigen value.

    Some Mathematics functions

    We can have various trigonometric functions like sin, cosine etc. using numpy:
    B = np.array([[0,-20,36],[40,50,1]])
    np.sin(B)
    array([[ 0.        , -0.91294525, -0.99177885],
    [ 0.74511316, -0.26237485, 0.84147098]])
    The resultant is the matrix of all sin( ) elements.
    In order to get the exponents we use **
    B**2
    array([[   0,  400, 1296],
    [1600, 2500, 1]], dtype=int32)
    We get the matrix of the square of all elements of B.
    In order to obtain if a condition is satisfied by the elements of a matrix we need to write the criteria. For instance, to check if the elements of B are more than 25 we write:
    B>25
    array([[False, False,  True],
    [ True, True, False]], dtype=bool)
    We get a matrix of Booleans where True indicates that the corresponding element is greater than 25 and False indicates that the condition is not satisfied.
    In a similar manner np.absolute, np.sqrt and np.exp return the matrices of absolute numbers, square roots and exponentials respectively.
    np.absolute(B)
    np.sqrt(B)
    np.exp(B)
    Now we consider a matrix A of shape 3*3:
    A = np.arange(1,10).reshape(3,3)
    A
    array([[1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]])
    To find the sum, minimum, maximum, mean, standard deviation and variance respectively we use the following commands:
    A.sum()
    A.min()
    A.max()
    A.mean()
    A.std() #Standard deviation
    A.var() #Variance
    A.sum()
    Output: 45

    A.min()
    Output: 1

    A.max()
    Output: 9

    A.mean()
    Output: 5.0

    A.std() #Standard deviation
    Output: 2.5819888974716112

    A.var()
    Output: 6.666666666666667
    In order to obtain the index of the minimum and maximum elements we use argmin( ) and argmax( ) respectively.
    A.argmin()
    A.argmax()
    A.argmin()
    Output: 0

    A.argmax()
    Output: 8
    If we wish to find the above statistics for each row or column then we need to specify the axis:
    A.sum(axis=0)
    A.mean(axis = 0)
    A.std(axis = 0)
    A.argmin(axis = 0)
    A.sum(axis=0)                 # sum of each column, it will move in downward direction
    Output: array([12, 15, 18])

    A.mean(axis = 0)
    Output: array([ 4., 5., 6.])

    A.std(axis = 0)
    Output: array([ 2.44948974, 2.44948974, 2.44948974])

    A.argmin(axis = 0)
    Output: array([0, 0, 0], dtype=int64)
    By defining axis = 0, calculations will move in downward direction i.e. it will give the statistics for each column. To find the min and index of maximum element for each row, we need to move in right-wise direction so we write axis = 1:
    A.min(axis=1)
    A.argmax(axis = 1)
    A.min(axis=1)                  # min of each row, it will move in rightwise direction
    Output: array([1, 4, 7])

    A.argmax(axis = 1)
    Output: array([2, 2, 2], dtype=int64)
    To find the cumulative sum along each row we use cumsum( )
    A.cumsum(axis=1)
    array([[ 1,  3,  6],
    [ 4, 9, 15],
    [ 7, 15, 24]], dtype=int32)

    Creating 3D arrays
    Numpy also provides the facility to create 3D arrays. A 3D array can be created as:
    X = np.array( [[[ 1, 2,3],
    [ 4, 5, 6]],
    [[7,8,9],
    [10,11,12]]])
    X.shape
    X.ndim
    X.size
    X contains two 2D arrays Thus the shape is 2,2,3. Totol number of elements is 12.
    To calculate the sum along a particular axis we use the axis parameter as follows:
    X.sum(axis = 0)
    X.sum(axis = 1)
    X.sum(axis = 2)
    X.sum(axis = 0)
    Output:
    array([[ 8, 10, 12],
    [14, 16, 18]])

    X.sum(axis = 1)
    Output:
    array([[ 5, 7, 9],
    [17, 19, 21]])

    X.sum(axis = 2)
    Output:
    array([[ 6, 15],
    [24, 33]])
    axis = 0 returns the sum of the corresponding elements of each 2D array. axis = 1 returns the sum of elements in each column in each matrix while axis = 2 returns the sum of each row in each matrix.
    X.ravel()
     array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])
    ravel( ) writes all the elements in a single array.
    Consider a 3D array:
    X = np.array( [[[ 1, 2,3],
    [ 4, 5, 6]],
    [[7,8,9],
    [10,11,12]]])
    To extract the 2nd matrix we write:
    X[1,...] # same as X[1,:,:] or X[1]
    array([[ 7,  8,  9],
    [10, 11, 12]])
    Remember python indexing starts from 0 that is why we wrote 1 to extract the 2nd 2D array.
    To extract the first element from all the rows we write:
    X[...,0] # same as X[:,:,0]
    array([[ 1,  4],
    [ 7, 10]])


    Find out position of elements that satisfy a given condition
    a = np.array([8, 3, 7, 0, 4, 2, 5, 2])
    np.where(a > 4)
    array([0, 2, 6]
    np.where locates the positions in the array where element of array is greater than 4.

    Indexing with Arrays of Indices
    Consider a 1D array.
    x = np.arange(11,35,2)
    x
    array([11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33])
    We form a 1D array i which subsets the elements of x as follows:
    i = np.array( [0,1,5,3,7,9 ] )
    x[i]
    array([11, 13, 21, 17, 25, 29])
    In a similar manner we create a 2D array j of indices to subset x.
    j = np.array( [ [ 0, 1], [ 6, 2 ] ] )
    x[j]
    array([[11, 13],
    [23, 15]])
    Similarly we can create both i and j as 2D arrays of indices for x
    x = np.arange(15).reshape(3,5)
    x
    i = np.array( [ [0,1], # indices for the first dim
    [2,0] ] )
    j = np.array( [ [1,1], # indices for the second dim
    [2,0] ] )
    To get the ith index in row and jth index for columns we write:
    x[i,j] # i and j must have equal shape
    array([[ 1,  6],
    [12, 0]])
    To extract ith index from 3rd column we write:
    x[i,2]
    array([[ 2,  7],
    [12, 2]])
    For each row if we want to find the jth index we write:
    x[:,j]
    array([[[ 1,  1],
    [ 2, 0]],

    [[ 6, 6],
    [ 7, 5]],

    [[11, 11],
    [12, 10]]])
    Fixing 1st row and jth index,fixing 2nd row jth index, fixing 3rd row and jth index.

    You can also use indexing with arrays to assign the values:
    x = np.arange(10)
    x
    x[[4,5,8,1,2]] = 0
    x
    array([0, 0, 0, 3, 0, 0, 6, 7, 0, 9])
    0 is assigned to 4th, 5th, 8th, 1st and 2nd indices of x.
    When the list of indices contains repetitions then it assigns the last value to that index:
    x = np.arange(10)
    x
    x[[4,4,2,3]] = [100,200,300,400]
    x
    array([  0,   1, 300, 400, 200,   5,   6,   7,   8,   9])
    Notice that for the 5th element(i.e. 4th index) the value assigned is 200, not 100.
    Caution: If one is using += operator on repeated indices then it carries out the operator only once on repeated indices.
    x = np.arange(10)
    x[[1,1,1,7,7]]+=1
    x
     array([0, 2, 2, 3, 4, 5, 6, 8, 8, 9])
    Although index 1 and 7 are repeated but they are incremented only once.

    Indexing with Boolean Arrays
    We create a 2D array and store our condition in b. If we the condition is true it results in True otherwise False.
    a = np.arange(12).reshape(3,4)
    b = a > 4
    b
    array([[False, False, False, False],
    [False, True, True, True],
    [ True, True, True, True]], dtype=bool)
    Note that 'b' is a Boolean with same shape as that of 'a'.
    To select the elements from 'a' which adhere to condition 'b' we write:
    a[b]
    array([ 5,  6,  7,  8,  9, 10, 11])
    Now 'a' becomes a 1D array with the selected elements
    This property can be very useful in assignments:
    a[b] = 0
    a
    array([[0, 1, 2, 3],
    [4, 0, 0, 0],
    [0, 0, 0, 0]])
    All elements of 'a' higher than 4 become 0
    As done in integer indexing we can use indexing via Booleans:
    Let x be the original matrix and 'y' and 'z' be the arrays of Booleans to select the rows and columns.
    x = np.arange(15).reshape(3,5)
    y = np.array([True,True,False]) # first dim selection
    z = np.array([True,True,False,True,False]) # second dim selection
    We write the x[y,:] which will select only those rows where y is True.
    x[y,:] # selecting rows
    x[y] # same thing
    Writing x[:,z] will select only those columns where z is True.
    x[:,z] # selecting columns
    x[y,:]                                   # selecting rows
    Output:
    array([[0, 1, 2, 3, 4],
    [5, 6, 7, 8, 9]])

    x[y] # same thing
    Output:
    array([[0, 1, 2, 3, 4],
    [5, 6, 7, 8, 9]])

    x[:,z] # selecting columns
    Output:
    array([[ 0, 1, 3],
    [ 5, 6, 8],
    [10, 11, 13]])

    Statistics on Pandas DataFrame

    Let's create dummy data frame for illustration :
    np.random.seed(234)
    mydata = pd.DataFrame({"x1" : np.random.randint(low=1, high=100, size=10),
    "x2" : range(10)
    })

    1. Calculate mean of each column of data frame
    np.mean(mydata)
    2. Calculate median of each column of data frame
    np.median(mydata, axis=0)
    axis = 0 means the median function would be run on each column. axis = 1 implies the function to be run on each row.

    Stacking various arrays
    Let us consider 2 arrays A and B:
    A = np.array([[10,20,30],[40,50,60]])
    B = np.array([[100,200,300],[400,500,600]])
    To join them vertically we use np.vstack( ).
    np.vstack((A,B)) #Stacking vertically
    array([[ 10,  20,  30],
    [ 40, 50, 60],
    [100, 200, 300],
    [400, 500, 600]])
    To join them horizontally we use np.hstack( ).
    np.hstack((A,B)) #Stacking horizontally
    array([[ 10,  20,  30, 100, 200, 300],
    [ 40, 50, 60, 400, 500, 600]])
    newaxis helps in transforming a 1D row vector to a 1D column vector.
    from numpy import newaxis
    a = np.array([4.,1.])
    b = np.array([2.,8.])
    a[:,newaxis]
    array([[ 4.],
    [ 1.]])
    #The function np.column_stack( ) stacks 1D arrays as columns into a 2D array. It is equivalent to hstack only for 1D arrays:
    np.column_stack((a[:,newaxis],b[:,newaxis]))
    np.hstack((a[:,newaxis],b[:,newaxis])) # same as column_stack
    np.column_stack((a[:,newaxis],b[:,newaxis]))
    Output:
    array([[ 4., 2.],
    [ 1., 8.]])

    np.hstack((a[:,newaxis],b[:,newaxis]))
    Output:
    array([[ 4., 2.],
    [ 1., 8.]])

    Splitting the arrays
    Consider an array 'z' of 15 elements:
    z = np.arange(1,16)
    Using np.hsplit( ) one can split the arrays
    np.hsplit(z,5) # Split a into 5 arrays
    [array([1, 2, 3]),
    array([4, 5, 6]),
    array([7, 8, 9]),
    array([10, 11, 12]),
    array([13, 14, 15])]
    It splits 'z' into 5 arrays of eqaual length.
    On passing 2 elements we get:
    np.hsplit(z,(3,5))
    [array([1, 2, 3]),
    array([4, 5]),
    array([ 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])]
    It splits 'z' after the third and the fifth element.
    For 2D arrays np.hsplit( ) works as follows:
    A = np.arange(1,31).reshape(3,10)
    A
    np.hsplit(A,5) # Split a into 5 arrays
    [array([[ 1,  2],
    [11, 12],
    [21, 22]]), array([[ 3, 4],
    [13, 14],
    [23, 24]]), array([[ 5, 6],
    [15, 16],
    [25, 26]]), array([[ 7, 8],
    [17, 18],
    [27, 28]]), array([[ 9, 10],
    [19, 20],
    [29, 30]])]
    In the above command A gets split into 5 arrays of same shape.
    To split after the third and the fifth column we write:
    np.hsplit(A,(3,5))
    [array([[ 1,  2,  3],
    [11, 12, 13],
    [21, 22, 23]]), array([[ 4, 5],
    [14, 15],
    [24, 25]]), array([[ 6, 7, 8, 9, 10],
    [16, 17, 18, 19, 20],
    [26, 27, 28, 29, 30]])]

    Copying
    Consider an array x
    x = np.arange(1,16)
    We assign y as x and then say 'y is x'
    y = x
    y is x
    Let us change the shape of y
    y.shape = 3,5
    Note that it alters the shape of x
    x.shape
    (3, 5)

    Creating a view of the data
    Let us store z as a view of x by:
    z = x.view()
    z is x
    False
    Thus z is not x.
    Changing the shape of z
    z.shape = 5,3
    Creating a view does not alter the shape of x
    x.shape
    (3, 5)
    Changing an element in z
    z[0,0] = 1234
    Note that the value in x also get alters:
    x
    array([[1234,    2,    3,    4,    5],
    [ 6, 7, 8, 9, 10],
    [ 11, 12, 13, 14, 15]])
    Thus changes in the display does not hamper the original data but changes in values of view will affect the original data.


    Creating a copy of the data:
    Now let us create z as a copy of x:
    z = x.copy()
    Note that z is not x
    z is x
    Changing the value in z
    z[0,0] = 9999
    No alterations are made in x.
    x
    array([[1234,    2,    3,    4,    5],
    [ 6, 7, 8, 9, 10],
    [ 11, 12, 13, 14, 15]])
    Python sometimes may give 'setting with copy' warning because it is unable to recognize whether the new dataframe or array (created as a subset of another dataframe or array) is a view or a copy. Thus in such situations user needs to specify whether it is a copy or a view otherwise Python may hamper the results.

    Exercises : Numpy


    1. How to extract even numbers from array?

    arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
    Desired Output :array([0, 2, 4, 6, 8])

    Show Solution
    arr[arr % 2 == 0]

    2. How to find out the position where elements of x and y are same

    x = np.array([5,6,7,8,3,4])
    y = np.array([5,3,4,5,2,4])
    Desired Output :array([0, 5]

    Show Solution
    np.where(x == y)

    3. How to standardize values so that it lies between 0 and 1

    k = np.array([5,3,4,5,2,4])
    Hint :k-min(k)/(max(k)-min(k))

    Show Solution
    kmax, kmin = k.max(), k.min()
    k_new = (k - kmin)/(kmax - kmin)

    4. How to calculate the percentile scores of an array

    p = np.array([15,10, 3,2,5,6,4])

    Show Solution
    np.percentile(p, q=[5, 95])

    5. Print the number of missing values in an array

    p = np.array([5,10, np.nan, 3, 2, 5, 6, np.nan])

    Show Solution
    print("Number of missing values =", np.isnan(p).sum())

    April 19, 2019 02:12 PM UTC


    Python Bytes

    #126 WebAssembly comes to Python

    April 19, 2019 08:00 AM UTC


    Low Kian Seong

    The Human in Devops

    What was significant this week ?

    This week a mild epiphany came to me right after a somewhat heated and tense meeting with a team of developers plus project owner of a web project. They were angry and they were not afraid to show it. They were somewhat miffed about the fact that the head wrote them an email pretty much forcing them to participate to make our DevOps initiative a success. All kinds of expletive words were running through my head in relation to describing this team of flabby, tired looking individuals in front of me, which belied the cool demeanour and composure that I was trying so hard to maintain.

    It happened. In the spur of the moment I too got engulfed in a sea of negativity and for a few minutes lost site of what is the most important component or pillar in a successful DevOps initiative. The people. 

    "What a bunch of mule heads !" I thought. It's as plain as day, once this initiative is a success everybody can go home earlier and everything will be more predictable and we can do much much more than we could before. "Why are you fighting this ?!" I was ready to throw my hands up in defeat when it finally dawned on me.

    "Codes that power DevOps projects don't write themselves. People write those code" 
    "Without people powering our initiative now, we are just a few guys with a bunch of code and tools that are irrelevant"

    Boom! These thoughts hit me like lightning and in that moment I felt and equal measure of wisdom brought by this realisation as well as disgust at my stupidity of forgetting one of the main tenants and requirements to make the dream of a successful DevOps project a success.

    It was then I realised 2 very important mistakes I had made so far:


    1. I was reaching out horizontally to push our agenda across. Developers loved what we proposed and that was pretty much it. It's cool and it's cutting edge. It stopped there. "Hey thanks for sharing that cool tool ! I will try it in my project when I get the chance!" is pretty much the maximum you can expect to get from such an exchange. For you to gain any traction, you have got to sell your proposed solution or improvement to the stakeholders or the decision makers. Efforts that usually require people to do the right thing or go out of their way to do some unplanned kindness or rightness usually results in zilch. 
    2. I did not try to see the tool that I was proposing from the eyes of the beholders. It was too much of a leap. Much like how Abraham it's impossible for you to frog leap from sadness to happiness, so it was how the developers felt. They knew it was good for them, they can see it was good for them, they felt it could have the potential to improve their lives but alas they did not internalise it. The proverbial light bulb did not turn on inside of them, more correctly said, I did not do enough to turn that light on. I could see some people opening up, but when this realisation hit me, I just ended the meeting. I have not done enough of understanding where these people that I hoped to implement DevOps were. I had to do that first. 

    Do I miss coding ? Do I miss hunkering down and prototyping my way to showcase a tool or to get something to work ? Of course! Who wouldn't but main thing I keep on going back to is ... what is the main goal and expectation of the people who hired me to lead their DevOps push ? Is it to wire together some tools and configure something so they can use it ? At small enough scale probably that is enough of value, but when you want the horses you lead to the puddle to drink you need to give them a reason and just because you are drinking, you can't expect them to follow suit. 

    I am going to reach out more, I am going to understand more and I am going to engage more. All the people pieces needs to be in place before the pieces start falling automatically. Stay tuned if this is interesting ... 


    April 19, 2019 07:38 AM UTC


    ListenData

    Python for Data Science : Learn in 3 Days

    This tutorial helps you to learn Data Science with Python with examples. Python is an open source language and it is widely used as a high-level programming language for general-purpose programming. It has gained high popularity in data science world. As data science domain is rising these days, IBM recently predicted demand for data science professionals would rise by more than 25% by 2020. In the PyPL Popularity of Programming language index, Python scored second rank with a 14 percent share. In advanced analytics and predictive analytics market, it is ranked among top 3 programming languages for advanced analytics.
    Data Science with Python Tutorial

    Table of Contents
    1. Getting Started with Python
    2. Data Structures and Conditional Statements
    3. Python Libraries
    4. Data Manipulation using Pandas
    5. Data Science with Python

    Python 2.7 vs 3.6

    Google yields thousands of articles on this topic. Some bloggers opposed and some in favor of 2.7. If you filter your search criteria and look for only recent articles (late 2016 onwards), you would see majority of bloggers are in favor of Python 3.6. See the following reasons to support Python 3.6.

    1. The official end date for the Python 2.7 is year 2020. Afterward there would be no support from community. It does not make any sense to learn 2.7 if you learn it today.

    2. Python 3.6 supports 95% of top 360 python packages and almost 100% of top packages for data science.

    What's new in Python 3.6

    It is cleaner and faster. It is a language for the future. It fixed major issues with versions of Python 2 series. Python 3 was first released in year 2008. It has been 9 years releasing robust versions of Python 3 series.

    Key Takeaway
    You should go for Python 3.6. In terms of learning Python, there are no major differences in Python 2.7 and 3.6. It is not too difficult to move from Python 3 to Python 2 with a few adjustments. Your focus should go on learning Python as a language.

    Python for Data Science : Introduction

    Python is widely used and very popular for a variety of software engineering tasks such as website development, cloud-architecture, back-end etc. It is equally popular in data science world. In advanced analytics world, there has been several debates on R vs. Python. There are some areas such as number of libraries for statistical analysis, where R wins over Python but Python is catching up very fast. With popularity of big data and data science, Python has become first programming language of data scientists.

    There are several reasons to learn Python. Some of them are as follows -
    1. Python runs well in automating various steps of a predictive model. 
    2. Python has awesome robust libraries for machine learning, natural language processing, deep learning, big data and artificial Intelligence. 
    3. Python wins over R when it comes to deploying machine learning models in production.
    4. It can be easily integrated with big data frameworks such as Spark and Hadoop.
    5. Python has a great online community support.
    Do you know these sites are developed in Python?
    1. YouTube
    2. Instagram
    3. Reddit
    4. Dropbox
    5. Disqus

    How to Install Python

    There are two ways to download and install Python
    1. Download Anaconda. It comes with Python software along with preinstalled popular libraries.
    2. Download Python from its official website. You have to manually install libraries.

    Recommended : Go for first option and download anaconda. It saves a lot of time in learning and coding Python

    Coding Environments

    Anaconda comes with two popular IDE :
    1. Jupyter (Ipython) Notebook
    2. Spyder
    Spyder. It is like RStudio for Python. It gives an environment wherein writing python code is user-friendly. If you are a SAS User, you can think of it as SAS Enterprise Guide / SAS Studio. It comes with a syntax editor where you can write programs. It has a console to check each and every line of code. Under the 'Variable explorer', you can access your created data files and function. I highly recommend Spyder!
    Spyder - Python Coding Environment
    Jupyter (Ipython) Notebook

    Jupyter is equivalent to markdown in R. It is useful when you need to present your work to others or when you need to create step by step project report as it can combine code, output, words, and graphics.

    Spyder Shortcut Keys

    The following is a list of some useful spyder shortcut keys which makes you more productive.
    1. Press F5 to run the entire script
    2. Press F9 to run selection or line 
    3. Press Ctrl + 1 to comment / uncomment
    4. Go to front of function and then press Ctrl + I to see documentation of the function
    5. Run %reset -f to clean workspace
    6. Ctrl + Left click on object to see source code 
    7. Ctrl+Enter executes the current cell.
    8. Shift+Enter executes the current cell and advances the cursor to the next cell

    List of arithmetic operators with examples

    Arithmetic Operators Operation Example
    + Addition 10 + 2 = 12
    Subtraction 10 – 2 = 8
    * Multiplication 10 * 2 = 20
    / Division 10 / 2 = 5.0
    % Modulus (Remainder) 10 % 3 = 1
    ** Power 10 ** 2 = 100
    // Floor 17 // 3 = 5
    (x + (d-1)) // d Ceiling (17 +(3-1)) // 3 = 6

    Basic Programs

    Example 1
    #Basics
    x = 10
    y = 3
    print("10 divided by 3 is", x/y)
    print("remainder after 10 divided by 3 is", x%y)
    Result :
    10 divided by 3 is 3.33
    remainder after 10 divided by 3 is 1

    Example 2
    x = 100
    x > 80 and x <=95
    x > 35 or x < 60
    x > 80 and x <=95
    Out[45]: False
    x > 35 or x < 60
    Out[46]: True

    Comparison & Logical Operators Description Example
    > Greater than 5 > 3 returns True
    < Less than 5 < 3 returns False
    >= Greater than or equal to 5 >= 3 returns True
    <= Less than or equal to 5 <= 3 return False
    == Equal to 5 == 3 returns False
    != Not equal to 5 != 3 returns True
    and Check both the conditions x > 18 and x <=35
    or If atleast one condition hold True x > 35 or x < 60
    not Opposite of Condition not(x>7)

    Assignment Operators

    It is used to assign a value to the declared variable. For e.g. x += 25 means x = x +25.
    x = 100
    y = 10
    x += y
    print(x)
    print(x)
    110
    In this case, x+=y implies x=x+y which is x = 100 + 10.
    Similarly, you can use x-=y, x*=y and x /=y

    Python Data Structure

    In every programming language, it is important to understand the data structures. Following are some data structures used in Python.

    1. List

    It is a sequence of multiple values. It allows us to store different types of data such as integer, float, string etc. See the examples of list below. First one is an integer list containing only integer. Second one is string list containing only string values. Third one is mixed list containing integer, string and float values.
    1. x = [1, 2, 3, 4, 5]
    2. y = [‘A’, ‘O’, ‘G’, ‘M’]
    3. z = [‘A’, 4, 5.1, ‘M’]
    Get List Item

    We can extract list item using Indexes. Index starts from 0 and end with (number of elements-1).
    x = [1, 2, 3, 4, 5]
    x[0]
    x[1]
    x[4]
    x[-1]
    x[-2]
    x[0]
    Out[68]: 1

    x[1]
    Out[69]: 2

    x[4]
    Out[70]: 5

    x[-1]
    Out[71]: 5

    x[-2]
    Out[72]: 4

    x[0] picks first element from list. Negative sign tells Python to search list item from right to left. x[-1] selects the last element from list.

    You can select multiple elements from a list using the following method
    x[:3] returns [1, 2, 3]

    2. Tuple

    A tuple is similar to a list in the sense that it is a sequence of elements. The difference between list and tuple are as follows -
    1. A tuple cannot be changed once constructed whereas list can be modified.
    2. A tuple is created by placing comma-separated values inside parentheses ( ). Whereas, list is created inside square brackets [ ]
    Examples
    K = (1,2,3)
    State = ('Delhi','Maharashtra','Karnataka')
    Perform for loop on Tuple
    for i in State:
        print(i)
    Delhi
    Maharashtra
    Karnataka

    Detailed Tutorial : Python Data Structures

    Functions

    Like print(), you can create your own custom function. It is also called user-defined functions. It helps you in automating the repetitive task and calling reusable code in easier way.

    Rules to define a function
    1. Function starts with def keyword followed by function name and ( )
    2. Function body starts with a colon (:) and is indented
    3. The keyword return ends a function  and give value of previous expression.
    def sum_fun(a, b):
        result = a + b
        return result 
    z = sum_fun(10, 15)
    Result : z = 25

    Suppose you want python to assume 0 as default value if no value is specified for parameter b.
    def sum_fun(a, b=0):
        result = a + b
        return result
    z = sum_fun(10)
    In the above function, b is set to be 0 if no value is provided for parameter b. It does not mean no other value than 0 can be set here. It can also be used as z = sum_fun(10, 15)

    Conditional Statements (if else)

    Conditional statements are commonly used in coding. It is IF ELSE statements. It can be read like : " if a condition holds true, then execute something. Else execute something else"

    Note : The if and else statements ends with a colon :

    Example
    k = 27
    if k%5 == 0:
      print('Multiple of 5')
    else:
      print('Not a Multiple of 5')
    Result : Not a Multiple of 5

    Popular python packages for Data Analysis & Visualization

    Some of the leading packages in Python along with equivalent libraries in R are as follows-
    1. pandas. For data manipulation and data wrangling. A collections of functions to understand and explore data. It is counterpart of dplyr and reshape2 packages in R.
    2. NumPy. For numerical computing. It's a package for efficient array computations. It allows us to do some operations on an entire column or table in one line. It is roughly approximate to Rcpp package in R which eliminates the limitation of slow speed in R. Numpy Tutorial
    3. Scipy.  For mathematical and scientific functions such as integration, interpolation, signal processing, linear algebra, statistics, etc. It is built on Numpy.
    4. Scikit-learn. A collection of machine learning algorithms. It is built on Numpy and Scipy. It can perform all the techniques that can be done in R using glm, knn, randomForest, rpart, e1071 packages.
    5. Matplotlib. For data visualization. It's a leading package for graphics in Python. It is equivalent to ggplot2 package in R.
    6. Statsmodels. For statistical and predictive modeling. It includes various functions to explore data and generate descriptive and predictive analytics. It allows users to run descriptive statistics, methods to impute missing values, statistical tests and take table output to HTML format.
    7. pandasql.  It allows SQL users to write SQL queries in Python. It is very helpful for people who loves writing SQL queries to manipulate data. It is equivalent to sqldf package in R.
    Maximum of the above packages are already preinstalled in Spyder.
      Comparison of Python and R Packages by Data Mining Task

      Task Python Package R Package
      IDE Rodeo / Spyder Rstudio
      Data Manipulation pandas dplyr and reshape2
      Machine Learning Scikit-learn glm, knn, randomForest, rpart, e1071
      Data Visualization ggplot + seaborn + bokeh ggplot2
      Character Functions Built-In Functions stringr
      Reproducibility Jupyter Knitr
      SQL Queries pandasql sqldf
      Working with Dates datetime lubridate
      Web Scraping beautifulsoup rvest

      Popular Python Commands

      The commands below would help you to install and update new and existing packages. Let's say, you want to install / uninstall pandas package.

      Run these commands from IPython console window. Don't forget to add ! before pip otherwise it would return syntax error.

      Install Package
      !pip install pandas

      Uninstall Package
      !pip uninstall pandas

      Show Information about Installed Package
      !pip show pandas

      List of Installed Packages
      !pip list

      Upgrade a package
      !pip install --upgrade pandas

        How to import a package

        There are multiple ways to import a package in Python. It is important to understand the difference between these styles.

        1. import pandas as pd
        It imports the package pandas under the alias pd. A function DataFrame in package pandas is then submitted with pd.DataFrame.

        2. import pandas
        It imports the package without using alias but here the function DataFrame is submitted with full package name pandas.DataFrame

        3. from pandas import *
        It imports the whole package and the function DataFrame is executed simply by typing DataFrame. It sometimes creates confusion when same function name exists in more than one package.

        Pandas Data Structures : Series and DataFrame

        In pandas package, there are two data structures - series and dataframe. These structures are explained below in detail -
        1. Series is a one-dimensional array. You can access individual elements of a series using position. It's similar to vector in R.
        In the example below, we are generating 5 random values.
        import pandas as pd
        import numpy as np
        s1 = pd.Series(np.random.randn(5))
        s1
        0   -2.412015
        1 -0.451752
        2 1.174207
        3 0.766348
        4 -0.361815
        dtype: float64

        Extract first and second value

        You can get a particular element of a series using index value. See the examples below -

        s1[0]
        -2.412015
        s1[1]
        -0.451752
        s1[:3]
        0   -2.412015
        1 -0.451752
        2 1.174207

        2. DataFrame

        It is equivalent to data.frame in R. It is a 2-dimensional data structure that can store data of different data types such as characters, integers, floating point values, factors. Those who are well-conversant with MS Excel, they can think of data frame as Excel Spreadsheet.

        Comparison of Data Type in Python and Pandas

        The following table shows how Python and pandas package stores data.

        Data Type Pandas Standard Python
        For character variable object string
        For categorical variable category -
        For Numeric variable without decimals int64 int
        Numeric characters with decimals float64 float
        For date time variables datetime64 -

        Important Pandas Functions

        The table below shows comparison of pandas functions with R functions for various data wrangling and manipulation tasks. It would help you to memorize pandas functions. It's a very handy information for programmers who are new to Python. It includes solutions for most of the frequently used data exploration tasks.

        Functions R Python (pandas package)
        Installing a package install.packages('name') !pip install name
        Loading a package library(name) import name as other_name
        Checking working directory getwd() import os
        os.getcwd()
        Setting working directory setwd() os.chdir()
        List files in a directory dir() os.listdir()
        Remove an object rm('name') del object
        Select Variables select(df, x1, x2) df[['x1', 'x2']]
        Drop Variables select(df, -(x1:x2)) df.drop(['x1', 'x2'], axis = 1)
        Filter Data filter(df, x1 >= 100) df.query('x1 >= 100')
        Structure of a DataFrame str(df) df.info()
        Summarize dataframe summary(df) df.describe()
        Get row names of dataframe "df" rownames(df) df.index
        Get column names colnames(df) df.columns
        View Top N rows head(df,N) df.head(N)
        View Bottom N rows tail(df,N) df.tail(N)
        Get dimension of data frame dim(df) df.shape
        Get number of rows nrow(df) df.shape[0]
        Get number of columns ncol(df) df.shape[1]
        Length of data frame length(df) len(df)
        Get random 3 rows from dataframe sample_n(df, 3) df.sample(n=3)
        Get random 10% rows sample_frac(df, 0.1) df.sample(frac=0.1)
        Check Missing Values is.na(df$x) pd.isnull(df.x)
        Sorting arrange(df, x1, x2) df.sort_values(['x1', 'x2'])
        Rename Variables rename(df, newvar = x1) df.rename(columns={'x1': 'newvar'})


        Data Manipulation with pandas - Examples

        1. Import Required Packages

        You can import required packages using import statement. In the syntax below, we are asking Python to import numpy and pandas package. The 'as' is used to alias package name.
        import numpy as np
        import pandas as pd

        2. Build DataFrame

        We can build dataframe using DataFrame() function of pandas package.
        mydata = {'productcode': ['AA', 'AA', 'AA', 'BB', 'BB', 'BB'],
                'sales': [1010, 1025.2, 1404.2, 1251.7, 1160, 1604.8],
                'cost' : [1020, 1625.2, 1204, 1003.7, 1020, 1124]}
        df = pd.DataFrame(mydata)
         In this dataframe, we have three variables - productcode, sales, cost.
        Sample DataFrame

        To import data from CSV file


        You can use read_csv() function from pandas package to get data into python from CSV file.
        mydata= pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv")
        Make sure you use double backslash when specifying path of CSV file. Alternatively, you can use forward slash to mention file path inside read_csv() function.

        Detailed Tutorial : Import Data in Python

        3. To see number of rows and columns

        You can run the command below to find out number of rows and columns.
        df.shape
         Result : (6, 3). It means 6 rows and 3 columns.

        4. To view first 3 rows

        The df.head(N) function can be used to check out first some N rows.
        df.head(3)
             cost productcode   sales
        0 1020.0 AA 1010.0
        1 1625.2 AA 1025.2
        2 1204.0 AA 1404.2

        5. Select or Drop Variables

        To keep a single variable, you can write in any of the following three methods -
        df.productcode
        df["productcode"]
        df.loc[: , "productcode"]
        To select variable by column position, you can use df.iloc function. In the example below, we are selecting second column. Column Index starts from 0. Hence, 1 refers to second column.
        df.iloc[: , 1]
        We can keep multiple variables by specifying desired variables inside [ ]. Also, we can make use of df.loc() function.
        df[["productcode", "cost"]]
        df.loc[ : , ["productcode", "cost"]]

        Drop Variable

        We can remove variables by using df.drop() function. See the example below -
        df2 = df.drop(['sales'], axis = 1)

        6. To summarize data frame

        To summarize or explore data, you can submit the command below.
        df.describe()
                      cost       sales
        count 6.000000 6.00000
        mean 1166.150000 1242.65000
        std 237.926793 230.46669
        min 1003.700000 1010.00000
        25% 1020.000000 1058.90000
        50% 1072.000000 1205.85000
        75% 1184.000000 1366.07500
        max 1625.200000 1604.80000

        To summarise all the character variables, you can use the following script.
        df.describe(include=['O'])
        Similarly, you can use df.describe(include=['float64']) to view summary of all the numeric variables with decimals.

        To select only a particular variable, you can write the following code -
        df.productcode.describe()
        OR
        df["productcode"].describe()
        count      6
        unique 2
        top BB
        freq 3
        Name: productcode, dtype: object

        7. To calculate summary statistics

        We can manually find out summary statistics such as count, mean, median by using commands below
        df.sales.mean()
        df.sales.median()
        df.sales.count()
        df.sales.min()
        df.sales.max()

        8. Filter Data

        Suppose you are asked to apply condition - productcode is equal to "AA" and sales greater than or equal to 1250.
        df1 = df[(df.productcode == "AA") & (df.sales >= 1250)]
        It can also be written like :
        df1 = df.query('(productcode == "AA") & (sales >= 1250)')
        In the second query, we do not need to specify DataFrame along with variable name.

        9. Sort Data

        In the code below, we are arrange data in ascending order by sales.
        df.sort_values(['sales'])

        10.  Group By : Summary by Grouping Variable

        Like SQL GROUP BY, you want to summarize continuous variable by classification variable. In this case, we are calculating average sale and cost by product code.
        df.groupby(df.productcode).mean()
                            cost        sales
        productcode
        AA 1283.066667 1146.466667
        BB 1049.233333 1338.833333
        Instead of summarising for multiple variable, you can run it for a single variable i.e. sales. Submit the following script.
        df["sales"].groupby(df.productcode).mean()

        11. Define Categorical Variable

        Let's create a classification variable - id which contains only 3 unique values - 1/2/3.
        df0 = pd.DataFrame({'id': [1, 1, 2, 3, 1, 2, 2]})
        Let's define as a categorical variable.
        We can use astype() function to make id as a categorical variable.
        df0.id = df0["id"].astype('category')
        Summarize this classification variable to check descriptive statistics.
        df0.describe()
               id
        count 7
        unique 3
        top 2
        freq 3

        Frequency Distribution

        You can calculate frequency distribution of a categorical variable. It is one of the method to explore a categorical variable.
        df['productcode'].value_counts()
        BB    3
        AA 3

        12. Generate Histogram

        Histogram is one of the method to check distribution of a continuous variable. In the figure shown below, there are two values for variable 'sales' in range 1000-1100. In the remaining intervals, there is only a single value. In this case, there are only 5 values. If you have a large dataset, you can plot histogram to identify outliers in a continuous variable.
        df['sales'].hist()
        Histogram

        13. BoxPlot

        Boxplot is a method to visualize continuous or numeric variable. It shows minimum, Q1, Q2, Q3, IQR, maximum value in a single graph.
        df.boxplot(column='sales')
        BoxPlot

        Detailed Tutorial : Data Analysis with Pandas Tutorial

        Data Science using Python - Examples

        In this section, we cover how to perform data mining and machine learning algorithms with Python. sklearn is the most frequently used library for running data mining and machine learning algorithms. We will also cover statsmodels library for regression techniques. statsmodels library generates formattable output which can be used further in project report and presentation.

        1. Install the required libraries

        Import the following libraries before reading or exploring data
        #Import required libraries
        import pandas as pd
        import statsmodels.api as sm
        import numpy as np

        2. Download and import data into Python

        With the use of python library, we can easily get data from web into python.
        # Read data from web
        df = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
        Variables Type Description
        gre Continuous Graduate Record Exam score
        gpa Continuous Grade Point Average
        rank Categorical Prestige of the undergraduate institution
        admit Binary Admission in graduate school

        The binary variable admit is a target variable.

        3. Explore Data

        Let's explore data. We'll answer the following questions -
        1. How many rows and columns in the data file?
        2. What are the distribution of variables?
        3. Check if any outlier(s)
        4. If outlier(s), treat them
        5. Check if any missing value(s)
        6. Impute Missing values (if any)
        # See no. of rows and columns
        df.shape
        Result : 400 rows and 4 columns

        In the code below, we rename the variable rank to 'position' as rank is already a function in python.
        # rename rank column
        df = df.rename(columns={'rank': 'position'}) 
        Summarize and plot all the columns.
        # Summarize
        df.describe()
        # plot all of the columns
        df.hist()
        Categorical variable Analysis

        It is important to check the frequency distribution of categorical variable. It helps to answer the question whether data is skewed.
        # Summarize
        df.position.value_counts(ascending=True)
        1     61
        4 67
        3 121
        2 151

        Generating Crosstab 

        By looking at cross tabulation report, we can check whether we have enough number of events against each unique values of categorical variable.
        pd.crosstab(df['admit'], df['position'])
        position   1   2   3   4
        admit
        0 28 97 93 55
        1 33 54 28 12

        Number of Missing Values

        We can write a simple loop to figure out the number of blank values in all variables in a dataset.
        for i in list(df.columns) :
            k = sum(pd.isnull(df[i]))
            print(i, k)
        In this case, there are no missing values in the dataset.

        4. Logistic Regression Model

        Logistic Regression is a special type of regression where target variable is categorical in nature and independent variables be discrete or continuous. In this post, we will demonstrate only binary logistic regression which takes only binary values in target variable. Unlike linear regression, logistic regression model returns probability of target variable.It assumes binomial distribution of dependent variable. In other words, it belongs to binomial family.

        In python, we can write R-style model formula y ~ x1 + x2 + x3 using  patsy and statsmodels libraries. In the formula, we need to define variable 'position' as a categorical variable by mentioning it inside capital C(). You can also define reference category using reference= option.
        #Reference Category
        from patsy import dmatrices, Treatment
        y, X = dmatrices('admit ~ gre + gpa + C(position, Treatment(reference=4))', df, return_type = 'dataframe')
        It returns two datasets - X and y. The dataset 'y' contains variable admit which is a target variable. The other dataset 'X' contains Intercept (constant value), dummy variables for Treatment, gre and gpa. Since 4 is set as a reference category, it will be 0 against all the three dummy variables. See sample below -
        P  P_1 P_2 P_3
        3 0 0 1
        3 0 0 1
        1 1 0 0
        4 0 0 0
        4 0 0 0
        2 0 1 0


        Split Data into two parts

        80% of data goes to training dataset which is used for building model and 20% goes to test dataset which would be used for validating the model.
        from sklearn.model_selection import train_test_split
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

        Build Logistic Regression Model

        By default, the regression without formula style does not include intercept. To include it, we already have added intercept in X_train which would be used as a predictor.
        #Fit Logit model
        logit = sm.Logit(y_train, X_train)
        result = logit.fit()

        #Summary of Logistic regression model
        result.summary()
        result.params
                                  Logit Regression Results                           
        ==============================================================================
        Dep. Variable: admit No. Observations: 320
        Model: Logit Df Residuals: 315
        Method: MLE Df Model: 4
        Date: Sat, 20 May 2017 Pseudo R-squ.: 0.03399
        Time: 19:57:24 Log-Likelihood: -193.49
        converged: True LL-Null: -200.30
        LLR p-value: 0.008627
        =======================================================================================
        coef std err z P|z| [95.0% Conf. Int.]
        ---------------------------------------------------------------------------------------
        C(position)[T.1] 1.4933 0.440 3.392 0.001 0.630 2.356
        C(position)[T.2] 0.6771 0.373 1.813 0.070 -0.055 1.409
        C(position)[T.3] 0.1071 0.410 0.261 0.794 -0.696 0.910
        gre 0.0005 0.001 0.442 0.659 -0.002 0.003
        gpa 0.4613 0.214 -2.152 0.031 -0.881 -0.041
        ======================================================================================

        Confusion Matrix and Odd Ratio

        Odd ratio is exponential value of parameter estimates.
        #Confusion Matrix
        result.pred_table()
        #Odd Ratio
        np.exp(result.params)

        Prediction on Test Data
        In this step, we take estimates of logit model which was built on training data and then later apply it into test data.
        #prediction on test data
        y_pred = result.predict(X_test)

        Calculate Area under Curve (ROC)
        # AUC on test data
        false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
        auc(false_positive_rate, true_positive_rate)
        Result : AUC = 0.6763

        Calculate Accuracy Score
        accuracy_score([ 1 if p > 0.5 else 0 for p in y_pred ], y_test)

        Decision Tree Model

        Decision trees can have a target variable continuous or categorical. When it is continuous, it is called regression tree. And when it is categorical, it is called classification tree. It selects a variable at each step that best splits the set of values. There are several algorithms to find best split. Some of them are Gini, Entropy, C4.5, Chi-Square. There are several advantages of decision tree. It is simple to use and easy to understand. It requires a very few data preparation steps. It can handle mixed data - both categorical and continuous variables. In terms of speed, it is a very fast algorithm.

        #Drop Intercept from predictors for tree algorithms
        X_train = X_train.drop(['Intercept'], axis = 1)
        X_test = X_test.drop(['Intercept'], axis = 1)

        #Decision Tree
        from sklearn.tree import DecisionTreeClassifier
        model_tree = DecisionTreeClassifier(max_depth=7)

        #Fit the model:
        model_tree.fit(X_train,y_train)

        #Make predictions on test set
        predictions_tree = model_tree.predict_proba(X_test)

        #AUC
        false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_tree[:,1])
        auc(false_positive_rate, true_positive_rate)
        Result : AUC = 0.664

        Important Note
        Feature engineering plays an important role in building predictive models. In the above case, we have not performed variable selection. We can also select best parameters by using grid search fine tuning technique.

        Random Forest Model

        Decision Tree has limitation of overfitting which implies it does not generalize pattern. It is very sensitive to a small change in training data. To overcome this problem, random forest comes into picture. It grows a large number of trees on randomised data. It selects random number of variables to grow each tree. It is more robust algorithm than decision tree. It is one of the most popular machine learning algorithm. It is commonly used in data science competitions. It is always ranked in top 5 algorithms. It has become a part of every data science toolkit.

        #Random Forest
        from sklearn.ensemble import RandomForestClassifier
        model_rf = RandomForestClassifier(n_estimators=100, max_depth=7)

        #Fit the model:
        target = y_train['admit']
        model_rf.fit(X_train,target)

        #Make predictions on test set
        predictions_rf = model_rf.predict_proba(X_test)

        #AUC
        false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_rf[:,1])
        auc(false_positive_rate, true_positive_rate)

        #Variable Importance
        importances = pd.Series(model_rf.feature_importances_, index=X_train.columns).sort_values(ascending=False)
        print(importances)
        importances.plot.bar()

        Result : AUC = 0.6974

        Grid Search - Hyper Parameters Tuning

        The sklearn library makes hyper-parameters tuning very easy. It is a strategy to select the best parameters for an algorithm. In scikit-learn they are passed as arguments to the constructor of the estimator classes. For example, max_features in randomforest. alpha for lasso.

        from sklearn.model_selection import GridSearchCV
        rf = RandomForestClassifier()
        target = y_train['admit']

        param_grid = {
        'n_estimators': [100, 200, 300],
        'max_features': ['sqrt', 3, 4]
        }

        CV_rfc = GridSearchCV(estimator=rf , param_grid=param_grid, cv= 5, scoring='roc_auc')
        CV_rfc.fit(X_train,target)

        #Parameters with Scores
        CV_rfc.grid_scores_

        #Best Parameters
        CV_rfc.best_params_
        CV_rfc.best_estimator_

        #Make predictions on test set
        predictions_rf = CV_rfc.predict_proba(X_test)

        #AUC
        false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_rf[:,1])
        auc(false_positive_rate, true_positive_rate)

        Cross Validation
        # Cross Validation
        from sklearn.linear_model import LogisticRegression
        from sklearn.model_selection import cross_val_predict,cross_val_score
        target = y['admit']
        prediction_logit = cross_val_predict(LogisticRegression(), X, target, cv=10, method='predict_proba')
        #AUC
        cross_val_score(LogisticRegression(fit_intercept = False), X, target, cv=10, scoring='roc_auc')

        Data Mining : PreProcessing Steps

        1.  The machine learning package sklearn requires all categorical variables in numeric form. Hence, we need to convert all character/categorical variables to be numeric. This can be accomplished using the following script. In sklearn,  there is already a function for this step.

        from sklearn.preprocessing import LabelEncoder
        def ConverttoNumeric(df):
        cols = list(df.select_dtypes(include=['category','object']))
        le = LabelEncoder()
        for i in cols:
        try:
        df[i] = le.fit_transform(df[i])
        except:
        print('Error in Variable :'+i)
        return df

        ConverttoNumeric(df)
        Encoding

        2. Create Dummy Variables

        Suppose you want to convert categorical variables into dummy variables. It is different to the previous example as it creates dummy variables instead of convert it in numeric form.
        productcode_dummy = pd.get_dummies(df["productcode"])
        df2 = pd.concat([df, productcode_dummy], axis=1)

        The output looks like below -
           AA  BB
        0 1 0
        1 1 0
        2 1 0
        3 0 1
        4 0 1
        5 0 1

        Create k-1 Categories

        To avoid multi-collinearity, you can set one of the category as reference category and leave it while creating dummy variables. In the script below, we are leaving first category.
        productcode_dummy = pd.get_dummies(df["productcode"], prefix='pcode', drop_first=True)
        df2 = pd.concat([df, productcode_dummy], axis=1)

        3. Impute Missing Values

        Imputing missing values is an important step of predictive modeling. In many algorithms, if missing values are not filled, it removes complete row. If data contains a lot of missing values, it can lead to huge data loss. There are multiple ways to impute missing values. Some of the common techniques - to replace missing value with mean/median/zero. It makes sense to replace missing value with 0 when 0 signifies meaningful. For example, whether customer holds a credit card product.

        Fill missing values of a particular variable
        # fill missing values with 0
        df['var1'] = df['var1'].fillna(0)
        # fill missing values with mean
        df['var1'] = df['var1'].fillna(df['var1'].mean())

        Apply imputation to the whole dataset
        from sklearn.preprocessing import Imputer

        # Set an imputer object
        mean_imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)

        # Train the imputor
        mean_imputer = mean_imputer.fit(df)

        # Apply imputation
        df_new = mean_imputer.transform(df.values)

        4. Outlier Treatment

        There are many ways to handle or treat outliers (or extreme values). Some of the methods are as follows -
        1. Cap extreme values at 95th / 99th percentile depending on distribution
        2. Apply log transformation of variables. See below the implementation of log transformation in Python.
        import numpy as np
        df['var1'] = np.log(df['var1'])

        5. Standardization

        In some algorithms, it is required to standardize variables before running the actual algorithm. Standardization refers to the process of making mean of variable zero and unit variance (standard deviation).

        #load dataset
        dataset = load_boston()
        predictors = dataset.data
        target = dataset.target
        df = pd.DataFrame(predictors, columns = dataset.feature_names)

        #Apply Standardization
        from sklearn.preprocessing import StandardScaler
        k = StandardScaler()
        df2 = k.fit_transform(df)


        Next Steps

        Practice, practice and practice. Download free public data sets from Kaggle / UCLA websites and try to play around with data and generate insights from it with pandas package and build statistical models using sklearn package. I hope you would find this tutorial helpful. I tried to cover all the important topics which beginner must know about Python. Once completion of this tutorial, you can flaunt you know how to program it in Python and you can implement machine learning algorithms using sklearn package.

        April 19, 2019 07:18 AM UTC

        Install Python Package

        Python is one of the most popular programming language for data science and analytics. It is widely used for a variety of tasks in startups and many multi-national organizations. The beauty of this programming language is that it is open-source which means it is available for free and has very active community of developers across the world. Python developers share their solutions in the form of package or module with other python users. This tutorial explains various ways how to install python package.

        Ways to Install Python Package


        Method 1 : If Anaconda is already installed on your System

        Anaconda is the data science platform which comes with pre-installed popular python packages and powerful IDE (Spyder) which has user-friendly interface to ease writing of python programming scripts.

        If Anaconda is installed on your system (laptop), click on Anaconda Prompt as shown in the image below.

        Anaconda Prompt

        To install a python package or module, enter the code below in Anaconda Prompt -
        pip install package-name
        Install Python Package using PIP Windows

        Method 2 : NO Need of Anaconda


        1. Open RUN box using shortcut Windows Key + R

        2. Enter cmd in the RUN box
        Command Prompt

        Once you press OK, it will show command prompt screen.



        3. Search for folder named Scripts where pip applications are stored.

        Scripts Folder

        4. In command prompt, type cd <file location of Scripts folder>

        cd refers to change directory.

        For example, folder location is C:\Users\DELL\Python37\Scripts so you need to enter the following line in command prompt :
        cd C:\Users\DELL\Python37\Scripts 

        Change Directory

        5. Type pip install package-name

        Install Package via PIP command prompt


        Method 3 : Install Python Package from IPython console

        Make sure to use ! before pip when you enter the command below in IPython console window. Otherwise it would return syntax error.
        !pip install package_name
        The ! prefix tells Python to run a shell command.


        Syntax Error : Installing Package using PIP

        Some users face error "SyntaxError: invalid syntax" in installing packages. To workaround this issue, run the command line below in command prompt -
        python -m pip install package-name
        python -m pip tells python to import a module for you, then run it as a script.

        Install Specific Versions of Python Package
        python -m pip install Packagename==1.3     # specific version
        python -m pip install "Packagename>=1.3"  # version greater than or equal to 1.3

        How to load or import package or module

        Once package is installed, next step is to make the package in use. In other words, it is required to import package once installed. There are several ways to load package or module in Python :

        1. import math loads the module math. Then you can use any function defined in math module using math.function. Refer the example below -
        import math
        math.sqrt(4)

        2. from math import * loads the module math. Now we don't need to specify the module to use functions of this module.
        from math import *
        sqrt(4)

        3. from math import sqrt, cos imports the selected functions of the module math.

        4.import math as m imports the math module under the alias m.
        m.sqrt(4)

        Other Useful Commands
        Description Command
        To uninstall a package pip uninstall package
        To upgrade a package pip install --upgrade package
        To search a package pip search "package-name"
        To check all the installed packages pip list

        April 19, 2019 07:16 AM UTC


        Codementor

        Why Django Is The Popular Python Framework Among Web Developers?

        The lot of advantages of web development using python Django framework can be easily accessed in small project, better security, less effort and less investment money into a projects.

        April 19, 2019 06:27 AM UTC


        Vasudev Ram

        Python's dynamic nature: sticking an attribute onto an object


        - By Vasudev Ram - Online Python training / SQL training / Linux training



        Hi, readers,

        [This is a beginner-level Python post.]

        Python, being a dynamic language, has some interesting features that some static languages may not have (and vice versa too, of course).

        One such feature, which I noticed a while ago, is that you can add an attribute to a Python object even after it has been created. (Conditions apply.)

        I had used this feature some time ago to work around some implementation issue in a rudimentary RESTful server that I created as a small teaching project. It was based on the BaseHTTPServer module.

        Here is a (different) simple example program, stick_attrs_onto_obj.py, that demonstrates this Python feature.
        My informal term for this feature is "sticking an attribute onto an object" after the object is created.

        Since the program is simple, and there are enough comments in the code, I will not explain it in detail.
        # stick_attrs_onto_obj.py

        # A program to show:
        # 1) that you can "stick" attributes onto a Python object after it is created, and
        # 2) one use of this technique, to count the number# of calls to a function.

        # Copyright 2019 Vasudev Ram
        # Web site: https://vasudevram.github.io
        # Blog: https://jugad2.blogspot.com
        # Training: https://jugad2.blogspot.com/p/training.html
        # Product store: https://gumroad.com/vasudevram
        # Twitter: https://twitter.com/vasudevram

        from __future__ import print_function

        # Define a function.
        def foo(arg):
        # Print something to show that the function has been called.
        print("in foo: arg = {}".format(arg))
        # Increment the "stuck-on" int attribute inside the function.
        foo.call_count += 1

        # A function is also an object in Python.
        # So we can add attributes to it, including after it is defined.
        # I call this "sticking" an attribute onto the function object.
        # The statement below defines the attribute with an initial value,
        # which is changeable later, as we will see.
        foo.call_count = 0

        # Print its initial value before any calls to the function.
        print("foo.call_count = {}".format(foo.call_count))

        # Call the function a few times.
        for i in range(5):
        foo(i)

        # Print the attribute's value after those calls.
        print("foo.call_count = {}".format(foo.call_count))

        # Call the function a few more times.
        for i in range(3):
        foo(i)

        # Print the attribute's value after those additional calls.
        print("foo.call_count = {}".format(foo.call_count))

        And here is the output of the program:
        $ python stick_attrs_onto_obj.py
        foo.call_count = 0
        in foo: arg = 0
        in foo: arg = 1
        in foo: arg = 2
        in foo: arg = 3
        in foo: arg = 4
        foo.call_count = 5
        in foo: arg = 0
        in foo: arg = 1
        in foo: arg = 2
        foo.call_count = 8

        There may be other ways to get the call count of a function, including using a profiler, and maybe by using a closure or decorator or other way. But this way is really simple. And as you can see from the code, it is also possible to use it to find the number of calls to the function, between any two points in the program code. For that, we just have to store the call count in a variable at the first point, and subtract that value from the call count at the second point. In the above program, that would be 8 - 5 = 3, which matches the 3 that is the number of calls to function foo made by the 2nd for loop.

        Enjoy.

        - Vasudev Ram - Online Python training and consulting

        I conduct online courses on Python programming, Unix / Linux commands and shell scripting and SQL programming and database design, with course material and personal coaching sessions.

        The course details and testimonials are here.

        Contact me for details of course content, terms and schedule.

        Try FreshBooks: Create and send professional looking invoices in less than 30 seconds.

        Getting a new web site or blog, and want to help preserve the environment at the same time? Check out GreenGeeks.com web hosting.

        Sell your digital products via DPD: Digital Publishing for Ebooks and Downloads.

        Learning Linux? Hit the ground running with my vi quickstart tutorial. I wrote it at the request of two Windows system administrator friends who were given additional charge of some Unix systems. They later told me that it helped them to quickly start using vi to edit text files on Unix. Of course, vi/vim is one of the most ubiquitous text editors around, and works on most other common operating systems and on some uncommon ones too, so the knowledge of how to use it will carry over to those systems too.

        Check out WP Engine, powerful WordPress hosting.

        Creating online products for sale? Check out ConvertKit, email marketing for online creators.

        Teachable: feature-packed course creation platform, with unlimited video, courses and students.

        Posts about: Python * DLang * xtopdf

        My ActiveState Code recipes

        Follow me on:


        April 19, 2019 03:38 AM UTC

        April 18, 2019


        PyCharm

        PyCharm at PyCon 2019: The Big Tent

        Last week we announced our “big tent” at PyCon 2019 with the blog post PyCharm Hosts Python Content Creators at Expanded PyCon Booth. Next week we’ll announce more on each individual piece.

        Today, let’s do an overview of the kinds of activities in “the big tent.”

        Workshops

        Miguel Grinberg, one of the boothmates, is doing his First Steps in Web Development With Python tutorial Thursday morning, 9AM to 12:20. He’s fantastic at this and a real icon of PyCon tutorials over the years.

        Thursday afternoon at 3:30 I’m doing 42 PyCharm Tips and Tricks in Room 13. It’s a hands-on workshop with a secret twist which I’ll reveal at the event (and after.) We’ll have some of the PyCharm team with me to help folks in the audience with questions.

        Reception

        PyCon’s opening reception starts at 5:30PM on the show floor. It’s got food, it’s got drinks, it’s got…our packed booth with lots of stuff going on. Come meet ten of us from the PyCharm team, along with the Content Creators: Michael Kennedy, Brian Okken, Dan Bader, Miguel Grinberg, Matt Harrison, Anthony Shaw, Luciano Ramalho, Bob Belderbos, Julian Sequeira, and Chris Medina. Perhaps even a FLUFL sighting.

        Some activities in the mini-theater:

        PyCharm Stand

        Come meet the PyCharm team! We’ll have ten of us, most from the core team. We go to events not to do sales but to listen. (Some might say, face the consequences of our decisions.) Want to talk to the main developer of our debugger? She’s there. Ditto for the new Jupyter support, vim emulation, etc.

        Or if you just want to say hi, then please come by, take a picture and tweet it, and get a retweet from us.

        Content Creators Stands

        Podcasts, articles, video courses and training, books…as the previous article mentioned, we have a home for many of the key Python “content creators” to share a presence, use the mini-theater and one-on-one space, and just hang out and have fun.

        There are two stands for them to share in timeslots throughout the conference. We’ll make the schedule available closer to PyCon. But they’ll all be around for the reception.

        Mini-Theater

        This is the second big addition this year: booth space for small talks, both scheduled and impromptu, by the PyCharm team, the Content Creators, and even by some others. We’ll announce this in detail later.

        Not just talks…we’ll announce some special events as well.

        One-on-Ones

        “Can you take a look at my project?” We get this a lot at conferences, as well as “I’m really interested in the new Jupyter support”, or “I heard your pytest support is really neat, can you show me?”

        The PyCharm booth will have a dedicated area along with the conference miracle of seating, where we can work one-on-one. Bring your laptop “into the shop” for diagnosis. Show us some big idea you’ve been working on. Get a tour of some PyCharm feature that interests you, from the person that implemented it.

        This also applies to the Content Creators as well. Saw an article or listened to a podcast and want more? Pick a time to meet up with them in the one-on-one area. Did I mention seating?

        Videography

        We have a crew hanging around different times at the booth, doing interviews and producing clips. If you’re around and want to give a shoutout to PyCon for the hard (volunteer!) work putting on a great show, let’s get you on camera.

        April 18, 2019 08:08 PM UTC


        Mike Driscoll

        Mozilla Announces Pyodide – Python in the Browser

        Mozilla announced a new project called Pyodide earlier this week. The aim of Pyodide is to bring Python’s scientific stack into the browser.

        The Pyodide project will give you a full, standard Python interpreter that runs in your browser and also give you access to the browsers Web APIs. Currently, Pyodide does not support threading or networking sockets. Python is also quite a bit slower to run in the browser, although it is usable for interactive exploration.

        The article mentions other projects, such as Brython and Skulpt. These projects are rewrites of Python’s interpreter in Javascript. Their disadvantage to Pyodide is that they cannot use Python extensions that were written in C, such as Numpy or Pandas. Pyodide overcomes this issue.

        Anyway, this sounds like a really interesting project. I always thought the demos I used to see of Python running in Silverlight in the browser were cool. That project is basically dead at this point, but Pyodide sounds like a really interesting new hack at getting Python into the browser. Hopefully it will go somewhere.

        April 18, 2019 08:02 PM UTC

        Creating a GUI Application for NASA’s API with wxPython

        Growing up, I have always found the universe and space in general to be exciting. It is fun to dream about what worlds remain unexplored. I also enjoy seeing photos from other worlds or thinking about the vastness of space. What does this have to do with Python though? Well, the National Aeronautics and Space Administration (NASA) has a web API that allows you to search their image library.

        You can read all about it on their website.

        The NASA website recommends getting an Application Programming Interface (API) key. If you go to that website, the form that you will fill out is nice and short.

        Technically, you do not need an API key to make requests against NASA’s services. However they do have rate limiting in place for developers who access their site without an API key. Even with a key, you are limited to a default of 1000 requests per hour. If you go over your allocation, you will be temporarily blocked from making requests. You can contact NASA to request a higher rate limit though.

        Interestingly, the documentation doesn’t really say how many requests you can make without an API key.

        The API documentation disagrees with NASA’s Image API documentation about which endpoints to hit, which makes working with their website a bit confusing.

        For example, you will see the API documentation talking about this URL:

        • https://api.nasa.gov/planetary/apod?api_key=API_KEY_GOES_HERE

        But in the Image API documentation, the API root is:

        • https://images-api.nasa.gov

        For the purposes of this tutorial, you will be using the latter.

        This article is adapted from my book:

        Creating GUI Applications with wxPython

        Purchase now on Leanpub


        Using NASA’s API

        When you start out using an unfamiliar API, it is always best to begin by reading the documentation for that interface. Another approach would be to do a quick Internet search and see if there is a Python package that wraps your target API. Unfortunately, there does not seem to be any maintained NASA libraries for Python. When this happens, you get to create your own.

        To get started, try reading the NASA Images API document.

        Their API documentation isn’t very long, so it shouldn’t take you very long to read or at least skim it.

        The next step is to take that information and try playing around with their API.

        Here are the first few lines of an experiment at accessing their API:

        # simple_api_request.py
         
        import requests
         
        from urllib.parse import urlencode, quote_plus
         
         
        base_url = 'https://images-api.nasa.gov/search'
        search_term = 'apollo 11'
        desc = 'moon landing'
        media = 'image'
        query = {'q': search_term, 'description': desc, 'media_type': media}
        full_url = base_url + '?' + urlencode(query, quote_via=quote_plus)
         
        r = requests.get(full_url)
        data = r.json()

        If you run this in a debugger, you can print out the JSON that is returned.

        Here is a snippet of what was returned:

        'items': [{'data': 
                      [{'center': 'HQ',
                         'date_created': '2009-07-18T00:00:00Z',
                         'description': 'On the eve of the '
                                        'fortieth anniversary of '
                                        "Apollo 11's first human "
                                        'landing on the Moon, '
                                        'Apollo 11 crew member, '
                                        'Buzz Aldrin speaks during '
                                        'a lecture in honor of '
                                        'Apollo 11 at the National '
                                        'Air and Space Museum in '
                                        'Washington, Sunday, July '
                                        '19, 2009. Guest speakers '
                                        'included Former NASA '
                                        'Astronaut and U.S. '
                                        'Senator John Glenn, NASA '
                                        'Mission Control creator '
                                        'and former NASA Johnson '
                                        'Space Center director '
                                        'Chris Kraft and the crew '
                                        'of Apollo 11.  Photo '
                                        'Credit: (NASA/Bill '
                                        'Ingalls)',
                         'keywords': ['Apollo 11',
                                      'Apollo 40th Anniversary',
                                      'Buzz Aldrin',
                                      'National Air and Space '
                                      'Museum (NASM)',
                                      'Washington, DC'],
                         'location': 'National Air and Space '
                                     'Museum',
                         'media_type': 'image',
                         'nasa_id': '200907190008HQ',
                         'photographer': 'NASA/Bill Ingalls',
                         'title': 'Glenn Lecture With Crew of '
                                  'Apollo 11'}],
               'href': 'https://images-assets.nasa.gov/image/200907190008HQ/collection.json',
               'links': [{'href': 'https://images-assets.nasa.gov/image/200907190008HQ/200907190008HQ~thumb.jpg',
                          'rel': 'preview',
                          'render': 'image'}]}

        Now that you know what the format of the JSON is, you can try parsing it a bit.

        Let’s add the following lines of code to your Python script:

        item = data['collection']['items'][0]
        nasa_id = item['data'][0]['nasa_id']
        asset_url = 'https://images-api.nasa.gov/asset/' + nasa_id
        image_request = requests.get(asset_url)
        image_json = image_request.json()
        image_urls = [url['href'] for url in image_json['collection']['items']]
        print(image_urls)

        This will extract the first item in the list of items from the JSON response. Then you can extract the nasa_id, which is required to get all the images associated with this particular result. Now you can add that nasa_id to a new URL end point and make a new request.

        The request for the image JSON returns this:

        {'collection': {'href': 'https://images-api.nasa.gov/asset/200907190008HQ',
                        'items': [{'href': 'http://images-assets.nasa.gov/image/200907190008HQ/200907190008HQ~orig.tif'},
                                  {'href': 'http://images-assets.nasa.gov/image/200907190008HQ/200907190008HQ~large.jpg'},
                                  {'href': 'http://images-assets.nasa.gov/image/200907190008HQ/200907190008HQ~medium.jpg'},
                                  {'href': 'http://images-assets.nasa.gov/image/200907190008HQ/200907190008HQ~small.jpg'},
                                  {'href': 'http://images-assets.nasa.gov/image/200907190008HQ/200907190008HQ~thumb.jpg'},
                                  {'href': 'http://images-assets.nasa.gov/image/200907190008HQ/metadata.json'}],
                        'version': '1.0'}}

        The last two lines in your Python code will extract the URLs from the JSON. Now you have all the pieces you need to write a basic user interface!


        Designing the User Interface

        There are many different ways you could design your image downloading application. You will be doing what is simplest as that is almost always the quickest way to create a prototype. The nice thing about prototyping is that you end up with all the pieces you will need to create a useful application. Then you can take your knowledge and either enhance the prototype or create something new with the knowledge you have gained.

        Here’s a mockup of what you will be attempting to create:

        NASA Image Search Mockup

        As you can see, you will want an application with the following features:

        • A search bar
        • A widget to hold the search results
        • A way to display an image when a result is chosen
        • The ability to download the image

        Let’s learn how to create this user interface now!


        Creating the NASA Search Application

        Rapid prototyping is an idea in which you will create a small, runnable application as quickly as you can. Rather than spending a lot of time getting all the widgets laid out, let’s add them from top to bottom in the application. This will give you something to work with more quickly than creating a series of nested sizers will.

        Let’s start by creating a script called nasa_search_ui.py:

        # nasa_search_ui.py
         
        import os
        import requests
        import wx
         
        from download_dialog import DownloadDialog
        from ObjectListView import ObjectListView, ColumnDefn
        from urllib.parse import urlencode, quote_plus

        Here you import a few new items that you haven’t seen as of yet. The first is the requests package. This is a handy package for downloading files and doing things on the Internet with Python. Many developers feel that it is better than Python’s own urllib. You will need to install it to use it though. You will also need to instal ObjectListView.

        Here is how you can do that with pip:

        pip install requests ObjectListView

        The other piece that is new are the imports from urllib.parse. You will be using this module for encoding URL parameters. Lastly, the DownloadDialog is a class for a small dialog that you will be creating for downloading NASA images.

        Since you will be using ObjectListView in this application, you will need a class to represent the objects in that widget:

        class Result:
         
            def __init__(self, item):
                data = item['data'][0]
                self.title = data['title']
                self.location = data.get('location', '')
                self.nasa_id = data['nasa_id']
                self.description = data['description']
                self.photographer = data.get('photographer', '')
                self.date_created = data['date_created']
                self.item = item
         
                if item.get('links'):
                    try:
                        self.thumbnail = item['links'][0]['href']
                    except:
                        self.thumbnail = ''

        The Result class is what you will be using to hold that data that makes up each row in your ObjectListView. The item parameter is a portion of JSON that you are receiving from NASA as a response to your query. In this class, you will need to parse out the information you require.

        In this case, you want the following fields:

        • Title
        • Location of image
        • NASA’s internal ID
        • Description of the photo
        • The photographer’s name
        • The date the image was created
        • The thumbnail URL

        Some of these items aren’t always included in the JSON response, so you will use the dictionary’s get() method to return an empty string in those cases.

        Now let’s start working on the UI:

        class MainPanel(wx.Panel):
         
            def __init__(self, parent):
                super().__init__(parent)
                self.search_results = []
                self.max_size = 300
                self.paths = wx.StandardPaths.Get()
                font = wx.Font(12, wx.SWISS, wx.NORMAL, wx.NORMAL)
         
                main_sizer = wx.BoxSizer(wx.VERTICAL)

        The MainPanel is where the bulk of your code will be. Here you do some housekeeping and create a search_results to hold a list of Result objects when the user does a search. You also set the max_size of the thumbnail image, the font to be used, the sizer and you get some StandardPaths as well.

        Now let’s add the following code to the __init__():

        txt = 'Search for images on NASA'
        label = wx.StaticText(self, label=txt)
        main_sizer.Add(label, 0, wx.ALL, 5)
        self.search = wx.SearchCtrl(
            self, style=wx.TE_PROCESS_ENTER, size=(-1, 25))
        self.search.Bind(wx.EVT_SEARCHCTRL_SEARCH_BTN, self.on_search)
        self.search.Bind(wx.EVT_TEXT_ENTER, self.on_search)
        main_sizer.Add(self.search, 0, wx.EXPAND)

        Here you create a header label for the application using wx.StaticText. Then you add a wx.SearchCtrl, which is very similar to a wx.TextCtrl except that it has special buttons built into it. You also bind the search button’s click event (EVT_SEARCHCTRL_SEARCH_BTN) and EVT_TEXT_ENTER to a search related event handler (on_search).

        The next few lines add the search results widget:

        self.search_results_olv = ObjectListView(
            self, style=wx.LC_REPORT | wx.SUNKEN_BORDER)
        self.search_results_olv.SetEmptyListMsg("No Results Found")
        self.search_results_olv.Bind(wx.EVT_LIST_ITEM_SELECTED,
                                     self.on_selection)
        main_sizer.Add(self.search_results_olv, 1, wx.EXPAND)
        self.update_search_results()

        This code sets up the ObjectListView in much the same way as some of my other articles use it. You customize the empty message by calling SetEmptyListMsg() and you also bind the widget to EVT_LIST_ITEM_SELECTED so that you do something when the user selects a search result.

        Now let’s add the rest of the code to the __init__() method:

        main_sizer.AddSpacer(30)
        self.title = wx.TextCtrl(self, style=wx.TE_READONLY)
        self.title.SetFont(font)
        main_sizer.Add(self.title, 0, wx.ALL|wx.EXPAND, 5)
        img = wx.Image(240, 240)
        self.image_ctrl = wx.StaticBitmap(self,
                                          bitmap=wx.Bitmap(img))
        main_sizer.Add(self.image_ctrl, 0, wx.CENTER|wx.ALL, 5
                       )
        download_btn = wx.Button(self, label='Download Image')
        download_btn.Bind(wx.EVT_BUTTON, self.on_download)
        main_sizer.Add(download_btn, 0, wx.ALL|wx.CENTER, 5)
         
        self.SetSizer(main_sizer)

        These final few lines of code add a title text control and an image widget that will update when a result is selected. You also add a download button to allow the user to select which image size they would like to download. NASA usually gives several different versions of the image from thumbnail all the way up to the original TIFF image.

        The first event handler to look at is on_download():

        def on_download(self, event):
            selection = self.search_results_olv.GetSelectedObject()
            if selection:
                with DownloadDialog(selection) as dlg:
                    dlg.ShowModal()

        Here you call GetSelectedObject() to get the user’s selection. If the user hasn’t selected anything, then this method exits. On the other hand, if the user has selected an item, then you instantiate the DownloadDialog and show it to the user to allow them to download something.

        Now let’s learn how to do a search:

        def on_search(self, event):
            search_term = event.GetString()
            if search_term:
                query = {'q': search_term, 'media_type': 'image'}
                full_url = base_url + '?' + urlencode(query, quote_via=quote_plus)
                r = requests.get(full_url)
                data = r.json()
                self.search_results = []
                for item in data['collection']['items']:
                    if item.get('data') and len(item.get('data')) > 0:
                        data = item['data'][0]
                        if data['title'].strip() == '':
                            # Skip results with blank titles
                            continue
                        result = Result(item)
                        self.search_results.append(result)
                self.update_search_results()

        The on_search() event handler will get the string that the user has entered into the search control or return an empty string. Assuming that the user actually enters something to search for, you use NASA’s general search query, q and hard code the media_type to image. Then you encode the query into a properly formatted URL and use requests.get() to request a JSON response.

        Next you attempt to loop over the results of the search. Note that is no data is returned, this code will fail and cause an exception to be thrown. But if you do get data, then you will need to parse it to get the bits and pieces you need.

        You will skip items that don’t have the title field set. Otherwise you will create a Result object and add it to the search_results list. At the end of the method, you tell your UI to update the search results.

        Before we get to that function, you will need to create on_selection():

        def on_selection(self, event):
            selection = self.search_results_olv.GetSelectedObject()
            self.title.SetValue(f'{selection.title}')
            if selection.thumbnail:
                self.update_image(selection.thumbnail)
            else:
                img = wx.Image(240, 240)
                self.image_ctrl.SetBitmap(wx.Bitmap(img))
                self.Refresh()
                self.Layout()

        Once again, you get the selected item, but this time you take that selection and update the title text control with the selection’s title text. Then you check to see if there is a thumbnail and update that accordingly if there is one. When there is no thumbnail, you set it back to an empty image as you do not want it to keep showing a previously selected image.

        The next method to create is update_image():

        def update_image(self, url):
            filename = url.split('/')[-1]
            tmp_location = os.path.join(self.paths.GetTempDir(), filename)
            r = requests.get(url)
            with open(tmp_location, "wb") as thumbnail:
                thumbnail.write(r.content)
         
            if os.path.exists(tmp_location):
                img = wx.Image(tmp_location, wx.BITMAP_TYPE_ANY)
                W = img.GetWidth()
                H = img.GetHeight()
                if W > H:
                    NewW = self.max_size
                    NewH = self.max_size * H / W
                else:
                    NewH = self.max_size
                    NewW = self.max_size * W / H
                img = img.Scale(NewW,NewH)
            else:
                img = wx.Image(240, 240)
         
            self.image_ctrl.SetBitmap(wx.Bitmap(img))
            self.Refresh()
            self.Layout()

        The update_image() accepts a URL as its sole argument. It takes this URL and splits off the filename. Then it creates a new download location, which is the computer’s temp directory. Your code then downloads the image and checks to be sure the file saved correctly. If it did, then the thumbnail is loaded using the max_size that you set; otherwise you set it to use a blank image.

        The last couple of lines Refresh() and Layout() the panel so that the widget appear correctly.

        Finally you need to create the last method:

        def update_search_results(self):
            self.search_results_olv.SetColumns([
                ColumnDefn("Title", "left", 250, "title"),
                ColumnDefn("Description", "left", 350, "description"),
                ColumnDefn("Photographer", "left", 100, "photographer"),
                ColumnDefn("Date Created", "left", 150, "date_created")
            ])
            self.search_results_olv.SetObjects(self.search_results)

        Here you create the frame, set the title and initial size and add the panel. Then you show the frame.

        This is what the main UI will look like:

        NASA Image Search Main App

        Now let’s learn what goes into making a download dialog!


        The Download Dialog

        The download dialog will allow the user to download one or more of the images that they have selected. There are almost always at least two versions of every image and sometimes five or six.

        The first piece of code to learn about is the first few lines:

        # download_dialog.py
         
        import requests
        import wx
         
        wildcard = "All files (*.*)|*.*"

        Here you once again import requests and set up a wildcard that you will use when saving the images.

        Now let’s create the dialog’s __init__():

        class DownloadDialog(wx.Dialog):
         
            def __init__(self, selection):
                super().__init__(None, title='Download images')
                self.paths = wx.StandardPaths.Get()
                main_sizer = wx.BoxSizer(wx.VERTICAL)
                self.list_box = wx.ListBox(self, choices=[], size=wx.DefaultSize)
                urls = self.get_image_urls(selection)
                if urls:
                    choices = {url.split('/')[-1]: url for url in urls if 'jpg' in url}
                    for choice in choices:
                        self.list_box.Append(choice, choices[choice])
                main_sizer.Add(self.list_box, 1, wx.EXPAND|wx.ALL, 5)
         
                save_btn = wx.Button(self, label='Save')
                save_btn.Bind(wx.EVT_BUTTON, self.on_save)
                main_sizer.Add(save_btn, 0, wx.ALL|wx.CENTER, 5)
                self.SetSizer(main_sizer)

        In this example, you create a new reference to StandardPaths and add a wx.ListBox. The list box will hold the variants of the photos that you can download. It will also automatically add a scrollbar should there be too many results to fit on-screen at once. You call get_image_urls with the passed in selection object to get a list of urls. Then you loop over the urls and extract the ones that have jpg in their name. This does result in you missing out on alternate image files types, such as PNG or TIFF.

        This gives you an opportunity to enhance this code and improve it. The reason that you are filtering the URLs is that the results usually have non-image URLs in the mix and you probably don’t want to show those as potentially downloadable as that would be confusing to the user.

        The last widget to be added is the “Save” button. You could add a “Cancel” button as well, but the dialog has an exit button along the top that works, so it’s not required.

        Now it’s time to learn what get_image_urls() does:

        def get_image_urls(self, item):
            asset_url = f'https://images-api.nasa.gov/asset/{item.nasa_id}'
            image_request = requests.get(asset_url)
            image_json = image_request.json()
            try:
                image_urls = [url['href'] for url in image_json['collection']['items']]
            except:
                image_urls = []
            return image_urls

        This event handler is activated when the user presses the “Save” button. When the user tries to save something without selecting an item in the list box, it will return -1. Should that happen, you show them a MessageDialog to tell them that they might want to select something. When they do select something, you will show them a wx.FileDialog that allows them to choose where to save the file and what to call it.

        The event handler calls the save() method, so that is your next project:

        def save(self, path):
            selection = self.list_box.GetSelection()
            r = requests.get(
                self.list_box.GetClientData(selection))
            try:
                with open(path, "wb") as image:
                    image.write(r.content)
         
                message = 'File saved successfully'
                with wx.MessageDialog(None, message=message,
                                      caption='Save Successful',
                                      style=wx.ICON_INFORMATION) as dlg:
                    dlg.ShowModal()
            except:
                message = 'File failed to save!'
                with wx.MessageDialog(None, message=message,
                                      caption='Save Failed',
                                      style=wx.ICON_ERROR) as dlg:
                    dlg.ShowModal()

        Here you get the selection again and use the requests package to download the image. Note that there is no check to make sure that the user has added an extension, let along the right extension. You can add that yourself when you get a chance.

        Anyway, when the file is finished downloading, you will show the user a message letting them know.

        If an exception occurs, you can show them a dialog that let’s them know that too!

        Here is what the download dialog looks like:

        NASA Image Download Dialog

        Now let’s add some new functionality!


        Adding Advanced Search

        There are several fields that you can use to help narrow your search. However you don’t want to clutter your user interface with them unless the user really wants to use those filters. To allow for that, you can add an “Advanced Search” option.

        Adding this option requires you to rearrange your code a bit, so let’s copy your nasa_search_ui.py file and your download_dialog.py module to a new folder called version_2.

        Now rename nasa_search_ui.py to main.py to make it more obvious which script is the main entry point for your program. To make things more modular, you will be extracting your search results into its own class and have the advanced search in a separate class. This means that you will have three panels in the end:

        • The main panel
        • The search results panel
        • The advanced search panel

        Here is what the main dialog will look like when you are finished:

        NASA Image Search with Advanced Search Option

        Let’s go over each of these separately.


        The main.py Script

        The main module is your primary entry point for your application. An entry point is the code that your user will run to launch your application. It is also the script that you would use if you were to bundle up your application into an executable.

        Let’s take a look at how your main module starts out:

        # main.py
         
        import wx
         
        from advanced_search import RegularSearch
        from regular_search import SearchResults
        from pubsub import pub
         
         
        class MainPanel(wx.Panel):
         
            def __init__(self, parent):
                super().__init__(parent)
                pub.subscribe(self.update_ui, 'update_ui')
         
                self.main_sizer = wx.BoxSizer(wx.VERTICAL)
                search_sizer = wx.BoxSizer()

        This example imports both of your search-related panels:

        • AdvancedSearch
        • RegularSearch

        It also uses pubsub to subscribe to an update topic.

        Let’s find out what else is in the __init__():

        txt = 'Search for images on NASA'
        label = wx.StaticText(self, label=txt)
        self.main_sizer.Add(label, 0, wx.ALL, 5)
        self.search = wx.SearchCtrl(
            self, style=wx.TE_PROCESS_ENTER, size=(-1, 25))
        self.search.Bind(wx.EVT_SEARCHCTRL_SEARCH_BTN, self.on_search)
        self.search.Bind(wx.EVT_TEXT_ENTER, self.on_search)
        search_sizer.Add(self.search, 1, wx.EXPAND)
         
        self.advanced_search_btn = wx.Button(self, label='Advanced Search',
                                    size=(-1, 25))
        self.advanced_search_btn.Bind(wx.EVT_BUTTON, self.on_advanced_search)
        search_sizer.Add(self.advanced_search_btn, 0, wx.ALL, 5)
        self.main_sizer.Add(search_sizer, 0, wx.EXPAND)

        Here you add the title for the page along with the search control widget as you did before. You also add the new Advanced Search button and use a new sizer to contain the search widget and the button. You then add that sizer to your main sizer.

        Now let’s add the panels:

        self.search_panel = RegularSearch(self)
        self.advanced_search_panel = AdvancedSearch(self)
        self.advanced_search_panel.Hide()
        self.main_sizer.Add(self.search_panel, 1, wx.EXPAND)
        self.main_sizer.Add(self.advanced_search_panel, 1, wx.EXPAND)

        In this example, you instantiate the RegularSearch and the AdvancedSearch panels. Since the RegularSearch is the default, you hide the AdvancedSearch from the user on startup.

        Now let’s update on_search():

        def on_search(self, event):
            search_results = []
            search_term = event.GetString()
            if search_term:
                query = {'q': search_term, 'media_type': 'image'}
                pub.sendMessage('search_results', query=query)

        The on_search() method will get called when the user presses “Enter / Return” on their keyboard or when they press the search button icon in the search control widget. If the user has entered a search string into the search control, a search query will be constructed and then sent off using pubsub.

        Let’s find out what happens when the user presses the Advanced Search button:

        def on_advanced_search(self, event):
            self.search.Hide()
            self.search_panel.Hide()
            self.advanced_search_btn.Hide()
            self.advanced_search_panel.Show()
            self.main_sizer.Layout()

        When on_advanced_search() fires, it hides the search widget, the regular search panel and the advanced search button. Next, it shows the advanced search panel and calls Layout() on the main_sizer. This will cause the panels to switch out and resize to fit properly within the frame.

        The last method to create is update_ui():

        def update_ui(self):
            """
            Hide advanced search and re-show original screen
         
            Called by pubsub when advanced search is invoked
            """
            self.advanced_search_panel.Hide()
            self.search.Show()
            self.search_panel.Show()
            self.advanced_search_btn.Show()
            self.main_sizer.Layout()

        The update_ui() method is called when the user does an Advanced Search. This method is invoked by pubsub. It will do the reverse of on_advanced_search() and un-hide all the widgets that were hidden when the advanced search panel was shown. It will also hide the advanced search panel.

        The frame code is the same as it was before, so it is not shown here.

        Let’s move on and learn how the regular search panel is created!


        The regular_search.py Script

        The regular_search module is your refactored module that contains the ObjectListView that will show your search results. It also has the Download button on it.

        The following methods / classes will not be covered as they are the same as in the previous iteration:

        • on_download()
        • on_selection()
        • update_image()
        • update_search_results()
        • The Result class

        Let’s get started by seeing how the first few lines in the module are laid out:

        # regular_search.py
         
        import os
        import requests
        import wx
         
        from download_dialog import DownloadDialog
        from ObjectListView import ObjectListView, ColumnDefn
        from pubsub import pub
        from urllib.parse import urlencode, quote_plus
         
        base_url = 'https://images-api.nasa.gov/search'

        Here you have all the imports you had in the original nasa_search_ui.py script from version_1. You also have the base_url that you need to make requests to NASA’s image API. The only new import is for pubsub.

        Let’s go ahead and create the RegularSearch class:

        class RegularSearch(wx.Panel):
         
            def __init__(self, parent):
                super().__init__(parent)
                self.search_results = []
                self.max_size = 300
                font = wx.Font(12, wx.SWISS, wx.NORMAL, wx.NORMAL)
                main_sizer = wx.BoxSizer(wx.VERTICAL)
                self.paths = wx.StandardPaths.Get()
                pub.subscribe(self.load_search_results, 'search_results')
         
                self.search_results_olv = ObjectListView(
                    self, style=wx.LC_REPORT | wx.SUNKEN_BORDER)
                self.search_results_olv.SetEmptyListMsg("No Results Found")
                self.search_results_olv.Bind(wx.EVT_LIST_ITEM_SELECTED,
                                             self.on_selection)
                main_sizer.Add(self.search_results_olv, 1, wx.EXPAND)
                self.update_search_results()

        This code will initialize the search_results list to an empty list and set the max_size of the image. It also sets up a sizer and the ObjectListView widget that you use for displaying the search results to the user. The code is actually quite similar to the first iteration of the code when all the classes were combined.

        Here is the rest of the code for the __init__():

        main_sizer.AddSpacer(30)
        self.title = wx.TextCtrl(self, style=wx.TE_READONLY)
        self.title.SetFont(font)
        main_sizer.Add(self.title, 0, wx.ALL|wx.EXPAND, 5)
        img = wx.Image(240, 240)
        self.image_ctrl = wx.StaticBitmap(self,
                                          bitmap=wx.Bitmap(img))
        main_sizer.Add(self.image_ctrl, 0, wx.CENTER|wx.ALL, 5
                       )
        download_btn = wx.Button(self, label='Download Image')
        download_btn.Bind(wx.EVT_BUTTON, self.on_download)
        main_sizer.Add(download_btn, 0, wx.ALL|wx.CENTER, 5)
         
        self.SetSizer(main_sizer)

        The first item here is to add a spacer to the main_sizer. Then you add the title and the img related widgets. The last widget to be added is still the download button.

        Next, you will need to write a new method:

        def reset_image(self):
            img = wx.Image(240, 240)
            self.image_ctrl.SetBitmap(wx.Bitmap(img))
            self.Refresh()

        The reset_image() method is for resetting the wx.StaticBitmap back to an empty image. This can happen when the user uses the regular search first, selects an item and then decides to do an advanced search. Resetting the image prevents the user from seeing a previously selected item and potentially confusing the user.

        The last method you need to add is load_search_results():

        def load_search_results(self, query):
            full_url = base_url + '?' + urlencode(query, quote_via=quote_plus)
            r = requests.get(full_url)
            data = r.json()
            self.search_results = []
            for item in data['collection']['items']:
                if item.get('data') and len(item.get('data')) > 0:
                    data = item['data'][0]
                    if data['title'].strip() == '':
                        # Skip results with blank titles
                        continue
                    result = Result(item)
                    self.search_results.append(result)
            self.update_search_results()
            self.reset_image()

        The load_search_results() method is called using pubsub. Both the main and the advanced_search modules call it by passing in a query dictionary. Then you encode that dictionary into a formatted URL. Next you use requests to send a JSON request and you then extract the results. This is also where you call reset_image() so that when a new set of results loads, there is no result selected.

        Now you are ready to create an advanced search!


        The advanced_search.py Script

        The advanced_search module is a wx.Panel that has all the widgets you need to do an advanced search against NASA’s API. If you read their documentation, you will find that there are around a dozen filters that can be applied to a search.

        Let’s start at the top:

        class AdvancedSearch(wx.Panel):
         
            def __init__(self, parent):
                super().__init__(parent)
         
                self.main_sizer = wx.BoxSizer(wx.VERTICAL)
         
                self.free_text = wx.TextCtrl(self)
                self.ui_helper('Free text search:', self.free_text)
                self.nasa_center = wx.TextCtrl(self)
                self.ui_helper('NASA Center:', self.nasa_center)
                self.description = wx.TextCtrl(self)
                self.ui_helper('Description:', self.description)
                self.description_508 = wx.TextCtrl(self)
                self.ui_helper('Description 508:', self.description_508)
                self.keywords = wx.TextCtrl(self)
                self.ui_helper('Keywords (separate with commas):',
                               self.keywords)

        The code to set up the various filters is all pretty similar. You create a text control for the filter, then you pass it into ui_helper() along with a string that is a label for the text control widget. Repeat until you have all the filters in place.

        Here are the rest of the filters:

        self.location = wx.TextCtrl(self)
        self.ui_helper('Location:', self.location)
        self.nasa_id = wx.TextCtrl(self)
        self.ui_helper('NASA ID:', self.nasa_id)
        self.photographer = wx.TextCtrl(self)
        self.ui_helper('Photographer:', self.photographer)
        self.secondary_creator = wx.TextCtrl(self)
        self.ui_helper('Secondary photographer:', self.secondary_creator)
        self.title = wx.TextCtrl(self)
        self.ui_helper('Title:', self.title)
        search = wx.Button(self, label='Search')
        search.Bind(wx.EVT_BUTTON, self.on_search)
        self.main_sizer.Add(search, 0, wx.ALL | wx.CENTER, 5)
         
        self.SetSizer(self.main_sizer)

        At the end, you set the sizer to the main_sizer. Note that not all the filters that are in NASA’s API are implemented in this code. For example, I didn’t add media_type because this application will be hard-coded to only look for images. However if you wanted audio or video, you could update this application for that. I also didn’t include the year_start and year_end filters. Feel free to add those if you wish.

        Now let’s move on and create the ui_helper() method:

        def ui_helper(self, label, textctrl):
            sizer = wx.BoxSizer()
            lbl = wx.StaticText(self, label=label, size=(150, -1))
            sizer.Add(lbl, 0, wx.ALL, 5)
            sizer.Add(textctrl, 1, wx.ALL | wx.EXPAND, 5)
            self.main_sizer.Add(sizer, 0, wx.EXPAND)

        The ui_helper() takes in label text and the text control widget. It then creates a wx.BoxSizer and a wx.StaticText. The wx.StaticText is added to the sizer, as is the passed-in text control widget. Finally the new sizer is added to the main_sizer and then you’re done. This is a nice way to reduce repeated code.

        The last item to create in this class is on_search():

        def on_search(self, event):
            query = {'q': self.free_text.GetValue(),
                     'media_type': 'image',
                     'center': self.nasa_center.GetValue(),
                     'description': self.description.GetValue(),
                     'description_508': self.description_508.GetValue(),
                     'keywords': self.keywords.GetValue(),
                     'location': self.location.GetValue(),
                     'nasa_id': self.nasa_id.GetValue(),
                     'photographer': self.photographer.GetValue(),
                     'secondary_creator': self.secondary_creator.GetValue(),
                     'title': self.title.GetValue()}
            pub.sendMessage('update_ui')
            pub.sendMessage('search_results', query=query)

        When the user presses the Search button, this event handler gets called. It creates the search query based on what the user has entered into each of the fields. Then the handler will send out two messages using pubsub. The first message will update the UI so that the advanced search is hidden and the search results are shown. The second message will actually execute the search against NASA’s API.

        Here is what the advanced search page looks like:

        NASA Image Search with Advanced Search Page

        Now let’s update the download dialog.


        The download_dialog.py Script

        The download dialog has a couple of minimal changes to it. Basically you need to add an import of Python’s os module and then update the save() function.

        Add the following lines to the beginning of the function:

        def save(self, path):
            _, ext = os.path.splitext(path)
            if ext.lower() != '.jpg':
                path = f'{path}.jpg'

        This code was added to account for the case where the user does not specify the extension of the image in the saved file name.


        Wrapping Up

        This article covered a lot of fun new information. You learned one approach for working with an open API that doesn’t have a Python wrapper already around it. You discovered the importance of reading the API documentation and then added a user interface to that API. Then you learned how to parse JSON and download images from the Internet.

        While it is not covered here, Python has a json module that you could use as well.

        Here are some ideas for enhancing this application:

        • Caching search results
        • Downloading thumbnails in the background
        • Downloading links in the background

        You could use threads to download the thumbnails and the larger images as well as for doing the web requests in general. This would improve the performance of your application. You may have noticed that the application became slightly unresponsive, depending on your Internet connectivity. This is because when it is doing a web request or downloading a file, it blocks the UI’s main loop. You should give threads a try if you find that sort of thing bothersome.


        Download the Code


        Related Reading

        April 18, 2019 05:15 PM UTC


        PyPy Development

        PyPy 7.1.1 Bug Fix Release

        The PyPy team is proud to release a bug-fix release version 7.1.1 of PyPy, which includes two different interpreters:
        • PyPy2.7, which is an interpreter supporting the syntax and the features of Python 2.
        • PyPy3.6-beta: the second official release of PyPy to support 3.6 features.
        The interpreters are based on much the same codebase, thus the double release.

        This bugfix fixes bugs related to large lists, dictionaries, and sets, some corner cases with unicode, and PEP 3118 memory views of ctype structures. It also fixes a few issues related to the ARM 32-bit backend. For the complete list see the changelog.

        You can download the v7.1.1 releases here:
        http://pypy.org/download.html

        As always, this release is 100% compatible with the previous one and fixed several issues and bugs raised by the growing community of PyPy users. We strongly recommend updating.

        The PyPy3.6 release is rapidly maturing, but is still considered beta-quality.

        The PyPy team

        April 18, 2019 04:24 PM UTC