skip to navigation
skip to content

Planet Python

Last update: September 29, 2020 01:47 PM UTC

September 29, 2020


Mike Driscoll

wxPython by Example – Drag-and-Drop an Image (Video)

In this tutorial, you will learn how to drag an image into your #wxPython application and display it to your user.

If you enjoy this video, you may want to check out my book, Creating GUI Applications with wxPython, available on Leanpub and Amazon.

The post wxPython by Example – Drag-and-Drop an Image (Video) appeared first on The Mouse Vs. The Python.

September 29, 2020 01:24 PM UTC


Codementor

Why use Python Programming for building a Healthcare Application

Python programming in healthcare is changing how doctors and clinicians approach patient care delivery. Here’s why Python for healthcare is the right choice for better health outcomes.

September 29, 2020 01:07 PM UTC

All You Need To Know For Selenium Testing On The Cloud

Selenium testing on the cloud is the most efficient way to scale up automated browser testing. This blog will help you get started with cross browser testing in Selenium.

September 29, 2020 09:58 AM UTC

Sumana Harihareswara is an open-source software fairy... and other things I learned recording her DevJourney

Sumana Harihareswara is an open-source software fairy. After interviewing her for the DevJourney podcast, here are the key takeaways I personally took out of the discussion.

September 29, 2020 07:24 AM UTC


Podcast.__init__

Solving Python Package Creation For End User Applications With PyOxidizer - Episode 282

Python is a powerful and expressive programming language with a vast ecosystem of incredible applications. Unfortunately, it has always been challenging to share those applications with non-technical end users. Gregory Szorc set out to solve the problem of how to put your code on someone else's computer and have it run without having to rely on extra systems such as virtualenvs or Docker. In this episode he shares his work on PyOxidizer and how it allows you to build a self-contained Python runtime along with statically linked dependencies and the software that you want to run. He also digs into some of the edge cases in the Python language and its ecosystem that make this a challenging problem to solve, and some of the lessons that he has learned in the process. PyOxidizer is an exciting step forward in the evolution of packaging and distribution for the Python language and community.

Summary

Python is a powerful and expressive programming language with a vast ecosystem of incredible applications. Unfortunately, it has always been challenging to share those applications with non-technical end users. Gregory Szorc set out to solve the problem of how to put your code on someone else’s computer and have it run without having to rely on extra systems such as virtualenvs or Docker. In this episode he shares his work on PyOxidizer and how it allows you to build a self-contained Python runtime along with statically linked dependencies and the software that you want to run. He also digs into some of the edge cases in the Python language and its ecosystem that make this a challenging problem to solve, and some of the lessons that he has learned in the process. PyOxidizer is an exciting step forward in the evolution of packaging and distribution for the Python language and community.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • This portion of Python Podcast is brought to you by Datadog. Do you have an app in production that is slower than you like? Is its performance all over the place (sometimes fast, sometimes slow)? Do you know why? With Datadog, you will. You can troubleshoot your app’s performance with Datadog’s end-to-end tracing and in one click correlate those Python traces with related logs and metrics. Use their detailed flame graphs to identify bottlenecks and latency in that app of yours. Start tracking the performance of your apps with a free trial at pythonpodcast.com/datadog. If you sign up for a trial and install the agent, Datadog will send you a free t-shirt.
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to pythonpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
  • Your host as usual is Tobias Macey and today I’m interviewing Gregory Szorc about his work on PyOxidizer, a revolutionary new approach to building and distributing self-contained Python applications

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by giving an overview on the shortcomings of the current state of the art for distributing Python projects, both for deployment and end-user consumption?
  • What is PyOxidizer and what motivated you to create it?
  • How does PyOxidizer differ from projects such as CxFreeze, Py2Exe, or Shiv?
  • What are the characteristics of CPython and the packaging ecosystem that make it so challenging to easily distribute self-contained applications?
  • For someone using PyOxidizer, what is their workflow for building an executable that they can share with end users?
    • What are some of the edge cases or special considerations that they need to be aware of?
  • How is PyOxidizer implemented?
    • How has the design or direction evolved since you first began working on it?
  • From your experience in working on PyOxidizer, what changes would you like to see in the Python language or the CPython reference implementation?
  • What are some of the most interesting, unexpected, or challenging lessons that you have learned while working on PyOxidizer?
  • What do you have planned for the future of PyOxidizer?
  • What are the ways that listeners can contribute to PyOxidizer?

Keep In Touch

Picks

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

September 29, 2020 01:50 AM UTC

September 28, 2020


Made With Mu

Resources: Python for Kids

Friend of Mu, Kevin Thomas has been hard at work creating free-to-use resources for kids (and older kids) who want to learn Python, with the BBC micro:bit and Mu.

Each of the ten resources tackles a different aspect of Python on the micro:bit through illustrated “baby steps” to get you to a finished fun project. Work is ongoing, but you can find all the links and source code on his GitHub page for the project.

Meanwhile, in our secret fortress of solitude, the Mu “minions” (Munions..?) have been hard at work on some fantastic updates which we hope to reveal very soon.

Stay tuned. :-)

September 28, 2020 04:30 PM UTC


Real Python

The Python return Statement: Usage and Best Practices

The Python return statement is a key component of functions and methods. You can use the return statement to make your functions send Python objects back to the caller code. These objects are known as the function’s return value. You can use them to perform further computation in your programs.

Using the return statement effectively is a core skill if you want to code custom functions that are Pythonic and robust.

In this tutorial, you’ll learn:

  • How to use the Python return statement in your functions
  • How to return single or multiple values from your functions
  • What best practices to observe when using return statements

With this knowledge, you’ll be able to write more readable, maintainable, and concise functions in Python. If you’re totally new to Python functions, then you can check out Defining Your Own Python Function before diving into this tutorial.

Free Bonus: 5 Thoughts On Python Mastery, a free course for Python developers that shows you the roadmap and the mindset you'll need to take your Python skills to the next level.

Getting Started With Python Functions#

Most programming languages allow you to assign a name to a code block that performs a concrete computation. These named code blocks can be reused quickly because you can use their name to call them from different places in your code.

Programmers call these named code blocks subroutines, routines, procedures, or functions depending on the language they use. In some languages, there’s a clear difference between a routine or procedure and a function.

Sometimes that difference is so strong that you need to use a specific keyword to define a procedure or subroutine and another keyword to define a function. For example the Visual Basic programming language uses Sub and Function to differentiate between the two.

In general, a procedure is a named code block that performs a set of actions without computing a final value or result. On the other hand, a function is a named code block that performs some actions with the purpose of computing a final value or result, which is then sent back to the caller code. Both procedures and functions can act upon a set of input values, commonly known as arguments.

In Python, these kinds of named code blocks are known as functions because they always send a value back to the caller. The Python documentation defines a function as follows:

A series of statements which returns some value to a caller. It can also be passed zero or more arguments which may be used in the execution of the body. (Source)

Even though the official documentation states that a function “returns some value to the caller,” you’ll soon see that functions can return any Python object to the caller code.

In general, a function takes arguments (if any), performs some operations, and returns a value (or object). The value that a function returns to the caller is generally known as the function’s return value. All Python functions have a return value, either explicit or implicit. You’ll cover the difference between explicit and implicit return values later in this tutorial.

To write a Python function, you need a header that starts with the def keyword, followed by the name of the function, an optional list of comma-separated arguments inside a required pair of parentheses, and a final colon.

The second component of a function is its code block, or body. Python defines code blocks using indentation instead of brackets, begin and end keywords, and so on. So, to define a function in Python you can use the following syntax:

def function_name(arg1, arg2,..., argN):
    # Function's code goes here...
    pass

When you’re coding a Python function, you need to define a header with the def keyword, the name of the function, and a list of arguments in parentheses. Note that the list of arguments is optional, but the parentheses are syntactically required. Then you need to define the function’s code block, which will begin one level of indentation to the right.

In the above example, you use a pass statement. This kind of statement is useful when you need a placeholder statement in your code to make it syntactically correct, but you don’t need to perform any action. pass statements are also known as the null operation because they don’t perform any action.

Note: The full syntax to define functions and their arguments is beyond the scope of this tutorial. For an in-depth resource on this topic, check out Defining Your Own Python Function.

To use a function, you need to call it. A function call consists of the function’s name followed by the function’s arguments in parentheses:

function_name(arg1, arg2, ..., argN)

You’ll need to pass arguments to a function call only if the function requires them. The parentheses, on the other hand, are always required in a function call. If you forget them, then you won’t be calling the function but referencing it as a function object.

To make your functions return a value, you need to use the Python return statement. That’s what you’ll cover from this point on.

Read the full article at https://realpython.com/python-return-statement/ »


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

September 28, 2020 02:00 PM UTC


Test and Code

132: mocking in Python - Anna-Lena Popkes

Using mock objects during testing in Python.

Anna-Lena joins the podcast to teach us about mocks and using unittest.mock objects during testing.

We discuss:

Special Guest: Anna-Lena Popkes.

Sponsored By:

Support Test & Code : Python Testing for Software Engineering

Links:

<p>Using mock objects during testing in Python.</p> <p>Anna-Lena joins the podcast to teach us about mocks and using unittest.mock objects during testing. </p> <p>We discuss:</p> <ul> <li>the different styles of using mocks</li> <li>pros and cons of mocks</li> <li>dependency injection</li> <li>adapter pattern</li> <li>mock hell</li> <li>magical universe</li> <li>and much more</li> </ul><p>Special Guest: Anna-Lena Popkes.</p><p>Sponsored By:</p><ul><li><a href="https://talkpython.fm/test" rel="nofollow">Talk Python Training</a>: <a href="https://talkpython.fm/test" rel="nofollow">Online video courses for Python developers</a></li><li><a href="https://testandcode.com/pycharm" rel="nofollow">PyCharm Professional</a>: <a href="https://testandcode.com/pycharm" rel="nofollow">Try PyCharm Pro for 4 months and learn how PyCharm will save you time.</a> Promo Code: TESTANDCODE20</li><li><a href="https://www.honeybadger.io/" rel="nofollow">HoneyBadger</a>: <a href="https://www.honeybadger.io/" rel="nofollow">When bad things happen, it's nice to know that Honeybadger has your back. 30% off for first 6 months when you mention Test & Code Podcast when signing up.</a></li></ul><p><a href="https://www.patreon.com/testpodcast" rel="payment">Support Test & Code : Python Testing for Software Engineering</a></p><p>Links:</p><ul><li><a href="http://alpopkes.com/" title="Personal webpage of Anna-Lena Popkes" rel="nofollow">Personal webpage of Anna-Lena Popkes</a></li><li><a href="https://github.com/zotroneneis/magical_universe" title="Magical Universe" rel="nofollow">Magical Universe</a> &mdash; Awesome Python features explained using the world of magic</li><li><a href="https://testandcode.com/102" title="Test & Code 102: Cosmic Python, TDD, testing and external dependencies " rel="nofollow">Test & Code 102: Cosmic Python, TDD, testing and external dependencies </a> &mdash; The episode where Harry Percival discusses mocking.</li><li><a href="https://www.youtube.com/watch?v=rk-f3B-eMkI" title="Talk: Harry Percival - Stop Using Mocks (for a while) - YouTube" rel="nofollow">Talk: Harry Percival - Stop Using Mocks (for a while) - YouTube</a> &mdash; Talk: Harry Percival - Stop Using Mocks (for a while)</li><li><a href="https://docs.python.org/3/library/unittest.mock.html" title="unittest.mock " rel="nofollow">unittest.mock </a></li><li><a href="https://docs.python.org/3/library/unittest.mock.html#auto-speccing" title="Autospeccing" rel="nofollow">Autospeccing</a></li><li><a href="https://www.youtube.com/watch?v=Ldlz4V-UCFw" title="Mock Hell Talk (45 min version) Edwin Jung - PyCon 2019 " rel="nofollow">Mock Hell Talk (45 min version) Edwin Jung - PyCon 2019 </a></li><li><a href="https://www.youtube.com/watch?v=CdKaZ7boiZ4" title="Mock Hell Talk (30 min version) - Edwin Jung - PyConDE " rel="nofollow">Mock Hell Talk (30 min version) - Edwin Jung - PyConDE </a></li><li><a href="https://pycon.ee/" title="PyCon Estonia" rel="nofollow">PyCon Estonia</a></li><li><a href="https://ki-macht-schule.de/" title="KI macht Schule!" rel="nofollow">KI macht Schule!</a></li><li><a href="https://talkpython.fm/episodes/show/186/100-days-of-python-in-a-magical-universe" title="Talk Python #186 : 100 Days of Python in a Magical Universe" rel="nofollow">Talk Python #186 : 100 Days of Python in a Magical Universe</a></li></ul>

September 28, 2020 02:00 PM UTC


Stack Abuse

Padding Strings in Python

Introduction

String padding refers to adding, usually, non-informative characters to a string to one or both ends of it. This is most often done for output formatting and alignment purposes, but it can have useful practical applications.

A frequent use case for padding strings is outputting table-like information in a table-like fashion. You can do this in a variety of ways, including using Pandas to convert your data to an actual table. This way, Python would handle the output formatting on its own.

In this article, we'll cover how to pad strings in Python.

Say, we've got these three lists:

medicine1 = ['Morning', 'dispirine', '1 mg']
medicine2 = ['Noon', 'arinic', '2 mg']
medicine3 = ['Evening', 'Long_capsule_name', '32 mg']

We can form these into a string, using the join() function:

print(str.join(' ', medicine1))
print(str.join(' ', medicine2))
print(str.join(' ', medicine3))

Would give us the rather untidy output of:

Morning Dispirine 1 mg
Noon Arinic 2 mg
Evening Long_capsule_name 32 mg

To combat this, we could write for/while loops and append spaces to the strings until they reached a certain length, and make sure that all the data is aligned properly for easy visual inspection. Or, we could use built-in functions that can achieve the same goal.

The functions we'll be taking a look at in this article are: ljust(), center(), rjust(), zfill() and format(). Any of these functions can be used to add a certain number of characters to either end of strings, including spaces.

Padding Types

Before we take a closer look at the functions mentioned above, we'll take a look at different types of padding so we can reference them when talking about the functions.

Left Padding

Adding left padding to a string means adding a given character at the start of a string to make it of the specified length. Left padding, outside of simple formatting and alignment reasons can be really useful when naming files that start with a number generated in a sequence.

For example, you need to name 11 files, and each of them starts with a number from 1 to 11. If you simply added the number at the beginning of the file, most operating systems would sort the files in the following order: 1, 10, 11, 2, and so on.

This happens of course because of the rules of lexicographical sorting, but you can avoid these situations by naming files with one or more leading zeroes, depending on how many files you expect, i.e.: 01, 02, 03...

This can be achieved by effectively left padding the numbers with the appropriate number of zeroes, which keeps their original value.

This gives the effect of strings being left-justified.

Center Padding

This means that the given character is added in equal measure to both sides of the string until the new string reaches the given length. Using this effectively centers the string in the provided length:

This is a normal string.
This is a center padded string.

Right Padding

Right padding is analogous to left padding - the given character is added to the end of the string until the string reaches a certain length.

Python Functions For String Padding

Python offers many functions to format and handle strings, their use depends on the use case and the developer's personal preference. Most of the functions we'll discuss deal with text justification which is essentially adding padding to one side of the string. For example, for a string to be left-justified, we need to add padding to the end (right side) of the string.

Note: In all the functions that expect a width or len parameter, in case the original string is longer than the specified width or len the entire string will be kept without changes. This can have the undesired effect of long strings ruining the formatting, so when choosing a width value, make sure you take your longest string into an account or a top length boundary.

ljust()

The ljust() function aligns a string to the left by adding right padding.

The ljust() function takes two parameters: width and fillchar. The width is mandatory and specifies the length of the string after adding padding, while the second parameter is optional and represents the character added pad the original string.

The default value is a space character, i.e. ' '. This is a particularly good option to use when printing table-like data, like in our example at the beginning:

medicine1 = ['Morning', 'Dispirine', '1 mg']
medicine2 = ['Noon', 'Arinic', '2 mg']
medicine3 = ['Evening', 'Long_capsule_name', '32 mg']

for medicine in [medicine1, medicine2, medicine3]:
    for entry in medicine:
        print(entry.ljust(25), end='')
    print()

Which gives us the output:

Morning                  Dispirine                1 mg                     
Noon                     Arinic                   2 mg                     
Evening                  Long_capsule_name        32 mg 

center()

The center() function aligns a string in the center of the specified width, by adding padding equally to both sides. The parameters are the same as with the ljust() function, a required width, and optional fillchar parameter:

list_of_strings = ["This can give us", "text that's center aligned", "within the specified width"]

for s in list_of_strings:
    print(s.center(50, ' '))

Output:

                 This can give us                 
            text that's center aligned            
            within the specified width            

rjust()

Analogous to the previous two functions, rjust() aligns the string to the right by adding padding to the left (beginning) of the string.

Again, the parameters are the required width and optional fillchar. Like we mentioned previously, this function is very useful when naming files that start with numbers because of the more intuitive sorting:

list_of_names_original = []
list_of_names_padded = []

for n in range(1, 13):
    list_of_names_original.append(str(n) + "_name")
    list_of_names_padded.append(str(n).rjust(2, '0') + "_name")

print("Lexicographical sorting without padding:")
print(sorted(list_of_names_original))
print()

print("Lexicographical sorting with padding:")
print(sorted(list_of_names_padded))

Running this code would give us:

Lexicographical sorting without padding:
['10_name', '11_name', '12_name', '1_name', '2_name', '3_name', '4_name', '5_name', '6_name', '7_name', '8_name', '9_name']

Lexicographical sorting with padding:
['01_name', '02_name', '03_name', '04_name', '05_name', '06_name', '07_name', '08_name', '09_name', '10_name', '11_name', '12_name']

zfill()

The zfill() function performs very similarly to using rjust() with zero as the specified character. It left pads the given string with zeroes until the string reaches the specified length.

The only difference is that in case our string starts with a plus(+) or minus(-) sign, the padding will start after that sign:

neutral = '15'
positive = '+15'
negative = '-15'
length = 4

print(neutral.zfill(length))
print(positive.zfill(length+1))
print(negative.zfill(length+1))

This is done to keep the original value of the number in case the string was a number. Running this code would give us:

0015
+0015
-0015

format()

The format() function is the most advanced in the list. This single function can be used for left, right, and even center padding. It is also used for other formatting too, but we'll only take a look at the padding functionality it provides.

It returns the string after formatting the specified values and putting them inside the string placeholders which are defined by {}.

The placeholders can be identified by either named indexes, numbered indexes, or even empty curly brackets. A quick example of how these placeholders look before we see how we can use this function to add padding:

print("Placeholders can given by {0}, or with {value}".format("adding a number", value="a named value"))
print("They can also be given {}, without adding a {} or {}".format("implicitly", "number", "name"))

Which would give us:

Placeholders can given by adding a number, or with a named value
They can also be given implicitly, without adding a number or name

These placeholders accept a variety of formatting options. Let's see how we can achieve different types of string padding by using these options:

You can also append characters other than white spaces, by adding the specified characters before the >, ^ or < character:

print('{:*^50}'.format('Center padding with a specific character'))
*****Center padding with a specific character*****

You can read more about the different possibilities of the format() function in our Guide to Formatting Strings with Python.

Conclusion

Adding padding to strings in Python is a relatively straight-forward process, that can noticeably increase the readability of your output, especially if the data you have can be read in a table-like fashion.

In this article, we've covered the ljust(), rjust(), center(), zfill() and format() function as built-in approaches to string padding in Python.

September 28, 2020 01:17 PM UTC


Abhijeet Pal

Python’s Generator and Yield Explained

Generators are iterators, a kind of iterable you can only iterate over once. So what are iterators anyway? An iterator is an object that can be iterated (looped) upon. It ... Read more

The post Python’s Generator and Yield Explained appeared first on Django Central.

September 28, 2020 01:04 PM UTC


Andrew Dalke

Simple FPS fingerprint similarity search: variations on a theme

It's easy to write a fingerprint search tool. Peter Willett tells a story about how very soon after he, Winterman, and Bawden published Implementation of nearest-neighbor searching in an online chemical structure search system (1986) (which described their nearest-neighbor similarity search implementation and observed that Tanimoto similarity gave more satisfactory results than cosine similarity), he heard from a company which wrote their own implementation, on a Friday afternoon, and found it to be very useful.

Now, my memory of his story may be missing in the details, but the key point is that it's always been easy to write a fingerprint similarity search tool. So, let's do it!

I'll call my program ssimsearch because it's going to be a simplified version of chemfp's simsearch command-line tool. In fact, I'll hard-code just about everything, with only the bare minimum of checking.

What's in chembl_27.fps?

To make it even easier, it will be hard-coded so it can only search "chembl_27.fps" in the local directory, which is the decompressed contents of chembl_27.fps.gz.

Let's take a look at the first few lines of that file, which is in FPS format:

% fold -w 90 chembl_27.fps | head -18
#FPS1
#num_bits=2048
#type=RDKit-Morgan/1 radius=2 fpSize=2048 useFeatures=0 useChirality=0 useBondTypes=1
#software=RDKit/2019.09.2
#source=chembl_27.fps.gz
#date=2020-05-18 15:43:14.060167
000804000000000000000000000000000010000000000000000000000000000000000000000000000000008000
000004000000000000400000000100000000000000000000010000000000000000000000000000081000000000
200000000000000000008000000000000000000888000000000080000000000010000000000000000000020000
000000000000000002000108002040000000000000000005000000020001000000000000001000000002000000
000000000000000000000000000000800000000000200000000000000000010000000004000000000000000000
00000000000080000002000000000000000008000000000000000000000000	CHEMBL153534
020800000002000000800300000030000010810000200000220000400800040000000008400000000000008004
0000050004000000000040004000000400000000000800020000b0000000000000030000040000000102000002
700009008008000004408000000000000200000800040000004088000002000004200000022c10000000020000
0000004844000000020180084000000010000000000000000080040400002000000000000010000a0000000000
01101004000500000000008400014010800000000000a500020400000000002900008800000001490000000200
00008020008002020082000004802040804000800000000000100001800000	CHEMBL440060

The line starting with "#" are part of the header. The first line identifies the file format, the other header lines contain key/value-formatted metadata. In this case, the file contains 2048-bit Morgan fingerprints from RDKit version 2019.09.2, and it was created on 18 May 2020.

fingerprint type

The RDKit code to generate a fingerprint corresponding to the type parmaeters, given a RDMol molecule object, is:

from rdkit.Chem import rdMolDescriptors

query_rd_fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(
    mol, radius=2, nBits=2048, useChirality=0,
    useBondTypes=1, useFeatures=0)

This returns a Datastructs.ExplicitBitVect instance.

Fingerprint lines

After the header lines are the fingerprint lines. There is one record per line, with tab-separated fields. The first field is the hex-encoded fingerprint, the second is the identifier. The other fields are user-defined - chemfp ignores them. (They could contain SMILES, property values, etc.)

The FPS format uses hex-encoding because it's very easy in most programs to convert hex-strings into byte-strings. Here's a Python3 example using a hex-encoded MACCS fingerprint for caffeine:

>>> bytes.fromhex("000000003000000001d414d91323915380f138ea1f")
b'\x00\x00\x00\x000\x00\x00\x00\x01\xd4\x14\xd9\x13#\x91S\x80\xf18\xea\x1f'

In Python 2 it was:

>>> "000000003000000001d414d91323915380f138ea1f".decode("hex")
'\x00\x00\x00\x000\x00\x00\x00\x01\xd4\x14\xd9\x13#\x91S\x80\xf18\xea\x1f'

How to represent fingerprints and compute the Tanimoto?

Data representation and data calculation go hand-in-hand. I'll need to compute the binary Tanimoto, which is defined as popcount(fp1&fp2)/popcount(fp1|fp2) where | and & are the usual bitwise and and or operators, and popcount() is the number of bits in the result. Bitwise-and and -or are available in standard Python integers but popcount() is not.

I can think of three approaches:

  1. Turn the hex-encoded fingerprint into an RDKit ExplicitBitVect() and have RDKit compute the Tanimoto for me;
  2. Turn both into a third data type where it's easy to compute the Tanimoto;
  3. Turn the RDKit fingerprint into a byte string and figure out some way to compute the Tanimoto between them.

I'm writing this essay for pedagogical reasons, so I'm going to explore all three options.

Using RDKit's ExplicitBitVect()

RDKit's CreateFromFPS() function parses a hex-encoded fingerprint into an ExplicitBitVect(), and TanimotoSimilarity() computes the Tanimoto similarity between two fingerprints, which makes it easy to write a search program. It's so easy, I'll let the comments speak for themselves:

from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from rdkit import DataStructs

## Input parameters
smiles = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
threshold = 0.7

## Convert the SMILES into an RDKit fingerprint
mol = Chem.MolFromSmiles(smiles)
if mol is None:
    raise SystemExit(f"Cannot parse SMILES {smiles!r}")

query_rd_fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(
    mol, radius=2, nBits=2048, useChirality=0,
    useBondTypes=1, useFeatures=0)

## Search chembl_27.fps

# Skip the header; make sure I skipped only to the end of the header
infile = open("chembl_27.fps")
for i in range(6):
    line = infile.readline()
assert line.startswith("#date")

# Process each line, convert to an RDKit fingerprint, and compare
for line in infile:
    target_hex_fp, target_id = line.split()
    target_rd_fp = DataStructs.CreateFromFPSText(target_hex_fp)
    score = DataStructs.TanimotoSimilarity(query_rd_fp, target_rd_fp)
    if score >= threshold:
        print(f"{target_id}\t{score}")

The output from running this program is:

CHEMBL113	1.0
CHEMBL1232048	0.7096774193548387

CHEMBL113 is caffeine, so getting the expected score of 1.0 shows I didn't mess up on the GetMorganFingerprintAsBitVect() parameters. CHEMBL1232048 is bisdionin C, with a caffeine-caffeine linker, so that also makes sense.

You can surely see that this is not a complicated program, and can easily be done in an afternoon! The hardest part is probably finding out which functions to call and how to call them.

However, this search took 30 seconds. Can we do better?

Using gmpy2 integers

I could convert a hex-encoded fingerprint to a Python integer and do the bitwise-and and -or operations, but there's no way to get the popcount() of a Python integer directly. (I could convert the Python integer into a byte string and compute the popcount of that, but then why not just work on byte strings directly?)

Instead, I'll try using the GNU Multiple Precision Arithmetic Library through the gmpy2 Python wrapper. The integer-like mpz object implements the bitwise operation and the popcount method, and has an optimized popcount implementation. Here's an example using the RDKit MACCS keys for caffeine and theobromine:

>>> import gmpy2
>>> caffeine = gmpy2.mpz("000000003000000001d414d91323915380f138ea1f", 16)
>>> theobromine = gmpy2.mpz("000000003000000001d414d91323915380e178ea1f", 16)
>>> gmpy2.popcount(theobromine & caffeine)
45
>>> gmpy2.popcount(theobromine | caffeine)
47
>>> gmpy2.popcount(theobromine & caffeine) / gmpy2.popcount(theobromine | caffeine)
0.9574468085106383

(I used chemfp to verify that it computed the same MACCS Tanimoto similarity.)

That, plus the knowledge that BitVectToFPSText() turns an ExplicitBitVect into a hex-encoded fingerprint, makes it straight-forward to try out a GMPY2-based search system:

from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from rdkit import DataStructs
import gmpy2

## Input parameters
smiles = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
threshold = 0.7

## Convert the SMILES into an RDKit fingerprint then into a GMP mpz integer
mol = Chem.MolFromSmiles(smiles)
if mol is None:
    raise SystemExit(f"Cannot parse SMILES {smiles!r}")

query_rd_fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(
    mol, radius=2, nBits=2048, useChirality=0,
    useBondTypes=1, useFeatures=0)

query_gmp_fp = gmpy2.mpz(DataStructs.BitVectToFPSText(query_rd_fp), 16)

## Search chembl_27.fps

# Skip the header; make sure I skipped only to the end of the header
infile = open("chembl_27.fps")
for i in range(6):
    line = infile.readline()
assert line.startswith("#date")

# Process each line, convert to a GMP mpz integer, and compare
for line in infile:
    target_hex_fp, target_id = line.split()
    target_gmp_fp = gmpy2.mpz(target_hex_fp, 16)
    score = gmpy2.popcount(query_gmp_fp & target_gmp_fp) / gmpy2.popcount(query_gmp_fp | target_gmp_fp)
    if score >= threshold:
        print(f"{target_id}\t{score}")

The GMP-based search takes about 8 seconds, which is a nice speedup from 30 seconds.

One downside of this approach is that each of the bitwise-operators returns a new Python object, which is used only to find the popcount. That's a lot of object creation, which means a lot of object overhead.

Using byte string + my own Tanimoto function

The third option is to use native Python byte strings. This has less overhead, but Python offers no fast way to compute the Tanimoto between two byte strings. I could make one in Python using, say, a byte-based lookup table, but it's clear that a C or C++ extension will be faster, and there are compiler intrinsics for many compilers (gcc/clang, Visual Studio), or std::popcount (for C++) to make it fast.

CFFI - C foreign-function interface for Python

While I could use the Python/C API to develop an extension, that documentation points out Third party tools like Cython, cffi, SWIG and Numba offer both simpler and more sophisticated approaches to creating C and C++ extensions for Python. I'll use cffi, which implements a way for Python code to call C functions directly.

This includes a way to specify the source code for the extension. Here's an example, which will create the extension "_popc" with a function called byte_tanimoto_256 which is hard-coded to compute the Tanimoto between two 2048-bit/256-byte bytesstrings.

# I call this file "popc.py". It should work for gcc and clang.
from cffi import FFI

ffibuilder = FFI()

# Create a Python module which can be imported via "import _popc".
# It will contain the single function byte_tanimoto_256() which
# expects two byte strings of length 256 bytes - exactly!
ffibuilder.set_source("_popc", r"""

#include <stdint.h>

static double
byte_tanimoto_256(const unsigned char *fp1, const unsigned char *fp2) {
    int num_words = 2048 / 64;
    int union_popcount = 0, intersect_popcount = 0;

    /* Interpret as 64-bit integers and assume possible mis-alignment is okay. */
    uint64_t *fp1_64 = (uint64_t *) fp1, *fp2_64 = (uint64_t *) fp2;

    for (int i=0; i<num_words; i++) {
        intersect_popcount += __builtin_popcountll(fp1_64[i] & fp2_64[i]);
        union_popcount += __builtin_popcountll(fp1_64[i] | fp2_64[i]);
    }
    if (union_popcount == 0) {
        return 0.0;
    }
    return ((double) intersect_popcount) / union_popcount;
}

""",
    # Tell the compiler to always expect the POPCNT instruction will be available.
    extra_compile_args=["-mpopcnt"])

# Tell cffi to export the above function for use by Python.
ffibuilder.cdef("""
double byte_tanimoto_256(unsigned char *fp1, unsigned char *fp2);
""")

if __name__ == "__main__":
    ffibuilder.compile(verbose=True)

(If I don't include -mpopcnt then the compiler will generate slower code that works even on older Intel-compatible CPUs that don't support the POPCNT instruction added to SSE4a in 2007.)

I then run the program to compile the _popc module:

% python popc.py
generating ./_popc.c
the current directory is '/Users/dalke/demo'
running build_ext
building '_popc' extension
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -I/Users/dalke/venvs/py38-2020-8/include -I/Users/dalke/local-3.8/include/python3.8 -c _popc.c -o ./_popc.o -mpopcnt
gcc -bundle -undefined dynamic_lookup ./_popc.o -o ./_popc.cpython-38-darwin.so

Finally, I'll test it on a couple of strings to check that nothing is obviously wrong:

>>> import _popc
>>> _popc.lib.byte_tanimoto_256(b"\1"*256, b"\3"*256)
0.5
>>> _popc.lib.byte_tanimoto_256(b"\1"*255 + b"\3", b"\3"*256)
0.501953125
>>> 257/512
0.501953125

A bytestring search implementation

Putting it all together, here's an implementation which uses the _popc module I created in the previous subsection:

from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from rdkit import DataStructs
from _popc.lib import byte_tanimoto_256

## Input parameters
smiles = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
threshold = 0.7

## Convert the SMILES into Morgan fingerprint byte string
mol = Chem.MolFromSmiles(smiles)
if mol is None:
    raise SystemExit(f"Cannot parse SMILES {smiles!r}")

query_rd_fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(
    mol, radius=2, nBits=2048, useChirality=0,
    useBondTypes=1, useFeatures=0)

query_byte_fp = DataStructs.BitVectToBinaryText(query_rd_fp)

## Search chembl_27.fps

# Skip the header; make sure I skipped only to the end of the header
infile = open("chembl_27.fps")
for i in range(6):
    line = infile.readline()
assert line.startswith("#date")

# Process each line, convert to a byte string, and compare
for line in infile:
    target_hex_fp, target_id = line.split()
    target_byte_fp = bytes.fromhex(target_hex_fp)
    score = byte_tanimoto_256(query_byte_fp, target_byte_fp)
    if score >= threshold:
        print(f"{target_id}\t{score}")

This takes a bit over 5 seconds. Recall that the GMP-based search takes about 8 seconds, and the RDKit-based search takes about 30 seconds.

Chemfp's simsearch performance

While it's easy to write a fingerprint search tool, it's hard to make a fast fingerprint search tool. There are many obvious ways to improve the performance of the above code. For examples: 1) local variable name looks in a Python function are faster than using module variables, so simply moving the module-level program into its own function increases performance by 8%, 2) profiling shows that looping over the lines takes the longest time, followed by the split() and then the conversion from hex to a byte, so reducing that overhead is the place to look next, not optimizing the popcount any more.

Instead of going through that process yourself, you could use chemfp's simsearch tool, which is fairly well optimized already. A caffeine search takes under 2 seconds, and the implementation supports many different fingerprint types and sizes:

% time simsearch --query "CN1C=NC2=C1C(=O)N(C(=O)N2C)C" --query-format smistring chembl_27.fps --threshold 0.7
#Simsearch/1
#num_bits=2048
#type=Tanimoto k=all threshold=0.7
#software=chemfp/3.4.1
#targets=chembl_27.fps
#target_source=chembl_27.fps.gz
2	Query1	CHEMBL113	1.00000	CHEMBL1232048	0.70968
1.876u 0.263s 0:01.98 107.5%	0+0k 0+0io 426pf+0w

You can test it out for yourself by installing chemfp (under the Base License Agreement) using:

python -m pip install chemfp -i https://chemfp.com/packages/

The no-cost Base License Agreement says you may do single-query FPS searches against any sized FPS file, for in-house purposes. If you are interested in faster performance, contact me for a license key to try out the binary FPB format, which is faster to load and supports optimized in-memory searches.

September 28, 2020 12:00 PM UTC


Codementor

Resources to learn Tableau, Power BI, Python etc

September 28, 2020 10:04 AM UTC

Implementing Common Python Built-ins in JavaScript

In this post we'll try to implement common Python builtins such as min mas etc in JavaScript. Here's what we'll have: print(1, 2, 3, 4) print(max([1, 2, 100])); print(min([1, 2,...

September 28, 2020 09:49 AM UTC


Mike Driscoll

PyDev of the Week: William Cox

This week we welcome William Cox as our PyDev of the Week. William is a data scientist who has spoken at a few Python conferences. He maintains a blog where you can catch up on what’s new with him

Let’s spend a few moments getting to know William better!

Can you tell us a little about yourself (hobbies, education, etc):

I’ve always loved building things. I spent most of highschool building robots and running a blog about robots. I got a degree in electrical engineering thanks to this, and then went on to get a PhD in signal processing and digital communications. Outside of work I enjoy wood and metal working and being outdoors. Mostly though, I’m a full-time parent.

Why did you start using Python?

I first used Python in graduate school when I needed to automate reading from a serial input device connected up to one of our sensors. I’d been programming for a long time at that point and Python was a pretty easy jump from Perl. My computer science friends were telling me Python was great so I saw it as an opportunity to branch out again.

What other programming languages do you know and which is your favorite?

When I was 12 my dad dropped “Teach Yourself Perl in 21 Days” on my desk and said, “you should learn this.” It took me much longer than 21 days, but I’m glad he did that. I dabbled in several languages (PHP, Java, C) but spent many years in graduate school honing my MATLAB skills, due to its powerful plotting and data analysis capabilities. My first job was at a military contractor and they all used MATLAB. This was the early 10’s and Python was really taking off as the language of scientific computing so I was able to convince my boss that it was something I should be learning – he was especially attracted to how much money they could save over thir massive MATLAB bills. I got my 2nd job with my impressive iPython Notebook skills! It was, however, till I started my 2nd job that I finally started learning what it means to write software with a team. It’s a lot different than dabbling on your own.

What projects are you working on now?

During the start of the Pandemic I brushed off www.makeamaze.com, which is Python-based and hosted on Heroku. Lately I’ve been spending more time on http://www.remotedatascience.com/, a newsletter for remote data science and machine learning jobs. Promoting remote work is a favorite hobby of mine and I’ve worked fully remote for many years now. I’m both pleased and skeptical of everyone’s newfound love of remote! I’ve also been doing some 1-on-1 coaching for folks wanting to transition from engineering jobs into Data Science/ML roles. That’s been especially rewarding to teach. In my day-job I write Machine Learning jobs to predict demand for my employer. That’s all Python-based too.

Which Python libraries are your favorite (core or 3rd party)?

I work as a machine learning engineer for a big food ordering and delivery company and I use Pandas daily. `Black`, `isort` and `flake8` make my job easier. We distribute all of our machine learning jobs with Dask and its family of packages. I really love what the Dask community is doing and I gave a talk on how we’re using it last year at PyColorado.

Can you describe your journey from electrical engineering to data scientist?

Data science is a fairly new field and as such there isn’t much of an established path to this career. There certainly wasn’t when I came into it. I’d encountered – and avoided – Machine Vision and so-called AI classes in undergraduate, thinking they were too esoteric and not useful in the real-world. It wasn’t till after I’d finished my PhD and was taking Sebastian Thrun’s “build a self driving car” MOOC for fun that I had the realization that I’d seen all of his equations before in my electrical engineering curriculum, just with different names or applications. I also took Andrew Ng’s machine learning class at the same time and experienced building predictive models for the first time. Subsequently I had the advantage of a fantastic mentor at my first job who encouraged exploring all the new topics around machine learning and deep learning that was happening at the time. I’d also had plenty of experience dabbling with computers, running websites, administering Linux machines, etc. while in high school and college (no parties for me!). That all combined together to let me convince a CEO to hire me as a data scientist at a startup. Being active on Twitter, speaking at conferences fairly regularly, and seeking out like-minded machine learning practitioners have accelerated my career dramatically.

Do you have any tips for people who would like to get into data science from another discipline?

First off, the grass isn’t always greener. Data Science is a hard job – just like most jobs – and only a small portion of it is fun and exciting data analysis or training ML models. Significantly I would focus on being a good *software engineer* first, and a data scientist second. The SWE aspects of the job – dealing with computers, writing tests, anticipating failure modes, and writing repeatable analysis – are the parts that are difficult to learn without spending the hours on it. They also don’t make for great blog posts so there is less written about it than some fancy analysis. All that said, the best way to move forward is by finding a mentor who can steer your learning and keep you from wandering off the track too much. https://www.sharpestminds.com/ would be one such place to find one. Joining some Slack/Reddit/Twitter communities and explicitly asking for a mentor would be another avenue. Start conducting informational interviews (my friend Matt has a good podcast on the topic: Informational Interviewing – Learn Cool Stuff, Meet Amazing People, and Stack the Deck in your Favor — Life Meets Money) to find out if you *really* want to be a data scientist and then narrow down the areas of business you’d like to work in. Make a plan and write down *as you learn*. Documenting your learning helps solidify the concepts in your head and makes a good resource to show potential employers *and* people you’ll mentor in your future. Finally, as I mentioned above, don’t forget careers are long things. Be kind and don’t get burnt out!

Is there anything else you’d like to say?

Be kind to your coworkers! Careers are long.

Thanks for doing the interview, William

The post PyDev of the Week: William Cox appeared first on The Mouse Vs. The Python.

September 28, 2020 05:05 AM UTC


Montreal Python User Group

Montréal-Python 80 – Pedal Kayak

Greetings Python community, October is fast approaching with vibrant fall colour and our favourite apples. This is the occasion to set the table for our 80th event – Pedal Kayak – which will take place this coming October 26.

Pedal Kayak will feature a recap of our September code sprint as well as a captivating series of technical presentations.

Help us better meet your expectations by filling our orientation survey. The survey takes about 5 minutes to fill and will help us forge high impact events. The survey will remain open until November 10. We are always looking for presenters eager to share their experience and their knowledge. If you have worked with Python for your personal projects or for work, come and share with our community. Subit your talk proposal (10 to 45 mins long) by sending an email to mtlpyteam@googlegroups.com or by joining our Slack and announcing your proposal in #general .

September 28, 2020 04:00 AM UTC

September 27, 2020


Erik Marsja

How to Convert a Float Array to an Integer Array in Python with NumPy

The post How to Convert a Float Array to an Integer Array in Python with NumPy appeared first on Erik Marsja.

In this short NumPy tutorial, we are going to learn how to convert a float array to an integer array in Python. Specifically, here we are going to learn by example how to carry out this rather simple conversion task. First, we are going to change the data type from float to integer in a 1-dimensional array. Second, we are going to convert float to integer in a 2-dimensional array. 

Now, sometimes we may want to round the numbers before we change the data type. Thus, we are going through a couple of examples, as well, in which we 1) round the numbers with the round() method, 2) round the numbers to the nearest largest in with the ceil() method, 3) round the float numbers to the nearest smallest numbers with floor() method. Note, all code can be found in a Jupyter Notebook.

Creating a Float Array 

First, however, we are going to create an example NumPy 1-dimensional array:

import numpy as np # Creating a 1-d array with float numbers oned = np.array([0.1, 0.3, 0.4, 0.6, -1.1, 0.3])

As you can see, in the code chunk above, we started by importing NumPy as np. Second, we created a 1-dimensional array with the array() method. Here’s the output of the array containing float numbers:

convert array to integer python

Now, we are also going to be converting a 2-dimensional array so let’s create this one as well:

# Creating a 2-d float array twod = np.array([[ 0.3, 1.2, 2.4, 3.1, 4.3], [ 5.9, 6.8, 7.6, 8.5, 9.2], [10.11, 11.1, 12.23, 13.2, 14.2], [15.2, 16.4, 17.1, 18.1, 19.1]])

Note, if you have imported your data with Pandas you can also convert the dataframe to a NumPy array. In the next section, we will be converting the 1-dimensional array to integer data type using the astype() method.

How to Convert a Float Array to an Integer Array in Python:

Here’s how to convert a float array to an integer array in Python:

# convert array to integer python oned_int = oned.astype(int)

Now, if we want to change the data type (i.e. from float to int) in the 2-dimensional array we will do as follows:

# python convert array to int twod_int = twod.astype(int)

Now, in the output, from both conversion examples above, we can see that the float numbers were rounded down. In some cases, we may want the float numbers to be rounded according to common practice. Therefore, in the next section, we are going to use around() method before converting. 

array converted to integer in Python

Now, if we want to we can now convert the NumPy array to Pandas dataframe, as well as carrying out descriptive statistics.

Round the Float Numbers Before we Convert them to Integer

Here’s how to use the around() method before converting the float array to an integer array:

oned = np.array([0.1, 0.3, 0.4, 0.6, -1.1, 0.3]) oned = np.around(oned) # numpy convert to int oned_int = oned.astype(int)

Now, we can see in the output that the float numbers are rounded up when they should be and, then, we converted them to integers. Here’s the output of the converted array:

float rounded and converted to integer

Round to the Nearest Largest Integer Before Converting to Int

Here’s how we can use the ceil() method before converting the array to integer:

oned = np.array([0.1, 0.3, 0.4, 0.6, -1.1, 0.3]) oned = np.ceil(oned) # numpy float to int oned_int = oned.astype(int)

Now, we can see the different in the output containing the converted float numbers:

rounded and converted (float to int) numpy array

Round to the Nearest Smallest Integer Before Converting to Int

Here’s how to round the numbers to the smallest integer and changing the data type from float to integer:

oned = np.array([0.1, 0.3, 0.4, 0.6, -1.1, 0.3]) oned = np.floor(oned) # numpy float to int oned_int = oned.astype(int)

In the image below, we see the results of using the floor() method before converting the array. It is, of course, possible to carry out the rounding task before converting a 2-dimensional float array to integer, as well.

data type changed from float to int numpy array

Here’s the link to the Jupyter Notebook containing all the code examples found in this post.

Conclusion

In this NumPy tutorial, we have learned a simple conversion task. That is, we have converted a float array to an integer array. To change the data type of the array we used the astype() method. Hope you learned something. Please share the post across your social media accounts if you did! Support the blog by becoming a patron. Finally, if you have any suggestions, comments, or anything you want me to cover in the blog: leave a comment below.

The post How to Convert a Float Array to an Integer Array in Python with NumPy appeared first on Erik Marsja.

September 27, 2020 06:59 PM UTC


Peter Hoffmann

Azure Synapse SQL-on-Demand Openrowset Common Table Expression with SQLAlchemy

In a previous post I have shown how to use turbodbc to access Azure Synapse SQL-on-Demand endpoints. A common pattern is to use the openrowset function to query parquet data from an external data source like the azure blob storage:

select
    result.filepath(1) as [c_date],
    *
FROM
    OPENROWSET(
        BULK 'https://<storage_account>.dfs.core.windows.net/<filesystem>/sales/table/c_date=*/*.parquet',
        FORMAT='PARQUET'
    ) with(
        [l_id] bigint,
        [sales_euro] float,
    ) as [result]
where c_date='2020-09-01'

Common table expressions help to make the sql code more readable, especially if more than one external data source is queried. Once you have defined the CTE statements at the top you can use them like normal tables inside your queries:

WITH location AS
(SELECT
    *
FROM
    OPENROWSET(
        BULK 'https://<storage_account>.dfs.core.windows.net/<filesystem>/location/table/*.parquet',
        FORMAT='PARQUET'
    ) with(
        [l_id] bigint,
        [l_name] varchar(100),
        [latitude] float,
        [longitude] float
    ) as [result]
),
sales AS
(SELECT
    result.filepath(1) as [c_date],
    *
FROM
    OPENROWSET(
        BULK 'https://<storage_account>.dfs.core.windows.net/<filesystem>/sales/table/c_date=*/*.parquet',
        FORMAT='PARQUET'
    ) with(
        [l_id]  bigint,
        [sales_euro]  float,
    ) as [result]
)

SELECT location.l_id, sales.sales_euro
FROM sales JOIN location ON sales.l_id = location.l_id
where c_date = '2020-01-01'

Still writing such queries in data pipelines soon becomes cumbersome end error prone. So once we moved from writing the queries in the Azure Synapse Workbench to using them in our daily workflows with python, we wanted to have a better way to programmatically generate the SQL statements.

SQLAlchemy is still our library of choice to work with SQL in python. SQLAlchemy already has support for Microsoft SQL Server so most of the Azure Synapse SQL-on-Demand features are covered. I have not yet found a native way to work with openrowset queries, but it's quite easy to use the text() feature to inject the missing statement

import sqlalchemy as sa

cte_location_raw = '''
*
FROM
    OPENROWSET(
        BULK 'https://<storage_account>.dfs.core.windows.net/<filesystem>/location/table/*.parquet',
        FORMAT='PARQUET'
    ) with(
        [l_id] bigint,
        [l_name] varchar(100),
        [latitude] float,
        [longitude] float
    ) as [result]
'''

cte = sa.select([sa.text(cte_location_raw)]).cte('location')
q = sa.select([sa.column('l_id'), sa.column('l_code'), sa.column('l_name')]).select_from(cte)

The cte returns a Common Table Expression instance which is a subclass of the BaseSelect SELECT statement and can be used as such in other statements to generate the following code:

WITH location AS
(SELECT
        *
    FROM
        OPENROWSET(
            BULK 'https://<storage_account>.dfs.core.windows.net/<filesystem>/location/table/*.parquet',
            FORMAT='PARQUET'
        ) with(
            [l_id] bigint,
            [l_name] varchar(100),
            [latitude] float,
            [longitude] float
        ) as [result]
)
SELECT l_id, l_code, l_name FROM location

The cte statement does not know about it's columns because it only gets passed the raw sql text. But you can annotate the sa.text statement with a typemap dictionary, so that it exposes which columns are available from the statement. By annotating the cte we can use the table.c.column statement later to reference the columns instead of using sa.column('l_code') as above.

import sqlalchemy as sa

cte_location_raw = '''
*
FROM
    OPENROWSET(
        BULK 'https://<storage_account>.dfs.core.windows.net/<filesystem>/location/table/*.parquet',
        FORMAT='PARQUET'
    ) with(
        [l_id] bigint,
        [l_name] varchar(100),
        [latitude] float,
        [longitude] float
    ) as [result]
'''

typemap = {"l_id": sa.Integer, "l_code": sa.String, "l_name": sa.String, "latitude": sa.Float, "longitude": sa.Float}
cte = sa.select([sa.text(cte_location_raw, typemap=typemap)]).cte('location')
q = sa.select([cte.c.l_id, cte.c.l_name]).select_from(cte)

So putting everything together you can define and test your CTEs in python

import sqlalchemy as sa

cte_sales_raw = '''
SELECT
    result.filepath(1) as [c_date],
    *
FROM
    OPENROWSET(
        BULK 'https://<storage_account>.dfs.core.windows.net/<filesystem>/sales/table/*.parquet',
        FORMAT='PARQUET'
    ) with(
        [l_id]                              bigint,
        [sales_euro]                             float,
    ) as [result]
'''
cte_location_raw = '''
SELECT
    *
FROM
    OPENROWSET(
        BULK 'https://<storage_account>.dfs.core.windows.net/<filesystem>/location/table/*.parquet',
        FORMAT='PARQUET'
    ) with(
        [l_id] bigint,
        [l_name] varchar(100),
        [latitude] float,
        [longitude] float
    ) as [result]
'''

typemap_location = {"l_id": sa.Integer, "l_name": sa.String, "latitude": sa.Float, "longitude": sa.Float}
location = sa.select([sa.text(cte_location_raw, typemap=typemap_location).alias("tmp1")]).cte('location')
typemap_sales = {"l_id": sa.Integer,  "c_date": sa.Date, "sales_euro": sa.Float}
sales = sa.select([sa.text(cte_sales_raw, typemap=typemap_sales).alias("tmp2")]).cte('sales')

and then compose more complex statements like with any other SQLAlchemy table definitions:

cols = [sales.c.c_date, sales.c.l_id, location.c.l_name, location.c.latitude, location.c.longitude]
q = sa.select(cols).select_from(sales.join(location, sales.c.l_id == location.c.l_id ))

In our production data pipelins at Blue Yonder we typically provide the building blocks to create complex queries in libraries that are maintained by a central team. Testing smaller parts with SQLAlchemy works much better and it's easier for data scientists to plug them together and concentrate on high level model logic.

We like the power of Azure SQL-on-Demand, but managing and testing complex SQL statements is still a challenge as you can already see by the result of the above code. But at least SQLAlchemy and Python make it easier:

WITH sales AS
(SELECT l_id AS l_id, c_date AS c_date, sales_euro AS sales_euro
FROM (
SELECT
    result.filepath(1) as [c_date],
    *
FROM
    OPENROWSET(
        BULK 'https://<storage_account>.dfs.core.windows.net/<filesystem>/sales/table/*.parquet',
        FORMAT='PARQUET'
    ) with(
        [l_id] bigint,
        [sales_euro] float,
    ) as [result]
)as tmp1),
location AS
(SELECT l_id AS l_id, l_name AS l_name, latitude AS latitude, longitude AS longitude
FROM (
SELECT
    *
FROM
    OPENROWSET(
        BULK 'https://<storage_account>.dfs.core.windows.net/<filesystem>/location/table/*.parquet',
        FORMAT='PARQUET'
    ) with(
        [l_id] bigint,
        [l_name] varchar(100),
        [latitude] float,
        [longitude] float
    ) as [result]
) as tmp2)
SELECT sales.c_date, sales.l_id, location.l_name, location.latitude, location.longitude
FROM sales JOIN location ON sales.l_id = location.l_id

September 27, 2020 12:00 AM UTC

September 26, 2020


Weekly Python StackOverflow Report

(ccxliii) stackoverflow python report

These are the ten most rated questions at Stack Overflow last week.
Between brackets: [question score / answers count]
Build date: 2020-09-26 18:08:21 GMT


  1. How to download a nested JSON into a pandas dataframe? - [10/4]
  2. I need to change the type of few columns in a pandas dataframe. Can't do so using iloc - [9/1]
  3. How to delete all instances of a repeated number in a list? - [6/6]
  4. Apply function to each row in Pandas dataframe by group - [6/2]
  5. Error at Prepare training and validation data in neural network - [6/1]
  6. Why doesn't small integer caching seem to work with int objects from the round() function in Python 3? - [6/1]
  7. Groupby in Reverse - [5/4]
  8. creating a json object from pandas dataframe - [5/3]
  9. How to convert a dataframe from long to wide, with values grouped by year in the index? - [5/2]
  10. KeyError on If-Condition in dictionary Python - [5/2]

September 26, 2020 06:08 PM UTC


Erik Marsja

How to Perform Mann-Whitney U Test in Python with Scipy and Pingouin

The post How to Perform Mann-Whitney U Test in Python with Scipy and Pingouin appeared first on Erik Marsja.

In this data analysis tutorial, you will learn how to carry out a Mann-Whitney U test in Python with the package SciPy. This test is also known as Mann–Whitney–Wilcoxon (MWW), Wilcoxon rank-sum test, or Wilcoxon–Mann–Whitney test and is a non-parametric hypothesis test.

Outline of the Post

In this tutorial, you will learn when and how to use this non-parametric test.  After that, we will see an example of a situation when the Mann-Whitney U test can be used. The example is followed by how to install the needed package (i.e., SciPy) as well as a package that makes importing data easy and that we can quickly visualize the data to support the interpretation of the results. In the following section, you will learn the 2 steps to carry out the Mann-Whitney-Wilcoxon test in Python. Note, we will also have a look at another package, Pingouin, that enables us to carry out statistical tests with Python. Finally, we will learn how to interpret the results and visualize data to support our interpretation.

When to use the Mann-Whitney U test

This test is a rank-based test that can be used to compare values for two groups.  If we get a significant result it suggests that the values for the two groups are different.  As previously mentioned, the Mann-Whitney U test is equivalent to a two-sample Wilcoxon rank-sum test.

Furthermore, we don’t have to assume that our data is following the normal distribution and can decide whether the population distributions are identical. Now, the Mann–Whitney test does not address hypotheses about the medians of the groups.  Rather, the test addresses if it is likely that an observation in one group is greater than an observation in the other group.  In other words, it concerns whether one sample has stochastic dominance compared with the other.

The test assumes that the observations are independent.  That is, it is not appropriate for paired observations or repeated measures data.

Appropriate data

Hypotheses

As with the two samples t-test there are normally two hypothesis:

Interpretation

If the results are significant they can be reported as “The values for men were significantly different from those for women.”, if you are examining differences in values between men and women.

To conclude, you should use this test instead of e.g., two-sample t-test using Python if the above information is true for your data.

Example

In this section, before moving on to how to carry out the test, we will have a quick look at an example when you should use the Mann-Whitney U test. 

If you, for example, run an intervention study designed to examine the effectiveness of a new psychological treatment to reduce symptoms of depression in adults. Let’s say that you have a total of n=14 participants. Furthermore, these participants are randomized to receive either the treatment or no treatment, at all. In your study, the participants are asked to record the number of depressive episodes over a 1 week period following receipt of the assigned treatment. Here are some example data:

Example data to carry out the wilcoxon rank sum testExample data

In this example, the question you might want to answer is: is there a difference in the number of depressive episodes over a 1 week period in participants receiving the new treatment as in comparison to those receiving no treatment? By inspecting your data, it appears that participants receiving no treatment have more depressive episodes. The crucial question is, however, is this statistically significant?

Pandas frequency histogram

In this example, the outcome variable is number of episodes (count) and, naturally, in this sample, the data do not follow a normal distribution.  Note, Pandas was used to create the above histogram.      

Prerequisites 

To follow this tutorial you will need to have Pandas and SciPy installed. Now, you can get these packages using your favorite Python package manager. For example, installing Python packages with pip can be done as follows:

pip install scipy pandas pingouin

Note, both Pandas and Pingouin are optional. However, using these packages have, as you will see later, their advantages. Hint, Pandas make data importing easy. If you ever need, you can also use pip to install a specific version of a package.

2 Steps to Perform the Mann-Whitney U test in Python

In this section, we will go through the steps to carry out the Mann-Whitney U test using Pandas and SciPy. In the first step, we will get our data. After the data is stored in a dataframe, we will carry out the non-parametric test. 

Step1: Get your Data

Here’s one way to import data to Python with Pandas:

import pandas as pd # Getting our data in to a dictionary data = {'Notrt':[7, 5, 6, 4, 12, 9, 8], 'Trt':[3, 6, 4, 2, 1, 5, 1]} # Dictionary to Dataframe df = pd.DataFrame(data)

In the code chunk above, we created a dataframe from a dictionary. Of course, most of the time we will have our data stored in formats such as CSV or Excel.

Example data in wide format

See the following posts about how to import data in Python with Pandas:

wilcoxon rank-sum test in python 2 steps

Here’s also worth noting that if your data is stored in long format, you will have to subset your data such that you can get the data from each group into two different variables. 

Step 2: Use the mannwhitneyu method from SciPy:

Here’s how to perform the Mann-Whitney U test in Python with SciPy:

from scipy.stats import mannwhitneyu # Carrying out the Wilcoxon–Mann–Whitney test results = mannwhitneyu(df['Notrt'], df['Trt']) results

Notice that we selected the columns, for each group, as x and y parameters to the mannwhitneyu method. If your data, as previously mentioned, is stored in long format (e.g., see image further down below) you can use Pandas query() method to subset the data.

scipy results mann u whitney test in pythonresults from the wilcoxon rank sum test

Here’s how to perform the test, using df.query(), if your data is stored in a similar way as in the image above:

import pandas as pd idrt = [i for i in range(1,8)] idrt += idrt data = {'Count':[7, 5, 6, 4, 12, 9, 8, 3, 6, 4, 2, 1, 5, 1], 'Condition':['No Treatment']*7 + ['Treatment']*7, 'IDtrt':idrt} # Dictionary to Dataframe df = pd.DataFrame(data) # Subsetting (i.e., creating new variables): x = df.query('Condition == "No Treatment"')['Count'] y = df.query('Condition == "Treatment"')['Count'] # Mann-Whitney U test: mannwhitneyu(x, y)

Now, there are some things to be explained here. First, the mannwhitneyu method will by default carry out a one-sided test. If we, on the other hand, would use the parameter alternative and set it to “two-sided” we would get different results. Make sure you check out the documentation before using the method. In the next section, we will have a look at another, previously mentioned, Python package that also can be used to do the Mann-Whitney U test.

Mann-Whitney U Test with the Python Package Pingouin

As previously mentioned, we can also install the Python package Pingouin to carry out the Mann-Whitney U test. Here’s how to perform this test with the mwu() method:

from pingouin import mwu results2 = mwu(df['Notrt'], df['Trt'], tail='one-sided')

Now, the advantage with using the mwu method is that we will get some additional information (e.g., common language effect size; CLES). Here’s the output:

Results from the mann-whitney u test Python

Interpreting the Results of the Mann-Whitney U test

In this section, we will start off by interpreting the results of the test. Now, this is pretty straight forward. 

In our example, we can reject H0 because 3 < 7. Furthermore, we have statistically significant evidence at α =0.05 to show that the treatment groups differ in the number of depressive episodes. Naturally, in a real application, we would have set both the H0 and Ha prior to conducting the hypothesis test, as we did here. 

Visualizing the Data with Boxplots

To aid the interpretation of our results we can create box plots with Pandas:

axarr = df.boxplot(column='Count', by='Condition', figsize=(8, 6), grid=False) axarr.set_title('') axarr.set_ylabel('Number of Depressive Episodes')

In the box plot, we can see that the median is greater for the group that did not get any treatment compared to the group that got treatment. Furthermore, if there were any outliers in our data they would show up as dots in the box plot. If you are interested in more data visualization techniques have a look at the post “9 Data Visualization Techniques You Should Learn in Python”.

Visualizing the results of Mann-Whitney U test

Conclusion

In this post, you have learned how to perform the Mann-Whitney U test using the Python packages SciPy, Pandas, and Pingouin. Moreover, you have learned when to carry out this non-parametric test both by learning about e.g. when it is appropriate and by an example. After this, you learned how to carry out the test using data from the example. Finally, you have learned how to interpret the results and visualize the data. Note that you preferably should have a larger sample size than in the example of the current post. Of course, you should also make the decision on whether to carry out a one-sided or two-sided test based on theory. In the example of this post, we can assume that going without treatment would mean more depressive episodes. However, in other examples this may not be true. 

Hope you have learned something and if you have a comment, a suggestion, or anything you can leave a comment below. Finally, I would very much appreciate it if you shared this post across your social media accounts if you found it useful!

References

In this final section, you will find some references and resources that may prove useful. Note, there are both links to blog posts and peer-reviewed articles. Sadly, some of the content here is behind paywalls. 

Mann-Whitney U Test

Mann, H. B.; Whitney, D. R. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. Ann. Math. Statist. 18 (1947), no. 1, 50–60. doi:10.1214/aoms/1177730491. https://projecteuclid.org/euclid.aoms/1177730491

Vargha, A., & Delaney, H. D. (2000). A Critique and Improvement of the CL Common Language Effect Size Statistics of McGraw and Wong. Journal of Educational and Behavioral Statistics, 25(2), 101–132. https://doi.org/10.3102/10769986025002101

wilcoxon rank-sum test in python 2 steps

The post How to Perform Mann-Whitney U Test in Python with Scipy and Pingouin appeared first on Erik Marsja.

September 26, 2020 12:34 PM UTC


Python Circle

Sending email with attachments using Python built-in email module

Sending email with attachments using Python built-in email module, adding image as attachment in email while sending using Python, Automating email sending process using Python, Automating email attachment using python

September 26, 2020 07:44 AM UTC

Python Requests Library: Sending HTTP GET and POST requests using Python

Python requests library to send GET and POST requests, Sending query params in Python Requests GET method, Sending JSON object using python requests POST method, checking response headers and response status in python requests library

September 26, 2020 07:44 AM UTC

How to use Jupyter Notebook for practicing python programs

How to use Jupyter Notebook for practicing python programs, jupyter notebook installation and usage in linux ubuntu 16.04, Writing first program with Jupyter notebook, uploading file in jupyter notebook

September 26, 2020 07:44 AM UTC

Python program to convert Linux file permissions from octal number to rwx string

Python program to convert Linux file permissions from octal number to rwx string, Linux file conversion python program, Python script to convert Linux file permissions

September 26, 2020 07:44 AM UTC

Read, write, tell, seek, check stats, move, copy and delete a file in Python

Performing different file operations in Python, Reading and writing to a file in python, read vs readline vs readlines in python, write vs writelines in python, how to read a file line by line in python, read vs write vs append mode in python file operations, text vs binary read mode in python, solving error can't do nonzero end-relative seeks

September 26, 2020 07:44 AM UTC

Server Access Logging in Django using middleware

Creating access logs in Django application, Logging using middleware in Django app, Creating custom middleware in Django, Server access logging in Django, Server Access Logging in Django using middleware

September 26, 2020 07:44 AM UTC