skip to navigation
skip to content

Planet Python

Last update: November 19, 2024 07:44 AM UTC

November 18, 2024


PyCharm

JetBrains AI Assistant 2024.3 is here! A highlight of this release is the flexibility to choose your preferred chat model. Select between Google Gemini, OpenAI, or local models to tailor interactions for a more customized experience.  This update also brings advanced code completion for all major programming languages, improved context management, and the ability to […]

November 18, 2024 07:41 PM UTC


Python Morsels

Python's pathlib module

Python's pathlib module is the tool to use for working with file paths. See pathlib quick reference tables and examples.

Table of contents

  1. A pathlib cheat sheet
  2. The open function accepts Path objects
  3. Why use a pathlib.Path instead of a string?
  4. The basics: constructing paths with pathlib
  5. Joining paths
  6. Current working directory
  7. Absolute paths
  8. Splitting up paths with pathlib
  9. Listing files in a directory
  10. Reading and writing a whole file
  11. Many common operations are even easier
  12. No need to worry about normalizing paths
  13. Built-in cross-platform compatibility
  14. A pathlib conversion cheat sheet
  15. What about things pathlib can't do?
  16. Should strings ever represent file paths?
  17. Use pathlib for readable cross-platform code

A pathlib cheat sheet

Below is a cheat sheet table of common pathlib.Path operations.

The variables used in the table are defined here:

>>> import pathlib
>>> path = Path("/home/trey/proj/readme.md")
>>> relative = Path("readme.md")
>>> base = Path("/home/trey/proj")
>>> new = Path("/home/trey/proj/sub")
>>> target = path.with_suffix(".txt")  # .md -> .txt
>>> pattern = "*.md"
>>> name = "sub/f.txt"
Path-related task pathlib approach Example
Read all file contents path.read_text() 'Line 1\nLine 2\n'
Write file contents path.write_text('new') Writes new to file
Get absolute file path relative.resolve() Path('/home/trey/proj/readme.md')
Get the filename path.name 'readme.md'
Get parent directory path.parent Path('home/trey/proj')
Get file extension path.suffix '.md'
Get suffix-free name path.stem 'readme'
Ancestor-relative path path.relative_to(base) Path('readme.md')
Verify path is a file path.is_file() True
Make new directory new.mkdir() Makes new directory
Get current directory Path.cwd() Path('/home/trey/proj')
Get home directory Path.home() Path('/home/trey')
Get all ancestor paths path.parents [Path('/home/trey/proj'), ...]
List files/directories base.iterdir() [Path('home/trey/proj/readme.md'), ...]
Find files by pattern base.glob(pattern) [Path('/home/trey/proj/readme.md')]
Find files recursively base.rglob(pattern) [Path('/home/trey/proj/readme.md')]
Join path parts base / name Path('/home/trey/proj/sub/f.txt')
Get file size (bytes) path.stat().st_size 14
Walk the file tree base.walk() Iterable of (path, subdirs, files)
Rename file to new path path.rename(target) Path object for new path
Remove file path.unlink()

Note that iterdir, glob, rglob, and walk all return iterators. The examples above show lists for convenience.

The open function accepts Path objects

What does Python's open function …

Read the full article: https://www.pythonmorsels.com/pathlib-module/

November 18, 2024 05:00 PM UTC


Real Python

Interacting With Python

In this tutorial, you'll explore the various ways of interacting with Python. You'll learn about the REPL for quick testing and running scripts, as well as how to work with IDEs, Jupyter Notebooks, and online interpreters.

November 18, 2024 02:00 PM UTC


Go Deh

There's the easy way...

November 18, 2024 10:11 AM UTC


Python Software Foundation

Help power Python and join in the PSF year-end fundraiser & membership drive!

November 18, 2024 09:56 AM UTC


Python Bytes

#410 Entering the Django core

Topics include Thoughts on Django’s Core, futurepool, Don't return named tuples in new APIs, and Ziglang: Migrating from AWS to Self-Hosting.

November 18, 2024 08:00 AM UTC


James Bennett

Introducing DjangoVer

Version numbering is hard, and there are lots of popular schemes out there for how to do it. Today I want to talk about a system I’ve settled on for my own Django-related packages, and which I’m calling “DjangoVer”, because it ties the version number of a Django-related package to the latest Django version that package supports.

But one quick note to start with: this is not really “introducing” the idea of DjangoVer, because I know I’ve used the name a few times already in other places. I’m also not the person who invented this, and I don’t know for certain who did — I’ve seen several packages which appear to follow some form of DjangoVer and took inspiration from them in defining my own take on it.

Django’s version scheme: an overview

The basic idea of DjangoVer is that the version number of a Django-related package should tell you which version of Django you can use it with. Which probably doesn’t help much if you don’t know how Django releases are numbered, so let’s start there. In brief:

This has been in effect since Django 2.0 was released, and the feature releases have been: 2.0, 2.1, 2.2 (LTS); 3.0, 3.1, 3.2 (LTS); 4.0, 4.1, 4.2 (LTS); 5.0, 5.1. Django 5.2 (LTS) is expected in April 2025, and then eight months later (if nothing is changed) will come Django 6.0.

I’ll talk more about SemVer in a bit, but it’s worth being crystal clear that Django does not follow Semantic Versioning, and the MAJOR number is not a signal about API compatibility. Instead, API compatibility runs LTS-to-LTS, with a simple principle: if your code runs on a Django LTS release and raises no deprecation warnings, it will run unmodified on the next LTS release. So, for example, if you have an application that runs without deprecation warnings on Django 4.2 LTS, it will run unmodified on Django 5.2 LTS (though at that point it might begin raising new deprecation warnings, and you’d need to clear them before it would be safe to upgrade any further).

DjangoVer, defined

In DjangoVer, a Django-related package has a version number of the form DJANGO_MAJOR.DJANGO_FEATURE.PACKAGE_VERSION, where DJANGO_MAJOR and DJANGO_FEATURE indicate the most recent feature release series of Django supported by the package, and PACKAGE_VERSION begins at zero and increments by one with each release of the package supporting that feature release of Django.

Since the version number only indicates the newest Django feature release supported, a package using DjangoVer should also use Python package classifiers to indicate the full range of its Django support (such as Framework :: Django :: 5.1 to indicate support for Django 5.1 — see examples on PyPI).

But while Django takes care to maintain compatibility from one LTS to the next, I do not think DjangoVer packages need to do that; they can use the simpler approach of issuing deprecation warnings for two releases, and then making the breaking change. One of the stated reasons for Django’s LTS-to-LTS compatibility policy is to help third-party packages have an easier time supporting Django releases that people are actually likely to use; otherwise, Django itself generally just follows the “deprecate for two releases, then remove it” pattern. No matter what compatibility policy is chosen, however, it should be documented clearly, since DjangoVer explicitly does not attempt to provide any information about API stability/compatibility in the version number.

That’s a bit wordy, so let’s try an example:

Why another version system?

Some of you probably didn’t even read this far before rushing to instantly post the XKCD “Standards” comic as a reply. Thank you in advance for letting the rest of us know we don’t need to bother listening to or engaging with you. For everyone else: here’s why I think in this case adding yet another “standard” is actually a good idea.

The elephant in the room here is Semantic Versioning (“SemVer”). Others have written about some of the problems with SemVer, but I’ll add my own two cents here: “compatibility” is far too complex and nebulous a concept to be usefully encoded in a simple value like a version number. And if you want my really cynical take, the actual point of SemVer in practice is to protect developers of software from users, by providing endless loopholes and ways to say “sure, this change broke your code, but that doesn’t count as a breaking change”. It’ll turn out that the developer had a different interpretation of the documentation than you did, or that the API contract was “underspecified” and now has been “clarified”, or they’ll just throw their hands up, yell “Hyrum’s Law” and say they can’t possibly be expected to preserve that behavior.

A lot of this is rooted in the belief that changes, and especially breaking changes, are inherently bad and shameful, and that if you introduce them you’re a bad developer who should be ashamed. Which is, frankly, bullshit. Useful software almost always evolves and changes over time, and it’s unrealistic to expect it not to. I wrote about this a few years back in the context of the Python 2/3 transition:

Though there is one thing I think gets overlooked a lot: usually, the anti-Python-3 argument is presented as the desire of a particular company, or project, or person, to stand still and buck the trend of the world to be ever-changing.

But really they’re asking for the inverse of that. Rather than being a fixed point in a constantly-changing world, what they really seem to want is to be the only ones still moving in a world that has become static around them. If only the Python team would stop fiddling with the language! If only the maintainers of popular frameworks would stop evolving their APIs! Then we could finally stop worrying about our dependencies and get on with our real work! Of course, it’s logically impossible for each one of those entities to be the sole mover in a static world, but pointing that out doesn’t always go well.

But that’s a rant for another day and another full post all its own. For now it’s enough to just say I don’t believe SemVer can ever deliver on what it promises. So where does that leave us?

Well, if the version number can’t tell you whether it’s safe to upgrade from one version to another, perhaps it can still tell you something useful. And for me, when I’m evaluating a piece of third-party software for possible use, one of the most important things I want to know is: is someone actually maintaining this? There are lots of potential signals to look for, but some version schemes — like CalVer — can encode this into the version number. Want to know if the software’s maintained? With CalVer you can guess a package’s maintenance status, with pretty good accuracy, from a glance at the version number.

Over the course of this year I’ve been transitioning all my personal non-Django packages to CalVer for precisely this reason. Compatibility, again, is something I think can’t possibly be encoded into a version number, but “someone’s keeping an eye on this” can be. Even if I’m not adding features to something, Python itself does a new version every year and I’ll push a new release to explicitly mark compatibility (as I did recently for the release of Python 3.13). That’ll bump the version number and let anyone who takes a quick glance at it know I’m still there and paying attention to the package.

For packages meant to be used with Django, though, the version number can usefully encode another piece of information: not just “is someone maintaining this”, but “can I use this with my Django installation”. And that is what DjangoVer is about: telling you at a glance the maintenance and Django compatibility status of a package.

DjangoVer in practice

All of my own personal Django-related packages are now using DjangoVer, and say so in their documentation. If I start any new Django-related projects they’ll do the same thing.

A quick scroll through PyPI turns up other packages doing something that looks similar; django-cockroachdb and django-snowflake, for example, versioned their Django 5.1 packages as “5.1”, and explicitly say in their READMEs to install a package version corresponding to the Django version you use (they also have a maintainer in common, who I suspect of having been an early inventor of what I’m now calling “DjangoVer”).

If you maintain a Django-related package, I’d encourage you to at least think about adopting some form of DjangoVer, too. I won’t say it’s the best, period, because something better could always come along, but in terms of information that can be usefully encoded into the version number, I think DjangoVer is the best option I’ve seen for Django-related packages.

November 18, 2024 02:04 AM UTC


Armin Ronacher

Playground Wisdom: Threads Beat Async/Await

November 18, 2024 12:00 AM UTC

November 17, 2024


Django Weblog

2025 DSF Board Election Results

The 2025 DSF Board Election has closed, and the following candidates have been elected:

They will all serve two years for their term.

Directors elected for the 2024 DSF Board, Jacob, Sarah, and Thibaud are continuing with one year left to serve on the board.

Therefore, the combined 2025 DSF Board of Directors are:

Congratulations to our winners, and a huge thank you to our departing board members Çağıl Uluşahin Sonmez, Chaim Kirby, Kátia Yoshime Nakamura, Katie McLaughlin.

Thank you again to everyone who nominated themselves. Even if you were not successful, you gave our community the chance to make their voices heard in who they wanted to represent them.

November 17, 2024 11:56 PM UTC


Paolo Melchiorre

Thoughts on my election as a DSF board member

My thoughts on my election as a member of the Django Software Foundation (DSF) board of directors.

November 17, 2024 11:00 PM UTC


Real Python

Using the Python zip() Function for Parallel Iteration

In this step-by-step tutorial, you'll learn how to use the Python zip() function to solve common programming problems. You'll learn how to traverse multiple iterables in parallel and create dictionaries with just a few lines of code.

November 17, 2024 02:00 PM UTC


Test and Code

223: Writing Stuff Down is a Super Power

Taking notes well can help to listen better, remember things, show respect, be more accountable, free up mind space to solve problems.

This episode discusses


 Learn pytest

November 17, 2024 01:55 AM UTC

November 16, 2024


Real Python

Using the len() Function in Python

In this tutorial, you'll learn how and when to use the len() Python function. You'll also learn how to customize your class definitions so that objects of a user-defined class can be used as arguments in len().

November 16, 2024 02:00 PM UTC

November 15, 2024


Real Python

The Real Python Podcast – Episode #228: Maintaining the Foundations of Python & Cautionary Tales

How do you build a sustainable open-source project and community? What lessons can be learned from Python's history and the current mess that the WordPress community is going through? This week on the show, we speak with Paul Everitt from JetBrains about navigating open-source funding and the start of the Python Software Foundation.

November 15, 2024 12:00 PM UTC


Talk Python to Me

#485: Secure coding for Python with SheHacksPurple

What do developers need to know about AppSec and building secure software? We have Tonya Janca (AKA SheHacksPurple) on the show to tell us all about it. We talk about what developers should expect from threat modeling events as well as concrete tips for security your apps and services.

November 15, 2024 08:00 AM UTC


Matt Layman

Heroku To DigitalOcean - Building SaaS #206

In this episode, I began a migration of my JourneyInbox app from Heroku to DigitalOcean. The first step to this move, since I’m going to use Kamal, is to put the app into a Docker image. We got the whole app into the Docker image, then cleaned up local development and the CI system after making changes that broke those configurations.

November 15, 2024 12:00 AM UTC

November 14, 2024


Python Morsels

Inspecting objects in Python

I rely on 4 functions for inspecting Python objects: type, help, dir, and vars.

Table of contents

  1. Inspecting an object's structure and data
  2. How to see an object's class
  3. Looking up documentation with help
  4. Getting the methods and attributes for an object
  5. Inspecting direct attributes of an object
  6. Base classes, module paths, and more
  7. Inspect Python objects with type, help, dir, and vars

Inspecting an object's structure and data

The scenario is, we're either in the Python REPL or we've used the built-in breakpoint function to drop into the Python debugger within our code. So we're within some sort of interactive Python environment.

For example, we might be running this file, which we've put a breakpoint call in to drop into a Python debugger:

from argparse import ArgumentParser
from collections import Counter
from pathlib import Path
import re

def count_letters(text):
    return Counter(
        char
        for char in text.casefold()
        if char.isalpha()
    )

def main():
    parser = ArgumentParser()
    parser.add_argument("file", type=Path)
    args = parser.parse_args()
    letter_counts = count_letters(args.file.read_text())
    breakpoint()
    for letter, count in letter_counts.most_common():
        print(count, letter.upper())

if __name__ == "__main__":
    main()

And we've used the PDB interact command to start a Python REPL:

~ $ python3 letter_counter.py frankenstein.txt
> /home/trey/letter_counter.py(18)main()
-> breakpoint()
(Pdb) interact
*pdb interact start*
>>>

We have a letter_counts variable that refers to some sort of object. We want to know what this object is all about.

What questions could we ask of this object?

Well, to start with, we could simply refer to the object, and then hit Enter:

>>> letter_counts
Counter({'e': 46043, 't': 30365, 'a': 26743, 'o': 25225, 'i': 24613, 'n': 24367, 's': 21155, 'r': 20818, 'h': 19725, 'd': 16863, 'l': 12739, 'm': 10604, 'u': 10407, 'c': 9243, 'f': 8731, 'y': 7914, 'w': 7638, 'p': 6121, 'g': 5974, 'b': 5026, 'v': 3833, 'k': 1755, 'x': 677, 'j': 504, 'q': 324, 'z': 243})

We've typed the name of a variable that points to an object, and now we see the programmer-readable representation for that object.

How to see an object's class

Often, the string representation tells …

Read the full article: https://www.pythonmorsels.com/inspecting-python-objects/

November 14, 2024 09:49 PM UTC


Django Weblog

Django’s technical governance challenges, and opportunities

As of October 29th, two of four members of the Django Software Foundation Steering Council have resigned from their role, with their intentions being to trigger an election of the Steering Council earlier than otherwise scheduled, per our established governance processes.

To our departing members, Simon and Adam, thank you for your contributions to Django and its governance ❤️. The framework and our community owes a lot to your dedication, and we’re confident our community will join us in celebrating your past contributions – and look forward to learning about your future endeavors in the Django ecosystem. And thanks to the remaining members, James and Andrew, for their service over the years.

Our governance challenges

Governance in open source is hard, and community-driven open source even more so. We’re proud that Django’s original two Benevolent Dictators For Life (BDFLs) both retired from the role and turned things over to community governance ten years ago now. The BDFL model can provide  excellent technical governance, but also has its flaws. So the  mantle of technical governance then went on to the Core Developers and the Technical Board (renamed to Steering Council) was introduced.

However, time has revealed flaws in the Steering Council’s governance model and operations. The Steering Council was able to provide decision-making – tiebreaking when the developer community couldn’t lead to consensus – but didn’t provide more forward-looking leadership or vision. Disagreements over how – or if – the Steering Council should approach this part of leadership led us to the current situation, with no functioning technical governance as of a few weeks ago. Even before those recent events, those flaws were also a common source of frustration for our contributors, and a source of concern for Django users who (rightly or not) might have expectations of Django’s direction – such as the publication of a “roadmap” for Django development.

The Django Software Foundation Board of Directors is and was aware of those issues, and recently made attempts to have the Steering Council rectify them, in coordination with other established community members. The DSF Board has tried to be hands-off when it comes to technical leadership, but in retrospect we should have been getting involved sooner, or more decisively. The lack of technical leadership is an existential threat to Django – a slow moving one, but a threat nonetheless. It’s our responsibility to address this threat.

Where we’re heading

We now need new Steering Council members. But we also need governance reform. There’s a lot about the Steering Council that is good and might only need minimal changes. However, the overall question of the Steering Council’s remit, and how it approaches technical leadership for the Django community, needs to be resolved.

We’re going to hold early elections of the Steering Council, as soon as we’ve completed the ongoing 2025 DSF Board elections. Those elections will follow existing processes, and we will want a Steering Council who strives  to meet the group’s intended goals:

  1. To safeguard big decisions that affect Django projects at a fundamental level.
  2. To help shepherd the project’s future direction.

We expect the new Steering Council will take on those known challenges, resolve those questions of technical leadership, and update Django’s technical governance. They will have the full support of the Board of Directors to address this threat to Django’s future. And the Board will also be more decisive in intervening, should similar issues keep arising.

How you can help

We need contributors willing to take on those challenges and help our community come out ahead. It’s a big role, impactful but demanding. And there are strict, often annoying eligibility rules for the Steering Council.

To help you help us, we’ve set up a form: Django 6.x Steering Council elections - Expression of interest.

If you’re interested in stepping up to shepherd Django’s technical direction, fill in our expression of interest form. We’ll let you know whether or not you meet those eligibility rules, take the guesswork out of the way. You get to focus on your motivation for taking on this kind of high-purpose, high-reward governance role.

And once the elections start and we get to candidate registrations, we’ll be able to reuse details submitted here (if you want to) so the process is smoother for everyone.

Django 6.x Steering Council elections - Expression of interest

How everyone can help

Those elections will be crucial for the future of Django, and will be decided thanks to the vote of our Django Software Foundation Individual Members. If you know people who contribute to ​​the DSF’s mission but aren’t Individual Members already -- use our form to nominate them as Individual Members, so they’re eligible to vote. If you’re that person, do nominate yourself. We consider all contributions towards our mission: advancing and promoting Django, protecting the framework’s long-term viability, and advancing the state of the art in web development.


Any questions? Comment on our forum thread, Discussion thread for “Django’s technical governance challenges, and opportunities” blog post, or reach out via email to foundation@djangoproject.com.

November 14, 2024 05:00 PM UTC


PyCharm

Inline AI Prompting, Coding Assistance for the dataclass_transform Decorator (PEP 681), and More in PyCharm 2024.3!

Code smarter, optimize performance, and stay focused on what matters most with the latest updates in PyCharm 2024.3. From enhanced support for AI Assistant and Jupyter notebooks to new features like no-code data filtering, there’s so much to explore.  Learn about all the updates on our What’s New page, download the latest version from our […]

November 14, 2024 01:42 PM UTC


Real Python

Quiz: Namespaces and Scope in Python

In this quiz, you'll test your understanding of Python namespaces and variable scope. These concepts are crucial for organizing the symbolic names assigned to objects in a Python program and ensuring they don't interfere with one another.

November 14, 2024 12:00 PM UTC


PyPy

Guest Post: Final Encoding in RPython Interpreters

Introduction

This post started as a quick note summarizing a recent experiment I carried out upon a small RPython interpreter by rewriting it in an uncommon style. It is written for folks who have already written some RPython and want to take a deeper look at interpreter architecture.

Some experiments are about finding solutions to problems. This experiment is about taking a solution which is already well-understood and applying it in the context of RPython to find a new approach. As we will see, there is no real change in functionality or the number of clauses in the interpreter; it's more like a comparison between endo- and exoskeletons, a different arrangement of equivalent bones and plates.

Overview

An RPython interpreter for a programming language generally does three or four things, in order:

  1. Read and parse input programs
  2. Encode concrete syntax as abstract syntax
  3. Optionally, optimize or reduce the abstract syntax
  4. Evaluate the abstract syntax: read input data, compute, print output data, etc.

Today we'll look at abstract syntax. Most programming languages admit a concrete parse tree which is readily abstracted to provide an abstract syntax tree (AST). The AST is usually encoded with the initial style of encoding. An initial encoding can be transformed into any other encoding for the same AST, looks like a hierarchy of classes, and is implemented as a static structure on the heap.

In contrast, there is also a final encoding. A final encoding can be transformed into by any other encoding, looks like an interface for the actions of the interpreter, and is implemented as an unwinding structure on the stack. From the RPython perspective, Python builtin modules like os or sys are final encodings for features of the operating system; the underlying implementation is different when translated or untranslated, but the interface used to access those features does not change.

In RPython, an initial encoding is built from a hierarchy of classes. Each class represents a type of tree nodes, corresponding to a parser production in the concrete parse tree. Each class instance therefore represents an individual tree node. The fields of a class, particularly those filled during .__init__(), store pre-computed properties of each node; methods can be used to compute node properties on demand. This seems like an obvious and simple approach; what other approaches could there be? We need an example.

Final Encoding of Brainfuck

We will consider Brainfuck, a simple Turing-complete programming language. An example Brainfuck program might be:

[-]

This program is built from a loop and a decrement, and sets a cell to zero. In an initial encoding which follows the algebraic semantics of Brainfuck, the program could be expressed by applying class constructors to build a structure on the heap:

Loop(Plus(-1))

A final encoding is similar, except that class constructors are replaced by methods, the structure is built on the stack, and we are parameterized over the choice of class:

lambda cls: cls.loop(cls.plus(-1))

In ordinary Python, transforming between these would be trivial, and mostly is a matter of passing around the appropriate class. Indeed, initial and final encodings are equivalent; we'll return to that fact later. However, in RPython, all of the types must line up, and classes must be determined before translation. We'll need to monomorphize our final encodings, using some RPython tricks later on. Before that, let's see what an actual Brainfuck interface looks like, so that we can cover all of the difficulties with final encoding.

Before we embark, please keep in mind that local code doesn't know what cls is. There's no type-safe way to inspect an arbitrary semantic domain. In the initial-encoded version, we can ask isinstance(bf, Loop) to see whether an AST node is a loop, but there simply isn't an equivalent for final-encoded ASTs. So, there is an implicit challenge to think about: how do we evaluate a program in an arbitrary semantic domain? For bonus points, how do we optimize a program without inspecting the types of its AST nodes?

What follows is a dissection of this module at the given revision. Readers may find it satisfying to read the entire interpreter top to bottom first; it is less than 300 lines.

Core Functionality

Final encoding is given as methods on an interface. These five methods correspond precisely to the summands of the algebra of Brainfuck.

class BF(object):
    # Other methods elided
    def plus(self, i): pass
    def right(self, i): pass
    def input(self): pass
    def output(self): pass
    def loop(self, bfs): pass

Note that the .loop() method takes another program as an argument. Initial-encoded ASTs have other initial-encoded ASTs as fields on class instances; final-encoded ASTs have other final-encoded ASTs as parameters to interface methods. RPython infers all of the types, so the reader has to know that i is usually an integer while bfs is a sequence of Brainfuck operations.

We're using a class to implement this functionality. Later, we'll treat it as a mixin, rather than a superclass, to avoid typing problems.

Monoid

In order to optimize input programs, we'll need to represent the underlying monoid of Brainfuck programs. To do this, we add the signature for a monoid:

class BF(object):
    # Other methods elided
    def unit(self): pass
    def join(self, l, r): pass

This is technically a unital magma, since RPython doesn't support algebraic laws, but we will enforce the algebraic laws later on during optimization. We also want to make use of the folklore that free monoids are lists, allowing callers to pass a list of actions which we'll reduce with recursion:

class BF(object):
    # Other methods elided
    def joinList(self, bfs):
        if not bfs: return self.unit()
        elif len(bfs) == 1: return bfs[0]
        elif len(bfs) == 2: return self.join(bfs[0], bfs[1])
        else:
            i = len(bfs) >> 1
            return self.join(self.joinList(bfs[:i]), self.joinList(bfs[i:]))

.joinList() is a little bulky to implement, but Wirth's principle applies: the interpreter is shorter with it than without it.

Idioms

Finally, our interface includes a few high-level idioms, like the zero program shown earlier, which are defined in terms of low-level behaviors. In an initial encoding, these could be defined as module-level functions; here, we define them on the mixin class BF.

class BF(object):
    # Other methods elided
    def zero(self): return self.loop(self.plus(-1))
    def move(self, i): return self.scalemove(i, 1)
    def move2(self, i, j): return self.scalemove2(i, 1, j, 1)
    def scalemove(self, i, s):
        return self.loop(self.joinList([
            self.plus(-1), self.right(i), self.plus(s), self.right(-i)]))
    def scalemove2(self, i, s, j, t):
        return self.loop(self.joinList([
                self.plus(-1), self.right(i), self.plus(s), self.right(j - i),
                self.plus(t), self.right(-j)]))

Interface-oriented Architecture

Applying Interfaces

Now, we hack at RPython's object model until everything translates. First, consider the task of pretty-printing. For Brainfuck, we'll simply regurgitate the input program as a Python string:

class AsStr(object):
    import_from_mixin(BF)
    def unit(self): return ""
    def join(self, l, r): return l + r
    def plus(self, i): return '+' * i if i > 0 else '-' * -i
    def right(self, i): return '>' * i if i > 0 else '<' * -i
    def loop(self, bfs): return '[' + bfs + ']'
    def input(self): return ','
    def output(self): return '.'

Via rlib.objectmodel.import_from_mixin, no stressing with covariance of return types is required. Instead, we shift from a Java-esque view of classes and objects, to an OCaml-ish view of prebuilt classes and constructors. AsStr is monomorphic, and any caller of it will have to create their own covariance somehow. For example, here are the first few lines of the parsing function:

@specialize.argtype(1)
def parse(s, domain):
    ops = [domain.unit()]
    # Parser elided to preserve the reader's attention

By invoking rlib.objectmodel.specialize.argtype, we make copies of the parsing function, up to one per call site, based on our choice of semantic domain. Oleg calls these "symantics" but I prefer "domain" in code. Also, note how the parsing stack starts with the unit of the monoid, which corresponds to the empty input string; the parser will repeatedly use the monoidal join to build up a parsed expression without inspecting it. Here's a small taste of that:

while i < len(s):
    char = s[i]
    if char == '+': ops[-1] = domain.join(ops[-1], domain.plus(1))
    elif char == '-': ops[-1] = domain.join(ops[-1], domain.plus(-1))
    # and so on

The reader may feel justifiably mystified; what breaks if we don't add these magic annotations? Well, the translator will throw UnionError because the low-level types don't match. RPython only wants to make one copy of functions like parse() in its low-level representation, and each copy of parse() will be compiled to monomorphic machine code. In this interpreter, in order to support parsing to an optimized string and also parsing to an evaluator, we need two copies of parse(). It is okay to not fully understand this at first.

Composing Interfaces

Earlier, we noted that an interpreter can optionally optimize input programs after parsing. To support this, we'll precompose a peephole optimizer onto an arbitrary domain. We could also postcompose with a parser instead, but that sounds more difficult. Here are the relevant parts:

def makePeephole(cls):
    domain = cls()
    def stripDomain(bfs): return domain.joinList([t[0] for t in bfs])
    class Peephole(object):
        import_from_mixin(BF)
        def unit(self): return []
        def join(self, l, r): return l + r
        # Actual definition elided... for now...
    return Peephole, stripDomain

Don't worry about the actual optimization yet. What's important here is the pattern of initialization of semantic domains. makePeephole is an SML-style functor on semantic domains: given a final encoding of Brainfuck, it produces another final encoding of Brainfuck which incorporates optimizations. The helper stripDomain is a finalizer which performs the extraction from the optimizer's domain to the underlying cls that was passed in at translation time. For example, let's optimize pretty-printing:

AsStr, finishStr = makePeephole(AsStr)

Now, it only takes one line to parse and print an optimized AST without ever building it on the heap. To be pedantic, fragments of the output string will be heap-allocated, but the AST's node structure will only ever be stack-allocated. Further, to be shallow, the parser is written to prevent malicious input from causing a stack overflow, and this forces it to maintain a heap-allocated RPython list of intermediate operations inside loops.

print finishStr(parse(text, AsStr()))

Performance

But is it fast? Yes. It's faster than the prior version, which was initial-encoded, and also faster than Andrew Brown's classic version (part 1, part 2). Since Brown's interpreter does not perform much optimization, we will focus on how final encoding can outperform initial encoding.

JIT

First, why is it faster than the same interpreter with initial encoding? Well, it still has initial encoding from the JIT's perspective! There is an Op class with a hierarchy of subclasses implementing individual behaviors. A sincere tagless-final student, or those who remember Stop Writing Classes (2012, Pycon US), will recognize that the following classes could be plain functions, and should think of the classes as a concession to RPython's lack of support for lambdas with closures rather than an initial encoding. We aren't ever going to directly typecheck any Op, but the JIT will generate typechecking guards anyway, so we effectively get a fully-promoted AST inlined into each JIT trace. First, some simple behaviors:

class Op(object): _immutable_ = True

class _Input(Op):
    _immutable_ = True
    def runOn(self, tape, position):
        tape[position] = ord(os.read(0, 1)[0])
        return position
Input = _Input()

class _Output(Op):
    _immutable_ = True
    def runOn(self, tape, position):
        os.write(1, chr(tape[position]))
        return position
Output = _Output()

class Add(Op):
    _immutable_ = True
    _immutable_fields_ = "imm",
    def __init__(self, imm): self.imm = imm
    def runOn(self, tape, position):
        tape[position] += self.imm
        return position

The JIT does technically have less information than before; it no longer knows that a sequence of immutable operations is immutable enough to be worth unrolling, but a bit of rlib.jit.unroll_safe fixes that:

class Seq(Op):
    _immutable_ = True
    _immutable_fields_ = "ops[*]",
    def __init__(self, ops): self.ops = ops
    @unroll_safe
    def runOn(self, tape, position):
        for op in self.ops: position = op.runOn(tape, position)
        return position

Finally, the JIT entry point is at the head of each loop, just like with prior interpreters. Since Brainfuck doesn't support mid-loop jumps, there's no penalty for only allowing merge points at the head of the loop.

class Loop(Op):
    _immutable_ = True
    _immutable_fields_ = "op",
    def __init__(self, op): self.op = op
    def runOn(self, tape, position):
        op = self.op
        while tape[position]:
            jitdriver.jit_merge_point(op=op, position=position, tape=tape)
            position = op.runOn(tape, position)
        return position

That's the end of the implicit challenge. There's no secret to it; just evaluate the AST. Here's part of the semantic domain for evaluation, as well as the "functor" to optimize it. In AsOps.join() are the only isinstance() calls in the entire interpreter! This is acceptable because Seq is effectively a type wrapper for an RPython list, so that a list of operations is also an operation; its list is initial-encoded and available for inspection.

class AsOps(object):
    import_from_mixin(BF)
    def unit(self): return Shift(0)
    def join(self, l, r):
        if isinstance(l, Seq) and isinstance(r, Seq):
            return Seq(l.ops + r.ops)
        elif isinstance(l, Seq): return Seq(l.ops + [r])
        elif isinstance(r, Seq): return Seq([l] + r.ops)
        return Seq([l, r])
    # Other methods elided!

AsOps, finishOps = makePeephole(AsOps)

And finally here is the actual top-level code to evaluate the input program. As before, once everything is composed, the actual invocation only takes one line.

tape = bytearray("\x00" * cells)
finishOps(parse(text, AsOps())).runOn(tape, 0)

Peephole Optimization

Our peephole optimizer is an abstract interpreter with one instruction of lookahead/rewrite buffer. It implements the aforementioned algebraic laws of the Brainfuck monoid. It also implements idiom recognition for loops. First, the abstract interpreter. The abstract domain has six elements:

class AbstractDomain(object): pass
meh, aLoop, aZero, theIdentity, anAdd, aRight = [AbstractDomain() for _ in range(6)]

We'll also tag everything with an integer, so that anAdd or aRight can be exact annotations. This is the actual Peephole.join() method:

def join(self, l, r):
    if not l: return r
    rv = l[:]
    bfHead, adHead, immHead = rv.pop()
    for bf, ad, imm in r:
        if ad is theIdentity: continue
        elif adHead is aLoop and ad is aLoop: continue
        elif adHead is theIdentity:
            bfHead, adHead, immHead = bf, ad, imm
        elif adHead is anAdd and ad is aZero:
            bfHead, adHead, immHead = bf, ad, imm
        elif adHead is anAdd and ad is anAdd:
            immHead += imm
            if immHead: bfHead = domain.plus(immHead)
            elif rv: bfHead, adHead, immHead = rv.pop()
            else:
                bfHead = domain.unit()
                adHead = theIdentity
        elif adHead is aRight and ad is aRight:
            immHead += imm
            if immHead: bfHead = domain.right(immHead)
            elif rv: bfHead, adHead, immHead = rv.pop()
            else:
                bfHead = domain.unit()
                adHead = theIdentity
        else:
            rv.append((bfHead, adHead, immHead))
            bfHead, adHead, immHead = bf, ad, imm
    rv.append((bfHead, adHead, immHead))
    return rv

If this were to get much longer, then implementing a DSL would be worth it, but this is a short-enough method to inline. The abstract interpretation is assumed by induction for the left-hand side of the join, save for the final instruction, which is loaded into a rewrite register. Each instruction on the right-hand side is inspected exactly once. The logic for anAdd followed by anAdd is exactly the same as for aRight followed by aRight because they both have underlying Abelian groups given by the integers. The rewrite register is carefully pushed onto and popped off from the left-hand side in order to cancel out theIdentity, which itself is merely a unifier for anAdd or aRight of 0.

Note that we generate a lot of garbage. For example, parsing a string of n '+' characters will cause the peephole optimizer to allocate n instances of the underlying domain.plus() action, from domain.plus(1) up to domain.plus(n). An older initial-encoded version of this interpreter used hash consing to avoid ever building an op more than once, even loops. It appears more efficient to generate lots of immutable garbage than to repeatedly hash inputs and search mutable hash tables, at least for optimizing Brainfuck incrementally during parsing.

Finally, let's look at idiom recognition. RPython lists are initial-coded, so we can dispatch based on the length of the list, and then inspect the abstract domains of each action.

def isConstAdd(bf, i): return bf[1] is anAdd and bf[2] == i

def oppositeShifts(bf1, bf2):
    return bf1[1] is bf2[1] is aRight and bf1[2] == -bf2[2]

def oppositeShifts2(bf1, bf2, bf3):
    return (bf1[1] is bf2[1] is bf3[1] is aRight and
            bf1[2] + bf2[2] + bf3[2] == 0)

def loop(self, bfs):
    if len(bfs) == 1:
        bf, ad, imm = bfs[0]
        if ad is anAdd and imm in (1, -1):
            return [(domain.zero(), aZero, 0)]
    elif len(bfs) == 4:
        if (isConstAdd(bfs[0], -1) and
            bfs[2][1] is anAdd and
            oppositeShifts(bfs[1], bfs[3])):
            return [(domain.scalemove(bfs[1][2], bfs[2][2]), aLoop, 0)]
        if (isConstAdd(bfs[3], -1) and
            bfs[1][1] is anAdd and
            oppositeShifts(bfs[0], bfs[2])):
            return [(domain.scalemove(bfs[0][2], bfs[1][2]), aLoop, 0)]
    elif len(bfs) == 6:
        if (isConstAdd(bfs[0], -1) and
            bfs[2][1] is bfs[4][1] is anAdd and
            oppositeShifts2(bfs[1], bfs[3], bfs[5])):
            return [(domain.scalemove2(bfs[1][2], bfs[2][2],
                                       bfs[1][2] + bfs[3][2],
                                       bfs[4][2]), aLoop, 0)]
        if (isConstAdd(bfs[5], -1) and
            bfs[1][1] is bfs[3][1] is anAdd and
            oppositeShifts2(bfs[0], bfs[2], bfs[4])):
            return [(domain.scalemove2(bfs[0][2], bfs[1][2],
                                       bfs[0][2] + bfs[2][2],
                                       bfs[3][2]), aLoop, 0)]
    return [(domain.loop(stripDomain(bfs)), aLoop, 0)]

This ends the bonus question. How do we optimize an unknown semantic domain? We must maintain an abstract context which describes elements of the domain. In initial encoding, we ask an AST about itself. In final encoding, we already know everything relevant about the AST.

The careful reader will see that I didn't really answer that opening question in the JIT section. Because the JIT still ranges over the same operations as before, it can't really be slower; but why is it now faster? Because the optimizer is now slightly better in a few edge cases. It performs the same optimizations as before, but the rigor of abstract interpretation causes it to emit slightly better operations to the JIT backend.

Concretely, improving the optimizer can shorten pretty-printed programs. The Busy Beaver Gauge measures the length of programs which search for solutions to mathematical problems. After implementing and debugging the final-encoded interpreter, I found that two of my entries on the Busy Beaver Gauge for Brainfuck had become shorter by about 2%. (Most other entries are already hand-optimized according to the standard algebra and have no optimization opportunities.)

Discussion

Given that initial and final encodings are equivalent, and noting that RPython's toolchain is written to prefer initial encodings, what did we actually gain? Did we gain anything?

One obvious downside to final encoding in RPython is interpreter size. The example interpreter shown here is a rewrite of an initial-encoded interpreter which can be seen here for comparison. Final encoding adds about 20% more code in this case.

Final encoding is not necessarily more code than initial encoding, though. All AST encodings in interpreters are subject to the Expression Problem, which states that there is generally a quadratic amount of code required to implement multiple behaviors for an AST with multiple types of nodes; specifically, n behaviors for m types of nodes require n × m methods. Initial encodings improve the cost of adding new types of nodes; final encodings improve the cost of adding new behaviors. Final encoding may tend to win in large codebases for mature languages, where the language does not change often but new behaviors are added frequently and maintained for long periods.

Optimizations in final encoding require a bit of planning. The abstract-interpretation approach is solid but relies upon the monoid and its algebraic laws. In the worst case, an entire class hierarchy could be required to encode the abstraction.

It is remarkable to find a 2% improvement in residual program size merely by reimplementing an optimizer as an abstract interpreter respecting the algebraic laws. This could be the most important lesson for compiler engineers, if it happens to generalize.

Final encoding was popularized via the tagless-final movement in OCaml and Scala, including famously in a series of tutorials by Kiselyov et al. A "tag", in this jargon, is a runtime identifier for an object's type or class; a tagless encoding effectively doesn't allow isinstance() at all. In the above presentation, tags could be hacked in, but were not materially relevant to most steps. Tags were required for the final evaluation step, though, and the tagless-final insight is that certain type systems can express type-safe evaluation without those tags. We won't go further in this direction because tags also communicate valuable information to the JIT.

Summarizing Table

Initial Encoding Final Encoding
hierarchy of classes signature of interfaces
class constructors method calls
built on the heap built on the stack
traversals allocate stack traversals allocate heap
tags are available with isinstance() tags are only available through hacks
cost of adding a new AST node: one class cost of adding a new AST node: one method on every other class
cost of adding a new behavior: one method on every other class cost of adding a new behavior: one class

Credits

Thanks to folks in #pypy on Libera Chat: arigato for the idea, larstiq for pushing me to write it up, and cfbolz and mattip for reviewing and finding mistakes. The original IRC discussion leading to this blog post is available here.

This interpreter is part of the rpypkgs suite, a Nix flake for RPython interpreters. Readers with Nix installed can run this interpreter directly from the flake:

$ nix-prefetch-url https://github.com/MG-K/pypy-tutorial-ko/raw/refs/heads/master/mandel.b
$ nix run github:rpypkgs/rpypkgs#bf -- /nix/store/ngnphbap9ncvz41d0fkvdh61n7j2bg21-mandel.b

November 14, 2024 08:42 AM UTC


Python Bytes

#409 We've moved to Hetzner write-up

Topics include terminal-tree, posting: The API client that lives in your terminal, , and UV does everything or enough that I'm not sure what else it needs to do.

November 14, 2024 08:00 AM UTC


Stefan Scherfke

Publishing to PyPI with a Trusted Publisher from GitLab CI/CD

Learn how to securely upload Python packages to PyPI from GitLab CI/CD pipelines using a “Trusted Publisher” (and without API tokens). Continuously test the release process with TestPyPI on every push. Use GitLab (deploy) environments as an additional security measure.

November 14, 2024 06:18 AM UTC


Seth Michael Larson

Early promising results with SBOMs and Python packages

November 14, 2024 12:00 AM UTC

November 13, 2024


Bojan Mihelac

Building docker images with private python packages

Installing private python packages into Docker container can be tricky because container does not have access to private repositories and you do not want to leave trace of private ssh key in docker…

November 13, 2024 04:37 PM UTC