skip to navigation
skip to content

Planet Python

Last update: May 04, 2026 07:44 PM UTC

May 04, 2026


Real Python

A New Python Packaging Council and Other News for May 2026

April gave Python developers a new governing body. PEP 772 was accepted on April 16, creating a dedicated Python Packaging Council that will make binding decisions about packaging standards and tools. After years of informal coordination through the Python Packaging Authority (PyPA), the packaging community now has its own elected five-member council with authority comparable to the Steering Council.

On the release side, Python 3.15.0 alpha 8 dropped with a refreshed JIT delivering 6–7 percent speedups on x86-64 Linux and 12–13 percent on AArch64 macOS. The core team also decided to revert 3.14’s incremental garbage collector after production reports of runaway memory use, with the fix landing in the upcoming 3.14.5 patch release. The next pre-release is the first beta, scheduled for May 5, which marks the feature freeze for Python 3.15.

Elsewhere, Google released the open-weights Gemma 4 family, Starlette 1.0 shipped for FastAPI’s foundation, and the broader Python ecosystem absorbed the news that OpenAI acquired Astral, the company behind uv, Ruff, and ty. Get ready to dig into the biggest Python news from the past month!

Join Now: Click here to join the Real Python Newsletter and you’ll never miss another Python tutorial, course, or news update.

Python Releases and PEP Highlights

April pushed Python 3.15 to its final alpha before the beta freeze, walked back the incremental garbage collector introduced in 3.14, and gave the Steering Council a busy month of PEP decisions. The packaging community even got its own elected governing body for the first time. Plenty to unpack on the language and process side of the ecosystem.

Python 3.15.0 Alpha 8: Final Alpha Before Beta Freeze

Python 3.15.0a8 landed on April 7, released alongside maintenance updates 3.14.4 and 3.13.13. Release manager Hugo van Kemenade confirmed that a8 is the final alpha before the beta phase begins. If you maintain a library, this is the last alpha where you can file an issue against an unreleased feature and reasonably expect it to land before the freeze.

Alpha 8 consolidates a long list of PEPs you’ve been hearing about in earlier alphas:

The headline number is the JIT performance jump. On x86-64 Linux, the alpha reports a 6–7 percent geometric mean improvement over the standard interpreter. On AArch64 macOS, the gain is 12–13 percent over the tail-calling interpreter introduced in 3.14. Those aren’t microbenchmark curiosities. They’re cumulative gains across a broad suite of workloads.

Note: If you haven’t tried the alpha yet, installing it in an isolated environment is a good idea. Running uv python install 3.15.0a8 pulls the binary, and pyenv handles alpha builds too. The next pre-release, 3.15.0 beta 1, is scheduled for May 5, which marks the feature freeze. After that, no new PEPs land in 3.15.

Incremental GC Reverted in 3.14.5 and 3.15

On April 16, release manager Hugo van Kemenade proposed reverting the incremental garbage collector that debuted in Python 3.14, and the core team agreed. The revert will ship in Python 3.14.5 and also make it into 3.15 before feature freeze.

The reasoning is practical. Neil Schemenauer’s testing on production workloads showed that the incremental collector cut maximum pause times from 26 ms down to 1.3 ms, which looks great on paper. But peak memory usage climbed to as much as 5x the generational baseline in the worst case, and total runtime went up, not down, because of the extra bookkeeping.

For most Python programs, like web apps, data pipelines, and batch jobs, reducing long pauses isn’t the win that matters. Memory pressure is.

The unusual part is doing this in a patch release. The working assumption during 3.14’s release cycle was that the incremental GC had earned its place. Rolling it back in 3.14.5 is a reminder that “passed the benchmark suite” and “works in production” aren’t the same thing. If you noticed your 3.14 deployments using noticeably more memory than 3.13, this is almost certainly why, and 3.14.5 should give you the old behavior back.

Note: The incremental approach isn’t dead. The core team noted that it could return in Python 3.16 through a proper PEP review process, which the original implementation had skipped. If you were on the fence about the switch, waiting for the formal design round is probably the right call.

PEP 772 Accepted: Python Gets a Packaging Council

On April 16, the Python Software Foundation (PSF) and the Steering Council accepted PEP 772, which creates a five-member Packaging Council with broad authority over packaging standards, tools, and implementations. It’s one of the biggest governance changes the ecosystem has seen since the Steering Council itself was established back in 2019.

Council members will be elected by PSF voting members who opt into the election. The council runs on staggered two-year terms, with two seats and three seats rotating in different cycles to preserve institutional continuity. Decision-making emphasizes consensus over voting, following the same pattern that has worked for the Steering Council.

The practical impact is that a formal, elected body now owns decisions about tools like pip, setuptools, and PyPI, replacing the ambiguous delegation model defined in PEP 609.

If you’ve ever wondered why packaging decisions in Python sometimes feel stuck in committee, PEP 772 is the structural answer to that complaint. It also sets the stage for the council to weigh in on LLM-era packaging concerns, which are popping up faster than the PyPA’s informal coordination could address them.

PEP 803 Accepted: Stable ABI Goes Free-Threaded

PEP 803 was accepted on March 30 and targets Python 3.15. It defines abi3t, a new variant of the stable ABI that works with free-threaded builds. When the Steering Council accepted PEP 779 last year, it promised free-threading would get a proper stable ABI story for 3.15. PEP 803 is that follow-through.

Read the full article at https://realpython.com/python-news-may-2026/ »


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

May 04, 2026 02:00 PM UTC


PyCon

Everything You Always Wanted to Know About Sprints!

Who is at the sprints?

Sprints at PyCon US are organized by the attendees. The conference provides the space with tables, power strips, Internet connectivity, and, for the first two days, catered lunches. The attendees band together to work on their open-source and community projects of choice.

You can expect bigger projects like CPython, Django, Flask, or BeeWare, to have their own dedicated rooms to hack on their stuff. Smaller projects either group together topically or just join a random friendly room with empty seats.

You’ll find project maintainers, seasoned contributors, community organizers, and first-time contributors alike. Everybody’s welcome!

Which project should I join for sprints?

This is a question worth answering before coming. Sprints work best for contributors who are already users of a given project. If you know a project like, say, CPython or Django well enough, you probably stumbled upon a bug in that software in the past. Maybe you looked into how some internals of that software work. Maybe you thought about making a change to it in the past. Or maybe you at least saw enough tracebacks that you feel somewhat familiar with the software you’re using.

The most impactful way to participate in sprints is to choose a project you’re familiar with, at least as a user. The project maintainers who come to sprints are helpful people and want you to succeed with your contribution. Meet them halfway and choose a project you already use. There’s not enough time during the sprints for maintainers to install software for the first time and explain how it works on somebody else’s laptop.

First PR to CPython

For CPython in particular, the maintainers at the sprint recommend that sprint attendees have at least 2 years of experience as users of the Python programming language. A sprint isn’t the best place to learn the basics of programming. But it’s totally okay to be new at contributing to Python. This year we’ll be continuing to host a dedicated space for first-time contributors to CPython to make sure the experts around you are best-equipped to help you get your first pull request merged. Look for the dedicated “First PR to CPython” signage on the sprint room list.

Are there any easy issues to work on?

This depends on the project. Some issue trackers contain issues marked as “easy” or “help wanted”. You can look at those first. Sometimes such issues are a bit of a trap. When an “easy” issue is still open 3+ years later, it often means there’s something not particularly easy about it.

Asking the maintainers present at the sprint for an issue to fix is another way to get something to work on. That will often mean that you’ll get involved with whatever that maintainer already planned to work on. This might be super interesting or pretty boring to you, depending on what that particular topic area is.

Speak up!

Sprints are different from tutorials and talks. Everybody, including the experts in the room, hunker down and work on something. If you get stuck with an issue, others around you will probably not notice. A good rule of thumb is to fight with an issue for 45 minutes, and if you’re not making progress, let others know where you’re stuck. They might be able to help unblock you.

Have fun!

Sprints are fun, surprisingly productive, and sometimes spark big things! We also can’t overstate the importance of putting a face to a name. “Don’t be a stranger” is good advice. Connecting a GitHub account name with a good experience you had at a conference is a great way to build trust and fast-track online collaboration in the future!

And that’s that. The mysterious sprints demystified! Come and see for yourself whether they’re really the best part of PyCon US.


(Abridged from Łukasz Langa’s 2025 post)

May 04, 2026 01:49 PM UTC


Real Python

Quiz: Data Management With Python, SQLite, and SQLAlchemy

In this quiz, you’ll test your understanding of the tutorial Data Management With Python, SQLite, and SQLAlchemy.

By working through this quiz, you’ll revisit how Python, SQLite, and SQLAlchemy work together to give your programs reliable data storage.

You’ll also check your grasp of primary and foreign keys, SQL operations, and the SQLAlchemy models that let you work with your data as Python objects.


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

May 04, 2026 12:00 PM UTC


The Python Coding Stack

Do You Get It Now?

When you decide to learn about Python’s special methods, you have to choose which ones to learn first. Some are more straightforward than others. It makes sense to prioritise them.

However, there are some headaches and rabbit holes along the way.

And one of these challenges is when you start exploring the “get*” dunder methods. You come across .__getitem__(), .__getattr__(), .__getattribute__(), and .__get__() and you think:

“Aren’t they all the same thing? They’re all ‘getting’ stuff, right?”

Well, yes, they’re all “getting stuff”, which is why they have “get” in their names. But, as you probably guessed by now, they do different things.

One Deals With [] • The Others Deal With .

Let’s start with the odd one out, which is also possibly the least challenging of the lot. The .__getitem__() special method deals with the square bracket notation, [], which you place after an object’s name. These are the square brackets you use to get an item from a list or a dictionary, say. You’ll see later that the other special methods with “get” in their name deal with the dot, ., which you use in a different context in Python.

With lists, or any other sequence, you use the square brackets with an index that represents the position of an item within the data structure. So, numbers[0] gives you the value of the first item in numbers if numbers is a sequence.

With dictionaries, or mappings in general, you place the key inside the square brackets to fetch the value associated with that key, such as points["Sam"], which gives the value associated with the key "Sam" if it exists.

Let’s explore a custom class:

All code blocks are available in text format at the end of this article • #1

You pass a mapping to PointsTable when creating an instance of the class. The data is stored as a dictionary in the ._data attribute.

Now, let’s say you’d like to access points from a PointsTable instance using the square bracket notation, as you would do with a dictionary:

#2

Unfortunately, this doesn’t work:

Traceback (most recent call last):
  ...
    print(table[”Mark”])
          ~~~~~^^^^^^^^
TypeError: ‘PointsTable’ object is not subscriptable

The PointsTable instance is not a dictionary. It includes a dictionary as a data attribute. The PointsTable object itself is not subscriptable, which means you can’t use the square bracket notation to fetch an item.

When Python sees the square bracket notation after an object identifier, such as table, it calls the class’s .__getitem__() special method. If it’s not there, as in this case, Python raises a TypeError saying the object is not subscriptable.

But what does this tell you? If you want to use the square brackets notation, you just need to define .__getitem__() in the class:

#3

Now, Python finds the .__getitem__() special method. It takes whatever you placed within the square brackets and passes it as an argument to .__getitem__(). Therefore, table["Mark"] returns whatever the call to PointsTable.__getitem__(table, "Mark") returns:

17

The code outputs Mark’s points. The PointsTable class is now subscriptable since it has a .__getitem__() special method. The term item generally refers to the objects contained within a data structure. Therefore, __getitem__() is there to let you get an item from within a data structure.

Let’s make this example a bit more interesting before we move on to the other special methods with “get” in their name.

Try the following:

#4

This raises an error:

Traceback (most recent call last):
  File ..., line 18, in <module>
    print(table[”Mark”, “Stephen”])
          ~~~~~^^^^^^^^^^^^^^^^^^^
  File ..., line 6, in __getitem__
    return self._data[item]
           ~~~~~~~~~~^^^^^^
KeyError: (’Mark’, ‘Stephen’)

There’s no key equal to ("Mark", "Stephen"). So, you get a KeyError. Note how Python places parentheses around the two names when showing you the KeyError, even though you didn’t use the parentheses within the square brackets in table["Mark", "Stephen"]. This gives you a clue to what’s happening. But let’s explore this so we’re sure and we don’t just rely on our hunches:

#5

You add two print() calls in .__getitem__(). Here’s what they print out before Python raises the KeyError:

item=(’Mark’, ‘Stephen’)
type(item)=<class ‘tuple’>

The object Python passes to .__getitem__() is the tuple ("Mark", "Stephen"). You decide to make your PointsTable class super-flexible so you can include more than one name within the square brackets notation:

#6

The .__getitem__() method now checks whether item is a tuple first. If it is, then .__getitem__() returns a tuple with all the values. If item is not a tuple, then it must be just a single key, and the square brackets behave in the normal way:

(17, 20)
(22, 20, 19)
17

But note that you can no longer use an actual tuple as a dictionary key. This is fine in this example since all the keys are strings with people’s names.

So, .__getitem__() gives you full control over what happens when you use the square brackets notation to fetch an item from an object. Now, let’s move on to the other “get*” special methods.


Support The Python Coding Stack


Accessing Data Using The Dot Notation, .

You decide you also want PointsTable to work with the dot notation, so that you can use table.Mark to get the number of points Mark has.

Note that I’m adding more features to this class to demonstrate various special methods in this article. I’m not suggesting it’s necessarily always desirable to try to be too fancy with your classes!

Let’s try it out directly first, without making any changes to the class:

#7

Unfortunately, table.Mark raises an error:

Traceback (most recent call last):
  File ..., line 32, in <module>
    print(table.Mark)
          ^^^^^^^^^^
AttributeError: ‘PointsTable’ object has no attribute ‘Mark’

Python is looking for an attribute called Mark. Attributes are the things you can access using the dot notation in Python. Typically, these are data attributes, methods, properties, and other things you can access using the dot. However, as with everything else, Python provides special methods to handle this. But this is where it gets a bit complicated.

.__getattr__()

Let’s take this one step at a time. Let’s add the .__getattr__() special method. You can guess that this special method name stands for get attribute (but it’s not the only special method that stands for get attribute!) Let’s add this method:

#8

The .__getattr__() special method has a print() call to show when it’s called. This line is not required, but it will help you figure out when Python calls this special method. It’s a way of peeking into Python’s inner workings! For completeness, I added a similar line to .__getitem__() as well.

There are two print() calls at the bottom: one uses the square brackets notation, which you dealt with in the previous section, and the other uses the dot notation. Now, both work:

Calling __getitem__ with argument item=’Mark’
17
Calling __getattr__ with argument name=’Mark’
17

As you’ve seen in the previous section, the square brackets notation relies on .__getitem__().

And as the third and fourth lines in this printout show, Python used .__getattr__() to deal with table.Mark.

Python passed the attribute name to the .__getattr__() special method’s name parameter. This method then fetches the value associated with this name from the dictionary ._data.

This trick only works for keys that are valid attribute names and that don’t clash with real attributes or methods. That’s one reason dictionary-style access is often preferable for arbitrary user data. But let’s stick with this exercise in this article.

Also, you wrote the code such that if self._data[name] raises a KeyError, then .__getattr__() raises an AttributeError, which is what you’d expect if you try to use a name that’s not an attribute after the dot:

#9

This raises an AttributeError:

AttributeError: ‘PointsTable’ object has no attribute ‘Matilda’

It seems that .__getattr__() is analogous to .__getitem__() – the first deals with . and the second with []. But...

But things are a bit more complex. What if you try to access a standard attribute, say a data attribute or a method? You haven’t defined any methods in this class, but you do have a data attribute, ._data. It’s marked with a leading underscore, showing you shouldn’t access it directly. But you can “break the rules”. Python won’t stop you:

#10

The output is the following:

{’Stephen’: 20, ‘Mark’: 17, ‘Kate’: 19, ‘Sarah’: 22}

Perhaps, this is not surprising. But note that there’s no printout that says that .__getattr__() was called, as you got when you used table.Mark. Let’s look at both outputs together:

#11

The output now is the following:

Calling __getattr__ with argument name=’Mark’
17
{’Stephen’: 20, ‘Mark’: 17, ‘Kate’: 19, ‘Sarah’: 22}

Python called the .__getattr__() special method when you accessed .Mark, but not when you accessed ._data.

This means there’s something else happening behind the scenes.

.__getattribute__()

This is where .__getattribute__() enters the scene. No prizes for guessing that this method name stands for get attribute, but then so did .__getattr__(). This is confusing.

And here’s the thing. It’s .__getattribute__() that gets called each and every time you use the dot notation with an instance. Let’s define it:

#12

Note that this code breaks lots of things as it is now. We’ll fix it soon. Let’s first look at the output from this code:

Calling __getattribute__ with argument name=’Mark’
None
Calling __getattribute__ with argument name=’_data’
None

As you can see, Python calls .__getattribute__() for both table.Mark and table._data. However, the current .__getattribute__() method does nothing else. Therefore, it returns None. You’ll see soon that this is a bigger problem than you might think.

Let’s try using the dot notation to access an attribute that doesn’t exist:

#13

Every time you use the dot, ., Python calls .__getattribute__(), which, for now, does nothing except print out the self-serving statement to show it was called:

Calling __getattribute__ with argument name=’Mark’
None
Calling __getattribute__ with argument name=’_data’
None
Calling __getattribute__ with argument name=’Janine’
None

Therefore, you can’t access any attribute at the moment. All data attributes and methods are out of reach. You should be careful when overriding .__getattribute__(). In fact, you’ll rarely need to write your own .__getattribute__().

Let’s fix this code by calling object.__getattribute__() from within PointsTable.__getattribute__(). Recall that all Python classes inherit from object. And object.__getattribute__() is where the important logic that Python uses to decide what to do when you use the . lives. You can access object using super():

#14

Note that in this class, .__getattribute__() is not doing anything different to the default version except for printing out a statement. I’m using this for demonstration purposes only.

Let’s deal with one print() call at a time. The code above includes print(table._data). You’re fetching the value of a data attribute:

Calling __getattribute__ with argument name=’_data’
{’Stephen’: 20, ‘Mark’: 17, ‘Kate’: 19, ‘Sarah’: 22}

Python calls PointsTable.__getattribute__(), which in turn calls the base class’s object.__getattribute__(). This special method recognises that ._data is an instance variable and fetches its value from the object’s .__dict__.

But let’s see what happens with the .Mark attribute access. Recall that .Mark is not a data attribute or method:

#15

Here’s the output now. Look at the various printouts, too:

Calling __getattribute__ with argument name=’Mark’
Calling __getattr__ with argument name=’Mark’
Calling __getattribute__ with argument name=’_data’
17

When Python sees table.Mark, it calls PointsTable.__getattribute__(table, "Mark"). This call gives rise to the first line printed out above. The base class’s .__getattribute__() doesn’t recognise "Mark" as a data attribute, a method name, or anything else it’s expecting (such as descriptors, which we will discuss soon). So, object.__getattribute__() actually raises an AttributeError in this case, which you can’t see in the output above.

That’s because when .__getattribute__() raises an AttributeError during normal attribute access using the dot, Python checks whether there’s a .__getattr__() defined. If there is, it uses it as a fallback. That’s what happens in this case. You can see that .__getattr__() is called next – that’s the second line printed out above.

But .__getattr__() contains this line: return self._data[name]. There’s a dot in that line, too. So Python needs to access .__getattribute__() again, this time using "_data" as the argument. Since ._data is a data attribute, object.__getattribute__() doesn’t need to use .__getattr__() as a fallback since it knows how to deal with data attributes.

Confused? You’re not alone. This is confusing, I know. And there’s a bit more. But we’ll make sure things are clear by the end of this article.

But first, let’s focus on table.Mark again and let’s comment out, just for now, the definition of .__getattr__():

#16

The base class’s .__getattribute__() method works hard to determine whether "Mark" is an attribute it expects. It doesn’t find it, and now, there’s no fallback .__getattr__(), so Python raises an error:

Calling __getattribute__ with argument name=’Mark’
Calling __getattribute__ with argument name=’__dict__’
Calling __getattribute__ with argument name=’__class__’
Traceback (most recent call last):
  ...
AttributeError: ‘PointsTable’ object has no attribute ‘Mark’

You can see a couple of extra printouts, too. These are side effects of Python preparing the error message since we get the printout each time Python uses the dot notation, even when it does so behind the scenes. Let’s not go too far down the rabbit hole in this article. We may never come out again if we go too deep!

You’ve seen that when you use the dot notation, Python calls object.__getattribute__() eventually. That’s why you should include super().__getattribute__() when you override this special method, unless you have a clear (niche) reason why you don’t want to do this. This special method looks in a number of places for the attribute. I’m going to skip a few steps in the hierarchy for now. But we’ll revisit these in the next section (remember, we’re not done yet since there’s still .__get__() to deal with).

The .__getattribute__() special method looks for instance attributes. Let’s temporarily turn .Mark into an instance attribute:

#17

Now, .Mark is a data attribute, so .__getattribute__() finds it:

Calling __getattribute__ with argument name=’Mark’
This is Mark as a data attribute

If it’s not an instance variable, object.__getattribute__() also checks whether it’s a class attribute. Let’s test this:

#18

Here’s the output:

Calling __getattribute__ with argument name=’Mark’
This is Mark as a class attribute

Note that the line in .__init__() that defines .Mark as a data attribute is commented out now because instance attributes take precedence over class attributes. Only when all else fails (including the steps I skipped for now), does Python check whether the fallback special method .__getattr__() is there.

So, when you use the dot notation for attribute access (the simplified version, for now):

  1. Python calls .__getattribute__()

  2. It checks for known attributes, such as instance attributes and class attributes (and a bit more – coming soon)

  3. If it doesn’t find anything, then .__getattribute__() raises an AttributeError. However, Python does one final check before letting this error through to the user: is there a .__getattr__()? If there is, Python calls it to see whether it contains instructions for handling this unknown attribute.

I’m simplifying slightly here. Descriptors complicate the order, and we’ll return to them in the next section.

Note that in real code, you rarely need to define .__getattribute__(). You can rely on the default provided in the object base class in most cases. Most custom attribute behaviour should use .__getattr__() rather than .__getattribute__(). Use .__getattribute__() only when you need to intercept every attribute access. There is rarely a need for this, though.

Common uses for .__getattr__() include delegating missing attributes to another object, implementing lazy loading, or exposing dynamic attributes from structured data.

Here’s the tidied-up code in full so far:

#19

The Asymmetry Between .__getattr__() and .__setattr__()

A short note: There’s an asymmetry between .__getattr__() and .__setattr__() despite their names following the same pattern. As you’ve seen above, .__getattr__() is only used as a fallback when .__getattribute__() doesn’t find the attribute through “normal routes”.

However, .__setattr__() doesn’t have a counterpart analogous to .__getattribute__(). Therefore, .__setattr__() is always used when setting the value of an attribute. Life is simpler in the “setting” world!

In this article, I’m focusing on the “getting” part of things, but some of the same logic applies to “setting”, too.

All The Python Coding Place video courses are included into a single, cost-effective bundle. The courses cover beginner and intermediate level courses, and you also get access to a members-only forum.

Get All The Python Coding Place Courses in One Bundle

The Fourth Horseman: .__get__()

I’ve had this article in my pipeline here on The Python Coding Stack for a very long time. But I never wrote it because I knew that dealing with .__get__() would be a pain and would need a lot of time and space. However, in March, I published The Weird and Wonderful World of Descriptors in Python. If you haven’t read that article yet, well, now is the time!

And therefore I can cheat in this article. I can avoid talking about .__get__() in detail. What follows in this section is a summary of the world of descriptors. See the full article for more details.

If a class has .__get__(), it’s a descriptor class. If it only has .__get__() and it does not define .__set__() or .__delete__(), which are the other methods that make up the descriptor protocol, then the class creates non-data descriptors. Classes that include .__set__() or .__delete__() create data descriptors. This distinction matters when we get back to .__getattribute__() and the order it uses to look for known attributes.

Let’s explore this with a dummy example. This code defines two descriptor classes and another class to test the priority order that object.__getattribute__() uses for different attribute types:

#20

DataDescriptor defines .__get__() and .__set__() methods, which is why it creates data descriptors. However, NonDataDescriptor creates non-data descriptors since the class only defines .__get__().

All the special methods in all classes (except .__init__()) have print() calls so you can see when Python calls them. Before we look at the output from this code, let’s review the five attributes you use in the print() calls and where they appear in the class definitions:

  1. .first is defined as a data descriptor at the top of the TestingAttributeAccess class. Recall from the article on descriptors that you initially define descriptors as class attributes within a class. You also assign a value to .first within the class’s .__init__(). More on this soon.

  2. .second is also defined as a descriptor at the top of the TestingAttributeAccess class, but it’s a non-data descriptor since the NonDataDescriptor class only defines .__get__(). You also assign a value to self.second within .__init__(). We’ll see how the class actually deals with .second soon, since it treats it differently from .first.

  3. .third is also defined as a non-data descriptor. However, there is no other assignment to third within .__init__(). This is the key difference between .second and .third in this code.

  4. .fourth is a class attribute. It’s not a descriptor and it’s not an instance attribute.

  5. .fifth is not defined anywhere in the class, but you still call print(test.fifth).

Let’s look at the whole output from this code and then break it down into steps later:

Calling DataDescriptor.__set__ with value=”’first’: a data attribute defined in .__init__”

Printing `test.first`
Calling __getattribute__ with argument name=’first’
Calling DataDescriptor.__get__
This is the Data Descriptor

Printing `test.second`
Calling __getattribute__ with argument name=’second’
‘second’: a data attribute defined in .__init__

Printing `test.third`
Calling __getattribute__ with argument name=’third’
Calling NonDataDescriptor.__get__
This is the Non-Data Descriptor

Printing `test.fourth`
Calling __getattribute__ with argument name=’fourth’
I’m just a normal class attribute!

Printing `test.fifth`
Calling __getattribute__ with argument name=’fifth’
Calling __getattr__ with argument name=’fifth’
This is the fallback value from __getattr__

There’s plenty to digest there. So, let’s break it down in stages.

0. Creating an instance of TestingAttributeAccess

I want to focus on the “getting” bit from the various print() calls. However, there’s this line output first, so let’s deal with it, too:

Calling DataDescriptor.__set__ with value=”’first’: a data attribute defined in .__init__”

This happens when you create the instance of TestingAttributeAccess and assign it to the variable name test. Why? Because first has already been defined as a data descriptor at the time of the class definition. Therefore, when you assign a value to .first during initialisation, the descriptor protocol kicks in.

When you call TestingAttributeAccess(), Python calls the class’s .__init__() and soon finds this expression: self.first = ...

Therefore, Python calls DataDescriptor.__set__() to set the value of this attribute. Note that DataDescriptor.__set__() doesn’t really set anything in this case. Normally, something more meaningful would happen in .__set__(). See The Weird and Wonderful World of Descriptors in Python for more on this.

1. test.first

The next section in the code’s output is the following:

Printing `test.first`
Calling __getattribute__ with argument name=’first’
Calling DataDescriptor.__get__
This is the Data Descriptor

This output is created when you write print(test.first). The dot notation triggers Python to call .__getattribute__(). So, let’s start exploring the hierarchy of checks in .__getattribute__(). What does this special method look for first?

The first thing .__getattribute__() checks is whether the attribute is a data descriptor.

Since .first is a data descriptor, the search stops there. Python calls DataDescriptor.__get__() and returns whatever the .__get__() method returns. This is the string "This is the Data Descriptor" in this demo example.

2. test.second

Let’s look at the next segment in the code’s output. This is generated when you call print(test.second):

Printing `test.second`
Calling __getattribute__ with argument name=’second’
‘second’: a data attribute defined in .__init__

You know the drill by now. The dot notation is the trigger that makes Python call .__getattribute__(). It checks whether the attribute is a data descriptor. But .second is not a data descriptor. You initially define it as a non-data descriptor. Since it’s definitely not a data descriptor, it’s time to move on to the second check in the hierarchy.

The second thing .__getattribute__() checks is whether the attribute is an instance attribute.

Is this attribute name in the object’s .__dict__ dictionary? Instance attributes come second in the hierarchy. Now, you’ve seen that .second is not a data descriptor, which is why we moved to the second check. But is test.second an instance attribute or a non-data descriptor?

You can see from the printout that Python never called NonDataDescriptor.__get__() even though you originally defined .second as a NonDataDescriptor object. The swap occurred when you created the TestingAttributeAccess instance. Python calls the class’s .__init__(), which contains this assignment line: self.second = ....

Since .second is not a data descriptor (it’s a non-data descriptor) and doesn’t have a .__set__(), the descriptor protocol doesn’t kick in here to set the value of this attribute. This is different from .first, which was a data descriptor and, therefore, its .__set__() was responsible for setting the value.

Since the .second non-data descriptor doesn’t have a .__set__(), Python does what it always does in these cases: it assigns the new object, the string "'second': a data attribute defined in .__init__", to test.second. Therefore, test.second is now an instance attribute containing the string rather than the original non-data descriptor.

Although test.second would originally access the non-data descriptor, you created an instance attribute when you initialised the object. It’s now a data attribute, which is an instance attribute.

Note that TestingAttributeAccess.second is still there as a class attribute, and that’s still the non-data descriptor. But test.second now points to another object, the string.

Therefore, when later in the code you call print(test.second), Python calls .__getattribute__() and starts looking through its hierarchy. It’s not a data descriptor. It is an instance attribute. Therefore, it returns its value.

3. test.third

The next block of output is the one generated when you call print(test.third). Recall that .third is a NonDataDescriptor object. So was .second initially. However, unlike .second, Python never creates an instance attribute that’s assigned to test.third since you don’t assign to self.third in .__init__(). Here’s the output:

Printing `test.third`
Calling __getattribute__ with argument name=’third’
Calling NonDataDescriptor.__get__
This is the Non-Data Descriptor

In print(test.third), there’s a dot again, so Python calls .__getattribute__(). This method first checks whether .third is a data descriptor. It is not. Then it checks whether .third is an instance attribute. It is not. So...

The third thing .__getattribute__() checks is whether the attribute is a non-data descriptor.

Non-data descriptors come next in the hierarchy. That’s why you can see the printout saying that Python calls NonDataDescriptor.__get__ and then prints the attribute’s value, which is the value it gets from the non-data descriptor’s .__get__().

Incidentally, ordinary instance methods are functions that also implement the .__get__() special method. They are non-data descriptors. So, methods are found in this third check in the hierarchy of checks when you use the dot notation.

4. test.fourth

It’s test.fourth‘s turn. Here’s the output you get from print(test.fourth):

Printing `test.fourth`
Calling __getattribute__ with argument name=’fourth’
I’m just a normal class attribute!

There’s a dot, so there’s a call to .__getattribute__() which checks whether .fourth is a data descriptor, an instance attribute, or a non-data descriptor, in that order. It’s none of these. Time for the fourth check.

The fourth thing .__getattribute__() checks is whether the attribute is a class attribute.

And look at that! The .fourth attribute is indeed a class attribute.

5. test.fifth

And finally, it’s .fifth‘s turn. You call print(test.fifth) and you get the following:

Printing `test.fifth`
Calling __getattribute__ with argument name=’fifth’
Calling __getattr__ with argument name=’fifth’
This is the fallback value from __getattr__

Yes, yes, there’s a dot again, so Python calls .__getattribute__(). No, it’s not a data descriptor. No, it’s not an instance attribute. No, it’s not a non-data descriptor. No, it’s not a class attribute. It’s nothing. The attribute .fifth doesn’t exist.

When .__getattribute__() fails to find the attribute, Python calls .__getattr__() if this special method exists.

And that’s it. Simple, eh?!?!

Let’s finish off by summarising the order in which Python deals with looking for attributes when you use the dot notation:

  1. The first thing .__getattribute__() checks is whether the attribute is a data descriptor.

  2. The second thing .__getattribute__() checks is whether the attribute is an instance attribute.

  3. The third thing .__getattribute__() checks is whether the attribute is a non-data descriptor.

  4. The fourth thing .__getattribute__() checks is whether the attribute is a class attribute.

  5. Finally, when .__getattribute__() fails to find the attribute, Python calls .__getattr__() if this special method exists.

The fifth step occurs when you access attributes using the dot notation. If you call .__getattribute__() explicitly, which you rarely need to do, Python doesn’t automatically look for .__getattr__().

Note that Python looks for descriptors, instance and class attributes in the class and its base classes, too, following the method resolution order.

Final Words

Wait for this. I’ve been waiting all article to write this: Now, you won’t forget the four “get“* special methods. Got it! (My son just walked out disapprovingly when he read this.)

You may never need to use all of these methods. Maybe you’ll never need to use any of them. But if you ever wondered why there are these four special methods with similar names, now you know what they do. And you dug a bit more underneath the Python surface along the way. And that’s always fun.

Your call…

The Python Coding Place offers something for everyone:

• a super-personalised one-to-one 6-month mentoring option
$ 4,750

• individual one-to-one sessions
$ 125

• a self-led route with access to 60+ hrs of exceptional video courses and a support forum
$ 400

Which The Python Coding Place student are you?

Photo by Robert Clark


Code in this article uses Python 3.14

The code images used in this article are created using Snappify. [Affiliate link]

Join The Club, the exclusive area for paid subscribers for more Python posts, videos, a members’ forum, and more.

Subscribe now

You can also support this publication by making a one-off contribution of any amount you wish.

Support The Python Coding Stack


For more Python resources, you can also visit Real Python—you may even stumble on one of my own articles or courses there!

Also, are you interested in technical writing? You’d like to make your own writing more narrative, more engaging, more memorable? Have a look at Breaking the Rules.

And you can find out more about me at stephengruppetta.com

Further reading related to this article’s topic:


Appendix: Code Blocks

Code Block #1
class PointsTable:
    def __init__(self, data):
        self._data = dict(data)
​
table = PointsTable(
    {
        “Stephen”: 20,
        “Mark”: 17,
        “Kate”: 19,
        “Sarah”: 22,
    }
)
Code Block #2
# ...
print(table[”Mark”])
Code Block #3
class PointsTable:
    def __init__(self, data):
        self._data = dict(data)
​
    def __getitem__(self, item):
        return self._data[item]
​
table = PointsTable(
    {
        “Stephen”: 20,
        “Mark”: 17,
        “Kate”: 19,
        “Sarah”: 22,
    }
)
​
print(table[”Mark”])
Code Block #4
# ...
print(table[”Mark”, “Stephen”])
Code Block #5
class PointsTable:
    def __init__(self, data):
        self._data = dict(data)
​
    def __getitem__(self, item):
        print(f”{item=}”)
        print(f”{type(item)=}”)
        return self._data[item]
​
table = PointsTable(
    # ...
)
​
print(table[”Mark”, “Stephen”])
Code Block #6
class PointsTable:
    def __init__(self, data):
        self._data = dict(data)
​
    def __getitem__(self, item):
        if isinstance(item, tuple):
            return tuple(self._data[key] for key in item)
        return self._data[item]
​
table = PointsTable(
    # ...
)
​
print(table[”Mark”, “Stephen”])
print(table[”Sarah”, “Stephen”, “Kate”])
print(table[”Mark”])
Code Block #7
# ...
print(table.Mark)
Code Block #8
class PointsTable:
    # ...
​
    def __getitem__(self, item):
        print(f”Calling __getitem__ with argument {item=}”)
        if isinstance(item, tuple):
            return tuple(self._data[key] for key in item)
        return self._data[item]
​
    def __getattr__(self, name):
        print(f”Calling __getattr__ with argument {name=}”)
        try:
            return self._data[name]
        except KeyError:
            raise AttributeError(
                f”’{type(self).__name__}’ object has “
                f”no attribute ‘{name}’”
            ) from None
​
table = PointsTable(
    # ...
)
​
print(table[”Mark”])
print(table.Mark)
Code Block #9
# ...
print(table.Matilda)
Code Block #10
# ...
print(table._data)
Code Block #11
# ...
print(table.Mark)
print(table._data)
Code Block #12
class PointsTable:
    def __init__(self, data):
        self._data = dict(data)

    def __getitem__(self, item):
        # ...

    def __getattr__(self, name):
        # ...

    def __getattribute__(self, name):
        print(f”Calling __getattribute__ with argument {name=}”)

table = PointsTable(
    # ...
)

print(table.Mark)
print(table._data)
Code Block #13
# ...
print(table.Mark)
print(table._data)
print(table.Janine)
Code Block #14
class PointsTable:
    def __init__(self, data):
        self._data = dict(data)

    def __getitem__(self, item):
        # ...

    def __getattr__(self, name):
        # ...

    def __getattribute__(self, name):
        print(f”Calling __getattribute__ with argument {name=}”)
        return super().__getattribute__(name)

table = PointsTable(
    # ...
)

print(table._data)
Code Block #15
# ...
print(table.Mark)
Code Block #16
class PointsTable:
    def __init__(self, data):
        self._data = dict(data)

    def __getitem__(self, item):
        # ...

    # def __getattr__(self, name):
    #     print(f”Calling __getattr__ with argument {name=}”)
    #     try:
    #         return self._data[name]
    #     except KeyError:
    #         raise AttributeError(
    #             f”’{type(self).__name__}’ object has “
    #             f”no attribute ‘{name}’”
    #         ) from None

    def __getattribute__(self, name):
        print(f”Calling __getattribute__ with argument {name=}”)
        return super().__getattribute__(name)

table = PointsTable(
    # ...
)

print(table.Mark)
Code Block #17
class PointsTable:
    def __init__(self, data):
        self._data = dict(data)
        self.Mark = “This is Mark as a data attribute”

    def __getitem__(self, item):
        # ...

    # def __getattr__(self, name):
    # ...

    def __getattribute__(self, name):
        print(f”Calling __getattribute__ with argument {name=}”)
        return super().__getattribute__(name)

table = PointsTable(
    # ...
)

print(table.Mark)
Code Block #18
class PointsTable:
    Mark = “This is Mark as a class attribute”

    def __init__(self, data):
        self._data = dict(data)
        # self.Mark = “This is Mark as a data attribute”

    def __getitem__(self, item):
        # ...

    # def __getattr__(self, name):
    # ...

    def __getattribute__(self, name):
        print(f”Calling __getattribute__ with argument {name=}”)
        return super().__getattribute__(name)

table = PointsTable(
    # ...
)

print(table.Mark)
Code Block #19
class PointsTable:
    def __init__(self, data):
        self._data = dict(data)

    def __getitem__(self, item):
        print(f”Calling __getitem__ with argument {item=}”)
        if isinstance(item, tuple):
            return tuple(self._data[key] for key in item)
        return self._data[item]

    def __getattr__(self, name):
        print(f”Calling __getattr__ with argument {name=}”)
        try:
            return self._data[name]
        except KeyError:
            raise AttributeError(
                f”’{type(self).__name__}’ object has “
                f”no attribute ‘{name}’”
            ) from None

    def __getattribute__(self, name):
        print(f”Calling __getattribute__ with argument {name=}”)
        return super().__getattribute__(name)

table = PointsTable(
    {
        “Stephen”: 20,
        “Mark”: 17,
        “Kate”: 19,
        “Sarah”: 22,
    }
)

print(table._data)  # Only .__getattribute__() used
print(table.Mark)  # Fallback .__getattr__() needed here
Code Block #20
class DataDescriptor:
    def __get__(self, instance, owner):
        print(”Calling DataDescriptor.__get__”)
        return “This is the Data Descriptor”
​
    def __set__(self, instance, value):
        print(f”Calling DataDescriptor.__set__ with {value=}”)
​
class NonDataDescriptor:
    def __get__(self, instance, owner):
        print(”Calling NonDataDescriptor.__get__”)
        return “This is the Non-Data Descriptor”
​
class TestingAttributeAccess:
    first = DataDescriptor()
    second = NonDataDescriptor()
    third = NonDataDescriptor()
    fourth = “I’m just a normal class attribute!”
​
    def __init__(self):
        self.first = “’first’: a data attribute defined in .__init__”
        self.second = “’second’: a data attribute defined in .__init__”
​
    def __getattribute__(self, name):
        print(f”Calling __getattribute__ with argument {name=}”)
        return super().__getattribute__(name)
​
    def __getattr__(self, name):
        print(f”Calling __getattr__ with argument {name=}”)
        return “This is the fallback value from __getattr__”
​
test = TestingAttributeAccess()
print(”\nPrinting `test.first`”)
print(test.first)
print(”\nPrinting `test.second`”)
print(test.second)
print(”\nPrinting `test.third`”)
print(test.third)
print(”\nPrinting `test.fourth`”)
print(test.fourth)
print(”\nPrinting `test.fifth`”)
print(test.fifth)

For more Python resources, you can also visit Real Python—you may even stumble on one of my own articles or courses there!

Also, are you interested in technical writing? You’d like to make your own writing more narrative, more engaging, more memorable? Have a look at Breaking the Rules.

And you can find out more about me at stephengruppetta.com

May 04, 2026 11:59 AM UTC


Daniel Roy Greenfeld

Word counter that ignores Markdown

I've been doing a lot of writing recently, and tracking my word count. I write in markdown. I could just render the text using a markdown library and then do a count on the generated output, but then I wouldn't have the fun of writing out a bunch of regular expressions. Yes, I know the cautionary meme by that says:

"Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems." -- Jamie Zawinski

I don't care.

I love working in regular expressions. It was the one thing I got out of my brief foray in Perl at the very start of my software development career. I carried it into my Java and ColdFusion days and periodically use it in Python. Yes, Python has lots of useful string tools, but playing with regular expressions until they are just right remains a fun puzzle for me.

So here you go, a Python-powered word counter powered by my desire to noodle with regular expressions:

"""
word_count.py — Count words in a Markdown file or a directory of markdown files.

Dependencies:
    typer
    rich

Usage:
    python word_count.py README.md
    python word_count.py README.md --no-strip-markdown
    python word_count.py README.md --verbose
    python word_count.py book/
"""

import re
from pathlib import Path

import typer
from rich.console import Console
from rich.panel import Panel
from rich.table import Table
from rich import box

app = typer.Typer(
    name="word-count",
    help="Count words in Markdown files.",
    add_completion=False,
)
console = Console()


MARKDOWN_PATTERNS = [
    r"```[\s\S]*?```",  # fenced code blocks
    r"`[^`]+`",  # inline code
    r"!\[.*?\]\(.*?\)",  # images
    r"\[.*?\]\(.*?\)",  # links => keep link text
    r"^#{1,6}\s+",  # ATX headings
    r"^\s*[-*+]\s+",  # unordered list markers
    r"^\s*\d+\.\s+",  # ordered list markers
    r"[*_]{1,2}([^*_]+)[*_]{1,2}",  # bold / italic => keep inner text
    r"~~([^~]+)~~",  # strikethrough => keep inner text
    r"^>+\s*",  # blockquote markers
    r"^\s*\|.*\|\s*$",  # table rows (kept as-is, words counted)
    r"^[-*_]{3,}\s*$",  # horizontal rules
    r"<!--[\s\S]*?-->",  # HTML comments
    r"<[^>]+>",  # remaining HTML tags
]

_STRIP_RE = re.compile("|".join(MARKDOWN_PATTERNS), re.MULTILINE)


def strip_markdown(text: str) -> str:
    """Remove Markdown syntax, keeping readable prose."""
    # Replace links/images with their label text
    text = re.sub(r"!\[.*?\]\(.*?\)", "", text)
    text = re.sub(r"\[(.*?)\]\(.*?\)", r"\1", text)
    # Remove fenced code blocks entirely
    text = re.sub(r"```[\s\S]*?```", "", text)
    # Remove inline code
    text = re.sub(r"`[^`]+`", "", text)
    # Unwrap bold / italic
    text = re.sub(r"[*_]{1,2}([^*_\n]+)[*_]{1,2}", r"\1", text)
    text = re.sub(r"~~([^~]+)~~", r"\1", text)
    # Remove HTML comments and tags
    text = re.sub(r"<!--[\s\S]*?-->", "", text)
    text = re.sub(r"<[^>]+>", "", text)
    # Strip leading syntax characters
    text = re.sub(r"^#{1,6}\s+", "", text, flags=re.MULTILINE)
    text = re.sub(r"^\s*[-*+]\s+", "", text, flags=re.MULTILINE)
    text = re.sub(r"^\s*\d+\.\s+", "", text, flags=re.MULTILINE)
    text = re.sub(r"^>+\s*", "", text, flags=re.MULTILINE)
    text = re.sub(r"^[-*_]{3,}\s*$", "", text, flags=re.MULTILINE)
    return text


def count_stats(text: str) -> dict:
    words = text.split()
    lines = text.splitlines()
    chars_no_space = len(re.sub(r"\s", "", text))
    sentences = len(re.findall(r"[.!?]+", text))
    return {
        "words": len(words),
        "lines": len(lines),
        "chars": len(text),
        "chars_no_space": chars_no_space,
        "sentences": max(sentences, 1),
        "avg_word_len": (
            round(sum(len(x) for x in words) / len(words), 1) if words else 0.0
        ),
        "reading_time_min": max(1, round(len(words) / 200)),  # ~200 wpm
    }


def _count_single_file(
    file: Path,
    strip: bool,
    verbose: bool,
    plain: bool,
) -> dict:
    """Count words for a single file, print output, and return stats."""
    raw = file.read_text(encoding="utf-8")
    text = strip_markdown(raw) if strip else raw
    stats = count_stats(text)

    if plain:
        typer.echo(f"{file.name}\t{stats['words']}")
        return stats

    if not verbose:
        console.print(
            Panel(
                f"[bold cyan]{stats['words']:,}[/bold cyan] words  ·  "
                f"[dim]{stats['reading_time_min']} min read[/dim]",
                title=f"[bold]{file.name}[/bold]",
                border_style="cyan",
            )
        )
        return stats

    # Verbose: full table
    table = Table(box=box.ROUNDED, show_header=True, header_style="bold magenta")
    table.add_column("Metric", style="bold")
    table.add_column("Value", justify="right")

    rows = [
        ("Words", f"{stats['words']:,}"),
        ("Lines", f"{stats['lines']:,}"),
        ("Characters (with spaces)", f"{stats['chars']:,}"),
        ("Characters (no spaces)", f"{stats['chars_no_space']:,}"),
        ("Sentences (approx.)", f"{stats['sentences']:,}"),
        ("Average word length", f"{stats['avg_word_len']} chars"),
        ("Estimated reading time", f"{stats['reading_time_min']} min"),
        ("Markdown stripped", "yes" if strip else "no"),
    ]
    for label, value in rows:
        table.add_row(label, value)

    console.print()
    console.print(f"  [bold]{file}[/bold]", style="dim")
    console.print(table)
    console.print()
    return stats


@app.command()
def count(
    path: Path = typer.Argument(
        ...,
        help="Path to a Markdown file or a directory with digit-prefixed .md files.",
        exists=True,
        file_okay=True,
        dir_okay=True,
        readable=True,
    ),
    strip: bool = typer.Option(
        True,
        "--strip-markdown/--no-strip-markdown",
        help="Strip Markdown syntax before counting (default: True).",
    ),
    verbose: bool = typer.Option(
        False,
        "--verbose",
        "-v",
        help="Show a full breakdown table.",
    ),
    plain: bool = typer.Option(
        False,
        "--plain",
        help="Print a bare number (word count only) — useful for scripting.",
    ),
):
    """Count words in a Markdown FILE or all digit-prefixed .md files in a directory."""
    if path.is_file():
        _count_single_file(path, strip, verbose, plain)
        return

    # Directory mode: find .md files starting with a digit
    files = sorted(x for x in path.glob("[0-9]*.md") if f.is_file())
    if not files:
        console.print(f"[red]No digit-prefixed .md files found in {path}[/red]")
        raise typer.Exit(code=1)

    total_words = 0
    for f in files:
        stats = _count_single_file(f, strip, verbose, plain)
        total_words += stats["words"]

    if plain:
        typer.echo(f"TOTAL\t{total_words}")
    else:
        console.print(
            Panel(
                f"[bold green]{total_words:,}[/bold green] words across "
                f"[bold]{len(files)}[/bold] files  ·  "
                f"[dim]{max(1, round(total_words / 200))} min read[/dim]",
                title=f"[bold]{path}[/bold] — Total",
                border_style="green",
            )
        )


if __name__ == "__main__":
    app()

May 04, 2026 11:55 AM UTC


PyCharm

PyTorch vs. TensorFlow: Choosing the Right Framework in 2026

PyTorch vs. TensorFlow

Choosing between PyTorch and TensorFlow isn’t about finding the “better” framework – it’s about finding the right fit for your project. Both power cutting-edge AI systems, but they excel in different domains. PyTorch dominates research and experimentation, while TensorFlow leads in production deployment at scale.

The frameworks have evolved significantly since their early days, each building tools and capabilities to support research and production. Despite these improvements, fundamental differences remain in their philosophies, ecosystems, and ideal use cases, which will naturally influence which framework will best fit your project.

This guide examines where each framework shines, compares them across key dimensions, and helps you choose the right tool for your natural language processing, computer vision, and reinforcement learning projects.

What sets PyTorch and TensorFlow apart?

PyTorch and TensorFlow took different approaches from day one. Google launched TensorFlow in 2015, focusing on production deployment and enterprise scalability. Meta released PyTorch in 2016, prioritizing research flexibility and Pythonic development. These roots still shape each framework today.

The key difference between the two lies in computational graphs. PyTorch uses dynamic graphs that execute operations immediately, making debugging natural – you use standard Python tools and inspect tensors at any point. TensorFlow originally required static graphs defined before execution, though version 2.x now defaults to eager execution while retaining optional graph compilation for performance.

Market data shows TensorFlow holds a 37% market share, while PyTorch commands 25%. But the research tells a different story: PyTorch powers 85% of deep learning papers presented at top AI conferences.

PyTorch: Strengths and weaknesses

PyTorch’s Pythonic API treats models like regular Python code, making development feel intuitive from the start. The framework’s dynamic computational graphs execute operations immediately rather than requiring upfront model definition, fundamentally changing how you approach debugging and experimentation.

This design philosophy has made PyTorch the dominant choice in research, where flexibility matters more than deployment infrastructure. However, this research-first design means production deployment tools remain less mature than TensorFlow’s enterprise infrastructure.

PyTorch strengths

PyTorch weaknesses

TensorFlow: Strengths and weaknesses

TensorFlow’s production ecosystem provides you with a comprehensive infrastructure for deploying models at scale. Google built the framework specifically for enterprise environments where reliability, performance, and deployment flexibility matter most.

This production-first approach created mature tooling for serving, mobile optimization, and MLOps that PyTorch is still catching up to. The trade-off comes in development experience – TensorFlow’s API can feel more complex and less intuitive than PyTorch’s streamlined approach.

TensorFlow strengths

TensorFlow weaknesses

If you’re new to TensorFlow and want a hands-on starting point, check out How to Train Your First TensorFlow Model in PyCharm, where you’ll build and train a simple model step by step using Keras and visualize the results.

PyTorch vs. TensorFlow: Head-to-head comparison

Choosing between PyTorch and TensorFlow isn’t always straightforward, and there are many factors to consider. 

The table below provides a high-level head-to-head comparison of PyTorch and TensorFlow so you can quickly assess which framework generally fits your needs. We’ll later consider project-specific scenarios and provide a detailed decision matrix to guide your choice.

DimensionPyTorchTensorFlow
Learning curveEasier: Pythonic and intuitiveSteeper: more complex API despite Keras
DebuggingExcellent: standard Python tools work naturallyGood: improved with eager execution
Production deploymentImproving: TorchServe and TorchScript availableExcellent: mature ecosystem (Serving, Lite, JS)
Research/experimentationDominant: 85% of deep‑learning research papersPresent: but trailing PyTorch in adoption
Community ecosystemResearch-focused: Hugging Face, PyTorch LightningEnterprise-focused: TFX, strong cloud integration
Performance at scaleStrong: DDP for distributed trainingStrong: graph optimization, TPU support
Industry adoptionGrowing: used by 15,800+ companiesEstablished: used by more than 23,000 companies

PyTorch vs. TensorFlow for different use cases and applications 

Your framework choice depends heavily on what you’re building. Here’s how PyTorch and TensorFlow stack up for major machine learning domains.

Natural language processing

PyTorch dominates NLP with no signs of slowing. The Hugging Face Transformers library – the de facto standard for working with language models – started as a PyTorch-only framework and later added TensorFlow support as a secondary option. When you’re fine-tuning transformers, implementing custom attention mechanisms, or experimenting with novel architectures, PyTorch’s flexibility accelerates your iteration.

Verdict: PyTorch leads NLP decisively. Choose TensorFlow only if you have specific mobile deployment requirements that override all other considerations.

Computer vision

Computer vision presents a more balanced landscape for your projects. PyTorch benefits from research momentum – when you’re developing novel detection algorithms or experimenting with architectures, you’ll find state-of-the-art implementations appear in PyTorch first. TensorFlow excels for building production CV systems, especially for mobile object detection or on-device image classification, where TensorFlow Lite’s optimization matters most.

For a hands-on example, watch this video on how to build a TensorFlow object detection app to see how to take a pre-trained model and turn it into a real-time object detection app running on a robot in PyCharm:

Verdict: Use case dependent. Choose PyTorch for research and novel architectures, TensorFlow when your deployment priorities favor mobile and edge devices.

Reinforcement learning

PyTorch holds a slight edge in reinforcement learning, driven by the research community’s preference for it. When you’re implementing custom RL algorithms, modifying reward functions dynamically, or debugging agent behavior, PyTorch’s flexibility serves you better. TensorFlow offers solid capabilities through TF-Agents for production RL systems at scale.

Verdict: Choose PyTorch for RL research and experimentation or TensorFlow for building large-scale production-grade RL systems like recommendation engines.

Tooling and developer experience in PyCharm

PyCharm provides comprehensive support for both frameworks, streamlining your development workflow regardless of which you choose.

Performance, scalability, and deployment

Training performance barely differs between frameworks for most workloads – both handle GPU training efficiently with comparable speeds. TensorFlow gains an edge when you need TPU support for large-scale training, offering more mature integration with Google’s specialized hardware. For multi-GPU scaling, both deliver strong performance with PyTorch’s DDP and TensorFlow’s MirroredStrategy.

Deployment scenarios differentiate the frameworks more clearly. TensorFlow Serving handles production model serving at scale with built-in versioning and A/B testing that PyTorch’s TorchServe can’t yet match in maturity. When deploying to mobile devices or edge hardware, TensorFlow Lite provides industry-standard optimization through quantization and pruning. For browser deployment, TensorFlow.js offers more integrated, optimized inference compared to serving PyTorch models via ONNX Runtime.

Memory management affects development experience – PyTorch’s caching allocator handles GPU memory efficiently with dynamic batch sizes, causing fewer surprises when experimenting with different model configurations.

Community, ecosystem, and library support

PyTorch’s research dominance created a vibrant, innovation-focused community that accelerates development. The PyTorch Conference 2024 saw triple the registrations versus 2023, and when cutting-edge techniques emerge, they appear in PyTorch first. The Hugging Face ecosystem amplifies this advantage – more than 220,000 PyTorch-compatible models versus around 15,000 for TensorFlow makes a tangible difference in development speed.

TensorFlow’s community skews toward production engineering, providing comprehensive enterprise-grade documentation and proven deployment patterns. Google’s backing ensures strong cloud platform integrations, particularly with Google Cloud, offering managed services that reduce operational complexity. The Model Garden provides production-ready implementations optimized for deployment rather than research experimentation.

Learning resources reflect these different audiences – PyTorch tutorials emphasize research workflows and novel implementations, while TensorFlow documentation prioritizes production deployment patterns and enterprise-scale systems.

Choosing the right framework for your project

Many successful teams use both frameworks strategically – researching and experimenting in PyTorch, then deploying in TensorFlow. The frameworks aren’t mutually exclusive. You can use ONNX to enable model conversion between them when needed.

When making a choice, it helps to prioritize factors most relevant to your project: Mobile deployment requirements may override other considerations, research-heavy work might make PyTorch essential, and enterprise support with MLOps integration could tip the scales toward TensorFlow. 

Use the table below to match your project requirements with the framework strengths. 

Decision FactorPyTorchTensorFlow
By use case
Natural language processing✅ NLP standard choiceOnly if mobile deployment is critical
Computer vision✅ Research/novel architectures✅ Production mobile/edge apps
Reinforcement learning✅ Research and experimentation✅ Large-scale production RL
By experience level
Beginner✅ More intuitive APIKeras simplifies learning
Intermediate/Advanced✅ Research and prototyping✅ Production systems at scale
By project phase
Research/Experimentation✅ Dynamic graphs aid iterationGraph compilation for optimization
Rapid prototyping✅ Fast experimentationKeras for simple models
Production deploymentTorchServe improving✅ Mature deployment tools
By deployment target
Cloud/ServerStrong performance✅ Strong performance, slight GCP advantage
Mobile/Edge devicesBasic support via PyTorch Mobile✅ TensorFlow Lite industry standard
Web ApplicationsVia ONNX Runtime✅ TensorFlow.js optimized
By team context
Research-focused team✅ Natural fit for researchersIf already using TensorFlow
Production-focused teamIf comfortable with tooling✅ Proven enterprise patterns

May 04, 2026 10:07 AM UTC


Python Engineering at Microsoft

Introducing Apache Arrow Support in mssql-python

c1014e61 a66d 4807 ab58 655671044f49 image

Reviewed by Sumit Sarabhai

Fetching a million rows from SQL Server into a Polars DataFrame used to mean a million Python objects, a million GC allocations, and then throwing it all away to build a DataFrame. Not anymore. mssql-python now supports fetching SQL Server data directly as Apache Arrow structures – a faster and more memory-efficient path for anyone working with SQL Server data in Polars, Pandas, DuckDB, or any other Arrow-native library. This feature was contributed by community developer Felix Graßl (@ffelixg), and we are thrilled to ship it.

Key Terms

API (Application Programming Interface): a source-code contract that defines how to call a function or library.

ABI (Application Binary Interface): a binary-level contract that specifies how compiled code is laid out in memory. Two programs built in different languages can share an ABI and exchange data directly – no serialization is needed.

Arrow C Data Interface: Apache Arrow’s ABI specification – the standard that makes zero-copy data exchange between languages possible.

What Is Apache Arrow?

The key insight behind Apache Arrow is zero-copy language interoperability. Arrow defines a stable shared-memory layout – the Arrow C Data Interface, a cross-language ABI – that any language can produce or consume by exchanging a pointer, with no serialization, no copies, and no re-parsing. A C++ database driver and a Python DataFrame library can work on the exact same memory without either one knowing about the other.

Built on top of that, Arrow uses a columnar in-memory format: instead of representing a table as a list of rows, each row a collection of Python objects, Arrow stores all values for a column contiguously in a typed buffer. Nulls are tracked in a compact bitmap rather than per-cell None objects.

For a database driver, this means the entire fetch loop can run in C++ and write values directly into Arrow buffers – no Python object creation per row, no garbage-collector pressure. The DataFrame library receives a pointer to that memory and can begin operating on it immediately. Crucially, subsequent operations – filters, joins, aggregations – also work in-place on those same buffers. A Polars pipeline reading from mssql-python never needs to materialize intermediate Python objects at any stage, making Arrow the right foundation for high-throughput data processing pipelines.

For users of mssql-python, this translates into four concrete benefits:

Calling all Python + SQL developers! We invite the community to try out mssql-python and help us shape the future of high-performance SQL Server connectivity in Python.!

The Arrow Fetch APIs

Three APIs have been added to the Cursor object.

1. cursor.arrow_batch(batch_size=8192)pyarrow.RecordBatch

Fetches the next batch of up to batch_size rows as an Arrow RecordBatch and advances the cursor. RecordBatches are the building block for more high-level Arrow data types like tables and the batch reader interface.

import mssql_python

conn   = mssql_python.connect(conn_str)
cursor = conn.cursor()
cursor.execute("SELECT * FROM SalesData")

partial_data = cursor.arrow_batch(batch_size=50000)
process(partial_data)   # pyarrow.RecordBatch

2. cursor.arrow(batch_size=8192)pyarrow.Table

Eagerly fetches the entire result set into a single Arrow Table. This is the simplest path and works well for analytics queries where the result fits comfortably in memory. However, because it materialises the full result set at once, it can cause high peak RAM usage or out-of-memory errors on very large or unbounded queries. For large exports or ETL workloads, prefer cursor.arrow_reader() (streaming, fetches lazily) or cursor.arrow_batch() (fetch one batch at a time). In both cases, batch_size is a tuning knob: larger batches improve throughput but increase peak memory; smaller batches reduce memory at the cost of slightly more per-batch overhead.

cursor.execute("SELECT customer_id, order_date, amount FROM Orders")
table = cursor.arrow()

# Zero-copy conversion to Polars
import polars as pl
df = pl.DataFrame(table)

# Or to Pandas with Arrow-backed dtypes
import pandas as pd
df = table.to_pandas(types_mapper=pd.ArrowDtype)

3. cursor.arrow_reader(batch_size=8192)pyarrow.RecordBatchReader

Returns a lazy RecordBatchReader. Batches are fetched only when the reader is iterated, enabling streaming over very large result sets. RecordBatchReader is also accepted directly by DuckDB, Lance, and other Arrow-native libraries.

cursor.execute("SELECT * FROM LargeEventLog")
reader = cursor.arrow_reader(batch_size=100000)

for batch in reader:
    sink.write(batch)

Testing

We validated the Arrow fetch path against the standard Python row fetch path across a range of SQL Server types — numeric, temporal, string, and UUID – for both single-column and wide (20-column) tables. The full test script and results are available in the Resources section; we encourage you to run them on your own hardware to see the difference for your workload.

In our testing, the Arrow path was consistently faster for most SQL Server types. Temporal types showed the largest gains: types like DATETIME and DATETIMEOFFSET benefit significantly because the Arrow path handles timezone normalization and value encoding entirely in C++, eliminating per-value Python-side conversions. DATETIMEOFFSET in particular showed some of the most pronounced speedups we observed.

JSON Serialization Bonus

The Arrow path can also benefit API workloads that serialize results to JSON. Instead of fetchall() + json.dumps(), fetch via cursor.arrow(), wrap in a Polars DataFrame, and call df.write_json() – the entire pipeline bypasses Python objects and can be noticeably faster, especially for types like DATETIMEOFFSET.

NVARCHAR on Linux

Our Linux tests show longer fetch times for NVARCHAR due to the current UTF-16 → UTF-8 conversion path. On Windows, NVARCHAR fetches consistently faster with Arrow. A fix is targeted for a follow-up release.

Getting Started

Install or upgrade mssql-python, then add pyarrow:

pip install mssql-python pyarrow

For IDE type hints and static type checking:

pip install pyarrow-stubs

Then swap in cursor.arrow() wherever you would have called fetchall() and converted to a DataFrame. Your existing code is completely unaffected — Arrow support is purely additive.

import mssql_python
import polars as pl

conn   = mssql_python.connect(conn_str)
cursor = conn.cursor()

cursor.execute("SELECT * FROM dbo.LargeSalesTable")
df = pl.DataFrame(cursor.arrow())

print(df.describe())

What’s Next

One known area we are actively working on to improve is NVARCHAR performance on Linux. SQL Server returns Unicode string data in UTF-16 encoding, which the driver must convert to UTF-8 before handing it to Arrow. On Windows this conversion uses a native system API that is very fast, but the current Linux code path goes through a slower chain of intermediate steps. As a result, NVARCHAR columns on Linux show longer fetch times compared to the Python fetch path — the opposite of every other type. A fix using a more efficient codec is in progress for a follow-up release. On Windows, our tests show NVARCHAR fetching noticeably faster with Arrow, and Linux will follow.

A Note of Thanks

This feature was contributed by Felix Graßl (@ffelixg), the author of zodbc, his own Zig-based ODBC driver. His deep familiarity with ODBC and Arrow made this a thorough, well-tested contribution covering both Linux and Windows, and all three fetch patterns. We are very grateful for his work and the care he brought to this feature.

Resources

Try It and Share Your Feedback! 

We invite you to: 

  1. Check-out the mssql-python driver and integrate it into your projects. 
  2. Share your thoughts: Open issues, suggest features, and contribute to the project. 
  3. Join the conversation: GitHub Discussions | SQL Server Tech Community

Use Python Driver with Free Azure SQL Database

You can use the Python Driver with the free version of Azure SQL Database!

✅ Deploy Azure SQL Database for free

✅ Deploy Azure SQL Managed Instance for free Perfect for testing, development, or learning scenarios without incurring costs.

Have questions or feedback? Open an issue or discussion on GitHub, or reach out to the team at mssql-python@microsoft.com

The post Introducing Apache Arrow Support in mssql-python appeared first on Microsoft for Python Developers Blog.

May 04, 2026 04:33 AM UTC


Armin Ronacher

Content for Content’s Sake

Language is constantly evolving, particularly in some communities. Not everybody is ready for it at all times. I, for instance, cannot stand that my community is now constantly “cooking” or “cooked”, that people in it are “locked in” or “cracked.” I don’t like it, because the use of the words primarily signals membership of a group rather than one’s individuality.

But some of the changes to that language might now be coming from … machines? Or maybe not. I don’t know. I, like many others, noticed that some words keep showing up more than before, and the obvious assumption is that LLMs are at fault. What I did was take 90 days’ worth of my local coding sessions and look for medium-frequency words where their use is inflated compared to what wordfreq would assume their frequency should be. Then I looked for the more common of these words and did a Google Trends search (filtered to the US). Note that some words like “capability” are more likely going to show up in coding sessions just because of the nature of the problem, so the actual increase is much more pronounced than you would expect.

You can click through it; this is what the change over time looks like. Note that these are all words from agent output in my coding sessions that are inflated compared to historical norms:

Loading word trend chart…

The interactive word trend chart requires JavaScript.

Something is going on for sure. Google Trends, in theory, reflects words that people search for. In theory, maybe agents are doing some of the Googling, but it might just be humans Googling for stuff that is LLM-generated; I don’t know. This data set might be a complete fabrication, but for all the words I checked and selected, I also saw an increase on Google Trends.

So how did I select the words to check in the first place? First, I looked for the highest-frequency words. They were, as you would expect, things like “add”, “commit”, “patch”, etc. Then I had an LLM generate a word list of words that it thought were engineering-related, and I excluded them entirely from the list. Then I also removed the most common words to begin with. In the end, I ended up with the list above, plus some other ones that are internal project names. For instance, habitat and absurd, as well as some other internal code names, were heavily over-represented, and I had to remove those. As you can see, not entirely scientific. But of the resulting list of words with a high divergence compared to wordfreq, they all also showed spikes on Google Trends.

There might also be explanations other than LLM generation for what is going on, but I at least found it interesting that my coding session spikes also show up as spikes on Google Trends.

The Rise of LLM Slop

The choice of words is one thing; the way in which LLMs form sentences is another. It’s not hard to spot LLM-generated text, but I’m increasingly worried that I’m starting to write like an LLM because I just read so much more LLM text. The first time I became aware of this was that I used the word “substrate” in a talk I gave earlier this year. I am not sure where I picked it up, but I really liked it for what I wanted to express and I did not want to use the word “foundation”. Since then, however, I am reading this word everywhere. This, in itself, might be a case of the Baader–Meinhof phenomenon, but you can also see from the selection above that my coding agent loves substrate more than it should, and that Google Trends shows an increase.

We have all been exposed to LLM-generated text now, but I feel like this is getting worse recently. A lot of the tweet replies I get and some of the Hacker News comments I see read like they are LLM-generated, and that includes people I know are real humans. It’s really messing with my brain because, on the one hand, I really want to tell people off for talking and writing like LLMs; on the other hand, maybe we all are increasingly actually writing and speaking like LLMs?

I was listening to a talk recording recently (which I intentionally will not link) where the speaker used the same sentence structure that is over-represented in LLM-generated text. Yes, the speaker might have used an LLM to help him generate the talk, but at the same time, the talk sounded natural. So either it was super well-rehearsed, or it was natural.

Engage and Farm

At least on Twitter, LinkedIn, and elsewhere, there is a huge desire among people to write content and be read. Shutting up is no longer an option and, as a result, people try to get reach and build their profile by engaging with anything that is popular or trending. In the same way that everybody has gazillions of Open Source projects all of a sudden, everybody has takes on everything.

My inbox is a disaster of companies sending me AI-generated nonsense and I now routinely see AI-generated blog posts (or at least ones that look like they are AI-generated) being discussed in earnest on Hacker News and elsewhere.

Genuine human discourse had already been an issue because of social media algorithms before, but now it has become incredibly toxic. As more and more people discover that they can use LLMs to optimize their following, they are entering an arms race with the algorithms and real genuine human signal is losing out quickly. There are entire companies now that just exist to automate sending LLM-generated shit and people evidently pay money for it.

Speed Should Kill

If we take into account the idea that the highest-quality content should win out, then the speed element would not matter. If a human-generated comment comes in 15 minutes after a clanker-generated one, but outperforms it by being better, then this whole LLM nonsense would show up less. But I think that LLM-generated noise actually performs really well. We see this plenty with Open Source now. Someone builds an interesting project, puts it on GitHub and within hours, there are “remixes” and “reimplementations” of that codebase. Not only that, many of those forks come with sloppy marketing websites, paid-for domains, and a whole story on socials about why this is the path to take.

I have complained before that Open Source is quickly deteriorating because people now see the opportunity to build products on top of useful Open Source projects, but the underlying mechanics are the same as why we see so much LLM slop. Someone has a formed opinion (hopefully) at lunch, and then has a clanker-made post 3 minutes later. It just does not take that much time to build it. For the tweets, I think it’s worse because I suspect that some people have scripts running to mostly automate the engagement.

And surely, we should hate all of this. These low-effort posts, tweets, and Open Source projects should not make it anywhere. But they do! Whatever they play into, whether in the algorithms or with human engagement, they are not punished enough for how little effort goes into them.

Friction and Rate Limiting

That increases in speed and ease of access can turn into problems is a long-understood issue. ID cards are a very unpopular thing in the UK because the British are suspicious of misuse of a central database after what happened in Nazi Germany. Likewise the US has the Firearm Owners Protection Act from 1986, which also bans the US from creating a central database of gun owners. The gun-tracing methodologies that result from not having such a database look like something out of a Wes Anderson movie. We have known for a long time that certain things should not be easy, because of the misuse that happens.

We know it in engineering; we know it when it comes to governmental overreach. Now we are probably going to learn the same lesson in many more situations because LLMs make almost anything that involves human text much easier. This is hitting existing text-based systems quickly. Take, for instance, the EU complaints system, which is now buckling under the pressure of AI. Or take any AI-adjacent project’s issue tracker. Pi is routinely getting AI-generated issue requests, sometimes even without the knowledge of the author.

Trust Erosion and Gaslighting

I know that’s a lot of complaining for “I am getting too many emails, shitty Twitter mentions, and GitHub issues.” I really think, though, that now that we know that it’s happening, we have to change how we interact with people who are increasingly automating themselves. Not only do they produce a lot of shitty slop that we all have to sit through; they are also influencing the world in much more insidious ways, in that they are influencing our interactions with each other. The moment I start distrusting people I otherwise trust, because they have started picking up LLM phrasing, it erodes trust all over society.

You also can’t completely ban people for bad behavior, because some of this increasingly happens accidentally. You sending Polsia spam to me? You’re dead to me. You sending me an AI-generated issue request and following up with an apology five minutes later? Well, I guess mistakes happen. Yet, in many ways, what is going on and will continue to go on is unsettling.

I recently talked with my friend Ben who said he forced someone to call him to continue a conversation because he was no longer convinced he was talking to a human.

Not all of us have been exposed to the extreme cases of this yet, but I had a handful of interactions in which I questioned reality due to the behavior of the person on the other side. I struggle with this, and I consider myself to be pretty open to new technologies and AI in particular. But how will my children react to stuff like this? My mother? I have strong doubts that technology is going to solve this for us.

Suggestions for Change

The reason I don’t think technology is going to solve this for us is that while it can hide some spam and label some generated text, it won’t fix us humans. What is being damaged here are social interactions across the board: the assumption that when someone writes to you, there is a person on the other side who has put some care into the interaction. I would rather have someone ghost me or reject me than send me back some AI-generated slop.

Change has to start with awareness and an unfortunate developmend is that LLMs don’t just influence the text we rea and influence the text we write, even when we don’t use htem. Given the resulting ambiguity, we need to become more aware of how easily we can turn into energy vampires when we use agents to back us up in interactions with others. Consider that every time someone reads text coming from you, they will have to increasingly have to make a judgement call if it was you, or an LLM or you and an LLM that produced the interaction. Transparency in either direction, when there is ambiguity, can help great lengths.

When someone sends us undeclared slop, we need to change how we engage with them. If we care about them, we should tell them. If we don’t care about them, we should not give them visibility and not engage.

When it comes to creating platforms and interfaces where text can be submitted, we need to throw more wrenches in. The fact that it was cheap for you to produce does not make it cheap for someone else to receive, and we need to find more creative ways to increase the backpressure. GitHub or whatever wants to replace it, will have a lot to improve here and some of which might be going against it’s core KPIs. More engagement is increasingly the wrong thing to look at if you want a long term healthy platform.

Whatever we can do to rate-limit social interactions is something we should try: more in-person meetings, more platforms where trust has to be earned, and maybe more acceptance that sometimes the right response is no response at all.

And as for AI assistence on this blog, I have an AI transparency disclaimer for a while. In this particular blog post I used Pi as an agent to help me generate the dynamic visualization and I use the agent to write the code to analyze and scrape Google Trends.

May 04, 2026 12:00 AM UTC

May 03, 2026


"Michiel's Blog"

Talk at PyGrunn on httpxyz

On Friday 8th of May 2026 I will be giving a talk on our new fork of the popular python package httpx called httpxyz at PyGrunn. PyGrunn is a full day Python (“and friends”) conference in Groningen, The Netherlands. httpx is a top-100 Python package for sending http requests but has not had a release since end of 2024, plus, recently all issues were set to hidden and all discussions are closed. httpxyz is our friendly fork with lots of fixes for serious and more niche issues. Read for more info my announcement post for httpxyz. I will talk about why we did the fork and how we approach it. I’ll also delve into how to build a performant API client in Python and technical details of HTTP.

This is an expanded version of the talk I gave at PyAmsterdam last week, and is part of my ‘promotion efforts’ for httpxyz. I look forward to giving my talk & see you there!

Me presenting at PyAmsterdam

May 03, 2026 08:00 PM UTC


PyCon

Asking the Key Questions: Q&A with the PyCon US 2026 keynote speaker Pablo Galindo Salgado

 

This is a blog series where we're asking each of our PyConUS 2026 keynote speakers about their journey into tech, how excited they are for PyconUS and any tips they can provide for an awesome conference experience!

Thank you Pablo for this interview! You can learn more about Pablo's keynote on the PyConUS Keynote Speakers page and you can also attend Pablo Galindo Salgados meet and greet at the PSF Booth in the Expo Hall on Saturday May 16 after Pablo's keynote.



How did you get started in tech/Python? Did you have a friend or a mentor that helped you?


I got into tech through the back door: as part of my Physics studies, needing to write code to run simulations and process data. The simulations themselves were in Fortran 77 and C++ but for the rest I tried a bunch of languages before landing on Python, but Python had something the others didn't: it was genuinely fun. And then I discovered the community, and that was it. I didn't have a single mentor so much as a whole constellation of generous people in the Python world. The core dev team is full of some of the most talented and kind people I've ever met, and I learn from them every single day.


What do you think the most important work you've ever done is?


Honestly, a tough one. I've done technical work I'm proud of: the PEG parser, better error messages, performance and memory profilers, debuggers, work in the garbage collector, the Steering Council... but if I'm being real, the human contributions matter more to me than the technical ones. The contributors I've mentored who became core developers themselves. The talk that made someone feel like they could contribute too. The code will always be there (or not!) but helping people feel welcome and capable in this community is the work that actually keeps me going.


Have you been to PyCon US before? What are you looking forward to?


This will be my sixth PyCon US! I should probably be blasé about it by now, but I'm genuinely not. Every year I get that same rush walking into a room full of people who care as deeply about this stuff as I do. What I'm most looking forward to is seeing everyone: there are so many people in this community I only get to see once a year, and those reunions mean the world to me. And the conversations. The hallway conversations, the late-night ones, the ones that start at a talk and end up somewhere completely unexpected. That's where the magic happens.


Do you have any advice for first-time conference goers?


Talk to people! The hallway track is real and it's where some of the best things at PyCon happen. Introduce yourself, go to the social events, ask questions after talks: everyone here is friendly and almost everyone remembers what it felt like to be new. And please, go to the Sprints. They are so underrated. You don't need to be an expert, you just need to show up and people will help you find something to work on, and it might just be the start of something big. Finally, be kind to yourself. You won't see everything and that's okay. Pick what excites you, let yourself be surprised, and enjoy being part of something wonderful.


Can you tell us about an open source project not enough people know about?


A twist: CPython itself. Everyone knows about CPython, but I don't think people really know it as a community : a place where real humans show up every day and do imperfect, collaborative, joyful work together. There's a persistent myth that core developers are geniuses in an ivory tower who never make mistakes. I want to bust that completely. We are normal people. We make mistakes, we don't always know the answers, we learn from each other, and we have an enormous amount of fun. How CPython gets built is still a mystery to many Python developers, but it really doesn't need to be. The project is open, the conversations are public, and the door is open to anyone who wants to contribute. Come take a look! You might be surprised at how human it all is.


May 03, 2026 03:29 PM UTC


Real Python

Quiz: Revisit Python Fundamentals

In this quiz, you’ll revisit the core concepts covered in the Revisit Python Fundamentals learning path. The 15 questions span variables, data types, operators, expressions, keywords, and exceptions, giving you a way to check that you understood the most important ideas.

Take your time and revisit any topics that feel rusty before moving on to the next learning path.


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

May 03, 2026 12:00 PM UTC

May 02, 2026


PyCon

Asking the Key Questions: Q&A with the PyCon US 2026 keynote speaker amanda casari

 

This is a blog series where we're asking each of our PyConUS 2026 keynote speakers about their journey into tech, how excited they are for PyconUS and any tips they can provide for an awesome conference experience!

Thank you amanda for this interview! You can learn more about amanda's keynote on the PyConUS Keynote Speakers page and you can also attend amanda's meet and greet at the PSF Booth in the Expo Hall on Thursday May 14 during the opening reception at 5 - 6pm PT.


Without giving any too many spoilers, tell us what your keynote is about?

More and more these days, amanda is asking how do you make space in open source for hope.

How did you get started in tech/Python? Did you have a friend or a mentor that helped you?

My first time wrestling with Python was in 2009 when I was struggling to set up a webserver for a graduate student project building a microgrid testbed for a local national park. When I moved to Seattle a few years later, the local Python tech community was extremely welcoming, friendly, and really focused on bringing people together to make them feel connected. I especially grateful to
My first time wrestling with Python was in 2009 when I was struggling to set up a webserver for a graduate student project building a microgrid testbed for a local national park. When I moved to Seattle a few years later, the local Python tech community was extremely welcoming, friendly, and really focused on bringing people together to make them feel connected. I especially grateful to PyLadies Seattle leader Wendy Grus, and later Carol Willing, for entertaining and celebrating with me so many silly ideas.

What do you think the most important work you’ve ever done is? Or if you think it might still be in the future, can you tell us something about your plans?

It's hard for me to judge what the most important work I've ever done is, or will be. So much of my work is a series of incremental changes or decisions made with a goal to impact what is next, rather than what is nearby. What I will always be most proud of is when I'm given the opportunity to build teams with other people who challenge me. The most important is always the people, and how we spend the time together when our lives intersect.

Have you been to PyCon US before? What are you looking forward to?

YES! In no particular order, I'm looking forward to: community booth time, meals with old friends, meeting new friends, and finally giving the 5K a noble effort.

Do you have any advice for first-time conference goers or any general conference tips?

Pick at least one talk or session that is completely new to you, or that you have no idea whether or not it intersects with your interests. If it's a low-volume crowd, sit near the front, and be the silent, attentive, and encouraging audience member that every speaker needs.

Can you tell us about an open source or open culture project that you think not enough people know about?

I'm a MASSIVE space nerd. Last year I finally learned about RTEMS, and now I'm obsessed. Everyone talked about the proprietary software failure from the recent Artemis II launch, but they SHOULD have been talking about RTEMS being onboard!!! As a successful open source project that's been running for over 30 years, I want everyone to know about this.

May 02, 2026 12:48 PM UTC

May 01, 2026


Rodrigo Girão Serrão

TIL #144 – Sentinel built-in

Today I learned Python 3.15 will get a new sentinel built-in.

Sentinel values are unique placeholder values that are commonly used in programming. Python 3.15 ships with a new built-in sentinel that can be used to create new sentinel values:

# Python 3.15+
>>> MISSING = sentinel("MISSING")
>>> MISSING
MISSING

Before this built-in was added, the most common sentinel idiom used the built-in object:

MISSING = object()

def my_function(some_arg=MISSING):
    if some_arg is MISSING:
        ... # Handle the sentinel

In the function above, the sentinel value MISSING is being used to check whether the user passed anything as the parameter some_arg or not. PEP 661, that introduced this built-in, has a great discussion covering the reasons as to why this pattern, and many other sentinel patterns, fall short. In general, each common sentinel idiom suffers from at least one of the following problems:

  1. Bad string repr: the string representation is too long and uninformative
  2. Type unsafe: the sentinels don't have a distinct type so it becomes hard or impossible to write code that uses the sentinels and is type safe
  3. Unexpected copy behaviour: the sentinels can't be copied or pickled without breaking the sentinel behaviour

May 01, 2026 05:49 PM UTC


Mike Driscoll

Textual-cogs 0.0.5 Released

I always thought it would be fun to create my own open source libraries or applications and distribute them somehow. When I started writing my book, Creating TUI Applications with Textual and Python, I took the plunge and wrote a helper package called textual-cogs, which is a collection of reusable dialogs and widgets for Textual. Right now, it is mostly just dialogs, but I do hope to add some widgets to it as well.

Anyway, I have released two new dialogs in the past week, with one in v0.0.4 and the other in v0.0.5.

A Textual Directory Dialog

In v0.0.5, I added a directory dialog similar to wxPython’s wx.DirDialog. The dialog will display the user’s directories and allow the user to choose one. It will also allow the user to create a new folder.

Here’s a screenshot:

Textual cogs - Directory Dialog

A Textual Open File Dialog

In v0.0.4, I also added an open file dialog. Textual cogs already has a save file dialog, and I had meant to include the open file dialog originally, but only recently got it added.

Here is what that looks like:

Textual cogs - Open File Dialog

How to Install textual-cogs

You can install textual-cogs using pip or uv:

python -m pip install textual-cogs

Where to Get textual-cogs

You can find textual-cogs on the following websites:

The post Textual-cogs 0.0.5 Released appeared first on Mouse Vs Python.

May 01, 2026 02:58 PM UTC


Real Python

The Real Python Podcast – Episode #293: Agentic Data Science Pair Programming With marimo pair

How do you add agent skills to your data science workflow? How can a coding agent assist with data wrangling and research? This week on the show, Trevor Manz from marimo joins us to discuss marimo pair.


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

May 01, 2026 12:00 PM UTC

Quiz: The Factory Method Pattern and Its Implementation in Python

In this quiz, you’ll test your understanding of The Factory Method Pattern and Its Implementation in Python.

Factory Method is one of the most widely used design patterns, and it’s a powerful tool for separating object creation from object use in your code.

By working through this quiz, you’ll revisit the components of the pattern, recognize opportunities to apply it, and see how you can implement a reusable, general-purpose solution in Python.


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

May 01, 2026 12:00 PM UTC


Luke Plant

Inverse Sapir-Whorf and programming languages

The Sapir-Whorf hypothesis, in its simplest form, is the idea that the language you speak influences the thoughts you think. This post is about a twist on this idea, that I’m calling “Inverse Sapir-Whorf” (for want of a better term), and how we see it in computer programming languages.

Sapir-Whorf is one of those ideas that has been popularised in general culture in a rather misrepresented and exaggerated form. In the field of linguistics, not many people today take seriously the “strong” forms of Sapir-Whorf, such as “linguistic determinism” – the idea that a language controls your thoughts or limits what you can think, or that you even need certain languages to think certain thoughts.

For example, just because a language might lack grammatical tenses, it doesn’t at all follow that the speakers will be more limited in how they think about time – there are always other ways you can express time.

There is a fair amount of evidence that spoken languages can affect perception, skill and attitudes in certain areas, but it’s usually hard to demonstrate a large direct effect.

Inverse Sapir-Whorf is a bit different. I haven’t been able to track down where I first came across the idea, but it goes like this: if classic Sapir-Whorf says your language limits what you can say or think, or makes it hard to say some things, inverse Sapir-Whorf says your language limits what you can’t say, or makes it hard not to say some things, or even hard not to think about some things. Some examples might clear things up.

Examples in natural language

There are many examples to choose from, but they are not always obvious to native speakers of a language. I’ll pick just a few.

English: temporary or permanent present tense

What’s the difference between someone saying “I’m living in London” and “I live in London”? A non-native speaker may not pick this up at all, and a native speaker may pick it up only subconsciously, but “I’m living in London” reveals that the arrangement is temporary.

Now, this might not even be to do with the actual length of time you have been living there, because “temporary” is pretty relative. It might be more about how much you like London. You have to choose a tense, and because you typically do so subconsciously, the language is forcing you to reveal things – either the period of time you’ve been living somewhere, or how you feel about it.

English/Turkish/French: gendered pronouns and nouns

In English, in normal speech you are going to use “he” or “she” when referring to a specific person. “Singular they” does exist, but it’s very unnatural if you are talking about a specific person of known or assumed sex.

You can compare this to another language which doesn’t have gendered pronouns, such as Turkish, which just has “o” for he/she/it. The lack of gendered pronouns in Turkish doesn’t stop you from thinking or talking about a person’s sex, or produce a “less gendered society”, or anywhere close, so it would be difficult to find support for normal Sapir-Whorf here. But the inverse Sapir-Whorf is obvious – English pronouns push you to talk about it whether you want to or not. If you are trying to talk about someone you know, but do so anonymously, it can be very hard to avoid making their identification easier by revealing their sex with an inadvertent “him” or “her”.

Different again is French, in which nouns are gendered, which in some cases can force you to reveal information. If you translate “my friend” into French, you have to choose between “mon ami” (male friend) and “mon amie” (female friend), which are distinct, at least in written form, or “mon copain” vs “ma copine”. Possessive pronouns are also interesting – they are gendered in both English and French (his/her, son/sa), but refer to the gender of the possessor and possessee respectively, and so reveal different information.

Turkish: “mış” tense

With some simplifications, Turkish has two main past tenses: there is the normal one that is similar to “simple past” in English, and then there is the “mış” form (you can pronounce that “mish” if you want).

This has various functions, but when describing a past event, this form is used when you have second hand or unreliable information. If someone asks you “Did Fred come to work on Monday?”, then if you saw him you would use the normal past tense “geldi” (he came), but if you only heard that he came you would instead say “gelmiş” (he came, but second hand information).

The interesting thing to me as a non-native speaker was the effect of having these options, in contrast to English where you can just use simple past tense without any specific indication of reliability or where the information came from. In certain circumstances, Turkish forces you to include information about your level of certainty or whether you witnessed something – the simple past form is not neutral, because the existence of the “mış” form makes it an unnatural choice if it is not the most appropriate of the two.

Interestingly, having learned to think that way, my wife and I have noticed an effect on our English. Often in Turkish the “mış” suffix would come at the end of the last word in a sentence, so now quite frequently we get to the end of an English sentence and notice that we haven’t put in any marker for “this-is-second-hand-info-I-didn’t-actually-witness-it”, and so we tack “mış” on the end.

Of course, you can easily express the same thing in English, using words like “apparently” and other means, but English doesn’t force you to specify, while Turkish pretty much does.

Comments

You often don’t notice these things until you learn another language, or attempt to teach your language to a foreigner. You kind of just understand them subconsciously. The vast majority of times you choose simple present over present continuous, for example, you won’t be consciously thinking about what that implies.

I should also note that when a language forces you express something, it might not be in the form of something included, but in something omitted. For example, I might say “I love cake” or “I love the cake”. In the first case, I’m talking about cake generally, in the second about a specific cake. It is the absence of the word “the” in the first case that makes it unambiguous that I’m referring to all cake, because if I’m referring to a specific cake, I must use the word “the” or some other marker like “this”. In another language, there might not be a direct equivalent to this distinction.

Examples in programming

When it comes to programming languages, I think that the “straight” version of Sapir-Whorf is closer to being true - in some programming languages it is simply hard to express certain concepts. For example, in a language like Python or Haskell it’s hard (though not impossible) to talk about memory allocations. We often talk about the limitations of a language in terms of “things that are hard to express” in that language. Hillel Wayne has some more discussion of this in his post Sapir-Whorf does not apply to Programming Languages.

But I want to talk more about Inverse Sapir-Whorf. What is the language forcing you to talk about, even if you don’t actually care about it?

I think there are actually many, many examples of this, but seeing them can be quite hard, and often requires the “foreigner perspective” that comes from learning multiple languages.

Here are a few:

I suspect that many of the features of more “approachable” or “readable” programming languages could be analysed in these terms – they have a low inverse Sapir-Whorf barrier, and don’t force you to talk about things you don’t have an opinion on, and may not even understand yet.

Are there more examples of this that you’ve come across? How do they affect the programming languages we use, or how we perceive them?

Links

May 01, 2026 08:40 AM UTC


Tryton News

Tryton News May 2026

During the last month we focused on fixing bugs, improving the behaviour of things, speeding-up performance issues - building on the changes from our last LTS release 8.0. We added some new features which we would like to introduce to you in this newsletter.

For an in depth overview of the Tryton issues please take a look at our issue tracker or see the issues and merge requests filtered by label.

Changes for the User

Accounting, Invoicing and Payments

We now updated the supported version of stripe from 2025-09-30.clover to 2026-03-25.dahlia.

Stock, Production and Shipments

Now we include the time-sheet costs in the production work costs.

User Interface

We now implemented a fallback on the model name when there is no name parameter is given in a Tryton URL.

No we support sending emails on chat messages and the ability to reply to them.

Modules

Now we move the account_de_skr03, account_es and account_es_sii modules to the external tryton-community project.

New Documentation

We now add a new documentation for the REST-API.

New Releases

We released bug fixes for the currently maintained long term support series
8.0, 7.0 and 6.0, and for the penultimate series 7.8 and 7.6.

Changes for the System Administrator

Now we [add a REST API](https://code.tryton.org/tryton/-/commit/44edc21632c653a7a0db8a0ee42a8631c6d10f31) for user applications. For more information, have a look at its documentation.

Changes for Implementers and Developers

We now fall back to compact syntax if RelaxNG files are not present. LXML is able to load the compact syntax in case the rnc2rng package is installed. This avoids the need to generate the RelaxNG files when developing.

Authors: @dave @pokoli @udono

1 post - 1 participant

Read full topic

May 01, 2026 06:00 AM UTC


Python GUIs

Streamlit Buttons — Making things happen with Streamlit buttons

Streamlit is a popular choice for creating interactive web applications in Python. With its simple syntax and intuitive interface, developers can quickly create visually appealing dashboards.

One of the great things about Streamlit is its ability to easily handle user interaction, and dynamically update the UI in response. One of the main way for users to trigger actions in UIs is through the use of buttons. In Streamlit, the st.button() method creates a button that users can click to perform an action. Each button can be associated with a different action.

In this tutorial we'll look at how you can use buttons to add interactivity to your Streamlit apps.

Creating Buttons in Streamlit

To create a button in Streamlit, you use the st.button() function, which takes an optional label as an argument. When the button is clicked, it returns True, which you can use to control subsequent actions.

Basic Button Syntax

Here's a simple example of a button in Streamlit:

python
import streamlit as st

if st.button('Click Me'):
    st.write("Button clicked!")

Simple Streamlit app with a single button Simple Streamlit app with a single button

The st.button('Click Me') creates a button labeled Click Me. When the button is clicked, it returns True and the if evaluates to true running the nested code underneath -- displaying the message "Button clicked!"

This basic structure is the foundation of working with buttons in Streamlit. Through this simple mechanism you can build quite complex interactivity.

Multiple Buttons for Different Actions

Building on the basic button structure, you can create multiple buttons within your Streamlit app, each associated with different actions. For instance, let's create buttons that display different messages based on which is clicked.

python
import streamlit as st

if st.button('Show Greeting'):
    st.write("Hello, welcome to the app!")

if st.button('Show Goodbye'):
    st.write("Goodbye! See you soon.")

Simple Streamlit app with two buttons Simple Streamlit app with two buttons

Each button is wrapped in a conditional statement. When a button is pressed, the corresponding action is executed. Depending on the button pressed, different messages are displayed, providing immediate feedback to the user.

This structure is versatile and can be expanded to include more buttons and actions.

Displaying Dynamic Content Based on Button Clicks

Buttons can be used to display all types of content dynamically, including text, images, and charts. For example, below is a similar example but displaying images.

python
import streamlit as st

img_url_1 = "https://placehold.co/150/FF0000"
img_url_2 = "https://placehold.co/150/8ACE00"

if st.button('Show Red Image'):
    st.image(img_url_1, caption="This is a red image")

if st.button('Show Green Image'):
    st.image(img_url_2, caption="This is a green image")

Simple Streamlit app with two buttons showing images Simple Streamlit app with two buttons showing images

When the Show Red Image button is pressed, a red image is displayed. The same goes for the Show Green Image button. This setup allows users to switch between different images based on their preferences.

Note that the state isn't persisted between each interaction. When you click on the "Show Red Image" the green image will disappear, and vice versa. This isn't a toggle but a natural consequence of how Streamlit works: the code of the script is executed on each interaction, so only one button can be in a "clicked" state at any time.

To persist state between runs of the script, you can use Streamlit's state management features. We'll cover this in a future tutorial.

Dynamic Forms Based on Button Press

Dynamic forms allow users to provide input in a structured way, which can vary based on user actions. This is particularly useful for collecting information without overwhelming users with multiple fields.

Here's a quick example where users can input their name and age based on button presses:

python
import streamlit as st

# Title
st.title("Dynamic Forms Based on Button Press")

# Name Input Field
if st.button('Enter Name'):
    name = st.text_input('What is your name?')
    if name:
        st.write(f"Hello, {name}\!")

# Age Input Field

if st.button('Enter Age'):
    age = st.number_input('What is your age?', min_value=1, max_value=120)
    if age:
        st.write(f"Your age is {age}.")

The button Enter Name triggers a text input field when clicked, allowing users to enter their names. The button Enter Age displays a number input field for users to enter their age. The app provides immediate feedback based on user input.

Handling Form Submission

For more complex collections of inputs that you want to work together, consider using st.form() to group inputs, allowing users to submit all inputs at once:

python
import streamlit as st

# Title
st.title("Dynamic Forms Based on Button Press")

with st.form("my_form"):
    name = st.text_input('What is your name?')
    age = st.number_input('What is your age?', min_value=1, max_value=120)
    submitted = st.form_submit_button("Submit")

    if submitted:
        st.write(f"Hello, {name}\! Your age is {age}.")

Streamlit form with submit button Streamlit form with submit button

Conclusion

In this tutorial, we explored how to make things happen in Streamlit using buttons. We learned how to create multiple buttons and display dynamic content based on user interaction.

Now that you have a basic understanding of buttons in Streamlit, you can add basic interaction to your Streamlit applications.

May 01, 2026 06:00 AM UTC


Antonio Cuni

Why Python Is Slow: Talking about SPy on the Behind the Commit Podcast

Why Python Is Slow: Talking about SPy on the Behind the Commit Podcast

During EuroPython 2025 I had the pleasure to talk to Mia Bajić for her podcast Behind The Commit.

In the chat we mainly talk about Python performance and how SPy tries to improve them.

Now the full episode is live: you can watch it on Youtube or listen on Spotify

May 01, 2026 12:00 AM UTC

April 30, 2026


Real Python

Quiz: Using Python for Data Analysis

In this quiz, you’ll test your understanding of Using Python for Data Analysis.

By working through this quiz, you’ll revisit the stages of a data analysis workflow, including cleansing raw data with pandas, spotting outliers and typos, and using regression to find relationships between variables.


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

April 30, 2026 12:00 PM UTC


EuroPython

EuroPython 2026: Ticket Sales Now Open

Hey hey, folks 👋

Get ready for EuroPython 2026: the conference for all things Python, Data Science, and AI! 

We’ve got an exciting week planned:

We have a special keynote this year: Łukasz Langa and Pablo Galindo Salgado will be recording the core.py podcast right on the conference stage. It will feature their special guest Guido van Rossum, the creator of Python.

altTicket sales for EuroPython 2026 is now open

People who’ve been to EuroPython will tell you that it is more than just talks and tutorials: it&aposs a time when the entire community is together, regardless of experience level or background. Each conference leads to new friends being made, projects gaining new contributors, and even people securing their next job. We want you all to be a part of it 💚

🎫 Grab your ticket before they sell out:

Can’t wait to see you all in Kraków and hang out with the Python crowd again 🐍💚

Cheers,

The EuroPython 2026 Organisers ✨

April 30, 2026 10:00 AM UTC


Seth Michael Larson

The Frog for Whom the Bell Tolls

Kaeru no Tame ni Kane wa Naru (カエルの為(ため)に鐘(かね)は鳴(な)る) is a Japanese-only Game Boy title published in 1992 by Nintendo and developed by Intelligent Systems. The title’s official English translation is “The Frog for Whom the Bell Tolls”. For brevity, I’ll be using the title “Frog Game” in this article.

After I finished Link’s Awakening, the Frog Game started popping up everywhere in my digital life. The first occurrence was without my knowledge: some of the characters in Link’s Awakening, Prince Richard and his frogs, are originally from the Frog Game and use the same sprites and music.


Picture of my “Kaeru no Tame ni Kane wa Naru” Game Boy cartridge.

While researching what game to play after Link’s Awakening I watched a video by AntDude detailing the history of hand-held Legend of Zelda games. The video starts by mentioning “Frog Game” instead of the actual first Zelda game on the Game Boy: Link’s Awakening. Very intriguing...

After further research I stumbled across a project by Iván Delgado (Bluesky, YouTube) to create a colorization patch for “Frog Game” that appears to still be in progress. I was already a subscriber to Iván’s blog and had previously read their series of posts about colorizing Game Boy games.

Everything I read about the game made me want to play: the game was affordable, short (7 hours to beat), with a light and funny narrative, and had ties to some of my favorite games. I’ve since played Frog Game and I recommend the game as a quick and fun “pocket-sized” adventure.

Playing with English translations

Kaeru no Tame ni Kane wa Naru was never released outside of Japan and despite multiple re-releases to the 3DS eShop and now Nintendo Classics, there is no official English translation.

I can’t read Japanese, but to experience the dialogue. Luckily for me, there is a fan-created English translation patch from 2011. I would need the actual game ROM to apply the patch.


Japanese title screen for “Kaeru no Tame ni Kane wa Naru”


Title screen with the English translation patch applied

I purchased the game cartridge for $10 on eBay and dumped the ROM using GB Operator. Next I applied the English translation patch (.ips) using ROM Patcher JS by Marco Bledo. I loaded the resulting ROM into the Delta Emulator and played exclusively on this platform (with RetroAchievements enabled).

Beware: There are minor spoilers beyond this point!

References

While game’s title is a reference to “For Whom the Bell Tolls” by Ernest Hemingway, the game’s story definitely isn’t. One of the goals of the protagonists is to repair and ring the “Spring Bell” to break the curse on the princes and their army: transforming them from frogs back into humans.

The developers of Frog Game, Intelligent Systems, also developed my favorite game of all time: “Paper Mario: The Thousand Year Door”. Chapter 4 of Paper Mario is titled “For Pigs the Bell Tolls” which is another reference to Hemingway and potentially Frog Game? Chapter 4’s story in Paper Mario has the villain “Doo*liss” ringing the Creepy Steeple bell which transforms the Twilight Town inhabitants one-by-one into pigs.

Frog Game references Nintendo very directly multiple times. During your adventure you visit “Nantendo Inc.” (not a typo!) to talk to the scientists there. One of the “products” you end up needing from Nantendo is a “Mamicon”, likely a reference to the Nintendo Famicom. From just the name alone you will never guess what the Mamicom does, you’ll have to play the game to find out!

Frog Game is referenced in a few other Nintendo games beyond Link’s Awakening, including an Assist Trophy and Single-Player Challenge in Super Smash Bros. Mad Scienstein from Nantendo Inc. cameos in Wario Land 3, Wario Land 4, and Dr. Mario 64.

Gameplay

The rumors about Link’s Awakening sharing an engine with Frog Game likely come from using a mix of top-down and side-scrolling platformer perspectives. Frog Game uses the top-down perspective when exploring the world map or different towns and then switches to side-scrolling when in dungeons or the castle. Folks who have dug into the assembly of both games are fairly sure the two games don’t share an engine, meaning the rumors are unlikely to be true. Still a fun story :)

Despite appearing to be a traditional RPG with stats like Health, Attack, Speed, and the ability to upgrade your equipment, this game does not play like many RPGs. There are no tactics in combat beyond being able to run away from a battle or use an item, which for most of the story is only to heal using Wine. Battles proceed automatically in a cloud of dust and will consistently resolve as either a victory or defeat.

Combat and stats are used to limit progression with difficult “boss enemies” until you’ve discovered or unlocked every new stat upgrade in an area. Stat upgrades are given out similar to any other item: hidden in chests or as a reward for defeating an enemy. You can’t increase your stats on your own using “experience points” or “leveling up” meaning the game is in control of how strong you are.

The “illusion of control” is my favorite design choice of Frog Game, and it goes beyond just combat and items, too. There are many points in the game where, without you even noticing, the game has set you on a “one-way track” where your combat ability, health, and resources are exactly managed to produce an outcome later in the story. It’s fun trying to break the flow and seeing how the game responds!

Factions

The universe of Frog Game has multiple kingdoms and three factions: humans, frogs and snakes.

Frogs are afraid of snakes, as snakes will actively pursue frogs as prey, but frogs and humans are either neutral or friendly towards each other. The antagonist, Lord Delarin, leads the “Croakian Army”, an army of soldiers who are friendly towards frogs but hostile towards humans of other kingdoms and snakes. Humans, frogs, and snakes can only converse with members of their group and this “information asymmetry” is used throughout the story.

Prince Richard, Prince Sablé, and the Custard Kingdom army are all “cursed” by Mandola the witch, transforming them into frogs. Prince Sablé eventually gains the ability to transform into a frog, snake, and human somewhat at-will from Mandola through additional “curses”. These curses end up being instrumental to your success, similar to the “curses” from Black Chests in Paper Mario or Li’l Devils from Link’s Awakening.

Story

The story of Frog Game after the initial few chapters is quite light. You’re trying to accomplish the main goal which is to defeat Delarin and find Princess Tiramisu, but a lot of that happens in the background. The bulk of the story is solving your minute-to-minute troubles caused either by your short-sightedness or the Croakian army. You don’t meet Delarin until the very end and despite a few twists at the end: the Princess does not escape her fate. At the end of the day it’s a Game Boy game, so the expectations of the story are not high.



Thanks for keeping RSS alive! ♥

April 30, 2026 12:00 AM UTC

April 29, 2026


PyCharm

Using Bag-of-Words With PyCharm

Have you ever wondered how machine learning models actually work with text? After all, these models require numerical input, but text is, well, text.

Natural language processing (NLP) offers many ways to bridge this gap, from the large language models (LLMs) that are dominating headlines today all the way back to the foundational techniques of the 1950s. Those early methods fall under what we now call the bag-of-words (BoW) model, and despite their age, they remain remarkably effective for a wide range of language problems.

In this post, we’ll unpack how the bag-of-words model works, explore the techniques it uses to convert text into numerical representations, and look at where it fits relative to more modern NLP approaches. We’ll also build a text classification project using BoW techniques, and see how PyCharm’s specific features make the whole process faster and easier.

What is the bag-of-words model?

The bag-of-words model is a text representation technique that converts unstructured text into numerical vectors by tracking which words appear across a corpus (a collection of texts). Rather than preserving grammar or word order, it simply represents each document as a “bag” of its words, recording how often each one appears. The result is a vector of counts that captures what a text is about, even if it discards how that content is expressed.

This apparent limitation turns out to matter less than you might expect. For many tasks, such as text classification and sentiment analysis, the presence of certain words is often a stronger signal than their arrangement, and BoW captures that signal efficiently.

How does bag-of-words work?

To use the bag-of-words model, we need to convert each text in a corpus into a numerical vector. Let’s walk through how that works, starting with what that vector actually looks like.

Take the following sentence:

When diving into natural language processing, it is natural for beginners to feel overwhelmed by the complexity of sentiment analysis, which involves distinguishing negative from positive text. However, as you practice with libraries like NLTK or spaCy, the concepts naturally start to click.

A vector representation of this text using the BoW model might look something like this.

naturalnaturallynauseanearnearednearingnecessarynegative
21000001

If we think of this vector as a table, you may have noticed that each column represents a word in the corpus, and the row contains a number from 0 to 2. This number is a count of how many times the word occurs in the text, as we can see below:

When diving into natural language processing, it is natural for beginners to feel overwhelmed by the complexity of sentiment analysis, which involves distinguishing negative from positive text. However, as you practice with libraries like NLTK or spaCy, the concepts naturally start to click.

Each column represents a word in the vocabulary; each value records how many times that word appears. Here, “natural” appears twice, while “naturally” and “negative” each appear once.

Tokenization

Before we can build this vector, we need to split our text into tokens. In BoW modeling, this is typically straightforward: We split on whitespace, so “When diving into natural language processing,” becomes seven tokens: ["When", "diving", "into", "natural", "language", "processing", ","]. This is considerably simpler than the tokenization used in LLMs.

Vocabulary creation

Applying tokenization across every text in the corpus produces a long list of words. Deduplicating this list gives us our vocabulary, which we can see in the set of columns in the vector above. This process does introduce some noise: “Natural” and “natural”, for instance, would be treated as two separate tokens. We’ll look at preprocessing steps to address this shortly.

Encoding

With a vocabulary in hand, we create a vector for each text with one element per vocabulary word. Encoding is then the process of filling in those elements by checking each vocabulary word against the text.

The simplest approach is binary vectorization: 0 if a word is absent, 1 if present. More common, however, is count vectorization, which records the actual number of occurrences, as we saw in the example above. Count vectorization carries more information, since it helps distinguish texts that merely mention a topic from those that focus on it heavily.

One practical consequence of this approach is sparsity. If a corpus contains thousands of unique words, each vector will have thousands of elements, but any individual text will only use a small fraction of them, leaving most values at zero. This signal-to-noise issue is something we’ll return to.

Advantages of the bag-of-words model

The bag-of-words model has remained a staple in NLP for good reason. Its greatest strength is its simplicity: Because text is represented as a collection of word counts, the approach is easy to understand and straightforward to implement, making it a natural baseline before reaching for more complex architectures.

Beyond simplicity, BoW is computationally efficient. As you saw above, the underlying math is lightweight, which means it scales well to large text collections without demanding significant computing resources. For tasks where the presence of specific words is sufficient to capture meaning, with sentiment analysis and topic categorization being the clearest examples, it remains a highly effective tool.

Applications of bag-of-words

Like many NLP approaches, the bag-of-words model can be applied to many natural language problems. These potential applications include:

As you can see, the number of potential applications is broad, making bag-of-words modeling a popular first approach to natural language problems.

Why use PyCharm for NLP?

PyCharm is particularly well-suited to bag-of-words modeling because it supports the iterative, detail-oriented workflow that text processing requires. As you’ll soon see, building a reliable BoW pipeline involves multiple steps, such as cleaning text, tokenizing, vectorizing, and validating outputs, and PyCharm’s code intelligence makes each of these smoother. Autocompletion, parameter hints, and quick navigation through specialized NLP libraries reduce friction when experimenting with different vectorizer settings, and help you understand how each component behaves.

Debugging and data inspection are equally important here, since small preprocessing mistakes can have an outsized effect on results. PyCharm lets you step through your code and examine intermediate states of things such as token lists and vocabulary at runtime, making it straightforward to verify that your feature extraction is working as intended. This visibility is especially useful when diagnosing issues like unexpected vocabulary sizes or missing terms.

PyCharm also supports exploratory work through its excellent Jupyter Notebook integration and scientific tooling. BoW modeling often involves trying different preprocessing strategies and observing their effects immediately, so the ability to run code interactively and inspect outputs inline is a genuine advantage. Combined with built-in virtual environment and package management support, this keeps experiments reproducible and well-organized.

As projects grow, PyCharm’s refactoring tools, project navigation, and version control integration help manage the added complexity. BoW models rarely exist in isolation, and they’re often embedded in larger ML pipelines. In such contexts, PyCharm’s features for working with larger applications mean you spend less time managing code and more time improving your models.

Setting up the project

To see these components in action, let’s build an actual bag-of-words project. We’ll use a classic text classification dataset and the AG News dataset, and then use the model to classify news articles into one of four categories: World, Sports, Business, or Science/Technology.

To get started in PyCharm, open the Projects and Files tool window and select New… > New Project…. Since this is a data science project, we can use PyCharm’s built-in Jupyter project type, which sets up a sensible default structure for us.

During project configuration, you’ll be asked to choose a Python interpreter. By default, PyCharm uses uv and lets you select from a range of Python versions, though all major dependency management systems are supported: pip, Anaconda, Pipenv, Poetry, and Hatch. Every project is automatically created with an attached virtual environment, so your setup will be ready to go each time you reopen the project.

With the project configured, we can install our dependencies via the Python Packages tool window. Simply search for a package by name, select it from the list, and install your desired version directly into the virtual environment. You can also see the same information about the package you’d find on PyPI directly within the IDE. For this project, we’ll need pandas and Numpy, along with datasets from Hugging Face, scikit-learn, Pytorch, and spaCy.

Implementing bag-of-words with PyCharm

There are many versions of this dataset online. We’ll be using one of the versions hosted on Hugging Face Hub.

Loading and preparing the data

We’ll use Hugging Face’s datasets package to download this dataset.

from datasets import load_dataset
ag_news_all = load_dataset("sh0416/ag_news")

This gives us a Hugging Face DatasetDict object. If we look at it, we can see it contains a training dataset with 120,000 news articles, and a test dataset containing 7,600 articles.

ag_news_all
DatasetDict({
    train: Dataset({
        features: ['label', 'title', 'description'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['label', 'title', 'description'],
        num_rows: 7600
    })
})

As we’ll be training a model, we also need a validation set. We’ll convert the training and test sets to pandas DataFrames, and use the train_test_split method from scikit-learn to create the validation set from the training data.

import pandas as pd
from sklearn.model_selection import train_test_split

ag_news_train = ag_news_all["train"].to_pandas()
ag_news_test = ag_news_all["test"].to_pandas()

ag_news_train, ag_news_val = train_test_split(
   ag_news_train,
   test_size=0.1,     
   random_state=456,   
   stratify=ag_news_train['label'] 
)

print(f"Training set: {len(ag_news_train)} samples")
print(f"Validation set: {len(ag_news_val)} samples")

We now have a validation set with 12,000 articles, and a training set with 108,000 articles.

Training set: 108000 samples
Validation set: 12000 samples

For those of you new to machine learning, you might be wondering why we need all of these different datasets. The reason for this is to make sure we have a good idea that our model will generalize well and perform as expected on unseen data. The training set is the only data the model ever learns from directly. The validation set is used to monitor how the model is performing on unseen data as we make modeling decisions, such as choosing how many epochs to train for, how large to make the hidden layer, or which preprocessing steps to apply (we’ll see all of this later). This means that we look at validation performance repeatedly while building the model, and this increases the risk that our choices gradually become tuned to the quirks of that particular split. This is why we need a third set (the test set), which we keep completely locked away until we’ve finished all modeling decisions and want a single, unbiased estimate of how well our model will perform on new data. Using the test set for anything other than this final evaluation would give us an overly optimistic picture of our model’s real-world performance.

Let’s now inspect our datasets. PyCharm Pro has a lot of built-in features that make working with DataFrames easier, a few of which we’ll see soon. In this DataFrame, we have three columns: The article title and description, the article text, and the label indicating which of the four news categories the article belongs to. You can open any of the DataFrame cells in the Value Editor to see its full text, or widen the column to prevent truncation, both of which are useful for a quick visual inspection.

At the top of each column, PyCharm displays column statistics, giving you an at-a-glance summary of the data. Switching from Compact to Detailed mode via Show Column Statistics gives you rich summary statistics about each column, and saves you from writing a lot of pandas boilerplate to get it! From these statistics, we can see that our training set is evenly split across the news categories (which is very handy when training a model). We can also see that some headlines and descriptions are not unique, which may introduce noise when classifying these duplicates.

The first step in preparing the data is basic string cleaning, which normalizes the text and reduces meaningless token variation. For instance, without cleaning, “Natural” and “natural” would be treated as two separate vocabulary entries, as we noted earlier. 

We’ll apply four cleaning steps: lowercasing, punctuation removal, number removal, and whitespace normalization. There are different string cleaning steps you can apply depending on the language and use case, but for English-language texts, these tend to be very standard. Let’s go ahead and write a function to do this.

def apply_string_cleaning(dataset: pd.Series) -> pd.Series:
   patterns_to_remove = [
       r"[^a-zA-Z\s]",
   ]

   cleaned = dataset.str.lower()

   for pattern in patterns_to_remove:
       cleaned = cleaned.str.replace(pattern, " ", regex=True)

   cleaned = cleaned.str.replace(r"\s+", " ", regex=True).str.strip()

   return cleaned

ag_news_train["title_clean"] = apply_string_cleaning(ag_news_train["title"])
ag_news_train["description_clean"] = apply_string_cleaning(ag_news_train["description"])

This mostly works, but there’s one issue: The regex strips apostrophes entirely, turning contractions like “you’re” into “you re” and possessives like “Canada’s” into “Canada s”. The cleanest fix is a regex that preserves apostrophes in contractions while removing possessive endings, but this is not the most enjoyable thing to write by hand.

This is where PyCharm’s built-in AI Assistant comes in. Open the chat window via the AI Chat icon on the right-hand side of the IDE and enter the following prompt:

Can you please alter the @apply_string_cleaning function so that it retains apostrophes inside words when they’re used for contractions (e.g., “you’re”), but removes them when they’re used for possessives (e.g., “Canada’s” into “Canada”).

The @ notation lets you reference specific files or objects in your IDE without copying and pasting code into the prompt, including Jupyter variables like datasets and functions.

I ran this against Claude Sonnet 4.5, though JetBrains AI supports a wide range of models from OpenAI, Anthropic, Google, and xAI, as well as open models via Ollama, LM Studio, and OpenAI-compatible APIs. Below is the updated function it returned:

def apply_string_cleaning(dataset: pd.Series) -> pd.Series:
    cleaned = dataset.str.lower()
    
    # Remove possessive apostrophes (word's -> word)
    # This pattern matches: letter(s) + 's + word boundary
    cleaned = cleaned.str.replace(r"(\w+)'s\b", r"\1", regex=True)
    
    # Remove all non-letter characters except apostrophes within words
    cleaned = cleaned.str.replace(r"[^a-zA-Z'\s]", " ", regex=True)
    
    # Clean up any apostrophes at the start or end of words
    cleaned = cleaned.str.replace(r"\s'|'\s", " ", regex=True)
    
    # Remove multiple spaces and trim
    cleaned = cleaned.str.replace(r"\s+", " ", regex=True).str.strip()
    
    return cleaned

ag_news_train["title_clean"] = apply_string_cleaning(ag_news_train["title"])
ag_news_train["description_clean"] = apply_string_cleaning(ag_news_train["description"])

We can insert this into our Jupyter notebook directly by clicking on Insert Snippet as Jupyter Cell in the AI chat.

Once we run this updated function on our raw text, we get the correct result:

texttext_clean
Don’t stand for racism – football chiefdon’t stand for racism football chief
Canada’s Barrick Gold acquires nine per cent stake in Celtic Resources (Canadian Press)canada barrick gold acquires nine per cent stake in celtic resources canadian press

We can see the contraction “don’t” is correctly preserved in the first example, but the possessive “Canada’s” has been removed. We apply this to both the training and validation datasets using the same function, so that the cleaning is consistent across both splits:

ag_news_val["title_clean"] = apply_string_cleaning(ag_news_val["title"])
ag_news_val["description_clean"] = apply_string_cleaning(ag_news_val["description"])

Creating the bag-of-words model

Now that we have clean text, we need to build our vocabulary and encode it. We’ll use scikit-learn’s CountVectorizer for this:

from sklearn.feature_extraction.text import CountVectorizer

countVectorizerNews = CountVectorizer()
countVectorizerNews.fit(ag_news_train["text_clean"])
ag_news_train_cv = countVectorizerNews.transform(ag_news_train["text_clean"]).toarray()

The process has two distinct steps. First, .fit() scans the training data and builds a vocabulary by identifying every unique word and assigning it a fixed index position (for example, “government” = column 8,901). The result is a mapping of 59,544 unique words, which you can think of as the column headers for our eventual matrix.

Second, .transform() uses that vocabulary to convert each headline into a numerical vector, counting how many times each vocabulary word appears and placing that count at the corresponding index position.

The reason these are two separate steps is important: When we later process our validation and test data, we’ll call .transform() using the vocabulary learned from the training set. This ensures that all three splits share a consistent feature space. If we re-ran .fit() on the test data, we’d get a different vocabulary, and the model’s predictions would be meaningless.

With the vectorizer fitted and our training data transformed, we can start exploring what we’ve actually built. Let’s first take a look at the vocabulary. CountVectorizer stores it as a dictionary mapping each word to its index position, accessible via vocabulary_:

countVectorizerNews.vocabulary_
{'fed': 18461,
 'up': 55833,
 'with': 58324,
 'pension': 38929,
 'defaults': 13156,
 'citing': 9475,
 'failure': 18077,
 'of': 36704,
 'two': 54804,
 'big': 5269,
 'airlines': 1139,
 'to': 53531,
 'make': 31397,
 'payments': 38686,
 'their': 52947,
...}
len(countVectorizerNews.vocabulary_)
59544

This confirms that our vocabulary contains 59,544 unique words. Browsing through it, you can start to guess what kinds of terms appear frequently in the different types of news. Country names feature heavily in the “world” news category, terms like “football” and “cricket” in the “sports” news category, terms like “profit” and “losses” in the “business” news category, and company names like “Google” and “Microsoft” in the “science/technology” category.

Next, let’s inspect the feature matrix itself. ag_news_train_cv is a NumPy array with one row per headline and one column per vocabulary word, giving us a matrix of shape (108,000 × 59,544). We can wrap it in a DataFrame to make it easier to inspect in PyCharm’s DataFrame viewer:

pd.DataFrame(ag_news_train_cv, columns=countVectorizerNews.get_feature_names_out())

As expected, the matrix is very sparse. Most values are zero, since any individual headline only contains a small fraction of the full vocabulary. In fact, you might have noticed that the number of columns is two-thirds of the number of rows, which is never good for a feature matrix. We’ll explore how to reduce the dimensionality of the feature space in a later section.

Note that we also need to apply this vectorization to the validation dataset before moving on to modeling. Importantly, we are only applying the .transform method to the validation set, as we already trained it on the training dataset.

ag_news_val_cv = countVectorizerNews.transform(ag_news_val["text_clean"]).toarray()

Visualizing the results

Before we move onto reducing down the dimensionality of our feature space, let’s explore the distribution of the words in our corpus. This can help us to understand the most common and rare words, and how we might use this to further process our data to amplify the signal-to-noise ratio.

Word frequency plots

We’ll start by creating a DataFrame that aggregates word counts across all headlines and ranks them by frequency:

import numpy as np

vocab = countVectorizerNews.get_feature_names_out()
counts = np.asarray(ag_news_train_cv.sum(axis=0)).flatten()

pd.DataFrame({
  'vocab': vocab,
  'count': counts,
}).sort_values('count', ascending=False).reset_index(drop=True)

First, we retrieve the vocabulary in index order using get_feature_names_out(), so each word lines up with its corresponding column in the feature matrix. We then sum the matrix column-wise (that is, across all documents) to get the total number of times each word appears in the training set. Finally, we wrap these two arrays into a DataFrame and sort by count, giving us a ranked list of the most frequent terms.

Once this DataFrame is displayed in PyCharm, we can easily turn it into a visualization without writing a single line of code. By clicking on the Chart View button in the top left-hand corner of the DataFrame, we can explore a range of ways of visualizing our data. Go to Show Series Settings in the top right-hand corner, and adjust the parameters to get the count frequencies of the words: we set the X axis value to “vocab” (and change group and sort to none), the Y axis value to “count”, and the chart type to “Bar”.

Hovering over this chart, we can see that it has a very long-tailed distribution, which is very typical of vocabulary frequencies (this is actually so typical that it is described in something called Zipf’s law). This means that the majority of our words very rarely occur in the text, and in fact, if we hover over the right-hand side of the chart, it looks like around a third of our vocabulary terms are only used once! 

On the other hand, when we hover over the left-hand side of the chart, we can see that this is dominated by very common words, prepositions, and articles, such as “to”, “in”, “the”, and “you”. These words don’t really carry any meaning and pretty much occur in every text, so they’re unlikely to be useful for our classification task. 

Let’s have a look at some things we can do to clean up our feature space and help our semantically meaningful words stand out a bit more.

Advanced bag-of-words techniques

The basic BoW pipeline we’ve built so far is a solid foundation, but there are several techniques that can meaningfully improve its quality. This section walks through the most important ones. We’ll only be using a selection of them in our project, but you can investigate which of these seem appropriate when building your own project.

Stop word removal

Stop words are extremely common words that appear frequently across all kinds of text but carry little meaningful information. This includes words like “the”, “is”, “and”, “of”, as we saw in the histogram in the previous section. They inflate vocabulary size without adding signal, so removing them is one of the most straightforward ways to improve your BoW representation. NLTK provides a built-in stop word list for English and many other languages.

Stemming and lemmatization

Another issue you might have noticed in our vocabulary is that words that are semantically equivalent have different syntactic forms, meaning that while they should be treated as the same token, they occupy additional token slots. We can resolve this through two techniques: stemming and lemmatization. Stemming reduces words to their root form using simple rule-based truncation (e.g. “running” → “run”), while lemmatization takes a linguistic approach, mapping words to their dictionary base form. Lemmatization is slower but generally produces cleaner results, particularly for irregular word forms.

TF-IDF

Term frequency-inverse document frequency (TF-IDF) is an extension of basic count vectorization that weights each word by how informative it actually is. A word that appears frequently in one document but rarely across the corpus receives a high weight; a word that appears everywhere receives a low one. This neatly addresses one of the core weaknesses of raw count vectors: common but uninformative words can dominate the feature space even after stop-word removal.

N-grams

Standard BoW treats each word independently, which means it misses phrases whose meaning depends on word combinations. A classic example of this is “machine learning”, which has a distinct meaning to “machine” + “learning”. N-grams address this by treating sequences of adjacent words as single tokens, so a bigram model would capture “machine learning” as a feature in its own right. The trade-off is a much larger vocabulary, so in practice, bigrams are most commonly used, with trigrams reserved for cases where capturing longer phrases is important.

Handling out-of-vocabulary words

When you apply your fitted vectorizer to new data, any words not present in the training vocabulary are silently ignored by default. For many tasks, this is acceptable, but if your production data is likely to continue introducing new terms that carry meaningful signal, it’s worth considering alternatives. One common approach is to reserve a special <UNK> token to represent unseen words, which at least preserves the information that something unfamiliar appeared, even if its identity is unknown and multiple (perhaps unrelated) words are collapsed onto the same token. 

However, LLMs, with their more flexible approach to tokenization, tend to be a better choice if out-of-vocabulary words will be a major issue for your model once it is in production.

Dimensionality reduction

Even after stop word removal and other cleaning steps, BoW feature matrices are typically very high-dimensional and sparse. Two widely used techniques can help. Reducing to the top-N most frequent terms is the simplest approach, discarding low-frequency words that are unlikely to generalize well. For a more principled reduction, techniques like principal component analysis (PCA) or latent semantic analysis (LSA) project the feature matrix into a lower-dimensional space, compressing the representation while preserving as much of the meaningful variance as possible.

Feature selection techniques

Rather than reducing dimensionality arbitrarily, feature selection methods identify and retain only the features most relevant to your specific task. Chi-squared testing measures the statistical dependence between each term and the target label, making it well-suited to classification tasks. Mutual information takes a similar approach, scoring each feature by how much it reduces uncertainty about the class. Both methods can substantially reduce vocabulary size while preserving model performance.

Applying bag-of-words to a real-world problem

Let’s now continue the example we started earlier. We’re going to take the work we’ve done on our AG News text classification task and take it to its completion by building a model.

A common way to build a model using encoded text is neural networks, where each of the words in the vocabulary is treated as a feature, and the categories we want to predict (in our case, the news category) are the output. We’ll start by building a baseline model that applies only string cleaning and encoding to the text.

I had originally written this model in Keras, as part of a previous BoW project from a couple of years ago. However, that code was now out of date. In order to update it and adapt it to Pytorch, I asked JetBrains AI to do the following:

Please update this neural network from Keras to Pytorch, making improvements to make the code as reusable as possible.

This gave us the following successful port of the code:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

class MulticlassClassificationModel(nn.Module):
   def __init__(self, input_size: int, hidden_layer_size: int, num_classes: int = 4):
       super(MulticlassClassificationModel, self).__init__()
       self.fc1 = nn.Linear(input_size, hidden_layer_size)
       self.relu = nn.ReLU()
       self.fc2 = nn.Linear(hidden_layer_size, num_classes)

   def forward(self, x):
       x = self.fc1(x)
       x = self.relu(x)
       x = self.fc2(x)
       return x

def train_text_classification_model(
       train_features: np.ndarray,
       train_labels: np.ndarray,
       validation_features: np.ndarray,
       validation_labels: np.ndarray,
       input_size: int,
       num_epochs: int,
       hidden_layer_size: int,
       num_classes: int = 4,
       batch_size: int = 1920,
       learning_rate: float = 0.001) -> MulticlassClassificationModel:

   # Convert labels to 0-indexed (AG News has labels 1,2,3,4 -> need 0,1,2,3)
   train_labels_indexed = train_labels - 1
   validation_labels_indexed = validation_labels - 1

   # Convert numpy arrays to PyTorch tensors
   X_train = torch.FloatTensor(train_features.copy())
   y_train = torch.LongTensor(train_labels_indexed.copy())
   X_val = torch.FloatTensor(validation_features.copy())
   y_val = torch.LongTensor(validation_labels_indexed.copy())

   # Create datasets and dataloaders
   train_dataset = TensorDataset(X_train, y_train)
   train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

   # Initialize model, loss function, and optimizer
   model = MulticlassClassificationModel(input_size, hidden_layer_size, num_classes)
   criterion = nn.CrossEntropyLoss()
   optimizer = optim.RMSprop(model.parameters(), lr=learning_rate)

   # Training loop
   for epoch in range(num_epochs):
       model.train()
       train_loss = 0.0
       correct_train = 0
       total_train = 0

       for batch_features, batch_labels in train_loader:
           # Forward pass
           outputs = model(batch_features)
           loss = criterion(outputs, batch_labels)

           # Backward pass and optimization
           optimizer.zero_grad()
           loss.backward()
           optimizer.step()

           # Calculate training metrics
           train_loss += loss.item()
           _, predicted = torch.max(outputs, 1)
           correct_train += (predicted == batch_labels).sum().item()
           total_train += batch_labels.size(0)

       # Validation
       model.eval()
       with torch.no_grad():
           val_outputs = model(X_val)
           val_loss = criterion(val_outputs, y_val)
           _, val_predicted = torch.max(val_outputs, 1)
           correct_val = (val_predicted == y_val).sum().item()
           total_val = y_val.size(0)

       # Print epoch metrics
       train_acc = correct_train / total_train
       val_acc = correct_val / total_val
       print(f'Epoch [{epoch+1}/{num_epochs}], '
             f'Train Loss: {train_loss/len(train_loader):.4f}, '
             f'Train Acc: {train_acc:.4f}, '
             f'Val Loss: {val_loss:.4f}, '
             f'Val Acc: {val_acc:.4f}')

   return model

def generate_predictions(model: MulticlassClassificationModel,
                       validation_features: np.ndarray,
                       validation_labels: np.ndarray) -> list:
   model.eval()

   # Convert to tensors
   X_val = torch.FloatTensor(validation_features.copy())

   with torch.no_grad():
       outputs = model(X_val)
       _, predicted = torch.max(outputs, 1)

   # Convert back to 1-indexed labels to match original dataset
   predicted_labels = (predicted.numpy() + 1)

   print("Confusion Matrix:")
   print(pd.crosstab(validation_labels, predicted_labels,
                     rownames=['Actual'], colnames=['Predicted']))
   return predicted_labels.tolist()

Let’s walk through this code step-by-step to understand how we’re going to train our text classifier.

The model architecture

MulticlassClassificationModel is a simple two-layer feedforward neural network. It takes a BoW vector as input, with each feature being a vocabulary word, and passes it through two sequential transformations to produce a prediction. The first layer (fc1) compresses this high-dimensional input down to a smaller intermediate representation, whose size we control via hidden_layer_size. A ReLU activation is then applied, which introduces a small amount of mathematical complexity that allows the model to learn patterns that a simple weighted sum couldn’t capture. The second layer (fc2) takes this intermediate representation and maps it down to four output values, one per news category, where the category with the highest value becomes the model’s prediction.

Training and validation

train_text_classification_model handles the full training loop. It starts with a small amount of housekeeping: The AG News labels run from 1 to 4, but PyTorch expects 0-indexed classes, so these are shifted down by 1. The features and labels are then converted to PyTorch tensors, and a DataLoader is created to feed the training data to the model in batches.

Each epoch, the model processes the training data batch by batch. For each batch, it runs a forward pass to generate predictions, computes the cross-entropy loss against the true labels, and then runs a backward pass to update the model weights via the RMSprop optimizer. At the end of every epoch, the model switches into evaluation mode and runs inference over the full validation set, printing the training and validation loss and accuracy so we can monitor how training is progressing.

Generating predictions

Once training is complete, generate_predictions runs the trained model on a held-out dataset and returns the predicted class for each article. It also prints a confusion matrix, which gives us a breakdown of which categories the model is getting right and where it’s getting confused, which is a much more informative picture than accuracy alone.

Running the baseline

We can now train the baseline model. We pass in the raw count-vectorized training and validation features, specify an input size equal to the vocabulary size (59,544 columns), train for two epochs, and use a hidden layer of 5,000 nodes.

baseline_model = train_text_classification_model(
    ag_news_train_cv,
    ag_news_train["label"].to_numpy(),
    ag_news_val_cv,
    ag_news_val["label"].to_numpy(),
    ag_news_train_cv.shape[1],
    5,
    5000
)

predictions = generate_predictions(
    baseline_model,
    ag_news_val_cv,
    ag_news_val["label"].to_numpy()
)
Epoch [1/2], Train Loss: 0.3553, Train Acc: 0.8813, Val Loss: 0.2307, Val Acc: 0.9243
Epoch [2/2], Train Loss: 0.1217, Train Acc: 0.9587, Val Loss: 0.2352, Val Acc: 0.9240

Confusion Matrix:
Predicted     1     2     3     4
Actual                           
1          2774    65    89    72
2            37  2944     9    10
3           112    20  2694   174
4            97    20   207  2676

Even with the very basic data preparation we did, we can see we’ve performed very well on this prediction task, with around 92% accuracy. The confusion matrix shows that the model seems to have the easiest time distinguishing between category two (sports) and the other topics, and the hardest time distinguishing between category three (business) and category four (science/technology). This makes sense, as the words used to describe sports are very distinct and unlikely to be used in other contexts (things like football), whereas there is likely to be overlapping vocabulary between business and technology (especially company names).

As we saw above, there is a lot we can do to improve the signal-to-noise ratio in BoW modeling. Let’s apply four commonly used techniques to our data and see whether this improves our predictions: lemmatization, stop word removal, limiting our vocabulary to the top N terms, and TF-IDF weighting. As you’ll see, all of these can be done relatively simply using inbuilt functions in packages such as spaCy and scikit-learn.

Lemmatization

As we discussed earlier, lemmatization collapses inflected word forms into a single vocabulary entry by mapping each word to its dictionary base form, which both shrinks the vocabulary and concentrates the signal for each concept into a single feature. We’ll use spaCy for this, which first requires downloading its small English language model:

!python -m spacy download en_core_web_sm

nlp = spacy.load("en_core_web_sm")

Our lemmatise_text function passes each text through spaCy’s NLP pipeline using nlp.pipe(), which processes them in batches of 1,000 for efficiency. For each document, it extracts the .lemma_ attribute of every token and joins them back into a single string. One small detail worth noting: we preserve the original DataFrame index when constructing the output Series, so that rows stay correctly aligned when we assign the results back.

We apply lemmatization before string cleaning, since spaCy needs the original casing and punctuation to correctly identify grammatical structure. For example, “running” and “Running” lemmatize to the same thing, but removing punctuation first can confuse the parser. Once lemmatized, we pass the output through apply_string_cleaning as before:

ag_news_train["title_clean"] = apply_string_cleaning(lemmatise_text(ag_news_train["title"]))
ag_news_train["description_clean"] = apply_string_cleaning(lemmatise_text(ag_news_train["description"]))

ag_news_val["title_clean"] = apply_string_cleaning(lemmatise_text(ag_news_val["title"]))
ag_news_val["description_clean"] = apply_string_cleaning(lemmatise_text(ag_news_val["description"]))

ag_news_train["text_clean"] = ag_news_train["title_clean"] + " " + ag_news_train["description_clean"]

ag_news_val["text_clean"] = ag_news_val["title_clean"] + " " + ag_news_val["description_clean"]

We apply this separately to the title and description columns before concatenating them into a single text_clean field. As you can see, we do this for both the training and validation sets using the same function, so that lemmatization is applied consistently across both splits.

Removing stop words

As with lemmatization, we covered the motivation for stop word removal earlier: Words like “the”, “is”, and “of” appear so frequently across all texts that they add noise rather than signal to our feature matrix. Here we’ll actually apply it to our data.

def remove_stopwords(texts: pd.Series) -> pd.Series:
   texts = texts.fillna("").astype(str)

   filtered_texts = []
   for doc in nlp.pipe(texts, batch_size=1000):
       filtered_texts.append(
           " ".join(token.text for token in doc if not token.is_stop)
       )

   return pd.Series(filtered_texts, index=texts.index)

Our remove_stopwords function again uses nlp.pipe() to process texts in batches. For each document, it filters out any token where spaCy’s is_stop attribute is True, and joins the remaining tokens back into a string. Conveniently, spaCy handles stop word detection using the same pipeline we already loaded for lemmatization, so no additional setup is needed.

We apply this to the already-cleaned and lemmatized text_clean column for both the training and validation sets, so the stop word removal builds directly on our previous preprocessing steps and is applied consistently across both splits.

ag_news_train["text_no_stopwords"] = remove_stopwords(ag_news_train["text_clean"])
ag_news_val["text_no_stopwords"] = remove_stopwords(ag_news_val["text_clean"])

Top N terms and TF-IDF vectorization

The final two improvements we’ll apply are limiting the vocabulary size and switching from raw count vectorization to TF-IDF weighting. Conveniently, scikit-learn’s TfidfVectorizer handles both in a single step.

Recall from earlier that TF-IDF downweights words that appear frequently across many documents while upweighting words that are distinctive to particular documents. This cleans up uninformative words that don’t quite qualify as stopwords, but add little useful information to our dataset. The max_features=20000 argument caps the vocabulary at the 20,000 most frequent terms after TF-IDF scoring, which discards the long tail of rare words that are unlikely to generalize well and brings our feature matrix down to a much more manageable size. (The choice of 20,000 words is arbitrary. We could have easily used a smaller or larger number, depending on our dataset and use case.)

As with CountVectorizer, we fit only on the training data and then use that fixed vocabulary to transform both the training and validation sets:

TfidfVectorizerNews = TfidfVectorizer(max_features=20000)
TfidfVectorizerNews.fit(ag_news_train["text_no_stopwords"])

ag_news_train_tfidf = TfidfVectorizerNews.transform(ag_news_train["text_no_stopwords"]).toarray()
ag_news_val_tfidf = TfidfVectorizerNews.transform(ag_news_val["text_no_stopwords"]).toarray()

We can inspect the resulting vocabulary and feature matrix exactly as we did before:

TfidfVectorizerNews.vocabulary_
{'fed': np.int64(6243),
 'pension': np.int64(13134),
 'default': np.int64(4469),
 'cite': np.int64(3200),
 'failure': np.int64(6109),
 'big': np.int64(1787),
 'airline': np.int64(401),
 'payment': np.int64(13051),
 'plan': np.int64(13424),
 'government': np.int64(7306),
 'official': np.int64(12453),
 'tuesday': np.int64(18437),
 'congress': np.int64(3691),
 'hard': np.int64(7689),
 'corporation': np.int64(3901),
...}
pd.DataFrame(ag_news_train_tfidf, columns=TfidfVectorizerNews.get_feature_names_out())

Compared to our baseline feature matrix of 59,544 columns filled almost entirely with zeros, this is considerably leaner. We now have 20,000 columns of weighted scores that better reflect each word’s actual importance to the document it appears in. It is still relatively sparse, but we can see from both the feature matrix and the vocabulary list that it is much more focused on semantically rich words.

Fitting the revised model

With our improved features in hand, we can now retrain the model. The call is identical to before, except we pass in the TF-IDF feature matrices instead of the raw count vectors, and the input size is now 20,000 rather than 59,544:

baseline_model = train_text_classification_model(
    ag_news_train_tfidf,
    ag_news_train["label"].to_numpy(),
    ag_news_val_tfidf,
    ag_news_val["label"].to_numpy(),
    ag_news_train_tfidf.shape[1],
    2,
    5000
)

predictions = generate_predictions(
    baseline_model,
    ag_news_val_tfidf,
    ag_news_val["label"].to_numpy()
)
Epoch [1/2], Train Loss: 0.3183, Train Acc: 0.8932, Val Loss: 0.2301, Val Acc: 0.9225
Epoch [2/2], Train Loss: 0.1512, Train Acc: 0.9475, Val Loss: 0.2332, Val Acc: 0.9243
Confusion Matrix - Raw Counts:
Predicted     1     2     3     4
Actual                           
1          2703    71   121   105
2            20  2955    13    12
3            68    19  2691   222
4            77    17   163  2743

The results are actually very encouraging! Our overall validation accuracy is essentially unchanged at around 92%, but we’ve achieved this with a feature matrix that is less than a third of the size. This suggests that the extra vocabulary in the baseline (including the stop words) was contributing to noise rather than signal. Reducing the size of the feature matrix makes our model more stable, less prone to overfitting, and much more manageable to deploy.

Looking at the confusion matrix, the pattern of errors is similar to before: Sports (category two) is the easiest category to classify, with 98.5% accuracy, while Business (category three) and Science/Technology (category four) remain the hardest to separate, with around 7% of articles in each category being misclassified as the other. This is consistent with what we saw in the baseline, so it seems that the preprocessing improvements have tightened things up at the margins, but the fundamental difficulty of the Business/Technology boundary is a property of the data rather than the feature representation.

Applying our model to the test set

Finally, we need to validate that our model performs as well on the test set as it does on the validation set. Up to this point, we’ve deliberately kept the test set locked away. As mentioned earlier, if we had been making modeling decisions based on test set performance, we’d risk inadvertently overfitting our choices to it, and our final accuracy estimate would be optimistic.

The preprocessing steps must be applied in exactly the same order as for the training and validation data: lemmatization, string cleaning, concatenation of title and description, and stop-word removal. Crucially, we also call .transform() rather than .fit_transform() on the test text, using the vocabulary learned from the training data:

ag_news_test["title_clean"] = apply_string_cleaning(lemmatise_text(ag_news_test["title"]))
ag_news_test["description_clean"] = apply_string_cleaning(lemmatise_text(ag_news_test["description"]))
ag_news_test["text_clean"] = ag_news_test["title_clean"] + " " + ag_news_test["description_clean"]
ag_news_test["text_no_stopwords"] = remove_stopwords(ag_news_test["text_clean"])

ag_news_test_tfidf = TfidfVectorizerNews.transform(ag_news_test["text_no_stopwords"]).toarray()

We can then generate predictions and evaluate accuracy on the test set:

test_predictions = generate_predictions(
    baseline_model,
    ag_news_test_tfidf,
    ag_news_test["label"].to_numpy()
)

test_accuracy = accuracy_score(ag_news_test["label"].to_numpy(), test_predictions)
print(f"Test Accuracy: {test_accuracy:.4f}")
Test Accuracy: 0.9183

Confusion Matrix - Raw Counts:
Predicted     1     2     3     4
Actual                           
1          1710    54    78    58
2            13  1870    10     7
3            51    12  1676   161
4            53     9   115  1723

The test accuracy of 91.8% is very close to the 92.4% we saw on the validation set, which is a reassuring sign that our model has generalized well rather than overfitting to the validation data. The confusion matrix tells the same story as before: Sports (category two) remains the easiest category to classify, with only 30 misclassified articles out of 1,900, while the Business/Technology boundary continues to be the main source of errors, with around 8% of articles in each category being misclassified as the other. The consistency between validation and test results gives us confidence that these error patterns reflect genuine properties of the data rather than artifacts of any particular split.

Limitations and alternatives

Loses word order information

The most fundamental limitation of the bag-of-words model is right there in the name: it treats text as an unordered collection of words, discarding all sequence information. This means “the dog bit the man” and “the man bit the dog” produce identical vectors, even though they describe very different events. For many classification tasks, this doesn’t matter much, but for tasks that require understanding the relationship between words, such as question answering or natural language inference, the loss of word order is a serious handicap.

Ignores semantics and context

BoW has no notion of word meaning or context. Each word is simply a column in a matrix, entirely independent of every other word. This creates two related problems. First, synonyms are treated as completely distinct features: “cheap” and “inexpensive” contribute nothing to each other’s signal, even though they mean the same thing. Second, words with multiple meanings are treated as a single feature regardless of context: “bank” means the same thing whether it appears in a sentence about rivers or finance. Both of these issues limit how well BoW representations can capture the actual semantics of a text.

Can result in large, sparse vectors

As we saw in our own example, even a moderately sized corpus of news headlines can produce a vocabulary of nearly 60,000 unique terms. The resulting feature matrix has one column per vocabulary word, but any individual document only uses a tiny fraction of them, leaving the vast majority of values at zero. This sparsity creates two practical problems: The matrices can consume a large amount of memory if stored densely, and the high dimensionality can make it harder for models to find meaningful patterns, a phenomenon sometimes called the curse of dimensionality.

Alternatives

If BoW’s limitations are a bottleneck for your task, there are several well-established alternatives worth considering.

For tasks where BoW already performs well, as we saw here with AG News, the added complexity of these approaches may not be worth the cost. BoW remains a strong baseline, and it’s always worth establishing how far it can take you before reaching for heavier machinery.

Get started with PyCharm today

In this post, we’ve covered a lot of ground: from the fundamentals of the bag-of-words model and how it converts text into numerical vectors, through to building and iteratively improving a real text classification pipeline on the AG News dataset. Along the way, we’ve seen how preprocessing steps like lemmatization, stop word removal, vocabulary capping, and TF-IDF weighting can meaningfully improve the efficiency of your feature representation, and how PyCharm’s DataFrame viewer, column statistics, chart view, and AI Assistant make each of these steps faster and easier to inspect and debug.

If you’d like to try this yourself, PyCharm Pro comes with a 30-day trial. As we saw in this tutorial, its built-in support for Jupyter notebooks, virtual environments, and scientific libraries means you can go from a blank project to a working NLP pipeline with minimal setup friction, leaving you free to focus on the fun parts. 

You can find the full code for this project on GitHub. If you’re interested in exploring more NLP topics, check out our recent blogs here.

April 29, 2026 05:42 PM UTC


PyCon

PyCon US 2026: Call for Volunteers

Looking to make a meaningful contribution to the Python community? Look no further than PyCon US 2026! Whether you're a seasoned Python pro or a newcomer to the community and looking to get involved, there's a volunteer opportunity that's perfect for you. 

Sign-up for volunteer roles is done directly through the PyCon US website. This way, you can view and manage shifts you sign up for through your personal dashboard! You can read up on the different roles to volunteer for and how to sign up on the PyCon US website.

PyCon US is largely organized and run by volunteers. Every year, we ask to fill over 300 onsite volunteer hours to ensure everything runs smoothly at the event. And the best part? You don't need to commit a lot of time to make a difference–some shifts are as short as 45 minutes long! You can sign up for as many or as few shifts as you’d like. Even a couple of hours of your time can go a long way in helping us create an amazing experience for attendees.

Keep in mind that you need to be registered for the conference to sign up for a volunteer role.

One important way to get involved is to sign up as a Session Chair or Session Runner. This is an excellent opportunity to meet and interact with speakers while helping to ensure that sessions run smoothly. And who knows, you might just learn something new along the way:) If you’re looking for an important yet simple-to-learn role, you may be just the person we’ve been looking for! 

We do ask if you sign up for these roles that you please do your absolute best to avoid canceling or worst case not showing up, so that we can make sure we have coverage for all the necessary time slots. You can sign up for these roles directly on the Talks schedule: Sign up for an open time slot by clicking the [+ Volunteer] in one of the talk slots for the session of your choice.

Volunteer your time at PyCon US 2026 and you’ll be part of a fantastic community that's passionate about Python programming. You can help us make this year's conference a huge success while connecting with your fellow event attendees. It’s especially great for first-timers looking to get the most out of PyCon US. Sign up today for the shifts that call to you and join the fun!

April 29, 2026 02:00 PM UTC