Planet Python
Last update: April 30, 2026 01:45 PM UTC
April 30, 2026
Real Python
Quiz: Using Python for Data Analysis
In this quiz, you’ll test your understanding of Using Python for Data Analysis.
By working through this quiz, you’ll revisit the stages of a data analysis workflow, including cleansing raw data with pandas, spotting outliers and typos, and using regression to find relationships between variables.
[ Improve Your Python With đ Python Tricks đ â Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]
EuroPython
EuroPython 2026: Ticket Sales Now Open
Hey hey, folks 👋
Get ready for EuroPython 2026: the conference for all things Python, Data Science, and AI!
We’ve got an exciting week planned:
- Tutorials (13–14 July, Mon–Tue) 🛠️
- Conference Days (15–17 July, Wed–Fri) 🎤
- Sprint Weekend (18–19 July, Sat–Sun) 🚀
We have a special keynote this year: Łukasz Langa and Pablo Galindo Salgado will be recording the core.py podcast right on the conference stage. It will feature their special guest Guido van Rossum, the creator of Python.
Ticket sales for EuroPython 2026 is now openPeople who’ve been to EuroPython will tell you that it is more than just talks and tutorials: it&aposs a time when the entire community is together, regardless of experience level or background. Each conference leads to new friends being made, projects gaining new contributors, and even people securing their next job. We want you all to be a part of it 💚
🎫 Grab your ticket before they sell out:
Can’t wait to see you all in Kraków and hang out with the Python crowd again 🐍💚
Cheers,
The EuroPython 2026 Organisers ✨
April 29, 2026
PyCharm
Using Bag-of-Words With PyCharm
Have you ever wondered how machine learning models actually work with text? After all, these models require numerical input, but text is, well, text.
Natural language processing (NLP) offers many ways to bridge this gap, from the large language models (LLMs) that are dominating headlines today all the way back to the foundational techniques of the 1950s. Those early methods fall under what we now call the bag-of-words (BoW) model, and despite their age, they remain remarkably effective for a wide range of language problems.
In this post, we’ll unpack how the bag-of-words model works, explore the techniques it uses to convert text into numerical representations, and look at where it fits relative to more modern NLP approaches. We’ll also build a text classification project using BoW techniques, and see how PyCharm’s specific features make the whole process faster and easier.
What is the bag-of-words model?
The bag-of-words model is a text representation technique that converts unstructured text into numerical vectors by tracking which words appear across a corpus (a collection of texts). Rather than preserving grammar or word order, it simply represents each document as a “bag” of its words, recording how often each one appears. The result is a vector of counts that captures what a text is about, even if it discards how that content is expressed.
This apparent limitation turns out to matter less than you might expect. For many tasks, such as text classification and sentiment analysis, the presence of certain words is often a stronger signal than their arrangement, and BoW captures that signal efficiently.
How does bag-of-words work?
To use the bag-of-words model, we need to convert each text in a corpus into a numerical vector. Let’s walk through how that works, starting with what that vector actually looks like.
Take the following sentence:
When diving into natural language processing, it is natural for beginners to feel overwhelmed by the complexity of sentiment analysis, which involves distinguishing negative from positive text. However, as you practice with libraries like NLTK or spaCy, the concepts naturally start to click.
A vector representation of this text using the BoW model might look something like this.
| … | natural | naturally | nausea | near | neared | nearing | necessary | negative | … |
| 2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
If we think of this vector as a table, you may have noticed that each column represents a word in the corpus, and the row contains a number from 0 to 2. This number is a count of how many times the word occurs in the text, as we can see below:
When diving into natural language processing, it is natural for beginners to feel overwhelmed by the complexity of sentiment analysis, which involves distinguishing negative from positive text. However, as you practice with libraries like NLTK or spaCy, the concepts naturally start to click.
Each column represents a word in the vocabulary; each value records how many times that word appears. Here, ânaturalâ appears twice, while ânaturallyâ and ânegativeâ each appear once.
Tokenization
Before we can build this vector, we need to split our text into tokens. In BoW modeling, this is typically straightforward: We split on whitespace, so “When diving into natural language processing,” becomes seven tokens: ["When", "diving", "into", "natural", "language", "processing", ","]. This is considerably simpler than the tokenization used in LLMs.
Vocabulary creation
Applying tokenization across every text in the corpus produces a long list of words. Deduplicating this list gives us our vocabulary, which we can see in the set of columns in the vector above. This process does introduce some noise: “Natural” and “natural”, for instance, would be treated as two separate tokens. We’ll look at preprocessing steps to address this shortly.
Encoding
With a vocabulary in hand, we create a vector for each text with one element per vocabulary word. Encoding is then the process of filling in those elements by checking each vocabulary word against the text.
The simplest approach is binary vectorization: 0 if a word is absent, 1 if present. More common, however, is count vectorization, which records the actual number of occurrences, as we saw in the example above. Count vectorization carries more information, since it helps distinguish texts that merely mention a topic from those that focus on it heavily.
One practical consequence of this approach is sparsity. If a corpus contains thousands of unique words, each vector will have thousands of elements, but any individual text will only use a small fraction of them, leaving most values at zero. This signal-to-noise issue is something we’ll return to.
Advantages of the bag-of-words model
The bag-of-words model has remained a staple in NLP for good reason. Its greatest strength is its simplicity: Because text is represented as a collection of word counts, the approach is easy to understand and straightforward to implement, making it a natural baseline before reaching for more complex architectures.
Beyond simplicity, BoW is computationally efficient. As you saw above, the underlying math is lightweight, which means it scales well to large text collections without demanding significant computing resources. For tasks where the presence of specific words is sufficient to capture meaning, with sentiment analysis and topic categorization being the clearest examples, it remains a highly effective tool.
Applications of bag-of-words
Like many NLP approaches, the bag-of-words model can be applied to many natural language problems. These potential applications include:
- Document classification, where encoded documents are sorted into predefined categories. A classic example of this is automatically sorting incoming news articles into distinct categories such as sports, politics, or technology, as weâll see in the project we do in this post.
- Sentiment analysis, where the presence of certain words strongly indicates the overall tone of a text, allows models to easily determine whether a piece of writing expresses a positive, negative, or neutral sentiment. If youâre interested in learning more about BoW and other approaches to sentiment analysis, you can see a prior blog post I wrote on this topic.
- Spam detection, which relies heavily on BoW to identify and filter out unwanted emails or messages by learning to recognize the distinct, high-frequency word patterns characteristic of spam.
- Retrieval systems, where it helps to efficiently find the most relevant documents from an immense corpus based on a userâs search query.
- Topic modeling, which aims to group similar text vectors in order to discover and extract the hidden, latent topics present within a large collection of documents.
As you can see, the number of potential applications is broad, making bag-of-words modeling a popular first approach to natural language problems.
Why use PyCharm for NLP?
PyCharm is particularly well-suited to bag-of-words modeling because it supports the iterative, detail-oriented workflow that text processing requires. As youâll soon see, building a reliable BoW pipeline involves multiple steps, such as cleaning text, tokenizing, vectorizing, and validating outputs, and PyCharm’s code intelligence makes each of these smoother. Autocompletion, parameter hints, and quick navigation through specialized NLP libraries reduce friction when experimenting with different vectorizer settings, and help you understand how each component behaves.
Debugging and data inspection are equally important here, since small preprocessing mistakes can have an outsized effect on results. PyCharm lets you step through your code and examine intermediate states of things such as token lists and vocabulary at runtime, making it straightforward to verify that your feature extraction is working as intended. This visibility is especially useful when diagnosing issues like unexpected vocabulary sizes or missing terms.
PyCharm also supports exploratory work through its excellent Jupyter Notebook integration and scientific tooling. BoW modeling often involves trying different preprocessing strategies and observing their effects immediately, so the ability to run code interactively and inspect outputs inline is a genuine advantage. Combined with built-in virtual environment and package management support, this keeps experiments reproducible and well-organized.
As projects grow, PyCharm’s refactoring tools, project navigation, and version control integration help manage the added complexity. BoW models rarely exist in isolation, and they’re often embedded in larger ML pipelines. In such contexts, PyCharmâs features for working with larger applications mean you spend less time managing code and more time improving your models.
Setting up the project
To see these components in action, let’s build an actual bag-of-words project. We’ll use a classic text classification dataset and the AG News dataset, and then use the model to classify news articles into one of four categories: World, Sports, Business, or Science/Technology.
To get started in PyCharm, open the Projects and Files tool window and select New⊠> New ProjectâŠ. Since this is a data science project, we can use PyCharm’s built-in Jupyter project type, which sets up a sensible default structure for us.
During project configuration, you’ll be asked to choose a Python interpreter. By default, PyCharm uses uv and lets you select from a range of Python versions, though all major dependency management systems are supported: pip, Anaconda, Pipenv, Poetry, and Hatch. Every project is automatically created with an attached virtual environment, so your setup will be ready to go each time you reopen the project.
With the project configured, we can install our dependencies via the Python Packages tool window. Simply search for a package by name, select it from the list, and install your desired version directly into the virtual environment. You can also see the same information about the package you’d find on PyPI directly within the IDE. For this project, we’ll need pandas and Numpy, along with datasets from Hugging Face, scikit-learn, Pytorch, and spaCy.
Implementing bag-of-words with PyCharm
There are many versions of this dataset online. Weâll be using one of the versions hosted on Hugging Face Hub.
Loading and preparing the data
Weâll use Hugging Faceâs datasets package to download this dataset.
from datasets import load_dataset
ag_news_all = load_dataset("sh0416/ag_news")
This gives us a Hugging Face DatasetDict object. If we look at it, we can see it contains a training dataset with 120,000 news articles, and a test dataset containing 7,600 articles.
ag_news_all
DatasetDict({
train: Dataset({
features: ['label', 'title', 'description'],
num_rows: 120000
})
test: Dataset({
features: ['label', 'title', 'description'],
num_rows: 7600
})
})
As weâll be training a model, we also need a validation set. Weâll convert the training and test sets to pandas DataFrames, and use the train_test_split method from scikit-learn to create the validation set from the training data.
import pandas as pd
from sklearn.model_selection import train_test_split
ag_news_train = ag_news_all["train"].to_pandas()
ag_news_test = ag_news_all["test"].to_pandas()
ag_news_train, ag_news_val = train_test_split(
ag_news_train,
test_size=0.1,
random_state=456,
stratify=ag_news_train['label']
)
print(f"Training set: {len(ag_news_train)} samples")
print(f"Validation set: {len(ag_news_val)} samples")
We now have a validation set with 12,000 articles, and a training set with 108,000 articles.
Training set: 108000 samples Validation set: 12000 samples
For those of you new to machine learning, you might be wondering why we need all of these different datasets. The reason for this is to make sure we have a good idea that our model will generalize well and perform as expected on unseen data. The training set is the only data the model ever learns from directly. The validation set is used to monitor how the model is performing on unseen data as we make modeling decisions, such as choosing how many epochs to train for, how large to make the hidden layer, or which preprocessing steps to apply (weâll see all of this later). This means that we look at validation performance repeatedly while building the model, and this increases the risk that our choices gradually become tuned to the quirks of that particular split. This is why we need a third set (the test set), which we keep completely locked away until we’ve finished all modeling decisions and want a single, unbiased estimate of how well our model will perform on new data. Using the test set for anything other than this final evaluation would give us an overly optimistic picture of our model’s real-world performance.
Letâs now inspect our datasets. PyCharm Pro has a lot of built-in features that make working with DataFrames easier, a few of which weâll see soon. In this DataFrame, we have three columns: The article title and description, the article text, and the label indicating which of the four news categories the article belongs to. You can open any of the DataFrame cells in the Value Editor to see its full text, or widen the column to prevent truncation, both of which are useful for a quick visual inspection.
At the top of each column, PyCharm displays column statistics, giving you an at-a-glance summary of the data. Switching from Compact to Detailed mode via Show Column Statistics gives you rich summary statistics about each column, and saves you from writing a lot of pandas boilerplate to get it! From these statistics, we can see that our training set is evenly split across the news categories (which is very handy when training a model). We can also see that some headlines and descriptions are not unique, which may introduce noise when classifying these duplicates.
The first step in preparing the data is basic string cleaning, which normalizes the text and reduces meaningless token variation. For instance, without cleaning, “Natural” and “natural” would be treated as two separate vocabulary entries, as we noted earlier.
We’ll apply four cleaning steps: lowercasing, punctuation removal, number removal, and whitespace normalization. There are different string cleaning steps you can apply depending on the language and use case, but for English-language texts, these tend to be very standard. Letâs go ahead and write a function to do this.
def apply_string_cleaning(dataset: pd.Series) -> pd.Series:
patterns_to_remove = [
r"[^a-zA-Z\s]",
]
cleaned = dataset.str.lower()
for pattern in patterns_to_remove:
cleaned = cleaned.str.replace(pattern, " ", regex=True)
cleaned = cleaned.str.replace(r"\s+", " ", regex=True).str.strip()
return cleaned
ag_news_train["title_clean"] = apply_string_cleaning(ag_news_train["title"])
ag_news_train["description_clean"] = apply_string_cleaning(ag_news_train["description"])
This mostly works, but there’s one issue: The regex strips apostrophes entirely, turning contractions like “you’re” into “you re” and possessives like “Canadaâs” into “Canada s”. The cleanest fix is a regex that preserves apostrophes in contractions while removing possessive endings, but this is not the most enjoyable thing to write by hand.
This is where PyCharm’s built-in AI Assistant comes in. Open the chat window via the AI Chat icon on the right-hand side of the IDE and enter the following prompt:
Can you please alter the
@apply_string_cleaningfunction so that it retains apostrophes inside words when they’re used for contractions (e.g., “you’re”), but removes them when they’re used for possessives (e.g., “Canadaâs” into “Canada”).
The @ notation lets you reference specific files or objects in your IDE without copying and pasting code into the prompt, including Jupyter variables like datasets and functions.
I ran this against Claude Sonnet 4.5, though JetBrains AI supports a wide range of models from OpenAI, Anthropic, Google, and xAI, as well as open models via Ollama, LM Studio, and OpenAI-compatible APIs. Below is the updated function it returned:
def apply_string_cleaning(dataset: pd.Series) -> pd.Series:
cleaned = dataset.str.lower()
# Remove possessive apostrophes (word's -> word)
# This pattern matches: letter(s) + 's + word boundary
cleaned = cleaned.str.replace(r"(\w+)'s\b", r"\1", regex=True)
# Remove all non-letter characters except apostrophes within words
cleaned = cleaned.str.replace(r"[^a-zA-Z'\s]", " ", regex=True)
# Clean up any apostrophes at the start or end of words
cleaned = cleaned.str.replace(r"\s'|'\s", " ", regex=True)
# Remove multiple spaces and trim
cleaned = cleaned.str.replace(r"\s+", " ", regex=True).str.strip()
return cleaned
ag_news_train["title_clean"] = apply_string_cleaning(ag_news_train["title"])
ag_news_train["description_clean"] = apply_string_cleaning(ag_news_train["description"])
We can insert this into our Jupyter notebook directly by clicking on Insert Snippet as Jupyter Cell in the AI chat.
Once we run this updated function on our raw text, we get the correct result:
| text | text_clean |
| Donât stand for racism – football chief | don’t stand for racism football chief |
| Canada’s Barrick Gold acquires nine per cent stake in Celtic Resources (Canadian Press) | canada barrick gold acquires nine per cent stake in celtic resources canadian press |
We can see the contraction âdonâtâ is correctly preserved in the first example, but the possessive âCanadaâsâ has been removed. We apply this to both the training and validation datasets using the same function, so that the cleaning is consistent across both splits:
ag_news_val["title_clean"] = apply_string_cleaning(ag_news_val["title"]) ag_news_val["description_clean"] = apply_string_cleaning(ag_news_val["description"])
Creating the bag-of-words model
Now that we have clean text, we need to build our vocabulary and encode it. We’ll use scikit-learn’s CountVectorizer for this:
from sklearn.feature_extraction.text import CountVectorizer countVectorizerNews = CountVectorizer() countVectorizerNews.fit(ag_news_train["text_clean"]) ag_news_train_cv = countVectorizerNews.transform(ag_news_train["text_clean"]).toarray()
The process has two distinct steps. First, .fit() scans the training data and builds a vocabulary by identifying every unique word and assigning it a fixed index position (for example, “government” = column 8,901). The result is a mapping of 59,544 unique words, which you can think of as the column headers for our eventual matrix.
Second, .transform() uses that vocabulary to convert each headline into a numerical vector, counting how many times each vocabulary word appears and placing that count at the corresponding index position.
The reason these are two separate steps is important: When we later process our validation and test data, we’ll call .transform() using the vocabulary learned from the training set. This ensures that all three splits share a consistent feature space. If we re-ran .fit() on the test data, we’d get a different vocabulary, and the model’s predictions would be meaningless.
With the vectorizer fitted and our training data transformed, we can start exploring what we’ve actually built. Let’s first take a look at the vocabulary. CountVectorizer stores it as a dictionary mapping each word to its index position, accessible via vocabulary_:
countVectorizerNews.vocabulary_
{'fed': 18461,
'up': 55833,
'with': 58324,
'pension': 38929,
'defaults': 13156,
'citing': 9475,
'failure': 18077,
'of': 36704,
'two': 54804,
'big': 5269,
'airlines': 1139,
'to': 53531,
'make': 31397,
'payments': 38686,
'their': 52947,
...}
len(countVectorizerNews.vocabulary_)
59544
This confirms that our vocabulary contains 59,544 unique words. Browsing through it, you can start to guess what kinds of terms appear frequently in the different types of news. Country names feature heavily in the âworldâ news category, terms like âfootballâ and âcricketâ in the âsportsâ news category, terms like âprofitâ and âlossesâ in the âbusinessâ news category, and company names like âGoogleâ and âMicrosoftâ in the âscience/technologyâ category.
Next, let’s inspect the feature matrix itself. ag_news_train_cv is a NumPy array with one row per headline and one column per vocabulary word, giving us a matrix of shape (108,000 Ă 59,544). We can wrap it in a DataFrame to make it easier to inspect in PyCharm’s DataFrame viewer:
pd.DataFrame(ag_news_train_cv, columns=countVectorizerNews.get_feature_names_out())
As expected, the matrix is very sparse. Most values are zero, since any individual headline only contains a small fraction of the full vocabulary. In fact, you might have noticed that the number of columns is two-thirds of the number of rows, which is never good for a feature matrix. Weâll explore how to reduce the dimensionality of the feature space in a later section.
Note that we also need to apply this vectorization to the validation dataset before moving on to modeling. Importantly, we are only applying the .transform method to the validation set, as we already trained it on the training dataset.
ag_news_val_cv = countVectorizerNews.transform(ag_news_val["text_clean"]).toarray()
Visualizing the results
Before we move onto reducing down the dimensionality of our feature space, let’s explore the distribution of the words in our corpus. This can help us to understand the most common and rare words, and how we might use this to further process our data to amplify the signal-to-noise ratio.
Word frequency plots
Weâll start by creating a DataFrame that aggregates word counts across all headlines and ranks them by frequency:
import numpy as np
vocab = countVectorizerNews.get_feature_names_out()
counts = np.asarray(ag_news_train_cv.sum(axis=0)).flatten()
pd.DataFrame({
'vocab': vocab,
'count': counts,
}).sort_values('count', ascending=False).reset_index(drop=True)
First, we retrieve the vocabulary in index order using get_feature_names_out(), so each word lines up with its corresponding column in the feature matrix. We then sum the matrix column-wise (that is, across all documents) to get the total number of times each word appears in the training set. Finally, we wrap these two arrays into a DataFrame and sort by count, giving us a ranked list of the most frequent terms.
Once this DataFrame is displayed in PyCharm, we can easily turn it into a visualization without writing a single line of code. By clicking on the Chart View button in the top left-hand corner of the DataFrame, we can explore a range of ways of visualizing our data. Go to Show Series Settings in the top right-hand corner, and adjust the parameters to get the count frequencies of the words: we set the X axis value to âvocabâ (and change group and sort to none), the Y axis value to âcountâ, and the chart type to âBarâ.
Hovering over this chart, we can see that it has a very long-tailed distribution, which is very typical of vocabulary frequencies (this is actually so typical that it is described in something called Zipfâs law). This means that the majority of our words very rarely occur in the text, and in fact, if we hover over the right-hand side of the chart, it looks like around a third of our vocabulary terms are only used once!
On the other hand, when we hover over the left-hand side of the chart, we can see that this is dominated by very common words, prepositions, and articles, such as âtoâ, âinâ, âtheâ, and âyouâ. These words donât really carry any meaning and pretty much occur in every text, so theyâre unlikely to be useful for our classification task.
Letâs have a look at some things we can do to clean up our feature space and help our semantically meaningful words stand out a bit more.
Advanced bag-of-words techniques
The basic BoW pipeline we’ve built so far is a solid foundation, but there are several techniques that can meaningfully improve its quality. This section walks through the most important ones. Weâll only be using a selection of them in our project, but you can investigate which of these seem appropriate when building your own project.
Stop word removal
Stop words are extremely common words that appear frequently across all kinds of text but carry little meaningful information. This includes words like “the”, “is”, “and”, “of”, as we saw in the histogram in the previous section. They inflate vocabulary size without adding signal, so removing them is one of the most straightforward ways to improve your BoW representation. NLTK provides a built-in stop word list for English and many other languages.
Stemming and lemmatization
Another issue you might have noticed in our vocabulary is that words that are semantically equivalent have different syntactic forms, meaning that while they should be treated as the same token, they occupy additional token slots. We can resolve this through two techniques: stemming and lemmatization. Stemming reduces words to their root form using simple rule-based truncation (e.g. “running” â “run”), while lemmatization takes a linguistic approach, mapping words to their dictionary base form. Lemmatization is slower but generally produces cleaner results, particularly for irregular word forms.
TF-IDF
Term frequency-inverse document frequency (TF-IDF) is an extension of basic count vectorization that weights each word by how informative it actually is. A word that appears frequently in one document but rarely across the corpus receives a high weight; a word that appears everywhere receives a low one. This neatly addresses one of the core weaknesses of raw count vectors: common but uninformative words can dominate the feature space even after stop-word removal.
N-grams
Standard BoW treats each word independently, which means it misses phrases whose meaning depends on word combinations. A classic example of this is “machine learningâ, which has a distinct meaning to âmachineâ + âlearningâ. N-grams address this by treating sequences of adjacent words as single tokens, so a bigram model would capture “machine learning” as a feature in its own right. The trade-off is a much larger vocabulary, so in practice, bigrams are most commonly used, with trigrams reserved for cases where capturing longer phrases is important.
Handling out-of-vocabulary words
When you apply your fitted vectorizer to new data, any words not present in the training vocabulary are silently ignored by default. For many tasks, this is acceptable, but if your production data is likely to continue introducing new terms that carry meaningful signal, it’s worth considering alternatives. One common approach is to reserve a special <UNK> token to represent unseen words, which at least preserves the information that something unfamiliar appeared, even if its identity is unknown and multiple (perhaps unrelated) words are collapsed onto the same token.
However, LLMs, with their more flexible approach to tokenization, tend to be a better choice if out-of-vocabulary words will be a major issue for your model once it is in production.
Dimensionality reduction
Even after stop word removal and other cleaning steps, BoW feature matrices are typically very high-dimensional and sparse. Two widely used techniques can help. Reducing to the top-N most frequent terms is the simplest approach, discarding low-frequency words that are unlikely to generalize well. For a more principled reduction, techniques like principal component analysis (PCA) or latent semantic analysis (LSA) project the feature matrix into a lower-dimensional space, compressing the representation while preserving as much of the meaningful variance as possible.
Feature selection techniques
Rather than reducing dimensionality arbitrarily, feature selection methods identify and retain only the features most relevant to your specific task. Chi-squared testing measures the statistical dependence between each term and the target label, making it well-suited to classification tasks. Mutual information takes a similar approach, scoring each feature by how much it reduces uncertainty about the class. Both methods can substantially reduce vocabulary size while preserving model performance.
Applying bag-of-words to a real-world problem
Let’s now continue the example we started earlier. We’re going to take the work we’ve done on our AG News text classification task and take it to its completion by building a model.
A common way to build a model using encoded text is neural networks, where each of the words in the vocabulary is treated as a feature, and the categories we want to predict (in our case, the news category) are the output. We’ll start by building a baseline model that applies only string cleaning and encoding to the text.
I had originally written this model in Keras, as part of a previous BoW project from a couple of years ago. However, that code was now out of date. In order to update it and adapt it to Pytorch, I asked JetBrains AI to do the following:
Please update this neural network from Keras to Pytorch, making improvements to make the code as reusable as possible.
This gave us the following successful port of the code:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
class MulticlassClassificationModel(nn.Module):
def __init__(self, input_size: int, hidden_layer_size: int, num_classes: int = 4):
super(MulticlassClassificationModel, self).__init__()
self.fc1 = nn.Linear(input_size, hidden_layer_size)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_layer_size, num_classes)
def forward(self, x):
x = self.fc1(x)
x = self.relu(x)
x = self.fc2(x)
return x
def train_text_classification_model(
train_features: np.ndarray,
train_labels: np.ndarray,
validation_features: np.ndarray,
validation_labels: np.ndarray,
input_size: int,
num_epochs: int,
hidden_layer_size: int,
num_classes: int = 4,
batch_size: int = 1920,
learning_rate: float = 0.001) -> MulticlassClassificationModel:
# Convert labels to 0-indexed (AG News has labels 1,2,3,4 -> need 0,1,2,3)
train_labels_indexed = train_labels - 1
validation_labels_indexed = validation_labels - 1
# Convert numpy arrays to PyTorch tensors
X_train = torch.FloatTensor(train_features.copy())
y_train = torch.LongTensor(train_labels_indexed.copy())
X_val = torch.FloatTensor(validation_features.copy())
y_val = torch.LongTensor(validation_labels_indexed.copy())
# Create datasets and dataloaders
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
# Initialize model, loss function, and optimizer
model = MulticlassClassificationModel(input_size, hidden_layer_size, num_classes)
criterion = nn.CrossEntropyLoss()
optimizer = optim.RMSprop(model.parameters(), lr=learning_rate)
# Training loop
for epoch in range(num_epochs):
model.train()
train_loss = 0.0
correct_train = 0
total_train = 0
for batch_features, batch_labels in train_loader:
# Forward pass
outputs = model(batch_features)
loss = criterion(outputs, batch_labels)
# Backward pass and optimization
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Calculate training metrics
train_loss += loss.item()
_, predicted = torch.max(outputs, 1)
correct_train += (predicted == batch_labels).sum().item()
total_train += batch_labels.size(0)
# Validation
model.eval()
with torch.no_grad():
val_outputs = model(X_val)
val_loss = criterion(val_outputs, y_val)
_, val_predicted = torch.max(val_outputs, 1)
correct_val = (val_predicted == y_val).sum().item()
total_val = y_val.size(0)
# Print epoch metrics
train_acc = correct_train / total_train
val_acc = correct_val / total_val
print(f'Epoch [{epoch+1}/{num_epochs}], '
f'Train Loss: {train_loss/len(train_loader):.4f}, '
f'Train Acc: {train_acc:.4f}, '
f'Val Loss: {val_loss:.4f}, '
f'Val Acc: {val_acc:.4f}')
return model
def generate_predictions(model: MulticlassClassificationModel,
validation_features: np.ndarray,
validation_labels: np.ndarray) -> list:
model.eval()
# Convert to tensors
X_val = torch.FloatTensor(validation_features.copy())
with torch.no_grad():
outputs = model(X_val)
_, predicted = torch.max(outputs, 1)
# Convert back to 1-indexed labels to match original dataset
predicted_labels = (predicted.numpy() + 1)
print("Confusion Matrix:")
print(pd.crosstab(validation_labels, predicted_labels,
rownames=['Actual'], colnames=['Predicted']))
return predicted_labels.tolist()
Letâs walk through this code step-by-step to understand how weâre going to train our text classifier.
The model architecture
MulticlassClassificationModel is a simple two-layer feedforward neural network. It takes a BoW vector as input, with each feature being a vocabulary word, and passes it through two sequential transformations to produce a prediction. The first layer (fc1) compresses this high-dimensional input down to a smaller intermediate representation, whose size we control via hidden_layer_size. A ReLU activation is then applied, which introduces a small amount of mathematical complexity that allows the model to learn patterns that a simple weighted sum couldn’t capture. The second layer (fc2) takes this intermediate representation and maps it down to four output values, one per news category, where the category with the highest value becomes the model’s prediction.
Training and validation
train_text_classification_model handles the full training loop. It starts with a small amount of housekeeping: The AG News labels run from 1 to 4, but PyTorch expects 0-indexed classes, so these are shifted down by 1. The features and labels are then converted to PyTorch tensors, and a DataLoader is created to feed the training data to the model in batches.
Each epoch, the model processes the training data batch by batch. For each batch, it runs a forward pass to generate predictions, computes the cross-entropy loss against the true labels, and then runs a backward pass to update the model weights via the RMSprop optimizer. At the end of every epoch, the model switches into evaluation mode and runs inference over the full validation set, printing the training and validation loss and accuracy so we can monitor how training is progressing.
Generating predictions
Once training is complete, generate_predictions runs the trained model on a held-out dataset and returns the predicted class for each article. It also prints a confusion matrix, which gives us a breakdown of which categories the model is getting right and where it’s getting confused, which is a much more informative picture than accuracy alone.
Running the baseline
We can now train the baseline model. We pass in the raw count-vectorized training and validation features, specify an input size equal to the vocabulary size (59,544 columns), train for two epochs, and use a hidden layer of 5,000 nodes.
baseline_model = train_text_classification_model(
ag_news_train_cv,
ag_news_train["label"].to_numpy(),
ag_news_val_cv,
ag_news_val["label"].to_numpy(),
ag_news_train_cv.shape[1],
5,
5000
)
predictions = generate_predictions(
baseline_model,
ag_news_val_cv,
ag_news_val["label"].to_numpy()
)
Epoch [1/2], Train Loss: 0.3553, Train Acc: 0.8813, Val Loss: 0.2307, Val Acc: 0.9243 Epoch [2/2], Train Loss: 0.1217, Train Acc: 0.9587, Val Loss: 0.2352, Val Acc: 0.9240 Confusion Matrix: Predicted 1 2 3 4 Actual 1 2774 65 89 72 2 37 2944 9 10 3 112 20 2694 174 4 97 20 207 2676
Even with the very basic data preparation we did, we can see weâve performed very well on this prediction task, with around 92% accuracy. The confusion matrix shows that the model seems to have the easiest time distinguishing between category two (sports) and the other topics, and the hardest time distinguishing between category three (business) and category four (science/technology). This makes sense, as the words used to describe sports are very distinct and unlikely to be used in other contexts (things like football), whereas there is likely to be overlapping vocabulary between business and technology (especially company names).
As we saw above, there is a lot we can do to improve the signal-to-noise ratio in BoW modeling. Letâs apply four commonly used techniques to our data and see whether this improves our predictions: lemmatization, stop word removal, limiting our vocabulary to the top N terms, and TF-IDF weighting. As youâll see, all of these can be done relatively simply using inbuilt functions in packages such as spaCy and scikit-learn.
Lemmatization
As we discussed earlier, lemmatization collapses inflected word forms into a single vocabulary entry by mapping each word to its dictionary base form, which both shrinks the vocabulary and concentrates the signal for each concept into a single feature. We’ll use spaCy for this, which first requires downloading its small English language model:
!python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")
Our lemmatise_text function passes each text through spaCy’s NLP pipeline using nlp.pipe(), which processes them in batches of 1,000 for efficiency. For each document, it extracts the .lemma_ attribute of every token and joins them back into a single string. One small detail worth noting: we preserve the original DataFrame index when constructing the output Series, so that rows stay correctly aligned when we assign the results back.
We apply lemmatization before string cleaning, since spaCy needs the original casing and punctuation to correctly identify grammatical structure. For example, “running” and “Running” lemmatize to the same thing, but removing punctuation first can confuse the parser. Once lemmatized, we pass the output through apply_string_cleaning as before:
ag_news_train["title_clean"] = apply_string_cleaning(lemmatise_text(ag_news_train["title"])) ag_news_train["description_clean"] = apply_string_cleaning(lemmatise_text(ag_news_train["description"])) ag_news_val["title_clean"] = apply_string_cleaning(lemmatise_text(ag_news_val["title"])) ag_news_val["description_clean"] = apply_string_cleaning(lemmatise_text(ag_news_val["description"])) ag_news_train["text_clean"] = ag_news_train["title_clean"] + " " + ag_news_train["description_clean"] ag_news_val["text_clean"] = ag_news_val["title_clean"] + " " + ag_news_val["description_clean"]
We apply this separately to the title and description columns before concatenating them into a single text_clean field. As you can see, we do this for both the training and validation sets using the same function, so that lemmatization is applied consistently across both splits.
Removing stop words
As with lemmatization, we covered the motivation for stop word removal earlier: Words like “the”, “is”, and “of” appear so frequently across all texts that they add noise rather than signal to our feature matrix. Here we’ll actually apply it to our data.
def remove_stopwords(texts: pd.Series) -> pd.Series:
texts = texts.fillna("").astype(str)
filtered_texts = []
for doc in nlp.pipe(texts, batch_size=1000):
filtered_texts.append(
" ".join(token.text for token in doc if not token.is_stop)
)
return pd.Series(filtered_texts, index=texts.index)
Our remove_stopwords function again uses nlp.pipe() to process texts in batches. For each document, it filters out any token where spaCy’s is_stop attribute is True, and joins the remaining tokens back into a string. Conveniently, spaCy handles stop word detection using the same pipeline we already loaded for lemmatization, so no additional setup is needed.
We apply this to the already-cleaned and lemmatized text_clean column for both the training and validation sets, so the stop word removal builds directly on our previous preprocessing steps and is applied consistently across both splits.
ag_news_train["text_no_stopwords"] = remove_stopwords(ag_news_train["text_clean"]) ag_news_val["text_no_stopwords"] = remove_stopwords(ag_news_val["text_clean"])
Top N terms and TF-IDF vectorization
The final two improvements we’ll apply are limiting the vocabulary size and switching from raw count vectorization to TF-IDF weighting. Conveniently, scikit-learn’s TfidfVectorizer handles both in a single step.
Recall from earlier that TF-IDF downweights words that appear frequently across many documents while upweighting words that are distinctive to particular documents. This cleans up uninformative words that donât quite qualify as stopwords, but add little useful information to our dataset. The max_features=20000 argument caps the vocabulary at the 20,000 most frequent terms after TF-IDF scoring, which discards the long tail of rare words that are unlikely to generalize well and brings our feature matrix down to a much more manageable size. (The choice of 20,000 words is arbitrary. We could have easily used a smaller or larger number, depending on our dataset and use case.)
As with CountVectorizer, we fit only on the training data and then use that fixed vocabulary to transform both the training and validation sets:
TfidfVectorizerNews = TfidfVectorizer(max_features=20000) TfidfVectorizerNews.fit(ag_news_train["text_no_stopwords"]) ag_news_train_tfidf = TfidfVectorizerNews.transform(ag_news_train["text_no_stopwords"]).toarray() ag_news_val_tfidf = TfidfVectorizerNews.transform(ag_news_val["text_no_stopwords"]).toarray()
We can inspect the resulting vocabulary and feature matrix exactly as we did before:
TfidfVectorizerNews.vocabulary_
{'fed': np.int64(6243),
'pension': np.int64(13134),
'default': np.int64(4469),
'cite': np.int64(3200),
'failure': np.int64(6109),
'big': np.int64(1787),
'airline': np.int64(401),
'payment': np.int64(13051),
'plan': np.int64(13424),
'government': np.int64(7306),
'official': np.int64(12453),
'tuesday': np.int64(18437),
'congress': np.int64(3691),
'hard': np.int64(7689),
'corporation': np.int64(3901),
...}
pd.DataFrame(ag_news_train_tfidf, columns=TfidfVectorizerNews.get_feature_names_out())
Compared to our baseline feature matrix of 59,544 columns filled almost entirely with zeros, this is considerably leaner. We now have 20,000 columns of weighted scores that better reflect each word’s actual importance to the document it appears in. It is still relatively sparse, but we can see from both the feature matrix and the vocabulary list that it is much more focused on semantically rich words.
Fitting the revised model
With our improved features in hand, we can now retrain the model. The call is identical to before, except we pass in the TF-IDF feature matrices instead of the raw count vectors, and the input size is now 20,000 rather than 59,544:
baseline_model = train_text_classification_model(
ag_news_train_tfidf,
ag_news_train["label"].to_numpy(),
ag_news_val_tfidf,
ag_news_val["label"].to_numpy(),
ag_news_train_tfidf.shape[1],
2,
5000
)
predictions = generate_predictions(
baseline_model,
ag_news_val_tfidf,
ag_news_val["label"].to_numpy()
)
Epoch [1/2], Train Loss: 0.3183, Train Acc: 0.8932, Val Loss: 0.2301, Val Acc: 0.9225 Epoch [2/2], Train Loss: 0.1512, Train Acc: 0.9475, Val Loss: 0.2332, Val Acc: 0.9243 Confusion Matrix - Raw Counts: Predicted 1 2 3 4 Actual 1 2703 71 121 105 2 20 2955 13 12 3 68 19 2691 222 4 77 17 163 2743
The results are actually very encouraging! Our overall validation accuracy is essentially unchanged at around 92%, but we’ve achieved this with a feature matrix that is less than a third of the size. This suggests that the extra vocabulary in the baseline (including the stop words) was contributing to noise rather than signal. Reducing the size of the feature matrix makes our model more stable, less prone to overfitting, and much more manageable to deploy.
Looking at the confusion matrix, the pattern of errors is similar to before: Sports (category two) is the easiest category to classify, with 98.5% accuracy, while Business (category three) and Science/Technology (category four) remain the hardest to separate, with around 7% of articles in each category being misclassified as the other. This is consistent with what we saw in the baseline, so it seems that the preprocessing improvements have tightened things up at the margins, but the fundamental difficulty of the Business/Technology boundary is a property of the data rather than the feature representation.
Applying our model to the test set
Finally, we need to validate that our model performs as well on the test set as it does on the validation set. Up to this point, we’ve deliberately kept the test set locked away. As mentioned earlier, if we had been making modeling decisions based on test set performance, we’d risk inadvertently overfitting our choices to it, and our final accuracy estimate would be optimistic.
The preprocessing steps must be applied in exactly the same order as for the training and validation data: lemmatization, string cleaning, concatenation of title and description, and stop-word removal. Crucially, we also call .transform() rather than .fit_transform() on the test text, using the vocabulary learned from the training data:
ag_news_test["title_clean"] = apply_string_cleaning(lemmatise_text(ag_news_test["title"])) ag_news_test["description_clean"] = apply_string_cleaning(lemmatise_text(ag_news_test["description"])) ag_news_test["text_clean"] = ag_news_test["title_clean"] + " " + ag_news_test["description_clean"] ag_news_test["text_no_stopwords"] = remove_stopwords(ag_news_test["text_clean"]) ag_news_test_tfidf = TfidfVectorizerNews.transform(ag_news_test["text_no_stopwords"]).toarray()
We can then generate predictions and evaluate accuracy on the test set:
test_predictions = generate_predictions(
baseline_model,
ag_news_test_tfidf,
ag_news_test["label"].to_numpy()
)
test_accuracy = accuracy_score(ag_news_test["label"].to_numpy(), test_predictions)
print(f"Test Accuracy: {test_accuracy:.4f}")
Test Accuracy: 0.9183 Confusion Matrix - Raw Counts: Predicted 1 2 3 4 Actual 1 1710 54 78 58 2 13 1870 10 7 3 51 12 1676 161 4 53 9 115 1723
The test accuracy of 91.8% is very close to the 92.4% we saw on the validation set, which is a reassuring sign that our model has generalized well rather than overfitting to the validation data. The confusion matrix tells the same story as before: Sports (category two) remains the easiest category to classify, with only 30 misclassified articles out of 1,900, while the Business/Technology boundary continues to be the main source of errors, with around 8% of articles in each category being misclassified as the other. The consistency between validation and test results gives us confidence that these error patterns reflect genuine properties of the data rather than artifacts of any particular split.
Limitations and alternatives
Loses word order information
The most fundamental limitation of the bag-of-words model is right there in the name: it treats text as an unordered collection of words, discarding all sequence information. This means “the dog bit the man” and “the man bit the dog” produce identical vectors, even though they describe very different events. For many classification tasks, this doesn’t matter much, but for tasks that require understanding the relationship between words, such as question answering or natural language inference, the loss of word order is a serious handicap.
Ignores semantics and context
BoW has no notion of word meaning or context. Each word is simply a column in a matrix, entirely independent of every other word. This creates two related problems. First, synonyms are treated as completely distinct features: “cheap” and “inexpensive” contribute nothing to each other’s signal, even though they mean the same thing. Second, words with multiple meanings are treated as a single feature regardless of context: “bank” means the same thing whether it appears in a sentence about rivers or finance. Both of these issues limit how well BoW representations can capture the actual semantics of a text.
Can result in large, sparse vectors
As we saw in our own example, even a moderately sized corpus of news headlines can produce a vocabulary of nearly 60,000 unique terms. The resulting feature matrix has one column per vocabulary word, but any individual document only uses a tiny fraction of them, leaving the vast majority of values at zero. This sparsity creates two practical problems: The matrices can consume a large amount of memory if stored densely, and the high dimensionality can make it harder for models to find meaningful patterns, a phenomenon sometimes called the curse of dimensionality.
Alternatives
If BoW’s limitations are a bottleneck for your task, there are several well-established alternatives worth considering.
- Word embeddings (Word2Vec and GloVe) address the semantics problem by representing each word as a dense vector in a continuous space, where similar words are geometrically close to each other. They capture distributional meaning far more richly than BoW, and are a natural next step when synonym handling or word similarity matters. Doc2Vec extends this idea to produce embeddings for entire documents rather than individual words.
- Transformer-based models (BERT and GPT) go further still, generating contextual representations where the same word receives a different vector depending on the surrounding text. This handles polysemy directly and captures complex long-range dependencies between words. The trade-off is substantially higher computational cost and complexity compared to BoW.
- Topic models like latent Dirichlet allocation (LDA) take a different angle entirely. Rather than encoding documents for downstream classification, they are generative models that discover latent thematic structure in a corpus. This is useful when your goal is exploration and interpretation rather than prediction.
For tasks where BoW already performs well, as we saw here with AG News, the added complexity of these approaches may not be worth the cost. BoW remains a strong baseline, and it’s always worth establishing how far it can take you before reaching for heavier machinery.
Get started with PyCharm today
In this post, we’ve covered a lot of ground: from the fundamentals of the bag-of-words model and how it converts text into numerical vectors, through to building and iteratively improving a real text classification pipeline on the AG News dataset. Along the way, we’ve seen how preprocessing steps like lemmatization, stop word removal, vocabulary capping, and TF-IDF weighting can meaningfully improve the efficiency of your feature representation, and how PyCharm’s DataFrame viewer, column statistics, chart view, and AI Assistant make each of these steps faster and easier to inspect and debug.
If you’d like to try this yourself, PyCharm Pro comes with a 30-day trial. As we saw in this tutorial, its built-in support for Jupyter notebooks, virtual environments, and scientific libraries means you can go from a blank project to a working NLP pipeline with minimal setup friction, leaving you free to focus on the fun parts.
You can find the full code for this project on GitHub. If you’re interested in exploring more NLP topics, check out our recent blogs here.
PyCon
PyCon US 2026: Call for Volunteers
Looking to make a meaningful contribution to the Python community? Look no further than PyCon US 2026! Whether you're a seasoned Python pro or a newcomer to the community and looking to get involved, there's a volunteer opportunity that's perfect for you.
Sign-up for volunteer roles is done directly through the PyCon US website. This way, you can view and manage shifts you sign up for through your personal dashboard! You can read up on the different roles to volunteer for and how to sign up on the PyCon US website.
PyCon US is largely organized and run by volunteers. Every year, we ask to fill over 300 onsite volunteer hours to ensure everything runs smoothly at the event. And the best part? You don't need to commit a lot of time to make a differenceâsome shifts are as short as 45 minutes long! You can sign up for as many or as few shifts as youâd like. Even a couple of hours of your time can go a long way in helping us create an amazing experience for attendees.
Keep in mind that you need to be registered for the conference to sign up for a volunteer role.
Real Python
AI Coding Agents Guide: A Map of the Four Workflow Types
AI coding agents can read your code, reason about changes, and act on your behalf. To choose the right one, it helps to understand the four common workflow types: integrated development environment (IDE), terminal, pull request (PR), and cloud.
In this tutorial, youâll:
- Identify the four common agent interaction modes
- Understand what makes each workflow distinct
- Recognize which mode fits common development scenarios
- Weigh the risks and tradeoffs of each workflow
Before exploring the four workflow types, itâs worth looking at what makes a coding tool agentic in the first place.
Take the Quiz: Test your knowledge with our interactive âAI Coding Agents Guide: A Map of the Four Workflow Typesâ quiz. Youâll receive a score upon completion to help you track your learning progress:
Interactive Quiz
AI Coding Agents Guide: A Map of the Four Workflow TypesCheck your understanding of how AI coding agents fit into your workflow through four interaction modes: IDE, terminal, pull request, and cloud.
Get Your Cheat Sheet: Click here to download your free AI coding agents cheat sheet and keep the four workflow types at your fingertips when choosing the right agent for the job.
Understanding AI Coding Agents
While standard chatbots provide one-off answers, coding agents are designed for autonomy, operating through a continuous execution loop to solve complex tasks. This loop typically follows four distinct steps:
- Read: They read relevant files from your codebase to form their context.
- Reason: They determine the logical steps needed to achieve your goal.
- Act: They execute those steps by editing files, running terminal commands, or using external tools.
- Evaluate: They check the results of their actions to see if more work is needed.
This loop repeats until the task is completed or the agent hands control back to you. Unlike simple predictive text or one-off prompts, agents bridge the gap between suggestion and execution by autonomously navigating the development workflow.
The core agent loop will generally stay the same, but where an agent runs will shape how you interact with it:
- In an editor, it works alongside you.
- In a terminal, you guide it step by step.
- In pull requests, it reviews changes asynchronously.
- In the cloud, it works in a managed environment and reports back later.
These environments define four primary agent types, each enabling a distinct workflow: IDE agents, terminal agents, PR agents, and cloud agents.
Exploring the Four Workflow Types
The four workflow types describe interaction modes and donât always map cleanly to product categories. The same tool often spans multiple workflows. For example, Claude Code runs in your terminal, in your editor, and in the cloud with Claude Code on the web. It can also review pull requests with Code Review.
The goal is to match the workflow to the task. The diagram below summarizes the four types at a glance:
The Four Coding Agent Workflows
Read the full article at https://realpython.com/ai-coding-agents-guide/ »
[ Improve Your Python With đ Python Tricks đ â Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]
Quiz: ChatterBot: Build a Chatbot With Python
In this quiz, you’ll test your understanding of ChatterBot: Build a Chatbot With Python.
You’ll revisit how ChatterBot learns from conversation data, how it picks replies based on similarity to what it’s already seen, and how it can pull in a local LLM to round out its responses.
[ Improve Your Python With đ Python Tricks đ â Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]
Quiz: Python 3.13: A Modern REPL
Test your knowledge of the redesigned interactive interpreter introduced in Python 3.13: A Modern REPL, including the help system, multiline statement editing, code pasting improvements, and the history browser.
Good luck!
[ Improve Your Python With đ Python Tricks đ â Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]
Python GUIs
Actions in one thread changing data in another â How to communicate between threads and windows in PyQt6
I have a main window that starts background threads (e.g., handling GPIO data). From the main window I open secondary windows using buttons. When I press a button in a secondary window, I can't change anything in the background threads. But if I press a button in the main window, everything works. How do I communicate between a secondary window and a thread that was started from the main window?
This is a common problem when building PyQt6 applications with multiple windows and background threads. The good news is that Qt's signal and slot system is designed to handle this and it works safely across threads.
The core idea is that your secondary window doesn't need direct access to the thread or the worker object. Instead the secondary window and the worker just need access to the same signals, and can then use them to communicate with one another. Qt handles the cross-thread communication automatically.
Why doesn't direct access work?
When you create a background thread from the main window, you'll often store a reference to that thread on the main window. If that main window then creates a sub-window, it doesn't have any access to the objects on it's parent. Even if it did calling the methods on the thread directly is not usually the right approach.
You can access the attributes of a parent window using .parent() but this is a bad habit, because it tightly couples the parts of your application together. If you modify the structure of the parent window, you now need to also edit the sub-window. There are better ways that keep things nicely isolated.
The solution is to avoid calling methods directly across threads. Instead, use signals and slots. When a signal is emitted in one thread and connected to a slot in another, Qt automatically queues the call and delivers it safely.
Setting up a background worker
First, let's create a simple worker class that runs in a background thread. This worker simulates handling incoming data (like GPIO data) and also accepts commands from the GUI.
from PyQt6.QtCore import QObject, pyqtSignal, pyqtSlot
import time
class Worker(QObject):
"""A worker that runs in a background thread."""
data_updated = pyqtSignal(str)
def __init__(self):
super().__init__()
self.running = True
self.current_value = 0
@pyqtSlot()
def run(self):
"""Simulate continuous data handling."""
while self.running:
self.current_value += 1
self.data_updated.emit(f"Data: {self.current_value}")
time.sleep(1)
@pyqtSlot(int)
def set_value(self, value):
"""Receive a new value from the GUI."""
self.current_value = value
self.data_updated.emit(f"Value set to: {self.current_value}")
The set_value slot is what we'll trigger from the secondary window. Because it's a slot connected via a signal, Qt will deliver the call on the correct thread.
Creating the secondary window
The secondary window has a button and a spin box. When the user clicks the button, the window emits a signal carrying the new value. The secondary window doesn't know anything about the worker — it just emits a signal.
from PyQt6.QtWidgets import QWidget, QVBoxLayout, QPushButton, QSpinBox, QLabel
from PyQt6.QtCore import pyqtSignal
class SecondaryWindow(QWidget):
"""A secondary window that emits a signal when the user sets a value."""
value_changed = pyqtSignal(int)
def __init__(self):
super().__init__()
self.setWindowTitle("Secondary Window")
layout = QVBoxLayout()
self.label = QLabel("Set a new value for the worker:")
layout.addWidget(self.label)
self.spinbox = QSpinBox()
self.spinbox.setRange(0, 1000)
layout.addWidget(self.spinbox)
self.button = QPushButton("Send to Worker")
self.button.clicked.connect(self.send_value)
layout.addWidget(self.button)
self.setLayout(layout)
def send_value(self):
self.value_changed.emit(self.spinbox.value())
The value_changed signal is the only interface this window exposes. This keeps things clean and decoupled.
Wiring everything together in the main window
The main window is where all the connections happen. It creates the worker, starts the thread, opens the secondary window, and connects the secondary window's signal to the worker's slot.
from PyQt6.QtWidgets import QMainWindow, QVBoxLayout, QPushButton, QLabel, QWidget
from PyQt6.QtCore import QThread
class MainWindow(QMainWindow):
def __init__(self):
super().__init__()
self.setWindowTitle("Main Window")
# Set up the UI
layout = QVBoxLayout()
self.status_label = QLabel("Waiting for data...")
layout.addWidget(self.status_label)
self.open_button = QPushButton("Open Secondary Window")
self.open_button.clicked.connect(self.open_secondary)
layout.addWidget(self.open_button)
container = QWidget()
container.setLayout(layout)
self.setCentralWidget(container)
# Keep a reference to the secondary window
self.secondary_window = None
# Set up the background thread and worker
self.thread = QThread()
self.worker = Worker()
self.worker.moveToThread(self.thread)
# Connect signals
self.thread.started.connect(self.worker.run)
self.worker.data_updated.connect(self.update_status)
# Start the thread
self.thread.start()
def update_status(self, text):
self.status_label.setText(text)
def open_secondary(self):
if self.secondary_window is None:
self.secondary_window = SecondaryWindow()
# Connect the secondary window's signal to the worker's slot.
# This is the connection that makes cross-window,
# cross-thread communication work.
self.secondary_window.value_changed.connect(self.worker.set_value)
self.secondary_window.show()
def closeEvent(self, event):
self.worker.running = False
self.thread.quit()
self.thread.wait()
super().closeEvent(event)
The line that connects everything together is:
self.secondary_window.value_changed.connect(self.worker.set_value)
This connects a signal from the secondary window (running in the main/GUI thread) to a slot on the worker (which has been moved to a background thread). Qt sees that the sender and receiver live in different threads, so it automatically uses a queued connection. The slot call is placed into the background thread's event queue and executed there.
Understanding why the main window worked but the secondary didn't
In the original question, buttons in the main window could affect the background threads, but buttons in a secondary window could not. This usually happens because:
- The main window had direct signal-slot connections to the worker (set up when both the worker and the connections were created).
- The secondary window was created later, and its signals were never connected to the worker.
To solution is to connect its signals to the appropriate worker slots, when you create the secondary window, just as you would for the main window. The worker doesn't care where the signal comes from — it just responds to whatever signals are connected to its slots. For more on managing multiple windows in PyQt6, see our tutorial on creating multiple windows.
A note about QThreadPool vs QThread
The original question mentions using QThreadPool. If you're using QRunnable with a QThreadPool, the pattern is slightly different because QRunnable doesn't inherit from QObject and can't have slots directly. In that case, you typically create a separate QObject-based signals class and attach it to your runnable. For a detailed walkthrough of that approach, see Multithreading PyQt6 applications with QThreadPool.
However, for long-running background tasks that need two-way communication with the GUI (like GPIO handling), QThread with moveToThread() is usually a better fit. It gives you a proper event loop in the background thread, which means signals and slots work naturally in both directions.
Complete working example
Here's everything in a single file you can copy, run, and experiment with. If you're new to PyQt6, you may want to start with creating your first window before diving in.
import sys
import time
from PyQt6.QtCore import QObject, QThread, pyqtSignal, pyqtSlot
from PyQt6.QtWidgets import (
QApplication,
QLabel,
QMainWindow,
QPushButton,
QSpinBox,
QVBoxLayout,
QWidget,
)
class Worker(QObject):
"""A worker that runs in a background thread."""
data_updated = pyqtSignal(str)
def __init__(self):
super().__init__()
self.running = True
self.current_value = 0
@pyqtSlot()
def run(self):
"""Simulate continuous data handling."""
while self.running:
self.current_value += 1
self.data_updated.emit(f"Data: {self.current_value}")
time.sleep(1)
@pyqtSlot(int)
def set_value(self, value):
"""Receive a new value from the GUI."""
self.current_value = value
self.data_updated.emit(f"Value set to: {self.current_value}")
class SecondaryWindow(QWidget):
"""A secondary window that emits a signal when the user sets a value."""
value_changed = pyqtSignal(int)
def __init__(self):
super().__init__()
self.setWindowTitle("Secondary Window")
layout = QVBoxLayout()
self.label = QLabel("Set a new value for the worker:")
layout.addWidget(self.label)
self.spinbox = QSpinBox()
self.spinbox.setRange(0, 1000)
layout.addWidget(self.spinbox)
self.button = QPushButton("Send to Worker")
self.button.clicked.connect(self.send_value)
layout.addWidget(self.button)
self.setLayout(layout)
def send_value(self):
self.value_changed.emit(self.spinbox.value())
class MainWindow(QMainWindow):
def __init__(self):
super().__init__()
self.setWindowTitle("Main Window")
# Set up the UI
layout = QVBoxLayout()
self.status_label = QLabel("Waiting for data...")
layout.addWidget(self.status_label)
self.open_button = QPushButton("Open Secondary Window")
self.open_button.clicked.connect(self.open_secondary)
layout.addWidget(self.open_button)
container = QWidget()
container.setLayout(layout)
self.setCentralWidget(container)
# Keep a reference to the secondary window
self.secondary_window = None
# Set up the background thread and worker
self.thread = QThread()
self.worker = Worker()
self.worker.moveToThread(self.thread)
# Connect signals
self.thread.started.connect(self.worker.run)
self.worker.data_updated.connect(self.update_status)
# Start the thread
self.thread.start()
def update_status(self, text):
self.status_label.setText(text)
def open_secondary(self):
if self.secondary_window is None:
self.secondary_window = SecondaryWindow()
# Connect the secondary window's signal to the worker's slot
self.secondary_window.value_changed.connect(
self.worker.set_value
)
self.secondary_window.show()
def closeEvent(self, event):
self.worker.running = False
self.thread.quit()
self.thread.wait()
super().closeEvent(event)
app = QApplication(sys.argv)
window = MainWindow()
window.show()
sys.exit(app.exec())
When you run this, you'll see the main window counting up once per second. Click "Open Secondary Window", enter a number, and click "Send to Worker" — the worker's counter will jump to your chosen value and continue counting from there.
The secondary window communicates with the background thread entirely through signals and slots, with no direct method calls across threads. This pattern scales well — you can connect as many windows as you like to the same worker, or connect one window to multiple workers. As long as you use signals and slots for cross-thread communication, Qt handles the thread safety for you.
For an in-depth guide to building Python GUIs with PyQt6 see my book, Create GUI Applications with Python & Qt6.
April 28, 2026
Talk Python Blog
Introducing the new Talk Python web player
We expect that most people who listen to Talk Python do so through their podcast player apps on their phone or even on their laptops. But there are plenty of times that people end up on an episode page and would love to have a nice experience interacting with that episode as well. One really common example: you go back to an episode you discovered several years ago, and the chances it’s still on your device are low. Though we do keep our entire back catalog available in the RSS feed, most podcast players trim down what they keep locally.
PyCoderâs Weekly
Issue #732: Web Scraping, Altair Charts, OpenAI's API, and More (April 28, 2026)
#732 â APRIL 28, 2026
View in Browser »
browser-use vs. Playwright: Which to Pick for Web Scraping?
Follow along in this walk-through building a Hacker News synthesizer with browser-use, then see it fail on a harder Newegg scraping task. Includes a side-by-side comparison with Playwright and a breakdown of when each tool is the right call.
CODECUT.AI âą Shared by Khuyen Tran
Altair: Declarative Charts With Python
Build interactive Python charts the declarative way with Altair. Map data to visual properties and add linked selections. No JavaScript required.
REAL PYTHON
Positron: The Data Science IDE from Posit PBC
Positron is a free IDE built for Python data science. AI assistance, interactive data frames, Jupyter notebooks, and instant app deployment, all in one place. Stop context-switching. Start shipping. Download free.
POSIT PBC sponsor
Leverage OpenAI’s API in Your Python Projects
Learn how to use the ChatGPT API with Python’s openai library to send prompts, control AI behavior with roles, and get structured outputs.
REAL PYTHON course
Articles & Tutorials
Fixing a Memory “Leak” From Python 3.14’s Incremental Garbage Collection
Adam encountered an out-of-memory error while migrating a client project to Python 3.14. The issue occurred when running Djangoâs database migration command on a limited-resource server, and seemed to be caused by the new incremental garbage collection algorithm in Python 3.14.
ADAM JOHNSON
Logging to File and to Textual Console
When writing TUI applications in Textual you can’t just print out your debug info since the terminal is controlled by the framework. This article shows you how to log and use Textual’s built-in debug console.
MIKE DRISCOLL
Beyond Basic RAG: Build Persistent AI Agents
Master next-gen AI with Python notebooks for agentic reasoning, memory engineering, and multi-agent orchestration. Scale apps using production-ready patterns for LangChain, LlamaIndex, and high-performance vector search. Explore & Star on GitHub.
ORACLE sponsor
Read the Docs Now Supports uv Natively
Popular open source documentation site Read the Docs has announced they now support native uv in .readthedocs.yaml for Python dependency installation. Learn how to use it in your configurations
READ THE DOCS
PyTexas 2026 Recap
Per-talk notes from PyTexas 2026 in Austin: Hynek on domain modeling, Dawn Wages on specialization, MCP security, PEP 810 lazy imports, free-threading, Ruff, ty, uv, supply chain.
BERNĂT GĂBOR
The Carbon Footprint of Wagtail AI
One of the package maintainers for Wagtail AI shares his method for measuring the carbon impact of the different AI tasks users can do and goes over the initial results.
WAGTAIL.ORG âą Shared by Meagen Voss
Gemini CLI vs Claude Code: Which to Choose for Python Tasks
Gemini CLI vs Claude Code: compare setup, performance, code quality, and cost to find the right Python AI coding tool for your workflow.
REAL PYTHON
Learn the Agentic Coding Workflow That Actually Works on Real Projects
65% of Python developers are stuck using AI for small tasks that fall apart on anything real. This 2-day live course (May 6-7 via Zoom) walks you through building a complete Python CLI app with Claude Code, from an empty directory to a shipped project on GitHub.
REAL PYTHON
Implementing OpenTelemetry in FastAPI
Learn how you can observe your FastAPI web apps with OpenTelemetry, including how to integrate it and why it is important.
SIGNOZ.IO âą Shared by Dhruv Ahuja
Building a Python Library in 2026
So you want to build a Python library in 2026? Here’s everything you need to know about the state of the art.
STEPHEN IF
Projects & Code
Local Usage PyPI Alternative With Vulnerability Scanning
Very interesting project
GITHUB.COM/RUSTEDBYTES âą Shared by Yehor Smoliakov
vibescore: One-Command Quality Score for Any Python Project
GITHUB.COM/STEF41 âą Shared by Anonymous
Events
Weekly Real Python Office Hours Q&A (Virtual)
April 29, 2026
REALPYTHON.COM
PyCamp Spain 2026
April 30 to May 4, 2026
PYCAMP.ES
PyDelhi User Group Meetup
May 2, 2026
MEETUP.COM
PyBodensee Monthly Meetup
May 4, 2026
PYBODENSEE.COM
IndyPy: Lightning Talks
May 5 to May 6, 2026
MEETUP.COM
Happy Pythoning!
This was PyCoder’s Weekly Issue #732.
View in Browser »
[ Subscribe to đ PyCoder’s Weekly đ â Get the best Python news, articles, and tutorials delivered to your inbox once a week >> Click here to learn more ]
Django Weblog
Renew Your PyCharm License and Support Django
Only a few days remain to support the Django Software Foundation through our annual JetBrains fundraiser.
You can now use the offer for new purchases and annual renewals. If your PyCharm Professional subscription expires this year, this is a great time to renew or extend it for up to 12 months.
Get 30% off PyCharm Professional, and 100% of proceeds from qualifying purchases and renewals go to the DSF to help fund Django Fellows, community programs, events, and the future of Django.
đ Offer ends May 1: Learn more about the fundraiser
đ Claim 30% off here: Get the JetBrains offer
Mariatta
PyCascades 2026 Recap
PyCascades 2026 Recap
PyCascades 2026 took place in Vancouver this year. I only get to attend on the first day, because I had a 5 a.m. flight to Washington DC the morning after.
Still, the first day’s talks were all very insightful and interesting. I’m waiting for all the talks to be published so that I could catch up on the ones I missed.
Here are notes on the talks I got to see.
Real Python
Testing Your Code With Python's unittest
The Python standard library ships with a testing framework named unittest, which you can use to write automated tests for your code. The unittest package has an object-oriented approach where test cases derive from a base class, which has several useful methods.
The framework supports many features that will help you write consistent unit tests for your code. These features include test cases, fixtures, test suites, and test discovery capabilities.
In this video course, you’ll learn how to:
- Write
unittesttests with theTestCaseclass - Explore the assert methods that
TestCaseprovides - Use
unittestfrom the command line - Group test cases using the
TestSuiteclass - Create fixtures to handle setup and teardown logic
To get the most out of this video course, you should be familiar with some important Python concepts, such as object-oriented programming, inheritance, and assertions. Having a good understanding of code testing is a plus.
[ Improve Your Python With đ Python Tricks đ â Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]
Quiz: Use Codex CLI to Enhance Your Python Projects
In this quiz, you’ll test your understanding of Use Codex CLI to Enhance Your Python Projects.
By working through this quiz, you’ll revisit how to install and configure Codex CLI, use Plan mode to review changes before they land, and refine features through iterative prompting in your terminal.
[ Improve Your Python With đ Python Tricks đ â Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]
Quiz: Testing Your Code With Python's unittest
In this quiz, you’ll test your understanding of Testing Your Code With Python’s unittest.
By working through this quiz, you’ll revisit key concepts like structuring tests with TestCase, using assertion methods, skipping tests conditionally, parameterizing with subtests, and preparing test data with fixtures.
[ Improve Your Python With đ Python Tricks đ â Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]
PyPy
PyPy v7.3.22 release
PyPy v7.3.22: release of python 2.7, 3.11
The PyPy team is proud to release version 7.3.22 of PyPy after the previous release on March 13, 2026. This is a bug-fix release that fixes several issues in the JIT. Among them, a long-standing JIT bug that started appearing when some instance optimizations exposed it. We also cleaned up many of the remaining stdlib test suite failures, which improves CPython compatibility around line numbers in dis.dis, signatures and objclass attributes for builtins, and other quality of life features.
There is now an RPython _pickle module that mirrors
the CPython one, greatly speeding up pickling operations. Where before PyPy was
5.7x slower than CPython on the pickle benchmark from the pyperformance
benchmark suite, now it is only 1.6x slower [0]. We also added pypy
pickler extensions to dump and load lists using list strategies, and enabled
them in the ForkingPickler used by multiprocessing, speeding up cases where
such objects are passed between PyPy multiprocessing instances.
We also added an RPython json encoder, speeding up json_bench from being 2.6x slower than CPython to being 0.7x (meaning faster).
The release includes two different interpreters:
PyPy2.7, which is an interpreter supporting the syntax and the features of Python 2.7 including the stdlib for CPython 2.7.18+ (the
+is for backported security updates)PyPy3.11, which is an interpreter supporting the syntax and the features of Python 3.11, including the stdlib for CPython 3.11.15.
The interpreters are based on much the same codebase, thus the double release. This is a micro release, all APIs are compatible with the other 7.3 releases.
We recommend updating. You can find links to download the releases here:
We would like to thank our donors for the continued support of the PyPy project. If PyPy is not quite good enough for your needs, we are available for direct consulting work. If PyPy is helping you out, we would love to hear about it and encourage submissions to our blog via a pull request to https://github.com/pypy/pypy.org
We would also like to thank our contributors and encourage new people to join the project. PyPy has many layers and we need help with all of them: bug fixes, PyPy and RPython documentation improvements, or general help with making RPython's JIT even better.
If you are a python library maintainer and use C-extensions, please consider making a HPy / CFFI / cppyy version of your library that would be performant on PyPy. In any case, cibuildwheel supports building wheels for PyPy.
Footnotes
[0]Once a PR to pyperformance to use the _pickle module on PyPy is accepted
What is PyPy?
PyPy is a Python interpreter, a drop-in replacement for CPython It's fast (PyPy and CPython performance comparison) due to its integrated tracing JIT compiler.
We also welcome developers of other dynamic languages to see what RPython can do for them.
We provide binary builds for:
x86 machines on most common operating systems (Linux 32/64 bits, Mac OS 64 bits, Windows 64 bits)
64-bit ARM machines running Linux (
aarch64) and macos (macos_arm64).
PyPy supports Windows 32-bit, Linux PPC64 big- and little-endian, Linux ARM 32 bit, RISC-V RV64IMAFD Linux, and s390x Linux but does not release binaries. Please reach out to us if you wish to sponsor binary releases for those platforms. Downstream packagers provide binary builds for debian, Fedora, conda, OpenBSD, FreeBSD, Gentoo, and more.
What else is new?
For more information about the 7.3.22 release, see the full changelog.
Please update, and continue to help us make pypy better.
Cheers, The PyPy Team
Armin Ronacher
Before GitHub
GitHub was not the first home of my Open Source software. SourceForge was.
Before GitHub, I had my own Trac installation. I had Subversion repositories, tickets, tarballs, and documentation on infrastructure I controlled. Later I moved projects to Bitbucket, back when Bitbucket still felt like a serious alternative place for Open Source projects, especially for people who were not all-in on Git yet.
And then, eventually, GitHub became the place, and I moved all of it there.
It is hard for me to overstate how important GitHub became in my life. A large part of my Open Source identity formed there. Projects I worked on found users there. People found me there, and I found other people there. Many professional relationships and many friendships started because some repository, issue, pull request, or comment thread made two people aware of each other.
That is why I find what is happening to GitHub today so sad and so disappointing. I do not look at it as just the folks at Microsoft making product decisions I dislike. GitHub was part of the social infrastructure of Open Source for a very long time. For many of us, it was not merely where the code lived; it was where a large part of the community lived.
So when I think about GitHub’s decline, I also think about what came before it, and what might come after it. I have written a few times over the years about dependencies, and in particular about the problem of micro dependencies. In my mind, GitHub gave life to that phenomenon. It was something I definitely did not completely support, but it also made Open Source more inclusive. GitHub changed how Open Source feels, and later npm and other systems changed how dependencies feel. Put them together and you get a world in which publishing code is almost frictionless, consuming code is almost frictionless, and the number of projects in the world explodes.
That has many upsides. But it is worth remembering that Open Source did not always work this way.
A Smaller World
Before GitHub, Open Source was a much smaller world. Not necessarily in the number of people who cared about it, but in the number of projects most of us could realistically depend on.
There were well-known projects, maintained over long periods of time by a comparatively small number of people. You knew the names. You knew the mailing lists. You knew who had been around for years and who had earned trust. That trust was not perfect, and the old world had plenty of gatekeeping, but reputation mattered in a very direct way. We took pride (and got frustrated) when the Debian folks came and told us our licensing stuff was murky or the copyright headers were not up to snuff, because they packaged things up.
A dependency was not just a package name. It was a project with a history, a website, a maintainer, a release process, a lot of friction, and often a place in a larger community. You did not add dependencies casually, because the act of depending on something usually meant you had to understand where it came from.
Not all of this was necessarily intentional, but because these projects were comparatively large, they also needed to bring their own infrastructure. Small projects might run on a university server, and many of them were on SourceForge, but the larger ones ran their own show. They grouped together into larger collectives to make it work.
We Ran Our Own Infrastructure
My first Open Source projects lived on infrastructure I ran myself. There was a Trac installation, Subversion repositories, tarballs, documentation, and release files served from my own machines or from servers under my control. That was normal. If you wanted to publish software, you often also became a small-time system administrator. Georg and I ran our own collective for our Open Source projects: Pocoo. We shared server costs and the burden of maintaining Subversion and Trac, mailing lists and more.
Subversion in particular made this “running your own forge” natural. It was centralized: you needed a server, and somebody had to operate it. The project had a home, and that home was usually quite literal: a hostname, a directory, a Trac instance, a mailing list archive.
When Mercurial and Git arrived, they were philosophically the opposite. Both were distributed. Everybody could have the full repository. Everybody could have their own copy, their own branches, their own history. In principle, those distributed version control systems should have reduced the need for a single center. But despite all of this, GitHub became the center.
That is one of the great ironies of modern Open Source. The distributed version control system won, and then the world standardized on one enormous centralized service for hosting it.
What GitHub Gave Us
It is easy now to talk only about GitHub’s failures, of which there are currently many, but that would be unfair: GitHub was, and continues to be, a tremendous gift to Open Source.
It made creating a project easy and it made discovering projects easy. It made contributing understandable to people who had never subscribed to a development mailing list in their life. It gave projects issue trackers, pull requests, release pages, wikis, organization pages, API access, webhooks, and later CI. It normalized the idea that Open Source happens in the open, with visible history and visible collaboration. And it was an excellent and reasonable default choice for a decade.
But maybe the most underappreciated thing GitHub did was archival work: GitHub became a library. It became an index of a huge part of the software commons because even abandoned projects remained findable. You could find forks, and old issues and discussions all stayed online. For all the complaints one can make about centralization, that centralization also created discoverable memory. The leaders there once cared a lot about keeping GitHub available even in countries that were sanctioned by the US.
I know what the alternative looks like, because I was living it. Some of my earliest Open Source projects are technically still on PyPI, but the actual packages are gone. The metadata points to my old server, and that server has long stopped serving those files.
That was normal before the large platforms. A personal domain expired, a VPS was shut down, a developer passed away, and with them went the services they paid for. The web was once full of little software homes, and many of them are gone 1.
npm and the Dependency Explosion
The micro-dependency problem was not just that people published very small packages. The hosted infrastructure of GitHub and npm made it feel as if there was no cost to create, publish, discover, install, and depend on them.
In the pre-GitHub world, reputation and longevity were part of the dependency selection process almost by necessity, and it often required vendoring. Plenty of our early dependencies were just vendored into our own Subversion trees by default, in part because we could not even rely on other services being up when we needed them and because maintaining scripts that fetched them, in the pre-API days, was painful. The implied friction forced some reflection, and it resulted in different developer behavior. With npm-style ecosystems, the package graph can grow faster than anybody’s ability to reason about it.
The problem that this type of thinking created also meant that solutions had to be found along the way. GitHub helped compensate for the accountability problem and it helped with licensing. At one point, the newfound influx of developers and merged pull requests left a lot of open questions about what the state of licenses actually was. GitHub even attempted to rectify this with their terms of service.
The thinking for many years was that if I am going to depend on some tiny package, I at least want to see its repository. I want to see whether the maintainer exists, whether there are issues, whether there were recent changes, whether other projects use it, whether the code is what the package claims it is. GitHub became part of the system that provides trust, and more recently it has even become one of the few systems that can publish packages to npm and other registries with trusted publishing.
That means when trust in GitHub erodes, the problem is not isolated to source hosting. It affects the whole supply chain culture that formed around it.
GitHub Is Slowly Dying
GitHub is currently losing some of what made it feel inevitable. Maybe that’s just the life and death of large centralized platforms: they always disappoint eventually. Right now people are tired of the instability, the product churn, the Copilot AI noise, the unclear leadership, and the feeling that the platform is no longer primarily designed for the community that made it valuable.
Obviously, GitHub also finds itself in the midst of the agentic coding revolution and that causes enormous pressure on the folks over there. But the site has no leadership! It’s a miracle that things are going as well as they are.
For a while, leaving GitHub felt like a symbolic move mostly made by smaller projects or by people with strong views about software freedom. I definitely cringed when Zig moved to Codeberg! But I now see people with real weight and signal talking about leaving GitHub. The most obvious one is Mitchell Hashimoto, who announced that Ghostty will move. Where it will move is not clear, but it’s a strong signal. But there are others, too. Strudel moved to Codeberg and so did Tenacity. Will they cause enough of a shift? Probably not, but I find myself on non-GitHub properties more frequently again compared to just a year ago.
One can argue that this is good: it is healthy for Open Source to stop pretending that one company should be the default home of everything. Git itself was designed for a world with many homes.
Dispersion Has a Cost
Going back to many forges, many servers, many small homes, and many independent communities will increase decentralization, and in many ways it will force systems to adapt. This can restore autonomy and make projects less dependent on the whims of Microsoft leadership. It can also allow different communities to choose different workflows. What’s happening in Pi‘s issue tracker currently is largely a result of GitHub’s product choices not working in the present-day world of Open Source. It was built for engagement, not for maintainer sanity.
It can also make the web forget again. I quite like software that forgets because it has a cleansing element. Maybe the real risk of loss will make us reflect more on actually taking advantage of a distributed version control system.
But if projects move to something more akin to self-hosted forges, to their own self-hosted Mercurial or cgit servers, we run the risk of losing things that we don’t want to lose. The code might be distributed in theory, but the social context often is not. Issues, reviews, design discussions, release notes, security advisories, and old tarballs are fragile. They disappear much more easily than we like to admit. Mailing lists, which carried a lot of this in earlier years, have not kept up with the needs of today, and are largely a user experience disaster.
We Need an Archive
As much as I like the idea of things fading out of existence, we absolutely need libraries and archives.
Regardless of whether GitHub is here to stay or projects find new homes, what I would like to see is some public, boring, well-funded archive for Open Source software. Something with the power of an endowment or public funding to keep it afloat. Something whose job is not to win the developer productivity market but just to make sure that the most important things we create do not disappear.
The bells and whistles can be someone else’s problem, but source archives, release artifacts, metadata, and enough project context to understand what happened should be preserved somewhere that is not tied to the business model or leadership mood of a single company.
GitHub accidentally became that archive because it became the center of Open Source activity. Once that no longer holds, we should not assume some magic archival function will emerge or that GitHub will continue to function as such. We have already seen what happens when project homes are just personal servers and good intentions, and we have seen what happened to Google Code and Bitbucket.
I hope GitHub recovers, I really do, in part because a lot of history lives there and because the people still working on it inherited something genuinely important. But I no longer think it is responsible to let the continued memory of Open Source depend on GitHub remaining a healthy product.
The world before GitHub had more autonomy and more loss, and in some ways, we’re probably going to move back there, at least for a while. Whatever people want to start building next should try to keep the memory and lose the dependence. It should be easier to move projects, easier to mirror their social context, easier to preserve releases, and harder for one company’s drift to become a cultural crisis for everyone else.
I do not want to go back to the old web of broken tarball links and abandoned Trac instances. I also do not want Open Source to pretend that the last twenty years were normal or permanent. GitHub wrote a remarkable chapter of Open Source, and if that chapter is ending, the next one should learn from it and also from what came before.
-
This is also a good reminder that we rely so very much on the Internet Archive for many projects of the time.↩
April 27, 2026
Python Engineering at Microsoft
Python Environments Extension for VS Code- April Update
Python Environments â April 2026 Release
We’re excited to announce the latest update to the Python Environments extension for Visual Studio Code. This release focuses on startup performance, reliability, and quality-of-life improvements for terminals and package management.
Faster startup
Activation is now noticeably snappier, especially on remote and containerized workspaces. We made three key changes:
Lazy manager discovery. Pipenv, pyenv, and poetry environments are no longer discovered eagerly on startup. Instead, detection is deferred until you actually interact with one of those managers â for example, by opening a project that uses a Pipfile or pyproject.toml with a poetry backend. This eliminates unnecessary work for the majority of users who rely on venv, uv, or conda. (#1423, #1408)
Faster environment resolution. The path from “extension activated” to “interpreter ready” is shorter. Resolution during startup and interpreter selection now completes with less overhead. (#1419)
Narrower default workspace scanning. The default search pattern for virtual environments was ./**/.venv, which triggered a recursive scan of the entire workspace tree. On large projects â and especially over Remote-SSH â this could cause the Python Environment Tools (PET) process to hang for 30+ seconds during configuration, leading to cascading timeouts and restart loops (see #1460, #1434). The default is now .venv and */.venv, which covers the standard layout without deep traversal. If you have virtual environments nested more than one level deep, you can add custom paths via the python-envs.workspaceSearchPaths setting. (#1419)
Improved reliability
PET crash recovery. When the PET process crashed mid-refresh, the extension could end up in a broken state with no environments visible. We now retry the refresh after a crash and handle empty or malformed responses defensively, so a transient PET failure no longer leaves you with a blank environment list. (#1442, #1447, #1444)
Conda base environment fix. After a window reload, the conda base environment could be incorrectly restored as a different named environment â making it appear that your interpreter selection had silently changed. This is now fixed. (#1412)
Environment updates and terminals
Auto-refreshing package lists. You no longer need to manually refresh the package view after running pip install or pip uninstall. The extension now watches for metadata changes in site-packages and updates the package list automatically. (#1420)
Multi-project terminal creation. In workspaces with multiple Python projects, creating a new terminal now prompts you to choose which project’s environment to activate, rather than picking one silently. (#1401)
PowerShell activation on Windows. Virtual environment activation via PowerShell could fail if the system execution policy blocked scripts. The extension now sets a process-scoped execution policy before running activation, so .ps1 activate scripts work out of the box without requiring system-wide policy changes. (#1414)
Try the update today and let us know how it works for you. If you run into issues, please file them on GitHub.
The post Python Environments Extension for VS Code- April Update appeared first on Microsoft for Python Developers Blog.
Talk Python to Me
#546: Self hosting apps for Python people
The cloud is convenient until it isn't. You upload your photos, sync your contacts, click through the cookie banners. Then prices go up again or you read about a family that lost their entire Google account over a medical photo sent to a doctor. At some point, the question shifts from "why would I run this myself?" to "why aren't I?" <br/> <br/> My guest this week is Alex Kretzschmar, head of DevRel at Tailscale, longtime host of the Self-Hosted podcast, and co-founder of Linuxserver.io. We cover what self-hosting really means in 2026, the apps worth running yourself like Immich and Home Assistant, why Docker Compose ties it all together, and how Tailscale lets you reach any of it from anywhere, without opening a single port. If you've been thinking about pulling your digital life back behind your own walls, this is your roadmap.<br/> <br/> <strong>Episode sponsors</strong><br/> <br/> <a href='https://talkpython.fm/temporal-replay'>Temporal</a><br> <a href='https://talkpython.fm/training'>Talk Python Courses</a><br/> <br/> <h2 class="links-heading mb-4">Links from the show</h2> <div><strong>Guest</strong><br/> <strong>Alex Kretzschmar</strong>: <a href="https://alex.ktz.me/?featured_on=talkpython" target="_blank" >alex.ktz.me</a><br/> <br/> <strong>Bitflip podcast</strong>: <a href="https://bitflip.show?featured_on=talkpython" target="_blank" >bitflip.show</a><br/> <strong>Self-Hosted podcast (Alex's previous show)</strong>: <a href="https://selfhosted.show?featured_on=talkpython" target="_blank" >selfhosted.show</a><br/> <strong>Perfect Media Server</strong>: <a href="https://perfectmediaserver.com?featured_on=talkpython" target="_blank" >perfectmediaserver.com</a><br/> <strong>KTZ Systems on YouTube</strong>: <a href="https://youtube.com/@ktzsystems" target="_blank" >youtube.com/@ktzsystems</a><br/> <strong>Linuxserver.io (co-founded by Alex)</strong>: <a href="https://linuxserver.io?featured_on=talkpython" target="_blank" >linuxserver.io</a><br/> <strong>"How Tailscale Works" blog post</strong>: <a href="https://tailscale.com/blog/how-tailscale-works?featured_on=talkpython" target="_blank" >tailscale.com/blog/how-tailscale-works</a><br/> <strong>https://tailscale.com/</strong>: <a href="https://tailscale.com/?featured_on=talkpython" target="_blank" >tailscale.com</a><br/> <br/> <strong>Self-hosted apps discussed</strong><br/> <strong>Awesome Self-Hosted (GitHub list)</strong>: <a href="https://github.com/awesome-selfhosted/awesome-selfhosted?featured_on=talkpython" target="_blank" >github.com</a><br/> <strong>Immich (Google Photos alternative)</strong>: <a href="https://immich.app?featured_on=talkpython" target="_blank" >immich.app</a><br/> <strong>Home Assistant</strong>: <a href="https://home-assistant.io?featured_on=talkpython" target="_blank" >home-assistant.io</a><br/> <strong>Open Home Foundation</strong>: <a href="https://openhomefoundation.org?featured_on=talkpython" target="_blank" >openhomefoundation.org</a><br/> <strong>Plausible Analytics</strong>: <a href="https://plausible.io?featured_on=talkpython" target="_blank" >plausible.io</a><br/> <strong>Umami Analytics</strong>: <a href="https://umami.is?featured_on=talkpython" target="_blank" >umami.is</a><br/> <strong>Python integration for umami</strong>: <a href="https://pypi.org/project/umami-analytics/?featured_on=talkpython" target="_blank" >pypi.org</a><br/> <strong>Pi-hole</strong>: <a href="https://pi-hole.net?featured_on=talkpython" target="_blank" >pi-hole.net</a><br/> <strong>AdGuard Home</strong>: <a href="https://adguard.com/adguard-home?featured_on=talkpython" target="_blank" >adguard.com</a><br/> <strong>NextDNS</strong>: <a href="https://nextdns.io?featured_on=talkpython" target="_blank" >nextdns.io</a><br/> <strong>Coolify</strong>: <a href="https://coolify.io?featured_on=talkpython" target="_blank" >coolify.io</a><br/> <strong>Docker + ufw</strong>: <a href="https://docs.docker.com/engine/network/packet-filtering-firewalls/#docker-and-ufw" target="_blank" >docs.docker.com</a><br/> <br/> <strong>Storage, backup & filesystem</strong><br/> <strong>OpenZFS</strong>: <a href="https://openzfs.org?featured_on=talkpython" target="_blank" >openzfs.org</a><br/> <strong>ZFS.rent (offsite ZFS replication)</strong>: <a href="https://zfs.rent?featured_on=talkpython" target="_blank" >zfs.rent</a><br/> <strong>Backblaze</strong>: <a href="https://backblaze.com?featured_on=talkpython" target="_blank" >backblaze.com</a><br/> <strong>Hetzner Storage Box</strong>: <a href="https://hetzner.com/storage/storage-box?featured_on=talkpython" target="_blank" >hetzner.com</a><br/> <strong>DigitalOcean</strong>: <a href="https://digitalocean.com?featured_on=talkpython" target="_blank" >digitalocean.com</a><br/> <br/> <strong>Secrets management mentioned</strong><br/> <strong>OpenBao (open-source Vault fork)</strong>: <a href="https://openbao.org?featured_on=talkpython" target="_blank" >openbao.org</a><br/> <strong>HashiCorp Vault</strong>: <a href="https://hashicorp.com/products/vault?featured_on=talkpython" target="_blank" >hashicorp.com</a><br/> <strong>Bitwarden</strong>: <a href="https://bitwarden.com?featured_on=talkpython" target="_blank" >bitwarden.com</a><br/> <strong>1Password</strong>: <a href="https://1password.com?featured_on=talkpython" target="_blank" >1password.com</a><br/> <br/> <strong>Hardware mentioned</strong><br/> <strong>Proxmox VE</strong>: <a href="https://proxmox.com?featured_on=talkpython" target="_blank" >proxmox.com</a><br/> <strong>Minisforum MS01</strong>: <a href="https://minisforum.com?featured_on=talkpython" target="_blank" >minisforum.com</a><br/> <strong>Zima Board / Zima OS</strong>: <a href="https://zimaspace.com?featured_on=talkpython" target="_blank" >zimaspace.com</a><br/> <br/> <strong>Other references</strong><br/> <strong>Cory Doctorow on "enshittification" (Cory's blog where he coined the term)</strong>: <a href="https://pluralistic.net?featured_on=talkpython" target="_blank" >pluralistic.net</a><br/> <strong>Linus Tech Tips' WAN Show (Linus mentioned NAS-building going mainstream)</strong>: <a href="https://linustechtips.com?featured_on=talkpython" target="_blank" >linustechtips.com</a><br/> <br/> <strong>Watch this episode on YouTube</strong>: <a href="https://www.youtube.com/watch?v=1iAQRY7hiVA" target="_blank" >youtube.com</a><br/> <strong>Episode #546 deep-dive</strong>: <a href="https://talkpython.fm/episodes/show/546/self-hosting-apps-for-python-people#takeaways-anchor" target="_blank" >talkpython.fm/546</a><br/> <strong>Episode transcripts</strong>: <a href="https://talkpython.fm/episodes/transcript/546/self-hosting-apps-for-python-people" target="_blank" >talkpython.fm</a><br/> <br/> <strong>Theme Song: Developer Rap</strong><br/> <strong>đ„ Served in a Flask đž</strong>: <a href="https://talkpython.fm/flasksong" target="_blank" >talkpython.fm/flasksong</a><br/> <br/> <strong>---== Don't be a stranger ==---</strong><br/> <strong>YouTube</strong>: <a href="https://talkpython.fm/youtube" target="_blank" ><i class="fa-brands fa-youtube"></i> youtube.com/@talkpython</a><br/> <br/> <strong>Bluesky</strong>: <a href="https://bsky.app/profile/talkpython.fm" target="_blank" >@talkpython.fm</a><br/> <strong>Mastodon</strong>: <a href="https://fosstodon.org/web/@talkpython" target="_blank" ><i class="fa-brands fa-mastodon"></i> @talkpython@fosstodon.org</a><br/> <strong>X.com</strong>: <a href="https://x.com/talkpython" target="_blank" ><i class="fa-brands fa-twitter"></i> @talkpython</a><br/> <br/> <strong>Michael on Bluesky</strong>: <a href="https://bsky.app/profile/mkennedy.codes?featured_on=talkpython" target="_blank" >@mkennedy.codes</a><br/> <strong>Michael on Mastodon</strong>: <a href="https://fosstodon.org/web/@mkennedy" target="_blank" ><i class="fa-brands fa-mastodon"></i> @mkennedy@fosstodon.org</a><br/> <strong>Michael on X.com</strong>: <a href="https://x.com/mkennedy?featured_on=talkpython" target="_blank" ><i class="fa-brands fa-twitter"></i> @mkennedy</a><br/></div>
EuroPython
Humans of EuroPython: Martin Borus
EuroPython wouldn’t exist if it weren’t for all the volunteers who put in countless hours to organize it. Whether it’s contracting the venue, ordering catering for a week-long conference, selecting and confirming talks & workshops, hundreds of hours of loving work have been put into making each edition the best one yet.
Today, we’d like to share an interview with Martin Borus, a member of the EuroPython 2025 Operations team and a returning conference contributor.
Thank you for making EuroPython such a welcoming conference, Martin!
Martin Borus, member of the Operations Team at EuroPython 2025 Prague & RemoteEP: What first inspired you to volunteer for EuroPython?
When visiting EuroPython - which was my first big Python conference - I got to know some volunteers. From the next year on I got gradually into helping. It seemed like a good idea to help.
EP: How did contributing to EuroPython impact your relationships within the community?
It was an entry point into the Python community. I met a lot of people I would not have met otherwise. Which led to a lot of interesting conversions and specific help for my journey into Python.
EP: Was there a moment when you felt your contribution really made a difference?
One of these moments comes from the Beginners’ Orientation sessions. I still remember the problems I had being alone on my first EuroPython that motivated me to give others a better start. I got feedback that this helped others to enjoy their first conference more.
EP: What&aposs one thing you took away from contributing to EuroPython that you still use today?
The experiences gained in working with a team coming from all over Europe.
EP: If you could add one thing to make the volunteer experience even better, what would it be?
If there was a single thing, we’d have implemented it already, because each year the volunteers try to improve based on the experiences of the previous years.
EP: What tips do you have for people attending the conference?
For anybody coming to EuroPython, volunteer or attendee, I can highly recommend having a note on your phone about what topics you’re interested in. Collect questions in the weeks before the conference, so you can pull them out in conversations. I call this my “EuroPython wish list” and usually get large parts of it covered during the week.
EP: What would you say to someone considering volunteering at EuroPython but feeling hesitant?
Even if it’s at the cost of missing some of the talks, as a volunteer you are where the action is and you have a chance to get more experiences out of the conference.
EP: Thank you for your contribution, Martin!
PyCon
Asking the Key Questions: Q&A with the PyCon US 2026 keynote speaker Lin Qiao
This is a blog series where we're asking each of our PyConUS 2026 keynote speakers about their journey into tech, how excited they are for PyconUS and any tips they can provide for an awesome conference experience! Here's our interview with Lin Qiao
Without giving too many spoilers, tell us what your keynote is about?
Most AI products are built on rented land. If your competitor can make the same API call, you do not have a moat. I will break down what the teams pulling ahead are doing differently, with real examples from Cursor, Notion, and Vercel, and get into the hard tradeoffs nobody talks about enough.
How did you get started in tech/Python?
My path into tech started pretty naturally. I studied STEM all through high school and undergrad, so it was always the space I gravitated toward. Python specifically came later, during my PhD, where I started using it to run experiments and support my research papers.
What do you think the most important work you've ever done is?
Co-creating PyTorch was a defining chapter, because it became the foundation for how the world does AI research. But I think the most important work is really what Iâm doing now. I founded Fireworks because I spent years watching companies outside Big Tech struggle to get AI into production. They had the ambition but not the infrastructure, and weâre changing that.
Have you been to PyCon US before? What are you looking forward to?
PyTorch was built on Python, so this community is close to my heart. I am most looking forward to the hallway conversations. The best ideas come from talking to people who are deep in the work.
Any advice for first-time conference goers?
Talk to people. The sessions are recorded, but the people are only there for a few days. Go to the hallway track, sit at lunch tables where you do not know anyone, and if a talk resonated with you, go tell the speaker. Thatâs how the best professional relationships start.
Can you tell us about an open source or open culture project that you think not enough people know about?
I am biased, but I think the broader open model ecosystem does not get the credit it deserves. Everyone knows the big names, but there is an incredible amount of work happening in specialized open models, evaluation frameworks, and fine-tuning tooling that is quietly making AI more accessible. The pattern I keep seeing is that the most impactful open-source projects are the ones that lower the barrier for the next person to build something better. That was true for PyTorch, and it is true today for the tools that help developers go from an off-the-shelf model to something truly customized for their use case.
Ari Lamstein
A Web App for Exploring ForeignâBorn Population Trends
I just created a web app for exploring trends in the foreign-born population in the United States. The app lets you pick a location and see how the size of the foreign-born population there has changed over time. A core purpose of the project is to let people track how the foreignâborn population changes as new immigration enforcement policies are implemented.
The app is built in Python with Streamlit, and the data comes from the American Community Survey (ACS) 1âyear estimates. Everything is powered by the acsânativity package I recently published to PyPI. The ACS currently covers 2005â2024, and the 2025 release is expected in September â Iâll update the app as soon as the new data becomes available. Data is available for the nation, all states, and any county or city with at least 65,000 residents.
Here’s a screenshot from the app providing data on Chicago, Illinois:
Chicagoâs foreignâborn population has risen and fallen sharply at different points between 2005 and 2024. Last September President Trump launched an immigration enforcement action in the city called Operation Midway Blitz. When the 2025 ACS estimates come out in September, weâll get the first chance to see whether that enforcement action shows up in the data – and how any change compares with the kinds of fluctuations Chicago has experienced in the past.
Exploratory Data Analysis
In addition to the graphs generated by the acs-nativity package, the app provides two additional tabs to help you explore nativity trends: the Table tab and the Compare Years tab.
Table Tab
The Table tab shows the full dataset for the selected geography level, and you can sort by any column. Sorting makes it easy to spot outliers. For example, in 2024 the location with the highest share of foreignâborn residents was Hialeah, Florida (77.1%), while the lowest was Muskingum County, Ohio (0.7%):
Compare Years Tab
The Compare Years tab lets you create a table showing how a demographic has changed between two years. This often surfaces surprising results. For example, between 2023 and 2024, New York City saw an estimated increase of 205,767 in the Native-born population – slightly larger than California’s increase of 204,056, despite California’s population being several times larger:
Try the App
If youâre interested in how these patterns play out in your own community, you can explore the app here.
Rodrigo GirĂŁo SerrĂŁo
TIL #143 â Resolve a lazy import manually
Learn how to work around the Python machinery to resolve an explicit lazy import manually.
A couple of articles ago I wrote about how you could inspect a lazy import.
Apparently, you can use a similar trick to check the attributes and methods that a lazy import has:
>>> lazy import json
>>> dir(globals()["json"])
['__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'resolve']
Apart from a large number of dunder methods and dunder attributes, you'll find the method resolve.
You can run help(globals()["json"].resolve) to get the help text on that method:
Help on built-in function resolve:
resolve() method of builtins.lazy_import instance
resolves the lazy import and returns the actual object
This shows that it's the method resolve that resolves a lazy import.
If you call the method, you can get access to the resolved module:
>>> lazy import json
>>> resolved_json = globals()["json"].resolve()
>>> resolved_json
<module 'json' from '/Users/rodrigogs/.local/share/uv/python/cpython-3.15.0a8-macos-aarch64-none/lib/python3.15/json/__init__.py'>
After calling resolve, the lazy module doesn't disappear automatically:
>>> globals()["json"]
<lazy_import 'json'>
Which shows that the mechanism that's responsible for reification most likely calls the method resolve and then reassigns the name of the module to the module returned by resolve.
In a way, it's as if the reification process ran something like
globals()["json"] = globals()["json"].resolve()
In hindsight, this isn't too surprising. After all, Python tends to be very consistent. The only mistery that remains is what triggers the reification process. How is it that Python can detect when something touches the lazy import..?
Real Python
How to Conceptualize Python Fundamentals for Greater Mastery
Struggling to conceptualize Python fundamentals is a common problem learners face. If youâre unable to put a fundamental concept into perspective and form a clear mental picture of what itâs about, itâll be difficult to understand and apply it.
In this guide, youâll walk through a framework of steps to help you better conceptualize Python fundamentals. This process is helpful for Python developers and learners at any experience level, but especially for beginners. If you are just starting out, this guide will help you build a solid understanding of the basics.
You might want to set aside twenty minutes or so to read through the tutorial, and another thirty minutes to practice on a few key concepts. You should also gather a list of difficult topics, your preferred learning resources, and a note-taking app or pen and paper.
Click the link below to download a free cheat sheet that covers the framework steps youâll walk through in this guide:
Get Your Cheat Sheet: Click here to download a free PDF that outlines the framework of steps for conceptualizing Python fundamentals.
Take the Quiz: Test your knowledge with our interactive âHow to Conceptualize Python Fundamentals for Greater Masteryâ quiz. Youâll receive a score upon completion to help you track your learning progress:
Interactive Quiz
How to Conceptualize Python Fundamentals for Greater MasteryCheck your understanding of a framework for conceptualizing Python fundamentals, from defining concepts to comparing similar ideas.
Step 1: Define the Concept in Your Own Words
Begin by briefly describing the concept in your own words. You can write your definition in the downloadable worksheet provided with this tutorial. Note that writing is a powerful tool for reinforcing learning, as educator and former Rutgers University professor Janet Emig asserted in her paper, Writing as a Mode of Learning.
Answer Key Questions for Defining a Concept
As a framework for your definition, consider these key questions:
- What: What is a short description of the concept?
- Why: Why is the concept important in the broader Python context?
- How: How is the concept used in a Python program?
These questions will help you establish a core understanding of the concept youâre learning.
You might feel intimidated when youâre trying to define a Python concept. If you need help, there are many resources that can assist you. Real Pythonâs Reference section has concise definitions of Python keywords, built-in types, standard library modules, and more to help you build your own descriptions.
If youâre a visual learner, using an illustration can be a powerful way to enhance your understanding. In addition to a written definition, you can draw a picture or diagram to illustrate the concept. For example, the Variables in Python: Usage and Best Practices tutorial shows some example images of how you might picture variables. If you look at the Lists vs Tuples in Python tutorial, you can see a diagram of a Python list.
While pictures can be helpful, being able to conceptualize doesnât necessarily mean you have to think visually. There are different thinking styles. Some researchers suggest that people can be visual or verbal thinkers. Pattern-based thinking is another style. Several of the tips in this tutorial encourage you to explore different aspects of these styles, depending on which works best for you.
View Examples of Concept Definitions
You might find a couple of examples helpful in understanding how to define difficult concepts. Suppose youâre studying variables. Here are possible responses to the key questions:
- What: A variable is a name that points to an object stored in the programâs memory.
- Why: Variables are key for data processing.
- How: Assigning a value to a variable using the assignment operator (
=) allows you to access your programâs data in a user-friendly way. You can then access and change the value by name throughout the program as needed.
This description provides a concise summary of what a variable is, why it matters, and how to use one. You can also include an example of variable usage as an addendum to your definition:
>>> age = 25
Here, you created a variable called age and assigned it a value of 25. From now on, you can use the variable name age to access, modify, or use the variableâs value.
Or, you might be learning about lists. Your definitions could look like this:
- What: A list is a sequence of values or objects.
- Why: Working with sequences of items is a common, foundational task in programming. Python lists make this important work easier.
- How: You can create a list by writing a pair of square brackets, with a comma-separated sequence of items inside them. Assign the list to a variable to use it throughout your program.
Hereâs a short Python list that demonstrates the points in the definitions above:
Read the full article at https://realpython.com/conceptualize-python-fundamentals/ »
[ Improve Your Python With đ Python Tricks đ â Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]
Django Weblog
It's time to redesign djangoproject.com
If you've felt like djangoproject.com could use a refresh, you're not alone. The site has served the community well for a long time, itâs beloved by a lot of people but doesnât reflect where Django is today or who we want to reach. We've been working on a redesign behind the scenes, and we want to share where we're headed and how you can get involved.
Why a redesign
The case has been building for a while. The excellent user research report from 20tab documented in detail what current site users struggle with, and the more recent community discussion on homepage redesigns on the forum focuses on the image issue.
In her recent talk Debunking Django Myths, Sarah Boyce, one of our Django Fellows who helps maintain the project, walked through the gap between how Django is perceived and what it actually offers in 2026. Our website is one of the places where the gap is widest, and we need to close it.
Itâs harder than it looks on the surface, as itâs essential the site serves both as a showcase of the value of Django for newcomers; and as a central information space for our users; and as an online and in-person community hub; and a fundraising and sustainability tool for our Django Software Foundation.
How we're approaching this
We're planning the work in three phases.
Discovery and groundwork. This is where weâre at right now. Before anything gets designed, we need clarity on what the site should communicate: Django's value, who we're speaking to, and what success looks like. That means a marketing strategy (at least bigger-picture). Possibly additional user research focused on new users. Definitely site analytics so we know how different aspects of the site are working. And a redesign brief we can share with UX and visual design experts. We also need to be building up capacity in UX, Information Architecture (IA), and marketing, since those areas of expertise are essential for the success of the website but not well represented in our working groups.
Design. From there we'll move into IA, mockups, and low-fidelity prototypes. We expect this visual work will be component-driven, producing a small design system and pattern library that can support a section-by-section rollout rather than a big-bang launch. The homepage is the most visible surface and a natural focus, but it might be easier for our volunteers to first look at more specific sections (docs, donation flows, community) before tackling the more complex multi-purpose areas.
Build. For that, we want to work with our existing volunteer contributors as much as possible, so implementation will be incremental against mockups that reflect the long-term goal. This keeps the site working and evolving while we make progress on the design.
Who's doing the work
We hope to do most of this with existing volunteers. The Website working group, the Accessibility team, and the Social Media working group. Working with paid contractors for specific tasks if Django Software Foundation finances allow. A project this size really needs both: the continuity of volunteers who know Django and our community and Foundation, and focused professional time for the pieces that need it.
Where you come in
If you have relevant experience in any of the following, we'd genuinely love to hear from you:
- UX and interaction design
- User research
- Visual design
- Information Architecture, content strategy, or copywriting
- Marketing
Check out the Django forum thread weâre using for ongoing updates, come say hi in DMs, or chime in on the tracking issue for this work. Our Discord server is a good place to reach out too.
And separately - a good redesign will cost real money. We'd like some of this work to be handled by paid contractors where it makes sense, and that depends on what the Foundation can afford. If you're in a position to support the DSF financially, it directly helps us make that possible. Thanks for caring about this! Let's make djangoproject.com as good as the framework and community it represents.


Debunking Django Myths - Sarah Boyce @ Python Unplugged on PyTV