Top Python package picks

A good Python package should be functional, well-documented, and easy to write. Finding the right package for a job can be challenging given the over 300 thousand packages on PyPi, the Python package repository. I compiled a list of my favorite packages to assist any Python coder. You may be familiar with some of these, while others might be hidden gems. All packages listed can be found on PyPi and the Posit Package Manager.
- Data Management
- Utilities
- Exploratory Data Analysis
- Data Visualization
- Web
- Machine Learning
- Text Analysis
- GUI
Data Management
Polars
Polars
is an excellent choice for large datasets. It is built for speed and efficient memory usage, enabling datasets too big for Pandas to be loaded and manipulated. Polars
has a consistent but familiar API making it easy to transition from Pandas to Polars
. Just as an example, a 50GB file with 1 billion rows was loaded in 143 seconds by Polars
. Pandas couldn’t load the dataset.
import polars as pl
q = (
pl.scan_csv("iris.csv")
.filter(pl.col("sepal_length") > 5)
.groupby("species")
.agg(pl.all().sum())
)
df = q.collect()
xlwings
xlwings is one of the most valuable packages anyone working with spreadsheets can install. With xlwings, spreadsheets (excel and google) can be created, loaded, altered, and automated. The package has additional features allowing Python to be called from excel, workbooks to be exposed through API, and array formulas to be built.
DuckDB
DuckDB is an in-process SQL OLAP database management system with a fully featured Python API. One of the best features of DuckDB is allowing users to write SQL queries directly on a variety of dataframes, including both Polars and Pandas. Other features include creating persistent tables from dataframes and registering dataframes as virtual tables.
Utilities
python-fsutil
python-fsutil is a great package for anyone that works with file systems. This package contains a ton of functions that make working with directories, paths, and files a breeze. Many of the mundane file system operations can be easily automated. Things like cleaning directories, creating zip files, getting file creation dates, and more are quickly achievable with this package.
Pins
Pins is a terrific way to store, share, version, and publish Python data products (datasets, models, etc.). Pins is compatible with tools most of us already use, like network drives, Amazon S3, Posit Connect, DropBox, Google Cloud Storage, and more. Pins makes tracking changes and re-running analysis on historical data efficient.
from pins import board_temp
from pins.data import mtcars
# Create a temporary board and write pin containing CSV file to it
board = board_temp()
# Save dataset with detailed information
board.pin_write(
mtcars,
name="mtcars2",
type="csv",
description = "Data extracted from the 1974 Motor Trend US magazine, \
and comprises fuel consumption and 10 aspects of automobile design \
and performance for 32 automobiles (1973–74 models).",
metadata = {
"source": "Henderson and Velleman (1981), \
Building multiple regression models interactively. \
Biometrics, 37, 391–411."
}
)
# Read pin from board
board.pin_read("mtcars")
Loguru
Making logs in Python is made more pleasant by Loguru, which makes configuring logs incredibly simple without sacrificing customizability.
Pendulum
Pendelum improves on the Python datetime library with added functionality like time zone management, enhanced parsing, easier calculations, and additional attributes. This package is a must for anyone working with datetimes across different time zones.
Exploratory Data Analysis
DataPrep
DataPrep is my go-to for doing exploratory analysis on a dataset. In less than three lines of code, you can generate a full report on a dataset, covering everything from missing data to correlations. DataPrep also includes functions to clean and standardize datasets. This package saves a ton of time on some of the traditionally least interesting work.
from dataprep.datasets import load_dataset
from dataprep.eda import create_report
df = load_dataset("titanic")
create_report(df).show()
ydata-profiling
ydata-profiling has a stated goal of providing one-line exploratory data analysis. To that end, it largely succeeds. A single line of code offers in-depth dataset analysis, helping you understand you’re data. ydata-profiling has great functionality for exploring time series and text data.
Data Visualization
Altair
Altair is a simple but powerful plotting library capable of producing elegant and interactive charts. Altair is handy when data transformations and filtering are required to generate a plot, thanks to its built-in support for these actions.
import altair as alt
from vega_datasets import data
source = data.cars()
alt.Chart(source).mark_circle(size=60).encode(
x='Horsepower',
y='Miles_per_Gallon',
color='Origin',
tooltip=['Name', 'Origin', 'Horsepower', 'Miles_per_Gallon']
).interactive()
Pygal
Pygal is another great option for those needing to easily create beautiful interactive plots. Pygal has a deep roster of chart types compared to some alternatives.
Web
Shiny for Python
Shiny, R’s primary web app framework, is now available for Python. Shiny enables Python users to build interactive and reactive web applications that remain performant. This is an excellent choice for anyone who needs to allow user input and response without refreshing the entire application.
Beautiful Soup 4
Beautiful Soup is a package that retrieves data from HTML and XML files. Beautiful Soup is commonly used to parse data retrieved through the aforementioned Requests package. With Beautiful Soup you can easily ingest data directly from the internet.
import pandas as pd
from bs4 import BeautifulSoup
import requests
response = requests.get(
'https://en.wikipedia.org/wiki/List_of_U.S._states_by_median_home_price'
)
tbl = BeautifulSoup(response.text, 'html.parser').find('table', {'class':'wikitable'})
tbl_df = pd.DataFrame(pd.read_html(str(tbl))[0])
FastAPI
FastAPI allows users to build fully functional and deployable APIs with minimal code. FastAPI helps automate much of the work in creating an API, like documentation and data validation.
Machine Learning
vetiver
Vetiver is a package for individuals and teams that want to instill good MLOps practices. Versioning, deploying, and monitoring ML models is painless with vetiver. Several model frameworks are supported, including sci-kit-learn and PyTorch.
from vetiver import VetiverModel
from vetiver.data import mtcars
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(mtcars.drop(columns="mpg"), mtcars["mpg"])
v = VetiverModel(model,
model_name = "cars_linear",
prototype_data = mtcars.drop(columns="mpg")
)
v.description
ELI5
ELI5 helps to visualize and debug ML models, helping to explain model results. ELI5 helps identify model issues and explain model behaviors. For those building and debugging ML models, this package can help you save a ton of time.
Text Analysis
FlashText
FlashText is a package for anyone needing to perform regular expression operations on large datasets. The primary benefit of FlashText is its speed at scale. The benefits are muted on smaller datasets.
NLTK
Natural Language Toolkit, or NLTK for short, is Python’s standard natural language processing framework. Anyone planning to do something involving text analysis will likely be using NLTK, thanks to its ubiquity, diverse feature set, and documentation. NLTK works excellent with other packages making creating things like sentiment models as easy as possible.
import numpy as np
import nltk
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax
# Use Roberta Model
model = f"cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForSequenceClassification.from_pretrained(model)
sentence = 'I really love Post Package Manager. \
It is a great tool for managing all of \
my R and Python packages.'
# Run Model on Sentence to get sentiment
encoded_text = tokenizer(example, return_tensors='pt')
output = model(**encoded_text)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
scores_dict = {
'roberta_neg' : scores[0],
'roberta_neu' : scores[1],
'roberta_pos' : scores[2]
}
GUI
PyQt6
PyQt6 should be the default for anyone that needs to build a Python GUI application. PyQt6 has an optional designer and includes various components, including a web browser, designer, SQL databases, and more. PyQt6 is known for its versatility, lightweight, deep API, and wide array of learning resources.
import sys
from PyQt6.QtCore import QSize, Qt
from PyQt6.QtWidgets import QApplication, QMainWindow, QPushButton
class sample(QMainWindow):
def __init__(self):
super().__init__()
# Set up App Title and Layout
self.setWindowTitle("Sample App")
layout = QGridLayout()
# Add Button
button = QPushButton('First Button')
# Input Field
self.input_field = QLineEdit()
# DropDown Menu
self.cb = QComboBox()
self.options = ('Option 1', 'Option 2', 'Option 3')
self.cb.addItems(self.options)
layout.addWidget(button, 0, 0)
layout.addWidget(self.input_field, 1, 0)
layout.addWidget(self.cb, 2, 0)
app = QApplication(sys.argv)
window = MainWindow()
window.show()
app.exec()
fbs
fbs makes creating a desktop application easy. It makes application deployment simple, sometimes taking seconds to generate an installer for a PyQt6 application. Anyone who desires to build and deploy Python-based applications should take a look at pairing PyQt6 with fbs.
Posit Package Manager
My favorite way to manage Python packages is with Posit Package Manager, which provides full mirrors of PyPi, Bioconductor, and CRAN. With Posit Package Manager, I have been able to create curated repositories and take snapshots to avoid broken package dependencies and ensure project reproducibility. Another useful feature is allowing custom Python packages to be added and installed using pip.