Open source packages - Quarto, Shiny, and more Commercial enterprise offerings

Building realistic fake datasets with Pointblank

Written by Rich Iannone
2026-03-09
Digital illustration of a data table for "Pointblank" featuring names, emails, and locations like Japan and Germany. A cartoon hand holds a large black mustache on a stick over the center, suggesting "fake" or "masked" data against a background of faded international flags.

Every data practitioner eventually runs into the same problem: you need data, but you don’t have it. It could be that the production database is locked behind access controls. Or, you might have the situation where the dataset you need doesn’t exist yet (because the feature hasn’t shipped). Maybe you’re writing tests, building a demo, or teaching a workshop and you need something that looks real but carries zero risk. Whatever the reason, the need for synthetic data is everywhere, and it comes up far more often than most of us would like to admit.

The great news here is that fake can be just as good. If your synthetic data has the right shape, the right types, the right distributions, and the right internal consistency, it can stand in for real data in many different situations.

Pointblank is a Python library for data validation, but over the last several releases (v0.20.0, v0.21.0, and v0.22.0), we’ve been building out a complementary capability: data generation. The idea is simple. You define a schema (the columns, their types, and their constraints), and Pointblank produces n rows of data that conform to it. The result is a Polars or Pandas DataFrame, ready to use.

In this post, I’ll walk through the generate_dataset() function in some depth, show how to build realistic datasets for common scenarios (including a customer data example you might actually use), and highlight the country-specific and coherence features that make the generated data feel surprisingly real.

Note

All examples here use pb.preview() to display results, which renders a compact HTML table showing the head and tail of the dataset. If you want to follow along, install Pointblank with pip install pointblank and make sure you have Polars available.

Starting simple: A schema and a dataset

Everything begins with a Schema object. You declare columns as keyword arguments, using field specification functions to describe each one:

import pointblank as pb

schema = pb.Schema(
    id=pb.int_field(min_val=1000, max_val=9999, unique=True),
    score=pb.float_field(min_val=0.0, max_val=100.0),
    passed=pb.bool_field(p_true=0.7),
)

pb.preview(pb.generate_dataset(schema, n=10, seed=23))
PolarsRows10Columns3
id
Int64
score
Float64
passed
Boolean
1 5749 92.48652516259452 False
2 2368 94.86057779931771 False
3 1279 89.24333440485793 False
4 6025 8.355067683068363 True
5 7942 59.20272268857353 True
6 7212 42.37474082349614 True
7 9684 53.00880101180064 True
8 6866 13.030294124748053 True
9 3134 19.19971575392927 True
10 4145 44.4573573873013 True

Three columns, three types, ten rows. The seed=23 parameter makes the output reproducible. The id column has unique integers in the range 1000–9999, score is a uniform float between 0 and 100, and passed is True about 70% of the time.

This is already useful for quick prototyping, but the real power shows up when you start using string presets.

String presets: Names, emails, cities, and more

The string_field() function accepts a preset parameter that taps into Pointblank’s built-in data generators. There are over 40 presets covering personal information, locations, business data, internet artifacts, and more. Here’s a small example:

schema = pb.Schema(
    name=pb.string_field(preset="name"),
    email=pb.string_field(preset="email"),
    city=pb.string_field(preset="city"),
    company=pb.string_field(preset="company"),
)

pb.preview(pb.generate_dataset(schema, n=10, seed=23))
PolarsRows10Columns4
name
String
email
String
city
String
company
String
1 Patricia Williams patricia_williams@yandex.com Lubbock Innovative Systems Solutions
2 Andrea Mitchell a_mitchell@gmail.com Anaheim Sterling Engineering
3 Maria Valentine maria.valentine54@gmail.com Phoenix Goldman Sachs
4 Virginia Walker virginia.walker@outlook.com Denver Evans Group
5 Brenda Lopez b_lopez@yahoo.com San Antonio Goodwin and Garrett
6 Lauren Davis l_davis@outlook.com New York Hayes and Kennedy
7 John West j_west@zoho.com Charlotte UnitedHealth Group
8 Claire Jackson claire202@outlook.com Irvine First Ventures Group
9 Ariana Wood ariana_wood@zoho.com Seattle Cox Research
10 Michael Simmons michaelsimmons@mail.com Denver Williams Industries

Notice that the email addresses aren’t random gibberish. They’re derived from the person’s name. This is one of Pointblank’s coherence systems at work, and it activates automatically when certain presets appear together in the same schema.

Building a realistic customer dataset

Let’s put these pieces together for a scenario that comes up constantly in practice: generating a table of customer records. This is the kind of dataset you might need for a dashboard prototype, a workshop exercise, or integration testing of a CRM pipeline.

from datetime import date

schema = pb.Schema(
    customer_id=pb.int_field(min_val=10000, max_val=99999, unique=True),
    first_name=pb.string_field(preset="first_name"),
    last_name=pb.string_field(preset="last_name"),
    email=pb.string_field(preset="email"),
    phone=pb.string_field(preset="phone_number"),
    city=pb.string_field(preset="city"),
    state=pb.string_field(preset="state"),
    postcode=pb.string_field(preset="postcode"),
    signup_date=pb.date_field(
        min_date=date(2022, 1, 1),
        max_date=date(2025, 12, 31),
    ),
    is_active=pb.bool_field(p_true=0.8),
    lifetime_spend=pb.float_field(min_val=0.0, max_val=5000.0),
)

customers = pb.generate_dataset(schema, n=50, seed=23)

pb.preview(customers)
PolarsRows50Columns11
customer_id
Int64
first_name
String
last_name
String
email
String
phone
String
city
String
state
String
postcode
String
signup_date
Date
is_active
Boolean
lifetime_spend
Float64
1 47999 Paul Woods paulwoods@hotmail.com (512) 899-4802 Lubbock Texas 79468 2023-08-17 False 4624.326258129726
2 20951 Mark Smith mark684@icloud.com (310) 986-0270 Anaheim California 92873 2022-06-21 False 4743.028889965885
3 12238 Willow Fowler willowfowler@gmail.com (623) 938-2304 Phoenix Arizona 85032 2022-02-04 False 4462.166720242896
4 87598 Roger Graham roger.graham@zoho.com (970) 514-7904 Denver Colorado 80232 2025-04-27 True 417.7533841534181
5 50205 Karen Horn karen.horn70@gmail.com (210) 987-2966 San Antonio Texas 78271 2023-09-21 True 2960.1361344286765
46 72136 Hannah Weaver hannahweaver@yahoo.com (419) 998-5523 Columbus Ohio 43255 2022-06-25 True 1377.8223075007618
47 33282 Martin Ramos martin_ramos@yahoo.com (951) 234-6078 San Jose California 95170 2024-08-28 True 2864.109474442189
48 73318 Audrey Jackson audrey_jackson@aol.com (252) 401-8878 Charlotte North Carolina 28226 2022-12-30 False 4103.315904362622
49 87412 Christina Cannon ccannon13@aol.com (320) 486-6471 St. Paul Minnesota 55195 2024-09-16 True 1654.024239966494
50 68648 Melissa Nelson m_nelson@yandex.com (260) 590-0851 Bloomington Indiana 47493 2025-04-24 True 1848.269660030496

What we get here is 50 rows of plausible customer data. The city, state, and postcode are coherent within each row (a customer in "San Antonio" will have a Texas state code and a valid Texas zip code). The email is derived from the customer’s name. The phone number matches the region. None of this required any manual wiring. Pointblank detects the preset combinations and applies the appropriate coherence rules.

Extending with Polars

Since the default output is a Polars DataFrame, you can immediately layer on transformations. Let’s add a loyalty tier based on lifetime spend:

import polars as pl

customers_tiered = customers.with_columns(
    pl.when(pl.col("lifetime_spend") >= 3000)
    .then(pl.lit("Gold"))
    .when(pl.col("lifetime_spend") >= 1000)
    .then(pl.lit("Silver"))
    .otherwise(pl.lit("Bronze"))
    .alias("loyalty_tier")
)

pb.preview(customers_tiered)
PolarsRows50Columns12
customer_id
Int64
first_name
String
last_name
String
email
String
phone
String
city
String
state
String
postcode
String
signup_date
Date
is_active
Boolean
lifetime_spend
Float64
loyalty_tier
String
1 47999 Paul Woods paulwoods@hotmail.com (512) 899-4802 Lubbock Texas 79468 2023-08-17 False 4624.326258129726 Gold
2 20951 Mark Smith mark684@icloud.com (310) 986-0270 Anaheim California 92873 2022-06-21 False 4743.028889965885 Gold
3 12238 Willow Fowler willowfowler@gmail.com (623) 938-2304 Phoenix Arizona 85032 2022-02-04 False 4462.166720242896 Gold
4 87598 Roger Graham roger.graham@zoho.com (970) 514-7904 Denver Colorado 80232 2025-04-27 True 417.7533841534181 Bronze
5 50205 Karen Horn karen.horn70@gmail.com (210) 987-2966 San Antonio Texas 78271 2023-09-21 True 2960.1361344286765 Silver
46 72136 Hannah Weaver hannahweaver@yahoo.com (419) 998-5523 Columbus Ohio 43255 2022-06-25 True 1377.8223075007618 Silver
47 33282 Martin Ramos martin_ramos@yahoo.com (951) 234-6078 San Jose California 95170 2024-08-28 True 2864.109474442189 Silver
48 73318 Audrey Jackson audrey_jackson@aol.com (252) 401-8878 Charlotte North Carolina 28226 2022-12-30 False 4103.315904362622 Gold
49 87412 Christina Cannon ccannon13@aol.com (320) 486-6471 St. Paul Minnesota 55195 2024-09-16 True 1654.024239966494 Silver
50 68648 Melissa Nelson m_nelson@yandex.com (260) 590-0851 Bloomington Indiana 47493 2025-04-24 True 1848.269660030496 Silver

Or compute a summary by state:

pb.preview(
    customers_tiered
    .group_by("state", "loyalty_tier")
    .agg(
        pl.col("customer_id").count().alias("count"),
        pl.col("lifetime_spend").mean().alias("avg_spend"),
    )
    .sort("state", "loyalty_tier")
)
PolarsRows35Columns4
state
String
loyalty_tier
String
count
UInt32
avg_spend
Float64
1 Arizona Gold 2 3882.413633247243
2 Arizona Silver 1 2860.2339059589044
3 California Bronze 3 561.8318745352304
4 California Gold 2 4798.352513930336
5 California Silver 3 2503.2021274153226
31 Texas Bronze 1 978.0392640195001
32 Texas Gold 3 3904.2742899489526
33 Texas Silver 3 2140.2056508843366
34 Washington Bronze 2 623.879217971278
35 Washington Gold 1 3671.0453939174777

This is the workflow I keep coming back to! We can use Pointblank to generate the raw material, and then get Polars in there to shape it into whatever you actually need.

Country-specific data

One of the features I’m most excited about is country-specific data generation. Pointblank ships with locale data for 100 countries, covering names, cities, states/provinces, postcodes, phone number formats, and much more. Switching locales is a single parameter (country=); here’s an example that gets person data for Germany ("DE"):

schema = pb.Schema(
    name=pb.string_field(preset="name"),
    email=pb.string_field(preset="email"),
    city=pb.string_field(preset="city"),
    state=pb.string_field(preset="state"),
    phone=pb.string_field(preset="phone_number"),
)

pb.preview(pb.generate_dataset(schema, n=8, seed=23, country="DE"))
PolarsRows8Columns5
name
String
email
String
city
String
state
String
phone
String
1 Ignaz Schulze ignazschulze@freenet.de Potsdam Brandenburg (0335) 150-6730
2 Sandra Schneider sandra922@mail.de Halle (Saale) Sachsen-Anhalt (0391) 478-3743
3 Antje Jung antje_jung@yahoo.de Frankfurt am Main Hessen (069) 188-2883
4 Jennifer Opitz j_opitz@gmx.de Leipzig Sachsen (0371) 162-0756
5 Eva Lehmann evalehmann@outlook.de Cologne Nordrhein-Westfalen (0231) 961-3846
6 Alexandra Koch alexandra.koch@outlook.de Berlin Berlin (030) 489-8041
7 Christiane Becker cbecker@gmail.com Stuttgart Baden-Württemberg (0711) 258-6321
8 Thomas Mertens thomas.mertens@posteo.de Magdeburg Sachsen-Anhalt (0345) 881-3877

What you see in the above dataset are German names, cities, and phone numbers (where area codes match the locations). Switch to "AU" and you get Australian data:

pb.preview(pb.generate_dataset(schema, n=8, seed=23, country="AU"))
PolarsRows8Columns5
name
String
email
String
city
String
state
String
phone
String
1 Ethan Ryan ethanryan@bigpond.com Toowoomba Queensland (07) 0308 7150
2 Olivia Jones olivia922@dodo.com.au Hobart Tasmania (03) 7301 4783
3 Thea Roberts troberts@icloud.com Melbourne Victoria (03) 4311 8828
4 Frankie Rowe frankierowe@mail.com Brisbane Queensland (07) 4162 0756
5 Freya Lee flee64@internode.on.net Brisbane Queensland (07) 9613 8466
6 Audrey Taylor audreytaylor@optusnet.com.au Melbourne Victoria (03) 8980 4102
7 Sadie Brown sadie.brown@protonmail.com Brisbane Queensland (07) 8632 1588
8 John Dawson john_dawson@fastmail.com.au Perth Western Australia (08) 3877 4056

Or Brazilian data:

pb.preview(pb.generate_dataset(schema, n=8, seed=23, country="BR"))
PolarsRows8Columns5
name
String
email
String
city
String
state
String
phone
String
1 Bruno Soares brunosoares@terra.com.br Campinas São Paulo (14) 0308-7150
2 Ana Santos ana922@zipmail.com.br Porto Alegre Rio Grande do Sul (55) 7301-4783
3 Regina Andrade randrade@bol.com.br Rio de Janeiro Rio de Janeiro (22) 4311-8828
4 Lorena Nóvoa lorenanovoa@icloud.com Belo Horizonte Minas Gerais (35) 3416-2075
5 Alícia Lopes alopes64@yahoo.com.br Belo Horizonte Minas Gerais (37) 6296-1384
6 Vitória Ferreira vitoriaferreira@globo.com Rio de Janeiro Rio de Janeiro (22) 6489-8041
7 Stella Souza stella.souza@live.com Belo Horizonte Minas Gerais (31) 2586-3215
8 José Brito jose_brito@protonmail.com Brasilia Distrito Federal (61) 3877-4056

The country parameter accepts ISO alpha-2 codes ("US", "DE", "JP") and alpha-3 codes ("USA", "DEU", "JPN").

Mixing multiple countries

For datasets that need to represent a multinational user base, pass a list for an equal distribution, or, a dictionary for weighted proportions:

schema = pb.Schema(
    name=pb.string_field(preset="name"),
    email=pb.string_field(preset="email"),
    city=pb.string_field(preset="city"),
    country=pb.string_field(preset="country"),
)

# Weighted: 60% US, 25% Germany, 15% Japan
mixed = pb.generate_dataset(
    schema, n=20, seed=23,
    country={"US": 0.60, "DE": 0.25, "JP": 0.15},
)

pb.preview(mixed)
PolarsRows20Columns4
name
String
email
String
city
String
country
String
1 Jens Hartmann j_hartmann@gmail.com Augsburg Germany
2 Cooper Richards c_richards@aol.com Akron United States
3 Martina Koch m_koch@gmx.de Heilbronn Germany
4 Lars Herbst lherbst@outlook.de Oldenburg Germany
5 Debra Patterson debra.patterson@yahoo.com Pittsburgh United States
16 Adrian Peters adrianpeters@outlook.de Essen Germany
17 Yuji Yamamoto yuji.yamamoto51@docomo.ne.jp Chiba Japan
18 Matteo Bishop matteo.bishop18@mail.com Brooklyn United States
19 Robert Martin robert636@gmail.com Philadelphia United States
20 Barbara Simpson bsimpson56@outlook.com Rochester United States

By default, rows from different countries are shuffled (set shuffle=False to keep them grouped by country instead).

This kind of multinational dataset is really valuable in practice. If you’re building a global e-commerce platform, you need test data that reflects customers in multiple regions. Other uses include: fintech applications processing cross-border transactions, logistics companies tracking shipments through different postal systems, and SaaS products localizing their onboarding flows. All of these use cases can benefit from synthetic data that accurately represents the countries involved, rather than defaulting to US-only placeholders.

The three coherence systems

I touched a bit on coherence earlier, but it’s worth spelling out explicitly because it’s one of the things that separates Pointblank’s generator from a bag of random values.

The package applies three coherence systems automatically based on which presets you include.

Person coherence

When name, first_name, last_name, email, or user_name presets appear together, emails and usernames are derived from the person’s actual name.

Address coherence

When city, state, postcode, phone_number, latitude, longitude, or license_plate presets appear together, all values are consistent for the same geographic location within each row.

Business coherence

When both job and company appear, they’re drawn from the same industry. If name_full is also present, people in certain professions get appropriate titles (Dr., Prof., etc.), and any integer field for age is automatically constrained to a realistic working range of 22–65.

An example that makes use of all three types

Here’s a more comprehensive example with many uses of string_field(preset=):

schema = pb.Schema(
    name=pb.string_field(preset="name_full"),
    email=pb.string_field(preset="email"),
    company=pb.string_field(preset="company"),
    job=pb.string_field(preset="job"),
    city=pb.string_field(preset="city"),
    state=pb.string_field(preset="state"),
    postcode=pb.string_field(preset="postcode"),
    age=pb.int_field(),
)

pb.preview(pb.generate_dataset(schema, n=12, seed=23))
PolarsRows12Columns8
name
String
email
String
company
String
job
String
city
String
state
String
postcode
String
age
Int64
1 Mr. Leo Stevens leo_stevens@gmail.com Creative Software Digital System Administrator Lubbock Texas 79456 40
2 Rev. Archer Ross archer.ross@hotmail.com Anaheim Freight Services Buyer Anaheim California 92860 27
3 Mrs. Carolyn Gonzales carolyn626@protonmail.com Premier Technologies Solutions System Administrator Phoenix Arizona 85005 23
4 Mr. Walter Peters walter.peters@gmail.com Costa Legal Services Attorney Denver Colorado 80267 59
5 Mr. Everett King everettking@aol.com San Antonio School District Teacher San Antonio Texas 78229 41
8 Dr. Christopher Crawford christopher.crawford29@aol.com Harris Medical Group Nurse Irvine California 92604 55
9 Mrs. Katherine Flores katherine545@protonmail.com CVS Health Nurse Seattle Washington 98172 44
10 Mr. Zachary Wright zachary_wright@aol.com Wood & Woods Electrical Engineer Denver Colorado 80265 30
11 Mr. Russell Hawkins r_hawkins@mail.com Baltimore Grand Hotel Event Coordinator Baltimore Maryland 21297 34
12 Mrs. Julia Powell julia_powell@outlook.com Los Angeles Academy Librarian Los Angeles California 90008 39

Notice the professional titles on some names, the consistent city/state/postcode combinations, and the age values falling within a plausible working range.

Profile fields: The fast path

For the very common case of generating person-centric data, profile_fields() provides a shortcut. It returns a dictionary of pre-configured StringField objects that you unpack into a schema:

schema = pb.Schema(
    **pb.profile_fields(set="standard"),
    account_id=pb.int_field(min_val=1, unique=True),
)

pb.preview(pb.generate_dataset(schema, n=10, seed=23))
PolarsRows10Columns8
first_name
String
last_name
String
email
String
city
String
state
String
postcode
String
phone_number
String
account_id
Int64
1 Patricia Williams patricia_williams@yandex.com Lubbock Texas 79420 (713) 225-8632 7188536481533917197
2 Andrea Mitchell a_mitchell@gmail.com Anaheim California 92875 (323) 788-1387 2674009078779859984
3 Maria Valentine maria.valentine54@gmail.com Phoenix Arizona 85062 (928) 605-6026 7652102777077138151
4 Virginia Walker virginia.walker@outlook.com Denver Colorado 80296 (720) 227-6164 157503859921753049
5 Brenda Lopez b_lopez@yahoo.com San Antonio Texas 78213 (972) 488-4413 2829213282471975080
6 Lauren Davis l_davis@outlook.com New York New York 10084 (212) 960-7964 3497364383162086858
7 John West j_west@zoho.com Charlotte North Carolina 28266 (910) 854-4526 3302703640991750415
8 Claire Jackson claire202@outlook.com Irvine California 92648 (310) 878-4841 6695746877064448147
9 Ariana Wood ariana_wood@zoho.com Seattle Washington 98198 (360) 542-8519 2466163118311913924
10 Michael Simmons michaelsimmons@mail.com Denver Colorado 80204 (970) 349-7004 129827878195925732

The "standard" set includes first_name, last_name, email, city, state, postcode, and phone_number. There’s also "minimal" (just name, email, and phone) and "full" (adds address, company, and job). You can further customize with include= and exclude= parameters to add or remove specific fields.

Regex patterns for structured strings

When none of the built-in presets fit, string_field() also accepts a pattern= parameter for regex-based generation. Pointblank’s regex engine supports character classes, quantifiers, alternation, and groups:

schema = pb.Schema(
    sku=pb.string_field(pattern=r"SKU-[A-Z]{2}-[0-9]{5}"),
    tracking=pb.string_field(pattern=r"1Z[0-9]{4}[A-Z]{2}[0-9]{8}"),
    code=pb.string_field(pattern=r"(ALPHA|BETA|GAMMA)-[0-9]{3}"),
)

pb.preview(pb.generate_dataset(schema, n=8, seed=23))
PolarsRows8Columns3
sku
String
tracking
String
code
String
1 SKU-CA-66852 1Z1094MQ23470397 BETA-094
2 SKU-IO-39701 1Z1176QU50309529 BETA-852
3 SKU-WP-08650 1Z5959VK72797222 GAMMA-470
4 SKU-ZB-29359 1Z8391DN94949515 ALPHA-011
5 SKU-SJ-91727 1Z8478IE91735829 GAMMA-608
6 SKU-VU-22858 1Z6270UN02303087 GAMMA-503
7 SKU-SD-16094 1Z5067BC78374311 ALPHA-293
8 SKU-SK-54847 1Z8834NF75629613 GAMMA-959

This is useful for generating product codes, tracking numbers, internal identifiers, or any string that follows a predictable format.

Categorical columns and nullable fields

For columns drawn from a fixed set of values, use the allowed= parameter:

schema = pb.Schema(
    plan=pb.string_field(allowed=["Free", "Pro", "Enterprise"]),
    region=pb.string_field(allowed=["AMER", "EMEA", "APAC"]),
    satisfaction=pb.int_field(allowed=[1, 2, 3, 4, 5]),
    notes=pb.string_field(preset="user_agent", nullable=True, null_probability=0.3),
)

pb.preview(pb.generate_dataset(schema, n=12, seed=23))
PolarsRows12Columns4
plan
String
region
String
satisfaction
Int64
notes
String
1 Pro EMEA 3 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/120.0.0.0
2 Free AMER 1 Mozilla/5.0 (Macintosh; Intel Mac OS X 14_6_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.6 Safari/605.1.15
3 Free AMER 1 None
4 Enterprise APAC 5 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36
5 Pro EMEA 3 None
8 Enterprise APAC 5 Mozilla/5.0 (Macintosh; Intel Mac OS X 15_0_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2.1 Safari/605.1.15
9 Pro EMEA 3 None
10 Free AMER 2 Mozilla/5.0 (Linux; Android 15; SM-S911B) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/28.0 Chrome/122.0.0.0 Mobile Safari/537.36
11 Enterprise APAC 2 Mozilla/5.0 (Macintosh; Intel Mac OS X 15_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15
12 Free AMER 3 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36

The nullable=True and null_probability= parameters let you introduce realistic missing data. About 30% of the notes values will be null.

Frequency-weighted sampling

By default, Pointblank uses frequency-weighted sampling for names and cities (weighted=True). This means you’ll see common names like "James" or "Maria" appearing more often than rare ones, following a four-tier distribution: very common (45%), common (30%), uncommon (20%), and rare (5%).

This produces datasets that feel more realistic than a uniform random draw. If you want every name to have an equal chance of appearing, set weighted=False.

A larger example: Event log data

So far, we’ve focused on person and business data, but generate_dataset() handles temporal and numeric types just as well. Let’s build a simulated event log, the kind of table you’d see behind a product analytics dashboard. This schema brings together several field types we haven’t combined yet: datetime_field() for timestamps, duration_field() for session lengths, bool_field() for success/failure flags, and the "ipv4" string preset for IP addresses.

The allowed= parameter on string_field() is doing the work of defining the event vocabulary. Rather than generating random strings, it draws uniformly from the list of actions we provide, giving us a clean categorical column.

from datetime import datetime, timedelta

schema = pb.Schema(
    event_id=pb.int_field(min_val=1, unique=True),
    user_id=pb.int_field(min_val=1000, max_val=1050),
    action=pb.string_field(
        allowed=["page_view", "click", "purchase", "signup", "logout"]
    ),
    timestamp=pb.datetime_field(
        min_date=datetime(2025, 11, 1),
        max_date=datetime(2025, 11, 30, 23, 59, 59),
    ),
    duration=pb.duration_field(
        min_duration=timedelta(seconds=1),
        max_duration=timedelta(minutes=10),
    ),
    success=pb.bool_field(p_true=0.92),
    ip_address=pb.string_field(preset="ipv4"),
)

events = pb.generate_dataset(schema, n=40, seed=23)

pb.preview(events)
PolarsRows40Columns7
event_id
Int64
user_id
Int64
action
String
timestamp
Datetime
duration
Duration
success
Boolean
ip_address
String
1 7188536481533917197 1049 purchase 2025-11-15 01:46:38 0:04:57 False 148.42.8.157
2 2674009078779859984 1018 page_view 2025-11-05 01:20:36 0:01:26 False 216.194.183.66
3 7652102777077138151 1005 page_view 2025-11-01 19:53:44 0:00:18 True 98.136.227.7
4 157503859921753049 1001 logout 2025-11-29 17:45:42 0:05:15 True 113.232.12.54
5 2829213282471975080 1037 purchase 2025-11-15 21:22:57 0:07:14 True 43.255.215.10
36 6232456323939446652 1002 logout 2025-11-28 18:28:58 0:05:22 True 41.215.141.245
37 1508803708693178976 1037 purchase 2025-11-14 15:42:28 0:09:47 True 90.152.135.44
38 7369527199060817792 1023 logout 2025-11-16 06:17:31 0:01:28 True 115.31.254.193
39 4921468493992610632 1042 purchase 2025-11-28 19:34:35 0:08:06 True 9.233.210.149
40 6210729776073352921 1011 purchase 2025-11-05 03:53:23 0:03:02 True 163.208.178.154

What we get is 40 rows of event data spread across November 2025. Each row has a unique event ID, a user ID drawn from a small pool (simulating repeat visitors), a random action, a timestamp within our date window, a session duration between 1 second and 10 minutes, a success flag that’s True about 92% of the time, and a plausible IPv4 address. All from a single generate_dataset() call.

Because the output is a Polars DataFrame, we can immediately run aggregations on it. Here’s a quick summary grouped by action type, showing the count of events, the average success rate, and the mean duration:

pb.preview(
    events
    .group_by("action")
    .agg(
        pl.col("event_id").count().alias("count"),
        pl.col("success").mean().round(2).alias("success_rate"),
        pl.col("duration").mean().alias("avg_duration"),
    )
    .sort("count", descending=True)
)
PolarsRows5Columns4
action
String
count
UInt32
success_rate
Float64
avg_duration
Duration
1 purchase 10 0.9 0:05:44.800000
2 page_view 9 0.89 0:03:44.222222
3 logout 8 1.0 0:04:20.750000
4 signup 7 0.86 0:03:57.285714
5 click 6 1.0 0:06:53.500000

This is the sort of exploratory analysis you might do while building a reporting pipeline or testing a dashboard query. The synthetic data gives you something to run your code against before the real event stream is available.

Validating what you generate

Pointblank started as a data validation library, and data generation turns out to be a natural extension of that core mission. The two capabilities complement each other quite well: the same Schema object that describes what your data should look like can also produce data that does look like that. This means you can build validation logic and test it against controlled synthetic inputs, all within one consistent API.

There’s a satisfying loop to this workflow. You define a schema, generate data from it, and then validate that the data meets your expectations. Here we generate 100 rows with a Field-based schema, then verify the structure with col_schema_match() using a dtype-based schema, and add a few value-level checks on top:

gen_schema = pb.Schema(
    id=pb.int_field(min_val=1, unique=True),
    name=pb.string_field(preset="name"),
    score=pb.float_field(min_val=0.0, max_val=100.0),
    active=pb.bool_field(),
)

test_data = pb.generate_dataset(gen_schema, n=100, seed=23)

# A dtype-based schema for structural validation
val_schema = pb.Schema(
    id="Int64",
    name="String",
    score="Float64",
    active="Boolean",
)

validation = (
    pb.Validate(data=test_data)
    .col_schema_match(schema=val_schema)
    .col_vals_between(columns="score", left=0.0, right=100.0)
    .col_vals_not_null(columns="name")
    .col_vals_gt(columns="id", value=0)
    .rows_distinct(columns_subset="id")
    .interrogate()
)

validation
Pointblank Validation
2026-04-13|17:29:09
Polars
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_schema_match
col_schema_match()
SCHEMA 1 1
1.00
0
0.00
#4CA64C 2
col_vals_between
col_vals_between()
score [0.0, 100.0] 100 100
1.00
0
0.00
#4CA64C 3
col_vals_not_null
col_vals_not_null()
name 100 100
1.00
0
0.00
#4CA64C 4
col_vals_gt
col_vals_gt()
id 0 100 100
1.00
0
0.00
#4CA64C 5
rows_distinct
rows_distinct()
id 100 100
1.00
0
0.00
2026-04-13 17:29:09 UTC< 1 s2026-04-13 17:29:09 UTC

Notes

Step 1 (schema_check) Schema validation passed.

Schema Comparison
TARGET EXPECTED
COLUMN DATA TYPE COLUMN DATA TYPE
1 id Int64 1 id Int64
2 name String 2 name String
3 score Float64 3 score Float64
4 active Boolean 4 active Boolean
Supplied Column Schema:
[('id', 'Int64'), ('name', 'String'), ('score', 'Float64'), ('active', 'Boolean')]
Schema Match Settings
COMPLETE
IN ORDER
COLUMN ≠ column
DTYPE ≠ dtype
float ≠ float64

The generated data should pass all checks, giving you a clean baseline for your validation logic. In practice, this is how you’d develop and refine validation rules before pointing them at real data: generate a known-good dataset, confirm your checks pass, then swap in the production table and see what fails. Having generation and validation in the same package makes that iteration cycle very tight.

Wrapping up

Synthetic data generation sits at the intersection of several real needs: testing, prototyping, teaching, and privacy. Pointblank’s generate_dataset() tries to make it practical by handling the tedious parts automatically (type-appropriate random values, coherent cross-column relationships, country-specific formatting) so you can focus on the shape of the data you actually need.

Define a schema, call generate_dataset(), and you have a DataFrame ready to go, which is the sort of simplicity that matters when you need data but can’t use the real thing. If you’d like to explore further, the Pointblank website has extensive documentation on data generation, including a dedicated User Guide section and full API documentation for every field type and function covered here.

Rich Iannone

Software Engineer at Posit, PBC
Richard is a software engineer and table enthusiast. He and R go way back and he's been getting better at writing code in Python too. For the most part, Rich enjoys creating open source packages in R and Python so that people can do great things in their own work.