Open source packages - Quarto, Shiny, and more Commercial enterprise offerings

Building realistic fake datasets with Pointblank

Written by Rich Iannone

2026-03-09

Digital illustration of a data table for "Pointblank" featuring names, emails, and locations like Japan and Germany. A cartoon hand holds a large black mustache on a stick over the center, suggesting "fake" or "masked" data against a background of faded international flags.

Every data practitioner eventually runs into the same problem: you need data, but you don’t have it. It could be that the production database is locked behind access controls. Or, you might have the situation where the dataset you need doesn’t exist yet (because the feature hasn’t shipped). Maybe you’re writing tests, building a demo, or teaching a workshop and you need something that looks real but carries zero risk. Whatever the reason, the need for synthetic data is everywhere, and it comes up far more often than most of us would like to admit.

The great news here is that fake can be just as good. If your synthetic data has the right shape, the right types, the right distributions, and the right internal consistency, it can stand in for real data in many different situations.

Pointblank is a Python library for data validation, but over the last several releases (v0.20.0, v0.21.0, and v0.22.0), we’ve been building out a complementary capability: data generation. The idea is simple. You define a schema (the columns, their types, and their constraints), and Pointblank produces n rows of data that conform to it. The result is a Polars or Pandas DataFrame, ready to use.

In this post, I’ll walk through the generate_dataset() function in some depth, show how to build realistic datasets for common scenarios (including a customer data example you might actually use), and highlight the country-specific and coherence features that make the generated data feel surprisingly real.

Note

All examples here use pb.preview() to display results, which renders a compact HTML table showing the head and tail of the dataset. If you want to follow along, install Pointblank with pip install pointblank and make sure you have Polars available.

Starting simple: A schema and a dataset

Everything begins with a Schema object. You declare columns as keyword arguments, using field specification functions to describe each one:

import pointblank as pb

schema = pb.Schema(
    id=pb.int_field(min_val=1000, max_val=9999, unique=True),
    score=pb.float_field(min_val=0.0, max_val=100.0),
    passed=pb.bool_field(p_true=0.7),
)

pb.preview(pb.generate_dataset(schema, n=10, seed=23))

	id Int64	score Float64	passed Boolean
PolarsRows10Columns3
1	5749	92.48652516259452	False
2	2368	94.86057779931771	False
3	1279	89.24333440485793	False
4	6025	8.355067683068363	True
5	7942	59.20272268857353	True
6	7212	42.37474082349614	True
7	9684	53.00880101180064	True
8	6866	13.030294124748053	True
9	3134	19.19971575392927	True
10	4145	44.4573573873013	True

Three columns, three types, ten rows. The seed=23 parameter makes the output reproducible. The id column has unique integers in the range 1000–9999, score is a uniform float between 0 and 100, and passed is True about 70% of the time.

This is already useful for quick prototyping, but the real power shows up when you start using string presets.

String presets: Names, emails, cities, and more

The string_field() function accepts a preset parameter that taps into Pointblank’s built-in data generators. There are over 40 presets covering personal information, locations, business data, internet artifacts, and more. Here’s a small example:

schema = pb.Schema(
    name=pb.string_field(preset="name"),
    email=pb.string_field(preset="email"),
    city=pb.string_field(preset="city"),
    company=pb.string_field(preset="company"),
)

pb.preview(pb.generate_dataset(schema, n=10, seed=23))

	name String	email String	city String	company String
PolarsRows10Columns4
1	Patricia Williams	patricia_williams@yandex.com	Lubbock	Innovative Systems Solutions
2	Andrea Mitchell	a_mitchell@gmail.com	Anaheim	Sterling Engineering
3	Maria Valentine	maria.valentine54@gmail.com	Phoenix	Goldman Sachs
4	Virginia Walker	virginia.walker@outlook.com	Denver	Evans Group
5	Brenda Lopez	b_lopez@yahoo.com	San Antonio	Goodwin and Garrett
6	Lauren Davis	l_davis@outlook.com	New York	Hayes and Kennedy
7	John West	j_west@zoho.com	Charlotte	UnitedHealth Group
8	Claire Jackson	claire202@outlook.com	Irvine	First Ventures Group
9	Ariana Wood	ariana_wood@zoho.com	Seattle	Cox Research
10	Michael Simmons	michaelsimmons@mail.com	Denver	Williams Industries

Notice that the email addresses aren’t random gibberish. They’re derived from the person’s name. This is one of Pointblank’s coherence systems at work, and it activates automatically when certain presets appear together in the same schema.

Building a realistic customer dataset

Let’s put these pieces together for a scenario that comes up constantly in practice: generating a table of customer records. This is the kind of dataset you might need for a dashboard prototype, a workshop exercise, or integration testing of a CRM pipeline.

from datetime import date

schema = pb.Schema(
    customer_id=pb.int_field(min_val=10000, max_val=99999, unique=True),
    first_name=pb.string_field(preset="first_name"),
    last_name=pb.string_field(preset="last_name"),
    email=pb.string_field(preset="email"),
    phone=pb.string_field(preset="phone_number"),
    city=pb.string_field(preset="city"),
    state=pb.string_field(preset="state"),
    postcode=pb.string_field(preset="postcode"),
    signup_date=pb.date_field(
        min_date=date(2022, 1, 1),
        max_date=date(2025, 12, 31),
    ),
    is_active=pb.bool_field(p_true=0.8),
    lifetime_spend=pb.float_field(min_val=0.0, max_val=5000.0),
)

customers = pb.generate_dataset(schema, n=50, seed=23)

pb.preview(customers)

	customer_id Int64	first_name String	last_name String	email String	phone String	city String	state String	postcode String	signup_date Date	is_active Boolean	lifetime_spend Float64
PolarsRows50Columns11
1	47999	Paul	Woods	paulwoods@hotmail.com	(512) 899-4802	Lubbock	Texas	79468	2023-08-17	False	4624.326258129726
2	20951	Mark	Smith	mark684@icloud.com	(310) 986-0270	Anaheim	California	92873	2022-06-21	False	4743.028889965885
3	12238	Willow	Fowler	willowfowler@gmail.com	(623) 938-2304	Phoenix	Arizona	85032	2022-02-04	False	4462.166720242896
4	87598	Roger	Graham	roger.graham@zoho.com	(970) 514-7904	Denver	Colorado	80232	2025-04-27	True	417.7533841534181
5	50205	Karen	Horn	karen.horn70@gmail.com	(210) 987-2966	San Antonio	Texas	78271	2023-09-21	True	2960.1361344286765
46	72136	Hannah	Weaver	hannahweaver@yahoo.com	(419) 998-5523	Columbus	Ohio	43255	2022-06-25	True	1377.8223075007618
47	33282	Martin	Ramos	martin_ramos@yahoo.com	(951) 234-6078	San Jose	California	95170	2024-08-28	True	2864.109474442189
48	73318	Audrey	Jackson	audrey_jackson@aol.com	(252) 401-8878	Charlotte	North Carolina	28226	2022-12-30	False	4103.315904362622
49	87412	Christina	Cannon	ccannon13@aol.com	(320) 486-6471	St. Paul	Minnesota	55195	2024-09-16	True	1654.024239966494
50	68648	Melissa	Nelson	m_nelson@yandex.com	(260) 590-0851	Bloomington	Indiana	47493	2025-04-24	True	1848.269660030496

What we get here is 50 rows of plausible customer data. The city, state, and postcode are coherent within each row (a customer in "San Antonio" will have a Texas state code and a valid Texas zip code). The email is derived from the customer’s name. The phone number matches the region. None of this required any manual wiring. Pointblank detects the preset combinations and applies the appropriate coherence rules.

Extending with Polars

Since the default output is a Polars DataFrame, you can immediately layer on transformations. Let’s add a loyalty tier based on lifetime spend:

import polars as pl

customers_tiered = customers.with_columns(
    pl.when(pl.col("lifetime_spend") >= 3000)
    .then(pl.lit("Gold"))
    .when(pl.col("lifetime_spend") >= 1000)
    .then(pl.lit("Silver"))
    .otherwise(pl.lit("Bronze"))
    .alias("loyalty_tier")
)

pb.preview(customers_tiered)

	customer_id Int64	first_name String	last_name String	email String	phone String	city String	state String	postcode String	signup_date Date	is_active Boolean	lifetime_spend Float64	loyalty_tier String
PolarsRows50Columns12
1	47999	Paul	Woods	paulwoods@hotmail.com	(512) 899-4802	Lubbock	Texas	79468	2023-08-17	False	4624.326258129726	Gold
2	20951	Mark	Smith	mark684@icloud.com	(310) 986-0270	Anaheim	California	92873	2022-06-21	False	4743.028889965885	Gold
3	12238	Willow	Fowler	willowfowler@gmail.com	(623) 938-2304	Phoenix	Arizona	85032	2022-02-04	False	4462.166720242896	Gold
4	87598	Roger	Graham	roger.graham@zoho.com	(970) 514-7904	Denver	Colorado	80232	2025-04-27	True	417.7533841534181	Bronze
5	50205	Karen	Horn	karen.horn70@gmail.com	(210) 987-2966	San Antonio	Texas	78271	2023-09-21	True	2960.1361344286765	Silver
46	72136	Hannah	Weaver	hannahweaver@yahoo.com	(419) 998-5523	Columbus	Ohio	43255	2022-06-25	True	1377.8223075007618	Silver
47	33282	Martin	Ramos	martin_ramos@yahoo.com	(951) 234-6078	San Jose	California	95170	2024-08-28	True	2864.109474442189	Silver
48	73318	Audrey	Jackson	audrey_jackson@aol.com	(252) 401-8878	Charlotte	North Carolina	28226	2022-12-30	False	4103.315904362622	Gold
49	87412	Christina	Cannon	ccannon13@aol.com	(320) 486-6471	St. Paul	Minnesota	55195	2024-09-16	True	1654.024239966494	Silver
50	68648	Melissa	Nelson	m_nelson@yandex.com	(260) 590-0851	Bloomington	Indiana	47493	2025-04-24	True	1848.269660030496	Silver

Or compute a summary by state:

pb.preview(
    customers_tiered
    .group_by("state", "loyalty_tier")
    .agg(
        pl.col("customer_id").count().alias("count"),
        pl.col("lifetime_spend").mean().alias("avg_spend"),
    )
    .sort("state", "loyalty_tier")
)

	state String	loyalty_tier String	count UInt32	avg_spend Float64
PolarsRows35Columns4
1	Arizona	Gold	2	3882.413633247243
2	Arizona	Silver	1	2860.2339059589044
3	California	Bronze	3	561.8318745352304
4	California	Gold	2	4798.352513930336
5	California	Silver	3	2503.2021274153226
31	Texas	Bronze	1	978.0392640195001
32	Texas	Gold	3	3904.2742899489526
33	Texas	Silver	3	2140.2056508843366
34	Washington	Bronze	2	623.879217971278
35	Washington	Gold	1	3671.0453939174777

This is the workflow I keep coming back to! We can use Pointblank to generate the raw material, and then get Polars in there to shape it into whatever you actually need.

Country-specific data

One of the features I’m most excited about is country-specific data generation. Pointblank ships with locale data for 100 countries, covering names, cities, states/provinces, postcodes, phone number formats, and much more. Switching locales is a single parameter (country=); here’s an example that gets person data for Germany ("DE"):

schema = pb.Schema(
    name=pb.string_field(preset="name"),
    email=pb.string_field(preset="email"),
    city=pb.string_field(preset="city"),
    state=pb.string_field(preset="state"),
    phone=pb.string_field(preset="phone_number"),
)

pb.preview(pb.generate_dataset(schema, n=8, seed=23, country="DE"))

	name String	email String	city String	state String	phone String
PolarsRows8Columns5
1	Ignaz Schulze	ignazschulze@freenet.de	Potsdam	Brandenburg	(0335) 150-6730
2	Sandra Schneider	sandra922@mail.de	Halle (Saale)	Sachsen-Anhalt	(0391) 478-3743
3	Antje Jung	antje_jung@yahoo.de	Frankfurt am Main	Hessen	(069) 188-2883
4	Jennifer Opitz	j_opitz@gmx.de	Leipzig	Sachsen	(0371) 162-0756
5	Eva Lehmann	evalehmann@outlook.de	Cologne	Nordrhein-Westfalen	(0231) 961-3846
6	Alexandra Koch	alexandra.koch@outlook.de	Berlin	Berlin	(030) 489-8041
7	Christiane Becker	cbecker@gmail.com	Stuttgart	Baden-Württemberg	(0711) 258-6321
8	Thomas Mertens	thomas.mertens@posteo.de	Magdeburg	Sachsen-Anhalt	(0345) 881-3877

What you see in the above dataset are German names, cities, and phone numbers (where area codes match the locations). Switch to "AU" and you get Australian data:

pb.preview(pb.generate_dataset(schema, n=8, seed=23, country="AU"))

	name String	email String	city String	state String	phone String
PolarsRows8Columns5
1	Ethan Ryan	ethanryan@bigpond.com	Toowoomba	Queensland	(07) 0308 7150
2	Olivia Jones	olivia922@dodo.com.au	Hobart	Tasmania	(03) 7301 4783
3	Thea Roberts	troberts@icloud.com	Melbourne	Victoria	(03) 4311 8828
4	Frankie Rowe	frankierowe@mail.com	Brisbane	Queensland	(07) 4162 0756
5	Freya Lee	flee64@internode.on.net	Brisbane	Queensland	(07) 9613 8466
6	Audrey Taylor	audreytaylor@optusnet.com.au	Melbourne	Victoria	(03) 8980 4102
7	Sadie Brown	sadie.brown@protonmail.com	Brisbane	Queensland	(07) 8632 1588
8	John Dawson	john_dawson@fastmail.com.au	Perth	Western Australia	(08) 3877 4056

Or Brazilian data:

pb.preview(pb.generate_dataset(schema, n=8, seed=23, country="BR"))

	name String	email String	city String	state String	phone String
PolarsRows8Columns5
1	Bruno Soares	brunosoares@terra.com.br	Campinas	São Paulo	(14) 0308-7150
2	Ana Santos	ana922@zipmail.com.br	Porto Alegre	Rio Grande do Sul	(55) 7301-4783
3	Regina Andrade	randrade@bol.com.br	Rio de Janeiro	Rio de Janeiro	(22) 4311-8828
4	Lorena Nóvoa	lorenanovoa@icloud.com	Belo Horizonte	Minas Gerais	(35) 3416-2075
5	Alícia Lopes	alopes64@yahoo.com.br	Belo Horizonte	Minas Gerais	(37) 6296-1384
6	Vitória Ferreira	vitoriaferreira@globo.com	Rio de Janeiro	Rio de Janeiro	(22) 6489-8041
7	Stella Souza	stella.souza@live.com	Belo Horizonte	Minas Gerais	(31) 2586-3215
8	José Brito	jose_brito@protonmail.com	Brasilia	Distrito Federal	(61) 3877-4056

The country parameter accepts ISO alpha-2 codes ("US", "DE", "JP") and alpha-3 codes ("USA", "DEU", "JPN").

Mixing multiple countries

For datasets that need to represent a multinational user base, pass a list for an equal distribution, or, a dictionary for weighted proportions:

schema = pb.Schema(
    name=pb.string_field(preset="name"),
    email=pb.string_field(preset="email"),
    city=pb.string_field(preset="city"),
    country=pb.string_field(preset="country"),
)

# Weighted: 60% US, 25% Germany, 15% Japan
mixed = pb.generate_dataset(
    schema, n=20, seed=23,
    country={"US": 0.60, "DE": 0.25, "JP": 0.15},
)

pb.preview(mixed)

	name String	email String	city String	country String
PolarsRows20Columns4
1	Jens Hartmann	j_hartmann@gmail.com	Augsburg	Germany
2	Cooper Richards	c_richards@aol.com	Akron	United States
3	Martina Koch	m_koch@gmx.de	Heilbronn	Germany
4	Lars Herbst	lherbst@outlook.de	Oldenburg	Germany
5	Debra Patterson	debra.patterson@yahoo.com	Pittsburgh	United States
16	Adrian Peters	adrianpeters@outlook.de	Essen	Germany
17	Yuji Yamamoto	yuji.yamamoto51@docomo.ne.jp	Chiba	Japan
18	Matteo Bishop	matteo.bishop18@mail.com	Brooklyn	United States
19	Robert Martin	robert636@gmail.com	Philadelphia	United States
20	Barbara Simpson	bsimpson56@outlook.com	Rochester	United States

By default, rows from different countries are shuffled (set shuffle=False to keep them grouped by country instead).

This kind of multinational dataset is really valuable in practice. If you’re building a global e-commerce platform, you need test data that reflects customers in multiple regions. Other uses include: fintech applications processing cross-border transactions, logistics companies tracking shipments through different postal systems, and SaaS products localizing their onboarding flows. All of these use cases can benefit from synthetic data that accurately represents the countries involved, rather than defaulting to US-only placeholders.

The three coherence systems

I touched a bit on coherence earlier, but it’s worth spelling out explicitly because it’s one of the things that separates Pointblank’s generator from a bag of random values.

The package applies three coherence systems automatically based on which presets you include.

Person coherence

When name, first_name, last_name, email, or user_name presets appear together, emails and usernames are derived from the person’s actual name.

Address coherence

When city, state, postcode, phone_number, latitude, longitude, or license_plate presets appear together, all values are consistent for the same geographic location within each row.

Business coherence

When both job and company appear, they’re drawn from the same industry. If name_full is also present, people in certain professions get appropriate titles (Dr., Prof., etc.), and any integer field for age is automatically constrained to a realistic working range of 22–65.

An example that makes use of all three types

Here’s a more comprehensive example with many uses of string_field(preset=):

schema = pb.Schema(
    name=pb.string_field(preset="name_full"),
    email=pb.string_field(preset="email"),
    company=pb.string_field(preset="company"),
    job=pb.string_field(preset="job"),
    city=pb.string_field(preset="city"),
    state=pb.string_field(preset="state"),
    postcode=pb.string_field(preset="postcode"),
    age=pb.int_field(),
)

pb.preview(pb.generate_dataset(schema, n=12, seed=23))

	name String	email String	company String	job String	city String	state String	postcode String	age Int64
PolarsRows12Columns8
1	Mr. Leo Stevens	leo_stevens@gmail.com	Creative Software Digital	System Administrator	Lubbock	Texas	79456	40
2	Rev. Archer Ross	archer.ross@hotmail.com	Anaheim Freight Services	Buyer	Anaheim	California	92860	27
3	Mrs. Carolyn Gonzales	carolyn626@protonmail.com	Premier Technologies Solutions	System Administrator	Phoenix	Arizona	85005	23
4	Mr. Walter Peters	walter.peters@gmail.com	Costa Legal Services	Attorney	Denver	Colorado	80267	59
5	Mr. Everett King	everettking@aol.com	San Antonio School District	Teacher	San Antonio	Texas	78229	41
8	Dr. Christopher Crawford	christopher.crawford29@aol.com	Harris Medical Group	Nurse	Irvine	California	92604	55
9	Mrs. Katherine Flores	katherine545@protonmail.com	CVS Health	Nurse	Seattle	Washington	98172	44
10	Mr. Zachary Wright	zachary_wright@aol.com	Wood & Woods	Electrical Engineer	Denver	Colorado	80265	30
11	Mr. Russell Hawkins	r_hawkins@mail.com	Baltimore Grand Hotel	Event Coordinator	Baltimore	Maryland	21297	34
12	Mrs. Julia Powell	julia_powell@outlook.com	Los Angeles Academy	Librarian	Los Angeles	California	90008	39

Notice the professional titles on some names, the consistent city/state/postcode combinations, and the age values falling within a plausible working range.

Profile fields: The fast path

For the very common case of generating person-centric data, profile_fields() provides a shortcut. It returns a dictionary of pre-configured StringField objects that you unpack into a schema:

schema = pb.Schema(
    **pb.profile_fields(set="standard"),
    account_id=pb.int_field(min_val=1, unique=True),
)

pb.preview(pb.generate_dataset(schema, n=10, seed=23))

	first_name String	last_name String	email String	city String	state String	postcode String	phone_number String	account_id Int64
PolarsRows10Columns8
1	Patricia	Williams	patricia_williams@yandex.com	Lubbock	Texas	79420	(713) 225-8632	7188536481533917197
2	Andrea	Mitchell	a_mitchell@gmail.com	Anaheim	California	92875	(323) 788-1387	2674009078779859984
3	Maria	Valentine	maria.valentine54@gmail.com	Phoenix	Arizona	85062	(928) 605-6026	7652102777077138151
4	Virginia	Walker	virginia.walker@outlook.com	Denver	Colorado	80296	(720) 227-6164	157503859921753049
5	Brenda	Lopez	b_lopez@yahoo.com	San Antonio	Texas	78213	(972) 488-4413	2829213282471975080
6	Lauren	Davis	l_davis@outlook.com	New York	New York	10084	(212) 960-7964	3497364383162086858
7	John	West	j_west@zoho.com	Charlotte	North Carolina	28266	(910) 854-4526	3302703640991750415
8	Claire	Jackson	claire202@outlook.com	Irvine	California	92648	(310) 878-4841	6695746877064448147
9	Ariana	Wood	ariana_wood@zoho.com	Seattle	Washington	98198	(360) 542-8519	2466163118311913924
10	Michael	Simmons	michaelsimmons@mail.com	Denver	Colorado	80204	(970) 349-7004	129827878195925732

The "standard" set includes first_name, last_name, email, city, state, postcode, and phone_number. There’s also "minimal" (just name, email, and phone) and "full" (adds address, company, and job). You can further customize with include= and exclude= parameters to add or remove specific fields.

Regex patterns for structured strings

When none of the built-in presets fit, string_field() also accepts a pattern= parameter for regex-based generation. Pointblank’s regex engine supports character classes, quantifiers, alternation, and groups:

schema = pb.Schema(
    sku=pb.string_field(pattern=r"SKU-[A-Z]{2}-[0-9]{5}"),
    tracking=pb.string_field(pattern=r"1Z[0-9]{4}[A-Z]{2}[0-9]{8}"),
    code=pb.string_field(pattern=r"(ALPHA|BETA|GAMMA)-[0-9]{3}"),
)

pb.preview(pb.generate_dataset(schema, n=8, seed=23))

	sku String	tracking String	code String
PolarsRows8Columns3
1	SKU-CA-66852	1Z1094MQ23470397	BETA-094
2	SKU-IO-39701	1Z1176QU50309529	BETA-852
3	SKU-WP-08650	1Z5959VK72797222	GAMMA-470
4	SKU-ZB-29359	1Z8391DN94949515	ALPHA-011
5	SKU-SJ-91727	1Z8478IE91735829	GAMMA-608
6	SKU-VU-22858	1Z6270UN02303087	GAMMA-503
7	SKU-SD-16094	1Z5067BC78374311	ALPHA-293
8	SKU-SK-54847	1Z8834NF75629613	GAMMA-959

This is useful for generating product codes, tracking numbers, internal identifiers, or any string that follows a predictable format.

Categorical columns and nullable fields

For columns drawn from a fixed set of values, use the allowed= parameter:

schema = pb.Schema(
    plan=pb.string_field(allowed=["Free", "Pro", "Enterprise"]),
    region=pb.string_field(allowed=["AMER", "EMEA", "APAC"]),
    satisfaction=pb.int_field(allowed=[1, 2, 3, 4, 5]),
    notes=pb.string_field(preset="user_agent", nullable=True, null_probability=0.3),
)

pb.preview(pb.generate_dataset(schema, n=12, seed=23))

	plan String	region String	satisfaction Int64	notes String
PolarsRows12Columns4
1	Pro	EMEA	3	Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/120.0.0.0
2	Free	AMER	1	Mozilla/5.0 (Macintosh; Intel Mac OS X 14_6_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.6 Safari/605.1.15
3	Free	AMER	1	None
4	Enterprise	APAC	5	Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36
5	Pro	EMEA	3	None
8	Enterprise	APAC	5	Mozilla/5.0 (Macintosh; Intel Mac OS X 15_0_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2.1 Safari/605.1.15
9	Pro	EMEA	3	None
10	Free	AMER	2	Mozilla/5.0 (Linux; Android 15; SM-S911B) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/28.0 Chrome/122.0.0.0 Mobile Safari/537.36
11	Enterprise	APAC	2	Mozilla/5.0 (Macintosh; Intel Mac OS X 15_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15
12	Free	AMER	3	Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36

The nullable=True and null_probability= parameters let you introduce realistic missing data. About 30% of the notes values will be null.

Frequency-weighted sampling

By default, Pointblank uses frequency-weighted sampling for names and cities (weighted=True). This means you’ll see common names like "James" or "Maria" appearing more often than rare ones, following a four-tier distribution: very common (45%), common (30%), uncommon (20%), and rare (5%).

This produces datasets that feel more realistic than a uniform random draw. If you want every name to have an equal chance of appearing, set weighted=False.

A larger example: Event log data

So far, we’ve focused on person and business data, but generate_dataset() handles temporal and numeric types just as well. Let’s build a simulated event log, the kind of table you’d see behind a product analytics dashboard. This schema brings together several field types we haven’t combined yet: datetime_field() for timestamps, duration_field() for session lengths, bool_field() for success/failure flags, and the "ipv4" string preset for IP addresses.

The allowed= parameter on string_field() is doing the work of defining the event vocabulary. Rather than generating random strings, it draws uniformly from the list of actions we provide, giving us a clean categorical column.

from datetime import datetime, timedelta

schema = pb.Schema(
    event_id=pb.int_field(min_val=1, unique=True),
    user_id=pb.int_field(min_val=1000, max_val=1050),
    action=pb.string_field(
        allowed=["page_view", "click", "purchase", "signup", "logout"]
    ),
    timestamp=pb.datetime_field(
        min_date=datetime(2025, 11, 1),
        max_date=datetime(2025, 11, 30, 23, 59, 59),
    ),
    duration=pb.duration_field(
        min_duration=timedelta(seconds=1),
        max_duration=timedelta(minutes=10),
    ),
    success=pb.bool_field(p_true=0.92),
    ip_address=pb.string_field(preset="ipv4"),
)

events = pb.generate_dataset(schema, n=40, seed=23)

pb.preview(events)

	event_id Int64	user_id Int64	action String	timestamp Datetime	duration Duration	success Boolean	ip_address String
PolarsRows40Columns7
1	7188536481533917197	1049	purchase	2025-11-15 01:46:38	0:04:57	False	148.42.8.157
2	2674009078779859984	1018	page_view	2025-11-05 01:20:36	0:01:26	False	216.194.183.66
3	7652102777077138151	1005	page_view	2025-11-01 19:53:44	0:00:18	True	98.136.227.7
4	157503859921753049	1001	logout	2025-11-29 17:45:42	0:05:15	True	113.232.12.54
5	2829213282471975080	1037	purchase	2025-11-15 21:22:57	0:07:14	True	43.255.215.10
36	6232456323939446652	1002	logout	2025-11-28 18:28:58	0:05:22	True	41.215.141.245
37	1508803708693178976	1037	purchase	2025-11-14 15:42:28	0:09:47	True	90.152.135.44
38	7369527199060817792	1023	logout	2025-11-16 06:17:31	0:01:28	True	115.31.254.193
39	4921468493992610632	1042	purchase	2025-11-28 19:34:35	0:08:06	True	9.233.210.149
40	6210729776073352921	1011	purchase	2025-11-05 03:53:23	0:03:02	True	163.208.178.154

What we get is 40 rows of event data spread across November 2025. Each row has a unique event ID, a user ID drawn from a small pool (simulating repeat visitors), a random action, a timestamp within our date window, a session duration between 1 second and 10 minutes, a success flag that’s True about 92% of the time, and a plausible IPv4 address. All from a single generate_dataset() call.

Because the output is a Polars DataFrame, we can immediately run aggregations on it. Here’s a quick summary grouped by action type, showing the count of events, the average success rate, and the mean duration:

pb.preview(
    events
    .group_by("action")
    .agg(
        pl.col("event_id").count().alias("count"),
        pl.col("success").mean().round(2).alias("success_rate"),
        pl.col("duration").mean().alias("avg_duration"),
    )
    .sort("count", descending=True)
)

	action String	count UInt32	success_rate Float64	avg_duration Duration
PolarsRows5Columns4
1	purchase	10	0.9	0:05:44.800000
2	page_view	9	0.89	0:03:44.222222
3	logout	8	1.0	0:04:20.750000
4	signup	7	0.86	0:03:57.285714
5	click	6	1.0	0:06:53.500000

This is the sort of exploratory analysis you might do while building a reporting pipeline or testing a dashboard query. The synthetic data gives you something to run your code against before the real event stream is available.

Validating what you generate

Pointblank started as a data validation library, and data generation turns out to be a natural extension of that core mission. The two capabilities complement each other quite well: the same Schema object that describes what your data should look like can also produce data that does look like that. This means you can build validation logic and test it against controlled synthetic inputs, all within one consistent API.

There’s a satisfying loop to this workflow. You define a schema, generate data from it, and then validate that the data meets your expectations. Here we generate 100 rows with a Field-based schema, then verify the structure with col_schema_match() using a dtype-based schema, and add a few value-level checks on top:

gen_schema = pb.Schema(
    id=pb.int_field(min_val=1, unique=True),
    name=pb.string_field(preset="name"),
    score=pb.float_field(min_val=0.0, max_val=100.0),
    active=pb.bool_field(),
)

test_data = pb.generate_dataset(gen_schema, n=100, seed=23)

# A dtype-based schema for structural validation
val_schema = pb.Schema(
    id="Int64",
    name="String",
    score="Float64",
    active="Boolean",
)

validation = (
    pb.Validate(data=test_data)
    .col_schema_match(schema=val_schema)
    .col_vals_between(columns="score", left=0.0, right=100.0)
    .col_vals_not_null(columns="name")
    .col_vals_gt(columns="id", value=0)
    .rows_distinct(columns_subset="id")
    .interrogate()
)

validation

Pointblank Validation

2026-04-13|17:29:09

Polars

STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT

#4CA64C

col_schema_match()

—

SCHEMA

✓

1
1.00

0
0.00

—

#4CA64C

col_vals_between()

score

[0.0, 100.0]

✓

100

100
1.00

0
0.00

—

#4CA64C

col_vals_not_null()

name

—

✓

100

100
1.00

0
0.00

—

#4CA64C

col_vals_gt()

✓

100

100
1.00

0
0.00

—

#4CA64C

rows_distinct()

—

✓

100

100
1.00

0
0.00

—

2026-04-13 17:29:09 UTC< 1 s2026-04-13 17:29:09 UTC

Notes

Step 1 (schema_check) ✓ Schema validation passed.

Schema Comparison

TARGET			EXPECTED
	COLUMN	DATA TYPE		COLUMN		DATA TYPE
1	id	Int64	1	id	✓	Int64	✓
2	name	String	2	name	✓	String	✓
3	score	Float64	3	score	✓	Float64	✓
4	active	Boolean	4	active	✓	Boolean	✓
Supplied Column Schema: `[('id', 'Int64'), ('name', 'String'), ('score', 'Float64'), ('active', 'Boolean')]`
Schema Match Settings COMPLETE IN ORDER COLUMN ≠ column DTYPE ≠ dtype float ≠ float64

The generated data should pass all checks, giving you a clean baseline for your validation logic. In practice, this is how you’d develop and refine validation rules before pointing them at real data: generate a known-good dataset, confirm your checks pass, then swap in the production table and see what fails. Having generation and validation in the same package makes that iteration cycle very tight.

Wrapping up

Synthetic data generation sits at the intersection of several real needs: testing, prototyping, teaching, and privacy. Pointblank’s generate_dataset() tries to make it practical by handling the tedious parts automatically (type-appropriate random values, coherent cross-column relationships, country-specific formatting) so you can focus on the shape of the data you actually need.

Define a schema, call generate_dataset(), and you have a DataFrame ready to go, which is the sort of simplicity that matters when you need data but can’t use the real thing. If you’d like to explore further, the Pointblank website has extensive documentation on data generation, including a dedicated User Guide section and full API documentation for every field type and function covered here.

Rich Iannone

Software Engineer at Posit, PBC

Richard is a software engineer and table enthusiast. He and R go way back and he's been getting better at writing code in Python too. For the most part, Rich enjoys creating open source packages in R and Python so that people can do great things in their own work.

Building realistic fake datasets with Pointblank

Starting simple: A schema and a dataset

String presets: Names, emails, cities, and more

Building a realistic customer dataset

Extending with Polars

Country-specific data

Mixing multiple countries

The three coherence systems

Person coherence

Address coherence

Business coherence

An example that makes use of all three types

Profile fields: The fast path

Regex patterns for structured strings

Categorical columns and nullable fields

Frequency-weighted sampling

A larger example: Event log data

Validating what you generate

Wrapping up

Rich Iannone

Related Content

What makes Posit different from proprietary analytics vendors

Deploying boosted tree models with Orbital

Serving the Public: See How Government Agencies Use R, Python, Shiny, ...