Grow your data science skills at posit::conf(2024)

August 12th-14th in Seattle

29 Jun 2023

You have to be able to reason about it

Joe Cheng

CTO at Posit, PBC
Join us with Joe to chat about all things Shiny, a web framework for data scientists.
Watch this hangout

Episode notes

We were recently joined by Joe Cheng, CTO at Posit PBC to chat about all things Shiny – a web framework for data scientists, career journeys, being vulnerable, and so much more. 

At 39:53 – Joe shared what it means when he says, “you have to be able to reason about it”

When you write software or do an analysis or whatever, there’s a level of “I got it to work” and then there’s the level of “it works, and I can reason about it.”

Complex pieces of software are among the most complicated things that humankind has ever devised. What other human-made constructs can have hundreds of millions of pieces and yet they’re expected to all fit and work together so precisely that, if one token is off, rockets explode. 

I don’t know what the limit is, but even the smartest humans can only hold some small number of variables and operations in their head at any one time.

When we work on software that’s non-trivial, we work on software that in its totality is more than any one human mind can hold at any one given time. The main challenge in software engineering is about, how do you take all this complexity and break it down into smaller pieces, each of which you can reason about, each of which you can hold in your head, each of which you can look at and say, “Yeah, I can fully ingest this entire function definition. I can read it, line by line, and prove to myself, this is definitely correct if the functions that it’s calling don’t have bugs and if it’s called in the right way.”

So with those caveats, if the things that this is calling are correct and are called correctly, then the result will be correct because the logic here is correct.

Software engineering, at all but the most beginner level, is a lot about this. How do you break up inherently complicated things that we’re trying to do into small pieces that are individually easy to reason about?

That’s half the battle right there.

The other half of the battle is – how do we combine them in ways that can be reliable and also easy to reason about?

So it’s these two pieces– small pieces reliably composed– if you can achieve that, that’s what I’m talking about.

That’s software that you can reason about.

This has implications for data science as well. With data science, you’re doing some kind of analysis on some data, and it starts out as, oh, I’m just doing these simple things. I’m doing this manipulation and then I’m doing this visualization.

But then as you get deeper and deeper into it, it grows and grows and grows. You’re at the point where you’re at the end and you don’t remember where these variables came from.

You don’t remember what’s the difference between this data frame and this data frame. And you go back and hopefully start breaking it into functions or somehow dividing it into smaller pieces that each focus on a thing. Then you join those pieces together in your overall script.

That’s this principle of small pieces individually able to be reasoned about. When you think about other rules that you might have heard about software engineering – we know when you’re writing functions, using global variables is bad. That’s another one of these things where that hurts your ability to hold the entire function in your head and to prove that it’ll work correctly. Because who knows who is setting that global variable to what value? You can’t prove to yourself this function is definitely correct.

I think if there’s anything that you’re working on that needs to be correct and you do care that

the answer is right, I try never to stop when I have an answer but I can’t reason about the code.

I always try to go back and do the refactoring that’s necessary just so I can prove to myself that the answer is right. 

The other big benefit to this is that those individual pieces– if it does turn out that there’s a mistake somewhere, you can individually debug, test, unit test those individual pieces.

When there are problems, you’ll much more easily be able to find them.

Subscribe to more inspiring open-source data science content.

We love to celebrate and help people do great data science. By subscribing, you'll get alerted whenever we publish something new.