Dan Boisvert

Jamie Warner

Rachael Dempsey

Data Science Leadership

The path to effective data stewardship

As a data scientist or data leader, data is at the center of your work. However, management of that data can feel overlooked. Many of us have asked questions like:

How do we ensure data we create is easily discoverable and reusable?
What happens when multiple people make different data transformations for similar work?

The focused effort of making data accessible and secure has been called many different names: data stewardship, data enablement, data governance, etc.

While it’s needed across all levels and departments within an organization, best practices for managing data remain difficult to define.

Last month, we got together as a community to discuss the importance of driving data stewardship at the individual level. Jamie Warner, Managing Director, Data Science & Pricing at Plymouth Rock Assurance, and Dan Boisvert, Senior Director Head of Data Stewardship at Biogen, kicked off the conversation by drawing from their own experiences.

Over 80 questions were submitted in advance, and this blog post summarizes the Q&A covered during the session. We grouped similar questions into themes to address as many as possible, but there’s still much more to explore. A great place to continue these conversations is Posit’s weekly Data Science Hangout.

A few key takeaways from the conversation:

Change management is hard. To inspire people to change, make it as easy as possible for them to see the path to get there by creating tools to make new tasks easy to complete. A problem not only has to exist, but that problem needs to be recognized to exist. In the data governance space, legal regulations can help push us forward because they create urgency.
Data stewardship and data governance enable innovation. You see more of what’s possible when you know what data exists. Core systems form the foundation that enables cutting-edge work like AI development.
People have good intentions and want to get their work done and impact the organization. Don’t create processes where the only way to get your job done is to break the process. In the act of governing, remember why you started pulling data together in the first place.
Community engagement is incredibly powerful and hearing different perspectives is how we grow. We all face similar problems regardless of our role or industry. An internal community for cross-organizational collaboration is vital for knowing what other people are working on and collaborating. This not only helps you with your data, but also with your career.
Implementing effective data stewardship is a journey with companies at all different stages. Those further along the journey may have teams and infrastructure dedicated to this, along with executive buy-in. Others are working to make it happen at the individual level – where it often starts with writing down the definitions of your data (who created it, when it was sourced, intended use, etc.) and recognizing that someone else will use your data output, potentially not in the way you initially intended.

Recording of our community event

Event Q&A

How do you effectively communicate data management standards to your team? How do you engage with teams that are resistant to change?

Change management is very hard. People are busy. It’s not necessarily that they don’t want to, but you have to show them the path. People need to recognize that it’s a problem. As you lead in this space, you have to show the pain of not being deliberate about governance.

The reason people resist change isn’t always because they don’t see the value, but because they are busy and don’t see how it will help them in the short term. Make it as easy as possible for them. Make a toolkit or framework that is easy to use and share it broadly.

Often when you see resistance to change, people don’t know what you’re asking them to do. Change is very hard. When you’re making changes, make sure there’s a value statement behind it that helps people tie what they’re doing back to the value they’re creating.
Dan recommended a book called Switch by Chip and Dan Heath about change management, especially for thinking about the rational mind, emotional mind, and the path to getting there. For change management, you need to solve a problem that actually exists and one that is recognized to exist.
Jamie added that finding some of the legal boundaries (like GDPR, California Right to be Forgotten) around data governance is a great way to give an incentive. People seem excited about investing in AI, but less so in the core systems that actually enable advanced data work.
If you can show the pain to the organization, this is where you can get people to adjust their mindset.

We store a lot of data. Are there guidelines/standards for the minimally reported information that must be accompanied with a dataset?

First, we’re all trying to figure out what good is here.
There’s reasonably good research on high-level frameworks: Where did it come from? What was its intended use? Who created it? When did they create it? What time scale does it go over? Reason you sourced it. Who can use it?
This can depend on where you are in the organization, industry, location, etc so that’s why you won’t find a general set of requirements for every case. In some cases there is a data governance group that can help define this.
Jamie shared a chapter on Data Governance over a Data Life Cycle from O’Reilly as well: https://www.oreilly.com/library/view/data-governance-the/9781492063483/ch04.html

What do you think of the idea that data governance could/should be rebranded as data enablement?

Dan shared that, “something about the word [data governance] holds a lot of baggage for people. I saw that early on in this journey and rebranded it to data stewardship.” Dan added, we used to be all centralized, and then you could reasonably have a centralized data governance over it, but over time, it spreads out, and we’re managing the mesh. When you manage the mesh, you need people at the individual points to manage it and this is where you get more of a view of data stewardship.
Jamie shared: “…it depends on who I’m talking to. For senior leadership, the idea of data governance is really expensive. Data enablement can be a good catchphrase with leadership to get us where we need to go and get what we need investment-wise. I think it’s important to have frank conversations with data scientists, data engineers, etc. about the fact that this is something that’s important – we have to care about this, and it is critical to the work that you do.” Jamie added, “If you’re curious about your data, you’re actually already doing data governance. You’re going to be asking questions about what looks weird, how is this defined, do I understand this? You being really good at the governance piece translates to being really good at your job as well.”

In theory, an organization should have a single source of truth, but in reality, many have their own definition of metrics and what to track under the same name. How do we overcome that?

Jamie: I think this is actually something we should embrace a little more. Different definitions exist in different environments. The question is – how do we tie it all together? Organizations where they say you can only get data from here, and only this way to define a field end up with a lot of shadow IT departments or shadow data that’s been downloaded and massaged. If instead, we say we know we have different definitions for this – how do we integrate this into our systems – that ends up with a much better outcome where people actually get what they need. I think this is the reality of the way we work.
Dan: People are going to go do their jobs whether you like it or not. I’m a big proponent of not creating a process where the only way to get your job done is to break the process.

What is your advice on how to reduce duplicate data stewardship when multiple people are making similar data transformations for similar work?

Dan recommended getting people together in the organization to talk. Grab a list of people who have access to the data and get people to come together once a month or once a quarter. No one wants to do duplicative work. There’s an opportunity for collaboration here. Sometimes, it’s duplicative, and it can be shut off, but sometimes, this doesn’t work because of the reality of our systems and how we work.

Following up on that, how do you know when someone else is doing the same work within the organization?

Jamie recommended, if you’re a data scientist in your organization – you should know the other data scientists. If nothing else, in case your department gets shuttered. It could be something like this meetup today. Knowing what other people with your type of role do across the company is really important. You can set up informational interviews or virtual coffees that not only help you with your data, but with your career. It can also create a situation where the first time you’re coming to someone isn’t “Hey, we make the same thing” which can feel a little threatening.
Make sure you have a community. We have a monthly meetup of folks using certain data tools where we walk through what we’re all working on and that really helps us feel like a team to get these efforts pulled together.

Are there possible tools that could facilitate data governance?

Dan shared that the need really has to be there, and the organization really has to see the people doing the work have a need for it. To go from “it’s a mess right now” to putting it in whatever data governance tool you want is too much of a jump for people.
Jamie added that she’s starting to see a lot of AI tools getting integrated, so that it’s almost like you don’t need your own solution. Tools used for other parts of the process, like ML workflows are starting to focus on the importance of the data and that being core of everything we do. Different platforms are adding features that allow you to better govern your data as you work with it, like Databricks or something else.
Dan added: When we say data governance, there’s probably some high risk things that aren’t appreciated at the ground level that needs control. You’re not going to find a business case for it. There’s not going to be positive ROI for it, but it still needs to be done.

What resources beyond this session could I use to learn more about data stewardship?

Dan shared the Data Management Body of Knowledge book and recommended community conversations like this one with other people. There aren’t a lot of data stewardship books out there, so a lot of knowledge you’ll get also comes from your colleagues.

To engage further with Posit:

Every Thursday at 12 ET, we get together for the Data Science Hangout. While there’s no set agenda for the conversation, we often talk about data science leadership, data management best practices, speaking with executives, career advice, and more.
Many Posit customers also use enterprise platforms for data governance, like Databricks and Snowflake, to help people discover, access, and work with the right data securely and efficiently. Posit will be at the Snowflake Summit Dev Day this week and Databricks Data + AI Summit next week. You can use the links to RSVP for customer events we’re hosting there!

Tags: data management data stewardship

The path to effective data stewardship

Recording of our community event

Event Q&A

Subscribe to more inspiring open-source data science content.