Announcing DevOps for Data Science by Alex Gold
Are you a data scientist looking to make a bigger impact?
DevOps for Data Science by Alex K Gold offers a practical introduction to DevOps tools and practices tailored specifically for the data science field. From key conventions to effective collaboration with IT teams, this book empowers data scientists with strategies to enhance their workflows, handle some administration tasks, and ultimately, achieve more profound results.
About the Book
Where Can I Get DevOps for Data Science?
DevOps for Data Science is freely accessible online, allowing readers to dive right in, do4ds.com.
For those who prefer print, the book is also available through Routledge, Amazon, and reputable independent online bookstores.
Alex K Gold leads Posit’s Solutions Engineering team. His team has helped thousands of organizations make their systems for developing and sharing data science products more robust with open-source tooling and Posit’s Professional Products.
Why DevOps for Data Science Matters
Many of us became data scientists out of a passion for tackling challenging questions with statistics. Attend any conference, and you’ll often meet people who revel in exploring complex data, geeking out over their favorite R or Python packages, applying cutting-edge models or ML techniques, and creating stunning visualizations or web apps.
But as Alex writes,
Ultimately – frustratingly – these things don’t matter.
What does matter is whether your work is useful. That is, whether it affects decisions at your organization or in the broader world.
That means you must share your work by putting it in production.
Data scientists are often hired to create predictive models and share insights with stakeholders. However, they quickly discover that the data pipelines needed to power their models are fragile—or, in some cases, nonexistent. As a result, they end up cobbling together workflows just to make their work feasible.
Alex Gold, head of Posit’s Solutions Engineering team, sees this repeatedly. He realized that data scientists don’t have to start from scratch. Tools, best practices, and techniques from the field of Developer Operations (DevOps) already exist and can significantly ease their path.
After helping numerous data science teams set up robust engineering pipelines, Alex decided to share his insights and experiences with this new book, offering guidance and solutions for the challenges data scientists face.
What You’ll Find in DevOps for Data Science
In DevOps for Data Science, you’ll find a guide for data scientists through the essential skills needed to deploy their projects into production effectively, focusing on making data science impactful and practical within organizations.
The book clarifies that “in production” doesn’t always mean complex machine learning pipelines but can simply be the process of sharing insights with stakeholders. It outlines core DevOps principles and IT administration basics, equipping data scientists to collaborate effectively with IT teams—or manage their environments independently when necessary.
The book is divided into three main parts.
Part 1 – DevOps Lessons for Data Science. This section covers essential DevOps practices for data science, focusing on managing environments, building robust app architectures, securely connecting to data sources, incorporating monitoring and logging. It also explores deployment strategies to streamline moving projects into production and includes an introduction to using Docker for environment management and code sharing.
Part 2 – IT/Admin for Data Science. If you are an independent hobbyist, or only have a small data science team, you maybe able to operate without any IT/Admin support. This section walks through of basic concepts in IT Administration that will get you to the point of being able to host and manage a basic data science environment. Even if you work at an organization with significant IT/Admin support, this section will equip you with the vocabulary to talk to the IT/Admins at your organization and some basic skills of how to do IT/Admin tasks yourself.
Part 3 – Enterprise-Grade Data Science is about how everything you learned in Part 2 is inadequate at organizations that operate at enterprise scale. If Part 2 explains how to do IT/Admin tasks yourself, Part 3 explains why you shouldn’t. It offers insights into IT/Admin priorities, such as protecting data and computational resources from threats, ensuring system reliability, and managing access to resources through layered security and the principle of least privilege. The chapters also discuss the tradeoffs IT/Admins face when deciding whether to build custom data science platforms from open-source tools or to invest in proprietary solutions.
What Readers are Saying
I ran a book club at the Data Science Learning Community for this book while Alex was writing it, with one member of the club presenting a chapter each week for the group to discuss. You can learn more about the Data Science Learning Community and our book clubs at DSLC.io! Each meeting was recorded, and is available at dslc.video/do4ds01. We learned a lot of useful information, which I’ve used to make things run more efficiently for our Community. Alex joined us three times for Q&A sessions, and was extremely responsive to our feedback. The final result is this tested and refined book. If you’d like to get data science products off of your computer and in front of users, I cannot recommend highly enough!
~ Jon Harmon, Executive Director, Data Science Learning Community (DSLC.io)
You can find the DevOps for Data Science book club playlist — including 18 episode discussions, one for each chapter — on the DSLC YouTube channel.
In software development circles, people still point to The Phoenix Project as a must read for learning how successful IT operations orient themselves with the enterprise using lean manufacturing principles. As a piece of fiction is does a great job of demonstrating how people and relationships matter in what they introduce as a new framework called “DevOps,” and that there are markers for change control success, or maturity, to strive for. The book is in print in its fourth edition since first release in 2013.
Gold’s DevOps for Data Science (2024) goes further as a working handbook for establishing the workflows and platform components to ensure that when production data science work is released, it is reliable, environments are safe, and that the software will be available when people need it. If you have ever worked as a lone data scientist, administering your own (shadow) IT environment to build just the statistical models alone is overwhelming. After that, in my own experience, navigating the maze through siloed or outsourced InfoSEC and cloud services groups with their own agendas just does not work. DevOps for Data Science is organized around the relationships and tools that need to be in place before taking a job on. With this book as a template, I am less of an analyst or code writer and more of an internal consultant serving the business, building paths to success.
~ Jim Gruman, Product Manager at The Protectoseal Company
Between shiny dashboards, report-making software like Quarto, and model deployment and serving tools like Vetiver and MLFlow, data scientists have plenty of ways to share artifacts that can be consumed by others in a company as well as by users of a product. However, many data scientists have little exposure of what it takes to get a product to that last mile – making it productionized, scalable, and accepted by IT, devops, and security professionals. This can be the difference between a dashboard that only gets viewed by a few people on a data science team and one that the C-suite uses to inform executive decisions. This book certainly is helpful to address this gap and the unprecedented availability of the aforementioned tools makes this [book] very timely!
~ Kevin Kent, Data Scientist, Nuance Communications, USA