Nick Rohrbaugh

Data Science Leadership

Why R and Python developers need shared infrastructure

2023-05-25

Text: Why R & Python developers need shared infrastructure. A cartoon snake wraps the words.

For many years, data science managers, Chief Data Officers, Chief Analytics Officers, and countless others who spend their careers working with data agonized over whether to use R or Python for a specific data project, within their team, or across their entire organization.

As a Customer Success Manager for Posit (previously RStudio), I spent four years talking to hundreds of our enterprise customers about this topic and the relative merits of each language. At the time, I would often compare the R vs. Python “debate” to a series of old commercials by Apple featuring Justin Long, who played a cool, laid-back personification of a Mac, and John Hodgman, who portrayed Windows PCs as stuffy, bumbling, and overly focused on work. Windows countered with its own ad campaign.

When those commercials ran on television in the mid-2000s, it seemed like Apple and Microsoft might continue their battle over the personal computing market forever. These days, arguing about Mac vs. PC seems silly. Microsoft now develops an Office app for iOS, Apple helps support using iCloud in Windows, and whether you use a Mac or PC, you likely work with other people every day who use an alternative operating system without it impacting your ability to collaborate.

I feel the same way about the R vs. Python discussion today. In our eyes, it’s not a debate. It’s a love story where organizations combine their own strengths and experience with the relative strengths of each programming language to produce the best results.

Do R and Python developers have different needs?

Both R and Python are immensely powerful programming languages for a wide variety of tasks, and both languages have their unique relative strengths and weaknesses. Many consider R to be an excellent choice for research and statistics, data exploration and wrangling, and data visualization and reporting. In contrast, others prefer Python for data engineering, machine learning, and ML Ops. But do R and Python developers have vastly different needs when it comes to providing them with tools for working with data?

In some cases, yes! Python developers who design data pipelines or train machine learning models may need a more powerful workspace to process large amounts of data that feed into their pipelines and models. R developers who build customer-facing reports and applications may need tools that enable them to iterate quickly to respond to customer feedback. And to state the obvious, R and Python are separate languages with different package ecosystems, and both R and Python developers need access to the right versions of their programming language and their packages to do their work.

Despite these differences, R and Python developers of all stripes require a common set of tools to do great work, which adds value to their organization.

What R and Python developers need

While their day-to-day work may differ significantly, developers using open-source languages like R and Python to work with data share a similar set of needs.

Access to the right data

Without access to the right data, R and Python developers can’t produce useful insights. Data scientists and other data professionals coding in R and Python may work with a wide variety of data formats, from flat files like CSVs and spreadsheets to structured data in enterprise databases and everything in between. That data might be stored locally or in the cloud and may be sourced internally, purchased from a third-party vendor, or downloaded or scraped from the public internet.

In many cases, R and Python developers require access to company data stored in a database or data warehouse like Snowflake, SQL Server, or Teradata. Connecting to these databases is often done through the use of ODBC or JDBC, two popular APIs for accessing data within databases. This requires developers to have the correct database driver installed and configured on their development machine in order to access the right data.

A place to build (a development environment)

Most R and Python developers have a favorite IDE or integrated development environment, the software environment where they write code to produce data pipelines, models, apps and reports, and other data products which add value to their organization. In R, millions of people each week use RStudio, an open-source IDE developed by Posit, to analyze data. In Python, many data scientists prefer using Jupyter Notebooks to analyze their data, along with JupyterLab and Visual Studio Code.

While there might be significant overlap in features available between different IDEs, a developer’s choice of editor is often highly personal and important to their productivity. Many developers have used the same IDE for years. They are averse to learning a new editor or using a “frankenstein” editor that’s based on an open-source IDE but otherwise unfamiliar.

Critically, one’s development environment must be powerful enough to do the work the developer needs – merging data sets, training a model, creating a visualization of their data, or some other task. While R and Python developers may work with different kinds of data or amounts of data within your organization, the two programming languages share some similarities when it comes to data science and analytics.

Both R and Python require central processing (CPU) cycles to perform operations on data. In both R and Python, data is typically stored in memory (RAM). Copies of the data are made in memory as data is cleaned, transformed, modeled, and visualized, meaning you need enough working memory to hold a few copies of your data – something that can be difficult to accommodate when working with data locally on one’s laptop or desktop. And developers using either language will sometimes want to read and write files like their scripts, notebooks, and other objects created during development to persistent long-term storage for accessing later.

A place to share (a production environment)

To add value to their organization, R and Python developers need a place to share the work they’ve done. Sharing open-source work in the enterprise is a difficult problem for a few reasons:

Security. Putting an application or model into production typically requires greater attention to security than is enforced during development, as the risk of sharing the wrong data or information with the wrong person is higher in production. At the same time, most open-source tools prioritize the developer experience over hard-to-solve problems like security.

Scalability. It’s one thing for a single developer to use an application or model alone. It’s another thing entirely when hundreds or thousands of people are relying on that work to run your business. Your production infrastructure must scale to handle the load of those end users accessing your apps, models, and other data products while serving them reliably.

Consumer experience. Open-source packages for R and Python are typically built with developers in mind to help those developers produce meaningful insights. Consumers of their work, such as business leaders, executives, and external customers and clients often struggle to discover those insights, access the latest versions reliably, and answer follow-up “what if” questions themselves. A successful production environment will prioritize the end consumers of data products, making sure they can always find the right answers quickly and with confidence.

Breadth of output formats. One of the most difficult problems in supporting data science in production is supporting the massive variety of open-source packages and frameworks that R and Python developers use to build their pipelines, models, visualizations, apps, and reports. ML engineers and scientists might build their models with torch/PyTorch, TensorFlow, Keras, sci-kit learn, tidymodels, or H2O (among many others) and want to deploy them as a FastAPI or Flask API or Plumber API. App and report builders might use Shiny, Streamlit, Dash, Bokeh, Quarto, R Markdown, or Jupyter to share their work visually. Data engineers want to schedule their code to run regularly and receive alerts when something goes wrong or when a metric meets some threshold. An effective production environment will facilitate all of these use cases so developers can build with the tools they know and love, helping them work more quickly and produce higher-quality work.

Access to the right packages

From development to production, R and Python developers rely on the vast ecosystem of open-source packages to help them import data and turn it into something valuable to your company. Developers need to be able to install the right versions of the packages they need to produce great work.

At the same time, many organizations worry about the risks of running potentially hundreds of thousands of unvetted open-source packages on their sensitive data. Most organizations want some level of control over which packages are available for developers to install, blocking access to known risks or only allowing packages known to be safe and necessary.

Data scientists, data analysts, and other data professionals writing R and Python code have a unique set of needs when it comes to package management in the enterprise. They need (often undocumented) system dependencies to be installed for their packages to work, they need the ability to install older versions of packages to reproduce their work, and they might want to share their own internally-developed packages for accessing company data or building consistent reports and visualizations.

The costs of separate data science infrastructure

R and Python developers may share a similar set of needs, but many organizations opt to serve these developers with separate data science infrastructure. Open source tools for enabling development on servers are often focused on just one language – for example, many organizations will maintain both RStudio Server and JupyterHub in separate server environments for R and Python users, respectively. Members of the same data science and analytics teams also tend to share the same language and tooling, which means that the departmental purchasing and provisioning of data science infrastructure in an organization might not happen simultaneously for both R and Python users, leading to bifurcated tooling over time.

Maintaining separate infrastructure for R and Python comes with costs, however.

System configuration and maintenance

Every tool that IT or DevOps supports includes hidden costs beyond license fees, like the resources needed to install and properly configure the software, maintain it by testing and installing updates, and respond to any user complaints. Put simply, the more tools you support, the more you’ll need to invest in hiring and training DevOps staff to adequately support those tools.

Security

Most open-source tools aren’t designed with enterprise security in mind, which puts pressure on IT and developers to find solutions that enable developers to build and share insightful data products using open-source tools while minimizing risk. In an ideal world, data products like interactive apps and model APIs are protected by the same security and authentication layers as other enterprise software, and consumers don’t need to remember extra passwords or question whether something is really secure. Providing a secure experience that users enjoy is difficult to do, especially when you need to do it twice for R and Python users separately.

Finding insights

The most insightful model or dashboard is useless if the right decision-maker can’t find it at the right moment in time. Business leaders and other decision-makers are unlikely to know or even care if their model or dashboard is written in R or Python, but many companies deploy their shared data products to language-specific platforms, making it impossible for consumers of data products to find the right answer quickly. In the worst-case scenario, every open source framework has its own specific publishing environment, which means configuring and maintaining a Shiny server for Shiny apps, a Streamlit server for Streamlit apps, a Dash server for Dash apps, an API server for APIs, and so on – a logistical nightmare for DevOps, developers, and decision-makers alike.

Posit Team: End-to-end tools for R and Python in the enterprise

At Posit, our mission is to build open-source software that people love to use for data science, scientific research, and technical communication in both R and Python. We’ve designed Posit Team as a bundle of our 3 flagship professional products to meet the needs of R developers, Python developers, IT/DevOps professionals, and end consumers of data products.

Build in your favorite editors with Posit Workbench

Posit Workbench allows developers to code in both R and Python using RStudio, Jupyter Notebooks, Jupyterlab, and VS Code in a secure, scalable, centralized environment.

Administrators can configure Workbench to use their company’s authentication provider via SAML, Active Directory, or PAM, so both R and Python developers can sign into a single development platform with the same credentials they use for other services at work. Workbench can be deployed in a number of different architectures to meet your needs on-premise or in the cloud, including on a single server, in a load-balanced cluster, and in Kubernetes.

Share data products with Posit Connect

Posit Connect is the publishing platform for data products worth sharing. Connect lets you easily deploy everything you create in R & Python, from interactive applications built with packages like Shiny, Streamlit, and Dash to documents, notebooks, and dashboards, as well as machine learning models hosted as APIs.

Connect is designed with developers and consumers in mind. R and Python developers can deploy, schedule, and share their content without having to fiddle around with Dockerfiles and cron jobs or worry about package dependencies. Business leaders and other end consumers of data products can easily find what they’re looking for with search and tags to organize content.

Connect supports deploying all of these open-source frameworks, meaning you don’t need to maintain separate infrastructure for hosting each:

Applications & Dashboards: Shiny for R, Shiny for Python, Streamlit, Dash, Bokeh, Voilà
Reports: Quarto, R Markdown, Jupyter Notebooks
Models & APIs: Vetiver, Plumber, FastAPI, Flask, Tableau Analytics Extensions
Data sets: Pins for R and Python

Control the availability of packages with Posit Package Manager

Posit Package Manager provides administrators with repository management tools that are optimized for data science with R and Python. Package Manager lets you host packages from CRAN, Bioconductor, PyPI, and internal packages within your firewall, and provides tools for creating curated package repositories and blocking access to specific packages or open-source licenses.

R and Python developers can time travel to a specific date to install the package versions that were available at that point in time, helping them ensure the reproducibility of their work over time, as well as easily access package documentation.

Become an R + Python DevOps Champion

While R and Python developers can sometimes have different needs when it comes to working with data, they largely share a need for access to data, packages, and the right environments to build data products and share their work with others.

Interested in learning more about how Posit can help you become a DevOps champion enabling R and Python developers together within your organization? Learn more about Posit Team and book a demo at posit.co/team.

Tags: infrastructure Posit Connect Posit Package Manager Posit Workbench