The Road to Building Ten Million Binaries

A Journey of Risks, Rewards, and R Packages.
2023-11-13
A plot of packages built by Posit from December 2019 to June 2023, segmented by the type of distribution and noting what version of R is being used. It goes from around 0 in December 2019 to over 10 million in June 2023.

Package Binaries

 

The Posit Package Manager team is committed to streamlining data science adoption and enhancing community productivity. One way we achieve this is by distributing pre-built binary packages—OS and R-specific compiled code that users can install without manual compilation. This OS and R version combination means that to support everyone, we need a large matrix with a dozen flavors of Linux, including Ubuntu, Debian, RHEL and SUSE, the last five versions of R, macOS and Windows, and builds for every historical snapshot back to 2017. Whenever we add a new distribution, we must ensure it has the same coverage, translating to hundreds of thousands of packages built on demand.

Now, with nearly ten million of these binaries and millions of package downloads facilitated through Posit Public Package Manager each day, it’s clear that R and Python developers value efficient and reliable package management. The growing popularity of these pre-built packages highlights their cumulative benefits, which include reduced installation time, fewer system dependency requirements, and robust cross-package and cross-platform compatibility.

 

Speed and Efficiency

 

Let’s start by diving into one of the most immediate benefits: the significant reduction in installation time. Unlike compiling from source—a process that can be both laborious and resource-intensive—pre-built binaries allow users to download and install packages in a fraction of the time. This is particularly advantageous for CI/CD pipelines, where every second counts, especially during active development. For example, rstan and xgboost take minutes to build but install in seconds when pre-compiled.

 

Simplifying Dependencies

 

Another crucial benefit of pre-built packages is that it simplifies dependencies. Pre-built binaries frequently come with some necessary dependencies, making installation much easier than compiling packages from source, which often requires complex build-time dependencies and unclear instructions. By providing pre-packaged binaries, we eliminate a large chunk of this complexity, allowing users to focus on their core tasks rather than grapple with these challenges in package management.

 

Cross-Platform Compatibility

 

Pre-built binaries are tailored to specific operating systems and architectures, ensuring that what users download will work out of the box on their chosen platform, whether that is Linux, Windows, or macOS.

 

Kickstarting our Journey with Linux

 

When the Package Manager project started, we noticed a significant gap in the R ecosystem: Linux binary packages were noticeably absent. While CRAN provided support for Windows and macOS binaries since the early 2000s, Linux had been somewhat left behind, and we sought to fill this gap. Since launching our pre-built binaries–specifically designed to support Linux users, various CI systems, and Posit Workbench customers–we’ve seen widespread adoption across numerous community-led initiatives. Our Linux binaries have been adopted in various community projects like GitHub Actions for R, Rocker Images, and Binder.

 

Overcoming Technical Challenges: A URL-based Solution

 

The first hurdle was R’s lack of native support for installing Linux binaries. We developed a URL and user-agent-based solution to simplify Linux binary installations. This is why Package Manager URLs include __linux__ to indicate that the downloaded package is optimized for Linux environments.

 

Architecting for Scale and Security: Kubernetes and Isolation

 

To handle the intricacies of building each package individually, we opted for a Kubernetes-backed architecture. Each package is built in an isolated container, which has both challenges and advantages:

  1. Calculating Reverse-Package Graphs: Understanding the dependencies between packages was crucial to ensure that everything would be built in the right order.
  2. Guaranteed ABI Compatibility: We build snapshots of CRAN’s R packages and their reverse dependencies at a fixed point in time. This approach ensures that all packages are ABI-compatible with one another, providing an extra layer of reliability and cohesiveness in the package ecosystem for a given snapshot date.
  3. Managing Build-Time Dependencies: We provide minimal build-time dependencies without conflicts by isolating each build.
  4. Enhanced Security: Isolating builds in individual pods allowed us to add security features to minimize the risk of malicious package scenarios. For example, each package build occurs in a restricted environment, running under a downgraded user with fewer permissions.

Kubernetes can be a heavyweight solution, but our jobs are designed to be ephemeral, rebuilt, and rescheduled without conflicts or problems. This design means we can take advantage of spot pricing, reducing our costs when spinning up more than two hundred instances at peak.

 

Broadening Support for Linux Distributions

 

We started with distributions like Ubuntu 16, SLES 12, and CentOS 7. However, as the needs of our user base evolved, so did our support. We’ve since extended our offering to include distributions like Ubuntu 22, RHEL 9, Debian 12, and more, broadening the scope of users leveraging the packages.

It’s important to note that while we stop building binaries for end-of-life distributions, we never delete these packages. That means users can upgrade when it’s convenient and not when the packages they rely on suddenly disappear.

 

Navigating the Windows Ecosystem 

 

After we felt comfortable with our Linux builds, it was time to match CRAN and serve R package binaries for Windows. At the time, we estimated that over 50% of R users were using Windows, which made it a clear next step.

We had a choice: leverage existing CRAN Windows binaries or build our own. The decision was far from trivial. Using existing binaries would have initially saved us time and resources but would have constrained our control and flexibility. Ultimately, we decided to build our own binaries for several advantages, including compatibility, enhanced security, and control over versions. This also allowed us to support older R versions, catering to a broader user base not yet ready for an upgrade.

 

Taking a Leap with Kubernetes on Windows

 

Building our binaries presented another pivotal choice: extending our existing Kubernetes-based architecture onto Windows. We explored Kubernetes on Windows, a promising yet uncertain step toward a unified system architecture. That way, bug fixes or improvements we made to one system would positively impact the other.

We set up a new Kubernetes cluster configured to run native Windows pods. Windows on Kubernetes was released in March 2019, so it was still very new and came with hurdles, including limitations on the host and Docker image versions. We also had to grapple with the image sizes significantly larger than their Linux counterparts.

 

Windows vs Linux R Package Binaries

 

One major difference between the binary packages built for Windows versus Linux is the practice of static linking. Windows packages typically statically link their system dependencies, which results in self-contained packages without external dependencies. For instance, when packaging software depends on openssl, the resulting binary would include all the necessary components. This removes the need for separate system libraries. Conversely, in Linux, packages would require the openssl shared library installed on the system.

Another distinction lies in the R installation support. R on Windows natively supports installing Windows binaries, simplifying the user experience compared to Linux, where we need users to select the correct operating system and set the corresponding user-agent header.

 

Conquering macOS

 

Finally, it was time to build R package binaries for macOS. We had held off for a while because we hoped there would be an easy way to isolate macOS instances or incorporate the operating system into our Kubernetes-based solution. With MRAN shutting down, the community’s need was urgent, so we acted.

We looked into dedicated instances through Amazon, a private data center managed by MacStadium, and GitHub’s macOS-based virtual machines. One of our engineers also undertook a research and development project, through which he discovered a path forward to cross-build macOS binaries using a Linux-based Docker image.

 

Build Solutions and the osxcross Project

 

Investigating the best way to build these binaries led us to evaluate various native macOS build solutions. While solutions like Amazon’s dedicated macOS machines and MacStadium offered native environments, the associated costs were not ideal for our use case. GitHub’s macOS-based runners were another option, but it had limitations, particularly its lack of arm64 support and need for an entirely new build system.

While considering our options, one of our developers found the osxcross project on GitHub and attempted a proof-of-concept to cross-build R packages on Linux, targeting macOS systems. This cross-build system could easily be plugged into our existing Kubernetes architecture for Linux with little modification. While the proof-of-concept was a success, this path came with risks:

  • Build-time scripts: R packages can execute arbitrary code at build-time. This can include checks for which operating system the package is being built on, which is usually the assumed target.
  • Package dependencies: Packages require other packages at build time. It was unclear how we would provide these packages if they were built for macOS and couldn’t be installed normally.
  • Compatibility: While our proof-of-concept showed promise, it was unclear if the packages produced would always be compatible with actual macOS systems.
  • Apple Silicon: The osxcross only partially supported the arm64 architecture, and we decided Apple Silicon based on arm64 architecture was a priority

We decided to push forward with these concerns in mind and a successful proof-of-concept. In the worst-case scenario, we’d only have been able to build pure R packages. This would have been a disappointment, but the work wouldn’t have been wasted: we would still lower the cost of an additional build system if these challenges were insurmountable.

We can happily report we’ve made significant progress on each challenge. For build-time scripts, we added a handful of patches on an ad-hoc basis to fix any problems and continue to patch troublesome packages as we identify them. We worked around the package dependency problem both by relying on the --no-test-load flag for native macOS packages and installing the corresponding Linux package. We have found some incompatibilities with the packages we’ve built, but we’re quickly fixing them. Finally, we extended previous work on a forked version of the osxcross project to build Apple Silicon binaries with the same x86 Linux images.

Our cross-build system has built over 95% of the top 1000 CRAN packages.

 

Lessons and Future Directions

 

The road to compiling ten million R package binaries has been adventurous, filled with intricate challenges and profound discoveries. Our pursuit of self-built binaries across different operating systems has expanded our technical acumen and ingrained in us the value of agility in this ever-changing landscape of package management.

 

Making a Mark on the Community

 

Over the past four years, we’ve seen the adoption of the Package Manager skyrocket, serving up to eleven million packages in a single day and over three billion packages total since its creation.

Cumulative daily package downloads from 2019 to 2023 line graph. Starts from around 0 in 2019 to 3 billion in 2023.

Refining our Support

 

We started with essential Linux distributions, and now we’re continuing to research and expand our support as the community requires. The cost-efficiency of ARM Linux machines has also caught our attention. We’re in the early stages of figuring out how to build arm64 Linux binaries for R packages.

We understand the value of deepening the quality of our existing solutions. Take macOS, for example. CRAN has natively built macOS binaries, and we aim to reach that standard. Our cross-built macOS binaries have shown promise, but we recognize lingering compatibility hurdles. We’re actively engaged in iterative testing and are working to optimize our build environment to address compatibility and build limitations. This balanced approach ensures our offerings are wide-ranging, reliable, and effective.

Finally, security remains a cornerstone of our offerings. We’re exploring advanced measures like package virus scans using tools like ClamAV to continue protecting the community wherever possible.

 

Conclusion

 

For the last five years, Package Manager has navigated a complex landscape to deliver fast, reliable R packages across Linux, Windows, and macOS. Our journey has taught us invaluable lessons in package management and highlighted the broader challenges facing the developer community—challenges related to accessibility, reproducibility, and compatibility. As we continue to refine our offerings and explore new frontiers like environment management, we are committed to elevating the developer experience and safeguarding the ecosystem. As a final note, we’re incredibly grateful for the collective wisdom and support from everyone involved, from our team at Posit to the larger community.