Case studies and industry use cases

Pharmaceutical machine learning with tidymodels and Posit Connect

Profile picture of Simon Couch
Written by Simon Couch
Written by Max Kuhn
2023-06-26
Tidymodels hex plus Posit Connect logo

A group of scientists investigate whether they can use drug information to predict if a proposed drug could be a mutagen (i.e., toxicity caused by damage to DNA). Deploying a tidymodels machine learning model with Posit Connect, these scientists can rapidly assess new drugs for their potential harm to patients.

In pharmaceutical research, mutagenicity refers to a drug’s tendency to increase the rate of mutations due to the damage of genetic material, a key indicator that a drug may be a carcinogen. Mutagenicity can be evaluated using a lab test, though the test requires experienced scientists and time in the lab. A group of scientists are studying whether, instead, they can use known information to quickly predict the mutagenicity of new drugs.

The open-source tidymodels packages for machine learning empower these scientists to quickly propose, train, and evaluate a diversity of statistical approaches to predict mutagenicity. This work may be done in Posit Workbench, an enterprise tool allowing these researchers to code within their preferred development environment, with the ability to collaborate and scale when needed, all while securely and centrally managed. Based on their findings, the most performant machine learning model can then be integrated into a plumber API using Posit Connect, allowing scientists across the organization to quickly input drug information and evaluate the potential for drugs to harm patients.

This post will outline the steps to develop and deploy a machine learning model to predict drug mutagenicity. All of the source code for this process is available at https://github.com/simonpcouch/mutagen.

Training Data

The training data consists of 4335 rows and 1580 columns, where the first column gives the outcome of the lab test for a given proposed drug, and the remaining columns give known information about the chemical structure of the drug. This information about the chemical structure can be obtained much more quickly and cheaply than the outcome—in the future, scientists want to be able to predict whether a drug is a mutagen based only on the drug information.

# A tibble: 4,335 × 1,580
   outcome       MW   AMW    Sv    Se    Sp    Ss    Mv    Me    Mp    Ms   nAT
                   
 1 mutagen     326.  7.59  29.3  42.6  30.6  50.7  0.68  0.99  0.71  2.03    43
 2 mutagen     174.  9.17  13.2  19.6  13.4  38    0.7   1.03  0.71  2.92    19
 3 nonmutagen  300.  9.39  20.0  33.6  21.0  61.2  0.63  1.05  0.66  3.06    32
 4 nonmutagen  143.  6.23  12.6  23.1  13.5  26.2  0.55  1     0.59  2.62    23
 5 nonmutagen  216. 18.0   10.6  13.0  11.7  27.1  0.88  1.08  0.98  2.71    12
 6 mutagen     190.  7.93  15.4  24.4  16.0  36    0.64  1.02  0.67  2.57    24
 7 mutagen     328. 12.6   18.8  27.1  20.0  49.4  0.72  1.04  0.77  2.75    26
 8 nonmutagen  324.  8.11  26.3  40.7  27.4  59.2  0.66  1.02  0.68  2.47    40
 9 mutagen     136.  7.56  11.3  18.2  11.8  25.7  0.63  1.01  0.65  2.57    18
10 mutagen     323.  7.89  26.8  41.5  27.9  54.9  0.65  1.01  0.68  2.29    41
# ℹ 4,325 more rows
# ℹ 1,568 more variables: nSK , nBT , nBO , nBM ,
#   SCBO , ARR , nCIC , nCIR , RBN , RBF ,
#   nDB , nTB , nAB , nH , nC , nN , nO ,
#   nP , nS , nF , nCL , nBR , nI , nX ,
#   nR03 , nR04 , nR05 , nR06 , nR07 , nR08 ,
#   nR09 , nR10 , nR11 , nR12 , nBnz , ZM1 , …

No particular predictor will allow us to straightforwardly predict whether a drug may be a mutagen. We can plot two commonly used predictors against the outcome to demonstrate:

However, using machine learning, we may be able to find patterns hidden among all of this data to predict whether a drug is a mutagen or not.

Developing The Model

The tidymodels packages provide a consistent interface to hundreds of machine learning models available across the R ecosystem. This consistency allows us to quickly try out a diversity of statistical approaches, relying on tidymodels to protect us from common modeling pitfalls and provide rigorous estimates of model performance.

First, we try out a number of different machine learning techniques to model the mutagenicity of these drugs and judge their effectiveness using a metric called the area under the ROC curve:

Based on the above plot, we see that a boosted tree model fitted with XGBoost consistently outperforms other models we evaluate, with out-of-sample ROC scores above 0.85 (a value of 1.0 is best). We will thus use these initial results to optimize our XGBoost model further with an approach called simulated annealing:

Simulated annealing performs an iterative search, using results from previous iterations to inform later optimizations. In this search, we see that optimizations made in early iterations resulted in higher ROC scores. The search then proposed optimizations that resulted in less performant models before discovering more performant optimizations later on, giving a maximum out-of-sample ROC score of 0.903. Fitting the best model to the full training set, we see a final test set ROC score of 0.912, indicating that our model generalizes well to data it hasn’t yet seen.

Model Deployment

With our final model fitted and benchmarked, it’s time to put this model into practice. Using vetiver, a multilingual MLOps framework, we can quickly develop a plumber API to provide a user-friendly interface to the fitted model. We then host the app on Posit Connect, providing a safe and performant server to provide model predictions to practitioners within our organization.

The vetiver plumber API provides documentation and templates for generating predictions from the deployed model. Hosting the API on Posit Connect allows us to easily edit the model’s metadata and documentation, securely manage permissions among our organization, and monitor the model’s usage.

Learn more

Above, Simon and Max walked us through a model development and deployment process using tidymodels. You can find all of the source code for data pre-processing, model training, and deployment to Posit Connect with vetiver publicly available at https://github.com/simonpcouch/mutagen.

Learn more about all the tools used in this use case:

  • Learn more about tidymodels. The tidymodels packages provide a consistent interface to hundreds of machine learning models available across the open-source R ecosystem, empowering data scientists to quickly propose, train, and evaluate a diversity of statistical approaches.
  • Learn more about Posit Workbench. Workbench allows data scientists to code in both R and Python within their preferred development environment, without any additional strain on IT. Tap into more compute power, collaborate in real-time with others on data science projects, and access enterprise features like centralized management, security, and commercial support.
  • Learn more about vetiver. Using Vetiver – an open-source, multilingual MLOps framework to deploy and maintain machine learning models in production reliably and efficiently – you can quickly develop a plumber API to provide a user-friendly interface to the fitted model.
  • Learn more about developing APIs with plumber.
  • Learn more about Posit Connect. Connect allows teams to deploy your work created in R & Python, including APIs, Shiny apps, data, models, notebooks, dashboards, and much more. Use Connect to automate code execution so your data products are always up to date and give stakeholders, collaborators, and systems the right access to the content they need.

If you’re interested in learning more about how Posit’s enterprise products, like Workbench, Connect, and Package Manager can help your teams, please schedule a call with our sales team.

Profile picture of Simon Couch

Simon Couch

Software Engineer at Posit, PBC
Simon Couch is a member of the AI Core Team at Posit, working at the intersection of R and LLMs. He’s authored several packages that help R users get more out of LLMs, from package-based assistants to tools for evaluation to implementations of emerging technologies like the Model Context Protocol. Drawing on his background in statistics, Simon worked on the tidymodels framework for machine learning in R for a number of years before transitioning to working on LLMs.

Max Kuhn

Software Engineer at Posit, PBC
Max Kuhn is a software engineer at Posit (née RStudio). He is working on improving R's modeling capabilities and maintaining about 30 packages, including caret. He was a Senior Director of Nonclinical Statistics at Pfizer and had been applying models in the pharmaceutical and diagnostic industries for over 18 years. Max has a Ph.D. in Biostatistics. He, and Kjell Johnson, wrote the book Applied Predictive Modeling, which won the Ziegel award from the American Statistical Association. Their second book, Feature Engineering and Selection, was published in 2019, and his book Tidy Models with R, was published in 2022.