Five essential models for data scientists in finance
2025-05-20
If you’ve worked with data before, you’re probably familiar with foundational models like logistic and linear regression. These foundational models are important techniques for understanding baselines and relationships within data. As you grow your skills, however, you’ll find there’s a whole range of other interesting modeling methods you can learn and apply to your work.
One of the great things about the data science community is sharing knowledge, and our weekly Data Science Hangout is a space for data science leaders to do just that. We get to hear directly from finance experts; for example, Brad Zielke at Target shared insights on bridging the gap between complex technical work and non-technical business stakeholders, and Yu Cao at Exeter Finance talked about the challenges of including macroeconomic factors when forecasting.
Sometimes, our Hangout conversations go deep into specific data science techniques, with attendees asking about their application in finance. We’ve put together five types of models that have come up in these discussions, along with an example use case and code snippet.
Monte Carlo simulation is a computational technique that uses random sampling and statistical modeling to simulate a range of possible outcomes. Running numerous simulations with different random inputs provides a probability distribution of potential results. This provides a more realistic view of potential outcomes.
Example use case
Risk management: Since financial markets and economic factors are inherently uncertain, instead of using single-point estimates for variables like interest rates or stock prices, Monte Carlo simulations use probability distributions (e.g., normal, log-normal, uniform) to represent the range of possible values and their likelihood.
There are several Python packages that can help you perform Monte Carlo simulations, such as monaco and pandas-montecarlo (for pandas DataFrames). Sometimes, it’s easier to write your own function with pandas and numpy. Below is Python code that defines a monte_carlo_simulationfunction that simulates the potential growth of an investment over a specified number of years using a Monte Carlo method. The function takes the initial investment, the investment period in years, the number of simulations to run, the expected annual return, and the annual volatility as inputs. It then calculates the portfolio value at the end of each year for each simulation as the output.
It can be easier to analyze Monte Carlo simulations visually, such as in this Shiny app.
Mean Final Value 10647.272577
Median Final Value 10703.371167
Min Final Value 7326.054614
Max Final Value 15564.862384
Standard Deviation 1610.235976
dtype: float64
Sensitivity analysis
What it is
Sensitivity analysis is a method for evaluating how changes in one or more input variables will affect the outcome of a model. The inputs are factors that can change, such as interest rates, and the output is the variable being measured, such as net profit. Sensitivity analysis helps understand the impact of uncertainty on decision-making.
Example use case
Discounted Cash Flow (DCF) Valuation: A core concept in finance is the Discounted Cash Flow (DCF) analysis, which values an investment based on its projected future cash flows. However, these projections are inherently uncertain. The Discount Rate, typically the Weighted Average Cost of Capital (WACC), is used to discount future cash flows to their present value, and the Terminal Growth Rate is the assumed rate at which the company’s cash flows are expected to grow indefinitely after the forecast period. Sensitivity analysis would involve varying these two assumptions within a reasonable range and observing how the company’s estimated final valuation changes.
As JD Long from RenaissanceRe said during his Data Science Hangout, “I can get an awful long way on sensitivity analysis.”
Code snippet
There are several Python packages that can help you perform sensitivity analysis, such as SALib and sensitivity. You can also write your own sensitivity analysis with pandas and numpy.
Below is a simplified example inspired by the Tidy Finance chapter on Discounted Cash Flow Analysis. The dcf_value function calculates the enterprise value using the DCF method. We define a base case with example cash flows, discount rate, and terminal growth rate. We then create ranges of values for the discount rate and the terminal growth rate and calculate the DCF value for each combination of these values, showing how the valuation changes.
Discount Rate Value at Different Discount Rates Terminal Growth Rate \
0 0.08 4973.896444 0.01
1 0.09 4091.664172 0.02
2 0.10 3463.092880 0.03
3 0.11 2992.995628 0.04
4 0.12 2628.493960 0.04
Value at Different Terminal Growth Rates
0 2812.603875
1 3097.192815
2 3463.092880
3 3950.959634
4 3950.959634
Time series analysis
What it is
Time series analysis comprises statistical methods used to analyze data points that are ordered chronologically. The input is a sequence of data points indexed by time, and the output includes insights into the data’s characteristics, identified patterns, and forecasts of future values.
Example use case
Volatility calculation: Instead of looking at a single stock, an investor might want to understand how the combined daily returns of their entire portfolio fluctuate. This helps gauge the overall risk level of their investments.
Jarus Singh (currently at Adobe, previously at Pandora) stated, “We’re using a lot of time series techniques to forecast the future and time series technique, you know if it’s univariate, we’ll just look at historical data, how has this been trending, what is expected seasonality, and it’ll extrapolate that forward so that makes sense when the future state of the world.”
Code snippet
There are many Python packages for time series, as listed in Awesome Time Series. Below is a simplified example inspired by the Tidy Finance chapter on Working with Stock Returns. This code downloads the past year of Apple’s stock data, calculates the daily percentage change in the closing price and visualizes these daily returns using Plotnine.
Bayesian models use Bayesian inference, a statistical method in which prior beliefs or knowledge about a parameter are updated with observed data to obtain a posterior probability distribution. They can be used in conjunction with the other models mentioned above. For example, a time series model might incorporate Bayesian methods.
Example use case
Financial stress testing: Traditional stress tests often rely heavily on historical data and statistical correlations, which may not hold during extreme, unprecedented events. Bayesian models allow the incorporation of prior beliefs and expert judgment about the likelihood and impact of specific stress scenarios. This allows for a more nuanced understanding of the overall risk profile, considering not just the magnitude of potential losses under specific events but also the likelihood of those events occurring.
Lindsey Dietz at the Federal Reserve Bank of Minneapolis noted that Bayesian approaches can be particularly useful in situations with very little data, such as assessing diversity goals with only a few people or analyzing segments of credit portfolios that have never had a default event.
Code snippet
Let’s say we want to stress test a small portfolio against a market crash scenario. Based on some historical intuition and expert opinion, we initially believe there’s a 10% chance (0.1) of such a loss. We’ll use a simplified form of Bayes’ rule using averages to update our initial belief based on this simulated data.
# Prior Beliefprior_probability_significant_loss =0.1print(f'Initial belief (prior probability of >20% loss): {prior_probability_significant_loss:.2f}')# Simulated Data (Likelihood)num_simulations =10significant_losses_in_simulation =2likelihood_significant_loss_given_stress = significant_losses_in_simulation / num_simulationsprint(f'Likelihood of >20% loss based on simulation: {likelihood_significant_loss_given_stress:.2f}')# Simplified Posterior Update (Conceptual)# In a full Bayesian model, we'd use the formula.# One very simple way (not strictly Bayesian but illustrative) is to average themposterior_probability_significant_loss_simple = (prior_probability_significant_loss + likelihood_significant_loss_given_stress) /2print(f'Simplified updated belief (posterior probability): {posterior_probability_significant_loss_simple:.2f}')weight_prior =1weight_data =3posterior_probability_weighted = (weight_prior * prior_probability_significant_loss + weight_data * likelihood_significant_loss_given_stress) / (weight_prior + weight_data)print(f'More data-weighted updated belief (posterior probability): {posterior_probability_weighted:.2f}')
Initial belief (prior probability of >20% loss): 0.10
Likelihood of >20% loss based on simulation: 0.20
Simplified updated belief (posterior probability): 0.15
More data-weighted updated belief (posterior probability): 0.18
Natural language processing
What it is
Natural Language Processing (NLP) is the application of artificial intelligence techniques to understand, interpret, and generate human language to extract meaningful insights for decision making, process improvement, and augment other types of analysis. Particularly since finance generates massive amounts of unstructured text data, such as customer feedback and news articles, NLP provides the tools to make sense of this data.
Example use case
Topic modeling: NLP can identify the main topics and themes discussed within large collections of documents and files to uncover emerging trends and areas of focus.
Greg Shick from Charles Schwab mentioned their use of NLP on large volumes of text data to support call center operations.
Code snippet
Python has a rich ecosystem of NLP libraries, such as NLTK and spaCy. With the advent of Large Language Models (LLMs), data scientists have even more powerful tools at their disposal. For example, the mall package allows you to run text data directly against the LLM and capture the results in your Polars data frame. In the example below, we feed the Description column from call_operations to mall, which then returns a short summary of the issue in the summary column.
import mallimport polars as plimport ollamaollama.pull('llama3.2')call_operations = {'Call ID': ['CALL001', 'CALL002', 'CALL003'],'DateTime': ['2025-05-06 09:15:00', '2025-05-06 09:30:30', '2025-05-06 09:48:45'],'CustomerID': ['CUST123', 'CUST456', 'CUST789'],'Issue Category': ['Billing Inquiry', 'Technical Support', 'Account Update'],'Description': ['Question about a recent charge on my statement.','My internet service is not working.','Need to change the primary email address on my account.'] }call_operations_df = pl.DataFrame(call_operations)call_operations_df.llm.summarize('Description', 2)
shape: (3, 6)
Call ID
DateTime
CustomerID
Issue Category
Description
summary
str
str
str
str
str
str
"CALL001"
"2025-05-06 09:15:00"
"CUST123"
"Billing Inquiry"
"Question about a recent charge…
" disputed transaction"
"CALL002"
"2025-05-06 09:30:30"
"CUST456"
"Technical Support"
"My internet service is not wor…
"internet not working"
"CALL003"
"2025-05-06 09:48:45"
"CUST789"
"Account Update"
"Need to change the primary ema…
"update your info."
Your learning journey in financial data science is just beginning
Thank you to the Data Science Leaders who shared their experiences with us and helped provide a valuable starting point for data scientists entering the financial domain.
Join our Data Science Hangout to exchange ideas and deepen your understanding with the community.