Five essential models for data scientists in finance

2025-05-20

If you’ve worked with data before, you’re probably familiar with foundational models like logistic and linear regression. These foundational models are important techniques for understanding baselines and relationships within data. As you grow your skills, however, you’ll find there’s a whole range of other interesting modeling methods you can learn and apply to your work.

One of the great things about the data science community is sharing knowledge, and our weekly Data Science Hangout is a space for data science leaders to do just that. We get to hear directly from finance experts; for example, Brad Zielke at Target shared insights on bridging the gap between complex technical work and non-technical business stakeholders, and Yu Cao at Exeter Finance talked about the challenges of including macroeconomic factors when forecasting.

Sometimes, our Hangout conversations go deep into specific data science techniques, with attendees asking about their application in finance. We’ve put together five types of models that have come up in these discussions, along with an example use case and code snippet.

  1. Monte Carlo simulation
  2. Sensitivity analysis
  3. Time series analysis
  4. Bayesian model
  5. Natural language processing

Monte Carlo simulation

What it is

Monte Carlo simulation is a computational technique that uses random sampling and statistical modeling to simulate a range of possible outcomes. Running numerous simulations with different random inputs provides a probability distribution of potential results. This provides a more realistic view of potential outcomes.

Example use case

Matt McDonald from KBRA spoke a bit about this during his Data Science Hangout. You can also check out KBRA’s customer spotlight.

Code snippet

There are several Python packages that can help you perform Monte Carlo simulations, such as monaco and pandas-montecarlo (for pandas DataFrames). Sometimes, it’s easier to write your own function with pandas and numpy. Below is Python code that defines a monte_carlo_simulationfunction that simulates the potential growth of an investment over a specified number of years using a Monte Carlo method. The function takes the initial investment, the investment period in years, the number of simulations to run, the expected annual return, and the annual volatility as inputs. It then calculates the portfolio value at the end of each year for each simulation as the output.

It can be easier to analyze Monte Carlo simulations visually, such as in this Shiny app.

import pandas as pd
import numpy as np

def monte_carlo_simulation(initial_price, days, drift, volatility, simulations):
    prices = np.zeros((days, simulations))
    prices[0] = initial_price

    np.random.seed(42)
    daily_returns = np.random.normal(
        loc=drift,
        scale=volatility,
        size=(days-1, simulations)
    )

    for t in range(1, days):
        prices[t] = prices[t-1] * np.exp(daily_returns[t-1])

    dates = pd.date_range(start='today', periods=days, freq='B')
    return pd.DataFrame(prices, index=dates)

def analyze_simulation_results(simulation_df, initial_investment):
    final_prices = simulation_df.iloc[-1]
    shares = initial_investment / simulation_df.iloc[0, 0]
    final_values = final_prices * shares

    analysis = pd.Series({
        'Mean Final Value': final_values.mean(),
        'Median Final Value': final_values.median(),
        'Min Final Value': final_values.min(),
        'Max Final Value': final_values.max(),
        'Standard Deviation': final_values.std()
    })
    return analysis

if __name__ == '__main__':
    initial_price = 100
    days = 252
    mean_return = 0.05 / 252
    volatility = 0.15 / np.sqrt(252)
    simulations = 100
    initial_investment = 10000

    simulation_results = monte_carlo_simulation(
        initial_price,
        days,
        mean_return,
        volatility,
        simulations
    )

    analysis_results = analyze_simulation_results(simulation_results, initial_investment)
    print(analysis_results)
Mean Final Value      10647.272577
Median Final Value    10703.371167
Min Final Value        7326.054614
Max Final Value       15564.862384
Standard Deviation     1610.235976
dtype: float64

Sensitivity analysis

What it is

Sensitivity analysis is a method for evaluating how changes in one or more input variables will affect the outcome of a model. The inputs are factors that can change, such as interest rates, and the output is the variable being measured, such as net profit. Sensitivity analysis helps understand the impact of uncertainty on decision-making.

Example use case

As JD Long from RenaissanceRe said during his Data Science Hangout, “I can get an awful long way on sensitivity analysis.”

Code snippet

There are several Python packages that can help you perform sensitivity analysis, such as SALib and sensitivity. You can also write your own sensitivity analysis with pandas and numpy.

Below is a simplified example inspired by the Tidy Finance chapter on Discounted Cash Flow Analysis. The dcf_value function calculates the enterprise value using the DCF method. We define a base case with example cash flows, discount rate, and terminal growth rate. We then create ranges of values for the discount rate and the terminal growth rate and calculate the DCF value for each combination of these values, showing how the valuation changes.

import pandas as pd
import numpy as np

def dcf_value(cash_flows, discount_rate, terminal_growth_rate):

    pv_cash_flows = [cf / (1 + discount_rate) ** (t + 1) for t, cf in enumerate(cash_flows)]
    terminal_value = cash_flows[-1] * (1 + terminal_growth_rate) / (discount_rate - terminal_growth_rate)
    pv_terminal_value = terminal_value / (1 + discount_rate) ** len(cash_flows)
    enterprise_value = sum(pv_cash_flows) + pv_terminal_value
    return enterprise_value

cash_flows = [100, 150, 200, 250, 300]
base_discount_rate = 0.10
base_terminal_growth_rate = 0.03

discount_rates = np.arange(0.08, 0.13, 0.01)
values_discount_rate = [dcf_value(cash_flows, dr, base_terminal_growth_rate) for dr in discount_rates]

terminal_growth_rates = np.arange(0.01, 0.05, 0.01)
values_terminal_growth_rate = [dcf_value(cash_flows, base_discount_rate, tgr) for tgr in terminal_growth_rates]

max_length = max(len(values_discount_rate), len(discount_rates), len(values_terminal_growth_rate),len(terminal_growth_rates))

padded_values_discount_rate = values_discount_rate + [values_discount_rate[-1]] * (max_length - len(values_discount_rate))
padded_discount_rates = list(discount_rates) + [discount_rates[-1]] * (max_length - len(discount_rates))
padded_values_terminal_growth_rate = values_terminal_growth_rate + [values_terminal_growth_rate[-1]] * (max_length - len(values_terminal_growth_rate))
padded_terminal_growth_rates = list(terminal_growth_rates) + [terminal_growth_rates[-1]] * (max_length - len(terminal_growth_rates))
df = pd.DataFrame({
    'Discount Rate': padded_discount_rates,
    'Value at Different Discount Rates': padded_values_discount_rate,
    'Terminal Growth Rate': padded_terminal_growth_rates,
    'Value at Different Terminal Growth Rates': padded_values_terminal_growth_rate
})

print(df)
   Discount Rate  Value at Different Discount Rates  Terminal Growth Rate  \
0           0.08                        4973.896444                  0.01   
1           0.09                        4091.664172                  0.02   
2           0.10                        3463.092880                  0.03   
3           0.11                        2992.995628                  0.04   
4           0.12                        2628.493960                  0.04   

   Value at Different Terminal Growth Rates  
0                               2812.603875  
1                               3097.192815  
2                               3463.092880  
3                               3950.959634  
4                               3950.959634  

Time series analysis

What it is

Time series analysis comprises statistical methods used to analyze data points that are ordered chronologically. The input is a sequence of data points indexed by time, and the output includes insights into the data’s characteristics, identified patterns, and forecasts of future values.

Example use case

Jarus Singh (currently at Adobe, previously at Pandora) stated, “We’re using a lot of time series techniques to forecast the future and time series technique, you know if it’s univariate, we’ll just look at historical data, how has this been trending, what is expected seasonality, and it’ll extrapolate that forward so that makes sense when the future state of the world.”

Code snippet

There are many Python packages for time series, as listed in Awesome Time Series. Below is a simplified example inspired by the Tidy Finance chapter on Working with Stock Returns. This code downloads the past year of Apple’s stock data, calculates the daily percentage change in the closing price and visualizes these daily returns using Plotnine.

import yfinance as yf
import pandas as pd
from plotnine import ggplot, aes, geom_line, labs, theme_minimal

msft = yf.Ticker('MSFT')
data = msft.history(period='1y')

data['Daily Return'] = data['Close'].pct_change()

data = data.reset_index()

plot = (
    ggplot(data, aes(x='Date', y='Daily Return'))
    + geom_line(color='green')
    + labs(title='Microsoft (MSFT) Daily Returns', x='Date', y='Daily Return')
    + theme_minimal()
)

plot.show()

Bayesian model

What it is

Bayesian models use Bayesian inference, a statistical method in which prior beliefs or knowledge about a parameter are updated with observed data to obtain a posterior probability distribution. They can be used in conjunction with the other models mentioned above. For example, a time series model might incorporate Bayesian methods.

Example use case

Lindsey Dietz at the Federal Reserve Bank of Minneapolis noted that Bayesian approaches can be particularly useful in situations with very little data, such as assessing diversity goals with only a few people or analyzing segments of credit portfolios that have never had a default event.

Code snippet

Let’s say we want to stress test a small portfolio against a market crash scenario. Based on some historical intuition and expert opinion, we initially believe there’s a 10% chance (0.1) of such a loss. We’ll use a simplified form of Bayes’ rule using averages to update our initial belief based on this simulated data.

# Prior Belief
prior_probability_significant_loss = 0.1
print(f'Initial belief (prior probability of >20% loss): {prior_probability_significant_loss:.2f}')

# Simulated Data (Likelihood)
num_simulations = 10
significant_losses_in_simulation = 2
likelihood_significant_loss_given_stress = significant_losses_in_simulation / num_simulations
print(f'Likelihood of >20% loss based on simulation: {likelihood_significant_loss_given_stress:.2f}')

# Simplified Posterior Update (Conceptual)
# In a full Bayesian model, we'd use the formula.
# One very simple way (not strictly Bayesian but illustrative) is to average them
posterior_probability_significant_loss_simple = (prior_probability_significant_loss + likelihood_significant_loss_given_stress) / 2
print(f'Simplified updated belief (posterior probability): {posterior_probability_significant_loss_simple:.2f}')

weight_prior = 1
weight_data = 3
posterior_probability_weighted = (weight_prior * prior_probability_significant_loss + weight_data * likelihood_significant_loss_given_stress) / (weight_prior + weight_data)
print(f'More data-weighted updated belief (posterior probability): {posterior_probability_weighted:.2f}')
Initial belief (prior probability of >20% loss): 0.10
Likelihood of >20% loss based on simulation: 0.20
Simplified updated belief (posterior probability): 0.15
More data-weighted updated belief (posterior probability): 0.18

Natural language processing

What it is

Natural Language Processing (NLP) is the application of artificial intelligence techniques to understand, interpret, and generate human language to extract meaningful insights for decision making, process improvement, and augment other types of analysis. Particularly since finance generates massive amounts of unstructured text data, such as customer feedback and news articles, NLP provides the tools to make sense of this data.

Example use case

Greg Shick from Charles Schwab mentioned their use of NLP on large volumes of text data to support call center operations.

Code snippet

Python has a rich ecosystem of NLP libraries, such as NLTK and spaCy. With the advent of Large Language Models (LLMs), data scientists have even more powerful tools at their disposal. For example, the mall package allows you to run text data directly against the LLM and capture the results in your Polars data frame. In the example below, we feed the Description column from call_operations to mall, which then returns a short summary of the issue in the summary column.

import mall
import polars as pl
import ollama
ollama.pull('llama3.2')

call_operations = {'Call ID': ['CALL001', 'CALL002', 'CALL003'],
        'DateTime': ['2025-05-06 09:15:00', '2025-05-06 09:30:30', '2025-05-06 09:48:45'],
        'CustomerID': ['CUST123', 'CUST456', 'CUST789'],
        'Issue Category': ['Billing Inquiry', 'Technical Support', 'Account Update'],
        'Description': ['Question about a recent charge on my statement.',
                        'My internet service is not working.',
                        'Need to change the primary email address on my account.']
                        }

call_operations_df = pl.DataFrame(call_operations)

call_operations_df.llm.summarize('Description', 2)
shape: (3, 6)
Call ID DateTime CustomerID Issue Category Description summary
str str str str str str
"CALL001" "2025-05-06 09:15:00" "CUST123" "Billing Inquiry" "Question about a recent charge… " disputed transaction"
"CALL002" "2025-05-06 09:30:30" "CUST456" "Technical Support" "My internet service is not wor… "internet not working"
"CALL003" "2025-05-06 09:48:45" "CUST789" "Account Update" "Need to change the primary ema… "update your info."

Your learning journey in financial data science is just beginning

Thank you to the Data Science Leaders who shared their experiences with us and helped provide a valuable starting point for data scientists entering the financial domain.

The format of this post was inspired by Sarowar Jahan Saurav’s insightful article, 20 Important Statistical Approaches Every Data Scientist Knows🐱🚀.

Further reading