Commercial enterprise offerings

Deploying Scikit-Learn Models for In-Database Scoring with Snowflake

Nick Pelikan
Written by Nick Pelikan
2023-01-13
A dual-pane screenshot displays two separate code editors. The left editor, in a dark theme, shows Python code with data manipulation operations, including variable assignment and data loading. Below the code is a data preview table. The right editor, in a light theme, displays a section titled "Step 7: Massive Scale Inference" with Python code primarily focused on data loading and processing using libraries like pandas. Both editors show line numbers and scrollbars, indicating longer code files.

See this workflow in action

Nick presented a demo on deploying Scikit-learn models for in-database scoring with Snowflake and Posit Team. You can watch it on YouTube!

Introduction

Modern data science workflows often face a critical challenge: how to efficiently deploy machine learning models where the data lives. Traditional approaches require moving large datasets out of the database for scoring, creating bottlenecks, security concerns, and unnecessary data movement costs.

This post demonstrates a powerful solution that eliminates these challenges by deploying scikit-learn models directly inside Snowflake for in-database scoring. We’ll walk through a complete workflow that:

  • Trains a machine learning model using scikit-learn
  • Registers the model in Snowflake’s Model Registry for governance and versioning
  • Converts the trained model to SQL using Orbital
  • Deploys the model as a Snowflake view for production inference

The result is a production-ready ML pipeline that runs entirely within your data warehouse, eliminating data movement while maintaining the full power of Python’s machine learning ecosystem.

The Challenge: Traditional ML Deployment vs. In-Database Scoring

Traditional machine learning deployment typically follows this pattern:

  1. Extract: Pull data from the database to a Python/R environment
  2. Transform: Apply feature engineering and preprocessing
  3. Predict: Run the model on the transformed data
  4. Load: Write predictions back to the database

This ETL approach works for small datasets but becomes problematic at scale:

  • Data Movement Costs: Moving terabytes of data for scoring is expensive and slow
  • Security Risks: Sensitive data leaves the secure database environment
  • Latency Issues: Round-trip data movement introduces delays
  • Resource Inefficiency: Compute resources are duplicated across systems

In-database scoring solves these problems by converting machine learning models to SQL, allowing predictions to run directly where the data lives. This approach offers:

  • Zero Data Movement: All computations happen within the data warehouse
  • Enhanced Security: Data never leaves the secure environment
  • Improved Performance: Leverage the database’s parallel processing power
  • Simplified Architecture: Eliminate external scoring services

Why Snowflake + Orbital?

Snowflake provides an ideal platform for this approach with several key advantages:

  • Snowflake Model Registry: Enterprise-grade model governance, versioning, and metadata management
  • Massive Scale: Handle prediction workloads on billions of rows
  • Security & Compliance: Row-level security, data masking, and audit trails
  • Performance: Automatic scaling and optimization for analytical workloads

Orbital bridges the gap between Python’s rich ML ecosystem and database-native execution by converting scikit-learn pipelines to optimized SQL. This enables data scientists to:

  • Use familiar Python tools for model development
  • Automatically translate complex feature engineering to SQL
  • Deploy models without rewriting code in different languages
  • Maintain model accuracy through faithful SQL translations

Orbital vs. Snowpark ML: When to Choose Each Approach

Snowflake offers multiple approaches for machine learning deployment. Here’s when to choose Orbital over Snowpark ML:

Choose Orbital When:

🚀 Maximum Performance is Critical

  • Native SQL Execution: Orbital generates pure SQL that runs at native database speeds
  • Query Optimizer Benefits: Leverages Snowflake’s advanced query optimization
  • No Python Runtime Overhead: Eliminates the performance cost of Python interpretation
  • Massive Scale: Optimal for scoring billions of records with minimal latency

⚡ Simple Deployment is Priority

  • Zero Infrastructure: No need to manage Python environments or dependencies
  • Database Views: Deploy as standard SQL views that any application can query
  • No Runtime Dependencies: Models run with only SQL - no Python packages required
  • Version Control: Simple database object versioning through standard SQL DDL

🔧 Integration with Existing Systems

  • BI Tool Compatibility: SQL views work seamlessly with Tableau, PowerBI, Looker
  • Application Integration: Any system that can query SQL can use your models
  • Legacy System Support: Works with older applications that don’t support Python
  • Microservices Architecture: Models become stateless SQL services

Choose Snowpark ML When:

🐍 Complex Python Ecosystems

  • Advanced Libraries: Need libraries beyond scikit-learn (XGBoost, LightGBM, PyTorch)
  • Custom Transformations: Complex preprocessing that doesn’t translate well to SQL
  • Dynamic Logic: Business rules that require Python’s flexibility
  • Iterative Algorithms: Models that require loops or complex control flow

🔄 Rapid Prototyping

  • Development Speed: Faster iteration during model development phase
  • Debugging: Python stack traces and debugging tools
  • Interactive Development: Jupyter notebook-style experimentation
  • Complex Data Types: Working with nested JSON, arrays, or unstructured data

Performance Comparison

Here’s what you can expect in terms of performance characteristics:

Aspect Orbital (SQL) Snowpark ML (Python)
Execution Speed Fastest - Native SQL Fast - Optimized Python
Startup Time Instant Python environment startup
Memory Usage Minimal - SQL only Higher - Python runtime
Scaling Linear with data size May have Python overhead
Cold Start None Python initialization

Real-World Performance Example

In our lending club example, the Orbital approach delivered:

  • 2.2M predictions in ~3 seconds using pure SQL
  • Zero cold start time - queries execute immediately
  • Linear scaling - 10x more data = ~10x more time
  • Minimal resource usage - only SQL execution resources needed

Compare this to a Snowpark approach which would need to:

  • Initialize Python environments for each compute node
  • Load scikit-learn and other Python dependencies
  • Transfer Python objects between nodes
  • Manage Python memory allocation

The Orbital SQL approach eliminates all of these overheads, resulting in faster execution and lower resource consumption.

Deployment Simplicity

Orbital Deployment (This Approach):

-- Single SQL statement deploys the model
CREATE OR REPLACE VIEW loan_scoring_model AS 
SELECT id, predicted_interest_rate FROM (
  -- Generated SQL with embedded model logic
  ...
);

Snowpark Deployment:

# Requires Python environment management
# Package dependencies
# Runtime environment setup
# Function registration
# State management

The Orbital approach creates a simple, stateless SQL asset that requires no ongoing maintenance, while Snowpark deployments need to manage Python environments, dependencies, and runtime state.

Let’s see this integration in action with a real-world example.

Our Use Case: Lending Club Interest Rate Prediction

For this demonstration, we’ll build a model to predict interest rates for loans using the Lending Club dataset. This is a practical example that showcases several important aspects of production ML systems:

  • Financial Data: Working with sensitive financial information that benefits from in-database security
  • Large Scale: Millions of loan records requiring efficient processing
  • Real-time Scoring: Interest rate predictions needed for loan applications
  • Regulatory Compliance: Financial models requiring audit trails and governance

The dataset includes features like loan term, borrower credit utilization, and available credit that influence interest rate determination.

Step 1: Secure Database Connection with Managed Credentials

One of the key advantages of this workflow is seamless authentication. Using Posit Workbench’s managed credentials, we can connect to Snowflake without hardcoding credentials or managing connection strings.

import snowflake.connector

# Connect using managed credentials - no passwords or keys needed!
con = snowflake.connector.connect(
    connection_name="workbench",  # Managed credential profile
    warehouse="DEFAULT_WH",       # Compute resources
    database="LENDING_CLUB",      # Database containing our data
    schema="PUBLIC"               # Schema with loan data
)

This approach provides several security benefits:

  • No Credential Exposure: Passwords and keys never appear in code
  • Automatic Rotation: Credentials are automatically managed and rotated
  • User Identity: Each user connects with their own identity for auditing
  • Fine-grained Access: Database permissions control what each user can access

Step 2: Efficient Data Sampling with Ibis

Rather than pulling the entire dataset into memory, we’ll use Ibis to perform intelligent sampling directly in Snowflake. This approach is crucial for working with large datasets efficiently.

import ibis
ibiscon = ibis.snowflake.from_connection(con, create_object_udfs=False)

# Define our feature set for interest rate prediction
feature_columns = [
    "INT_RATE",      # Target variable: loan interest rate
    "TERM",          # Loan term (36 or 60 months)
    "BC_UTIL",       # Bank card utilization rate
    "BC_OPEN_TO_BUY", # Available credit on bank cards
    "ALL_UTIL",      # Total credit utilization
    "ID"             # Unique loan identifier
]

# Create a table reference (no data movement yet!)
lendingclub_data = ibiscon.table("LOAN_DATA")

# Apply data quality filters in the database
lendingclub_data = lendingclub_data.drop_null(feature_columns, how="any")

# Sample 0.3% of the data for model training (still executed in Snowflake)
lendingclub_sample = lendingclub_data[feature_columns]\
    .sample(.003)\
    .to_pandas()  # Only now do we bring data to Python

print(f"Training sample size: {len(lendingclub_sample):,} rows")
lendingclub_sample.head(5)
Training sample size: 4,083 rows
INT_RATE TERM BC_UTIL BC_OPEN_TO_BUY ALL_UTIL ID
0 19.48 60 months 92.9 592 65 68053448
1 13.44 60 months 91.3 1065 91 68674734
2 12.88 36 months 31.9 37853 48 67279352
3 19.48 36 months 61.7 2296 68 68944925
4 11.48 60 months 68.7 4877 69 68883620

Why this approach is powerful:

  • Lazy Evaluation: Ibis builds SQL queries without executing them until .to_pandas() is called
  • Database-side Filtering: Null value removal happens in Snowflake, not Python
  • Intelligent Sampling: Only the sampled data is transferred, not the entire dataset
  • SQL Generation: Ibis automatically generates optimized SQL for complex operations

The result is a clean, representative sample ready for model training, obtained with minimal data movement.

Step 3: Building a Robust ML Pipeline

Now we’ll create a comprehensive machine learning pipeline that handles both feature engineering and model training. This pipeline will later be converted to SQL for in-database deployment.

from sklearn.model_selection import train_test_split

target_column = "INT_RATE"

# Split the data for training and validation
X_train, X_test, y_train, y_test = train_test_split(
    lendingclub_sample.drop(target_column, axis=1), 
    lendingclub_sample[target_column], 
    test_size=0.2, 
    random_state=42
)

print(f"Training set: {len(X_train):,} samples")
print(f"Test set: {len(X_test):,} samples")
Training set: 3,266 samples
Test set: 817 samples

Now let’s build a comprehensive preprocessing and modeling pipeline. This pipeline is designed with SQL conversion in mind, using transformations that Orbital can faithfully translate:

import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, StandardScaler, OrdinalEncoder
from sklearn.linear_model import LinearRegression

# Define feature groups for different preprocessing steps
noncategorical_columns = ["BC_UTIL", "BC_OPEN_TO_BUY", "ALL_UTIL"]

# Create a comprehensive ML pipeline
pipeline = Pipeline([
    ("preprocess", ColumnTransformer([
        # Handle categorical variables (loan term)
        ("encoder", OrdinalEncoder(), ["TERM"]),
        
        # Handle missing values with mean imputation
        ("mean_imputation", SimpleImputer(missing_values=np.nan, strategy='mean'), 
         noncategorical_columns),
        
        # Feature scaling (centering only, compatible with SQL)
        ("scaler", StandardScaler(with_std=False), noncategorical_columns),
    ], remainder="passthrough")),
    
    # Linear regression model
    ("linear_regression", LinearRegression())
])

# Train the pipeline
pipeline.fit(X_train, y_train)

# Evaluate model performance
train_score = pipeline.score(X_train, y_train)
test_score = pipeline.score(X_test, y_test)

print(f"Training R²: {train_score:.4f}")
print(f"Test R²: {test_score:.4f}")
Training R²: 0.2396
Test R²: 0.2394

Key Pipeline Design Considerations:

  • Modular Design: ColumnTransformer allows different preprocessing for different feature types
  • Robust Handling: Missing value imputation ensures the model works with real-world messy data
  • Interpretability: Linear regression provides explainable predictions for financial applications

Step 4: Leveraging Snowflake’s Model Registry

One of Snowflake’s most powerful features for ML workflows is its integrated Model Registry. This provides enterprise-grade model governance, versioning, and metadata management directly within your data warehouse.

from snowflake.snowpark import Session
from snowflake.ml.registry import Registry

# Create a Snowpark session from our existing connection
session = Session.builder.configs({"connection": con}).create()

# Initialize the Model Registry
reg = Registry(session=session, database_name="LENDING_CLUB", schema_name="PUBLIC")

# Prepare sample input data for model signature
sig = X_train.head(100).copy()
sig.BC_UTIL = sig.BC_UTIL.astype("float64")  # Ensure consistent data types

# Register the model in Snowflake's Model Registry
model_ref = reg.log_model(
    pipeline,
    model_name="orbital_test",
    sample_input_data=sig,
    options={
        "method_options": {
            "predict": {"case_sensitive": True},
        }
    },
)

print(f"Model registered successfully!")
print(f"Version: {model_ref.version_name}")
model_ref
Model registered successfully!
Version: HEAVY_CHIPMUNK_
ModelVersion(
  name='ORBITAL_TEST',
  version='HEAVY_CHIPMUNK_4',
)

What just happened?

The Snowflake Model Registry provides several critical capabilities for production ML:

  • Version Management: Each model gets a unique version name (like NASTY_TREEFROG_4)
  • Metadata Storage: Model artifacts, dependencies, and signatures are stored securely
  • Lineage Tracking: Full audit trail of model creation and usage
  • Access Control: Database-level permissions control who can deploy and use models
  • Integration: Native integration with Snowflake’s security and governance features

This registry approach is particularly valuable in regulated industries where model governance and audit trails are essential.

Step 5: The Magic of Orbital - Converting ML to SQL

Now comes the exciting part: converting our trained scikit-learn pipeline into SQL that can run natively in Snowflake. This is where Orbital shines, providing a bridge between Python’s rich ML ecosystem and database-native execution.

import orbitalml
import orbitalml.types

# Define the schema for our input features
# This tells Orbital how to generate appropriate SQL for each column type
feature_schema = {
    "ID": orbitalml.types.DoubleColumnType(),           # Unique identifier
    "TERM": orbitalml.types.StringColumnType(),         # Categorical: "36 months" or "60 months"
    "BC_UTIL": orbitalml.types.DoubleColumnType(),      # Numeric: bank card utilization
    "BC_OPEN_TO_BUY": orbitalml.types.DoubleColumnType(), # Numeric: available credit
    "ALL_UTIL": orbitalml.types.DoubleColumnType(),     # Numeric: total utilization
}

# Convert the scikit-learn pipeline to Orbital's internal representation
orbitalml_pipeline = orbitalml.parse_pipeline(pipeline, features=feature_schema)

Let’s examine what Orbital extracted from our scikit-learn pipeline:

print("Orbital Pipeline Structure:")
print("=" * 40)
print(orbitalml_pipeline)
Orbital Pipeline Structure:
========================================
ParsedPipeline(
  features={
    ID: DoubleColumnType()
    TERM: StringColumnType()
    BC_UTIL: DoubleColumnType()
    BC_OPEN_TO_BUY: DoubleColumnType()
    ALL_UTIL: DoubleColumnType()
  },
  steps=[
    merged_columns=Concat(
      inputs: BC_UTIL, BC_OPEN_TO_BUY, ALL_UTIL,
      attributes: 
        axis=1
    )
    merged_columns1=Concat(
      inputs: BC_UTIL, BC_OPEN_TO_BUY, ALL_UTIL,
      attributes: 
        axis=1
    )
    label_encoder=LabelEncoder(
      inputs: TERM,
      attributes: 
        keys_strings=['36 months', '60 months']
        values_int64s=[0, 1]
    )
    variable1=Imputer(
      inputs: merged_columns,
      attributes: 
        imputed_value_floats=[53.89932632446289, 12975.7744140625, 56.76362609863281]
        replaced_value_float=nan
    )
    ordinal_output=Reshape(
      inputs: label_encoder, shape_tensor=[-1, 1],
      attributes: 
    )
    variable2=Sub(
      inputs: merged_columns1, Su_Subcst=[53.89932639314145, 12975.774341702388, 56.7636252296387],
      attributes: 
    )
    concat_result=Concat(
      inputs: ordinal_output,
      attributes: 
        axis=1
    )
    variable4=Cast(
      inputs: concat_result,
      attributes: 
        to=1
    )
    variable_cast=Cast(
      inputs: variable4,
      attributes: 
        to=11
    )
    transformed_column=Concat(
      inputs: variable_cast, variable1, variable2, ALL_UTIL,
      attributes: 
        axis=1
    )
    multiplied=MatMul(
      inputs: transformed_column, coef=[3.9272007446774158, 0.008255687576964283, -2.613724348162453e-05, 0.015146...,
      attributes: 
    )
    resh=Add(
      inputs: multiplied, intercept=[10.677283651336923],
      attributes: 
    )
    variable=Reshape(
      inputs: resh, shape_tensor=[-1, 1],
      attributes: 
    )
  ],
)

Understanding the Orbital Conversion:

This output shows how Orbital has deconstructed our scikit-learn pipeline into a series of database-friendly operations:

  • Feature Type Recognition: Each column is properly typed for SQL generation
  • Preprocessing Steps: Mean imputation, scaling, and encoding are converted to SQL operations
  • Model Coefficients: Linear regression weights are embedded as constants
  • Pipeline Flow: The entire transformation sequence is preserved

The beauty of this approach is that the complex Python preprocessing logic is now represented as a series of SQL operations that can execute directly in Snowflake.

Step 6: Generating Production-Ready SQL

Now we’ll generate the SQL that embodies our entire machine learning pipeline:

import sqlparse
from ibis.expr.sql import SQLString

# Generate Snowflake-specific SQL from our Orbital pipeline
sql = orbitalml.export_sql("LOAN_DATA", orbitalml_pipeline, dialect="snowflake")

# Display the generated SQL with syntax highlighting
print("Generated SQL Query:")
print("=" * 50)
SQLString(sql)
Generated SQL Query:
==================================================

SELECT 
    "t0"."ID" AS "ID", 
    CAST(
        CAST(
            CASE 
                WHEN "t0"."TERM" = '36 months' THEN 0 
                WHEN "t0"."TERM" = '60 months' THEN 1 
                ELSE -1 
            END AS FLOAT
        ) AS DOUBLE
    ) * 3.9272007446774158 
    + 10.677283651336923 
    + COALESCE("t0"."BC_UTIL", 53.89932632446289) * 0.008255687576964283 
    + COALESCE("t0"."BC_OPEN_TO_BUY", 12975.7744140625) * -2.613724348162453e-05 
    + COALESCE("t0"."ALL_UTIL", 56.76362609863281) * 0.01514643877937558 
    + ("t0"."BC_UTIL" - 53.89932639314145) * 0.008255689085808633 
    + ("t0"."BC_OPEN_TO_BUY" - 12975.774341702388) * -2.61372411882521e-05 
    + ("t0"."ALL_UTIL" - 56.7636252296387) * 0.015146438779367492 
    + "t0"."ALL_UTIL" * 2.1054364854737794e-09 AS "variable" 
FROM "LOAN_DATA" AS "t0"

What makes this SQL special?

This isn’t just any SQL query - it’s a complete machine learning pipeline encoded as a single SQL statement:

  • Feature Engineering: All preprocessing steps (imputation, scaling, encoding) are embedded
  • Model Scoring: The linear regression calculation is performed using SQL arithmetic
  • Production Ready: Handles missing values, data types, and edge cases
  • Optimized: Leverages Snowflake’s query optimizer for efficient execution
  • Scalable: Can process millions or billions of rows without modification

Step 7: Massive Scale Inference

Let’s see this SQL in action by scoring the entire dataset - something that would be prohibitively expensive with traditional ETL approaches:

import pandas as pd
import time

print("Starting inference on full dataset...")
start_time = time.time()

# Execute the SQL against the entire LOAN_DATA table
result = pd.read_sql(sql, con)

end_time = time.time()
execution_time = end_time - start_time

print(f"✅ Inference completed in {execution_time:.2f} seconds")
print(f"📊 Total predictions generated: {len(result):,}")
print(f"🚀 Throughput: {len(result)/execution_time:,.0f} predictions/second")

# Show sample predictions
print("\nSample Predictions:")
result.dropna().head(10)
Starting inference on full dataset...
✅ Inference completed in 3.05 seconds
📊 Total predictions generated: 2,260,702
🚀 Throughput: 741,492 predictions/second

Sample Predictions:
ID variable
126185 130954621 11.831669
126186 130964697 12.094963
126187 130955326 17.251766
126188 130504052 17.981724
126189 130956066 8.163239
126190 130502727 15.832542
126191 130258413 9.601245
126192 129913276 17.009582
126193 130968790 8.993534
126194 130711441 15.668842

Performance Highlights:

This query processed over 2 million loan records in seconds, demonstrating the power of in-database scoring:

  • No Data Movement: All 2M+ records stayed in Snowflake
  • Parallel Processing: Snowflake’s compute clusters handled the workload
  • Memory Efficiency: No risk of out-of-memory errors in Python
  • Cost Effective: Pay only for compute time, not data transfer

Step 8: Production Deployment as a Database View

The final step is deploying our model as a reusable database view. This creates a permanent, queryable asset that other applications and users can access:

# Deploy the model as a Snowflake view for production use
view_name = f"pyorbital_test_{model_ref.version_name}"

with con.cursor() as cursor:
    con.execute_string(f"""
        CREATE OR REPLACE VIEW {view_name} AS {sql};
    """)

print(f"🎯 Model successfully deployed as view: {view_name}")
print(f"🔗 Other applications can now query this view for real-time predictions")
🎯 Model successfully deployed as view: pyorbital_test_HEAVY_CHIPMUNK_4
🔗 Other applications can now query this view for real-time predictions

Production Benefits:

  • Reusable Asset: The view can be queried by any application with database access
  • Real-time Scoring: New data is scored immediately without batch processing delays
  • Security Integration: Inherits Snowflake’s row-level security and access controls
  • Performance Optimization: Snowflake automatically optimizes query execution
  • Version Control: Model versions are tracked through the naming convention

Why This Deployment Approach Wins

Operational Simplicity:

Unlike Snowpark ML functions that require:

  • Python package management across compute nodes
  • Runtime environment consistency
  • Function registration and permissions
  • State management between executions

The Orbital approach creates a stateless SQL view that:

  • Has no runtime dependencies
  • Requires no package management
  • Works with any SQL-compatible tool
  • Scales automatically with Snowflake’s query engine

Performance at Scale: The generated SQL runs at native database speeds because:

  • No Python interpreter overhead
  • Direct access to Snowflake’s columnar storage
  • Full utilization of query parallelization
  • Optimized execution plans from Snowflake’s query optimizer

This translates to sub-second response times even on massive datasets, making real-time applications feasible.

Real-World Applications

This workflow enables powerful production scenarios:

1. Real-Time Risk Assessment

-- Credit application scoring in real-time
SELECT customer_id, predicted_interest_rate
FROM customer_applications ca
JOIN pyorbital_test_NASTY_TREEFROG_4 model ON ca.id = model.id
WHERE application_date = CURRENT_DATE();

2. Batch Scoring for Marketing

-- Score all customers for targeted loan offers
SELECT customer_id, predicted_interest_rate,
       CASE WHEN predicted_interest_rate < 10 THEN 'Premium'
            WHEN predicted_interest_rate < 15 THEN 'Standard'
            ELSE 'Subprime' END as loan_tier
FROM customer_base cb
JOIN pyorbital_test_NASTY_TREEFROG_4 model ON cb.id = model.id;

3. Model Monitoring and Validation

-- Compare predictions with actual outcomes for model drift detection
SELECT predicted_interest_rate,
       actual_interest_rate,
       ABS(predicted_interest_rate - actual_interest_rate) as prediction_error
FROM loan_outcomes lo
JOIN pyorbital_test_NASTY_TREEFROG_4 model ON lo.id = model.id
WHERE loan_date >= DATEADD(month, -1, CURRENT_DATE());

Key Advantages of This Approach

🚀 Performance & Scale

  • Massive Throughput: Process millions of records in seconds
  • Auto-scaling: Snowflake automatically scales compute resources
  • Query Optimization: Database engine optimizes SQL execution

🔒 Security & Governance

  • Data Residency: Sensitive data never leaves the secure environment
  • Access Control: Leverage existing database security policies
  • Audit Trails: Complete lineage from model training to prediction
  • Compliance: Meet regulatory requirements for data handling

💰 Cost Efficiency

  • No Data Movement: Eliminate expensive data transfer costs
  • Resource Utilization: Pay only for compute time used
  • Infrastructure Simplicity: No separate scoring infrastructure needed

⚒️ Developer Experience

  • Familiar Tools: Use Python/scikit-learn for model development
  • Automatic Translation: No manual SQL rewriting required
  • Version Management: Integrated model versioning and registry

Conclusion

The combination of Snowflake’s Model Registry, Orbital’s SQL generation, and in-database scoring creates a powerful platform for production machine learning that addresses the key challenges of traditional ML deployment:

  • Eliminates Data Movement: Models execute where data lives
  • Scales Effortlessly: Handle enterprise-scale prediction workloads
  • Maintains Security: Keep sensitive data in secure environments
  • Simplifies Operations: No complex infrastructure management
  • Enables Governance: Enterprise-grade model management and auditing

This workflow represents the future of ML deployment: combining the flexibility of Python’s ML ecosystem with the scale, security, and performance of modern data warehouses. Whether you’re building credit scoring models, recommendation engines, or predictive maintenance systems, this approach provides a robust foundation for production ML at scale.

Ready to try this yourself? The complete workflow shown here can be adapted to your own datasets and use cases, providing a template for building production-ready ML systems that scale with your business needs.

Nick Pelikan

Nick Pelikan

Principal Solution Architect at Posit
Nick is a Principal Solution Architect at Posit, where he focuses on helping customers realize their data strategy goals using Posit's open source and commercial tools. When not diving into data architectures, you can find him skiing in Utah's Wasatch Mountains.