Commercial enterprise offerings

Introducing Orbital for converting Scikit-learn pipelines into SQL queries

Nick Pelikan
Written by Nick Pelikan
2023-01-13
A white stylized serpent or "S" curve with a circular head and dot for an eye, is centered on a dark blue circle. Behind the circle is a dark blue background with lighter blue vertical lines of varying heights, resembling an audio waveform.

We are thrilled to announce Orbital! With Orbital, you can construct your Scikit-learn pipelines and then transform them into SQL queries. This means your predictions can run directly in the database, with no Python environment required for real-time inference.

You can install Orbital from PyPI using1:

pip install orbital

In regulated, secure, or resource-constrained environments, deploying and maintaining Python environments for model inference can be impractical or even impossible. Orbital solves this by transforming your trained Scikit-learn pipelines into native SQL queries. This allows you to execute powerful machine learning predictions directly within any SQL database, leveraging its inherent scalability and security features.

This capability is particularly beneficial for users of platforms like Snowflake. For instance, running a model via SQL on Snowflake with Orbital has demonstrated up to a 5x speedup compared to executing the same model in Python via Snowpark, highlighting the significant performance gains of native SQL execution.

Training and converting Scikit-Learn Pipelines

Orbital currently supports a range of widely used Scikit-learn models for regression and classification, including Linear Regression and Decision Trees2. Define and train your models within Scikit-learn, convert them into plain SQL, then deploy them as plain SQL directly into your database.

Let’s walk through an example: we’ll create a Scikit-learn linear regression pipeline and convert it to an Orbital pipeline using orbital.parse_pipeline. First, import the libraries:

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import orbital
import snowflake.connector

Next, load and prepare the data. In this case, we are using the Iris dataset stored as a table in our Snowflake warehouse.

con = snowflake.connector.connect(
    connection_name="workbench",
    warehouse="DEFAULT_WH",
    database = "DEMOS",
    schema = "PUBLIC"
)

cur = con.cursor()

cur.execute("SELECT * FROM IRIS_DAT")
iris = cur.fetch_pandas_all()

COLUMNS = ["SEPAL_LENGTH", "SEPAL_WIDTH", "PETAL_LENGTH", "PETAL_WIDTH"]
iris_x = iris[COLUMNS]

X_train, X_test, y_train, y_test = train_test_split(
    iris_x, iris.SPECIES, test_size=0.2, random_state=42
)

Create the Scikit-learn pipeline:

pipeline = Pipeline(
    [
        ("preprocess", ColumnTransformer([("scaler", StandardScaler(with_std=False), COLUMNS)],
                                        remainder="passthrough")),
        ("linear_regression", LinearRegression()),
    ]
)

pipeline.fit(X_train, y_train)
Pipeline(steps=[('preprocess',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('scaler',
                                                  StandardScaler(with_std=False),
                                                  ['SEPAL_LENGTH',
                                                   'SEPAL_WIDTH',
                                                   'PETAL_LENGTH',
                                                   'PETAL_WIDTH'])])),
                ('linear_regression', LinearRegression())])

Now, convert it to an Orbital pipeline. Print the pipeline to see the result:

orbital_pipeline = orbital.parse_pipeline(pipeline, features={
    "SEPAL_LENGTH": orbital.types.DoubleColumnType(),
    "SEPAL_WIDTH": orbital.types.DoubleColumnType(),
    "PETAL_LENGTH": orbital.types.DoubleColumnType(),
    "PETAL_WIDTH": orbital.types.DoubleColumnType(),
})

print(orbital_pipeline)
ParsedPipeline(
  features={
    SEPAL_LENGTH: DoubleColumnType()
    SEPAL_WIDTH: DoubleColumnType()
    PETAL_LENGTH: DoubleColumnType()
    PETAL_WIDTH: DoubleColumnType()
  },
  steps=[
    merged_columns=Concat(
      inputs: SEPAL_LENGTH, SEPAL_WIDTH, PETAL_LENGTH, PETAL_WIDTH,
      attributes: 
        axis=1
    )
    variable1=Sub(
      inputs: merged_columns, Su_Subcst=[5.809166666666666, 3.0616666666666665, 3.7266666666666666, 1.18333333...,
      attributes: 
    )
    multiplied=MatMul(
      inputs: variable1, coef=[-0.11633479416518264, -0.059777851719802544, 0.2549137469977225, 0.5475959...,
      attributes: 
    )
    resh=Add(
      inputs: multiplied, intercept=[0.9916666666666668],
      attributes: 
    )
    variable=Reshape(
      inputs: resh, shape_tensor=[-1, 1],
      attributes: 
    )
  ],
)

Generating SQL

After running orbital.parse_pipeline, the Scikit-learn Pipeline becomes an Orbital pipeline. This can then be transformed into SQL using orbital.export_sql. In this example, I’ve specified ‘snowflake’ as the dialect because I intend to execute it in a Snowflake database. Print the resulting query to see the model pipeline converted into SQL:

sql = orbital.export_sql("IRIS_DAT", orbital_pipeline, dialect="snowflake")

print(sql)
SELECT ("t0"."SEPAL_LENGTH" - 5.809166666666666) * -0.11633479416518264 + 0.9916666666666668 + ("t0"."SEPAL_WIDTH" - 3.0616666666666665) * -0.059777851719802544 + ("t0"."PETAL_LENGTH" - 3.7266666666666666) * 0.2549137469977225 + ("t0"."PETAL_WIDTH" - 1.1833333333333333) * 0.5475959809777826 AS "variable" FROM "IRIS_DAT" AS "t0"

Looks like SQL!

Run Scikit learn pipelines on a database

Once the SQL is generated, you can run the pipeline in a Snowflake database with cur.execute. All of the predictions were done in the database with SQL.

cur.execute(sql)
predicted_y_snowflake_df = cur.fetch_pandas_all()

print("\nPrediction with SQL (from Snowflake):")
print(predicted_y_snowflake_df["variable"].iloc[:5].to_numpy())
print("-" * 62)

cur.close()
Prediction with SQL (from Snowflake):
[-0.0836039  -0.03044801 -0.044628    0.02396601 -0.07794821]
--------------------------------------------------------------

True

Learn more about leveraging databases with Python and Scikit-learn

We covered a lot of ground in this post, from connecting to a database to preparing that data for machine learning and executing predictions directly within the database using SQL. Thanks to the orbital package, you’re unlocking the full power of your cloud data warehouse for advanced analytics and machine learning.

We encourage you to check out our resources:

Reach out to our sales team to schedule a demo to see Orbital in Databricks or Snowflake.

Footnotes

  1. Scikit-learn and pandas are also required.↩︎

  2. Orbital currently supports the following models:

    • Linear Regression
    • Logistic Regression
    • Lasso Regression
    • Elastic Net
    • Decision Tree Regressor
    • Decision Tree Classifier
    • Random Forest Classifier
    • Gradient Boosting Regressor
    • Gradient Boosting Classifier
    ↩︎
Nick Pelikan

Nick Pelikan

Principal Solution Architect at Posit
Nick is a Principal Solution Architect at Posit, where he focuses on helping customers realize their data strategy goals using Posit's open source and commercial tools. When not diving into data architectures, you can find him skiing in Utah's Wasatch Mountains.