sqlm fits linear regression models on datasets residing in SQL databases (DuckDB, Postgres, Snowflake, BigQuery, etc.) without pulling data into R memory. It computes sufficient statistics inside the database engine and solves the normal equations in R.
lm_sql(y~m*x+b,data=data)broom::tidy(), broom:glance(),
orbital::predict() and orbital::augment() (see
orbital
article for more information).CASE WHEN SQL for each dummy.* and
: interactions, including interactions between numeric and
categorical variables. Categorical interactions are expanded to the full
cross of dummy columns.y ~ . expands to all
non-response columns (excluding group columns for grouped data).I(),
log(), and sqrt() in formulas are translated
to their SQL equivalents (POWER, LN,
SQRT).Date and
POSIXct columns are automatically converted to numeric
(days or seconds since epoch) in SQL, matching base R’s
lm() behavior.dplyr::group_by() table and get a tibble back with one
model per group, computed from a single GROUP BY
query.y ~ 0 + x is
handled correctly (uncorrected TSS, proper degrees of freedom).# install.packages("pak")
pak::pak("git::https://www.codeberg.org/usrbinr/sqlm")I’m not a statistician, I do come from a whole family of statisticians. My dad, my brother and my cousin are not only health statisticians but also heavy R users. So I mainly grew up hearing about R in very technical / scientific context.
While over the years, I’ve grown more confident in my statistical skills, I primarily use R for business intelligence and data engineering tasks so I’ll admit at first it wasn’t obvious to me what I could use linear modeling for in my day to day work.
This changed with Julia Silge’s post on Using linear models to understand crop yields which really opened my eyes to the power of linear models for understanding relationships in data beyond just prediction tasks.
So if you like this package, send Julia a thank you note.
[!NOTE]
lm()
I routinely teach R at work and I always think the thing that will impress people the most is dbplyr or how tidyselect verbs help us to easily navigate our 1,000+ columns databases but it is without doubt when I show folks that you can create a linear model with 1 or 2 lines of code with lm() does the music ‘stop’.
So while that seems obvious now, to be honest the literature tends to be overly focused on more ‘scientific’ applications or leverages domain specific language which creates disconnects for folks like me who are more focused on practical applications of statistics in a business context where we need to explain results to business managers.
The other challenge is most business data exists in large SQL databases. Just massive data warehouses or lakehouses that are impractical to pull into R memory.
So I knew I wanted a package that leverages all of the conveniences and sugar syntax of R’s lm() function but could work directly on SQL database tables without pulling data into memory.
There are a few other packages that I’m aware of that provide linear modeling on larger than memory datasets:
modeldb by Edgar Ruiz and Max Kuhn
biglm by Thomas Lumley
dbreg by Grant McDermott which is honestly super impressive from both a user experience and performance point of view
All packages are great and I recommend you check them out.
library(sqlm)
library(dplyr)
library(dbplyr)
library(duckdb)
# 1. Connect and create a lazy table reference
con <- DBI::dbConnect(duckdb::duckdb())
DBI::dbWriteTable(con, "mtcars", mtcars)
cars_db <- tbl(con, "mtcars")
# 2. Fit — data stays in the DB, only summary stats return to R
model <- lm_sql(mpg ~ wt + hp + cyl, data = cars_db)
print(model)Big Data Linear Regression (S7)
-------------------------------
Call:
lm_sql(formula = mpg ~ wt + hp + cyl, data = cars_db)
Coefficients:
(Intercept) wt hp cyl
38.7517874 -3.1669731 -0.0180381 -0.9416168
library(broom)
tidy(model, conf.int = TRUE)# A tibble: 4 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 38.8 1.79 21.7 0 35.1 42.4
2 wt -3.17 0.741 -4.28 0.000199 -4.68 -1.65
3 hp -0.0180 0.0119 -1.52 0.140 -0.0424 0.00629
4 cyl -0.942 0.551 -1.71 0.0985 -2.07 0.187
glance(model)# A tibble: 1 × 11
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.843 0.826 2.51 50.2 2.18e-11 3 -72.7 155. 163.
# ℹ 2 more variables: nobs <dbl>, df.residual <dbl>
cars_db %>%
group_by(am) %>%
lm_sql(mpg ~ wt + hp, data = .)# A tibble: 2 × 2
am model
<dbl> <list>
1 0 <sqlm::__>
2 1 <sqlm::__>
dplyr::rowwise()+dlyr::mutate() pattern to do
further analysis with the model artifacts (see getting
start for examples).Base R’s lm() pulls all rows into memory and solves via
QR decomposition. sqlm instead solves the normal equations directly:
\[\hat{\beta} = (X^TX)^{-1} X^T y\]
The key insight is that \(X^TX\) and \(X^Ty\) are sums of products, which are standard SQL aggregations. For \(p\) predictors the sufficient statistics are:
| Statistic | SQL expression | Size |
|---|---|---|
| \(n\) | COUNT(*) |
scalar |
| \(S_{x_j}\) | SUM(x_j) |
\(p\) values |
| \(S_{x_j x_k}\) | SUM(x_j * x_k) |
\(p(p+1)/2\) values |
| \(S_{x_j y}\) | SUM(x_j * y) |
\(p\) values |
| \(S_{yy}\) | SUM(y * y) |
scalar |
All of these are computed in a single SQL query and returned as one row. The entire dataset is never materialized in R.
For a model y ~ x1 + x2, sqlm generates:
SELECT
COUNT(*) AS N,
SUM(1.0 * "x1") AS S_x1,
SUM(1.0 * "x2") AS S_x2,
SUM(1.0 * "y") AS S_y,
SUM(1.0 * ("x1") * ("x1")) AS S_x1_x1,
SUM(1.0 * ("x1") * ("x2")) AS S_x1_x2,
SUM(1.0 * ("x1") * ("y")) AS S_x1_y,
SUM(1.0 * ("x2") * ("x2")) AS S_x2_x2,
SUM(1.0 * ("x2") * ("y")) AS S_x2_y,
SUM(1.0 * ("y") * ("y")) AS S_y_y
FROM (
SELECT ... FROM table WHERE x1 IS NOT NULL AND x2 IS NOT NULL AND y IS NOT NULL
) AS subqueryThe 1.0 * cast prevents integer overflow on databases
that use 32-bit integer arithmetic for SUM.
For grouped models, the same query adds GROUP BY columns
and returns one row per group.
When a predictor is character or factor, sqlm:
SELECT DISTINCT col FROM ... ORDER BY col to discover all
levels.contr.treatment).CASE WHEN col = 'level' THEN 1.0 ELSE 0.0 END
expression that participates in the sum-of-products query.For interactions involving categoricals (e.g.,
x * Species), the interaction is expanded to the Cartesian
product of the numeric term and all dummy columns.
When a predictor column is a Date, POSIXct,
or POSIXlt, sqlm converts it to a numeric value in SQL
before including it in the sum-of-products query — mirroring what base
R’s lm() does via model.matrix():
Date columns become days since the
Unix epoch:
CAST(("col" - DATE '1970-01-01') AS DOUBLE PRECISION),
equivalent to R’s as.numeric(date).POSIXct/POSIXlt columns
become seconds since epoch: EXTRACT(EPOCH FROM "col").Back in R, the returned sums are assembled into the \(p \times p\) matrix \(X^TX\) and the \(p\)-vector \(X^Ty\). The system is solved via Cholesky
decomposition (chol2inv(chol(XtX))), which is both
efficient and numerically stable for the positive-definite cross-product
matrices produced by full-rank designs. For rank-deficient cases (e.g.,
perfectly collinear predictors), the solver falls back to
MASS::ginv() (Moore-Penrose pseudoinverse).
From the solution \(\hat\beta\):
| Quantity | Formula |
|---|---|
| RSS | \(S_{yy} - \hat\beta^T X^Ty\) |
| TSS | \(S_{yy} - S_y^2/n\) (intercept models) or \(S_{yy}\) (no-intercept) |
| \(R^2\) | \(1 - \text{RSS}/\text{TSS}\) |
| \(\hat\sigma^2\) | \(\text{RSS} / (n - p)\) |
| Standard errors | \(\sqrt{\text{diag}((X^TX)^{-1}) \cdot \hat\sigma^2}\) |
| \(t\)-statistics | \(\hat\beta_j / \text{SE}_j\) |
| \(F\)-statistic | \(\frac{(\text{TSS} - \text{RSS})/d_1}{\text{RSS}/d_2}\) |
Log-likelihood, AIC, and BIC are computed from the Gaussian likelihood assuming normal residuals.
Before computing sums, all rows with NULL in any model
variable are excluded via a WHERE ... IS NOT NULL clause
(equivalent to na.action = na.omit). This is done through
dbplyr’s dplyr::filter(if_all(..., ~ !is.na(.))).
Formula terms wrapped in I() have their inner expression
translated to SQL:
x^2 becomes POWER(x, 2)log(x) becomes LN(x)sqrt(x) stays SQRT(x)These translated expressions are substituted into the sum-of-products query like any other predictor.
MASS::ginv() as a fallback, but results may diverge from
lm() for near-singular designs.SUM, COUNT, CASE WHEN).
Tested on DuckDB, PostgreSQL, and Snowflake.