One of the benefits of working in R is the ease with which you can implement complex models and implement challenging data analysis pipelines. Take, for example, the parsnip package; with the installation of a few associated libraries and a few lines of code, you can fit something as sophisticated as a boosted tree:
Yet, while this code is compact, the underlying fitted result may not
be. Since parsnip works as a wrapper for many modeling packages, its
fitted model objects inherit the same properties as those that arise
from the original modeling package. A straightforward example is the
lm() function from the base stats package.
Whether you leverage parsnip or not, you get the same result:
parsnip_lm <- linear_reg() %>% 
  fit(mpg ~ ., data = mtcars) 
parsnip_lm
#> parsnip model object
#> 
#> 
#> Call:
#> stats::lm(formula = mpg ~ ., data = data)
#> 
#> Coefficients:
#> (Intercept)          cyl         disp           hp         drat           wt  
#>    12.30337     -0.11144      0.01334     -0.02148      0.78711     -3.71530  
#>        qsec           vs           am         gear         carb  
#>     0.82104      0.31776      2.52023      0.65541     -0.19942Using just lm():
old_lm <- lm(mpg ~ ., data = mtcars) 
old_lm
#> 
#> Call:
#> lm(formula = mpg ~ ., data = mtcars)
#> 
#> Coefficients:
#> (Intercept)          cyl         disp           hp         drat           wt  
#>    12.30337     -0.11144      0.01334     -0.02148      0.78711     -3.71530  
#>        qsec           vs           am         gear         carb  
#>     0.82104      0.31776      2.52023      0.65541     -0.19942Let’s say we take this familiar old_lm approach in
building a custom in-house modeling pipeline. Such a pipeline might
entail wrapping lm() in other function, but in doing so, we
may end up carrying around some unnecessary junk.
in_house_model <- function() {
  some_junk_in_the_environment <- runif(1e6) # we didn't know about
  lm(mpg ~ ., data = mtcars) 
}The linear model fit that exists in our custom modeling pipeline is then:
But it is functionally the same as our old_lm, which
only takes up:
Ideally, we want to avoid saving this new
in_house_model() on disk, when we could have something like
old_lm that takes up less memory. But what the heck is
going on here? We can examine possible issues with a fitted model object
using the butcher package:
big_lm <- in_house_model()
weigh(big_lm, threshold = 0, units = "MB")
#> # A tibble: 25 × 2
#>    object            size
#>    <chr>            <dbl>
#>  1 terms         8.01    
#>  2 qr.qr         0.00666 
#>  3 residuals     0.00286 
#>  4 fitted.values 0.00286 
#>  5 effects       0.0014  
#>  6 coefficients  0.00109 
#>  7 call          0.000728
#>  8 model.mpg     0.000304
#>  9 model.cyl     0.000304
#> 10 model.disp    0.000304
#> # ℹ 15 more rowsThe problem here is in the terms component of
big_lm. Because of how lm() is implemented in
the base stats package (relying on intermediate forms of
the data from model.frame and model.matrix)
the environment in which the linear fit was created is
carried along in the model output.
We can see this with the env_print() function from the
rlang package:
library(rlang)
env_print(big_lm$terms)
#> <environment: 0x13edf0ba0>
#> Parent: <environment: global>
#> Bindings:
#> • some_junk_in_the_environment: <dbl>To avoid carrying possible junk around in our production pipeline,
whether it be associated with an lm() model (or something
more complex), we can leverage axe_env() from the butcher
package:
Comparing it against our old_lm, we find:
weigh(cleaned_lm, threshold = 0, units = "MB")
#> # A tibble: 25 × 2
#>    object            size
#>    <chr>            <dbl>
#>  1 terms         0.00771 
#>  2 qr.qr         0.00666 
#>  3 residuals     0.00286 
#>  4 fitted.values 0.00286 
#>  5 effects       0.0014  
#>  6 coefficients  0.00109 
#>  7 call          0.000728
#>  8 model.mpg     0.000304
#>  9 model.cyl     0.000304
#> 10 model.disp    0.000304
#> # ℹ 15 more rowsAnd now it takes the same memory on disk:
weigh(old_lm, threshold = 0, units = "MB")
#> # A tibble: 25 × 2
#>    object            size
#>    <chr>            <dbl>
#>  1 terms         0.00763 
#>  2 qr.qr         0.00666 
#>  3 residuals     0.00286 
#>  4 fitted.values 0.00286 
#>  5 effects       0.0014  
#>  6 coefficients  0.00109 
#>  7 call          0.000728
#>  8 model.mpg     0.000304
#>  9 model.cyl     0.000304
#> 10 model.disp    0.000304
#> # ℹ 15 more rowsAxing the environment, however, is not the only functionality of butcher. This package provides five S3 generics that include:
axe_call(): Remove the call object.axe_ctrl(): Remove the controls fixed for
training.axe_data(): Remove the original data.axe_env(): Replace inherited environments with empty
environments.axe_fitted(): Remove fitted values.In our case here with lm(), if we are only interested in
prediction as the end product of our modeling pipeline, we could free up
a lot of memory if we execute all the possible axe functions at once. To
do so, we simply run butcher():
butchered_lm <- butcher(big_lm)
predict(butchered_lm, mtcars[, 2:11])
#>           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
#>            22.59951            22.11189            26.25064            21.23740 
#>   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
#>            17.69343            20.38304            14.38626            22.49601 
#>            Merc 230            Merc 280           Merc 280C          Merc 450SE 
#>            24.41909            18.69903            19.19165            14.17216 
#>          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
#>            15.59957            15.74222            12.03401            10.93644 
#>   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
#>            10.49363            27.77291            29.89674            29.51237 
#>       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
#>            23.64310            16.94305            17.73218            13.30602 
#>    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
#>            16.69168            28.29347            26.15295            27.63627 
#>      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
#>            18.87004            19.69383            13.94112            24.36827Alternatively, we can pick and choose specific axe functions, removing only those parts of the model object that we are no longer interested in characterizing.
butchered_lm <- big_lm %>%
  axe_env() %>% 
  axe_fitted()
predict(butchered_lm, mtcars[, 2:11])
#>           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
#>            22.59951            22.11189            26.25064            21.23740 
#>   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
#>            17.69343            20.38304            14.38626            22.49601 
#>            Merc 230            Merc 280           Merc 280C          Merc 450SE 
#>            24.41909            18.69903            19.19165            14.17216 
#>          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
#>            15.59957            15.74222            12.03401            10.93644 
#>   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
#>            10.49363            27.77291            29.89674            29.51237 
#>       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
#>            23.64310            16.94305            17.73218            13.30602 
#>    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
#>            16.69168            28.29347            26.15295            27.63627 
#>      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
#>            18.87004            19.69383            13.94112            24.36827The butcher package provides tooling to axe parts of the fitted output that are no longer needed, without sacrificing much functionality from the original model object.