--- title: "Tips for Production" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{tips-for-production} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Manage Dependencies Finn continues to get updated on a regular basis. A best practice is to ensure you are using a specific version of Finn for your production forecast. This can be done through the use of the [renv package](https://rstudio.github.io/renv/) while using Finn on your local machine, or using [docker containers](https://colinfay.me/docker-r-reproducibility/) for running Finn in the cloud. ## Azure ML Pipelines Finn was built to run at scale in Azure, leveraging spark as the parallel back end. Check out the parallel processing vignette to learn how to get Finn running on Azure services like Databricks. The best way to run Finn in production is through the use of [Azure Machine Learning](https://learn.microsoft.com/en-us/azure/machine-learning/), specifically Azure ML Pipelines. Below are a few tips for leveraging Azure ML Pipelines - Leverage the Azure ML CLI v2 interface to submit R scripts in a pipeline. - Use the sub components of Finn instead of `forecast_time_series()` so you can have different pipeline steps ran for each step of the Finn forecast process. - If you do use the sub components of Finn, then definitely set `add_unique_id` to FALSE within `set_run_info()`. That will let you call the exact same run info in each separate pipeline step. Make sure that you are using your own unique `run_name` within `set_run_info()`, one that is different than previous Finn runs but the same name when using `set_run_info()` in each pipeline step. - Connect your Azure ML Pipeline to a spark compute cluster and mount an Azure Data Lake Storage connection. Either through Azure Databricks or Azure Synapse. Also consider using different spark cluster configurations depending on the Finn forecast sub component you are running in each pipeline step. - For `prep_data()` and `prep_models()`, use the default spark cluster settings. Where each spark task gets sent to a specific core on an executor. - For all other functions like `train_models()`, `ensemble_models()`, and `final_models()` consider adjusting the spark cluster settings like "spark.executor.cores" equal to 1 and set `inner_parallel` to TRUE within each function. That way only a single task/time series gets sent to each spark executor node, and all cores within that node can be used during the modeling process for that task/time series. Use `num_cores` within each function to control how many cores on the executor to use, with the default being all available cores minus one. This can significantly speed up run time, but if you have many time series and want to run all of the models within Finn ensure that you have a spark cluster that can scale to many VM's.