\documentclass[nojss]{jss} %\VignetteIndexEntry{party with the mob} %\VignetteDepends{mlbench,colorspace} %\VignetteKeywords{parametric models, object-orientation, recursive partitioning} %\VignettePackage{party} %% packages \usepackage{amsmath} \SweaveOpts{engine=R} %% neet no \usepackage{Sweave} <>= require("party") options(useFancyQuotes = FALSE) @ %% commands \newcommand{\ui}{\underline{i}} \newcommand{\oi}{\overline{\imath}} \newcommand{\argmin}{\operatorname{argmin}\displaylimits} %% author/title \author{Achim Zeileis\\Universit\"at Innsbruck \And Torsten Hothorn\\Ludwig-Maximilians-\\Universit\"at M\"unchen \And Kurt Hornik\\WU Wirtschafts-\\universit\"at Wien} \Plainauthor{Achim Zeileis, Torsten Hothorn, Kurt Hornik} \title{\pkg{party} with the \code{mob}: Model-Based Recursive Partitioning in \proglang{R}} \Plaintitle{party with the mob: Model-Based Recursive Partitioning in R} \Shorttitle{\pkg{party} with the \texttt{mob}} \Keywords{parametric models, object-orientation, recursive partitioning} %% abstract \Abstract{ The \pkg{party} package \citep{Hothorn+Hornik+Zeileis:2006a} provides the function \code{mob()} implementing a recently suggested algorithm for \underline{mo}del-\underline{b}ased recursive partitioning \citep{Zeileis+Hothorn+Hornik:2005}. The basic steps are: (1)~fit a parametric model to a data set, (2)~test for parameter instability over a set of partitioning variables, (3)~if there is some overall parameter instability, split the model with respect to the variable associated with the highest instability, (4)~repeat the procedure in each of the child nodes. It is discussed how these steps of the conceptual algorithm are translated into computational tools in an object-oriented manner, allowing the user to plug in various types of parametric models. The outcome is a tree where each node is associated with a fitted parametric model that can be effectively visualized and summarized. } \Address{ Achim Zeileis\\ Department of Statistics\\ Faculty of Economics and Statistics\\ Universit\"at Innsbruck\\ Universit\"atsstr.~15\\ 6020 Innsbruck, Austria\\ E-mail: \email{Achim.Zeileis@R-project.org}\\ URL: \url{http://eeecon.uibk.ac.at/~zeileis/}\\ Torsten Hothorn\\ Institut f\"ur Statistik\\ Ludwig-Maximilians-Universit\"at M\"unchen\\ Ludwigstra{\ss}e 33\\ 80539 M\"unchen, Germany\\ E-mail: \email{Torsten.Hothorn@R-project.org}\\ URL: \url{http://www.stat.uni-muenchen.de/~hothorn/}\\ Kurt Hornik\\ Institute for Statistics and Mathematics\\ WU Wirtschaftsuniversit\"at Wien\\ Augasse 2--6\\ 1090 Wien, Austria\\ E-mail: \email{Kurt.Hornik@R-project.org}\\ URL: \url{http://statmath.wu.ac.at/~hornik/}\\ } \begin{document} \section{Motivation} \label{sec:motivation} Consider a parametric model $\mathcal{M}(Y, \theta)$ with (possibly vector-valued) observations $Y$ and a $k$-dimensional vector of parameters $\theta$. This model could be a (possibly multivariate) normal distribution for $Y$, or some kind of regression model when $Y = (y, x)$ can be split up into a dependent variable $y$ and regressors $x$. An example for the latter could be a linear regression model $y = x^\top \theta$ or a generalized linear model (GLM) or a survival regression. Given $n$ observations $Y_i$ ($i = 1, \dots, n$) the model can be fitted by minimizing some objective function $\Psi(Y, \theta)$, e.g., a residual sum of squares or a negative log-likelihood leading to ordinary least squares (OLS) or maximum likelihood (ML) estimation. If a global model for all $n$ observations does not fit well and further covariates $Z_1, \dots, Z_\ell$ are available, it might be possible to partition the $n$ observations with respect to these variables and find a fitting model in each cell of the partition. The algorithm described here tries to find such a partition adaptively using a greedy forward search. This procedure is implemented in the function \code{mob()} and described in more detail in the following section. However, we will state some goals and design principles in advance. To translate the model-based partitioning problem into \proglang{R}, we start with a formula description of the variables involved. This formula should be of type \verb:y ~ x1 + ... + xk | z1 + ... + zl: where the variables on the left of the \code{|} specify the data $Y$ and the variables on the right specify the partitioning variables $Z_j$. Classical regression trees usually have a univariate response $Y$ and various partitioning variables, i.e., could be specified as \verb:y ~ 1 | z1 + ... + zl:. Structural change models, on the other hand, are usually regression models that are segmented with respect to a single partitioning variable, typically time: \verb:y ~ x1 + ... + xk | z:. The type of models $\mathcal{M}$ to be used with \code{mob()} should not be confined (by the implementation), hence we have written an object-oriented implementation. The idea is that $\mathcal{M}$ is translated into software by a model of class ``\code{StatModel}'' as provided by the \pkg{modeltools} package. The algorithm the relies on various methods being available for these models. The ``\code{StatModel}'' objects \code{linearModel} and \code{glinearModel}, implementing (generalized) linear regression models, are readily available in \pkg{modeltools}, others can easily be user-defined. \section{The model-based recursive partitioning algorithm} \label{sec:algorithm} The basic idea is to grow a tee in which every node is associated with a model of type $\mathcal{M}$. To assess whether splitting of the node is necessary a fluctuation test for parameter instability is performed. If there is significant instability with respect to any of the partitioning variables $Z_j$, the node is splitted into $B$ locally optimal segments (currently only $B = 2$ is implemented) and then the procedure is repeated in each of the $B$ children. If no more significant instabilities can be found, the recursion stops. More precisely, the steps of the algorithm are \begin{enumerate} \item Fit the model once to all observations in the current node. \item Assess whether the parameter estimates are stable with respect to every partitioning variable $Z_1, \dots, Z_\ell$. If there is some overall instability, select the variable $Z_j$ associated with the highest parameter instability, otherwise stop. \item Compute the split point(s) that locally optimize the objective function $\Psi$. \item Split the node into child nodes and repeat the procedure. \end{enumerate} The details for steps 1--3 are specified in the following. \subsection{Parameter estimation} This step of the algorithm is common practice, the only additional requirement is (as previously noted) that model has to be of the class ``\code{StatModel}'' as provided by \pkg{modeltools}. Looking at the source code for the \code{linearModel} provided by this package illustrates how a simple wrapper to existing \proglang{R} functionality can be written. In particular, a method to the generic function \code{reweight()} has to be available. The reason is that it is inefficient to fit a brand-new model \code{modelobj} (including formula-parsing) in every node---much computation time is saved if simply \code{reweight(modelobj, weights)} is called in each of the child nodes. The \code{weights} argument controls which observations go into which of the child nodes. \subsection{Testing for parameter instability} The task in this step of the algorithm is to find out whether the parameters of the fitted model are stable over each particular ordering implied by the partitioning variables $Z_j$ or whether splitting the sample with respect to one of the $Z_j$ might capture instabilities in the parameters and thus improve the fit. The tests used in this step belong to the class of generalized M-fluctuation tests \citep{ZeileisHornik2003,Zeileis2005}. For numerical partitioning variables $Z_j$ the $\sup LM$~statistic is used which is the maximum over all single split $LM$ statistics. For categorical partitioning variables, a $\chi^2$~statistic is employed which captures the fluctuation within each of the categories of $Z_j$. For computing the test statistics and corresponding $p$~values $p_j$ for each of the partitioning variables $Z_j$ in \proglang{R}, the only requirement is that there are methods for the extractor functions \code{estfun()} and \code{weights()}. The \code{estfun()} method extracts the empirical estimating functions (model scores) from the fitted \code{modelobj}, these are the main ingredient for M-fluctuation tests. The \code{weights()} method is used to determine which observations are in the current node (i.e., have a weight greater than zero) and which are not (i.e., have zero weight). To determine whether there is some overall instability, it is checked whether the minial $p$~value $p_{j^*} = \min_{j = 1, \dots, \ell} p_j$ falls below a pre-specified significance level $\alpha$ (by default $\alpha = 0.05$) or not. To adjust for multiple testing, the $p$ values can be Bonferroni adjusted (which is the default). If there is significant instability, the variable $Z_{j^*}$ associated with the minimal $p$~value is used for splitting the node. \subsection{Splitting} In this step, the observation in the current node are split with respect to the chosen partitioning variable $Z_{j^*}$ into $B$ child nodes. Currently, the infrastructure in \pkg{party} only supports binary splits, i.e., $B = 2$. For deterimining the split point, an exhaustive search procedure is adopted: For each conceivable split point, the $B$ child node models are fit and the split associated with the minimal value of the objective function $\Psi$ is chosen. Computationally, this means that the fitted model \code{modelobj} is \code{reweight()}ed for each of the child nodes. The observations entering a child node keep their current weight while those observations that go into different child nodes receive zero weight. To compare the objective function $\Psi$, an extractor function is required to compute it from the fitted \code{modelobj}. This extractor function can be user-specified and set in \verb:mob_control():, it defaults to \code{deviance()}. This concludes one iteration of the recursive partitioning algorithm and steps~1--3 are carried out again in each of the $B$ daughter nodes until no significant instability is detected in step~2. \section{Illustrations} \label{sec:illustration} \subsection{Boston housing data} Since the analysis by \cite{BreimanFriedman1985}, the Boston housing data are a popular and well-investigated empirical basis for illustrating non-linear regression methods both in machine learning and statistics \citep[see][for two recent examples]{Gama2004,Samarovetal2005} and we follow these examples by segmenting a bivariate linear regression model for the house values. Thus, the model $\mathcal{M}$ used is \code{linearModel} from the \pkg{modeltools} package which is automatically loaded together with \pkg{party}. <<>>= library("party") @ The data set is available in package \pkg{mlbench} via <<>>= data("BostonHousing", package = "mlbench") @ and provides $n = \Sexpr{NROW(BostonHousing)}$ observations of the median value of owner-occupied homes in Boston (in USD~1000) along with $\Sexpr{NCOL(BostonHousing)}$ covariates including in particular the number of rooms per dwelling (\code{rm}) and the percentage of lower status of the population (\code{lstat}). A segment-wise linear relationship between the value and these two variables is very intuitive, whereas the shape of the influence of the remaining covariates is rather unclear and hence should be learned from the data. Therefore, a linear regression model for median value explained by \verb:rm^2: and \verb:log(lstat): with $k = 3$ regression coefficients is employed and partitioned with respect to all $\ell = 11$ remaining variables. To facilitate subsequent commands, the transformations are explicitely stored in \code{BostonHousing}: <<>>= BostonHousing$lstat <- log(BostonHousing$lstat) BostonHousing$rm <- BostonHousing$rm^2 @ Choosing appropriate transformations of the dependent variable and the regressors that enter the linear regression model is important to obtain a well-fitting model in each segment and we follow in our choice the recommendations of \cite{BreimanFriedman1985}. Monotonous transformations of the partitioning variables do not affect the recursive partitioning algorithm and hence do not have to be performed. However, it is important to distinguish between numerical and categorical variables for choosing an appropriate parameter stability test. The variable \code{chas} is a dummy indicator variable (for tract bounds with Charles river) and should thus be turned into a factor. Furthermore, the variable \code{rad} is an index of accessibility to radial highways and takes only 9 distinct values. Thus it is most appropriately treated as an ordered factor. <<>>= BostonHousing$chas <- factor(BostonHousing$chas, levels = 0:1, labels = c("no", "yes")) BostonHousing$rad <- factor(BostonHousing$rad, ordered = TRUE) @ Both transformations only affect the parameter stability test chosen (step~2), not the splitting procedure (step~3). The model is estimated by OLS, the instability is assessed using a Bonferroni-corrected significance level of $\alpha = 0.05$ and the nodes are split with a required minimal segment size of $40$ observations. The control parameters are thus set to <<>>= ctrl <- mob_control(alpha = 0.05, bonferroni = TRUE, minsplit = 40, objfun = deviance, verbose = TRUE) @ Actually, all of these settings are the defaults except \code{minsplit = 40} and \code{verbose = TRUE} which causes some information about the fitting process being written to the screen. The objective function \code{deviance()} extracts in this case the residual sum of squares from a fitted \code{linearModel} object. Having collected all building blocks, we can now call the function \code{mob()} that takes the model specification of the linear regression model \verb:medv ~ lstat + rm: plus all partitioning variables, along with the \code{data} set, the \code{control} settings and the \code{model} to be used. <<>>= fmBH <- mob(medv ~ lstat + rm | zn + indus + chas + nox + age + dis + rad + tax + crim + b + ptratio, data = BostonHousing, control = ctrl, model = linearModel) @ The result is the fitted model \code{fmBH} of class ``\code{mob}'' that contains the tree with a fitted linear regression associated with every node. Printing this object will show the splits, their $p$ values and call the \code{print()} method for the model in each terminal node (i.e., this simply relies on a \code{print()} method being available for the fitted model and re-uses it). <<>>= fmBH @ Looking at the printed output is typically rather tedious, a visualization via the \code{plot()} method <>= plot(fmBH) @ is much easier to interpret. By default, this produces partial scatter plots of the variable $y$ against each of the regressors $x_i$ in the terminal nodes. Each scatter plot also shows the fitted values, i.e., a project of the fitted hyperplane. \setkeys{Gin}{width=\textwidth} \begin{figure}[p] \begin{center} <>= plot(fmBH) @ \includegraphics[width=18cm,keepaspectratio,angle=90]{MOB-BostonHousing-plot} \caption{\label{fig:BostonHousing} Linear-regression-based tree for the Boston housing data. The plots in the leaves give partial scatter plots for \code{rm} (upper panel) and \code{lstat} (lower panel).} \end{center} \end{figure} From this visualization, it can be seen that in the nodes~4, 6, 7 and 8 the increase of value with the number of rooms dominates the picture (upper panel) whereas in node~9 the decrease with the lower status population percentage (lower panel) is more pronounced. Splits are performed in the variables \code{tax} (poperty-tax rate) and \code{ptratio} (pupil-teacher ratio). Various quantities of interest can be computed, provided that the \code{model} used provides the corresponding methods, e.g., \code{predict()}, \code{residuals()}, \code{logLik()}, \code{coef()} and \code{summary()}. The latter two by default try to extract information for the terminal nodes, but a \code{node} argument can be set to the node IDs of interest. As an example, the regression coefficients for the terminal node models can be easily extracted by <<>>= coef(fmBH) @ reflecting the differences of the models that can also be seen in the the associated \code{plot()}. Even more information is available in a \code{summary()}, e.g., for node 7: <<>>= summary(fmBH, node = 7) @ The test statistics and $p$~values computed in each node, can be extracted analogously by using the method for the function \code{sctest()} (for performing \underline{s}tructural \underline{c}hange \underline{test}s). <<>>= sctest(fmBH, node = 7) @ For summarizing the quality of the fit, we could compute the mean squared error, log-likelihood or AIC: <<>>= mean(residuals(fmBH)^2) logLik(fmBH) AIC(fmBH) @ <>= nt <- NROW(coef(fmBH)) nk <- NCOL(coef(fmBH)) @ As the \code{logLik()} method simply re-uses the method for \code{linearModel} objects, this does not only report $\Sexpr{(nk+1)*nt-1}$ estimated parameters ($\Sexpr{nk}$ parameters in each of the $\Sexpr{nt}$ terminal nodes plus $\Sexpr{nt} - 1$ split points) but $\Sexpr{(nk+2)*nt-1}$ parameters because each terminal node is additionally associated with a variance estimate. However, for the fitting process, the variance was treated as a nuisance parameter as we employed OLS estimation (rather than fully-specified ML estimation). \subsection{Pima Indians diabetes data} Another popular benchmark data set for binary classifications is the Pima Indians diabetes database which is also available from \pkg{mlbench}: <<>>= data("PimaIndiansDiabetes2", package = "mlbench") PimaIndiansDiabetes <- na.omit(PimaIndiansDiabetes2[,-c(4, 5)]) @ After omitting missing values (and the variables \verb:triceps: and \verb:insulin: which are missing for most women), the data set provides diabetes test results for $n = \Sexpr{NROW(PimaIndiansDiabetes)}$ women along with $\Sexpr{NCOL(PimaIndiansDiabetes)}$ covariates including in particular the plasma glucose concentration \code{glucose} as an important predictor for diabetes. Fitting a logistic regression model \verb:diabetes ~ glucose: seems to be straightforward, whereas the influence of the remaining variables should again be learned by recursive partitioning. This will yield a model tree with $k = 2$ regression coefficients in each terminal node, partitioned with respect to the remaining $\ell = 5$ remaining variables. The model is estimated by ML employing the \code{glinearModel}, the instability is assessed using a Bonferroni-corrected significance level of $\alpha = 0.05$ and the nodes are split with a required minimal segment size of $20$ observations. Hence, all control parameters correspond to the default values in \verb:mob_control(): and do not have to be set explicitely in the \code{mob()} call: <<>>= fmPID <- mob(diabetes ~ glucose | pregnant + pressure + mass + pedigree + age, data = PimaIndiansDiabetes, model = glinearModel, family = binomial()) @ To visualize this, we simply call again: <>= plot(fmPID) @ \setkeys{Gin}{width=\textwidth} \begin{figure}[bth] \begin{center} <>= plot(fmPID) @ \caption{\label{fig:PimaIndiansDiabetes} Logistic-regression-based tree for the Pima Indians diabetes data. The plots in the leaves give spinograms for \code{diabetes} versus \code{glucose}.} \end{center} \end{figure} which produces again a plot of the dependent variable $y$ against the only regressors $x$ in the terminal nodes. As $y$ is \code{diabetes}, a binary variable, and $x$ is \code{glucose}, a numeric variable, a spinogram is chosen for visualization. The breaks in the spinogram are the five-point summary of \code{glucose} on the full data set. The fitted lines are the mean predicted probabilities in each group. The model tree distinguishes three different groups: \begin{itemize} \item[\#2] Women with low body mass index that have on average a low risk of diabetes, however this increases clearly with glucose level. \item[\#4] Women with average and high body mass index, younger than 30 years, that have a higher avarage risk that also increases with glucose level. \item[\#5] Women with average and high body mass index, older than 30 years, that have a high avarage risk that increases only slowly with glucose level. \end{itemize} The same interpretation can also be drawn from the coefficient estimates and the corresponding odds ratios (with respect to glucose): <<>>= coef(fmPID) exp(coef(fmPID)[,2]) @ <>= risk <- round(100 * (exp(coef(fmPID)[,2])-1), digits = 1) @ i.e., the odds increase by \Sexpr{risk[1]}\%, \Sexpr{risk[2]}\% and \Sexpr{risk[3]}\% with respect to glucose in the three groups. \section{Conclusion} \label{sec:conclusion} The function \code{mob()} in the \pkg{party} package provides a flexible and object-oriented implementation of the general algorithm for model-based recursive partitioning. Models of class ``\code{StatModel}''---that employ a formula interface and are equipped with methods for the generic functions \code{reweight()}, \code{weights()}, \code{estfun()} plus some function for extracting the value of the objective function---can be easily partitioned. The resulting ``\code{mob}'' tree can be flexibly summarized, both numerically and graphically, and used for predictions on new data. \bibliography{partyrefs} \end{document}