1.
Introduction
In a seminal work of 1997, Rich Caruana introduced the idea of the multi-task learning (MTL) paradigm in machine learning [12]. Multi-task learning is based on the intuitive idea that, like humans, machines may jointly learn distinct tasks that are yet somehow related one with the other. In this way, the knowledge acquired from learning one task can be exploited to improve performance with the other tasks, and vice versa. Sharing information between related tasks is a particularly useful idea in applications where only small amounts of samples are available for each single task, but it is also effective in more general situations.
In fact, multi-task learning strategies have been successfully employed in several settings: with supervised, unsupervised or semi-supervised tasks, with reinforcement learning problems, with graphical models. Also, the range of applications is wide: computer vision [29,32,38], bioinformatics [25,33,37], natural language processing [22,36], web applications [2,5], ubiquitous computing [26,41]. For the supervised learning setting alone, several different multi-task approaches have been proposed in the literature. Namely, we can list Feature-based approaches [4,12], Low-rank approaches [1,3], Task Clustering approaches [6,16,21,24,42,43], Task Relation Learning approaches [35,40] and Decomposition approaches [13,23]. The latter four classes of methodologies are collectively referred to as parameter-based approaches. We refer the reader to [39] for a thorough review of multi-task learning models.
In this work, we are interested in parameter-based multi-task approaches to regression problems in a homogeneous setting, i.e., where the input space is the same for all tasks [39]. Many strategies have been proposed in the literature to tackle regression problems in a multi-task environment by linear models, employing the common squared error loss function for training.
The main contribution of this work consists in showing that, in some of the aforementioned cases, the underlying optimization problem can in fact be equivalently reformulated as a Mixed-Integer Quadratic Programming (MIQP) problem. MIQP solvers, like Gurobi, CPLEX, CBC or GLPK, are nowadays able to efficiently manage problems with a large number of integer variables, finding the certified global optimum. This is in contrast with the local optimization procedures, typically employed to tackle the original continuous formulations, that only attain local minima. Furthermore we argue, and also numerically show, that solving to global optimality the training problems provides benefits in terms of generalization capabilities and predictive performance of the trained models.
The manuscript is organized as follows. In Section 2, we briefly review some basic multi-task learning models from the literature. Then, we show in Section 3 how such approaches can be reformulated by employing mixed-integer programming techniques. In Section 4, we present computational experiments aiming to assess the practical advantages of solving our reformulations, compared to using classical algorithmic schemes. Finally, we draw some conclusions in Section 5. In Appendix A, we list some possible additional elements that can be taken into account within our models, whereas in Appendix B we show the results of a computational study aimed at evaluating the scalability of the proposed approaches.
2.
Preliminaries
In this section, after introducing the notation employed in this work, we briefly review some of the most basic approaches to multi-task learning in regression problems.
2.1. Notation
We are interested in multi-task linear regression problems. We are given a set of m tasks T1,…,Tm, each one associated with a dataset Di=(Xi,yi), i=1,…,m. We consider the homogeneous setting, therefore Xi=(xi1,…,xiNi) with xij∈Rn for all i=1,…,m and all j=1,…,Ni. We also have yij∈R, as we are considering regression tasks. For each task Ti, we want to construct a linear regression model wi∈Rn. The loss function Li(⋅) associated to model wi on the dataset Di is the mean squared error:
2.2. Basic continuous MTL modes
Many works in parameter-based homogeneous MTL have focused on the idea that similarity among tasks shall be transferred to the learned models by enforcing that corresponding feature weights across tasks are close to each other. With linear regression models, one of the simplest ways to enforce this requirement is by introducing an additional regularization term into the ridge regression setting, as first shown by [16,17]:
where ˉw acts as a connection term. Problem (2.1) is quadratic, convex and unconstrained, hence it is easily solvable to global optimality. However, the model has evident limitations. First, not all tasks may be related to each other, and hence enforcing proximity may in fact deteriorate predictive performance. Moreover, the assumption that weights of related models are similar is often too strong with most real-world data.
For this reason, many variants and alternatives to model (2.1) have been proposed. A first extension is the important class of Clustered Multi-Task Learning (CMTL) models. Evegeniou et al. [16] have shown that by a simple extension of (2.1) it is possible to retrieve the Task-clustering setting. Task-clustering models have been reformulated in many different fashions [6,21,24,42,43]. If the hard-clustering setting is considered, in which any task is associated to one and only one cluster, the following basic formulation can be considered:
where K is the number of clusters, zk∈Rn for all k and δik, for i=1,…,m and k=1,…,K, is a binary indicator variable which is set to 1 if task i belongs to cluster k and is 0 otherwise. In the soft-clustering setting, the problem can be formulated similarly, with variables δik representing probability values: δik∈[0,1], ∑Kk=1δik=1.
Problem (2.2) and similar formulations such those in [24,42,43] are typically solved by Alternating Minimization, where three steps are iteratively repeated:
1) minimize the objective function with respect to model weights, having fixed clusters composition and representatives:
2) assign each task model to the cluster with the nearest representative (for the hard-clustering setting):
3) set the representative of each cluster as the mean of the models belonging to that cluster:
This approach can be shown to converge, but global optimality of the retrieved solution cannot be guaranteed; in fact, global optimality is rarely attained.
As an alternative to regularization-based approaches, with weaker assumptions on data and task relatedness, a polarity-constrained multi-task learning (or weakly constrained MTL, wcMTL) model has been recently proposed [30,34]; the idea is that corresponding weights in related tasks are not necessarily close in magnitude, but reasonably share the polarity, i.e., if a feature is positively relevant to the output of a task, then its weight will also be positive for the other ones. This is modeled by the following optimization problem:
which, again, can be solved by alternating minimization approaches such as Block Coordinate Descent (BCD) [7] or the Alternating Direction Method of Multipliers (ADMM) [10] up to stationarity. Specifically, the main steps in the ADMM loop are carried out as follows for the considered problem [34]:
1) Sequentially update models wi, i=1,…,m:
2) Update auxiliary variables:
3) Compute primal and dual residuals:
4) Update dual variables:
3.
Mixed-integer reformulations
In this section, we show how problems (2.2) and (2.3) can be reformulated as MIQP problems. In particular, Task-clustering multi-task learning (2.2) can be formulated, in the hard-clustering setting, as:
where sik∈Rn and M is a sufficiently large constant to be used in big-M type constraints and e∈Rn is the vector of all ones.
Constraints (3.1c)–(3.1d) are used to model the implication δik=1⟹sik=wi−zk: if δik=1, i.e., the i-th task belongs to the k-th cluster, sik is equal to the distance between the model and the cluster representative; this quantity will be quadratically penalized in the objective function; otherwise, sik is free and, since we are minimizing its squared norm, it will be set to zero, i.e., the distance between the model and the representative is not penalized. Therefore, equivalently as in model (2.2), the squared distance between a model and its cluster representative is penalized in the optimization process.
On the other hand, problem (2.3) can be reformulated as:
This time, the big-M constraint models the following implication: if yj=0, then the j-th weight will be non positive in all models; if yj=1, it will be non negative for all tasks. Basically, yj denotes the polarity sign of the j-th feature, which is shared among all tasks.
Since the number of introduced binary variables is limited - K×m for problem (3.1), with both K and m usually small, and n for problem (3.2) - such formulations can be solved, in a reasonable amount of time, to certified global optimality by employing off-the-shelf mixed-integer programming solvers such as Gurobi [20]. In the following section, we will show how this is advantageous, in terms of predictive performance, w.r.t. using local optimization techniques such as Alternate Minimization or the Alternating Direction Method of Multipliers.
A remark shall be pointed out at this point; exact optimization approaches for mixed-integer problems are well-known to be computationally hard. Local algorithms are certainly cheaper and possess better scalability properties. However, the intent when multi-task approaches are employed in linear regression tasks is to obtain the best possible improvement for models that have poor performance because of the lack of training data. We are hence in a context where it is worth employing greater computational resources in order to obtain even slight performance boosts. Nonetheless, we present in Appendix B a computational study suggesting the applicability limits of the proposed approach.
Note that models (3.1) and (3.2) are quite flexible and it is possible to introduce additional modeling elements aimed at further improving the predictive performance. For example, the two considered models can be combined in different ways; moreover, mixed-integer formulations allow to introduce sparsity and feature selection with ease. We formalize these aspects more in detail in Appendix A. However, in the computational analysis we will focus on the basic models, leaving a robust experimentation of these variants to future work.
4.
Computational experiments
In this section we evaluate the benefits, in terms of generalization performance, of solving the basic CMTL (2.2) and wcMTL (2.3) models to global optimality using reformulations (3.1) and (3.2).
To this aim, we implemented and solved with Gurobi [20] models (3.1) and (3.2). As also suggested in Gurobi documentation, we found that directly implementing big-M constraints, with the reasonable value of M=10000, is computationally more convenient than employing the "indicator constraint" construct provided by the library. We then implemented in Python3 the AM and ADMM procedures to solve respectively problems (2.2) and (2.3). We employed the numpy library for all basic operations and the L-BFGS-B solver [11] to solve the bound-constrained subproblems in ADMM. As for the parameter setting, we set to 100 the number of iterations for the Alternate Minimization of problem (2.2), while for ADMM we set τ=1000 and the tolerance for the stopping criterion based on residual convergence to ‖st+1‖≤0.01. As for the models hyperparameters, we will detail our choices case by case in the following. As starting points of the local optimization procedures, we initialized each model with the optimal solution of the least squares regression considering each task independently. Single task least-squares regression can easily be obtained by solving the normal equations.
Concerning the starting cluster assignment in CMTL, this is randomly extracted from a uniform distribution. The approach is thus nondeterministic, therefore we took into account 10 independent runs with different random cluster initializations every time we employed the Alternate Minimization method.
In the experiments we employed both synthetic and real-world benchmarks, that we describe in the following section.
4.1. Datasets
4.1.1. Synthetic datasets
We generated 10 datasets as follows. Each dataset contains 16 tasks, all concerning regression problems with data in R12. The size of the training set for each task is equal to N, where N=3,…,12 varies for every dataset, while the size of the test set is 5 for all tasks and all datasets. The samples are generated from the normal distribution N12(0,1), while the output for a sample x of the i-th task is given by y=wTix, where wi is generated as follows:
where A={1,2,3,4,5}, B={6,7,8,9,10,11}, C={12}, ˉwkA∼N5(0,1) for k=1,…,3, with the i-th task belonging to one of the K=3 clusters, ˉwB∼N6(0,1), ⊙ denotes the element-wise product and U the uniform distribution.
4.2. Real-world datasets
We considered a total of 4 real-world datasets for multi-task linear regression problems:
● school* [18]: the dataset concerns the estimation problem of examination scores of 15,362 students from 139 British secondary schools in the period 1985–87. Each school is treated as a task, the inputs consist of four school-specific and three student-specific attributes.
*http://www.bristol.ac.uk/cmm/learning/support/datasets/
● parkinson† [31]: the dataset contains 5,875 data points for 42 patients suffering from the Parkinson's disease, each one being a separate task. Given 19 bio-medical features, the aim is to predict the disease progress status at different times. The target can be measured by two different technical scores: motor UPDRS and total UPDRS. We can hence obtain two different datasets: parkinson_motor and parkinson_total.
†https://archive.ics.uci.edu/ml/datasets/parkinsons+telemonitoring
● insurance‡: the dataset concerns the prediction of the individual medical costs billed by health insurance. There are 1,338 data samples, with 5 features each. We used the sixth feature, i.e., the residential area in the US, to identify 4 different tasks.
‡https://www.kaggle.com/mirichoi0218/insurance
4.3. Results
We performed three groups of experiments. In the first one, we considered the synthetic datasets described in Section 4.1.1. We compare the test mean squared error (MSE) attained by the CMTL and the wcMTL models trained by solving the optimization problem both via mixed-integer and continuous (local) optimization procedures. We also report the result of training each task independently (single task learning, ST).
For this experiment we set to 3 the number of clusters in the task-clustering model. We set ν=1 and λ=0.01 for all models. We recall that the results of Alternate Minimization for CMTL are the mean of 10 independent runs with different random clusters initializations. The results of the experiment are reported in Table 1. We can observe that, with the only exception of the CMTL model with the problems of size N=3 and N=4, solving the optimization problem to global optimality has clear benefits in terms of the predictive performance of the models. We can also observe that the CMTL model appears to be superior on this class of problems than the wcMTL. Also, we can note that both models are indeed an upgrade if compared to the single task models.
Next, we turn to the real-world datasets. We begin by evaluating the performance of the MIQP approach, compared to the local minimization approaches, on the school dataset for different hyperparameters settings. For this experiment, we have kept fixed the train-test split (80/20) of all the tasks, which is obtained randomly. We show the results of the experiment in Tables 2 and 3. Again, we can see that finding a better solution in terms of the training optimization problems consistently translates into benefits when it comes to generalization performance.
Finally, we carried out a wider experiment on all 4 real-world datasets. This time we repeated the training and testing process on 10 different train-test splits for each dataset. For each run, we selected the hyperparameters of each model by a 5-fold cross-validation step on the training set. The results of the experiment are reported in Tables 4 and 5.
From Table 4 we can observe that the MIQP approach always outperforms the corresponding local minimization one, with the only exception of the polarity-constrained model on the insurance dataset. On the other hand, the wcMTL is competitive with the CMTL on the two parkinson datasets only if the mixed-integer approach is employed. We also note that Task-clustering solved as a MIQP is the overall best approach among those considered.
From Table 5, we can further see that solving the weakly-constrained model as an MIQP one is consistently advantageous as the train-test splits vary, except for the case of the insurance dataset. As for task-clustering, because of the nondeterministic nature of the Alternate Minimization procedure, we consider both the mean and the best MSE obtained among 10 different initializations for each split of each dataset. If the average is taken into account, the mixed-integer approach confirms to be certainly preferable than the AM approach. Even if we consider the best run of Alternate Minimization for each split, the MIQP method continues to be slightly superior.
In the end, we can conclude that basic models for multi-task linear regression should definitely be reformulated as MIQP problems, so that the global optimum of the training problem can be found with benefits in the prediction phase.
5.
Conclusions
In this work, we showed that basic multi-task learning models for linear regression can be equivalently reformulated by means of mixed-integer quadratic programming techniques. This is useful, as MIQP problems can be solved to global optimality by using off-the-shelf solvers like Gurobi, in contrast with the local optimization strategies usually employed. By a set of computational experiments, we also showed that this strategy indeed leads to a general improvement of the performance of the models at predicting out-of-sample values. In conclusion, the proposed approaches should allow practitioners to obtain stronger performance from classical MTL models with a very limited implementation effort. We shall highlight, however, that the computational cost of the proposed strategy is not just as cheap, especially as the number of employed integer variables grows (see Appendix B).
Future research should be focused on evaluating the performance of the extensions and combinations of the proposed models, such as those discussed in Appendix A. Moreover, the nontrivial extension of the proposed approach to classification problems might be considered. This could be achieved, for example, taking inspiration from works such as [8,14,28], where the problem of best subset selection in logistic regression is tackled by mixed-integer programming formulations.
Acknowledgments
We would like to thank Prof. Marco Sciandrone for giving us the opportunity to work on the topic and for giving us numerous useful suggestions.
Conflict of interest
The authors declare no conflict of interest.
A.
Extensions
In this Appendix, we show possible extensions of the basic models (2.2) and (2.3) that mixed-integer modeling makes possible to handle.
Firstly, we highlight that the CMTL and the wcMTL models might be combined in several ways:
● Trivially, corresponding weights could be simultaneously forced to share the sign and to be close to each other if the tasks belong to the same cluster. We would have model (3.1) with the addition of the constraints of (3.2).
● Some weights could be forced to be close while others to only share the polarity. This requirement can be modeled by the following formulation:
● Sign constraints could be based on the clusters structure. The constraints would have the following form:
A second modeling element that could be handled by mixed-integer approaches concerns sparsity and feature selection. Indeed, the best feature subset selection task in linear regression problems has been tackled by MIQP formulations in many works in the recent years [9,15,19,27].
Now, in the multi-task setting, we can either:
● force a feature selection which is in common for all tasks by adding to the model binary variables ν and the following constraints:
● select a different set of relevant features for each cluster:
● select a different set of relevant features for each individual task:
Note that the latter approach introduces n×m binary variables, which may excessively increase the computational cost of the approach.
Employing analogous mechanisms, it is also possible to select variables to which impose the polarity constraints and those to consider when forcing cluster compactness.
B.
Scalability of the proposed approach
In this Appendix we focus on scalability issues regarding our proposed approaches. The hardness of mixed-integer mathematical programming problems is well known: the computational cost of exact solvers for mixed-integer problems grows significantly with the number of integer variables. This still holds true with the powerful modern software developed in recent years.
We are therefore interested in providing readers with an insight on the computational resources demanded by the mixed-integer approaches and the extent it may be practically sustainable.
To this aim, we generated two new problems benchmarks. The first one is designed for testing the CMTL model: we generated MTL problems with n=8 features, 4 examples per task and a variable number of tasks m=2,…,30. The number of clusters K was fixed to 3, so that the problems have a minimum of 6 and a maximum number of 90 binary variables. We also set M=1000, λ=0.01, ν=1.
As for the wcMTL case, we generated problems with m=15 tasks, n=3t features for t=3,…,30 and N=⌊n/2⌋ examples per task; the number of binary variables is determined by n, so we have again problems with up to 90 integer variables. Here, we set M=500 and λ=0.01.
For both benchmarks, we extracted samples xij from the uniform distribution Nn(0,1) whereas labels are generated computing yij=(wcommon+0.1wj)Txij+N(0,0.2), where wcommon,wj∼Nm(0,1).
The results of the experiments, performed running Gurobi 9.1.0 on a computer with Ubuntu Server 20.04 LTS OS, Intel Xeon E5-2430 v2 @ 2.50GHz CPU, 12 cores and 16GB RAM, are reported in Figures 1 and 2.
We can observe that, for problems of the considered size, globally optimal solutions can be obtained in a reasonable amount of time (∼102−103s for the hardest problems); nonetheless, the order of magnitude of the cost, measured by CPU time, not surprisingly grows quite fast with the number of integer variables. This fact makes us guess that problems with hundreds of integer variables may become computationally unsustainable to be solved by the proposed approaches; we also shall note that we did not test problems of larger size, as certifying global optimality of the solutions becomes impractical.
In conclusion, the proposed methods appear to be reasonably employable in the typical MTL use case where few data are available and we want to obtain the most accurate possible model out of them. As the size of the problem grows, the computational cost likely becomes significant and the approach may become unsustainable in very large scale scenarios. However, we shall note that in our experiments we employed a very efficient, yet general purpose off-the-shelf solver; specific branching or bounding strategies for our problem may be able to further improve the performance of the training procedure.