Multitask TSK Fuzzy System Modeling by Mining Intertask Common Hidden Structure

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON CYBERNETICS 1

Multitask TSK Fuzzy System Modeling by MiningIntertask Common Hidden Structure

Yizhang Jiang, Member, IEEE, Fu-Lai Chung, Member, IEEE, Hisao Ishibuchi, Fellow, IEEE,Zhaohong Deng, Senior Member, IEEE, and Shitong Wang

Abstract—The classical fuzzy system modeling methods implic-itly assume data generated from a single task, which is essentiallynot in accordance with many practical scenarios where data canbe acquired from the perspective of multiple tasks. Althoughone can build an individual fuzzy system model for each task,the result indeed tells us that the individual modeling approachwill get poor generalization ability due to ignoring the intertaskhidden correlation. In order to circumvent this shortcoming, weconsider a general framework for preserving the independentinformation among different tasks and mining hidden correla-tion information among all tasks in multitask fuzzy modeling.In this framework, a low-dimensional subspace (structure) isassumed to be shared among all tasks and hence be the hiddencorrelation information among all tasks. Under this framework, amultitask Takagi–Sugeno–Kang (TSK) fuzzy system model calledMTCS-TSK-FS (TSK-FS for multiple tasks with common hiddenstructure), based on the classical L2-norm TSK fuzzy system,is proposed in this paper. The proposed model can not onlytake advantage of independent sample information from theoriginal space for each task, but also effectively use the inter-task common hidden structure among multiple tasks to enhancethe generalization performance of the built fuzzy systems.Experiments on synthetic and real-world datasets demonstratethe applicability and distinctive performance of the proposedmultitask fuzzy system model in multitask regression learningscenarios.

Index Terms—Common hidden structure, fuzzy modeling,multitask learning, Takagi-Sugeno-Kang (TSK) fuzzy systems.

Manuscript received September 19, 2013; revised April 11, 2014; acceptedJune 9, 2014. This work was supported in part by the Hong KongPolytechnic University under Grant G-UA68, in part by the National NaturalScience Foundation of China under Grant 61170122 and Grant 61272210,in part by the Natural Science Foundation of Jiangsu Province under GrantBK201221834, in part by JiangSu 333 Expert Engineering under GrantBRA2011142, in part by the Fundamental Research Funds for the CentralUniversities under Grant JUDCF13030, in part by the Ministry of EducationProgram for New Century Excellent Talents under Grant NCET-120882, andin part by 2013 Postgraduate Student’s Creative Research Fund of JiangsuProvince under Grant CXZZ13_0760. This paper was recommended byAssociate Editor H. M. Schwartz.

Y. Jiang, Z. Deng, and S. Wang are with the Schoolof Digital Media, Jiangnan University, Wuxi 214122, China(e-mail: [email protected]; [email protected];[email protected]).

F.-L. Chung is with the Department of Computing, Hong Kong PolytechnicUniversity, Hong Kong (e-mail: [email protected]).

H. Ishibuchi is with the Department of Computer Science and IntelligentSystems, Osaka Prefecture University, Osaka 599-8531, Japan (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCYB.2014.2330844

I. INTRODUCTION

ARECENT principle tells us that multitask learning orlearning multiple related tasks simultaneously has better

performance than learning these tasks independently [1]–[5].Focused on multitask learning, the principal goal of multitasklearning is to improve the generalization performance of learn-ers by leveraging the domain-specific information containedin the related tasks [1]. One way to reach the goal is learningmultiple-related tasks simultaneously while using a commonrepresentation. In fact, the training signals for extra tasks serveas an inductive bias which is helpful to learn multiple com-plex tasks together [1]. Empirical and theoretical studies onmultitask learning have been actively performed in the fol-lowing three areas: multitask classification learning [2]–[13],multitask clustering [14]–[21], and multitask regression learn-ing [22]–[28]. It has been shown in those studies that, whenthere are relations between multiple tasks to learn, it is benefi-cial to learn them simultaneously instead of learning each taskindependently. Although those studies have indicated the sig-nificance of multitask learning and demonstrated certain effec-tiveness in different real-world applications, the current multi-task learning methods are still very limited and cannot keep upwith the real-world requirements, particularly in fuzzy model-ing. Thus, this paper focuses on multitask fuzzy modeling.

With the rapid development of data collection technologies,the regression datasets are collected with different formats.Usually, the datasets obtained can be divided into four types:1) Single-Input Single-Output (SISO); 2) Single-Input Multi-Output (SIMO); 3) Multiple-Input Single-Output (MISO); and4) Multi-Input Multi-Output (MIMO). Except the followingtwo types, i.e., SISO datasets and MISO datasets, the other twotypes can be transformed into multitask regression learning sce-narios. Taking a MIMO dataset as an example, we can alwaysdivide it into multiple MISO datasets. Each resulted MISOdataset is simpler to process than the original MIMO dataset.This corresponds to a typical multitask (regression) learningscenario. While each resulted MISO dataset can be modeledindividually to preserve the independence of each task (MISOfunction), the intertask hidden correlation is virtually lost. Inorder to solve this problem, this paper proposes a novel mod-eling technique which can effectively take into considerationof the hidden correlation information between MISO datasets.Thus, the proposed multitask regression learning method makesgood use of both task independent information and intertaskhidden correlation information of a given MIMO dataset.

2168-2267 c© 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

mailto:[email protected]





http://ieeexplore.ieee.org

http://www.ieee.org/publications_standards/publications/rights/index.html


2 IEEE TRANSACTIONS ON CYBERNETICS

Because of their interpretability and learning capa-bility, fuzzy systems have been widely used in manyfields such as intelligent control, signal processing, patternrecognition [29]–[35], [57]. Just like most of the state-of-the-art machine-learning methods, current fuzzy systems havebeen designed for a single task. For multitask learning, one canbuild an individual fuzzy system for each dataset from eachtask. Obviously, the intertask hidden correlation informationhas been ignored and hence the generalization performance ofthe overall system has not yet been upgraded through min-ing such a hidden correlation information. As we may knowwell, the hidden structural learning mechanism is attractingmore and more attention in the fields of machine learn-ing and its effectiveness has been verified [36]–[41]. In thispaper, we attempt to mine the common low-dimensionalstructure among all tasks to enhance the generalization per-formance of the built fuzzy systems. By exploiting both taskindependent information from original space and intertaskcommon low-dimensional hidden structure, a new multitaskfuzzy system modeling method is proposed in this paper. Itis rooted at the classical Takagi–Sugeno–Kang (TSK) typefuzzy systems and makes use of a novel objective func-tion constructed to synthesize the task independence andintertask hidden correlation. Our experimental results indi-cate that the proposed fuzzy system modeling method indeedenhances the generalization performance of the trained systemand hence is very promising for multitask regression learn-ing. The contributions of this paper can be highlighted asfollows.

1) The fuzzy system modeling is first studied from theviewpoint of multitask learning among all tasks andaccordingly a multitask fuzzy system learning frame-work is developed.

2) Within the multitask learning framework, a learningmechanism based on the common hidden structure isproposed based on the TSK fuzzy system. In ourapproach, a new objective function for multitask learningis formulated to integrate the task independent infor-mation and the task correlation information. The latteris captured by the common low-dimensional hiddenstructure among all tasks.

3) It is demonstrated by experimental results on syn-thetic and real-world multitask regression datasets thatthe proposed method outperforms standard single-taskTSK-type fuzzy system modeling methods in multitaskscenarios.

The rest of this paper is organized as follows. In Section II,the concept and principle of classical TSK-FS systems isbriefly reviewed, and then a novel concept and principle ofmultitask TSK fuzzy system-based intertask common hiddenstructure is proposed. In Section III, the learning algorithmof classical TSK-FS is briefly introduced. A novel multitaskTSK fuzzy system model called MTCS-TSK-FS (TSK-FS formultiple tasks with common hidden structure), based on theε-insensitive criterion and L2-norm penalty terms, is then pro-posed. The proposed system is extensively evaluated with theexperimental results reported in Section IV. Conclusions aregiven in Section V.

II. TSK FUZZY SYSTEMS FOR MULTIPLE TASKS WITH

COMMON HIDDEN STRUCTURE

A. Problem Definition

Let us first describe our multitask learning problem as fol-lows. Given a set of K regression tasks {T1, T2, . . . , TK} andfor each individual task Tk(1 ≤ k ≤ K), we have a set ofNk training data samples Dk = [Xk, Yk] with an input datasetXk = {xi,k, i = 1, . . . , Nk}, xi,k ∈ Rd and an output datasetYk = {yi,k, i = 1, . . . , Nk}.

According to the description of the above multitask sce-nario, our main purpose is to design a novel TSK fuzzy modelfor the multitask scenario which is able to achieve the follow-ing two goals: 1) preserving the independence information totake full advantage of the characteristic of each task and 2)mining the intertask hidden correlation information to enhancethe generalization performance of the built fuzzy systems. Toachieve the above goals, we propose a novel multitask TSKfuzzy system based intertask common hidden structure in thissection and its training algorithm will be introduced in thenext section.

B. Classical Single-Task TSK Fuzzy System

Let us first review the classical single-task-based TSK fuzzysystem [42]. For the classical TSK fuzzy system, the mostcommonly used fuzzy inference rules are as follows:

TSK Fuzzy Rule Rm

If x1 is Am1 ∧ x2 is Am

2 ∧ · · · ∧ xd is Amd

then f m (x) = pm0 + pm

1 x1 + · · · + pmd xd m = 1, . . . , M. (1)

In (1), Ami is a fuzzy subset on the input variable xi for

the mth rule, which implies a linguistic variable in the cor-responding subspace; M is the number of fuzzy rules, and ∧is a fuzzy conjunction operator. Each rule is premised on theinput vector x = [x1, x2, . . . , xd]T , and maps a fuzzy subspacein the input space Am ⊂ Rd to a varying singleton denoted byf m (x). When multiplicative conjunction is employed as theconjunction operator, multiplicative implication as the impli-cation operator, and additive disjunction as the disjunctionoperator, the output of the TSK fuzzy model can be formulatedas

y0 =M∑

m=1

μm (x)∑M

m′=1μm′ (x)

· f m (x) =M∑

m=1

μm (x) · f m (x) (2.a)

where μm(x) and μm(x) denote the compatibility grade andthe normalized compatibility grade with the antecedent part(i.e., fuzzy subspace) Am, respectively. These two functionscan be calculated as

μm (x) =d∏

i=1

μAmi(xi) (2.b)

μm (x) = μm (x)

/M∑

m′=1

μm′(x). (2.c)

A commonly-used membership function is the Gaussianmembership function which can be expressed by

μAmi(xi) = exp(−(xi − cm

i )2/2δmi ) (2.d)


JIANG et al.: MULTITASK TSK FUZZY SYSTEM MODELING 3

where the parameters cmi and δm

i can be estimated by clusteringtechniques or other partition methods. For example, with fuzzyc-means (FCM) clustering, cm

i and δmi can be estimated as

follows:

cmi =

N∑

j=1

ujmxji

/N∑

j=1

ujm (2.e)

δmi = h ·

N∑

j=1

ujm(xji − cmi )2

/N∑

j=1

ujm (2.f)

where ujm denotes the membership degree of the jth inputdata xj = (xj1, . . . , xjd)

T , to the mth cluster obtained by FCMclustering [43], [44] or other partition methods. Here h is ascalar parameter and can be adjusted manually.

When the premise of the TSK fuzzy model is determined,let

xe = (1, xT)T (3.a)

xm = μm (x) xe (3.b)

xg = ((x1)T , (x2)T , . . . , (xM)T)T (3.c)

pm = (pm0 , pm

1 , . . . , pmd )T (3.d)

pg = ((p1)T , (p2)T , . . . , (pM)T)T (3.e)

then (2.a) can be formulated as the following linear regressionproblem:

yo = pTg xg. (3.f)

Thus, the training problem of the above TSK model canbe transformed into the learning of the parameters in thecorresponding linear regression model [33], [34], [45].

1) Discussion (Classical TSK Fuzzy Systems for MultitaskScenarios): Above all, we know that the classical fuzzysystem modeling strategies were designed for a single-taskscenario, and hence they suffer certain deficiency in multitasklearning. For example, a straightforward modeling strategy isto construct individual fuzzy systems for different tasks bythe samples in the corresponding tasks, as shown in Fig. 1.Although the resulted framework provides a viable solutionto various multitask learning application scenarios, individualmodeling in fact ignores the relationship between tasks, whichis important in boosting the generalization ability of differentfuzzy systems in different tasks.

C. Multitask TSK Fuzzy System With Intertask CommonHidden Structure

In order to empower the classical single-task TSK fuzzysystem modeling methods with multitask learning ability, amultitask TSK fuzzy system-based intertask common hid-den structure (MTCS-TSK-FS) is proposed in this paper. Asmentioned in the beginning of this section, the proposed MTCS-TSK-FS has two goals, i.e., it can preserve the independenceinformation of each task and mine the intertask hidden corre-lation information simultaneously. To achieve these two goals,we propose to divide the output f m

k (xk) for the mth rule of taskk into two parts, i.e., a common part and an individual part.For the common part, a common output �m

S (xk) is defined.

Fig. 1. Fuzzy systems for multiple tasks: Independent learning strategy.

In order to mine the intertask hidden correlation informationamong all tasks, the corresponding parameters U of �m

S (·) andthe common low-dimensional (r-dimensions) hidden structureHm are defined. Meanwhile, for the individual part, we definean individual output gm

k (xk) and the corresponding parametersθm

k are also defined to preserve the independence informationof each task in the original space. Overall, a novel multitaskfuzzy inference rules are defined as follows:

TSK Fuzzy Rule Rmk

If x1,k is Am1,k ∧ x2,k is Am

2,k ∧ · · · ∧ xd,k is Amd,k (4)

then f mk (xk) = �m

S (xk)+ gmk (xk)

= (U)THmxk +(θm

k

)T xk

= (U)Thm0 + (U)Thm

1 x1,k + · · ·+ (U)Thm

d xd,k + θm0,k + θm

1,kx1,k + · · ·+ θm

d,kxd,k

= ((U)Thm0 + θm

0,k)+ ((U)Thm1 + θm

1,k)x1,k + · · ·+ ((U)Thm

d + θmd,k)xd,k

m = 1, . . . , Mk = 1, . . . , K.

In (4), M is the number of fuzzy rules, K is the number oftasks, and the means of other symbols are the same as in clas-sical TSK fuzzy systems. Each rule is premised on the inputvector xk =

(x1,k, x2,k, . . . , xd,k

)T ∈ Rd×1 for each task k, andmaps a fuzzy subspace in the input space Am

k ⊂ Rd to a varyingsingleton denoted by f m

k (xk) for each task k, and f mk (xk) con-

sists of two parts: 1) the individual part (i.e., gmk (·)) for each

task k, which can be characterized by the corresponding param-

eters θmk =

(θm

0,k, θm1,k, θ

m2,k, . . . , θ

md,k

)T ∈ R(d+1)×1 generatedby using independence information of each task for originalspace, i.e., (d + 1)-dimensions space and 2) the commonpart (i.e., �m

S (·)) among all tasks, which can be characterizedby the corresponding parameters U = (u1, . . . , ur)

T ∈ Rr×1

of �mS (·) and the common low-dimensional (r-dimensions)

hidden structure Hm = (hm0 , hm

1 , . . . , hmd ) ∈ Rr×(d+1), hi =

(h1, . . . , hr)T ∈ Rr×1, i = 1, . . . d, obtained through mining

intertask hidden correlation information among all tasks. Likethe processing of classic single-task TSK model, the output ofthe multitask TSK fuzzy model for task k can be formulated as

y0k =

M∑

m=1

μmk (x)

∑Mm′=1

μm′k (x)

· f mk (x) =

M∑

m=1

μmk (x) · f m

k (x) (5.a)



where μmk (x) and μm

k (x) denote the fuzzy membership func-tion and the normalized fuzzy membership associated with thefuzzy set Am

k for each task k, respectively. These two functionscan be calculated by using

μmk (x) =

d∏

i=1

μAmi,k

(xi,k) (5.b)

μmk (x) = μm

k (x)

/M∑

m′=1

μm′

k (x). (5.c)

Similar to classical TSK fuzzy system, for each task k,we used Gaussian membership function as fuzzy membershipfunction

μAmi,k

(xi,k) = exp

(−(xi,k − cmi,k)

2

2δmi,k

)(5.d)

where we also used FCM clustering to estimate the value ofcm

i,k, δmi,k, cm

i,k, δmi,k for each task k can be estimated as follows:

cmi,k =

N∑

j=1

ujm,kxji,k

/N∑

j=1

ujm,k (5.e)

δmi,k = hk ·

N∑

j=1

ujm,k(xji,k − cmi,k)

2

/N∑

j=1

ujm,k (5.f)

where ujm,k denotes the fuzzy membership of the jth inputdata xj,k = (xj1,k, . . . , xjd,k)

T for each task k, belonging to themth cluster obtained by FCM clustering. Here hk is a scalarparameter and can be adjusted manually.

Similar to the classical TSK fuzzy system (see the abovesubsection), when the premise of the multitask TSK fuzzymodel for each task k is determined, let

xe,k = (1, xTk )T (6.a)

xmk = μm

k (xk) xe,k (6.b)

xg,k = ((x1k)

T , (x2k)

T , . . . , (xMk )T)T (6.c)

pmk =

(((U)Thm

0 + θm0,k

),((U)Thm

1 + θm1,k

), . . . ,

((U)Thm

d + θmd,k

))T(6.d)

pg,k = (((Hm)TU+ θ1g,k)

T , ((Hm)TU+ θ2g,k)

T , . . . ,

((Hm)TU+ θMg,k)

T)T (6.e)

pg,k = ((p1k)

T , (p2k)

T , . . . , (pMk )T)T = ((H)TU+ θk)

T (6.f)

where H = (H1, . . . , HM)T ∈ Rr×(M·(d+1)), θk =((θ1

g,k)T , . . . , (θM

g,k)T)T ∈ R(M·(d+1))×1, then (5.a) can be

formulated as the following linear regression problem:

y0k = pT

g,kxg,k. (6.g)

Thus, the training problem of the above multitask TSKmodel for each task k can also be transformed into the learn-ing of the parameters in the corresponding linear regressionmodel.

The framework for the construction of MTCS-TSK-FS canbe described with Fig. 2. In Fig. 2, its modeling strategy isshown. It can be seen that each fuzzy system is trained ina multitask learning manner by the multitask dataset, which

Fig. 2. Fuzzy systems for multiple tasks: Multitask learning strategy.

takes full use of the intertask hidden correlation information.In the following section, the learning algorithm of MTCS-TSK-FS based on ε-insensitive criterion and L2-norm penaltyterms will be elaborated (Algorithm 1)

III. TSK FUZZY MODEL LEARNING FOR MULTIPLE

TASKS WITH COMMON HIDDEN STRUCTURE

In this section, the ε-insensitive criterion and L2-normpenalty-based TSK fuzzy system (L2-TSK-FS) learning algo-rithm are first reviewed briefly. Then the learning algorithmof MTCS-TSK-FS based on the ε-insensitive criterion andL2-norm penalty terms are introduced in detail (Algorithm 1).

A. Classical Single-Task TSK Fuzzy Model Learning

Given a training dataset Dtr = {xi, yi|xi ∈ Rd, yi ∈ R, i =1, . . . , N}, for fixed antecedents obtained via clustering of theinput space (or by other partition techniques), the least square(LS) solution to the consequent parameters is to minimize thefollowing LS criterion function [46], that is:

min Epg=

N∑

i=1

(yo

i − yi)2 =

N∑

i=1

(pT

g xgi − yi

)2

= (y− Xgpg)T(y− Xgpg) (7)

where Xg = [xg1, . . . , xgN]T ∈ RN×(M·(d+1)) and y =[y1, . . . , yN]T ∈ RN .

The most popular LS-criterion-based TSK fuzzy systemlearning algorithm is the one used in the adaptive-network-based fuzzy inference systems (ANFIS) [46]. For this type ofalgorithms, a major shortcoming is their weak robustness formodeling tasks involving noisy and/or small datasets.

In addition to the LS-criterion-based TSK fuzzy sys-tem learning algorithm, another important representative ε-insensitive criterion based TSK-FS learning method is the onedeveloped by employing the L1-norm penalty terms [45] andthe L2-norm penalty terms [33], [34]. Compared with the L1-norm penalty terms-based TSK fuzzy system (L1-TSK-FS)



learning algorithms, the L2-norm penalty terms-based algo-rithms (L2-TSK-FS) have shown more advantages with itsimproved version, such as MEB-based L2-TSK-FS for verylarge datasets [33] and knowledge-leverage-based transferlearning L2-TSK-FS for missing data [34]. Here, we mainlyfocus on the L2-TSK-FS learning algorithm in [33] since it ismore related to our work in this paper.

For TSK fuzzy system training, the ε-insensitive objectivefunction is defined as follows.

Given a scalar g and a vector g = [g1, . . . , gg]T , the corre-sponding ε-insensitive loss functions take the following formsrespectively: |g|ε = g − ε (g > ε), |g|ε = 0 (g ≤ 0), and|g|ε = ∑d

i=1 |gi|ε. For the linear regression problem of theTSK fuzzy model in (3.f), the corresponding ε-insensitive lossbased criterion function [45] is defined as

minpg

E =N∑

i=1

|yoi − yi|ε =

N∑

i=1

|pTg xgi − yi|ε. (8.a)

In general, the inequalities yi−pTg xgi < ε and pT

g xgi−yi < ε

are not satisfied for all data pairs (Xgi, Yi).Further, by introducing the regularization term [45], (8.a) is

modified to

minpg

g(pg, ε

) = 1

Nτ

N∑

i=1

|pTg xgi − yi|ε + 1

2pT

g pg (8.b)

where | · |ε is the ε-insensitive measure, and the balanceparameter τ (> 0) controls the tradeoff between the complex-ity of the regression model and the tolerance of the errors.When L2-norm penalty terms with slack variables ξ+i and ξ−iare introduced, the corresponding objective function of theL2-TSK-FS can be formulated as follows [33]:

minpg

, ξ+, ξ−, ε g(pg, ξ

+, ξ−, ε) = 1

τ· 1

N

N∑

i=1

((ξ+i )2

+ (ξ−i )2)+ 1

2pT

g pg + 2

τ· ε

s.t.

{yi − pT

g xgi < ε + ξ+ipT

g xgi − yi < ε + ξ−i∀i. (9.a)

Compared with the L1-norm penalty-based ε-insensitive cri-terion function [45], the L2-norm penalty-based criterion [33]is advantageous because of the following characteristics: 1) theconstraints ξ+i ≥ 0 and ξ−i ≥ 0 in the objective functionof L1-TSK-FS [45] are not needed for the optimization and2) the insensitive parameter ε can be obtained automaticallyby optimization without the need of manual setting. Similarproperties can also be found in other L2-norm penalty-basedmachine learning algorithms, such as L2-SVR [47]. Based onoptimization theory, the dual of (9.a) can be formulated as thefollowing QP problem:

maxα+,α−

−∑Ni=1

∑Nj=1(α

+i − α−i )(α+j − α−j ) · xT

gixgj

−∑Ni=1

Nτ2 (α+i )2 −∑N

i=1Nτ2 (α−i )2

+∑Ni=1 α+i · yi · τ −∑N

i=1 α−i · yi · τ

s.t.N∑

i=1

(α+i + α−i ) = 1, α+i , α−i ≥ 0 ∀ i. (9.b)

Notably, the characteristic of the QP problem in (9.b) enablesthe use of the coreset-based minimal enclosing ball (MEB)approximation technique to solve problems involving verylarge datasets [47]. The scalable L2-TSK-FS learning algo-rithm (STSK) has thus been proposed in this regard [33]. The(9.b) can also be used to develop the transfer learning versionon TSK fuzzy system to solve the problem of missing data [34].

B. TSK Fuzzy Model Learning for Multiple Tasks WithCommon Hidden Structure

According to the advantages of L2-TSK-FS, we mainlyfocus on the L2-TSK-FS learning algorithm that is morerelated to our work in this paper. As distinguished from thesingle-task learning algorithm, when we design the objectivefunction of the multitask with common hidden structure TSKfuzzy system based on the classic ε-insensitive criterion andL2-norm penalty terms, we should consider how to maintaina balance between the unique characteristics of different tasksof data samples (independence information) and correlationinformation (intertask hidden correlation by mining the com-mon hidden structure among all tasks), and how to generalizethe independence and correlation information. Above all, thefollowing objective function is defined:

minθ,U,H,ξ,ε

J(θ, U, H, ξ, ε) =K∑

k=1

gk

(θk, ξk, εk

)+�S

(HTU

)

s.t.

⎧⎪⎨

⎪⎩

yi,k − (HTU+ θk)Txgi,k < εk + ξ+i,k

(HTU+ θk)Txgi,k − yi,k < εk + ξ−i,k

HHT = Ir×r

∀i, k

(10)

where

gk

(θk, ξk, εk

)= λ

K

1

2θ

Tk θk + 1

Nkτk

Nk∑

i=1

((ξ+i,k)

2 + (ξ−i,k)2)

+ 2

τkεk (10.a)

�S

(HTU

)= 1

2(HTU)T(HTU). (10.b)

It can be found from (10) that (10.a) and (10.b) play dif-ferent roles in (10), i.e., (10.a) representing the independenceinformation on independent regression sample space (M · (d+1)-dimensions space) of each task and (10.b) representing thecorrelation information on common hidden structure amongall tasks. Specifically, in order to represent the independenceinformation and the correlation information (intertask hiddencorrelation), we use the vector θk ∈ R(M·(d+1))×1 representsthe individual consequents of each task and it tends to zerowhen different tasks are similar to each other, otherwise thecommon hidden structure consequents U ∈ Rr×1 tends tozero. Then we can get the model parameter pg,k for task-k by using pg,k = HTU + θk. In other words, the commonstructure H ∈ Rr×(M·(d+1)) will realize mapping the origi-nal regression sample space (M · (d + 1)-dimensions space)into a low-dimensional hidden structure (r-dimensions space,r < (M · (d + 1))). Note here that the regularization parame-ters τk > 0 control the tradeoff between the complexity of the



L(λ+1 , . . . , λ+K︸︷︷︸K

, λ−1 , . . . , λ−K︸︷︷︸K

) = − 12

⎛

⎜⎜⎝

∑Kk=1

∑Kl=1

∑Nki=1

∑Nlj=1(λ

+j,l − λ−j,l)(λ

+i,k − λ−i,k)xT

gj,lxgi,k

+ Kλ

∑Kk=1

∑Nki=1

∑Nkj=1(λ

+i,k − λ−i,k)(λ

+j,k − λ−j,k)xT

gj,kxgi,k

+ ∑Kk=1

Nkτk2

∑Nki=1

((λ+i,k)2 + (λ−i,k)2

)

⎞

⎟⎟⎠

+∑Kk=1

∑Nki=1(λ

+i,k − λ−i,k)yi,k

(11)

s.t. λ+k ≥ 0, λ−k ≥ 0Nk∑

i=1

(λ+i,k + λ−i,k) =2

τk∀k k = 1 . . . K.

Let

υ = (λ+1,1, . . . , λ+N1,1

, λ−1,1, . . . , λ−N1,1︸︷︷︸

2N1

, . . . , λ+1,K, . . . , λ+NK ,K, λ−1,K, . . . , λ−NK ,K︸︷︷︸2NK

)T

=((

λ+1)T

,(λ−1)T

, . . . ,(λ+K)T

,(λ−K)T)T

(12.a)

zi,k ={

xgi,k, i = 1, . . . , Nk

−xg(i−N),k, i = N + 1, . . . , 2Nk(12.b)

β = (yT1 ,−yT

1 , . . . , yTK,−yT

K)T, yT

k = (y1,k, . . . , yN,k)T k = 1, . . . K (12.c)

regression model and the tolerance of errors, and the parame-ter λ has an impact on θk. When λ− > +∞, each θk tends to0, denoting strong correlation and weak independence. Whenλ− > 0, each θk tends to +∞, denoting strong independenceand weak correlation. The values of the vector τ and the con-stant λ can be manually set and can also be taken by thecross-validation strategy [48].

C. Parameter Solution for MTCS-TSK-FS

In (10), all of the variables θ , U, and H are required tobe optimized. Overall, solving this problem directly is not atrivial task. In this paper, an iterative method is adopted whichoccurred in our pervious work [35]. The optimizing procedurecontains three main steps.

Step 1: The computation of θ∗

and U∗.When the common structural parameter H is fixed, (10)

becomes the typical quadratic programming (QP) problem in(11)–(14.c) among which (11)–(12.c) can be seen at the topof the page.

Equation (11) can be reformulated as

arg maxυ

− 12υTKυ+ υTβ

s.t. υTk 1 = 2

τkυ i,k ≥ 0 ∀i, k

(13)

where

Kk =[kij

]

2Nk×2Nk, kij = K

λ·zT

gj,k zgi,k

+Nkτk

2δij, δij =

{1, i = j0, i �= j

. (14.a)

Kk,l =[kij

]

2Nl×2Nk, kij = zT

gj,l zgi,k (14.b)

K =

⎛

⎜⎜⎜⎝

K1 + K1,1 K2,1 · · · KK,1

K1,2 K2 + K2,2 · · · KK,2...

.... . .

...

K1,K K2,K · · · KK + KK,K

⎞

⎟⎟⎟⎠ . (14.c)

With the optimal solution(λ+1)∗

,(λ−1)∗

, . . . ,(λ+K)∗

,(λ−K)∗

of the dual in (11) or (13), we can get the optimal solution ofthe primal in (10) based on the relations in (15.a) and (15.b)

θ∗k =

K

λ

Nk∑

i=1

((λ+i,k

)∗ −(λ−i,k

)∗)xgi,k (15.a)

U∗ = H

(K∑

k=1

Nk∑

i=1

(λ+i,k

)∗xgi,k −

K∑

k=1

Nk∑

i=1

(λ−i,k

)∗xgi,k

).

(15.b)

Please note, the detailed derivations of (11), (15.a), and(15.b) can be found in the Appendix.

Step 2: The computation of H∗.When the parameter θ and U are fixed and the parameters

λ+1 , . . . ,λ+k︸︷︷︸K

,λ−1 , . . . ,λ−k︸︷︷︸K

obtained by Step 1, then the opti-

mal H that solves the optimization problem in (10) can beexpressed as

G(H) = 1

2(HTU)T(HTU)+

K∑

k=1

Nk∑

i=1

λ+i,k(−(HTU)Txgi,k

)

+K∑

k=1

Nk∑

i=1

λ−i,k((HTU)Txgi,k

)

s.t. H HT = Ir×r. (16.a)

We have the gradients on H as follows:

∂G

∂H= UUTH−

K∑

k=1

Nk∑

i=1

λ+i,kUxTgi,k

+K∑

k=1

Nk∑

i=1

λ−i,kUxTgi,k. (16.b)



Algorithm 1: Learning Algorithm for MTCS-TSK-FSStage 1: Initialization Optimal Conditions

Initialize t = 0, H(0);Compute J(0);Set the maximum number of external iterations tmax, themaximum number of internal iterations lmax, the regular-ization parameter τk, λ, the number of fuzzy rules Mk foreach task k = 1. K. (Set M1 = M2 = · · · = MK) and thethreshold δ.Stage 2: Constructing Multitask Dataset for LinearRegression

Step 1:Use FCM or other partition methods to generate the

regression datasets Dk = {xgi,k, yi,k}, i = 1, 2..., Nk, k =1, 2, . . . , K for each task.

Stage 3: Optimizing the Objective Function of MTCS-TSK-FSRepeat:t = t + 1;Step 2:

Use (11) or (13) to update θ(t)k and U(t);

Step 3:l = 0;H(t,l) = H(t−1)

Compute G(0);Repeat:

l = l + 1;Use (17.c) to update step size η;Use (18) to update H;

Until: ‖G(t, l)− G(t, l− 1)‖ ≤ δ or l ≥lmax

H(t) = H(t,l)

Step 4:Use (19) to update the consequent parameter pg,k of eachtask;Compute J(t);

Until: ‖J(t)− J(t − 1)‖ ≤ δ or t ≥ tmax

Stage 4: Generating the MTCS-TSK Fuzzy System forEach ViewStep 5:

Generate the desired MTCS-TSK-FS for each task byusing the final optimal consequent parameter pg,k and (6.g).

Then, the variable H can be learned via a gradient descentalgorithm on the Grassmann manifold [49]–[51]

H← H− η∂G

∂H(Ir×r − H HT) = H− η∇H. (16.c)

That is

H(t+1) = H(t) − η(UUTH(t) −K∑

k=1

Nk∑

i=1

λ+i,kUxTgi,k

+K∑

k=1

Nk∑

i=1

λ−i,kUxTgi,k)(Ir×r − H(t)H(t)T). (16.d)

Please note, the step size η can be manually set and it alsocan be obtained analytically as follows. First, let us put in theupdate rule in (16.c) into the objective function in (16.a), wehave

f (η) = − η

2(∇HTU)THTU− η

2(HTU)T∇HTU

+ η2

2(∇HTU)T∇HTU

+ η

K∑

k=1

Nk∑

i=1

λ+i,k((∇HTU)Txgi,k

)

− η

K∑

k=1

Nk∑

i=1

λ−i,k((∇HTU)Txgi,k

)(17.a)

and the gradient of step size η is

∂f

∂η= − 1

2(∇HTU)THTU− 1

2(HTU)T∇HTU

+ η(∇HTU)T∇HTU

+K∑

k=1

Nk∑

i=1

λ+i,k((∇HTU)Txgi,k

)

−K∑

k=1

Nk∑

i=1

λ−i,k((∇HTU)Txgi,k

). (17.b)

Let ∂f∂η= 0, we can obtain (17.c), as shown at the bottom

of the next page.Above all, we have the update rule for the common

structural parameter H by using (16.d) and (17.c), that is

H← H− η∇H (18)

where ∇H = ∂G∂H

(Ir×r − HHT) and ∂G∂H= UUTH −

∑Kk=1

∑Nki=1 λ+i,kUxT

gi,k +∑K

k=1∑Nk

i=1 λ−i,kUxTgi,k.

Step 3: The computation of p∗g,k.

With the optimal parameter θ∗

and U∗ of the dual in (11) or(13) from Step 1 and the optimal common structural parameterH∗ from Step 2, we can get the optimal model parameters

of the trained MT-TSK-FS for Task-k, i.e.,(

pg,k

)∗, is then

given by

(pg,k

)∗ =(

H∗)T

U∗ +(θk

)∗. (19)

Finally, we can use the obtained optimal parameter(pg,k

)∗to construct the TSK fuzzy system by using (6.g) for eachtask.

Note here that the joint optimization over θ, U, and H willcause the optimization being non-convex. Therefore, only alocal optimal solution can be obtained. This usually does notlead to a serious problem as the local optimal solution iseffective enough in most practical applications.

D. Learning Algorithm for MTCS-TSK-FS

Based on the above update rules, the learning algorithm ofthe proposed MTCS-TSK-FS is presented in Algorithm 1.



IV. EXPERIMENT RESULTS

A. Setup

1) Methods Adopted for Comparison: In this section, weevaluate the effectiveness of the proposed MTCS-TSK-FS incomparison with three representative methods on syntheticand real-world datasets. The three methods include: 1) L2-norm penalty-based ε-insensitive TSK fuzzy model (L2-TSK-FS) [33]; 2) TS-fuzzy-system-based support vector regression(TSFS-SVR) [52]; and 3) fuzzy system learned through fuzzyclustering and support vector machine (FS-FCSVM) [53].

2) Parameter Setting: In our experiments, the parameters ofthe above three methods and our proposed method as follows.

1) The numbers of fuzzy rules K: For all the algorithmsexcept FS-FCSVM (as the FS-FCSVM can automati-cally determined its fuzzy rules by the number of supportvectors), the number of fuzzy rules, according to thescale of the datasets, are determined by fivefold cross-validation strategy with the parameter set {5, 10, 15, 20,25, 30}.

2) The regularization parameters τ of L2-norm based TSK-FS (L2-TSK-FS and MTCS-TSK-FS) and C of SVM-based TSK-FS (TSFS-SVR and FS-FCSVM): theseregularization parameters are determined by using five-fold cross-validation strategy with the parameter set{2−6, 2−5, . . . , 25, 26

}and

{10−3, 10−2, . . . , 102, 103

},

respectively.3) The kernel function of SVM-based TSK-FS (TSFS-SVR

and FS-FCSVM): Here, we use Gaussion kernel functionK(x, y) = e−‖x−y‖2/σ 2

in [52] and [53], and its kernelparameter σ is determined by fivefold cross-validationstrategy with the parameter set

{2−6, 2−5, . . . , 25, 26

}.

4) The common hidden r-dimensional structure of our pro-posed MT-hidden-TSK: We tested different r-values andfound that r = �d/2� seems to give the best results,where d represents the dimensions of regression datasets.When r is too large, the computation cost is too high.When r is too small, the performance may suffer due toinformation loss.

3) Performance Index: J =√

1N∑N

i=1(y′i − yi)2/ 1

N∑N

i=1(yi − y)2

is adopted to evaluate the training and test performance. Inthis index, N is the number of sample in a dataset, yi isthe output for the ith data pair, y′i is the fuzzy model outputfor the ith input datum, and y = 1

N

∑Ni=1 yi. The smaller the

value of J obtained on a test set, the better the generalizationperformance.

And for clarity, the notations for the datasets and theirdefinitions are listed in Table I.

In the experiments, the parameters of all the methodsadopted for comparison are determined by using the five-folds cross-validation strategy with the training datasets. All

TABLE INOTATIONS OF THE ADOPTED DATASETS AND THEIR DEFINITIONS

the algorithms are implemented using 64-bit MATLAB on acomputer with Intel Xeon E5-2620 2.0 GHz CPU × 2 and32GB RAM.

B. Synthetic Datasets

1) Generation of Synthetic Datasets: Synthetic datasetswere generated to simulate different scenes in the paper. Thefollowing assumptions need to be satisfied for these syntheticdatasets: 1) there should exist different tasks (independenceinformation) and 2) these tasks should be related (correlationinformation). In other words, our synthetic datasets satisfy theassumptions that the multiple tasks are different but related.

Based on the above assumptions, we define three differentscenes as described in Table II to simulate real-world ones asfollows.

a) Same Input-Different Output [SI-DO(DN)] scene:This scene contains three tasks, which have the same inputdataset, i.e., X1 = X2 = X3, but different outputs by usingthe same mapping function but with different noise added.Thus, a multitask dataset {T1, T2, T3}, where T1 = [X, Y1],T2 = [X, Y2] and T3 = [X, Y3], is obtained.

b) Same Input-Different Output [SI-DO(DF)] scene:This scene contains three tasks, again having the same inputdataset, i.e., X1 = X2 = X3. However, their outputs are differ-ent by using different mapping functions and having the samenoise adding function applied.

c) Different Input-Different Output [DI-DO(SN&SF)]scene: This scene contains three tasks, with different inputdatasets generated by different range of inputs and differentoutput values, generated by the same mapping function andthe same noise adding function.

d) Different Input-Different Output [DI-DO(DN&SF)]scene: This scene contains three tasks, with different inputdatasets generated by different range of inputs and differentoutput values, generated by using the same mapping functionbut with different noise added.

The settings for generating the above synthetic datasets aredescribed in Table II. In this experiment, ten training datasetswith noise were generated and a noise free test dataset wasgenerated for each multitask scene. The average performanceof each algorithm under different multitask scenes has beenreported.

η =

(12 (∇HTU)THTU+ 1

2 (HTU)T∇HTU

−∑Kk=1

∑Nki=1 λ+i,k

((∇HTU)Txgi,k

)+∑K

k=1∑Nk

i=1 λ−i,k((∇HTU)Txgi,k

))

(∇HTU)T∇HTU(17.c)



TABLE IIDETAILS OF THE SYNTHETIC DATASETS

2) Comparing With Related TSK-FS Modeling Methods:The proposed MT-TSK-FS is compared with three relatedTSK-based fuzzy system method and the results are shown inTable III and Fig. 3 [for Fig. 3, to save the space of the paper,we only chose the experimental results on the dataset ofDI-DO(SF&DN) scene]. According to the experiment resultson these four different scenes, the following observations canbe made.

1) For SI-DO(DN) scene, it can be seen from the exper-imental results in Table III that for the increase in thedegree of noise the performance of comparison methods,i.e., L2-TSK-FS, TSFS-SVR, and FS-FCSVM, becomemore and more poor. An obvious reason is that thedata in the training set are noisy of each task, whichdegrades the generalization capability of these threemethods. However, our proposed MTCS-TSK-FS canconsiderably improve not only the generalization per-formance but also the fitting to each individual task inthis SI-DO(DN) scene, indicating that the use of com-mon information (i.e., common hidden structure) haspositive effects on the fitting of each individual fuzzymodel to each individual training data set and the gen-eralization performance of all models to unseen testdata sets.

2) For SI-DO(DF) scene, these three data generating func-tions have different levels of complexity, respectively.The modeling effect in Table III shows that the threerelated comparison methods are not ideal on this mul-titask scene. In particular, on the second functionf2(x2, N2) = x2

2 · cos(x2) + N2, both the generaliza-tion performance and fitting performance of these threerelated methods are much weaker for this function.Focused on generalization performance, although thesethree methods have demonstrated an undesirable resulton this scene, our proposed MTCS-TSK-FS still obtainan acceptable generalization capability.

3) For DI-DO(SN&SF) scene, the results shown inTable III that by expending the interval of input data

the complexity of output data is increasing and theperformance of comparison methods is not able toachieve an ideal results. By taking the independenceinformation of each task and mining the shared hiddencorrelation information among all task, the MTCS-TSK-FS is able to achieve the best performance.

4) For DI-DO(DN&SF) scene, Table III and Fig. 3 showsthe modeling results of L2-TSK-FS, TSFS-SVR, FS-FCSVM, and our MTCS-TSK-FS. It can be seen fromTable III and Fig. 3 that both the generalization andfitting performances of the proposed MTCS-TSK-FS isbetter than that of several related methods adopted in thisscene. This indicates that by simultaneously expendingthe interval of input data and increasing in the degreeof noise the output data become exceptionally complexamong all tasks. However, the single-task-based meth-ods cannot effectively take full use of the informationamong all tasks, especially they cannot use the commoninformation among all tasks, so they are still exhibitingpoor ability. In contrast to single-task-based methods, themultitask-based MTCS-TSK-FS achieve the best perfor-mance by exploring the independence information andshared hidden correlation information.

Above all, the experimental results confirm that the pro-posed multitask with common hidden structure TSK-FS out-performs the existing state-of-the-art TSK-based fuzzy systemmethod on multitask learning scene.

C. Real-World Datasets

1) Glutamic Acid Fermentation Process Modeling: Theglutamic acid fermentation process [33], [54] has a strongnonlinear and time-varying characteristic, so it is difficult toestablish a model by using its mechanism. However, a com-plex model is also useless for us to control and optimize itsprocess. Therefore, it is greatly significant and practical forus to model the glutamic acid fermentation process using thefuzzy modeling technology. In this subsection, to further eval-uate the performance of the proposed multitask TSK fuzzy



TABLE IIIGENERALIZATION AND FITTING PERFORMANCES (J) OF THE PROPOSED METHOD MTCS-TSK-FS AND SEVERAL RELATED

TSK-BASED FUZZY SYSTEM METHODS ON THE DATASET OF DIFFERENT SCENES

system learning method, an experiment is conducted to applythe proposed method to model a biochemical process withreal-world datasets [33]–[35]. The datasets adopted originatesfrom the glutamic acid fermentation process, which is a MIMOsystem. The input variables of the dataset include the fermen-tation time h, glucose concentration S(h), thalli concentrationX(h), glutamic acid concentration P(h), stirring speed R(h),and ventilation Q(h), where h= 0, 2, . . . , 28. The outputvariables are glucose concentration S(h + 2), thalli concen-tration X(h + 2), and glutamic acid concentration P(h + 2)at a future time h+2. The TSK-FS-based biochemical pro-cess prediction model is illustrated in Fig. 4. The data in thisexperiment were collected from 21 batches of fermentationprocesses, with each batch containing 14 effective data sam-ples. In this experiment, we randomly selected 20 batches asthe training datasets (D1-train) and the remained batch as thetest dataset (D2-test). The above procedure was repeated tentimes to obtain the average performance of each algorithm. Inthis experiment, in order to match the situation discussed inthis paper, the dataset are divided into three tasks as follows.

1) Task 1 (Glucose Concentration-FS): The main objectof this fuzzy system is to predict the output value ofglucose concentration on the next time, i.e., S(h + 2).

2) Task 2 (Thalli Concentration-FS): The main object ofthis fuzzy system is to predict the output value of thalliconcentration on the next time, i.e., P(h + 2).

3) Task 3 (Glutamic Acid Concentration-FS): The mainobject of this fuzzy system is to predict the output value

of glutamic acid concentration on the next time, i.e.,X(h + 2).

2) Polymer Test Plant Modeling: This dataset was takenfrom a polymer test plant.1 There are ten input variables, mea-surements of controlled variables in a polymer processing plant(temperatures, feed rates, etc.) X ∈ Rd, d = 10, and four outputvariables ([Y1, Y2, Y3, Y4]) are measures of the output of thatplant. It is claimed that this data set is particularly good fortesting the robustness of nonlinear modeling methods to irreg-ularly spaced data. It is also a MIMO system. According to itsoutput, we can divide this dataset into four tasks, i.e., [X, Y1]for task 1, [X, Y2] for task 2, [X, Y3] for task 3 and [X, Y4]for task 4. In this experiment, the dataset was also randomlypartitioned with ratio 3:1 for training and testing. This proce-dure was repeated ten times and the average performance ofeach algorithm on 10 runs is reported.

3) Wine Preferences Modeling: This dataset is adoptedfrom the wine quality dataset [55], [56]. It contains two sub-datasets that measure physicochemical properties of red wine(1599 samples) and white wine (4898 samples), 11 condi-tional attributes (input) based on physicochemical tests (e.g.,pH values, etc.) X ∈ Rd, d = 11 and 1 decision attribute (out-put) based on sensory data (quality, score between 0 and 10made by wine experts). In this experiment, this dataset canbe divided into two tasks, i.e., red-wine [Xred, Y red] for task 1

1Polymer data set can be available fromftp://ftp.cis.upenn.edu/pub/ungar/chemdata



Fig. 3. Experimental results of the proposed MTCS-TSK-FS method, the L2-TSK-FS method, TSFS-SVR method, and FS-FCSVM method on the testingdataset of DI-DO(SF&DN) scene in a certain run: 1) (a-1)–(a-3) TSK-FS foreach task; 2) (b-1)–(b-3) TSFS-SVR for each task; 3) (c-1)–(c-3) FS-FCSVMfor each task; and 4) (d-1)–(d-3) MTCS-TSK-FS for each task.

Fig. 4. Illustration of the glutamic acid fermentation process prediction modelbased on TSK-FSs.

and white wine [Xwhite, Ywhite] for task 2. In this experiment,the dataset was also randomly partitioned with ratio 3:1 fortraining and testing, respectively. This procedure was repeatedten times to obtain the average performance of each algorithm.

4) Concrete Slump Modeling: This dataset was taken froma ready mix concrete batching plants [58], [59]. Concrete is ahighly complex material, which makes modeling its behaviora very difficult task. This dataset includes 103 data points andthere are seven input variables [i.e., Cement, Slag, Fly ash,Water, SP, Coarse Aggr., and Fine Aggr.], making X ∈ Rd,d = 7, and three output variables [i.e., slump, flow and com-pressive strength (Mpa)] [Y1, Y2, Y3]. It is a MIMO system.According to its output, we can divide this dataset into threetasks, i.e., [X, Y1] for task 1, [X, Y2] for task 2, and [X, Y3]for task 3. In our experiment, this dataset was also randomly

partitioned with ratio 3:1 for training and testing, respec-tively. The procedure was repeated ten times to obtain averageperformance of each algorithm.

5) Multivalued (MV) Data Modeling: This is a benchmark-ing artificial dataset [61] with dependency between attributevalues.2 It is a large scale dataset, containing 40 768 datapoints, and there are ten input variables and one output vari-able. According to its 8th input variable, we can divide thisdataset into two tasks, i.e., normal-type for task 1 and large-type for task 2. In this experiment, the dataset was alsorandomly partitioned with ratio 3:1 for training and testing,respectively. The procedure was repeated ten times to obtainthe average performance of each algorithm.

6) Comparing With Related TSK-FS Modeling Methods:These five real-world multitask regression datasets were usedto test on the proposed method MV-TSK-FS and other threerelated TSK-based fuzzy system methods, i.e., L2-TSK-FS,TSFS-SVR, and FS-FCSVM, and the results are given inTable IV. The findings are similar to those presented inSection IV-B for the experiments performed on the syntheticdataset. The results show that our proposed MTCS-TSK-FShas obtained better performance than the other three meth-ods on the four small scale datasets (i.e., fermentation dataset,polymer dataset, wine dataset, and concrete dataset) and thelarge-scale dataset (i.e., MV dataset). This can be explainedagain by the fact the proposed method MTCS-TSK-FS caneffectively exploit not only the independent information oforiginal data space of each task but also the useful inter-task hidden correlation information by mining common hiddenstructure among all tasks. Therefore, both the generalizationand fitting capabilities of TSK-FS obtained by the proposedMTCS-TSK-FS for each task are all promising on the adoptedfive datasets, which include a large-scale dataset.

V. CONCLUSION

In this paper, a multitask fuzzy system modeling methodby mining intertask common hidden structure is proposedto overcome the weaknesses of classical TSK-based fuzzymodeling methods for multitask learning. When the classical(single-task) fuzzy modeling methods are applied to multi-task datasets, they usually focus on the task independenceinformation and ignore the correlation between different tasks.Here we mine the common hidden structure among multipletasks to realize multitask TSK fuzzy system learning. It makesgood use of the independence information of each task andthe correlation information captured by the common hiddenstructure among all tasks as well. Thus, the proposed learn-ing algorithm can effectively improve both the generalizationand fitting performances of the learned fuzzy system for eachtask. Our experiment results demonstrate that the proposedMTCS-TSK-FS has better modeling performance and adapt-ability than the existing TSK-based fuzzy modeling methodson multitask datasets. Although the performance of the pro-posed multitask fuzzy system is very promising, there arestill rooms for further study. For example, for the proposedMTCS-TSK-FS, fast learning algorithm is needed in order to

2MV data set can be available from http://www.keel.es/



TABLE IVGENERALIZATION PERFORMANCE (J) OF THE PROPOSED MTCS-TSK-FS AND SEVERAL RELATED TSK-BASED FUZZY SYSTEM METHODS ON

REAL-WORLD MULTITASK DATASETS

make it more efficient to large scale datasets. For this purpose,the minimum enclosing ball approximation technique [33] andthe stochastic dual coordinate descent algorithm [60] can beconsidered to develop the corresponding the fast algorithmof the proposed method. Another example is how to deter-mine an appropriate size of the common hidden structureand how to explain and leverage the knowledge containedin the common hidden structure which has been mined afterthe proposed modeling here. Besides, development of trans-fer learning mechanisms for MTCS-TSK-FS by mining theknowledge from the common hidden structure is also veryimportant to deal with applications where the data is missingon some tasks. Future work will be devoted to these issues.

APPENDIX

For (10), corresponding Lagrangian function is given in(A1), shown at the bottom of the page.

From this equation, the optimal values can be computed bysetting the derivatives of L(·) with respect to U, θg,k, ξ

+k , ξ−k

and εk to zeros, respectively, that is

∂L

∂U= HHTU− H

K∑

k=1

Nk∑

i=1

λ+i,kxgi,k

+ HK∑

k=1

Nk∑

i=1

λ−i,kxgi,k = 0 (A2.a)

⇒ HHT = Ir×r

∂L

∂U= U− H

K∑

k=1

Nk∑

i=1

λ+i,kxgi,k

+ HK∑

k=1

Nk∑

i=1

λ−i,kxgi,k = 0 (A2.b)

L(U, H, θ1, . . . , θk︸︷︷︸K

, ξ+1 , . . . , ξ+k︸︷︷︸K

, ξ−1 , . . . , ξ−k︸︷︷︸K

, ε1, . . . , εk︸︷︷︸K

, λ+1 , . . . , λ+k︸︷︷︸K

, λ−1 , . . . , λ−k︸︷︷︸K

)

= 12 (HTU)T(HTU)+ λ

2K

∑Kk=1 θ

Tk θk +∑K

k=11

Nkτk

∑Nki=1

((ξ+i,k)2 + (ξ−i,k)2

)

+ 2∑K

k=11τk

εk +∑Kk=1

∑Nki=1 λ+i,k

(yi,k − (HTU+ θk)

Txgi,k − εk − ξ+i,k)

+ ∑Kk=1

∑Nki=1 λ−i,k

((HTU+ θk)

Txgi,k − yi,k − εk − ξ−i,k)

. (A1)



L(λ+1 , . . . , λ+k︸︷︷︸K

, λ−1 , . . . , λ−k︸︷︷︸K

)= − 12

⎛

⎜⎜⎝

∑Kk=1

∑Kl=1

∑Nki=1

∑Nlj=1(λ

+j,l − λ−j,l)(λ

+i,k − λ−i,k)xT

gj,lxgi,k

+Kλ

∑Kk=1

∑Nki=1

∑Nkj=1(λ

+i,k − λ−i,k)(λ

+j,k − λ−j,k)xT

gj,kxgi,k

+∑Kk=1

Nkτk2

∑Nki=1

((λ+i,k)2 + (λ−i,k)2

)

⎞

⎟⎟⎠+∑K

k=1∑Nk

i=1(λ+i,k − λ−i,k)yi,k

s.t. λ+k ≥ 0, λ−k ≥ 0Nk∑

i=1

(λ+i,k + λ−i,k) =2

τk∀kk = 1 . . . K (A12)

∂L

∂ θk= λ

Kθk −

Nk∑

i=1

λ+i,kxgi,k +Nk∑

i=1

λ−i,kxgi,k = 0 (A3)

∂L

∂ξ+i,k= 2

Nkτkξ+i,k − λ+i,k = 0 (A4)

∂L

∂ξ−i,k= 2

Nkτkξ−i,k − λ−i,k = 0 (A5)

∂L

∂εk= 2

τk−

Nk∑

i=1

λ+i,k −Nk∑

i=1

λ−i,k = 0. (A6)

From (A2) to (A6), we obtain

U = H

(K∑

k=1

Nk∑

i=1

λ+i,kxgi,k −K∑

k=1

Nk∑

i=1

λ−i,kxgi,k

)(A7)

θk = K

λ

Nk∑

i=1

(λ+i,k − λ−i,k)xgi,k (A8)

ξ+i,k =Nkτk

2λ+i,k (A9)

ξ−i,k =Nkτk

2λ−i,k (A10)

2

τk=

Nk∑

i=1

(λ+i,k + λ−i,k). (A11)

Substituting (A7)–(A11) into (A1), the optimization prob-lem (A12), shown at the top of the page, is obtained.

It is clear that (A7), (A8), and (A12) here are equivalent to(15.b), (15.a), and (11) in the main content, respectively.

REFERENCES

[1] R. Caruana, “Multitask learning,” Mach. Learn., vol. 28, no. 1,pp. 41–75, 1997.

[2] S. Sun, “Multitask learning for EEG-based biometrics,” in Proc. 19thInt. Conf. Pattern Recognit., Tampa, FL, USA, 2008, pp. 1–4.

[3] X. T. Yuan and S. Yan, “Visual classification with multi-task joint sparserepresentation,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. PatternRecognit., San Francisco, CA, USA, 2010, pp. 3493–3500.

[4] S. Parameswaran and K. Q. Weinberger, “Large margin multi-task metriclearning,” in Proc. Adv. Neural Inf. Process. Syst., 2010, pp. 1867–1875.

[5] Y. Ji and S. Sun, “Multitask multiclass support vector machines: Modeland experiments,” Pattern Recognit., vol. 46, no. 3, pp. 914–924, 2013.

[6] T. Evgeniou and M. Pontil, “Regularized multi-task learning,” inProc. 10th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining,Seattle, WA, USA, 2004, pp. 109–117.

[7] T. Evgeniou, C. Micchelli, and M. Pontil, “Learning multiple tasks withkernel methods,” J. Mach. Learn. Res., vol. 6, no. 1, pp. 615–637, 2005.

[8] A. Argyriou, T. Evgeniou, and M. Pontil, “Convex multi-task featurelearning,” Mach. Learn., vol. 73, no. 3, pp. 243–272, 2008.

[9] T. Jebara, “Multi-task feature and kernel selection for SVMs,” inProc. 21st Int. Conf. Mach. Learn., Banff, AB, Canada, Jul. 2004.

[10] A. Caponnetto, C. A. Micchelli, M. Pontil, and Y. Ying, “Universalmulti-task kernels,” J. Mach. Learn. Res., vol. 68, pp. 1615–1646,Jul. 2008.

[11] G. Cavallanti, N. Cesa-Bianchi, and C. Gentile, “Linear algorithms foronline multitask classification,” in Proc. COLT, 2008.

[12] J. Fang, S. Ji, Y. Xue, and L. Carin, “Multitask classification by learningthe task relevance,” IEEE Signal Process. Lett., vol. 15, pp. 593–596,Oct. 2008.

[13] J. Chen, L. Tang, J. Liu, and J. Ye, “A convex formulation for learningshared structures from multiple tasks,” in Proc. ICML, Montreal, QC,Canada, 2009, p. 18.

[14] Q. Gu and J. Zhou, “Learning the shared subspace for multi-task clus-tering and transductive transfer classification,” in Proc. ICDM, Miami,FL, USA, 2009, pp. 159–168.

[15] Q. Gu, Z. Li, and J. Han, “Learning a kernel for multi-task clustering,”in Proc. AAAI, 2011.

[16] Z. Zhang and J. Zhou, “Multi-task clustering via domain adaptation,”Pattern Recognit., vol. 45, no. 1, pp. 465–473, 2012.

[17] L. Jacob, F. Bach, and J.-P. Vert, “Clustered multi-task learning:A convex formulation,” in Proc. Adv. Neural Inf. Process. Syst., 2008,pp. 745–752.

[18] S. Xie, H. Lu, and Y. He, “Multi-task co-clustering via nonneg-ative matrix factorization,” in Proc. ICPR, Tsukuba, Japan, 2012,pp. 2954–2958.

[19] J. Zhou, J. Chen, and J. Ye, “Clustered multi-task learning via alternatingstructure optimization,” in Proc. NIPS, 2011, pp. 702–710.

[20] S. Kong and D. Wang, “A multi-task learning strategy for unsupervisedclustering via explicitly separating the commonality,” in Proc. ICPR,Tsukuba, Japan, 2012, pp. 771–774.

[21] B. Bakker and T. Heskes, “Task clustering and gating for Bayesianmultitask learning,” J. Mach. Learn. Res., vol. 4, pp. 83–99, May 2003.

[22] F. Cai and V. Cherkassky, “SVM+ regression and multi-task learn-ing,” in Proc. Int. Joint Conf. Neural Netw., Atlanta, GA, USA, 2009,pp. 418–424.

[23] H. Wang et al., “A new sparse multi-task regression and feature selectionmethod to identify brain imaging predictors for memory performance,”in Proc. ICCV, 2011, pp. 557–562.

[24] M. Solnon, S. Arlot, and F. Bach, “Multi-task regression using minimalpenalties,” J. Mach. Learn. Res., vol. 13, pp. 2773–2812, Sep. 2012.

[25] Y. Zhang and D. Y. Yeung, “Semi-supervised multi-task regression,” inProc. Eur. Conf. Mach. Learn. Knowl. Discov., Bled, Slovenia, 2009,pp. 617–631.

[26] J. Zhou, L. Yuan, J. Liu, and J. Ye, “A multi-task learning formulationfor predicting disease progression,” in Proc. KDD, San Diego, CA, USA,2011, pp. 814–822.

[27] S. Kim and E. P. Xing, “Tree-guided group lasso for multi-task regres-sion with structured sparsity,” in Proc. 27th Int. Conf. Mach. Learn.,Haifa, Israel, 2010, pp. 543–550.

[28] K. Puniyani, S. Kim, and E. P. Xing, “Multi-population GWA mappingvia multi-task regularized regression,” Bioinformatics, vol. 26, no. 12,pp. 208–216, 2010.

[29] K. J. Astrom and T. J. McAvoy, “Intelligent control,” J. Process Control,vol. 2, no. 3, pp. 115–127, 1993.

[30] F. L. Chung, Z. H. Deng, and S. T. Wang, “An adaptive fuzzy-inference-rule-based flexible model for automatic elastic image registration,” IEEETrans. Fuzzy Syst., vol. 17, no. 5, pp. 995–1010, Oct. 2009.

[31] A. H. Sonbol, M. S. Fadali, and S. Jafarzadeh, “TSK fuzzy functionapproximators: Design and accuracy analysis,” IEEE Trans. Syst., Man,Cybern. B, Cybern., vol. 42, no. 3, pp. 702–712, Jun. 2012.

[32] Q. Gao, X. J. Zeng, G. Feng, Y. Wang, and J. B. Qiu, “T-S-fuzzy-model-based approximation and controller design for general nonlinearsystems,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 42, no. 4,pp. 1143–1154, Aug. 2012.

[33] Z. H. Deng, K. S. Choi, F. L. Chung, and S. T. Wang, “Scalable TSKfuzzy modeling for very large datasets using minimal-enclosing-ballapproximation,” IEEE Trans. Fuzzy Syst., vol. 19, no. 2, pp. 210–226,Apr. 2011.



[34] Z. H. Deng, Y. Z. Jiang, K. S. Choi, F. L. Chung, and S. T. Wang,“Knowledge-leverage-based TSK fuzzy system modeling,” IEEE Trans.Neural Netw. Learn. Syst., vol. 24, no. 8, pp. 1200–1212, Aug. 2013.

[35] Z. H. Deng, Y. Z. Jiang, F. L. Chung, H. Ishibuchi, and S. T. Wang,“Knowledge-leverage based fuzzy system and its modeling,” IEEETrans. Fuzzy Syst., vol. 21, no. 4, pp. 597–609, Aug. 2013.

[36] R. K. Ando and T. Zhang, “A framework for learning predictive struc-tures from multiple tasks and unlabeled data,” J. Mach. Learn. Res.,vol. 6, pp. 1817–1853, Nov. 2005.

[37] S. J. Pan, J. T. Kwok, and Q. Yang, “Transfer learning via dimension-ality reduction,” in Proc. 23rd Nat. Conf. Artif. Intell., vol. 2. 2008,pp. 677–682.

[38] S. Ji, L. Tang, S. Yu, and J. Ye, “A shared-subspace learning frameworkfor multi-label classification,” ACM Trans. Knowl. Discov. Data, vol. 2,no. 1, pp. 1–29, 2010.

[39] G. Obozinski, B. Taskar, and M. I. Jordan, “Joint covariate selection andjoint subspace selection for multiple classification problems,” J. Statist.Comput., vol. 20, no. 2, pp. 231–252, 2010.

[40] J. Zhang, Z. Ghahramani, and Y. Yang, “Flexible latent variable modelsfor multi-task learning,” Mach. Learn., vol. 73, no. 3, pp. 221–242, 2008.

[41] S. Han, X. Liao, and L. Carin, “Cross-domain multitask learning withlatent probit models,” in Proc. Int. Conf. Mach. Learn., Edinburgh, U.K.,2012, pp. 1463–1470.

[42] T. Takagi and M. Sugeno, “Fuzzy identification of systems and itsapplication to modeling and control,” IEEE Trans. Syst., Man, Cybern.,vol. 15, no. 1, pp. 116–132, Jan./Feb. 1985.

[43] J. C. Bezdek, Pattern Recognition with Fuzzy Objective FunctionAlgorithms. New York, NY, USA: Plenum Press, 1981.

[44] Z. H. Deng, K. S. Choi, F. L. Chung, and S. T. Wang, “Enhancedsoft subspace clustering integrating within-cluster and between-clusterinformation,” Pattern Recognit., vol. 43, no. 3, pp. 767–781, 2010.

[45] J. Leski, “TSK-fuzzy modeling based on ε-insensitive learning,” IEEETrans. Fuzzy Syst., vol. 13, no. 2, pp. 181–193, Apr. 2005.

[46] J.-S. R. Jang, “ANFIS: Adaptive-network-based fuzzy inference sys-tems,” IEEE Trans. Syst., Man, Cybern., vol. 23, no. 3, pp. 665–685,May 1993.

[47] I. W. Tsang, J. T. Kwok, and J. M. Zurada, “Generalized core vectormachines,” IEEE Trans. Neural Netw., vol. 17, no. 5, pp. 1126–1140,Sep. 2006.

[48] K. Ito and R. Nakano, “Optimizing support vector regression hyper-parameters based on cross-validation,” in Proc. Int. Joint Conf. NeuralNetw., 2003, pp. 2077–2082.

[49] A. Edelman, T. A. Arias, and S. T. Smith, “The geometry of algorithmswith orthonormality constraints,” SIAM J. Matrix Anal. Appl., vol. 20,no. 2, pp. 303–353, 1998.

[50] R. Keshavan, A. Montanari, and S. Oh, “Matrix completion from noisyentries,” J. Mach. Learn. Res., vol. 11, pp. 2057–2078, Jul. 2010.

[51] N. Del Buono and T. Politi, “A continuous technique for the weightedlow-rank approximation problem,” in Proc. Int. Conf. Comput. Sci. Appl.,Assisi, Italy, 2004, pp. 988–997.

[52] C. F. Juang, S. H. Chiu, and S. J. Shiu, “Fuzzy system learned throughfuzzy clustering and support vector machine for human skin color seg-mentation,” IEEE Trans. Syst., Man, Cybern. A, Syst., Humans, vol. 37,no. 6, pp. 1077–1087, Nov. 2007.

[53] C. F. Juang and C. D. Hsieh, “TS-fuzzy system-based support vectorregression,” Fuzzy Sets Syst., vol. 160, no. 17, pp. 2486–2504, 2009.

[54] S. Kinoshita, S. Udaka, and M. Shimamoto, “Studies on amino acidfermentation, Part I. Production of L-glutamic acid by various microor-ganisms,” J. Gen. Appl. Microbiol., vol. 3, no. 6, pp. 193–205, 1957.

[55] P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis, “Modelingwine preferences by data mining from physicochemical properties,”Decis. Support Syst., vol. 47, no. 4, pp. 547–553, 2009.

[56] D. Kim, S. Sra, and I. S. Dhillon, “A scalable trust-region algorithm withapplication to mixed-norm regression,” in Proc. ICML, Haifa, Israel,2010, pp. 519–526.

[57] X. J. Su, L. G. Wu, P. Shi, and Y. D. Song, “Model reduction of Takagi–Sugeno fuzzy stochastic systems,” IEEE Trans. Syst., Man, Cybern. B,Cybern., vol. 42, no. 6, pp. 1574–1585, Dec. 2012.

[58] I.-C. Yeh, “Simulation of concrete slump using neural networks,”Construct. Mater., vol. 162, no. 1, pp. 11–18, 2009.

[59] I.-C. Yeh, “Modeling slump flow of concrete using second-order regres-sions and artificial neural networks,” Cement Concrete Comp., vol. 29,no. 6, pp. 474–480, 2007.

[60] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan,“A dual coordinate descent method for large-scale linear SVM,” in Proc.ICML, Helsinki, Finland, 2008.

[61] J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, and S. García,“Keel data-mining software tool: Data set repository, integration of algo-rithms and experimental analysis framework,” Multiple-Valued LogicSoft Comput., vol. 17, nos. 2–3, pp. 255–287, 2011.

Yizhang Jiang (M’12) received the B.S. degreein computer science from Nanjing University ofScience and Technology, Nanjing, China, in 2010,the M.S. degree in computer science from JiangnanUniversity, Wuxi, China, in 2012, and is currentlypursuing the Ph.D. degree from the School of DigitalMedia, Jiangnan University, Wuxi, China.

He was a Research Assistant with the ComputingDepartment, Hong Kong Polytechnic University,Hong Kong, from May 2013 to January 2014. Hiscurrent research interests include pattern recognition,

intelligent computation, and their applications.Mr. Jiang has published several papers in international journals, includ-

ing the IEEE TRANSACTIONS ON FUZZY SYSTEMS and the IEEETRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS.

Fu-Lai Chung (M’95) received the B.Sc. degreefrom the University of Manitoba, Winnipeg, MB,Canada, in 1987, and the M.Phil. and Ph.D. degreesfrom the Chinese University of Hong Kong, HongKong, in 1991 and 1995, respectively.

In 1994, he joined the Department of Computing,Hong Kong Polytechnic University, where he is cur-rently an Associate Professor. His current researchinterests include transfer learning, social networkanalysis and mining, kernel learning, dimensionalityreduction, and big data learning. He has authored or

co-authored over 80 journal papers published in the areas of soft computing,data mining, machine intelligence, and multimedia.

Hisao Ishibuchi (M’93–F’14) received the B.S. andM.S. degrees in precision mechanics from KyotoUniversity, Kyoto, Japan, in 1985 and 1987, respec-tively, and the Ph.D. degree from Osaka PrefectureUniversity, Osaka, Japan, in 1992.

Since 1999, he has been a Full Professor withOsaka Prefecture University. His current researchinterests include artificial intelligence, neural fuzzysystems, and data mining.

Dr. Ishibuchi is on the editorial boards of sev-eral journals, including the IEEE TRANSACTIONS

ON FUZZY SYSTEMS and the IEEE TRANSACTIONS ON SYSTEMS, MAN,AND CYBERNETICS—PART B.

Zhaohong Deng (M’12–SM’14) received the B.S.degree in physics from Fuyang Normal College,Fuyang, China, in 2002, and the Ph.D. degreein information technology and engineering fromJiangnan University, Wuxi, China, in 2008.

He is currently an Associate Professor withthe School of Digital Media, Jiangnan University.He has visited the University of California-Davis,Davis, CA, USA, and the Hong Kong PolytechnicUniversity, Hong Kong, for over two years. His cur-rent research interests include neuro-fuzzy systems,

pattern recognition, and their applications. He has authored or coauthored over50 research papers in international/national journals.

Shitong Wang received the M.S. degree in computerscience from Nanjing University of Aeronautics andAstronautics, Nanjing, China, in 1987.

He has visited London University, London,U.K., Bristol University, Bristol, U.K., HiroshimaInternational University, Hiroshima, Japan, OsakaPrefecture University, Osaka, Japan, Hong KongUniversity of Science and Technology, Hong Kong,and Hong Kong Polytechnic University, Hong Kong,as a Research Scientist, for over six years. He is cur-rently a Full Professor with the School of Digital

Media, Jiangnan University, Wuxi, China. His current research interestsinclude artificial intelligence, neuro-fuzzy systems, pattern recognition, andimage processing. He has published over 100 papers in international/nationaljournals and has authored seven books.

Documents

Multitask TSK Fuzzy System Modeling by Mining Intertask Common Hidden Structure