Distributed Momentum-Based Frank-Wolfe Algorithm for Stochastic Optimization

2023-03-27 02:38JieHouXianlinZengGangWangJianSunSeniorandJieChen
IEEE/CAA Journal of Automatica Sinica 2023年3期

Jie Hou, Xianlin Zeng,, Gang Wang,,Jian Sun,Senior , and Jie Chen,

Abstract—This paper considers distributed stochastic optimization, in which a number of agents cooperate to optimize a global objective function through local computations and information exchanges with neighbors over a network.Stochastic optimization problems are usually tackled by variants of projected stochastic gradient descent.However, projecting a point onto a feasible set is often expensive.The Frank-Wolfe (FW) method has well-documented merits in handling convex constraints, but existing stochastic FW algorithms are basically developed for centralized settings.In this context, the present work puts forth a distributed stochastic Frank-Wolfe solver, by judiciously combining Nesterov’s momentum and gradient tracking techniques for stochastic convex and nonconvex optimization over networks.It is shown that the convergence rate of the proposed algorithm isfor convex optimization, and O(1/log2(k)) for nonconvex optimization.The efficacy of the algorithm is demonstrated by numerical simulations against a number of competing alternatives.

I.INTRODUCTION

DISTRIBUTED stochastic optimization is a basic problem that arises widely in diverse engineering applications,including unmanned systems [1]–[3], distributed machine learning [4], and multi-agent reinforcement learning [5]–[7],to name a few.The goal is to minimize a shared objective function, which is defined as the expectation of a set of stochastic functions subject to general convex constraints, by means of local computations and information exchanges between working agents.

A popular approach to solving problem (1) is the projected stochastic gradient descent (pSGD) [12]–[14].In pSGD and its variants, the iteration variable is projected back ontoX after taking a step in the direction of negative stochastic gradient [15]–[20].Such algorithms are efficient when the computational cost of performing the projection is low, e.g., projecting onto a hypercube or simplex.In many practical situations of interest, however, the cost of projecting onto X can be high, e.g., dealing with a trace norm ball or a base polytopeX in submodular minimization [21].

An alternative for tackling problem (1) is the projection-free methods, including the Frank-Wolfe (FW) [22] and conditional gradient sliding [23].In this paper, we focus on the FW algorithm, which is also known as conditional gradient method [22].Classical FW methods circumvent the projection step by first solving a linear minimization subproblem over the constraint set X to obtain a sort of conditional gradient θk, which is followed by updatingxk+1through a convex combination of the current iteration variablesxkand θk.On top of this idea, a number of modifications have been proposed to improve or accelerate the FW method in algorithm design or convergence analysis, see e.g., [8]–[10] and[24]–[31].

Nonetheless, most existing pSGD and Frank-Wolfe variants are designed for constrained centralized problems, and they cannot directly handle distributed problems.Therefore, it is necessary to develop distributed stochastic projection-free methods for problem (1).In addition, stochastic FW methods may not converge even in the centralized convex case, without increasing the batch size [8].In this context, a natural question arises: is it possible to develop a distributed FW method by using any fixed batch size for problem (1), while enjoying a convergence rate comparable to that of centralized stochastic FW methods? In this paper, we answer this question affirmatively, by carefully designing a distributed stochastic FW algorithm, which converges for any fixed batch size (can be as small as 1) and enjoys a comparable convergence rate as in the centralized stochastic setting.

A.Related Work

Although considerable results have been reported for distributed FW in deterministic settings, they cannot be directly applied in and/or generalized to stochastic settings.The reason is twofold: 1) FW may diverge due to the non-vanishing variance in gradient estimates; and, 2) the desired convergence rate of FW for stochastic optimization is not guaranteed to be comparable to pSGD, even for the centralized setting.

To address these challenges, the present paper puts forth a distributed stochastic version of the celebrated FW algorithm for stochastic optimization over networks.The main idea behind our proposal is a judicious combination of the recursive momentum [39] and the Nesterov’s momentum [40].On the theory side, it is shown that the proposed algorithm can not only attenuate the noise in gradient approximation, but also achieve a convergence guarantee comparable to pSGD in convex case.Comparison of the proposed algorithm in context is provided in Table I.

TABLE I CONVERGENCE RATE FOR STOCHASTIC OPTIMIZATION

B.Our Contributions

In succinct form, the contributions of this work are summarized as follows.

1) We propose a projection-free algorithm, referred to as the distributed momentum-based Frank-Wolfe (DMFW), for convex and nonconvex stochastic optimization over networks.Compared with the centralized FW methods [8]–[10], [28]and [31], DMFW is considerably different in algorithm design and convergence analysis.

II.PRELIMINARIES AND ALGORITHM DESIGN

A.Notation and Preliminaries

B.Algorithm Design

To solve problem (1), we propose a distributed momentumbased Frank-Wolfe algorithm, which is summarized in Algorithm 1.

Algorithm 1 Distributed Momentum-Based Frank-Wolfexi1∈X yi1=∇fi(ˆxi1,ξi1)=si1∀i∈N Input: Number of iterations K, initial conditions , and for.1: for all do 2: Average consensus:k=1,2,...,Kˆxik=∑cijxjk(2)j∈Niwhere is the set of neighbors of node i.3: Momentum update:Ni yik=(1−γk)yik−1+∇fi(ˆxik,ξik)−(1−γk)∇fi(ˆxik−1,ξik) (3)where is a step size.4: Gradient tracking:γk∈(0,1]sik=∑cijsjk−1+yik−yik−1(4)j∈Ni∑pik=cijsjk.(5)j∈Ni5: Frank-Wolfe step:θik∈argmin ϕ∈X〈pik,ϕ〉(6)xik+1= ˆxik+ηk(θik−ˆxik) (7)where is a step size.6: end for ηk∈(0,1]xik7: return for all.+1i∈N

III.MAIN RESULTS

In this section, we establish the convergence results of the proposed algorithm for convex and nonconvex problems,respectively.Before providing the results, we outline some standing assumptions and facts.

A.Assumptions and Facts

Assumption 1 (Weight rule):The weighted adjacency matrixCis a doubly stochastic matrix, i.e., the row sum and the column sum ofCare all 1.

Assumption 1 indicates that for each round of theAverage consensusstep of Algorithm 1, the agent takes a weighted average of the values from its neighbors according toC.

Assumption 2 (Connectivity):The network G is connected.

B.Convergence Rate for Convex Stochastic Optimization

This subsection is dedicated to the performance analysis of Algorithm 1.Let us start by defining the following auxiliary vectors:

C.Convergence Rate for Nonconvex Optimization

This subsection provides the convergence rate of the proposed DMFW for (1) with nonconvex objective functions.To show the convergence performance of DMFW for nonconvex case, we introduce the FW-gap, which is defined as

IV.NUMERICAL TESTS

A.Binary Classification

TABLE II REAL DATA FOR BLACK-BOX BINARY CLASSIFICATION

Fig.1.Multi-agent communication topology.

Fig.2.The error ‖ P¯k−y¯k‖ of DMFW on a9adataset.

Fig.4 shows the FW-gap of SFW, MSHFW, DeFW and DMFW for solving nonconvex problem (16).From the results,it can be observed that stochastic algorithms (SFW, DMFW and MSHFW) perform better compared to deterministic algorithm DeFW in all tested datasets.This indicates that the stochastic FW algorithms are more efficient than the deterministic FW algorithms in solving nonconvex problems (16).Comparing DMFW with centralized algorithms MSHFW and SFW, DMFW slightly outperforms SFW, especially in datasetsa9a, but is slower than MSHFW.

B.Stochastic Ridge Regression

V.CONCLUSIONS

Fig.3.The comparison between SFW, MSHFW, DeFW and DMFW on three datasets.(a) covtype.binarydataset; (b) a9adataset; (c) w8adataset.

Fig.4.The comparison between SFW, MSHFW, DeFW and DMFW on three datasets.(a) covtype.binarydataset; (b) a9adataset; (c) w8adataset.

Fig.5.The comparison between SFW, MSHFW and DMFW.(a) l1- norm ball constraint; (b) l2-norm ball constraint; (c) l5-norm ball constraint.

APPENDIX A TECHNICAL LEMMAS 1

Before proving Lemma 1, we first give the following technical lemmas.

APPENDIX B TECHNICAL LEMMAS

APPENDIX C PROOF OF LEMMA 2

APPENDIX D PROOF OF LEMMA 3

APPENDIX E PROOF OF LEMMA 4

¯Pk¯yk‖∇F(¯xk)−pik‖2

Proof:Adding and subtracting and to ,we have

APPENDIX G PROOF OF THEOREM 2

Proof:It follows from the definition of FW-gap (15) that: