For as far as I can remember every single Wednesday in Berkeley is the occasion for a “theory lunch”: the theory group gets together around a slice of world-famous pizza from the cheeseboard collective, or some other dish from one of the excellent Berkeley restaurants selected by the good care of Neha — today it was chicken pot pie and Halloween-styled cupcakes.

We enjoy the food and chat for half an hour or so, before getting down to business: someone, usually a member of the group but sometimes a visitor — last week Anup Rao told us about his work on recovering from adversarial error in Boolean circuits — gives a half-hour whiteboard talk. The specific constraints imposed by the lunch format (shorter talk, irresistible urge to nap) make these talks all the more enjoyable, as the speaker is aware that he needs to take particular care in keeping his audience engaged. The layout of the room also makes it particularly favorable to questions and interruptions.

Today I was up, and I talked about recent work with Assaf Naor and Oded Regev (see here for the recently uploaded paper) on “efficient rounding of the non-commutative Grothendieck inequality”. Since I didn’t have time to cover all the material I was hoping to discuss, and since I very much like the work we did in that paper, I will give a brief introduction to the results — in the format of an extended theory lunch talk — below.

**0. The non-commutative Grothendieck optimization problem.**

Our starting point is the following optimization problem. Given a table of real numbers of size , the goal is to compute

where denotes the set of orthogonal matrices, i.e. the set of all real matrices satisfying . Computing is a quadratic optimization problem, and in general we don’t expect there to be an efficient exact algorithm. Below I will describe a randomized polynomial-time constant-factor approximation algorithm for this problem. First let’s see some interesting classes of examples that can be put in the form of (1).

**1. Some applications of (1)**

**(i) The commutative Grothendieck inequality.** To start with a simple example, suppose is “diagonal”, meaning if and , and otherwise. Then the only coefficients of or that appear in (1) are the diagonal coefficients and . (Note that diagonal matrices *commute* — this is the reason for the commutative/noncommutative distinction in the inequalities’ name) In that case we can rewrite (1) as

To see why the equality hols, first observe that if the matrices , are orthogonal: is at least as large as the right-hand side of (2). Conversely, for any orthogonal their diagonal coefficients must be in , so that starting from any which optimize (1) we can obtain an assignment of values in to the variables in the right-hand side of (2)achieving the value . Using linearity, it is clear that the optimum over is achieved for a choice of values in , hence equality holds.

The problem of evaluating must be familiar to many. For a general real matrix , the quantity defined in (2) is closely related to the CUT-NORM of , which is defined as except that the maximum is taken over all . The CUT-NORM plays an important role in the work of Frieze and Kannan on the design of approximation algorithms for dense graph problems. Alon and Naor [AN04] gave a polynomial-time algorithm (in fact, *three* algorithms!) giving a constant-factor approximation to . Their approach is based on exploiting *Grothendieck’s inequality* [Gro53], which states that

where is the real *Grothendieck constant*. (Note that by choosing one sees that the supremum on the left-hand side is always at least as large as the maximum on the right-hand side; moreover we can always assume .) Since the supremum on the left-hand side can be case as a semidefinite program (SDP), it can be solved in polynomial time. To obtain an approximation algorithm one needs to show how a vector solution to the SDP can be *rounded* to a good assignment to the variables , and this is the content of the work of Alon and Naor: they show how different *proofs* of (4) can be turned into *algorithms*.

Alon and Naor also showed, by a reduction from MAXCUT, that the problem of approximating (4), and hence (1), to within a constant is NP-hard. We will return to (4) soon; first let’s see some other special instances of (1).

**(ii) Robust PCA.** In principal component analysis (PCA), one is given a set of points . The goal is to find a direction along which the points have the highest variance, i.e. to compute a unit vector achieving

where is the matrix with rows the and is the operator norm (the largest singular value) of . More generally, for any one could ask for a -dimensional subspace on which the projections of the have the largest “spread”:

where is the Kronecker symbol. This problem can be solved in polynomial time: the optimal are the right singular vectors of corresponding to its largest singular values (with multiplicities). One drawback of (5) in practice is that it tends to be very sensitive to “outliers”: if any one of the points has a much larger norm than the others, it will have a disproportionate effect on the optimal solution . To overcome this issue one can replace the norm used in (5)by a less sensitive measure, such as the norm. The resulting optimization problem is called L1-PCA:

To the best of my knowledge, prior to our work the best algorithm for the problem of computing (6)was due to McCoy and Tropp [MT12], who gave a approximation algorithm. (They also obtain a constant-factor approximation algorithm for the case .)

To formulate (6) as an instance of (1), observe that (6) can be rewritten a

Introduce two matrix variables

and define a table by setting the only nonzero entries to . (If desired we may make all four dimensions of equal by padding with zeros). With this choice of , and as defined above, it is clear that is at least as large as the optimum in (7). Conversely, from any orthogonal matrices achieving the optimum in~(1) we may obtain values and a matrix such that $latex {YY^T \leq \mathrm{Id}}&fg=000000$ achieving the same value in (7). Using linearity, it is then not hard to see that there will exist a choice of and such that $latex {Y'(Y’)^T = \mathrm{Id}}&fg=000000$ that achieve at least as high a value. Hence any constant-factor approximation algorithm for (1) immediately implies a constant-factor approximation algorithm, with the same constant, for the problem of computing (6).

As a remark, we could also have considered the following variant, called R1-PCA, and introduced by Ding et al.:

R1-PCA is perhaps more natural than L1 PCA in that it measures the length of the projection of each on the subspace in the norm, but then sums the contributions of each using the norm (i.e. it takes the sum instead of the sum of squares). I leave as an exercise to the reader to show that (8) can also be formulated as an instance of (1).

**(iii) The generalized orthogonal Procrustes problem.** Suppose given sets of points in each: . The goal is to find rotations that map the sets of points to each other as closely as possible:

The study of this problem goes back to Greek mythology, where the bandit Procrustes would rotate, stretch (and even amputate!) his guests in order to make them “fit” the bed he had prepared for them. In more recent times it has been studied in the Psychometrics (the theory of psychological measurement) literature since the 70s, and also plays an important role in shape analysis. For instance, given a set of many sample butterfly shapes corresponding to a certain species, solving (9)lets one compute a “group average”

against which any newly observed specimen can be compared in order to ascertain whether it comes from the same species (see e.g. these notesfor more explanations and nice pictures).

The problem of computing (9) can be solved efficiently when by performing the singular value decomposition of , where is the matrix whose rows are given by the points , but it is known to be NP-hard as soon as . Nemirovski [Nem07] gave a polynomial-time algorithm giving a polynomial factor approximation to (9), and this was improved by So [So11] to an approximation.

Expanding the square and ignoring constant terms we get that the problem of computing (9) is equivalent to evaluating

By introducing two matrix variables that contain the as diagonal blocks and defining an appropriate table one can rewrite (10) as an instance of (1), and I leave the details as an exercise. As a result, a constant-factor approximation algorithm for (1) again translates directly to an approximation algorithm for (10)(with the same constant).

**(iv) Quantum XOR games.** Let me briefly mention one last application, which is how I initially got interested in (1). A *quantum XOR game* is a game played by two cooperating players against a referee. The game is specified by a collection of quantum states , where each is a density matrix on , together with bits and a probability distribution on . The game is played as follows: the referee first selects an according to . He then prepares the state and sends the first half to the first player, the second half to the other. The players each have to reply with a bit respectively. They win the game if and only if their answers satisfy .

The goal of the players is to maximize their chances of winning the game. The description of the game itself is public, but the index chosen by the referee is secret: each player only has access to his or her own half of . In a sense the players are trying to distinguish which state was prepared, albeit in the specific “distributed” manner specified by the game. Now, the surprising fact is the following: the players’ maximum success probability in any quantum XOR game can be *exactly* cast as an instance of (1), for some table that depends on the game description. Conversely, for any “Hermitian” table (i.e. for any ), there exists a quantum XOR game that gives rise to this ! This tight connection holds for players not using any entanglement. If they do use entanglement, then their maximum winning probability turns out to be tightly connected with a different variant of Grothendieck’s inequality, the “operator-space” Grothendieck inequality…more details can be found here.

**2. An efficient approximation algorithm**

Taking inspiration from the classical inequality (4), to obtain an approximation algorithm for (1) we first look for an appropriate semidefinite relaxation. Replacing each of the entries of the variables and by a vector, we define to be the set of all matrices with entries that lie in . The constraint that is orthogonal is equivalent to $latex {UU^T = U^TU= \mathrm{Id}}&fg=000000$. The -th coefficient of is , and for a vector-valued matrix we analogously define and by

where here each term of the sum is the inner product between the two corresponding vector-entries of . Hence the following natural definition for the set of *orthogonal vector-valued matrices*:

For we recover the set , so the inequality

holds (note we may always restrict ). Is this a *good* relaxation? If so, can it lead to an approximation algorithm: can we *round*vector-valued orthogonal matrices to orthogonal matrices, while not degrading the objective function too much?

Pisier [Pis78] and Haagerup [Haa85] answered the first question in the affirmative. They considered a slightly different formulation of , and they focused on the complex case, but from their results one can deduce that the integrality gap of (11) in the real case is at most : for all , . (Haagerup obtained a factor for the complex case, which is known to be tight; we do not know if the factor for the real case is tight.)

Our main result is an efficient rounding algorithm matching this guarantee.

**Theorem** [NRV12]. Let be given and . Let be orthogonal vector-valued matrices such that

Then one can find in time polynomial in and two orthogonal matrices such that

In the following I will describe the rounding algorithm for the case of the *complex* commutative inequality. This may sound like the opposite of what we’re ultimately interested in (the real non-commutative inequality), but it is the best setting in which to explain the main ideas, which generalize straightforwardly to the matrix setting. I will also describe a *randomized* algorithm, but it is possible to derandomize it in polynomial time.

Our goal is thus to devise a rounding algorithm such that, given complex unit vectors such that

produces complex numbers of modulus such that

Our first step consists in performing the usual random projection.

Step 1.Choose uniformly at random. For , define and .

(The factor in the definition of is for convenience and will only play a role later.) By expanding out the terms, one checks that on expectation, for any , . The problem of course is that in general will not have modulus . The standard procedure, due to Goemans and Williamson, would be to define , , where denotes (in case is real this is just the sign). Goemans and Williamson showed that, in the real case,

where , so that recalling (3)such “sign” rounding will work well for a problem, such as MAXCUT, in which all the coefficients are of the same sign. However, in general the coefficients may have different signs and large cancellations could occur, rendering an inequality such as the one above (even if it was two-sided) useless — one needs a more precise control of the relation between the and the .

Our rounding algorithm proceeds differently. The idea of the second step is to use information that is contained in the moduli , instead of completely discarding that information as in the “sign” rounding procedure. If these moduli are then all is well. If, however, they are far from then intuitively this represents a “miss” of the random projection step: such values may be best discarded, or could be re-randomized. To this effect we perform an additional *correlated* randomized rounding step, as follows.

Step 2.Choose from the hyperbolic secant distribution, which has density proportional to . Return , .

Note that have modulus exactly . Why does this work? In his proof, Haagerup uses the maximum modulus principle to show the *existence* of a real for which as defined above will provide a good solution. Note that varying corresponds to simultaneously “rotating” the at different speeds, proportional to the logarithm of the modulus of respectively. Hence terms for which the modulus is far from will be rotated at greater speeds, going through all possible values on the unit circle much more frequently than terms whose modulus is close to .

We go a bit further and identify a specific distribution — the hyperbolic secant distribution — such that sampling according to that distribution will give a good solution, in expectation. To see why, the key property of the hyperbolic secant distribution we need is that its characteristic function satisfies , for any . Using that equation, we obtain

Taking the expectation over , the first term above is . To conclude it only remains to bound the contribution of the second term. The idea is to see the vectors and as a pair of possible solutions to the semidefinite program (note these vectors are infinite-dimensional, but this is not an issue — we are allowed to work directly with infinite-dimensional vectors, or we could project them down to finite dimensions, or discretize the choice of ). We are about to show that the squared norm of each such vector is at most , hence the value they give in the objective function defining in (12) can be at most . Plugging this bound back into the equations above will finish the analysis.

Hence it remains to compute

Here in the second line we used that the only terms that have nonzero expectation are those where either , or , or .

Pingback: QIP 2013 accepts | MyCQstate