## Theory lunch: rounding the noncommutative Grothendieck inequality

For as far as I can remember every single Wednesday in Berkeley is the occasion for a “theory lunch”: the theory group gets together around a slice of world-famous pizza from the cheeseboard collective, or some other dish from one of the excellent Berkeley restaurants selected by the good care of Neha — today it was chicken pot pie and Halloween-styled cupcakes.

Soda Hall’s beautiful Woz, where theory lunches are held.

We enjoy the food and chat for half an hour or so, before getting down to business: someone, usually a member of the group but sometimes a visitor — last week Anup Rao told us about his work on recovering from adversarial error in Boolean circuits — gives a half-hour whiteboard talk. The specific constraints imposed by the lunch format (shorter talk, irresistible urge to nap) make these talks all the more enjoyable, as the speaker is aware that he needs to take particular care in keeping his audience engaged. The layout of the room also makes it particularly favorable to questions and interruptions.

Today I was up, and I talked about recent work with Assaf Naor and Oded Regev (see here for the recently uploaded paper) on “efficient rounding of the non-commutative Grothendieck inequality”. Since I didn’t have time to cover all the material I was hoping to discuss, and since I very much like the work we did in that paper, I will give a brief introduction to the results — in the format of an extended theory lunch talk — below.

0. The non-commutative Grothendieck optimization problem.

Our starting point is the following optimization problem. Given a table of real numbers ${M}$ of size ${n\times n\times n\times n}$, the goal is to compute

$\displaystyle \mathrm{opt}(M)\,:=\, \max_{U,V\in\mathcal{O}_n}\, \sum_{ijkl} \, M_{ijkl}\, U_{ij} V_{kl}, \ \ \ \ \ (1)$

where ${\mathcal{O}_n}$ denotes the set of ${n\times n}$ orthogonal matrices, i.e. the set of all real matrices ${X}$ satisfying ${XX^T = X^T X = \mathrm{Id}}$. Computing ${\mathrm{opt}(M)}$ is a quadratic optimization problem, and in general we don’t expect there to be an efficient exact algorithm. Below I will describe a randomized polynomial-time constant-factor approximation algorithm for this problem. First let’s see some interesting classes of examples that can be put in the form of (1).

1. Some applications of (1)

(i) The commutative Grothendieck inequality. To start with a simple example, suppose ${M}$ is “diagonal”, meaning ${M_{ijkl} = A_{ik}}$ if ${i=j}$ and ${k=l}$, and ${0}$ otherwise. Then the only coefficients of ${U}$ or ${V}$ that appear in (1) are the diagonal coefficients ${U_{ii}}$ and ${V_{kk}}$. (Note that diagonal matrices commute — this is the reason for the commutative/noncommutative distinction in the inequalities’ name) In that case we can rewrite (1) as

$\displaystyle \mathrm{opt}(A)\,:=\,\mathrm{opt}(M)\,=\,\max_{\varepsilon_i,\delta_k\in\{-1,1\}}\ \sum_{ik} A_{ik}\, \varepsilon_i \delta_k. \ \ \ \ \ (2)$

To see why the equality hols, first observe that if ${\varepsilon_i,\delta_k\in\{-1,1\}}$ the matrices ${U := \mathrm{diag}(\varepsilon_i)}$, ${V := \mathrm{diag}(\delta_k)}$ are orthogonal: ${\mathrm{opt}(M)}$ is at least as large as the right-hand side of (2). Conversely, for any orthogonal ${U,V}$ their diagonal coefficients must be in ${[-1,1]}$, so that starting from any ${U,V}$ which optimize (1) we can obtain an assignment of values in ${[-1,1]}$ to the variables ${\varepsilon_i,\delta_k}$ in the right-hand side of (2)achieving the value ${\mathrm{opt}(M)}$. Using linearity, it is clear that the optimum over ${[-1,1]}$ is achieved for a choice of values in ${\{-1,1\}}$, hence equality holds.

The problem of evaluating ${\mathrm{opt}(A)}$ must be familiar to many. For a general real ${n\times n}$ matrix ${A}$, the quantity ${\mathrm{opt}(A)}$ defined in (2) is closely related to the CUT-NORM of ${A}$, which is defined as ${\mathrm{opt}(A)}$ except that the maximum is taken over all ${\varepsilon_i,\delta_j \in \{0,1\}}$. The CUT-NORM plays an important role in the work of Frieze and Kannan on the design of approximation algorithms for dense graph problems. Alon and Naor [AN04] gave a polynomial-time algorithm (in fact, three algorithms!) giving a constant-factor approximation to ${\mathrm{opt}(A)}$. Their approach is based on exploiting Grothendieck’s inequality [Gro53], which states that

$\displaystyle \sup_{d,\,x_i,y_k\in{\mathbb R}^d:\,\|x_i\|,\|y_k\|= 1} \sum_{ik}\, A_{ik}\, x_i\cdot y_k \,\leq\, K_G\max_{\varepsilon_i,\delta_k\in\{-1,1\}}\, \sum_{ik}\, A_{ik}\, \varepsilon_i \delta_k, \ \ \ \ \ (4)$

where ${K_G \leq 1.782\ldots}$ is the real Grothendieck constant. (Note that by choosing ${d=1}$ one sees that the supremum on the left-hand side is always at least as large as the maximum on the right-hand side; moreover we can always assume ${d\leq 2n}$.) Since the supremum on the left-hand side can be case as a semidefinite program (SDP), it can be solved in polynomial time. To obtain an approximation algorithm one needs to show how a vector solution to the SDP can be rounded to a good assignment to the variables ${\varepsilon_i,\delta_k}$, and this is the content of the work of Alon and Naor: they show how different proofs of (4) can be turned into algorithms.

Alon and Naor also showed, by a reduction from MAXCUT, that the problem of approximating (4), and hence (1), to within a constant is NP-hard. We will return to (4) soon; first let’s see some other special instances of (1).

(ii) Robust PCA. In principal component analysis (PCA), one is given a set of ${n}$ points ${a_1,\ldots,a_n \in {\mathbb R}^d}$. The goal is to find a direction along which the points have the highest variance, i.e. to compute a unit vector ${y\in{\mathbb R}^d}$ achieving

$\displaystyle \max_{y\in{\mathbb R}^d,\,\|y\|= 1} \sum_{i=1}^n (a_i\cdot y)^2 \,=\,\max_{y\in{\mathbb R}^d,\,\|y\|= 1} \|Ay\|_2^2 \,=\, \|A\|_\infty,$

where ${A}$ is the ${n\times d}$ matrix with rows the ${a_i}$ and ${\|A\|_\infty}$ is the operator norm (the largest singular value) of ${A}$. More generally, for any ${1\leq k \leq d}$ one could ask for a ${k}$-dimensional subspace on which the projections of the ${a_i}$ have the largest “spread”:

$\displaystyle \max_{\substack{y_1,\ldots,y_k \in {\mathbb R}^d\\ y_i\cdot y_j = \delta_{ij}}} \, \sum_{j=1}^k \,\|A y_j\|_2^2, \ \ \ \ \ (5)$

where ${\delta_{ij}}$ is the Kronecker symbol. This problem can be solved in polynomial time: the optimal ${y_1,\ldots,y_k}$ are the right singular vectors of ${A}$ corresponding to its ${k}$ largest singular values (with multiplicities). One drawback of (5) in practice is that it tends to be very sensitive to “outliers”: if any one of the points ${a_i}$ has a much larger norm than the others, it will have a disproportionate effect on the optimal solution ${(y_j)}$. To overcome this issue one can replace the ${\ell_2}$ norm used in (5)by a less sensitive measure, such as the ${\ell_1}$ norm. The resulting optimization problem is called L1-PCA:

$\displaystyle \max_{\substack{y_1,\ldots,y_k \in {\mathbb R}^d\\ \langle y_i,y_j\rangle = \delta_{ij}}} \,\sum_{j=1}^k \,\|A y_j\|_1. \ \ \ \ \ (6)$

To the best of my knowledge, prior to our work the best algorithm for the problem of computing (6)was due to McCoy and Tropp [MT12], who gave a ${O(\log(n+k+d))}$ approximation algorithm. (They also obtain a constant-factor approximation algorithm for the case ${k=1}$.)

To formulate (6) as an instance of (1), observe that (6) can be rewritten a

$\displaystyle \max_{\substack{Y\in {\mathbb R}^{k\times d}\\ YY^T = \mathrm{Id}}}\,\, \max_{z_{ij}\in\{-1,1\}}\,\sum_{i=1}^n \sum_{j=1}^k\, z_{ij}\,\Big(\sum_{l=1}^d a_{il} y_{jl}\Big). \ \ \ \ \ (7)$

Introduce two matrix variables

$\displaystyle U = \begin{pmatrix} z_{11} & & 0 \\ & \ddots & \\ 0 & & z_{nk} \end{pmatrix} \in {\mathbb R}^{nk\times nk}\qquad \text{and} \qquad V = \begin{pmatrix} y_{11} & \cdots & y_{1d} \\ \vdots & & \vdots \\ y_{k1} & \cdots & y_{kd} \\ & 0 & \end{pmatrix} \in {\mathbb R}^{d\times d},$

and define a ${nk\times nk\times d \times d}$ table ${M}$ by setting the only nonzero entries to ${M_{(ij)(ij)jl} = a_{jl}}$. (If desired we may make all four dimensions of ${M}$ equal by padding with zeros). With this choice of ${M}$, and ${U,V}$ as defined above, it is clear that ${\mathrm{opt}(M)}$ is at least as large as the optimum in (7). Conversely, from any orthogonal matrices ${U,V}$ achieving the optimum in~(1) we may obtain values ${z_{ij}\in[-1,1]}$ and a ${n\times d}$ matrix ${Y}$ such that $latex {YY^T \leq \mathrm{Id}}&fg=000000$ achieving the same value ${\mathrm{opt}(M)}$ in (7). Using linearity, it is then not hard to see that there will exist a choice of ${z'_{ij}\in\{-1,1\}}$ and ${Y'}$ such that $latex {Y'(Y’)^T = \mathrm{Id}}&fg=000000$ that achieve at least as high a value. Hence any constant-factor approximation algorithm for (1) immediately implies a constant-factor approximation algorithm, with the same constant, for the problem of computing (6).

As a remark, we could also have considered the following variant, called R1-PCA, and introduced by Ding et al.:

$\displaystyle \max_{\substack{y_1,\ldots,y_k \in {\mathbb R}^d\\ y_i\cdot y_j = \delta_{ij}}} \,\sum_{i=1}^n \Big(\sum_{j=1}^k \,|a_i\cdot y_j|^2\Big)^2. \ \ \ \ \ (8)$

R1-PCA is perhaps more natural than L1 PCA in that it measures the length of the projection of each ${a_i}$ on the subspace ${\mathrm{span}(y_j)}$ in the ${\ell_2}$ norm, but then sums the contributions of each ${a_i}$ using the ${\ell_1}$ norm (i.e. it takes the sum instead of the sum of squares). I leave as an exercise to the reader to show that (8) can also be formulated as an instance of (1).

(iii) The generalized orthogonal Procrustes problem. Suppose given ${k}$ sets ${S_1,\ldots,S_k}$ of ${n}$ points in ${{\mathbb R}^d}$ each: ${S_i=\{x_1^i,\ldots,x_n^i\}}$. The goal is to find rotations ${U_1,\ldots,U_k\in \mathcal{O}_d}$ that map the sets of points to each other as closely as possible:

$\displaystyle \min_{U_1,\ldots,U_k\in\mathcal{O}_d} \, \sum_{i,j=1}^k \sum_{l=1}^n\, \big\| U_ix_l^i - U_j x_l^j \big\|_2^2. \ \ \ \ \ (9)$

The bandit Procrustes trying to accommodate one of his guests

The study of this problem goes back to Greek mythology, where the bandit Procrustes would rotate, stretch (and even amputate!) his guests in order to make them “fit” the bed he had prepared for them. In more recent times it has been studied in the Psychometrics (the theory of psychological measurement) literature since the 70s, and also plays an important role in shape analysis. For instance, given a set of many sample butterfly shapes corresponding to a certain species, solving (9)lets one compute a “group average”

$\displaystyle S_{\mathrm{avg}} \,:=\, \Big\{\,\frac{1}{k}\sum_i \,U_i x_l^i,\, l=1,\ldots,n\,\Big\}$

against which any newly observed specimen can be compared in order to ascertain whether it comes from the same species (see e.g. these notesfor more explanations and nice pictures).

The problem of computing (9) can be solved efficiently when ${k=2}$ by performing the singular value decomposition of ${X_1X_2^T}$, where ${X_i}$ is the ${n\times d}$ matrix whose rows are given by the points ${x_l^i}$, but it is known to be NP-hard as soon as ${k>2}$. Nemirovski [Nem07] gave a polynomial-time algorithm giving a polynomial factor approximation to (9), and this was improved by So [So11] to an ${O(\log(n+d+k))}$ approximation.

Expanding the square and ignoring constant terms we get that the problem of computing (9) is equivalent to evaluating

$\displaystyle \max_{U_1,\ldots,U_k\in \mathcal{O}_d} \, \sum_{i,j=1}^k\,\sum_{l=1}^n\, \big(U_i x_l^i \cdot U_j x_l^j\big). \ \ \ \ \ (10)$

By introducing two matrix variables ${U,V}$ that contain the ${U_i}$ as diagonal blocks and defining an appropriate table ${M}$ one can rewrite (10) as an instance of (1), and I leave the details as an exercise. As a result, a constant-factor approximation algorithm for (1) again translates directly to an approximation algorithm for (10)(with the same constant).

(iv) Quantum XOR games. Let me briefly mention one last application, which is how I initially got interested in (1). A quantum XOR game is a game played by two cooperating players against a referee. The game is specified by a collection of ${n}$ quantum states ${\rho_1,\ldots,\rho_{n}}$, where each ${\rho_i}$ is a density matrix on ${{\mathbb C}^d \otimes {\mathbb C}^d}$, together with ${n}$ bits ${c_i\in\{0,1\}}$ and a probability distribution ${p}$ on ${\{1,\ldots,n\}}$. The game is played as follows: the referee first selects an ${i\in\{1,\ldots, n\}}$ according to ${p}$. He then prepares the state ${\rho_i}$ and sends the first half to the first player, the second half to the other. The players each have to reply with a bit ${a,b\in\{0,1\}}$ respectively. They win the game if and only if their answers satisfy ${a\oplus b = c_i}$.

The goal of the players is to maximize their chances of winning the game. The description of the game itself is public, but the index ${i}$ chosen by the referee is secret: each player only has access to his or her own half of ${\rho_i}$. In a sense the players are trying to distinguish which state ${\rho_i}$ was prepared, albeit in the specific “distributed” manner specified by the game. Now, the surprising fact is the following: the players’ maximum success probability in any quantum XOR game can be exactly cast as an instance of (1), for some table ${M}$ that depends on the game description. Conversely, for any “Hermitian” table ${M}$ (i.e. ${M_{ijkl} = \overline{M_{jilk}}}$ for any ${i,j,k,l}$), there exists a quantum XOR game that gives rise to this ${M}$! This tight connection holds for players not using any entanglement. If they do use entanglement, then their maximum winning probability turns out to be tightly connected with a different variant of Grothendieck’s inequality, the “operator-space” Grothendieck inequality…more details can be found here.

2. An efficient approximation algorithm

Taking inspiration from the classical inequality (4), to obtain an approximation algorithm for (1) we first look for an appropriate semidefinite relaxation. Replacing each of the entries of the variables ${U}$ and ${V}$ by a vector, we define ${M_n({\mathbb R}^d)}$ to be the set of all ${n\times n}$ matrices with entries that lie in ${{\mathbb R}^d}$. The constraint that ${U}$ is orthogonal is equivalent to $latex {UU^T = U^TU= \mathrm{Id}}&fg=000000$. The ${(i,j)}$-th coefficient of ${UU^T}$ is ${(UU^T)_{ij} = \sum_k U_{ik}U_{jk}}$, and for a vector-valued matrix ${X\in M_n({\mathbb R}^d)}$ we analogously define ${XX^T}$ and ${X^T X}$ by

$\displaystyle (XX^T)_{ij} := \sum_k X_{ik}\cdot X_{jk}\qquad\text{and}\qquad (X^T X)_{ij} := \sum_k X_{ki}\cdot X_{kj},$

where here each term of the sum is the inner product between the two corresponding vector-entries of ${X}$. Hence the following natural definition for the set of orthogonal vector-valued matrices:

$\displaystyle \mathcal{O}_n({\mathbb R}^d) \,:=\, \{ X \in M_n({\mathbb R}^d):\, XX^T = X^T X = \mathrm{Id}\}.$

For ${d=1}$ we recover the set ${\mathcal{O}_n}$, so the inequality

$\displaystyle \mathrm{opt}(M)\,\leq\,\mathrm{SDP}(M) \,:=\, \sup_{d,\, U,V \in \mathcal{O}_n({\mathbb R}^d)} \sum_{ijkl}\, M_{ijkl} U_{ij} \cdot V_{kl} \ \ \ \ \ (11)$

holds (note we may always restrict ${d\leq 2n^2}$). Is this a good relaxation? If so, can it lead to an approximation algorithm: can we roundvector-valued orthogonal matrices to orthogonal matrices, while not degrading the objective function too much?

Pisier [Pis78] and Haagerup [Haa85] answered the first question in the affirmative. They considered a slightly different formulation of ${\mathrm{SDP}(M)}$, and they focused on the complex case, but from their results one can deduce that the integrality gap of (11) in the real case is at most ${2\sqrt{2}}$: for all ${M}$, ${\mathrm{SDP}(M)\leq 2\sqrt{2}\,\mathrm{opt}(M)}$. (Haagerup obtained a factor ${2}$ for the complex case, which is known to be tight; we do not know if the ${2\sqrt{2}}$ factor for the real case is tight.)

Our main result is an efficient rounding algorithm matching this guarantee.

Theorem [NRV12]. Let ${M}$ be given and ${\varepsilon>0}$. Let ${X,Y}$ be orthogonal vector-valued matrices such that

$\displaystyle \sum_{ijkl}\, M_{ijkl}\, X_{ij}\cdot Y_{kl} \,\geq\, (1-\varepsilon) \mathrm{SDP}(M).$

Then one can find in time polynomial in ${n}$ and ${1/\varepsilon}$ two orthogonal matrices ${U,V}$ such that

$\displaystyle \sum_{ijkl} \,M_{ijkl}\,U_{ij} V_{kl} \,\geq\, \Big(\frac{1}{2\sqrt{2}} -2\varepsilon\Big)\mathrm{opt}(M).$

In the following I will describe the rounding algorithm for the case of the complex commutative inequality. This may sound like the opposite of what we’re ultimately interested in (the real non-commutative inequality), but it is the best setting in which to explain the main ideas, which generalize straightforwardly to the matrix setting. I will also describe a randomized algorithm, but it is possible to derandomize it in polynomial time.

Our goal is thus to devise a rounding algorithm such that, given complex unit vectors ${x_j,y_k\in {\mathbb C}^d}$ such that

$\displaystyle \Big|\sum_{j,k} \,A_{jk}\, x_j\cdot y_k\Big|\,\geq\, (1-\varepsilon)\,\mathrm{SDP}_{\mathbb C}(A)\,:=\,\sup_{\substack{d,\, x'_j,y'_k\in{\mathbb C}^d,\\ \|x'_j\|,\|y'_k\|=1}} \,\Big|\sum_{j,k} \,A_{jk}\, x'_j\cdot y'_k\Big|,$

produces complex numbers ${\varepsilon_j,\delta_k}$ of modulus ${1}$ such that

$\displaystyle \Big|\sum_{jk} A_{jk} \varepsilon_j \delta_k \Big| \geq \big(\frac{1}{2\sqrt{2}}-2\varepsilon\Big) \mathrm{SDP}_{\mathbb C}(A). \ \ \ \ \ (12)$

Our first step consists in performing the usual random projection.

Step 1. Choose ${z \in\{\pm 1,\pm i\}^d}$ uniformly at random. For ${j,k=1,\ldots,n}$, define ${\alpha_j := ( x_j\cdot z)/\sqrt{2}}$ and ${\beta_k := (y_k \cdot z)/\sqrt{2}}$.

(The factor ${\sqrt{2}}$ in the definition of ${\alpha_j,\beta_k}$ is for convenience and will only play a role later.) By expanding out the terms, one checks that on expectation, for any ${j,k\in\{1,\ldots,n\}}$, ${\mathrm{E}_z[\alpha_j\overline{\beta_k}] = (x_j\cdot y_k)/2}$. The problem of course is that in general ${\alpha_j,\beta_k}$ will not have modulus ${1}$. The standard procedure, due to Goemans and Williamson, would be to define ${\varepsilon_j := \mathrm{phase}(\alpha_j)}$, ${\delta_k := \mathrm{phase}(\beta_k)}$, where ${\mathrm{phase}(x)}$ denotes ${x/|x|}$ (in case ${x}$ is real this is just the sign). Goemans and Williamson showed that, in the real case,

$\displaystyle \mathrm{E}\big[(1-\mathrm{sign}(\varepsilon_j)\, \mathrm{sign}(\delta_k))\big] \,\geq\, \frac{1}{\alpha}(1- x_j\cdot y_k),$

where ${\alpha \approx 0.879\cdots}$, so that recalling (3)such “sign” rounding will work well for a problem, such as MAXCUT, in which all the coefficients ${A_{ij}}$are of the same sign. However, in general the coefficients ${A_{ij}}$ may have different signs and large cancellations could occur, rendering an inequality such as the one above (even if it was two-sided) useless — one needs a more precise control of the relation between the ${\varepsilon_j,\delta_k}$ and the ${\alpha_j,\beta_k}$.

Our rounding algorithm proceeds differently. The idea of the second step is to use information that is contained in the moduli ${|\alpha_j|,|\beta_k|}$, instead of completely discarding that information as in the “sign” rounding procedure. If these moduli are ${1}$ then all is well. If, however, they are far from ${1}$ then intuitively this represents a “miss” of the random projection step: such values may be best discarded, or could be re-randomized. To this effect we perform an additional correlated randomized rounding step, as follows.

Step 2. Choose ${t\in {\mathbb R}}$ from the hyperbolic secant distribution, which has density proportional to ${(e^{\pi t/2}+e^{-\pi t/2})^{-1}}$. Return ${\varepsilon_j := \mathrm{phase}(\alpha_j) |\alpha_j|^{it}}$, ${\delta_k:= \mathrm{phase}(\beta_k) |\beta_k|^{-it}}$.

Note that ${\varepsilon_j,\delta_k}$ have modulus exactly ${1}$. Why does this work? In his proof, Haagerup uses the maximum modulus principle to show the existence of a real ${t}$ for which ${\varepsilon_j,\delta_k}$ as defined above will provide a good solution. Note that varying ${t}$ corresponds to simultaneously “rotating” the ${\varepsilon_j,\delta_k}$ at different speeds, proportional to the logarithm of the modulus of ${\alpha_j,\beta_k}$ respectively. Hence terms for which the modulus is far from ${1}$ will be rotated at greater speeds, going through all possible values on the unit circle much more frequently than terms whose modulus is close to ${1}$.

The hyperbolic secant distribution

We go a bit further and identify a specific distribution — the hyperbolic secant distribution — such that sampling ${t}$ according to that distribution will give a good solution, in expectation. To see why, the key property of the hyperbolic secant distribution we need is that its characteristic function satisfies ${\mathrm{E}_t[a^{it}] = 2a/(1+a^2)}$, for any ${a>0}$. Using that equation, we obtain

$\displaystyle \begin{array}{rcl} \mathrm{E}_t\Big[\sum_{j,k} A_{jk} \, \varepsilon_j\overline{\delta_k} \Big] &=& \sum_{j,k} A_{jk} \mathrm{E}\Big[ \frac{\alpha_j\overline{\beta_k}}{|\alpha_j\beta_k|} |\alpha_j\beta_k|^{it}\Big]\\ &=& 2\sum_{j,k} A_{jk} \alpha_j\overline{\beta_k} - \sum_{j,k} A_{jk} \mathrm{E}_t\Big[ \frac{\alpha_j\overline{\beta_k}}{|\alpha_j\beta_k|} |\alpha_j\beta_k|^{2+it}\Big]. \end{array}$

Taking the expectation over ${z}$, the first term above is ${(1-\varepsilon)\mathrm{SDP}_{\mathbb C}(A)}$. To conclude it only remains to bound the contribution of the second term. The idea is to see the vectors ${u_j:=(\alpha_j |\alpha_j|^{1+it})_{z,t}}$ and ${v_k:=(\beta_k |\beta_k|^{1+it})_{z,t}}$ as a pair of possible solutions to the semidefinite program ${\mathrm{SDP}_{\mathbb C}(A)}$ (note these vectors are infinite-dimensional, but this is not an issue — we are allowed to work directly with infinite-dimensional vectors, or we could project them down to finite dimensions, or discretize the choice of ${t}$). We are about to show that the squared norm of each such vector is at most ${1/2}$, hence the value they give in the objective function defining ${\mathrm{SDP}_{\mathbb C}(A)}$ in (12) can be at most ${\mathrm{SDP}_{\mathbb C}(A)/2}$. Plugging this bound back into the equations above will finish the analysis.

Hence it remains to compute

$\displaystyle \begin{array}{rcl} \mathrm{E}_{z,t}\big[\|u_j\|_2^2\big]\,=\,\mathrm{E}_{z}\big[\,|\alpha_j|^4\,\big]&=&\frac{1}{4}\,\mathrm{E}_{z}\Big[ \sum_{l,l',l'',l'''} z_l \overline{z_{l'}} z_{l''} \overline{z_{l'''}} \,(x_j)_l \overline{(x_j)_{l'}} (x_j)_{l''} \overline{(x_j)_{l'''}}\Big]\\ &=&\frac{1}{4}\, \sum_l |(x_j)_l|^4 + 2\sum_{l\neq l'} |(x_j)_l|^2 |(x_j)_{l'}|^2\\ &\leq & \frac{1}{4}\,2 \,\Big( \sum_l |(x_j)_l|^2\Big)^2 \,=\, \frac{1}{2}\,\|x_j\|_2^4\,=\,1/2. \end{array}$

Here in the second line we used that the only terms ${z_l \overline{z_{l'}} z_{l''} \overline{z_{l'''}}}$ that have nonzero expectation are those where either ${l=l'=l''=l'''}$, or ${(l=l'')\neq (l'=l''')}$, or ${(l=l''')\neq(l'=l'')}$.