## Amir Shpilka on fast matrix multiplication

Amir Shpilka gave an excellent talk in the MSR/MIT theory reading group last Friday, walking us all the way from the initial tensor-rank formulation of the matrix multiplication exponent to the algorithm of Coppersmith-Vinograd and recent improvements by Vassilevska-Williams and Slothers. Amir also recommended these lecture notes by Markus Bläser as a good introduction to the topic.

Let $\omega$ be the matrix multiplication exponent: the smallest number such that the multiplication of two real $n\times n$ matrices $X,Y$ can be carried out by an arithmetic circuit using $n^{\omega+o(1)}$ multiplications (additions come for free). A trivial upper bound is $\omega \leq 3$, and the trivial lower bound $\omega \geq 2$: multiplying a column by a row matrix requires the computation of all $n^2$ possible products. A lot of excitement was generated recently when the best previously known upper bound, due to Coppersmith and Vinograd more than 20 years ago, was improved by Virginia Vassilevska Williams, to $\omega \leq 2.3727$. As far as I know, the best lower bound known is due to Amir himself: multiplying two $n\times n$ matrices requires at least $3n^2$ real multiplications. There is still some work ahead!

All known upper bounds follow the same overall structure. I will attempt to reproduce some of Amir’s explanations below; all misstatements are mine of course!

1. A tensor associated to matrix multiplication.

The starting point for all known efficient algorithms for matrix multiplication over the reals is the following.  Suppose you have an algorithm, represented as an arithmetic circuit, that computes the product $XY$ of two $n\times n$ real matrices $X,Y$. The first claim is that all the algorithm can be doing is computing linear forms $L_i(X)$ in the coefficients of $X$, linear forms $L'_i(Y)$ in those of $Y$, and outputting $\sum_i L_i(X)L'_i(Y)$. This is because each entry of the product $XY$ is a bilinear form in $(X,Y)$, so any polynomial of degree at least $2$ in either $X$ or $Y$ that appears in an intermediate step of the circuit will eventually have to cancel out, and it might as well be replaced by a $0$ the first time it is computed.

Given an arbitrary tensor $T$ (equivalently, trilinear map) define its rank $R(T)$ as the smallest integer $m$ such that $T(X,Y,Z) = \sum_{i=1}^m L_i(X)L'_i(Y)L_i''(Z)$, where for each $i$ the $L_i,L'_i,L''_i$ are linear forms in the coefficients of $X,Y,Z$ respectively. For any triple of integers $(m,n,k)$, let $\langle m,n,k \rangle$ be the trilinear map

$\displaystyle \langle m,n,k \rangle:\, (X,Y,Z)\in M_{m,n}\times M_{n,k} \times M_{k,m} \,\to\, \rm{Tr}(XYZ^T) \in \mathbb{R},$

where $M_{n,k}$ denotes the set of all $n\times k$ matrices with real entries. Equivalently, $\langle m,n,k \rangle$ is a real tensor of dimensions $mn\times nk \times km$. The key observation is that the coefficient of $\langle m,n,k \rangle$ (seen as a trilinear form) in $Z_{i,j}$ is exactly the $(i,j)$-th entry of the product $XY$. Given this, it is not hard to see that the rank of $\langle m,n,k \rangle$ is exactly the least number of multiplications required by any (bilinear, but as mentioned above this is essentially without loss of generality) arithmetic circuit that computes the matrix product: any circuit leads to a decomposition as a sum of products of linear forms, and any decomposition can be translated to a circuit with as many multiplication gates as there are terms in the decomposition.

So we are almost done: to determine the complexity of matrix multiplication, it suffices to determine the rank of the rather simple tensor whose $(ii',jj',kk')$-th entry is the coefficient of  $\rm{Tr}(XYZ^T)$ in $X_{ii'}Y_{jj'}Z_{kk'}$…easy?

2. Tensor rank

Unfortunately, contrary to the matrix rank, for which there is an efficient algorithm, the rank of tensors of order $3$ or more is known to be NP-hard to compute. In fact, tensor rank differs from matrix rank in many surprising ways:

Tensor rank is not continuous: suppose a sequence of tensors $T_\epsilon \to_{\epsilon \to 0} T$, where convergence is measured in an arbitrary norm, and suppose further that $R(T_\epsilon) \leq r$ for some $r$ and all $\epsilon$ small enough. What is the rank of $T$? If $T$ was a matrix, it would be at most $r$, as can be seen from the characterization of the rank as the size of the largest minor with nonzero determinant and continuity of the determinant. But if $T$ is a tensor, this is no longer true! A simple example is the tensor $T_{\epsilon} = (X_0+\epsilon X_1)(Y_0+\epsilon Y_1)Z_1+X_0Y_0(\epsilon Z_0-Z_1)$: $T_\epsilon$ has rank $2$ for all $\epsilon >0$, but as $\epsilon\to 0$, $T_\epsilon/\epsilon \to T$ which has rank $3$.

Tensor rank is not multiplicative: given two tensors $T_1$ and $T_2$, their tensor product satisfies $R(T_1 \otimes T_2)\leq R(T_1)\otimes R(T_2)$, but equality need not hold in general. In fact, as we will see the fact that the inequality is sometimes strict is used crucially in all improvements on the exponent $\omega$ ever since Strassen’s first $\omega \leq 2.81\cdots$ algorithm!

Tensor rank is (maybe) not additive: in general $R(T_1 \oplus T_2)\leq R(T_1)+ R(T_2)$, and equality holds for all examples for which we can carry out the computation…but is not known to hold in general. In fact, Strassen conjectured that equality does not always hold.

Now, here is our first algorithm for matrix multiplication that beats the trivial bound. We need one key observation, that we will use repeatedly, and one computation. The observation is that for all integers $k_1,k_2,m_1,m_2,n_1,n_2$, we have the equality $\langle k_1,m_1,n_1\rangle \otimes \langle k_2,m_2,n_2\rangle = \langle k_1k_2,m_1m_2,n_1n_2 \rangle$. This follows from the rule $latex X_1Y_1\otimes X_2Y_2 = (X_1\otimes X_2)\cdot (Y_1\otimes Y_2)$. The calculation is $R(\langle 2,2,2\rangle)=7$: we have all seen the upper bound in high school, and a matching lower bound is known. Repearing powering leads to Strassen’s algorithm, and the bound $\omega\leq \log_2 7 \approx 2.807$.

3. Border rank and Schönhage’s $\tau$-theorem

Motivated by the fact that a convergent sequence of tensors can have a strictly lower rank than its limiting tensor, we introduce the notion of border rank, due to Strassen: given a tensor ${T}$ and integer ${d}$, define ${R_d(T)}$ as the least integer ${m}$ such that there exists a tensor ${T'}$ with coefficients that are polynomials in ${\varepsilon}$ and such that ${T' = \varepsilon^d T + \varepsilon^{d+1} T''}$ for some ${T''}$. Clearly, ${R_d(T)}$ is a non-increasing function of ${d}$, and the border rank ${\underline{R}(T)}$ is defined as the limit of ${R_d(T)}$ as ${d\rightarrow\infty}$.

For the special case of matrix multiplication tensors it turns out that the border rank is (asymptotically) equal to the rank. Precisely, ${R(\langle m,n,k\rangle )\leq d^2 R_d(\langle m,n,k\rangle)}$ for any ${m,n,k}$ and $d$ (to show this, simply use the definition of ${R_d}$ and expand all linear forms as series in ${\varepsilon}$, collecting terms in ${\varepsilon^d}$). Since, however, bounds on the rank of ${\langle n,n,n \rangle}$ for large ${n}$ can be obtained by first proving bounds on ${\langle n',n',n' \rangle}$ and then taking tensor powers, in which the inequality ${R_{d+d'}(T\otimes T')\leq R_{d}(T) R_{d'}(T')}$ ensures that the degree grows much more slowly than the dimension: morally, a bound ${\underline{R}(\langle n,n,n\rangle) \leq r}$ implies ${\omega\leq \log r/\log n}$.

Using this idea, Strassen was able to obtain the bound ${\omega \leq 2.79}$ by first proving an upper bound on the border rank of the tensor ${\langle 2,2,3\rangle}$ (note that such a bound immediately implies a bound on ${\langle 12,12,12 \rangle = \langle 2,2,3\rangle \otimes \langle 2,3,2\rangle \otimes \langle 3,2,2\rangle}$. Hence when computing the rank of ${\langle m,n,k\rangle}$, the “volume” ${mnk}$ is the only important quantity).

Schönhage, in his “${\tau}$-theorem”, went one step further by also considering direct products. His idea was that, if we have a bound such as ${\underline{R}(\oplus \langle k_i,m_i,n_i\rangle)\leq r}$ (where here the direct sum is obtained by sticking together the tensors $\langle k_i,m_i,n_i$, acting on disjoint sets of variables), then we expect ${\omega}$ to satisfy ${\sum (k_im_in_i)^{\omega/3} \leq r}$. Indeed, we might in general expect ${\underline{R}}$ to satisfy some kind of direct product rule, in which case the direct sum cannot be computed much more efficiently than by computing each term individually, leading to the above bound. Using this idea, Schönhage obtained the bound ${\omega \leq 2.55}$.

The Coppersmith-Vinograd algorithm is based on another idea due to Strassen, called the “laser method” (there was a heated discussion as to what this name referred to, but I am sorry to report that no clear consensus emerged :-)). Start with a tensor ${T}$, which could be arbitrary, but for which we have a good upper bound on the (border) rank. By sub-multiplicativity, for any ${\ell}$ we have ${\underline{R}(T^{\otimes \ell})\leq (\underline{R}(T))^\ell}$. The idea is then to look for matrix multiplication tensors inside ${T^{\otimes \ell}}$: if ${T^{\otimes \ell}}$ contains many such independent (i.e. acting on disjoint subsets of the variables) tensors, then by Strassen’s direct product idea above, we may, from the bound on ${\underline{R}(T^{\otimes \ell})}$, obtain a bound on the rank of the matrix multiplication tensor itself.

Coppersmith and Vinograd used the following tensor as their starting point:

$\displaystyle T = \sum_{i=1}^q (X_0Y_iZ_i+X_iY_0Z_i+X_iY_iZ_0),$

for some integer ${q}$ (${q}$ will eventually be optimized over, and be something like ${5}$ or ${6}$). By explicitly exhibiting an approximating sequence, one can check that ${\underline{R}(T)\leq q+2}$, and in fact ${\underline{R}(T) = q+2}$. It is a bit more tricky to visualize why tensor powers of ${T}$ would contain copies of the matrix multiplication tensor. To see it, it is maybe easiest to look at the third tensor power. Expanding out all terms, we will get terms such as

$\displaystyle (X_0Y_iZ_i)\otimes (X_{i'}Y_{i'}Z_0)\otimes (X_{i''}Y_0Z_{i''}) = X_0X_{i''}X_{i'}\otimes Y_0Y_{i'}Y_{i}\otimes Z_0Z_{i''}Z_{i},$

which, when summed over all $(i,i',i'')$ and setting $X_0=Y_0=Z_0=1$, is exactly the tensor ${\langle q,q,q \rangle}$! Many such copies of $\langle q,q,q\rangle$ can be found, but note that they are not independent, the same variable appearing in different copies. Most of the work done by Coppersmith and Vinograd is combinatorial: they construct a hypergraph whose vertices are tuples of variables, such as ${\{X_0X_iX_{i'}\}}$, that appear in a given copy of ${\langle q,q,q \rangle}$, and has an hyperedge between any three triples that together constitute a copy of $\langle q,q,q\rangle$. The challenge is to delete the fewest possible number of vertices (meaning the corresponding variables are set to ${0}$), in order to keep the largest possible number of disjoint hyperedges — the final number of independent copies of ${\langle q,q,q \rangle}$ in ${T^{\otimes 3\ell}}$.

Amir gave the main ideas behind this construction, but I’m afraid it would stretch me too far to reproduce them here… Eventually making the best choice of parameters led Coppersmith and Vinograd to their famous bound ${\omega \leq 2.3755}$.

5. Epilogue: the Slothers and Vassilevska-Williams improvements

By the time we got here, I think both the audience and Amir were getting pretty much exhausted, and I will be brief. The main idea of the recent improvements consists in decomposing the powers ${T^{\otimes 3\ell} = (T^{\otimes k})^{\otimes 3\ell/k}}$. In fact, Coppersmith and Vinograd already used this trick for ${k=2}$, but Slothers used it for ${k=4}$, and Vassilevska-Williams for ${k=8}$ (and even higher $k$?). The point is to first gain a good understanding of the different terms that appear in, say, the expansion of ${T^{\otimes k}}$, so that when one takes the ${3\ell/k}$-tensor power, one has more leverage as to the structure of the hypergraph that arises, as described in the paragraph above, and one can obtain a better decomposition — leading to a better upper bound on ${\omega}$: ${\omega \leq 2.3727!}$ Of course, the larger the $k$ the more daunting the task, and computer-assisted optimization plays an important role in Vassilevska-Williams’ result.