Amir Shpilka gave an excellent talk in the MSR/MIT theory reading group last Friday, walking us all the way from the initial tensor-rank formulation of the matrix multiplication exponent to the algorithm of Coppersmith-Vinograd and recent improvements by Vassilevska-Williams and Slothers. Amir also recommended these lecture notes by Markus Bläser as a good introduction to the topic.
Let be the matrix multiplication exponent: the smallest number such that the multiplication of two real matrices can be carried out by an arithmetic circuit using multiplications (additions come for free). A trivial upper bound is , and the trivial lower bound : multiplying a column by a row matrix requires the computation of all possible products. A lot of excitement was generated recently when the best previously known upper bound, due to Coppersmith and Vinograd more than 20 years ago, was improved by Virginia Vassilevska Williams, to . As far as I know, the best lower bound known is due to Amir himself: multiplying two matrices requires at least real multiplications. There is still some work ahead!
All known upper bounds follow the same overall structure. I will attempt to reproduce some of Amir’s explanations below; all misstatements are mine of course!
1. A tensor associated to matrix multiplication.
The starting point for all known efficient algorithms for matrix multiplication over the reals is the following. Suppose you have an algorithm, represented as an arithmetic circuit, that computes the product of two real matrices . The first claim is that all the algorithm can be doing is computing linear forms in the coefficients of , linear forms in those of , and outputting . This is because each entry of the product is a bilinear form in , so any polynomial of degree at least in either or that appears in an intermediate step of the circuit will eventually have to cancel out, and it might as well be replaced by a the first time it is computed.
Given an arbitrary tensor (equivalently, trilinear map) define its rank as the smallest integer such that , where for each the are linear forms in the coefficients of respectively. For any triple of integers , let be the trilinear map
where denotes the set of all matrices with real entries. Equivalently, is a real tensor of dimensions . The key observation is that the coefficient of (seen as a trilinear form) in is exactly the -th entry of the product . Given this, it is not hard to see that the rank of is exactly the least number of multiplications required by any (bilinear, but as mentioned above this is essentially without loss of generality) arithmetic circuit that computes the matrix product: any circuit leads to a decomposition as a sum of products of linear forms, and any decomposition can be translated to a circuit with as many multiplication gates as there are terms in the decomposition.
So we are almost done: to determine the complexity of matrix multiplication, it suffices to determine the rank of the rather simple tensor whose -th entry is the coefficient of in …easy?
2. Tensor rank
Unfortunately, contrary to the matrix rank, for which there is an efficient algorithm, the rank of tensors of order or more is known to be NP-hard to compute. In fact, tensor rank differs from matrix rank in many surprising ways:
Tensor rank is not continuous: suppose a sequence of tensors , where convergence is measured in an arbitrary norm, and suppose further that for some and all small enough. What is the rank of ? If was a matrix, it would be at most , as can be seen from the characterization of the rank as the size of the largest minor with nonzero determinant and continuity of the determinant. But if is a tensor, this is no longer true! A simple example is the tensor : has rank for all , but as , which has rank .
Tensor rank is not multiplicative: given two tensors and , their tensor product satisfies , but equality need not hold in general. In fact, as we will see the fact that the inequality is sometimes strict is used crucially in all improvements on the exponent ever since Strassen’s first algorithm!
Tensor rank is (maybe) not additive: in general , and equality holds for all examples for which we can carry out the computation…but is not known to hold in general. In fact, Strassen conjectured that equality does not always hold.
Now, here is our first algorithm for matrix multiplication that beats the trivial bound. We need one key observation, that we will use repeatedly, and one computation. The observation is that for all integers , we have the equality . This follows from the rule $latex X_1Y_1\otimes X_2Y_2 = (X_1\otimes X_2)\cdot (Y_1\otimes Y_2) $. The calculation is : we have all seen the upper bound in high school, and a matching lower bound is known. Repearing powering leads to Strassen’s algorithm, and the bound .
3. Border rank and Schönhage’s -theorem
Motivated by the fact that a convergent sequence of tensors can have a strictly lower rank than its limiting tensor, we introduce the notion of border rank, due to Strassen: given a tensor and integer , define as the least integer such that there exists a tensor with coefficients that are polynomials in and such that for some . Clearly, is a non-increasing function of , and the border rank is defined as the limit of as .
For the special case of matrix multiplication tensors it turns out that the border rank is (asymptotically) equal to the rank. Precisely, for any and (to show this, simply use the definition of and expand all linear forms as series in , collecting terms in ). Since, however, bounds on the rank of for large can be obtained by first proving bounds on and then taking tensor powers, in which the inequality ensures that the degree grows much more slowly than the dimension: morally, a bound implies .
Using this idea, Strassen was able to obtain the bound by first proving an upper bound on the border rank of the tensor (note that such a bound immediately implies a bound on . Hence when computing the rank of , the “volume” is the only important quantity).
Schönhage, in his “-theorem”, went one step further by also considering direct products. His idea was that, if we have a bound such as (where here the direct sum is obtained by sticking together the tensors , acting on disjoint sets of variables), then we expect to satisfy . Indeed, we might in general expect to satisfy some kind of direct product rule, in which case the direct sum cannot be computed much more efficiently than by computing each term individually, leading to the above bound. Using this idea, Schönhage obtained the bound .
4. The Coppersmith-Vinograd algorithm
The Coppersmith-Vinograd algorithm is based on another idea due to Strassen, called the “laser method” (there was a heated discussion as to what this name referred to, but I am sorry to report that no clear consensus emerged :-)). Start with a tensor , which could be arbitrary, but for which we have a good upper bound on the (border) rank. By sub-multiplicativity, for any we have . The idea is then to look for matrix multiplication tensors inside : if contains many such independent (i.e. acting on disjoint subsets of the variables) tensors, then by Strassen’s direct product idea above, we may, from the bound on , obtain a bound on the rank of the matrix multiplication tensor itself.
Coppersmith and Vinograd used the following tensor as their starting point:
for some integer ( will eventually be optimized over, and be something like or ). By explicitly exhibiting an approximating sequence, one can check that , and in fact . It is a bit more tricky to visualize why tensor powers of would contain copies of the matrix multiplication tensor. To see it, it is maybe easiest to look at the third tensor power. Expanding out all terms, we will get terms such as
which, when summed over all and setting , is exactly the tensor ! Many such copies of can be found, but note that they are not independent, the same variable appearing in different copies. Most of the work done by Coppersmith and Vinograd is combinatorial: they construct a hypergraph whose vertices are tuples of variables, such as , that appear in a given copy of , and has an hyperedge between any three triples that together constitute a copy of . The challenge is to delete the fewest possible number of vertices (meaning the corresponding variables are set to ), in order to keep the largest possible number of disjoint hyperedges — the final number of independent copies of in .
Amir gave the main ideas behind this construction, but I’m afraid it would stretch me too far to reproduce them here… Eventually making the best choice of parameters led Coppersmith and Vinograd to their famous bound .
5. Epilogue: the Slothers and Vassilevska-Williams improvements
By the time we got here, I think both the audience and Amir were getting pretty much exhausted, and I will be brief. The main idea of the recent improvements consists in decomposing the powers . In fact, Coppersmith and Vinograd already used this trick for , but Slothers used it for , and Vassilevska-Williams for (and even higher ?). The point is to first gain a good understanding of the different terms that appear in, say, the expansion of , so that when one takes the -tensor power, one has more leverage as to the structure of the hypergraph that arises, as described in the paragraph above, and one can obtain a better decomposition — leading to a better upper bound on : Of course, the larger the the more daunting the task, and computer-assisted optimization plays an important role in Vassilevska-Williams’ result.